CN111524570B - Ultrasonic follow-up patient screening method based on machine learning - Google Patents

Ultrasonic follow-up patient screening method based on machine learning Download PDF

Info

Publication number
CN111524570B
CN111524570B CN202010371381.XA CN202010371381A CN111524570B CN 111524570 B CN111524570 B CN 111524570B CN 202010371381 A CN202010371381 A CN 202010371381A CN 111524570 B CN111524570 B CN 111524570B
Authority
CN
China
Prior art keywords
patient
word
data
follow
ultrasonic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010371381.XA
Other languages
Chinese (zh)
Other versions
CN111524570A (en
Inventor
张敬谊
李静
潘怀燕
郑文婕
李学源
李光亚
肖筱华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI PUBLIC HEALTH CLINICAL CENTER
WONDERS INFORMATION CO Ltd
Original Assignee
SHANGHAI PUBLIC HEALTH CLINICAL CENTER
WONDERS INFORMATION CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI PUBLIC HEALTH CLINICAL CENTER, WONDERS INFORMATION CO Ltd filed Critical SHANGHAI PUBLIC HEALTH CLINICAL CENTER
Priority to CN202010371381.XA priority Critical patent/CN111524570B/en
Publication of CN111524570A publication Critical patent/CN111524570A/en
Application granted granted Critical
Publication of CN111524570B publication Critical patent/CN111524570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/20ICT specially adapted for the handling or processing of medical images for handling medical images, e.g. DICOM, HL7 or PACS
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an ultrasonic follow-up patient screening method based on machine learning. Due to the rapid development of deep learning technology, natural language processing technology and deep learning technology are used as important means for analyzing medical texts, and are effective ways for replacing manual text screening. According to the invention, the text content is segmented by a JIEBA segmentation tool, word vectors are respectively constructed by adopting a TF-IDF method and a Word2Vec algorithm, and a chi-square test method is further utilized to select feature vectors. The classification model selects XGBoost, lightgbm and CNN to perform training modeling on the characteristic data, so that automatic screening of an ultrasonic examination follow-up list is realized.

Description

Ultrasonic follow-up patient screening method based on machine learning
Technical Field
The invention relates to an ultrasonic follow-up patient screening method based on an electronic health record, and belongs to the field of ultrasonic follow-up knowledge discovery.
Background
With the rapid development of ultrasound technology in recent years, the application of ultrasound technology in clinical diagnosis is becoming more and more widespread, and the ultrasound technology becomes one of the standard configurations of most hospitals, and the ultrasound examination does not expose patients to ionizing radiation and is free from risk interference of radiation-induced cancers. The accuracy of ultrasonic diagnosis depends on two aspects: firstly, whether a doctor can acquire clear images enough to support clinical diagnosis through an ultrasonic probe or not; and secondly, whether the sonographer gives a correct diagnostic description. Because the accuracy of the diagnostic results affects the diagnosis and treatment of diseases, medical institutions in various countries are striving to improve the ultrasonic diagnostic level.
In China, in order to ensure the ultrasonic diagnosis level, the ultrasonic department often carries out retrospective investigation on ultrasonic examination results in a follow-up mode. The three-level hospital ultrasound quality control guidelines issued by the society of sonographers of the Chinese society indicate that each three-level hospital should carry out selective, periodic or unscheduled follow-up of patients and their data after ultrasound examination, the ultrasound follow-up being based on diagnostic data and periodic follow-up, the following diagnostic and prognostic data being considered: (1) conclusion of pathological examination; (2) discovery of surgical treatment; (3) important laboratory examination results; (4) results of other medical imaging examinations (e.g., CT, magnetic resonance, nuclear or cardiovascular imaging, etc.); (5) data related to scientific research projects; (6) other data requiring follow-up collection. The ultrasonic follow-up data should be periodically subjected to statistical analysis, and the pathological or surgical operation and the like are used as references to calculate the pathological change positioning diagnosis coincidence rate and the physical property diagnosis coincidence rate of ultrasonic examination. The conforming ultrasonic department and the ultrasonic positioning diagnosis conforming rate and the physical property diagnosis conforming rate should reach more than 95 percent. For cases of misdiagnosis or missed diagnosis, the cause should be analyzed in time.
The primary link of the ultrasonic follow-up is to find out the important patients needing follow-up, namely those who have undergone ultrasonic examination and then have undergone surgery or image examination or pathological examination on the same part. In the past, acquiring a follow-up patient list requires arranging a specialized person to review a large number of archival medical records of the patient, find all image reports and pathology reports that the ultrasound patient subsequently made, and exclude a large number of irrelevant report contents. Therefore, the traditional follow-up patient screening mode is large in workload and low in efficiency, and the accuracy often depends on the capability level of staff and the period of medical history filing.
With the advancement of IT technology, hospital electronic medical record systems contain more and more abundant patient information, and electronic medical records include medical orders, inspection reports, image reports, ultrasound reports, pathology reports, operation records, and various disease course records, but most reports still adopt unstructured text formats, and descriptions related to ultrasound examination sites exist in a large section of free text.
Disclosure of Invention
The invention aims to solve the technical problems that: traditional follow-up patient screening methods are heavy in workload and inefficient, and accuracy often depends on the level of staff's ability and the period of medical history archiving.
In order to solve the technical problems, the technical scheme of the invention is to provide an ultrasonic follow-up patient screening method based on machine learning, which is characterized by comprising the following steps:
step 1, collecting patient treatment record data, wherein the patient treatment record data comprises pathology report data, image report data, ultrasonic report data and a patient unique identifier corresponding to a patient, and constructing a basic information data warehouse according to the collected patient treatment record data of different patients.
Step 2, according to the unique patient identification, associating all patient treatment record data in the basic information data warehouse with the patient, and constructing an ultrasonic patient information table associated with each patient;
step 3, aiming at the problem of unbalance of the samples in the ultrasonic patient information table obtained in the step 2, processing the data in an oversampling mode so as to achieve balance of positive and negative sample sizes; then dividing the samples in the ultrasonic patient information table into follow-up samples and non-follow-up samples according to the follow-up information, and marking the follow-up samples and the non-follow-up samples with different values respectively; finally, sampling from the population, and then merging to obtain a training sample;
step 4, performing word segmentation processing on the pathology report and the ultrasonic report in the training sample, and screening out some irrelevant word segmentation results;
step 5, converting text Word segmentation results into feature vector matrixes by using a TF-IDF method and a Word2Vec method respectively to describe the document;
step 6, selecting the feature vector constructed in the step 5 through chi-square test, and selecting useful information to perform machine learning modeling;
step 7, selecting XGBoost, lightgbm and CNN three models for two-class modeling, predicting to obtain a probability value of a sample for a follow-up patient, selecting a training feature matrix from TF-IDF and Word2Vec according to model effect comparison, and selecting one model from XGBoost, lightgbm and CNN as a model finally used for prediction;
and 8, setting a threshold value, adding samples with predicted probability values larger than or equal to the set threshold value into a follow-up patient list, wherein samples with predicted probability values smaller than the set threshold value are non-follow-up patients, calculating model evaluation indexes according to model classification results obtained in the step 7, and selecting an optimal model according to the model indexes.
Preferably, the step 2 includes the steps of:
step 201, invalid data in patient treatment record data in a basic information data warehouse are removed;
step 202, combining an ultrasonic field and an ultrasonic field which belong to the same examination in the patient treatment record data into an ultrasonic report, and simultaneously combining a pathological field and a pathological field into a pathological report;
step 203, performing many-to-many matching on the ultrasonic report and the pathology report in each patient treatment record data, so as to split the patient treatment record data into a plurality of new data records, and after performing many-to-many matching, constructing a new data set by each patient treatment record comprising one ultrasonic report and one pathology report of the same patient;
step 204, extracting patient characteristic information in text data from the data in the new data set obtained in step 203 through a regular expression, and converting the patient characteristic information into numerical data; then filling the missing value and processing the abnormal value; and finally, eliminating irrelevant indexes, deleting indexes with the missing value proportion being larger than a certain value, and normalizing the data to obtain an ultrasonic patient information table.
Preferably, in step 201, the invalid data is patient visit record data with ultrasound report but without pathology report.
Preferably, in step 4, the word segmentation tool employs JIEBA word segmentation.
Preferably, in step 5, a TF-IDF algorithm is used to construct a word feature vector matrix, and the TF-IDF matrix is trained on the segmentation results of the pathology report and the ultrasound report in the labeled sample, including the following contents:
for each word in each document set, the weight value K (t, D) in the document is calculated by using TF-IDF algorithm i ) Weight value K (t, D i ) Representing word t in document D i Weight in (i=1, 2, …, M), total number of M training documents. The TF-IDF algorithm takes into account the probability TF that a word t appears in a single document and the weight IDF of the word t in the whole set of documents. The weight idf of the word t is calculated as: idf (t) =log (M/n) t +0.01), where n t Training forThe number of documents in which word t appears in the document set is exercised. The computational formula of the TF-IDF algorithm is:
wherein tf (t, D i ) For word t in document D i The denominator of the word frequency of the Chinese word is a normalization factor.
Preferably, in step 5, word vector training is performed by using a CBOW neural network framework in a Word2Vec deep learning model, and a feature vector matrix is constructed, wherein:
the CBOW neural network is a three-layer neural network, which is obtained by inputting the current word w t C words of the context to output a word w for the current word t Is expressed in mathematical terms:
wherein w is t For a word in the dictionary D, i.e. w is predicted by a window T adjacent to the word t Probability of occurrence; p (w) i |Context i ) Representing the current word w i Probability of c words before and after occurrence;
the output layer of the CBOW neural network takes N words appearing in the dictionary D as leaf nodes, takes the number of times of word appearing in the corpus as weight value to construct a binary tree, and uses a random gradient rising algorithm to project a layer vector X w Is predicted so thatMaximizing, and finally obtaining N-dimensional word vectors w corresponding to each word segmentation through model training.
Preferably, in step 6, feature screening is performed by chi-square test, by analyzing the deviation result of the actual value and the calculated theoretical value, if the obtained deviation result is smaller than a preset threshold value, it can be judged that the two variables are independent, if the obtained deviation result is larger than the preset threshold valueThreshold, then consider the correlation between the two variables, and calculate chi-square value for each feature based on this 2 And (t, c), sorting the chi-square values from large to small, and selecting the characteristics larger than the threshold value. The formula of the chi-square test method is as follows:
wherein t is a feature, c is a category,representing the characteristic value e obtained by calculation t Class e c Theoretical value of->Representing a characteristic value e t Class e c Is a real value of (c).
Preferably, in step 8, the model evaluation index includes precision P, recall F, F1, area under curve AUC, accuracy ACC, specificity TNR and sensitivity TPR, and the calculation formula is as follows:
TP is a real example and represents the sample quantity predicted to be positive by the model in the positive sample; FP is a false positive example, representing the amount of samples in the negative samples that are predicted by the model to be positive; TN is true negative, representing the number of samples in the negative samples that are predicted negative by the model; FN is a false negative example, representing the number of samples in the positive samples that are predicted negative by the model.
Due to the rapid development of deep learning technology, natural language processing technology and deep learning technology are used as important means for analyzing medical texts, and are effective ways for replacing manual text screening. According to the invention, the text content is segmented by a JIEBA segmentation tool, word vectors are respectively constructed by adopting a TF-IDF method and a Word2Vec algorithm, and a chi-square test method is further utilized to select feature vectors. The classification model selects XGBoost, lightgbm and CNN to perform training modeling on the characteristic data, so that automatic screening of an ultrasonic examination follow-up list is realized.
The invention has the following advantages: firstly, based on mass data of medical institutions, automatic mining of follow-up patient information is achieved; then, quantitatively analyzing the high-dimensional index by means of a machine learning method, so that a more accurate ultrasonic follow-up patient can be rapidly and accurately excavated; finally, by using the method, a follow-up screening system aiming at ultrasonic data can be established, and the method is easy to popularize in different medical institutions.
Drawings
FIG. 1 is a flow chart of a machine learning based ultrasound follow-up patient screening method provided by the invention;
fig. 2 is a method for matching an ultrasound report with a pathology report in a visit record, where a and b are natural numbers.
Detailed Description
The invention will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. Further, it is understood that various changes and modifications may be made by those skilled in the art after reading the teachings of the present invention, and such equivalents are intended to fall within the scope of the claims appended hereto.
The invention provides an ultrasonic follow-up patient screening method based on machine learning, which aims to solve the problems that a large amount of acquired data is needed and ultrasonic follow-up patients are screened manually in the prior art, and achieves the aim of assisting clinical scientific research, and specifically comprises the following steps:
and step 1, collecting patient treatment record data including pathology report data, image report data and ultrasonic report data. In this embodiment, the patient's visit record data may be collected from the electronic health record, and may include basic information data, physical examination data, admission record data, discharge record data, medical records diagnosis data, operation information data, medical history and genetic history data in addition to pathology report data, image report data and ultrasound report data. And constructing a basic information data warehouse according to the acquired patient treatment record data of different patients.
Step 2, according to patient unique identifiers corresponding to different patients, such as patient case numbers, index serial numbers, and the like, associating all patient treatment record data in a basic information data warehouse with the patients, and constructing an ultrasonic patient information table associated with each patient, comprising the following steps:
step 201, invalid data in patient treatment record data in a basic information data warehouse is removed. In this embodiment, the invalid data is patient visit record data with ultrasound reports but without pathology reports.
The field names and meanings of the patient visit record data are shown in table 1 below:
table 1 field names and meanings of patient visit records (raw data set)
It is possible that the same patient may be subjected to multiple ultrasound and pathology examinations, so that each visit record may contain multiple, various exam reports for one patient, resulting in multiple visit records for the same patient in the ultrasound patient information table. To facilitate subsequent text processing, the method further comprises the following steps:
step 202, merging the ultrasonic field and the ultrasonic field which belong to the same examination in the patient treatment record data into an ultrasonic report, and merging the pathological field and the pathological field into a pathological report.
In step 203, since the original patient record data indicates multiple pathological examination and multiple ultrasonic report information of a patient, the ultrasonic report and the pathological report in each piece of the original patient record data are subjected to many-to-many matching, so that the patient record data are split into a plurality of new data records, and the specific matching method is shown in fig. 2.
Assuming that after step 202, a patient visit record includes a ultrasound reports and b pathology reports, the ultrasound report and pathology report in the current patient visit record can be split into a×b new data by many-to-many matching, a and b are natural numbers, and a new data set is constructed.
After many-to-many matching, each patient visit record contains one ultrasound report and one pathology report for the same patient. The data forms of the original data set and the new data set formed by steps 202 and 203 are shown in tables 2 and 3 below:
table 2 original patient visit record form
Table 3 new data set format
Step 204, extracting patient characteristic information (such as indexes of affected parts and the like) in the text data from the data in the new data set obtained in step 203 through a regular expression, and converting the patient characteristic information into numerical data. Then, the missing value is filled with "-1", and the outlier is processed by means of sample deletion. And finally, eliminating irrelevant indexes (such as patient sources and the like), deleting indexes with the missing value ratio larger than 0.5, and normalizing the data to obtain an ultrasonic patient information table.
And step 3, aiming at the problem of unbalance of the samples in the ultrasonic patient information table obtained in the step 2, processing the data in an oversampling mode, thereby achieving the purpose of balancing the positive and negative sample sizes. In this embodiment, the data is randomly extracted from the minority samples with a place of substitution, so as to increase the number of the minority samples, and thus, the proportion of the positive and negative samples is balanced.
The samples in the ultrasound patient information table are then divided into follow-up samples and non-follow-up samples based on the follow-up information and marked with 1 and 0, respectively. Setting the follow-up sample number and the non-follow-up sample number to be 2000, sampling from the population by adopting a system sampling method, and then merging.
And 4, because the human body parts in the ultrasonic report and the pathology report are important characteristic information, the traditional method for selecting and visiting the patients is to calculate the text similarity of the ultrasonic report and the pathology report according to the key words of the human body parts, and select the patients within a certain threshold range to enter a follow-up list. The specific method is that all human body part keywords are extracted from training samples, a TF-IDF method is utilized to train to obtain word vectors, and then cosine theorem is utilized to calculate the similarity of two sections of texts. Under each similarity level, a follow-up patient list can be selected, and each index of the model is calculated. When the text similarity is 0.1, each performance index of the model reaches the optimal.
In order to improve the model effect, the invention firstly uses 17765 patients' records obtained in the step 3 as training samples to train word vectors, firstly carries out JIEBA word segmentation processing on 30586 pathological reports and 50069 ultrasonic reports of 17765 patients, and the word segmentation results in a word size of 4030 and the total number of the contained words of 2001603. And then, training a Word2Vec model on the text obtained after Word segmentation to obtain a Word vector model with the dimension of 200. And then performing JIEBA word segmentation processing on the pathology report and the ultrasonic report of the marked sample. The important characteristics of the model obtained according to the service experience are human body parts, so that when the JIEBA is used for word segmentation, the JIEBA word segmentation result is adjusted from the angles of word length, part of speech, word frequency and the like, extraneous information is best filtered out, some prepositions, adjectives, numbers, letters, punctuation marks and the like are deleted, and a total of 1984 words are obtained
And 5, after the word segmentation result is obtained in the step 4, converting the text word segmentation result into a feature vector matrix for subsequent modeling analysis. The invention adopts two methods to train word vectors and compare model effects.
The first method is as follows: and constructing a word feature vector matrix by adopting a TF-IDF algorithm, and training the word segmentation results of the pathological report and the ultrasonic report in the marked sample into a TF-IDF matrix, wherein the matrix size is 62604x1984.
For each word in each document set, the weight value K (t, D) in the document is calculated by using TF-IDF algorithm i ) Weight value K (t, D i ) Representing word t in document D i Weight in (i=1, 2, …, M), total number of M training documents. The TF-IDF algorithm takes into account the probability TF that a word t appears in a single document and the weight IDF of the word t in the whole set of documents. The weight idf of the word t is calculated as: idf (t) =log (M/n) t +0.01), where n t The number of documents in which word t appears in the training document set. The computational formula of the TF-IDF algorithm is:
wherein tf (t, D i ) For word t in document D i The denominator of the word frequency of the Chinese word is a normalization factor.
Besides constructing a Word feature vector matrix by adopting a TF-IDF algorithm, a Word2Vec deep learning model can be used for Word vector training by using a CBOW neural network framework to construct a feature vector matrix. Thus, the second method is: and vector representation is carried out on the pathological report and the Word segmentation result of the ultrasonic report in the marked sample, 200-dimensional Word vectors of each Word are respectively extracted aiming at the Word segmentation result in each record, then the Word vectors are added and averaged to obtain 200-dimensional vector representation of each record, and the final feature matrix size is 62604x200.
CBOW is a three-layer neural network characterized by the input of the current word w t C words of the context to output a word w for the current word t The mathematical expression of (a) is:
wherein w is t For a word in the dictionary D, i.e. w is predicted by a window T adjacent to the word t Probability of occurrence; p (w) i |Context i ) Representing the current word w i Probability of occurrence of c words before and after.
The output layer of CBOW uses N words that appear in dictionary D as leaf nodes and uses the number of times that the word appears in the corpus as weights to construct a binary tree. Projection layer vector X by random gradient ascent algorithm w Is predicted so thatMaximization. Finally, through model training, an N-dimensional word vector w corresponding to each word segmentation is obtained, wherein w= (v) 1 ,v 2 ,...,v N )。
According to the invention, the characteristic vector matrix is respectively constructed by using the TF-IDF method and the Word2Vec method to describe the document.
And 6, selecting the feature vector constructed in the step 5 through chi-square test, setting a threshold value of 0.1, and reserving the feature as an important feature when the p value of the result parameter of the chi-square test of the feature is smaller than the threshold value of 0.1, otherwise, rejecting the feature, thereby realizing feature selection. Class labels are labeled 'yes' with '1' and 'no' with '0'. Useful information is selected for machine learning modeling.
And (3) carrying out feature screening by using chi-square test, analyzing the deviation result of the actual value and the calculated theoretical value, judging that the two variables are independent if the obtained deviation result is smaller than a preset threshold value, and considering that the two variables are related if the obtained deviation result is larger than the preset threshold value. On the basis of the above-mentioned, the chi-square value of each characteristic is calculated 2 And (t, c), sorting the chi-square values from large to small, and selecting the characteristics larger than the threshold value. The formula of the chi-square test method is as follows:
wherein t is a feature, c is a category,representing the characteristic value e obtained by calculation t Class e c Theoretical value of->Representing a characteristic value e t Class e c Is a real value of (c).
And 7, selecting XGBoost, lightgbm and CNN three models for two-class modeling, and predicting to obtain a probability value of the sample as a follow-up patient.
Model training effects are obtained for different feature combination modes, and evaluation indexes are compared as shown in the following table 4.
TABLE 4 model effect comparison of different feature projects
As can be seen from the comparison of model effects of the table, better effects can be obtained by combining a TF-IDF training feature matrix with a machine learning algorithm, so that TF-IDF features are selected as model training features.
While XGBoost is slightly higher than Lightgbm in terms of model accuracy from the performance metrics of the model in the table. XGBoost and Lightgbm are both Tree Boosting tools with high speed and good effect, and are suitable for large-scale data calculation. The model tuning mode is more mature due to longer release time of XGBoost, and the model tuning mode is better than Lightgbm in accuracy. The Lightgbm is fast and efficient, but the model tuning mode is still to be perfected because of the short release time. Therefore, XGBoost is selected as the final model for prediction, taking into account the accuracy and speed of the model.
And 8, follow-up patient screening. And setting a threshold value, adding samples with predicted probability values larger than or equal to the set threshold value into a follow-up patient list, and taking samples with predicted probability values smaller than the set threshold value as non-follow-up patients. And calculating a model evaluation index according to the model classification result, and selecting an optimal model according to the model index.
In step 8, the classification result is comprehensively evaluated through various indexes, and the optimal classification model is selected. The evaluation indexes comprise precision P, recall F, F1 measurement, area under curve AUC, accuracy ACC, specificity TNR and sensitivity TPR, and the calculation formulas are as follows:
TP is a real example and represents the sample quantity predicted to be positive by the model in the positive sample; FP is a false positive example, representing the amount of samples in the negative samples that are predicted by the model to be positive; TN is true negative, representing the number of samples in the negative samples that are predicted negative by the model; FN is a false negative example, representing the number of samples in the positive samples that are predicted negative by the model.
In this example, screening predictions were made for patient data at month 2 of 2018. 1808 pieces of data of 310 patients are obtained after pretreatment, wherein 840 ultrasonic reports and 468 pathological reports are manually marked by 3 experts, 351 pieces of data marked as '1' (meeting the follow-up requirement) and 1457 pieces of data marked as '0' (not meeting the follow-up requirement) are obtained. And adding samples with the predicted probability value larger than or equal to the set threshold value into a follow-up patient list according to the threshold value screening, wherein the samples with the predicted probability value smaller than the set threshold value are non-follow-up patients. According to the model classification result, the number of patients to be visited is 143 after model prediction, 318 pieces of data with model prediction of '1' and 1490 pieces of data with model prediction of '0' are obtained after model prediction, the whole data marking accuracy rate reaches 93%, and the effect is approved by experts, so that the feasibility and the effectiveness of the model are proved.

Claims (5)

1. The ultrasonic follow-up patient screening method based on machine learning is characterized by comprising the following steps of:
step 1, collecting patient treatment record data, wherein the patient treatment record data comprises pathology report data, image report data, ultrasonic report data and a patient unique identifier corresponding to a patient, and constructing a basic information data warehouse according to the collected patient treatment record data of different patients;
step 2, according to the unique patient identification, associating all patient treatment record data in the basic information data warehouse with the patient, and constructing an ultrasonic patient information table associated with each patient;
step 3, aiming at the problem of unbalance of the samples in the ultrasonic patient information table obtained in the step 2, processing the data in an oversampling mode so as to achieve balance of positive and negative sample sizes; then dividing the samples in the ultrasonic patient information table into follow-up samples and non-follow-up samples according to the follow-up information, and marking the follow-up samples and the non-follow-up samples with different values respectively; finally, sampling from the population, and then merging to obtain a training sample;
step 4, performing word segmentation processing on the pathology report and the ultrasonic report in the training sample, and screening out some irrelevant word segmentation results;
step 5, converting text Word segmentation results into feature vector matrixes to describe a document by using a TF-IDF method and a Word2Vec method, wherein the TF-IDF algorithm is adopted to construct a Word feature vector matrix, and the Word segmentation results of a pathological report and an ultrasonic report in a marked sample are used for training the TF-IDF matrix, and the method comprises the following steps:
for each word in each document set, the weight value K (t, D) in the document is calculated by using TF-IDF algorithm i ) Weight value K (t, D i ) Representing word t in document D i Weight in (i=1, 2, …, M), total number of M training documents; the TF-IDF algorithm comprehensively considers the probability TF of the word t in a single document and the weight IDF of the word t in the whole document set; the weight idf of the word t is calculated as: idf (t) =log (M/n) t +0.01), where n t The number of documents in which word t appears in the training document set; the computational formula of the TF-IDF algorithm is:
wherein tf (t, D i ) For word t in document D i The denominator of the word frequency of the Chinese word is a normalization factor;
word vector training is carried out by using a CBOW neural network framework in a Word2Vec deep learning model, and a feature vector matrix is constructed, wherein:
the CBOW neural network is a three-layer neural network, which is obtained by inputting the current word w t C words of the context to output a word w for the current word t Is expressed in mathematical terms:
wherein w is t For a word in the dictionary D, i.e. w is predicted by a window T adjacent to the word t Probability of occurrence; p (w) i |Context i ) Representing the current word w i Probability of c words before and after occurrence;
the output layer of the CBOW neural network takes N words appearing in the dictionary D as leaf nodes, takes the number of times of word appearing in the corpus as weight value to construct a binary tree, and uses a random gradient rising algorithm to project a layer vector X w Is predicted so thatMaximizing, and finally obtaining N-dimensional word vectors w corresponding to each word segmentation through model training;
step 6, selecting the feature vector constructed in the step 5 through chi-square test, selecting useful information for machine learning modeling, wherein the chi-square test is utilized for feature screening, the deviation result of the actual value and the calculated theoretical value is analyzed, if the obtained deviation result is smaller than a preset threshold value, the two variables can be judged to be independent, if the obtained deviation result is larger than the preset threshold value, the two variables are considered to be related, and the chi-square value of each feature is calculated on the basis 2 (t, c) sorting the chi-square values from large to small, and selecting the characteristics larger than a threshold value; the formula of the chi-square test method is as follows:
wherein t is a feature, c is a category,representing the characteristic value e obtained by calculation t Class e c Theoretical value of->Representing a characteristic value e t Class e c Is the actual value of (2);
step 7, selecting XGBoost, lightgbm and CNN three models for two-class modeling, predicting to obtain a probability value of a sample for a follow-up patient, selecting a training feature matrix from TF-IDF and Word2Vec according to model effect comparison, and selecting one model from XGBoost, lightgbm and CNN as a model finally used for prediction;
and 8, setting a threshold value, adding samples with predicted probability values larger than or equal to the set threshold value into a follow-up patient list, wherein samples with predicted probability values smaller than the set threshold value are non-follow-up patients, calculating model evaluation indexes according to model classification results obtained in the step 7, and selecting an optimal model according to the model indexes.
2. The machine learning based ultrasound follow-up patient screening method of claim 1, wherein step 2 comprises the steps of:
step 201, invalid data in patient treatment record data in a basic information data warehouse are removed;
step 202, combining an ultrasonic field and an ultrasonic field which belong to the same examination in the patient treatment record data into an ultrasonic report, and simultaneously combining a pathological field and a pathological field into a pathological report;
step 203, performing many-to-many matching on the ultrasonic report and the pathology report in each patient treatment record data, so as to split the patient treatment record data into a plurality of new data records, and after performing many-to-many matching, constructing a new data set by each patient treatment record comprising one ultrasonic report and one pathology report of the same patient;
step 204, extracting patient characteristic information in text data from the data in the new data set obtained in step 203 through a regular expression, and converting the patient characteristic information into numerical data; then filling the missing value and processing the abnormal value; and finally, eliminating irrelevant indexes, deleting indexes with the missing value proportion being larger than a certain value, and normalizing the data to obtain an ultrasonic patient information table.
3. A machine learning based ultrasound follow-up patient screening method as claimed in claim 2 wherein in step 201, the invalid data is patient visit record data with ultrasound report but without pathology report.
4. The machine learning based ultrasound follow-up patient screening method of claim 1, wherein in step 4, the segmentation tool employs JIEBA segmentation.
5. The machine learning based ultrasound follow-up patient screening method of claim 1, wherein in step 8, the model evaluation index comprises a precision P, a recall F, F metric, an area under curve AUC, an accuracy ACC, a specificity TNR and a sensitivity TPR, and the calculation formula is as follows:
TP is a real example and represents the sample quantity predicted to be positive by the model in the positive sample; FP is a false positive example, representing the amount of samples in the negative samples that are predicted by the model to be positive; TN is true negative, representing the number of samples in the negative samples that are predicted negative by the model; FN is a false negative example, representing the number of samples in the positive samples that are predicted negative by the model.
CN202010371381.XA 2020-05-06 2020-05-06 Ultrasonic follow-up patient screening method based on machine learning Active CN111524570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010371381.XA CN111524570B (en) 2020-05-06 2020-05-06 Ultrasonic follow-up patient screening method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010371381.XA CN111524570B (en) 2020-05-06 2020-05-06 Ultrasonic follow-up patient screening method based on machine learning

Publications (2)

Publication Number Publication Date
CN111524570A CN111524570A (en) 2020-08-11
CN111524570B true CN111524570B (en) 2024-01-16

Family

ID=71907066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010371381.XA Active CN111524570B (en) 2020-05-06 2020-05-06 Ultrasonic follow-up patient screening method based on machine learning

Country Status (1)

Country Link
CN (1) CN111524570B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112201360B (en) * 2020-10-09 2023-06-20 平安科技(深圳)有限公司 Method, device, equipment and storage medium for collecting chronic disease follow-up record
CN114334169B (en) * 2022-03-07 2022-06-10 四川大学 Medical object category decision method and device, electronic equipment and storage medium
CN115458162A (en) * 2022-11-10 2022-12-09 四川京炜数字科技有限公司 Bone-related disease treatment plan prediction system and method based on machine learning

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3056328A1 (en) * 2016-09-16 2018-03-23 L'air Liquide, Societe Anonyme Pour L'etude Et L'exploitation Des Procedes Georges Claude DATA PROCESSING SYSTEM FOR PREDICTING HOSPITALIZATION OR RE-HOSPITALIZATION OF A PATIENT WITH CHRONIC RESPIRATORY DISEASE
CN107833629A (en) * 2017-10-25 2018-03-23 厦门大学 Aided diagnosis method and system based on deep learning
CN109741806A (en) * 2019-01-07 2019-05-10 北京推想科技有限公司 A kind of Medical imaging diagnostic reports auxiliary generating method and its device
CN110265153A (en) * 2019-05-16 2019-09-20 平安科技(深圳)有限公司 Chronic disease follow-up method and electronic device
CN110415831A (en) * 2019-07-18 2019-11-05 天宜(天津)信息科技有限公司 A kind of medical treatment big data cloud service analysis platform
EP3573068A1 (en) * 2018-05-24 2019-11-27 Siemens Healthcare GmbH System and method for an automated clinical decision support system
WO2020006495A1 (en) * 2018-06-29 2020-01-02 Ai Technologies Inc. Deep learning-based diagnosis and referral of diseases and disorders using natural language processing
CN110781333A (en) * 2019-06-26 2020-02-11 杭州鲁尔物联科技有限公司 Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN110795564A (en) * 2019-11-01 2020-02-14 南京稷图数据科技有限公司 Text classification method lacking negative cases
CN113689927A (en) * 2021-10-26 2021-11-23 湖北经济学院 Ultrasonic image processing method and device based on deep learning model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7392185B2 (en) * 1999-11-12 2008-06-24 Phoenix Solutions, Inc. Speech based learning/training system using semantic decoding
US10169863B2 (en) * 2015-06-12 2019-01-01 International Business Machines Corporation Methods and systems for automatically determining a clinical image or portion thereof for display to a diagnosing physician
CN106021364B (en) * 2016-05-10 2017-12-12 百度在线网络技术(北京)有限公司 Foundation, image searching method and the device of picture searching dependency prediction model
JP2022505676A (en) * 2018-10-23 2022-01-14 ブラックソーン セラピューティクス インコーポレイテッド Systems and methods for patient screening, diagnosis, and stratification

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3056328A1 (en) * 2016-09-16 2018-03-23 L'air Liquide, Societe Anonyme Pour L'etude Et L'exploitation Des Procedes Georges Claude DATA PROCESSING SYSTEM FOR PREDICTING HOSPITALIZATION OR RE-HOSPITALIZATION OF A PATIENT WITH CHRONIC RESPIRATORY DISEASE
CN107833629A (en) * 2017-10-25 2018-03-23 厦门大学 Aided diagnosis method and system based on deep learning
EP3573068A1 (en) * 2018-05-24 2019-11-27 Siemens Healthcare GmbH System and method for an automated clinical decision support system
WO2020006495A1 (en) * 2018-06-29 2020-01-02 Ai Technologies Inc. Deep learning-based diagnosis and referral of diseases and disorders using natural language processing
CN109741806A (en) * 2019-01-07 2019-05-10 北京推想科技有限公司 A kind of Medical imaging diagnostic reports auxiliary generating method and its device
CN110265153A (en) * 2019-05-16 2019-09-20 平安科技(深圳)有限公司 Chronic disease follow-up method and electronic device
CN110781333A (en) * 2019-06-26 2020-02-11 杭州鲁尔物联科技有限公司 Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN110415831A (en) * 2019-07-18 2019-11-05 天宜(天津)信息科技有限公司 A kind of medical treatment big data cloud service analysis platform
CN110795564A (en) * 2019-11-01 2020-02-14 南京稷图数据科技有限公司 Text classification method lacking negative cases
CN113689927A (en) * 2021-10-26 2021-11-23 湖北经济学院 Ultrasonic image processing method and device based on deep learning model

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
丁尚伟 ; 谢玉环 ; 陈俊君 ; 陈沛芬 ; 何志忠 ; 罗海波 ; .数字化病例随访***在超声医师规范化培训中的应用.南方医学教育.2018,(第01期),全文. *
刘再毅 ; .影像组学的临床价值及面临的挑战.协和医学杂志.2018,(第04期),全文. *
常炳国 ; 刘清星 ; .基于深度学习的慢性肝病CT报告相似度分析.计算机应用与软件.2018,(第08期),全文. *
王根生 ; 黄学坚 ; .基于Word2vec和改进型TF-IDF的卷积神经网络文本分类模型.小型微型计算机***.2019,(第05期),全文. *
胡婧 ; 刘伟 ; 马凯 ; .基于机器学习的高血压病历文本分类.科学技术与工程.2019,(第33期),全文. *

Also Published As

Publication number Publication date
CN111524570A (en) 2020-08-11

Similar Documents

Publication Publication Date Title
CN111524570B (en) Ultrasonic follow-up patient screening method based on machine learning
CN109036577B (en) Diabetes complication analysis method and device
CN108511056A (en) Therapeutic scheme based on patients with cerebral apoplexy similarity analysis recommends method and system
CN112541066B (en) Text-structured-based medical and technical report detection method and related equipment
CN109065174B (en) Medical record theme acquisition method and device considering similarity constraint
CN113555077B (en) Suspected infectious disease prediction method and device
CN113808738B (en) Disease identification system based on self-identification image
CN110739076A (en) medical artificial intelligence public training platform
CN110556173A (en) intelligent classification management system and method for inspection report
CN112489740A (en) Medical record detection method, training method of related model, related equipment and device
Livieris et al. Identification of blood cell subtypes from images using an improved SSL algorithm
Hasan et al. Understanding current states of machine learning approaches in medical informatics: a systematic literature review
CN116189866A (en) Remote medical care analysis system based on data analysis
CN109192312B (en) Intelligent management system and method for adverse events of heart failure patients
CN116844733B (en) Medical data integrity analysis method based on artificial intelligence
CN113435200A (en) Entity recognition model training and electronic medical record processing method, system and equipment
CN113360643A (en) Electronic medical record data quality evaluation method based on short text classification
JP2017167738A (en) Diagnostic processing device, diagnostic processing system, server, diagnostic processing method, and program
CN116775897A (en) Knowledge graph construction and query method and device, electronic equipment and storage medium
Norman Systematic review automation methods
CN110610766A (en) Apparatus and storage medium for deriving probability of disease based on symptom feature weight
CN115862897A (en) Syndrome monitoring method and system based on clinical data
RU2723674C1 (en) Method for prediction of diagnosis based on data processing containing medical knowledge
CN109840275B (en) Method, device and equipment for processing medical search statement
Kaur et al. An Accurate Integrated System to detect Pulmonary and Extra Pulmonary Tuberculosis using Machine Learning Algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210608

Address after: 200233 5th floor, building 20, 481 Guiping Road, Xuhui District, Shanghai

Applicant after: WONDERS INFORMATION Co.,Ltd.

Applicant after: SHANGHAI PUBLIC HEALTH CLINICAL CENTER

Address before: 200233 5th floor, building 20, 481 Guiping Road, Xuhui District, Shanghai

Applicant before: WONDERS INFORMATION Co.,Ltd.

GR01 Patent grant
GR01 Patent grant