CN109344250B

CN109344250B - Rapid structuring method of single disease diagnosis information based on medical insurance data

Info

Publication number: CN109344250B
Application number: CN201811045058.2A
Authority: CN
Inventors: 王胜锋; 詹思延; 许璐; 冯菁楠; 刘国臻; 高培; 王金喜; 尉晨
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2018-09-07
Filing date: 2018-09-07
Publication date: 2021-11-19
Anticipated expiration: 2038-09-07
Also published as: CN109344250A

Abstract

The invention discloses a method for quickly structuring single disease diagnosis information based on medical insurance data, which is used for structuring the diagnosis information in medical big data and constructing a single disease word stock; the method comprises the following steps: extracting diagnostic information from a medical insurance database; segmenting the unstructured text into a plurality of lexical texts; marking the part of speech of the vocabulary text; training a word vector; sorting in positive order, and cutting into corresponding word sets; using cosine distance to solve the relevance between words; obtaining a word list which is most similar to the standard expression of the disease and is used as a standard word list; the professional performs a computer-assisted manual check and repeats it several times. The method can be used for realizing the personalized rapid structuring of the single disease diagnosis text data, provides technical support for fully and efficiently utilizing the diagnosis information in the medical insurance data, can greatly improve the efficiency of data processing and utilization, and accelerates the popularization and application of medical big data conversion.

Description

Rapid structuring method of single disease diagnosis information based on medical insurance data

Technical Field

The invention provides a rapid structuring method of diagnosis data/information about a single disease category in a medical insurance database, belonging to the technical field of medical text processing.

Background

The medical insurance data (Claims data) is data generated in the medical insurance business process, has huge data volume, covers large-scale crowds, and can completely and really record the treatment information, reimbursement records and the like of the crowds within a certain time range.

Currently, more and more workers in the medical field are trying to process and apply medical big data, such as: the sentry point plan of the Food and Drug Administration (FDA) of the united states adopts a data general model (CDM) to perform unified and normative processing on data from different sources, thereby realizing the completion of drug evaluation work through active monitoring.

However, there is a large amount of unstructured text (mainly diagnostic information) in medical big data, for example, there may be many different expressions for the same disease. Most of the expressions are not standard enough, and even have the problem of wrongly written words. These are given for example: the utilization of medical big data such as medical insurance data, regional data, electronic medical records and the like brings great difficulty, and leads to the 'idling' of a large amount of data.

The method for realizing the text similarity retrieval of the multi-type Chinese medical record combining statistical learning and deep learning has the key points of correctly classifying and processing diseases and symptoms and solving the long text vector distance. The conventional schemes and their disadvantages mainly include the following aspects:

(1) the method is characterized in that a Chinese word segmentation tool is directly used for segmenting words of a long text, then long text vectors are calculated, distances among the vectors are directly used for solving similar long text medical records, similarity is calculated according to dependency syntactic analysis, and the disadvantage is that the meanings of the long text obtained through solving are not similar in the literal sense.

(2) The method is characterized in that Chinese word segmentation and named entity recognition tools based on a dictionary are directly adopted, long text vectors are calculated, and then a combined distance method is used for solving similar long texts.

(3) The method comprises the steps of assigning correct lexical labels to each word in a sentence, assigning a category to each word, performing part-of-speech tagging, directly calculating a long text vector, calculating the similarity of the long text by using a single distance, and solving the similarity of the similar text which cannot meet the requirements of doctors.

The purpose of structuring medical insurance data is to promote medical research, which is usually started from one or more diseases, and the traditional structuring process facing to the whole database and all disease types has wide and rough problems, but lacks more personalized, fine and rapid medical text processing technology aiming at a single disease type.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method for rapidly structuring diagnosis data of single disease species (such as multiple myeloma, amyotrophic lateral sclerosis, albinism, Alport syndrome, autoimmune encephalitis and the like) based on medical insurance data, which can be used for realizing the personalized rapid structuring of the diagnosis text data of the single disease species, providing technical support for fully and efficiently utilizing the diagnosis information in the medical insurance data, greatly improving the efficiency of data processing and utilization and accelerating the popularization and application of medical big data conversion; the method comprises the steps of performing descriptive epidemiological measurement and calculation of disease morbidity, mortality, disease mortality and the like, case contrast analysis, long-term large-scale cohort analysis, random contrast test and the like, and analyzing the problems in the epidemiological and experimental epidemiological fields.

Aiming at the unstructured problem of disease diagnosis texts or coding (such as multiple myeloma) data in the current medical insurance database, the invention adopts a large and complete matching method different from the prior field database construction, uses a single disease species centralization strategy, uses the broadest extraction field, brings all potential target disease patients into a structured range, and then combines computer-assisted screening matching to carry out fast personalized structuring on the related diagnosis texts of all single disease species (such as multiple myeloma) in the medical insurance database. The screening process gives full play to the computer data processing function, follows that each case of confirmed target patients must have definite diagnosis entries, each case of confirmed non-target patients must conform to definite exclusion entries, and the process is repeated until all records are clearly defined. The working process can synchronously set a final single disease category word bank and a final non-target disease excluding word bank, the word banks have strong portability, and when the method is applied to other medical insurance databases, only data processing software is needed to perform manual matching work on small parts of records which cannot be completely matched. The above work is guided by the target, greatly improves the efficiency by means of man-machine cooperation, also provides technical support for analysis and application work by utilizing the diagnosis information, and is beneficial to rapid structurization of other single or certain diseases.

The technical scheme provided by the invention is as follows:

a method for quickly structuring single disease diagnosis information based on medical insurance data is characterized in that redundant text information is structured according to diagnosis information in medical big data to construct a single disease word stock (such as a multiple myeloma word stock); the method comprises the following steps:

1) extracting fields containing diagnosis information from a medical insurance database; segmenting the unstructured diagnostic information text into a plurality of vocabulary texts to obtain labeled segmented sentences;

using a Conditional Random Field (CRF) to label the part of speech of the long text sentence segmentation to obtain a labeled segmentation sentence;

2) classifying the part of speech of the marked segmented sentences;

this process is accomplished using existing mature open source packages; training the obtained words into word vectors;

3) solving positive sequence ordering with the diagnosis text in the original database being most similar to the character of a single disease species (such as multiple myeloma) by using the editing distance, and dividing the positive sequence ordering into corresponding word sets according to the step 1);

the edit distance is mainly used to calculate the similarity of two character strings, and is defined as: with character strings A and B, B being the pattern strings, the following operations are now given: deleting a character from the character string; inserting a character from the character string; one character is replaced from the string. Through the above three operations, the minimum operand required to edit the character string a to the pattern string B is called the shortest edit distance of a and B, denoted as ED (a, B). The algorithm for solving the shortest edit distance is: and a two-dimensional array ED [ i ] [ j ] is used for representing the minimum operand required by the first i characters of the character string A to be edited into the first j characters of the character string B, and the recurrence formula of ED [ i ] [ j ] is as follows:

31) ED [ i ] [0] ═ i, ED [0] [ j ] ═ j, where i is 0 ≦ a.len, and j is 0 ≦ b.len;

32) if a [ i ] ═ B [ j ], ED [ i ] [ j ] ═ ED [ i-1] [ j-1 ];

33) if a [ i ] ≠ B [ j ], ED [ i ] [ j ] ═ min (ED [ i-1] [ j-1], ED [ i-1] [ j ]) + 1.

The smaller the edit distance, the more similar the two character strings are. Conversely, the more dissimilar.

4) Using cosine distance to obtain cosine similarity for the word set result obtained in the step 3) to represent the relevance between the words; setting a threshold, wherein if the threshold is smaller than the threshold, the relevance is 0, and the relevance can be considered as no relevance; and (4) adding the associated word distances and sequencing the words in a positive sequence, and solving out similar words with the second priority. Cosine distance (cosine similarity) measures similarity between two texts by using a cosine value of an included angle between two vectors in a vector space, and compared with distance measurement, cosine similarity emphasizes difference of the two vectors in direction. The calculation formula is as follows:

wherein X is the query word vector and Y is the matched word vector in the library.

Through calculation of the editing distance and the similarity, a word list most similar to the standard expression of the disease (multiple myeloma) is obtained and is provided to a doctor for manual checking as a standard word list.

The core of the discrimination technology lies in the division of the identification threshold, for the method, multiple times of trial is carried out on the value taking points between 0.1 and 0.9 according to the interval of 0.05, finally, the method integrates multiple debugging results, and definitely sets the optimal threshold aiming at the national medical insurance database as 0.4 by referring to two indexes of matching speed and accuracy (over 95%).

5) The computer-aided manual checking is repeated for many times, and the specific operation is as follows:

based on the clinical name of the disease of interest (e.g., multiple myeloma), the broadest extraction fields are defined, such as "myeloma", "Carler", "bone marrow cancer", "myelopathy", "203.0", "C90.0", and "M97320/3", etc., and all patients with potential disease of interest are extracted. And then starting matching work by combining computer-aided screening, and carrying out computer-aided manual checking work on each matching result, wherein the checking principle is that each case of confirmed target patients must have definite diagnosis entries (the same as a standard word list, namely an inclusion word bank), each case of confirmed non-target patients must accord with definite exclusion entries (the same as a non-target disease exclusion word bank, wherein the exclusion word bank is formed by words of non-multiple myeloma diagnosis manually judged in the manual checking process), and the process is repeated circularly until all records are clearly defined. When the accuracy rate of the standard word list of the current version is lower than 95% through each manual check, adding the new words obtained through the check into the standard word list of the generated new version, calculating the editing distance and the similarity again, and then performing the next round of manual check; and stopping iteration until the manual checking result shows that the accuracy rate of the word list reaches 95%. The accuracy rate is as follows: judging whether the patient in the medical insurance database is the multiple myeloma patient by utilizing the formed inclusion word bank and the exclusion word bank; by the judgment, a part of patients in the medical insurance database are accurately judged as multiple myeloma patients, the number of the patients is marked as a, a part of patients are judged as non-multiple myeloma patients, the number of the patients is marked as b, a part of patients still need to be further judged, and the number of the patients is marked as c. The accuracy was recorded as a/(a + c).

The invention has the beneficial effects that:

the invention provides a rapid structuring method for diagnostic data information of a single disease type in a medical insurance database, which adopts a method different from the prior pure manual screening and matching and uses a computer to assist in screening and matching, so that man-machine cooperation and efficiency are improved. Firstly, using CRF to label the part of speech of long text sentence segmentation by using CRF conditional random field to obtain labeled segmentation sentence, then carrying out part of speech classification, and then training the obtained word into word vector; solving the positive sequence ordering with the text most similar to the character face of the myeloma by using the editing distance, and cutting into corresponding word sets; and (4) solving the relevance among the words by using the cosine distance, adding the associated word distances, sequencing the words in a positive sequence, and solving similar words with the second priority. In the invention, the combination relation between the distance algorithms and the internal threshold are tried for many times, and the effect is better. The value of the threshold value of 0.4 can provide a direct parameter basis for text standardization based on medical insurance data in the future, and is convenient for popularization of related work.

The technical scheme of the invention provides a standardized word stock and a construction method for the structuralization of single disease diagnosis data in a medical insurance database in China, so that technical support is provided for the utilization of diagnosis information in the medical insurance data, the efficiency of data processing and utilization can be greatly improved, and the popularization and application process of medical big data conversion can be accelerated. In practical application, for example, the method of the invention is used for completing rapid structuring and disease classification of diagnosis information in hospital electronic medical records, medical insurance data and regional medical health data, and on the basis, the structured big data can be used for carrying out analysis and calculation of descriptive epidemiology such as morbidity, mortality, fatality and the like of diseases, and can also be used for carrying out case contrast research analysis, long-term large-scale queue research, random contrast test and the like for analyzing epidemiology and experimental epidemiology problems. In addition, as the patient and non-patient population in the medical big data system can be identified, the big data can be used for economic analysis in aspects of medical expenses and the like. In a word, the realization of the rapid structural standardization of the diagnostic information in the medical big data is the basis for developing almost all research works, and the method of the invention provides technical support for combining the research on the aspects of traditional epidemiology, medical policy and health economics with the big data.

Drawings

Fig. 1 is a flow chart of the implementation of the multiple myeloma diagnosis information word bank generation method based on medical insurance data in China.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

As shown in fig. 1, the process for generating the multiple myeloma diagnosis information word bank based on medical insurance data of the present invention specifically includes the following operations:

A1. extracting fields containing diagnostic information from the medical insurance database:

the medical insurance database mainly comprises six variables containing diagnosis information, which are respectively as follows: primary diagnostic name, primary diagnostic code, secondary diagnostic name 1, secondary diagnostic code 1, secondary diagnostic name 2, and secondary diagnostic code 2. This piece of data was extracted by making a query over 6 diagnostic variables, if any of the diagnostic variables satisfied any of the multiple myeloma diagnostic keywords (see table 1). An example of the extracted main diagnostic names is shown in table 2.

TABLE 1 diagnostic description and ICD coding enumeration of multiple myeloma

	ICD-9	ICD-10	ICD-O-3
				Multiple myeloma
Osteomyelitis disease	203.0	C90.051	M97320/3
					Plasma cell myeloma	203.0	C90.002	M97320/3
Multiple myeloma	203.0	C90.001	M97320/3
					Myeloma nephropathy	203.0	C90.003+	M97320/3

TABLE 2 variable listing of "Primary diagnosis names" in the medical insurance database

A2. Word segmentation:

the extracted diagnostic information text is "cut" according to a series of separators such as "," ""/"," \\ "and the like, thereby dividing a long string of diagnostic text into a plurality of short vocabulary texts. An example of the results is shown in Table 3.

TABLE 3 list of word segmentation results

And (3) using a Conditional Random Field (CRF) to label the part of speech of the segmentation of the long text sentence to obtain a labeled segmentation sentence, and then using a mature open source program package to finish part of speech classification.

The feature function in the CRF accepts four parameters:

sentence s (i.e. the sentence we want to label part of speech)

I, representing the ith word in sentence s

L _ i, representing the part of speech tagged to the i-th word by the tagging sequence to be scored

L _ i-1, representing the part of speech of the i-1 th word tagged by the tagging sequence to be scored

Its output value is 0 or 1: a value of 0 indicates that the annotation sequence to be scored does not conform to this feature, and a value of 1 indicates that the annotation sequence to be scored conforms to this feature. After a set of feature functions is defined, we assign a weight λ _ j to each feature function f _ j. Now, as long as there is a sentence s, with a sequence of labels l, we can score l with the set of feature functions defined above.

There are two summations in the above equation, the outer one for summing the score values of each feature function f _ j, and the inner one for summing the feature values of the words at each position in the sentence.

By indexing and normalizing this score, we can obtain the probability value p (l | s) of the label sequence l as follows:

training the obtained words into word vectors for subsequent use after the steps are finished;

A3. words related to myeloma diagnosis are extracted from the short words that are sorted out:

the extraction process is based on the name of the multiple myeloma disease and ICD coding. A number of expression forms for text and ICD coding are fully considered, and Table 1 is particularly seen.

The diagnostic text with divided words is trained by using the existing genim method package, firstly similar words are screened and matched for the first time by using the edit distance, the most similar words on the face of the word are selected, then the screened result is screened for the second time, the cosine distance is used for calculating the relevance between the words, and the optimal similar words are finally obtained through the combined distance calculation. The method comprises the following specific steps:

first, using the edit distance, the positive order ranking with the text most similar to the myeloma character is solved and segmented into corresponding word sets according to step a2. The edit distance is mainly used to calculate the similarity of two character strings, and is defined as follows:

with character strings A and B, B being the pattern strings, the following operations are now given: deleting a character from the character string; inserting a character from the character string; one character is replaced from the string. Through the above three operations, the minimum operand required to edit the character string a to the pattern string B is called the shortest edit distance of a and B, denoted as ED (a, B). The algorithm for solving the shortest edit distance is described as follows: a two-dimensional array ED [ i ] [ j ] is used to represent the minimum operand required for the first i characters of the character string A to be edited into the first j characters of the character string B. The recurrence formula for ED [ i ] [ j ] is:

1) ED [ i ] [0] ═ i, ED [0] [ j ] ═ j, where i is 0 ≦ a.len, and j is 0 ≦ b.len;

2) if a [ i ] ═ B [ j ], ED [ i ] [ j ] ═ ED [ i-1] [ j-1 ];

3) if a [ i ] ≠ B [ j ], ED [ i ] [ j ] ═ min (ED [ i-1] [ j-1], ED [ i-1] [ j ]) + 1.

And secondly, calculating the relevance between words by using the cosine distance of the obtained result, setting a threshold value, (when the method is specifically implemented, the threshold value is set to be 0.4 after multiple times of debugging, and the matching speed and accuracy are relatively optimal), if the relevance is less than the threshold value and is 0, considering no relevance, adding the relevant word distances and sequencing in positive sequence, and calculating out similar words with the next priority. Cosine distance cosine similarity measures similarity between two texts by using a cosine value of an included angle between two vectors in a vector space, compared with distance measurement, cosine similarity emphasizes difference of the two vectors in direction, and generally, after vector representation of the two texts is obtained by using Embedding, the cosine similarity can be used for calculating the similarity between the two texts. The calculation formula is as follows:

An example of the results of the first extraction is shown in Table 4.

TABLE 4 example of computer extraction and human review results

A4. And (3) repeating manual check and performing computer re-extraction:

extracting words which are consistent with the name or code of the given target disease by using a computer, and then submitting the words screened by the computer to a professional for further manual investigation and labeling, wherein each word is labeled such as: "multiple myeloma", "suspected multiple myeloma", "nothing" and the like. Then, the words marked as "multiple myeloma" form an inclusion word bank in the dictionary, the words marked as "nothing is" form an exclusion word bank in the dictionary (see table 5 for example), and then the formed inclusion word bank and exclusion word bank are used for judging whether the patients in the medical insurance database are multiple myeloma patients, so that part of the patients in the medical insurance database are accurately judged as multiple myeloma patients by judging, the number of the patients is marked as a, part of the patients is judged as non-multiple myeloma patients, the number of the patients is marked as b, and the rest of the patients still need to be further judged, and the number of the patients is marked as c. And when the a/(a + c) < 95%, repeating the steps again, namely searching the diagnosis information of the part of the population which cannot be judged to be the multiple myeloma patients by manual browsing, finding out some diagnosis words from the diagnosis information to enrich the word stock for being extracted by the computer, then sending a new word stock for extraction to the computer again to extract, and repeating the operation until the a/(a + c) > 95%. An example of an artificial check is shown in table 3.

Table 5 example of excluding thesaurus from containing words

Deal with the words checked manually	Human checking and judging
		Bone marrow cancer	Myeloma cell
Marrow ca	Myeloma cell
		Endothelial cell myeloma	Removing
Primary myeloma	Removing

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A method for constructing a word bank of single disease species based on the single disease species diagnosis information of medical insurance data in a rapid structuring mode is characterized in that aiming at the diagnosis information in the medical insurance data, redundant text information is structured to construct the word bank of the single disease species; the method comprises the following steps:

1) extracting fields containing diagnostic information from a medical insurance database; segmenting the unstructured diagnostic information text into a plurality of words; marking the part of speech of the vocabulary obtained by segmentation to obtain a marked word set;

2) performing part-of-speech classification on the labeled word set; training the obtained words into word vectors; specifically, a genim method packet is used for training word vectors of the diagnosis texts with the divided words;

the genim method comprises the steps of firstly, using an editing distance to carry out first screening matching on similar words, selecting the words which are most similar in literal, then carrying out second screening on screened results, using a cosine distance to calculate the relevance among the words, and finally obtaining the optimal similar words through the calculation of the combined distance of the editing distance and the cosine distance; the method comprises the following specific steps:

firstly, using edit distance to solve the positive sequence order of the text most similar to the character face of myeloma, and cutting the positive sequence order into corresponding word sets;

the edit distance is used to calculate the similarity of two character strings, which is defined as follows:

with character strings A and B, B being the pattern strings, the following operations are now given: deleting a character from the character string A; inserting a character from the character string A; replacing a character from the character string a; through the three operations, the minimum operand required by editing the character string A into the mode string B is called the shortest editing distance of A and B and is marked as ED (A, B);

the algorithm for solving the shortest edit distance is described as follows: using a two-dimensional array ED [ i ] [ j ] to represent the minimum operand required by the first i characters of the character string A to be edited into the first j characters of the pattern string B; the recurrence formula for ED [ i ] [ j ] is:

21) ED [ i ] [0] ═ i, ED [0] [ j ] ═ j, where i is 0 ≦ a.len, and j is 0 ≦ b.len;

22) if a [ i ] ═ B [ j ], ED [ i ] [ j ] ═ ED [ i-1] [ j-1 ];

23) if a [ i ] is not equal to B [ j ], ED [ i ] [ j ] ═ min (ED [ i-1] [ j-1], ED [ i-1] [ j ]) + 1;

the smaller the editing distance is, the more similar the two character strings are; conversely, the more dissimilar;

secondly, the cosine distance is used for obtaining the relevance between the words according to the obtained result, a threshold value is set, if the threshold value is smaller than the result, the relevance is 0, and no relevance is shown; adding the associated word distances and sequencing the words in a positive sequence, and solving out similar words with the second priority;

cosine similarity measures similarity between two texts by using a cosine value of an included angle between two vectors in a vector space; after vector representation of the two texts is obtained by using Embelling, calculating the similarity between the two texts by using cosine similarity; the calculation formula is as follows:

wherein X is a query word vector and Y is a matched word vector in the library;

3) solving out positive sequence ordering with the diagnostic information text in the medical insurance database most similar to the literal of a single disease type by using the editing distance in the step 2), and cutting into corresponding word sets according to the step 1);

4) solving cosine similarity of the word set obtained in the step 3) by using the cosine distance obtained in the step 2) for representing the relevance among the words;

setting a threshold value to be 0.4, and if the obtained cosine similarity is smaller than the threshold value, indicating no correlation;

adding the associated word distances and sequencing the words in a positive order, and solving out similar words with the second priority;

5) calculating the similarity between the editing distance and the cosine to obtain a word list which is most similar to the standard expression of the disease and is used as a standard word list;

6) according to the standard word list, professional personnel perform computer-aided manual check and repeat for many times; the execution operation is as follows:

61) formulating the broadest extraction field according to the clinical name of the target disease; extracting all potential target disease patients;

62) starting matching by combining computer-aided screening, and carrying out computer-aided manual inspection on each round of matching results; the checking method comprises the following steps: specifically identifying the patient as a target disease patient or a non-target disease patient based on the condition of 62A or 62B;

each case of patients with confirmed target disease must have definite diagnostic entries which match with the single disease category word bank;

62B, confirming to diagnose the patients with the non-target diseases in each case, wherein the excluded entries are required to be in accordance with definite excluded entries, and the excluded entries are in accordance with excluded word banks of the non-target diseases;

63) performing the operations of step 62) multiple times until the records are all well defined;

7) when the accuracy rate of the standard word list of the current version is lower than 95% through manual check, updating the standard word list; the following operations are performed:

71) adding the new words obtained by the current manual check into the list to generate a standard word list of a new version;

72) recalculating the similarity between the editing distance and the cosine, and then performing the next round of manual inspection;

73) stopping updating until the manual checking result shows that the accuracy rate of the word list reaches 95%;

through the steps, the single-disease diagnosis information based on the medical insurance data is quickly structured, and the diagnosis information in the medical insurance data is efficiently utilized.

2. The method of claim 1, wherein the single disease species diagnostic information is selected from the group consisting of multiple myeloma, amyotrophic lateral sclerosis, albinism, Alport syndrome, and autoimmune encephalitis.

3. The method for rapidly constructing the word stock of the single disease species based on the single disease species diagnostic information of the medical insurance data as claimed in claim 1, wherein the part of speech of the segmented words of the long text sentences is labeled by using conditional random field CRF in step 1).

4. The method for rapidly structuring the monopathy species diagnosis information based on the medical insurance data as claimed in claim 1, wherein in step 61), specifically, the clinical name of the target disease is multiple myeloma, and the extraction field comprises: "myeloma", "Carler", "bone marrow cancer", "myelopathy", "203.0", "C90.0", "M97320/3".