CN109344250B - Rapid structuring method of single disease diagnosis information based on medical insurance data - Google Patents

Rapid structuring method of single disease diagnosis information based on medical insurance data Download PDF

Info

Publication number
CN109344250B
CN109344250B CN201811045058.2A CN201811045058A CN109344250B CN 109344250 B CN109344250 B CN 109344250B CN 201811045058 A CN201811045058 A CN 201811045058A CN 109344250 B CN109344250 B CN 109344250B
Authority
CN
China
Prior art keywords
word
words
medical insurance
disease
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811045058.2A
Other languages
Chinese (zh)
Other versions
CN109344250A (en
Inventor
王胜锋
詹思延
许璐
冯菁楠
刘国臻
高培
王金喜
尉晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201811045058.2A priority Critical patent/CN109344250B/en
Publication of CN109344250A publication Critical patent/CN109344250A/en
Application granted granted Critical
Publication of CN109344250B publication Critical patent/CN109344250B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Primary Health Care (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Epidemiology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a method for quickly structuring single disease diagnosis information based on medical insurance data, which is used for structuring the diagnosis information in medical big data and constructing a single disease word stock; the method comprises the following steps: extracting diagnostic information from a medical insurance database; segmenting the unstructured text into a plurality of lexical texts; marking the part of speech of the vocabulary text; training a word vector; sorting in positive order, and cutting into corresponding word sets; using cosine distance to solve the relevance between words; obtaining a word list which is most similar to the standard expression of the disease and is used as a standard word list; the professional performs a computer-assisted manual check and repeats it several times. The method can be used for realizing the personalized rapid structuring of the single disease diagnosis text data, provides technical support for fully and efficiently utilizing the diagnosis information in the medical insurance data, can greatly improve the efficiency of data processing and utilization, and accelerates the popularization and application of medical big data conversion.

Description

Rapid structuring method of single disease diagnosis information based on medical insurance data
Technical Field
The invention provides a rapid structuring method of diagnosis data/information about a single disease category in a medical insurance database, belonging to the technical field of medical text processing.
Background
The medical insurance data (Claims data) is data generated in the medical insurance business process, has huge data volume, covers large-scale crowds, and can completely and really record the treatment information, reimbursement records and the like of the crowds within a certain time range.
Currently, more and more workers in the medical field are trying to process and apply medical big data, such as: the sentry point plan of the Food and Drug Administration (FDA) of the united states adopts a data general model (CDM) to perform unified and normative processing on data from different sources, thereby realizing the completion of drug evaluation work through active monitoring.
However, there is a large amount of unstructured text (mainly diagnostic information) in medical big data, for example, there may be many different expressions for the same disease. Most of the expressions are not standard enough, and even have the problem of wrongly written words. These are given for example: the utilization of medical big data such as medical insurance data, regional data, electronic medical records and the like brings great difficulty, and leads to the 'idling' of a large amount of data.
The method for realizing the text similarity retrieval of the multi-type Chinese medical record combining statistical learning and deep learning has the key points of correctly classifying and processing diseases and symptoms and solving the long text vector distance. The conventional schemes and their disadvantages mainly include the following aspects:
(1) the method is characterized in that a Chinese word segmentation tool is directly used for segmenting words of a long text, then long text vectors are calculated, distances among the vectors are directly used for solving similar long text medical records, similarity is calculated according to dependency syntactic analysis, and the disadvantage is that the meanings of the long text obtained through solving are not similar in the literal sense.
(2) The method is characterized in that Chinese word segmentation and named entity recognition tools based on a dictionary are directly adopted, long text vectors are calculated, and then a combined distance method is used for solving similar long texts.
(3) The method comprises the steps of assigning correct lexical labels to each word in a sentence, assigning a category to each word, performing part-of-speech tagging, directly calculating a long text vector, calculating the similarity of the long text by using a single distance, and solving the similarity of the similar text which cannot meet the requirements of doctors.
The purpose of structuring medical insurance data is to promote medical research, which is usually started from one or more diseases, and the traditional structuring process facing to the whole database and all disease types has wide and rough problems, but lacks more personalized, fine and rapid medical text processing technology aiming at a single disease type.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method for rapidly structuring diagnosis data of single disease species (such as multiple myeloma, amyotrophic lateral sclerosis, albinism, Alport syndrome, autoimmune encephalitis and the like) based on medical insurance data, which can be used for realizing the personalized rapid structuring of the diagnosis text data of the single disease species, providing technical support for fully and efficiently utilizing the diagnosis information in the medical insurance data, greatly improving the efficiency of data processing and utilization and accelerating the popularization and application of medical big data conversion; the method comprises the steps of performing descriptive epidemiological measurement and calculation of disease morbidity, mortality, disease mortality and the like, case contrast analysis, long-term large-scale cohort analysis, random contrast test and the like, and analyzing the problems in the epidemiological and experimental epidemiological fields.
Aiming at the unstructured problem of disease diagnosis texts or coding (such as multiple myeloma) data in the current medical insurance database, the invention adopts a large and complete matching method different from the prior field database construction, uses a single disease species centralization strategy, uses the broadest extraction field, brings all potential target disease patients into a structured range, and then combines computer-assisted screening matching to carry out fast personalized structuring on the related diagnosis texts of all single disease species (such as multiple myeloma) in the medical insurance database. The screening process gives full play to the computer data processing function, follows that each case of confirmed target patients must have definite diagnosis entries, each case of confirmed non-target patients must conform to definite exclusion entries, and the process is repeated until all records are clearly defined. The working process can synchronously set a final single disease category word bank and a final non-target disease excluding word bank, the word banks have strong portability, and when the method is applied to other medical insurance databases, only data processing software is needed to perform manual matching work on small parts of records which cannot be completely matched. The above work is guided by the target, greatly improves the efficiency by means of man-machine cooperation, also provides technical support for analysis and application work by utilizing the diagnosis information, and is beneficial to rapid structurization of other single or certain diseases.
The technical scheme provided by the invention is as follows:
a method for quickly structuring single disease diagnosis information based on medical insurance data is characterized in that redundant text information is structured according to diagnosis information in medical big data to construct a single disease word stock (such as a multiple myeloma word stock); the method comprises the following steps:
1) extracting fields containing diagnosis information from a medical insurance database; segmenting the unstructured diagnostic information text into a plurality of vocabulary texts to obtain labeled segmented sentences;
using a Conditional Random Field (CRF) to label the part of speech of the long text sentence segmentation to obtain a labeled segmentation sentence;
2) classifying the part of speech of the marked segmented sentences;
this process is accomplished using existing mature open source packages; training the obtained words into word vectors;
3) solving positive sequence ordering with the diagnosis text in the original database being most similar to the character of a single disease species (such as multiple myeloma) by using the editing distance, and dividing the positive sequence ordering into corresponding word sets according to the step 1);
the edit distance is mainly used to calculate the similarity of two character strings, and is defined as: with character strings A and B, B being the pattern strings, the following operations are now given: deleting a character from the character string; inserting a character from the character string; one character is replaced from the string. Through the above three operations, the minimum operand required to edit the character string a to the pattern string B is called the shortest edit distance of a and B, denoted as ED (a, B). The algorithm for solving the shortest edit distance is: and a two-dimensional array ED [ i ] [ j ] is used for representing the minimum operand required by the first i characters of the character string A to be edited into the first j characters of the character string B, and the recurrence formula of ED [ i ] [ j ] is as follows:
31) ED [ i ] [0] ═ i, ED [0] [ j ] ═ j, where i is 0 ≦ a.len, and j is 0 ≦ b.len;
32) if a [ i ] ═ B [ j ], ED [ i ] [ j ] ═ ED [ i-1] [ j-1 ];
33) if a [ i ] ≠ B [ j ], ED [ i ] [ j ] ═ min (ED [ i-1] [ j-1], ED [ i-1] [ j ]) + 1.
The smaller the edit distance, the more similar the two character strings are. Conversely, the more dissimilar.
4) Using cosine distance to obtain cosine similarity for the word set result obtained in the step 3) to represent the relevance between the words; setting a threshold, wherein if the threshold is smaller than the threshold, the relevance is 0, and the relevance can be considered as no relevance; and (4) adding the associated word distances and sequencing the words in a positive sequence, and solving out similar words with the second priority. Cosine distance (cosine similarity) measures similarity between two texts by using a cosine value of an included angle between two vectors in a vector space, and compared with distance measurement, cosine similarity emphasizes difference of the two vectors in direction. The calculation formula is as follows:
Figure BDA0001793090970000031
wherein X is the query word vector and Y is the matched word vector in the library.
Through calculation of the editing distance and the similarity, a word list most similar to the standard expression of the disease (multiple myeloma) is obtained and is provided to a doctor for manual checking as a standard word list.
The core of the discrimination technology lies in the division of the identification threshold, for the method, multiple times of trial is carried out on the value taking points between 0.1 and 0.9 according to the interval of 0.05, finally, the method integrates multiple debugging results, and definitely sets the optimal threshold aiming at the national medical insurance database as 0.4 by referring to two indexes of matching speed and accuracy (over 95%).
5) The computer-aided manual checking is repeated for many times, and the specific operation is as follows:
based on the clinical name of the disease of interest (e.g., multiple myeloma), the broadest extraction fields are defined, such as "myeloma", "Carler", "bone marrow cancer", "myelopathy", "203.0", "C90.0", and "M97320/3", etc., and all patients with potential disease of interest are extracted. And then starting matching work by combining computer-aided screening, and carrying out computer-aided manual checking work on each matching result, wherein the checking principle is that each case of confirmed target patients must have definite diagnosis entries (the same as a standard word list, namely an inclusion word bank), each case of confirmed non-target patients must accord with definite exclusion entries (the same as a non-target disease exclusion word bank, wherein the exclusion word bank is formed by words of non-multiple myeloma diagnosis manually judged in the manual checking process), and the process is repeated circularly until all records are clearly defined. When the accuracy rate of the standard word list of the current version is lower than 95% through each manual check, adding the new words obtained through the check into the standard word list of the generated new version, calculating the editing distance and the similarity again, and then performing the next round of manual check; and stopping iteration until the manual checking result shows that the accuracy rate of the word list reaches 95%. The accuracy rate is as follows: judging whether the patient in the medical insurance database is the multiple myeloma patient by utilizing the formed inclusion word bank and the exclusion word bank; by the judgment, a part of patients in the medical insurance database are accurately judged as multiple myeloma patients, the number of the patients is marked as a, a part of patients are judged as non-multiple myeloma patients, the number of the patients is marked as b, a part of patients still need to be further judged, and the number of the patients is marked as c. The accuracy was recorded as a/(a + c).
The invention has the beneficial effects that:
the invention provides a rapid structuring method for diagnostic data information of a single disease type in a medical insurance database, which adopts a method different from the prior pure manual screening and matching and uses a computer to assist in screening and matching, so that man-machine cooperation and efficiency are improved. Firstly, using CRF to label the part of speech of long text sentence segmentation by using CRF conditional random field to obtain labeled segmentation sentence, then carrying out part of speech classification, and then training the obtained word into word vector; solving the positive sequence ordering with the text most similar to the character face of the myeloma by using the editing distance, and cutting into corresponding word sets; and (4) solving the relevance among the words by using the cosine distance, adding the associated word distances, sequencing the words in a positive sequence, and solving similar words with the second priority. In the invention, the combination relation between the distance algorithms and the internal threshold are tried for many times, and the effect is better. The value of the threshold value of 0.4 can provide a direct parameter basis for text standardization based on medical insurance data in the future, and is convenient for popularization of related work.
The technical scheme of the invention provides a standardized word stock and a construction method for the structuralization of single disease diagnosis data in a medical insurance database in China, so that technical support is provided for the utilization of diagnosis information in the medical insurance data, the efficiency of data processing and utilization can be greatly improved, and the popularization and application process of medical big data conversion can be accelerated. In practical application, for example, the method of the invention is used for completing rapid structuring and disease classification of diagnosis information in hospital electronic medical records, medical insurance data and regional medical health data, and on the basis, the structured big data can be used for carrying out analysis and calculation of descriptive epidemiology such as morbidity, mortality, fatality and the like of diseases, and can also be used for carrying out case contrast research analysis, long-term large-scale queue research, random contrast test and the like for analyzing epidemiology and experimental epidemiology problems. In addition, as the patient and non-patient population in the medical big data system can be identified, the big data can be used for economic analysis in aspects of medical expenses and the like. In a word, the realization of the rapid structural standardization of the diagnostic information in the medical big data is the basis for developing almost all research works, and the method of the invention provides technical support for combining the research on the aspects of traditional epidemiology, medical policy and health economics with the big data.
Drawings
Fig. 1 is a flow chart of the implementation of the multiple myeloma diagnosis information word bank generation method based on medical insurance data in China.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
As shown in fig. 1, the process for generating the multiple myeloma diagnosis information word bank based on medical insurance data of the present invention specifically includes the following operations:
A1. extracting fields containing diagnostic information from the medical insurance database:
the medical insurance database mainly comprises six variables containing diagnosis information, which are respectively as follows: primary diagnostic name, primary diagnostic code, secondary diagnostic name 1, secondary diagnostic code 1, secondary diagnostic name 2, and secondary diagnostic code 2. This piece of data was extracted by making a query over 6 diagnostic variables, if any of the diagnostic variables satisfied any of the multiple myeloma diagnostic keywords (see table 1). An example of the extracted main diagnostic names is shown in table 2.
TABLE 1 diagnostic description and ICD coding enumeration of multiple myeloma
ICD-9 ICD-10 ICD-O-3
Multiple myeloma
Osteomyelitis disease 203.0 C90.051 M97320/3
Plasma cell myeloma 203.0 C90.002 M97320/3
Multiple myeloma 203.0 C90.001 M97320/3
Myeloma nephropathy 203.0 C90.003+ M97320/3
TABLE 2 variable listing of "Primary diagnosis names" in the medical insurance database
Figure BDA0001793090970000051
Figure BDA0001793090970000061
A2. Word segmentation:
the extracted diagnostic information text is "cut" according to a series of separators such as "," ""/"," \\ "and the like, thereby dividing a long string of diagnostic text into a plurality of short vocabulary texts. An example of the results is shown in Table 3.
TABLE 3 list of word segmentation results
Figure BDA0001793090970000062
Figure BDA0001793090970000071
And (3) using a Conditional Random Field (CRF) to label the part of speech of the segmentation of the long text sentence to obtain a labeled segmentation sentence, and then using a mature open source program package to finish part of speech classification.
The feature function in the CRF accepts four parameters:
sentence s (i.e. the sentence we want to label part of speech)
I, representing the ith word in sentence s
L _ i, representing the part of speech tagged to the i-th word by the tagging sequence to be scored
L _ i-1, representing the part of speech of the i-1 th word tagged by the tagging sequence to be scored
Its output value is 0 or 1: a value of 0 indicates that the annotation sequence to be scored does not conform to this feature, and a value of 1 indicates that the annotation sequence to be scored conforms to this feature. After a set of feature functions is defined, we assign a weight λ _ j to each feature function f _ j. Now, as long as there is a sentence s, with a sequence of labels l, we can score l with the set of feature functions defined above.
Figure BDA0001793090970000072
There are two summations in the above equation, the outer one for summing the score values of each feature function f _ j, and the inner one for summing the feature values of the words at each position in the sentence.
By indexing and normalizing this score, we can obtain the probability value p (l | s) of the label sequence l as follows:
Figure BDA0001793090970000081
training the obtained words into word vectors for subsequent use after the steps are finished;
A3. words related to myeloma diagnosis are extracted from the short words that are sorted out:
the extraction process is based on the name of the multiple myeloma disease and ICD coding. A number of expression forms for text and ICD coding are fully considered, and Table 1 is particularly seen.
The diagnostic text with divided words is trained by using the existing genim method package, firstly similar words are screened and matched for the first time by using the edit distance, the most similar words on the face of the word are selected, then the screened result is screened for the second time, the cosine distance is used for calculating the relevance between the words, and the optimal similar words are finally obtained through the combined distance calculation. The method comprises the following specific steps:
first, using the edit distance, the positive order ranking with the text most similar to the myeloma character is solved and segmented into corresponding word sets according to step a2. The edit distance is mainly used to calculate the similarity of two character strings, and is defined as follows:
with character strings A and B, B being the pattern strings, the following operations are now given: deleting a character from the character string; inserting a character from the character string; one character is replaced from the string. Through the above three operations, the minimum operand required to edit the character string a to the pattern string B is called the shortest edit distance of a and B, denoted as ED (a, B). The algorithm for solving the shortest edit distance is described as follows: a two-dimensional array ED [ i ] [ j ] is used to represent the minimum operand required for the first i characters of the character string A to be edited into the first j characters of the character string B. The recurrence formula for ED [ i ] [ j ] is:
1) ED [ i ] [0] ═ i, ED [0] [ j ] ═ j, where i is 0 ≦ a.len, and j is 0 ≦ b.len;
2) if a [ i ] ═ B [ j ], ED [ i ] [ j ] ═ ED [ i-1] [ j-1 ];
3) if a [ i ] ≠ B [ j ], ED [ i ] [ j ] ═ min (ED [ i-1] [ j-1], ED [ i-1] [ j ]) + 1.
The smaller the edit distance, the more similar the two character strings are. Conversely, the more dissimilar.
And secondly, calculating the relevance between words by using the cosine distance of the obtained result, setting a threshold value, (when the method is specifically implemented, the threshold value is set to be 0.4 after multiple times of debugging, and the matching speed and accuracy are relatively optimal), if the relevance is less than the threshold value and is 0, considering no relevance, adding the relevant word distances and sequencing in positive sequence, and calculating out similar words with the next priority. Cosine distance cosine similarity measures similarity between two texts by using a cosine value of an included angle between two vectors in a vector space, compared with distance measurement, cosine similarity emphasizes difference of the two vectors in direction, and generally, after vector representation of the two texts is obtained by using Embedding, the cosine similarity can be used for calculating the similarity between the two texts. The calculation formula is as follows:
Figure BDA0001793090970000091
wherein X is the query word vector and Y is the matched word vector in the library.
An example of the results of the first extraction is shown in Table 4.
TABLE 4 example of computer extraction and human review results
Figure BDA0001793090970000092
Figure BDA0001793090970000101
A4. And (3) repeating manual check and performing computer re-extraction:
extracting words which are consistent with the name or code of the given target disease by using a computer, and then submitting the words screened by the computer to a professional for further manual investigation and labeling, wherein each word is labeled such as: "multiple myeloma", "suspected multiple myeloma", "nothing" and the like. Then, the words marked as "multiple myeloma" form an inclusion word bank in the dictionary, the words marked as "nothing is" form an exclusion word bank in the dictionary (see table 5 for example), and then the formed inclusion word bank and exclusion word bank are used for judging whether the patients in the medical insurance database are multiple myeloma patients, so that part of the patients in the medical insurance database are accurately judged as multiple myeloma patients by judging, the number of the patients is marked as a, part of the patients is judged as non-multiple myeloma patients, the number of the patients is marked as b, and the rest of the patients still need to be further judged, and the number of the patients is marked as c. And when the a/(a + c) < 95%, repeating the steps again, namely searching the diagnosis information of the part of the population which cannot be judged to be the multiple myeloma patients by manual browsing, finding out some diagnosis words from the diagnosis information to enrich the word stock for being extracted by the computer, then sending a new word stock for extraction to the computer again to extract, and repeating the operation until the a/(a + c) > 95%. An example of an artificial check is shown in table 3.
Table 5 example of excluding thesaurus from containing words
Deal with the words checked manually Human checking and judging
Bone marrow cancer Myeloma cell
Marrow ca Myeloma cell
Endothelial cell myeloma Removing
Primary myeloma Removing
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (4)

1. A method for constructing a word bank of single disease species based on the single disease species diagnosis information of medical insurance data in a rapid structuring mode is characterized in that aiming at the diagnosis information in the medical insurance data, redundant text information is structured to construct the word bank of the single disease species; the method comprises the following steps:
1) extracting fields containing diagnostic information from a medical insurance database; segmenting the unstructured diagnostic information text into a plurality of words; marking the part of speech of the vocabulary obtained by segmentation to obtain a marked word set;
2) performing part-of-speech classification on the labeled word set; training the obtained words into word vectors; specifically, a genim method packet is used for training word vectors of the diagnosis texts with the divided words;
the genim method comprises the steps of firstly, using an editing distance to carry out first screening matching on similar words, selecting the words which are most similar in literal, then carrying out second screening on screened results, using a cosine distance to calculate the relevance among the words, and finally obtaining the optimal similar words through the calculation of the combined distance of the editing distance and the cosine distance; the method comprises the following specific steps:
firstly, using edit distance to solve the positive sequence order of the text most similar to the character face of myeloma, and cutting the positive sequence order into corresponding word sets;
the edit distance is used to calculate the similarity of two character strings, which is defined as follows:
with character strings A and B, B being the pattern strings, the following operations are now given: deleting a character from the character string A; inserting a character from the character string A; replacing a character from the character string a; through the three operations, the minimum operand required by editing the character string A into the mode string B is called the shortest editing distance of A and B and is marked as ED (A, B);
the algorithm for solving the shortest edit distance is described as follows: using a two-dimensional array ED [ i ] [ j ] to represent the minimum operand required by the first i characters of the character string A to be edited into the first j characters of the pattern string B; the recurrence formula for ED [ i ] [ j ] is:
21) ED [ i ] [0] ═ i, ED [0] [ j ] ═ j, where i is 0 ≦ a.len, and j is 0 ≦ b.len;
22) if a [ i ] ═ B [ j ], ED [ i ] [ j ] ═ ED [ i-1] [ j-1 ];
23) if a [ i ] is not equal to B [ j ], ED [ i ] [ j ] ═ min (ED [ i-1] [ j-1], ED [ i-1] [ j ]) + 1;
the smaller the editing distance is, the more similar the two character strings are; conversely, the more dissimilar;
secondly, the cosine distance is used for obtaining the relevance between the words according to the obtained result, a threshold value is set, if the threshold value is smaller than the result, the relevance is 0, and no relevance is shown; adding the associated word distances and sequencing the words in a positive sequence, and solving out similar words with the second priority;
cosine similarity measures similarity between two texts by using a cosine value of an included angle between two vectors in a vector space; after vector representation of the two texts is obtained by using Embelling, calculating the similarity between the two texts by using cosine similarity; the calculation formula is as follows:
Figure FDA0003269466110000021
wherein X is a query word vector and Y is a matched word vector in the library;
3) solving out positive sequence ordering with the diagnostic information text in the medical insurance database most similar to the literal of a single disease type by using the editing distance in the step 2), and cutting into corresponding word sets according to the step 1);
4) solving cosine similarity of the word set obtained in the step 3) by using the cosine distance obtained in the step 2) for representing the relevance among the words;
setting a threshold value to be 0.4, and if the obtained cosine similarity is smaller than the threshold value, indicating no correlation;
adding the associated word distances and sequencing the words in a positive order, and solving out similar words with the second priority;
5) calculating the similarity between the editing distance and the cosine to obtain a word list which is most similar to the standard expression of the disease and is used as a standard word list;
6) according to the standard word list, professional personnel perform computer-aided manual check and repeat for many times; the execution operation is as follows:
61) formulating the broadest extraction field according to the clinical name of the target disease; extracting all potential target disease patients;
62) starting matching by combining computer-aided screening, and carrying out computer-aided manual inspection on each round of matching results; the checking method comprises the following steps: specifically identifying the patient as a target disease patient or a non-target disease patient based on the condition of 62A or 62B;
each case of patients with confirmed target disease must have definite diagnostic entries which match with the single disease category word bank;
62B, confirming to diagnose the patients with the non-target diseases in each case, wherein the excluded entries are required to be in accordance with definite excluded entries, and the excluded entries are in accordance with excluded word banks of the non-target diseases;
63) performing the operations of step 62) multiple times until the records are all well defined;
7) when the accuracy rate of the standard word list of the current version is lower than 95% through manual check, updating the standard word list; the following operations are performed:
71) adding the new words obtained by the current manual check into the list to generate a standard word list of a new version;
72) recalculating the similarity between the editing distance and the cosine, and then performing the next round of manual inspection;
73) stopping updating until the manual checking result shows that the accuracy rate of the word list reaches 95%;
through the steps, the single-disease diagnosis information based on the medical insurance data is quickly structured, and the diagnosis information in the medical insurance data is efficiently utilized.
2. The method of claim 1, wherein the single disease species diagnostic information is selected from the group consisting of multiple myeloma, amyotrophic lateral sclerosis, albinism, Alport syndrome, and autoimmune encephalitis.
3. The method for rapidly constructing the word stock of the single disease species based on the single disease species diagnostic information of the medical insurance data as claimed in claim 1, wherein the part of speech of the segmented words of the long text sentences is labeled by using conditional random field CRF in step 1).
4. The method for rapidly structuring the monopathy species diagnosis information based on the medical insurance data as claimed in claim 1, wherein in step 61), specifically, the clinical name of the target disease is multiple myeloma, and the extraction field comprises: "myeloma", "Carler", "bone marrow cancer", "myelopathy", "203.0", "C90.0", "M97320/3".
CN201811045058.2A 2018-09-07 2018-09-07 Rapid structuring method of single disease diagnosis information based on medical insurance data Active CN109344250B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811045058.2A CN109344250B (en) 2018-09-07 2018-09-07 Rapid structuring method of single disease diagnosis information based on medical insurance data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811045058.2A CN109344250B (en) 2018-09-07 2018-09-07 Rapid structuring method of single disease diagnosis information based on medical insurance data

Publications (2)

Publication Number Publication Date
CN109344250A CN109344250A (en) 2019-02-15
CN109344250B true CN109344250B (en) 2021-11-19

Family

ID=65304994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811045058.2A Active CN109344250B (en) 2018-09-07 2018-09-07 Rapid structuring method of single disease diagnosis information based on medical insurance data

Country Status (1)

Country Link
CN (1) CN109344250B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887562B (en) * 2019-02-20 2021-10-29 广州天鹏计算机科技有限公司 Similarity determination method, device, equipment and storage medium for electronic medical records
CN110070929A (en) * 2019-04-30 2019-07-30 上海复繁信息科技有限公司 A kind of acquisition and cleaning method for atrial fibrillation Single diseases data
CN110096533A (en) * 2019-04-30 2019-08-06 上海复繁信息科技有限公司 The method that a kind of pair of atrial fibrillation Single diseases data carry out multi-dimensional query analysis
CN110895961A (en) * 2019-10-29 2020-03-20 泰康保险集团股份有限公司 Text matching method and device in medical data
CN111063446B (en) * 2019-12-17 2023-06-16 医渡云(北京)技术有限公司 Method, apparatus, device and storage medium for standardizing medical text data
CN111178444B (en) * 2019-12-31 2023-06-02 山东中医药大学第二附属医院 Traditional Chinese medicine formula treatment effect statistical method based on vector analysis
CN111599463B (en) * 2020-05-09 2023-07-14 吾征智能技术(北京)有限公司 Intelligent auxiliary diagnosis system based on sound cognition model
CN112185520B (en) * 2020-09-27 2024-06-07 志诺维思(北京)基因科技有限公司 Text structuring processing system and method for medical pathology report picture
CN112837771B (en) * 2021-01-25 2022-09-13 山东浪潮智慧医疗科技有限公司 Abnormal physical examination item normalization method integrating text classification and lexical analysis
CN115083618A (en) * 2022-05-18 2022-09-20 深圳大学 Artificial intelligent epidemiology investigation system and method based on Internet of things
CN114996388A (en) * 2022-07-18 2022-09-02 湖南创星科技股份有限公司 Intelligent matching method and system for diagnosis name standardization
CN115862853B (en) * 2022-08-12 2023-11-21 内蒙古自治区综合疾病预防控制中心 Method for evaluating cardiovascular disease occurrence risk of prostate cancer patient
CN116166698B (en) * 2023-01-12 2023-09-01 之江实验室 Method and system for quickly constructing queues based on general medical terms
CN117438025B (en) * 2023-12-19 2024-03-22 南京江北新区生物医药公共服务平台有限公司 Single-disease electronic medical record database construction method based on deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354715A (en) * 2016-09-28 2017-01-25 医渡云(北京)技术有限公司 Method and device for medical word processing
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN107578798A (en) * 2017-10-26 2018-01-12 北京康夫子科技有限公司 The processing method and system of electronic health record
CN107610761A (en) * 2017-09-30 2018-01-19 电子科技大学 A kind of clinical path analysis method based on medical insurance data
CN108172269A (en) * 2017-12-15 2018-06-15 广州市康软信息科技有限公司 The Visual Report Forms generation method and system of medical data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100217768A1 (en) * 2009-02-20 2010-08-26 Hong Yu Query System for Biomedical Literature Using Keyword Weighted Queries

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106354715A (en) * 2016-09-28 2017-01-25 医渡云(北京)技术有限公司 Method and device for medical word processing
CN107610761A (en) * 2017-09-30 2018-01-19 电子科技大学 A kind of clinical path analysis method based on medical insurance data
CN107578798A (en) * 2017-10-26 2018-01-12 北京康夫子科技有限公司 The processing method and system of electronic health record
CN108172269A (en) * 2017-12-15 2018-06-15 广州市康软信息科技有限公司 The Visual Report Forms generation method and system of medical data

Also Published As

Publication number Publication date
CN109344250A (en) 2019-02-15

Similar Documents

Publication Publication Date Title
CN109344250B (en) Rapid structuring method of single disease diagnosis information based on medical insurance data
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
Demir et al. Improving named entity recognition for morphologically rich languages using word embeddings
CN112732946B (en) Modular data analysis and database establishment method for medical literature
US20220237230A1 (en) System and method for automated file reporting
CN108959566B (en) A kind of medical text based on Stacking integrated study goes privacy methods and system
CN111949759A (en) Method and system for retrieving medical record text similarity and computer equipment
CN108062978B (en) Method for predicting main adverse cardiovascular events of patients with acute coronary syndrome
TW201841121A (en) A method of automatically generating semantic similar sentence samples
CN107291895B (en) Quick hierarchical document query method
CN113972010B (en) Auxiliary disease reasoning system based on knowledge graph and self-adaptive mechanism
CN112768080A (en) Medical keyword bank establishing method and system based on medical big data
CN114996388A (en) Intelligent matching method and system for diagnosis name standardization
CN111159332A (en) Text multi-intention identification method based on bert
Peng et al. A self-attention based deep learning method for lesion attribute detection from CT reports
CN112699018B (en) Software defect positioning method based on software defect association analysis
Dotan et al. Effect of tokenization on transformers for biological sequences
CN117422074A (en) Method, device, equipment and medium for standardizing clinical information text
Li et al. Improved deep belief network model and its application in named entity recognition of Chinese electronic medical records
CN110060749B (en) Intelligent electronic medical record diagnosis method based on SEV-SDG-CNN
CN112397201B (en) Intelligent inquiry system-oriented repeated sentence generation optimization method
Chiang et al. Extracting functional annotations of proteins based on hybrid text mining approaches
Kivotova et al. Extracting clinical information from chest X-ray reports: A case study for Russian language
Chen et al. Extract protein-protein interactions from the literature using support vector machines with feature selection
Ramachandran et al. Optimized Version of Tree based Support Vector Machine for Named Entity Recognition in Medical Literature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant