CN117454843A

CN117454843A - Data preprocessing system based on electronic medical record question-answering model

Info

Publication number: CN117454843A
Application number: CN202311516587.7A
Authority: CN
Inventors: 刘立宇; 初乃强; 赵瑞莹
Original assignee: Singularity Digital Beijing Technology Co ltd; Singularity Of Life Beijing Technology Co ltd
Current assignee: Singularity Digital Beijing Technology Co ltd; Singularity Of Life Beijing Technology Co ltd
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2024-01-26

Abstract

The invention provides a data preprocessing system based on an electronic medical record question-answering model, which comprises a sample electronic medical record information set, a processor and a memory storing a computer program, wherein when the computer program is executed by the processor, the following steps are realized: according to the sample electronic case information set, a candidate text set is obtained, a candidate keyword set is obtained according to the candidate text set and a target term knowledge graph, an initial text set is obtained according to the candidate text set and the candidate keyword set, a target text set is obtained according to the initial text set, and a specified text vector is obtained according to the target text set to realize data preprocessing.

Description

Data preprocessing system based on electronic medical record question-answering model

Technical Field

The invention relates to the technical field of text processing, in particular to a data preprocessing system based on an electronic medical record question-answering model.

Background

Along with the continuous increase of medical service and the continuous development of artificial intelligence technology, the electronic medical record has become trend, how to process the text data corresponding to the electronic medical record to generate the data for training the model in the medical field becomes the current popular research direction, and when the related model in the medical field is established, the process of preprocessing the data is vital, and the performance of training the model can be effectively improved by reasonably processing the text data.

At present, in the prior art, the method for preprocessing data comprises the following steps: and acquiring the target text character string quantity based on an average value of the text character string quantity corresponding to the text in the database, carrying out a stage from the last beginning of the text when the text character string is overlong, randomly selecting the text to supplement when the text character string is lower than the target text character string quantity, and thus acquiring a designated text vector set to realize data preprocessing.

In summary, the data preprocessing has the following problems: the method has the advantages that the types of the texts are not considered when the number of the text strings is unified, the comprehensiveness of the acquired appointed text vectors cannot be guaranteed, meanwhile, keyword factors in the texts are not considered, the priority of keywords is not considered when text character truncation is carried out, associated texts corresponding to the keywords are not considered when text character supplementation is carried out, the texts are not processed by adopting different means based on different factors, and the accuracy of the acquired appointed text vector set is reduced.

Disclosure of Invention

The invention provides a data preprocessing system based on an electronic medical record question-answering model, which comprises: the system comprises a sample electronic medical record information set, a processor and a memory storing a computer program, wherein the sample electronic medical record information set comprises a plurality of sample electronic medical record information, the sample electronic medical record information is abnormal state characteristic information corresponding to medical records obtained from a database, and when the computer program is executed by the processor, the following steps are realized:

s1, acquiring a candidate text set A= { A according to a sample electronic medical record information set ₁ ，……，A _i ，……，A _n }，A _i For the i-th candidate text, i= … … n, n being the number of candidate texts.

S3, acquiring a candidate keyword set Q= { Q corresponding to the A according to the A and the target term knowledge graph ₁ ，……，Q _i ，……，Q _n }，Q _i Is A _i And a corresponding candidate keyword list.

S5, acquiring an initial text set T= { T according to A and Q ₁ ，……，T _i ，……，T _n }，T _i ＝{A _i ，Q _i }，T _i Is the i-th initial text.

S7, acquiring a designated text set U= { U according to T ₁ ，……，U _i ，……，U _n }，U _i The i-th specified text is obtained in S7 by _i ：

S71 according to T _i Obtaining T _i Corresponding text string WT _i ＝(WT ⁰ _i1 ，……，WT ⁰ _ix ，……，WT ⁰ _ip ，WT ¹ _i1 ，……，WT ¹ _iy ，……，WT ¹ _iq )，WT ⁰ _ix Is A _i Corresponding xth character, x= … … p, p is a _i Number of corresponding literal characters, WT ¹ _iy Is Q _i Corresponding y-th literal character, y= …… Q, Q is Q _i The number of corresponding literal characters.

S72, when p+q=k, acquires U _i ＝T _i Wherein K is a preset key priority threshold.

S73, when p+q > K, obtaining a candidate priority set P= { P corresponding to Q ₁ ，……，P _i ，……，P _n }，P _i ＝{P _i1 ，……，P _ie ，……，P _if(i) }，P _ie Is Q _i Candidate priority corresponding to the e-th candidate keyword in the corresponding candidate keyword list, e= … … f (i), f (i) being Q _i The number of candidate keywords in the corresponding candidate keyword list.

S74, P-based, for WT _i Processing to obtain U _i 。

S75, when p+q is less than K, obtaining Q _i Corresponding appointed keyword set R _i ＝{R _i1 ，……，R _ie ，……，R _if(i) Sum Q _i Corresponding designated priority set G _i ＝{G _i1 ，……，G _ie ，……，G _if(i) }，R _ie Is Q _ie Corresponding appointed keyword list G _ie Is Q _ie A corresponding list of assigned priorities.

S76, according to R _i And G _i For WT _i Processing to obtain U _i 。

S9, acquiring a specified text vector set according to the U to realize data preprocessing, wherein the specified text vector set comprises a plurality of specified text vectors, and the specified text vectors are acquired by inputting specified texts into a pre-training electronic medical record coding model.

The invention provides a data preprocessing system based on an electronic medical record question-answering model, which comprises a sample electronic medical record information set, a processor and a memory storing a computer program, wherein the sample electronic medical record information set comprises a plurality of sample electronic medical record information, the sample electronic medical record information is corresponding abnormal state characteristic information in medical records acquired from a database, and when the computer program is executed by the processor, the following steps are realized: according to a sample electronic case information set, a candidate text set is obtained, a candidate keyword set corresponding to the candidate text set is obtained according to the candidate text set and a target term knowledge graph, an initial text set is obtained according to the candidate text set and the candidate keyword set, and a target text set is obtained according to the initial text set, wherein the number of text strings corresponding to the initial text is processed based on different conditions to obtain a target text, and a designated text vector is obtained according to the target text set.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an execution computer program of a data preprocessing system based on an electronic medical record question-answer model according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Examples

An LLM model-based electronic medical record question-answering system, the system comprising: a sample electronic medical record information set, a processor and a memory storing a computer program which, when executed by the processor, performs the steps of, as shown in fig. 1:

Specifically, the sample electronic medical record information set includes a plurality of sample electronic medical record information, where the sample electronic medical record information is abnormal state feature information corresponding to medical records obtained from a database, where the abnormal state feature information is feature information associated with a disease, for example, abnormal state feature information such as abnormal glycoglycoprotein tap is in abnormal detection, and nasopharynx presents hypodifferentiation squamous cell carcinoma.

Furthermore, those skilled in the art know that any selection of the medical public database capable of acquiring cases can be performed according to actual requirements, which falls into the protection scope of the present invention, and will not be described again.

Further, the data format of the sample electronic medical record information comprises a text format and a table format.

Specifically, the system further comprises a target term knowledge graph, wherein the target term knowledge graph presents triplet forms, and each triplet form in the target term knowledge graph comprises two entities related to an abnormal state and a relationship between the two entities related to the abnormal state.

Further, those skilled in the art know that any method for constructing a knowledge graph based on a target term in the prior art falls into the protection scope of the present invention, and is not described herein.

Specifically, in S1, candidate texts are obtained by:

and S11, when the data format of the sample electronic case information is a text format, the sample electronic case information is segmented according to segmentation symbols to generate candidate texts.

S13, when the data format of the sample electronic case information is a table format, integrating each record in the sample electronic case information and the field name corresponding to the record to generate a candidate text, which can be understood as: when field names corresponding to each field in the sample electronic case information are ID, biopsy position and histological classification in sequence from left to right, and contents of a certain row in the sample electronic case information are 008 # s, nasopharynx and squamous cell carcinoma in sequence from left to right, a candidate text is obtained as follows: the biopsy site with ID 008 is nasopharyngeal and histologically classified as squamous cell carcinoma.

Specifically, Q is obtained in S3 by the following steps _i ：

S31, according to A, obtaining a first intermediate word set B= { B corresponding to A ₁ ，……，B _i ，……，B _n }，B _i ＝{B _i1 ，……，B _ij ，……，B _im(i) }，B _ij Is A _i Corresponding first middleThe j-th first intermediate word in the intermediate word list, j= … … m (i), m (i) being a _i The number of first intermediate words in the corresponding first intermediate word list.

Specifically, the first intermediate word is a word obtained from the candidate text, and those skilled in the art know that any method for extracting a word from the text in the prior art falls within the protection scope of the present invention, and is not described herein in detail.

S33, acquiring a target word list D= { D according to the target term knowledge graph ₁ ，……，D _r ，……，D _s }，D _r For the r-th target word, r= … … s, s is the number of target words.

Specifically, the target word is an entity related to the abnormal state obtained from the target term knowledge graph.

S35, according to B and D, obtaining a first intermediate similarity set F= { F corresponding to B ₁ ，……，F _i ，……，F _n }，F _i ＝{F _i1 ，……，F _ij ，……，F _im(i) }，F _ij ＝{F ¹ _i1 ，……，F ^r _ij ，……，F ^s _im(i) }，F ^r _ij Is B _ij And D _r A first intermediate similarity between.

Specifically, the first intermediate similarity is a similarity between a word vector corresponding to the first intermediate word and a word vector corresponding to the target word, where one skilled in the art knows that any method for calculating the similarity between the vectors in the prior art falls within the protection scope of the present invention, and is not described herein.

Further, the method for the word vector corresponding to the first intermediate word is to input the first intermediate word vector into a vector corresponding to the word obtained in the natural language processing model, where those skilled in the art know that any natural language processing model for converting text into a vector in the prior art falls into the protection scope of the present invention, and is not described herein again.

S37, when F ^r _ij ≥F ⁰ At the time, B _ij Inserted into Q _i Wherein F is ⁰ Is a preset first intermediate similarity threshold.

Specifically, F ⁰ The value range of (2) is 0.8-0.9, wherein, F can be carried out according to the actual requirement as known by the person skilled in the art ⁰ All falling within the protection scope of the present invention and will not be described herein.

Specifically, the initial text is a text obtained by splicing the candidate text and the candidate keywords and splicing the candidate keywords after the candidate text.

S71 according to T _i Obtaining T _i Corresponding text string WT _i ＝(WT ⁰ _i1 ，……，WT ⁰ _ix ，……，WT ⁰ _ip ，WT ¹ _i1 ，……，WT ¹ _iy ，……，WT ¹ _iq )，WT ⁰ _ix Is A _i Corresponding xth character, x= … … p, p is a _i Number of corresponding literal characters, WT ¹ _iy Is Q _i Corresponding y-th character, y= … … Q, Q is Q _i The number of corresponding literal characters.

Specifically, in S72K is obtained by:

s721, according to T, obtaining a key text type set C= { C ₁ ，……，C _d ，……，C _z }，C _d ＝{C _d1 ，……，C _dg ，……，C _dh(d) }，C _dg For the g-th key text in the d-th key text list, g=1 … … h (d), h (d) is the number of key texts in the d-th key text list, d=1 … … z, and z is the number of key text types.

Specifically, the key text is an initial text obtained from T based on a text type corresponding to the initial text, and those skilled in the art know that any method for classifying the text in the prior art falls within the scope of the present invention, and the method for classifying the text by using keywords of the text is not described herein, where the text type is classified into a text type corresponding to the initial text, such as a heart type and a ocularnose and throat type.

S723, according to C, obtaining a first text string quantity set C corresponding to C ⁰ ＝{C ⁰ ₁ ，……，C ⁰ _d ，……，C ⁰ _z }，C ⁰ _d ＝{C ⁰ _d1 ，……，C ⁰ _dg ，……，C ⁰ _dh(d) }，C ⁰ _dg Is C _dg A corresponding number of first text strings.

Specifically, the number of the first text strings is the number of text strings corresponding to the key text.

S725 according to C ⁰ Acquiring a second text string quantity set C corresponding to C ¹ ＝{C ¹ ₁ ，……，C ¹ _d ，……，C ¹ _z }，C ¹ _d ＝{C ¹ _d1 ，……，C ¹ _du ，……，C ¹ _dh(d) }，C ¹ _du For the (u) th second text in the second text string number list corresponding to the (d) th type of key text list, u= … … h (d), wherein C ¹ _d1 ≥……≥C ¹ _du ≥……≥C ¹ _dh(d) 。

Specifically, the second text string number is the text string number sequentially obtained according to the first text string number from the big to the small.

Further, the number of text strings is the number of text strings corresponding to the text.

S725 according to C ⁰ Obtaining K, wherein K meets the following conditions:

wherein C is ¹ _dα And E, the number of text strings corresponding to the key texts corresponding to the alpha second text string number in the d-th type key text list is a preset first number threshold value.

In particular, the method comprises the steps of,is an integer of not more than (h (d) ×ε).

Specifically, the value range of epsilon is 0.85-1, wherein, the person skilled in the art knows that epsilon can be selected according to the actual requirement, and the epsilon falls into the protection range of the invention, and is not repeated here.

According to the method, the preset key priority threshold is obtained based on the types of the key texts and the number of text strings corresponding to each type of key texts, so that the number of text strings corresponding to the initial text is uniform, the comprehensiveness of texts corresponding to the appointed text vectors obtained later is guaranteed by combining the number of text strings corresponding to the types of the texts, the accuracy of the obtained text string number unified value is improved by setting the threshold based on the number of text strings corresponding to each type of key texts, the problem that text data are easy to be lost due to too short text string length can be avoided, the problem that text data processing efficiency is reduced due to too long text string length can be avoided, and the accuracy of the appointed text vector set obtained later is improved.

Specifically, P is obtained in S73 by the following steps _ie ：

S731, obtaining candidate keyword list Q _i ＝{Q _i1 ，……，Q _ie ，……，Q _if(i) }，Q _ie Is Q _i E.g., the e-th candidate keyword.

S733, obtaining Q according to the target term knowledge graph _ie Corresponding appointed keyword list R _ie ＝{R ¹ _ie ，……，R ^a _ie ，……，R ^b(e) _ie Sum Q _ie Corresponding designated priority list G _ie ＝{G ¹ _ie ，……，G ^a _ie ，……，G ^b(e) _ie }，R ^a _ie Is Q _ie Corresponding a-th specified keyword, a= … … b (e), b (e) is Q _ie The number of corresponding specified keywords G ^a _ie Is Q _ie And R is R ^a _ie Assigned priority among them.

Specifically, the specified keyword is a target word associated with the candidate keyword, which is obtained from the target term knowledge graph.

Specifically, the specified priority is the association degree between the candidate keyword and the specified keyword, where those skilled in the art know that any method for obtaining the association degree between two texts in the prior art falls into the protection scope of the present invention, and is not described herein in detail.

S735, according to Q _ie 、R _ie And G _ie Obtaining P _ie Wherein P is _ie Meets the following conditions:

wherein M is _ie Is Q _ie Frequency number, N, of occurrences in candidate text set A _ie Including Q for candidate text set A _ie The number of first intermediate words corresponding to the candidate text of V _ie Including Q for candidate text set A _ie Number of candidate texts of E ^a _ie Is G ^a _ie Frequency of occurrence in candidate text set A, L ^a _ie Including G for candidate text set A ^a _ie The number of first intermediate words corresponding to the candidate text of J ^a _ie Including G for candidate text set A ^a _ie Is a number of candidate texts.

S74, P-based, for WT _i Processing to obtain U _i 。

Specifically, in S74, the following steps are further included:

s741 according to P _i Obtaining T _i Corresponding first intermediate text beta ¹ _i ＝(A _i ，Q _i1 ，……，Q _i(e-1) ，Q _i(e+1) ……，Q _if(i) ) Wherein P is _ie Is P _i Is the smallest candidate priority in the list.

S743, when beta ¹ _i When the number of the corresponding text strings is not more than K, U is acquired _i ＝β _i 。

S745, when beta ¹ _i When the number of the corresponding text strings is larger than K, P is acquired _i Middle P _ie P removal _ie Minimum candidate priority beyond, put it in initial text Q _i Delete to get T _i Corresponding second intermediate text beta ² _i 。

S747, repeatedly executing S743-S745 until the obtained U _i The number of corresponding text strings is not greater than K so as to obtain U _i 。

S76, according to R _i And G _i For WT _i Processing to obtain U _i 。

Specifically, in S76, the following steps are further included:

s761, when G ^a _ie Is G _ie When the maximum designated priority is the same, acquiring T _i A corresponding first candidate text set, wherein the first candidate text set comprises a plurality of first candidate texts, and the first candidate texts are obtained from A and comprise G ^a _ie Corresponding appointed keyword R ^a _ie Is a candidate text for a text in a text-to-text format.

S763 based on T _i Corresponding first candidate text set, obtain T _i Corresponding second candidate text H _i Wherein H is ⁰ _i ＝K-p-q，H ⁰ _i Is H _i Corresponding number of text strings.

S765 according to H _i Obtaining U _i ＝(A _i ，Q _i ，H _i )。

And when the text character strings corresponding to the initial text are not less than the preset length threshold, performing supplementary processing on the text associated with the candidate keywords corresponding to the initial text, unifying the number of the text character strings corresponding to the initial text by adopting different processing modes based on different numbers of the text character strings corresponding to the initial text, and improving the accuracy of the acquired appointed text vector set.

S9, acquiring a specified text vector set according to the U, wherein the specified text vector set comprises a plurality of specified text vectors, and the specified text vectors are acquired by inputting specified texts into a pre-training electronic medical record coding model.

Specifically, the pre-training electronic medical record coding model is a model which is obtained by training a medical record text training set based on the pre-training model and is used for converting texts into vectors.

Further, those skilled in the art will know that the selection of the pre-training model may be performed according to actual requirements, which falls within the protection scope of the present invention, and details such as the ERNIE pre-training model will not be described herein.

Further, the medical record text training set is a medical record text set for model training, which is acquired based on different search engines, and the medical record text set comprises a plurality of medical record texts with different types and forms.

Further, those skilled in the art will know that any method of obtaining text from multiple search engines in the prior art falls within the protection scope of the present invention, and will not be described herein, where the search engines, such as hundred degrees, etc.

According to the method, the number of the text strings is unified based on the types of the texts, the comprehensiveness of the acquired appointed text vectors is guaranteed, meanwhile, keyword factors in the texts are considered, text characters are truncated based on the priorities of the keywords, the texts are processed by different means based on different factors, and the accuracy of the acquired appointed text vector sets is improved.

Specifically, the method further comprises the following steps after S9:

s100, acquiring a first target text set corresponding to the first preset text set based on the first preset text set and the appointed text vector set.

Specifically, the first preset text set includes a plurality of first preset texts, where the first preset texts are question texts related to abnormal states, which are acquired based on the abnormal states.

Further, the question text is text which presents the requirement for answer and interpretation in the form of a question, for example: question text of the expression of luteinizing hormone lower than 3.

Further, the first preset text is a question text obtained through a medical public database, where those skilled in the art know that any text of a question related to medicine obtained based on the medical public database in the prior art falls within the protection scope of the present invention, and is not described herein.

Specifically, the step S100 further includes the following steps:

s101, acquiring a first preset text vector set I= { I ₁ ，……，I _t ，……，I _θ }，I _t For the first preset text vector corresponding to the t first preset text, t= … … θ, where θ is the number of the first preset texts.

Specifically, the first preset text vector is obtained by inputting a first preset text into a pre-training electronic medical record coding model.

S103, acquiring a designated text vector set A text vector is specified for the i-th.

S105, according to I andobtaining a first target similarity set ER= { ER corresponding to I ₁ ，……，ER _t ，……，ER _θ }，ER _t ＝{ER _t1 ，……，ER _ti ，……，ER _tn }，ER _ti Is I _t And->A first target similarity between.

Specifically, those skilled in the art know that any method for obtaining the similarity between vectors in the prior art falls into the protection scope of the present invention, and the method for calculating the similarity between vectors, such as cosine similarity, is not described herein.

S107, when ER _ti ≥ER ⁰ At the time, obtainCorresponding target text U _i Is I _t A corresponding first target text, wherein ER ⁰ Is a preset second priority threshold.

In particular, ER ⁰ The range of the value of (2) is 0.8-0.85, and the person skilled in the art knows that ER can be carried out according to the actual requirement ⁰ All falling within the protection scope of the present invention and will not be described herein.

S200, based on the first preset text set and the first target text set, acquiring a second target text set corresponding to the first preset text set.

Specifically, the second target text set includes a plurality of second target texts, where the second target texts are explanatory content texts associated with the first preset text generated based on the first preset text and the first target text set through a prompt instruction, for example, when the first preset text relates to the heart, the heart is simply explained by combining the first target text related to the first preset text and related knowledge in some abnormal state fields, and the first preset text and the explanatory content acquired based on the first preset text are regarded as the second target text.

Further, those skilled in the art know that any method for training by using a prompt instruction in the prior art to output a result falls within the protection scope of the present invention, and is not described herein.

According to the method, the second target text set corresponding to the first preset text set is generated based on the first preset text set and the first target text through the prompt instruction, the medical record text corresponding to each question text is obtained, and the prompt instruction is used for setting the instruction for the corresponding question text, so that understanding and replying of the electronic medical record question-answering system are facilitated, and accuracy of an output result of the electronic medical record question-answering system is improved.

S300, inputting the first preset text set and the second target text set into a preset first initial LLM model, and obtaining a third target text set corresponding to the first preset text set.

Specifically, the third target text set includes a plurality of third target texts, where the third target texts are answer texts and interpretation texts corresponding to a first preset text obtained based on the first preset text.

Further, the answer text is a text which answers based on the question text.

Further, the explanation text is text for obtaining explanation of the answer text based on the question text.

Further, in S300, a third target text is acquired by:

s301, acquiring a psi fourth target text corresponding to the first preset text according to the first preset text and a second target text corresponding to the first preset text, wherein the fourth target text is an answer text and an explanation text corresponding to the first preset text acquired in a plurality of LLM models based on the second target text.

Specifically, those skilled in the art know that any method of outputting a result through the LLM model in the prior art falls within the protection scope of the present invention, and will not be described herein, where the LLM model is, for example, a Baichuan-13B model, a LLaMA model, or the like.

Specifically, the value range of ψ is 30-50, where those skilled in the art know that selection of ψ can be performed according to actual requirements, which falls into the protection range of the present invention, and will not be described herein.

S303, acquiring a priority corresponding to the fourth target text according to the fourth target text, wherein the priority is a score value acquired based on a voting method, and any method for acquiring the score based on the voting method in the prior art is known by a person skilled in the art and falls into the protection scope of the present invention, and is not repeated herein.

Specifically, the value range of the priority is 0-1.

S305, acquiring a third target text corresponding to the first preset text according to the priority, wherein the third target text is a fourth target text corresponding to the maximum priority.

S400, the first target text set, the second target text set and the third target text set are used as training sets to be input into a preset second initial LLM model, and an initial electronic medical record question-answering model is generated.

In another specific embodiment, the following steps are further included after S400:

s401, when the data volume of the training set corresponding to the initial electronic medical record question-answer model is larger than a preset data volume threshold, acquiring a candidate parameter list omega= { omega corresponding to the initial electronic medical record question-answer model ₁ ，……，ω _c ，……，ω _w }，ω _c For the c-th candidate parameter, c= … … w, w is the number of candidate parameters, where ω _c ＝2 ^c ，w＝6。

Specifically, the candidate parameter is a rank corresponding to a matrix set for reducing training time of a training set in an initial electronic medical record question-answering model, where the rank can be understood as: when the LLM model performs data processing, multiplication between the matrix and the matrix is involved, and when the data volume of the training set is too large, the training efficiency is reduced, so that a matrix with a slightly smaller rank needs to be set to help training in order to reduce the training time of the training set, and the candidate parameter is the set rank of the matrix.

Further, the value range of the preset data quantity threshold is 100 GB-1 TB, and those skilled in the art know that the selection of the preset data quantity threshold can be performed according to the actual requirement, which falls into the protection range of the present invention, and will not be described herein.

S402, according to ω, acquiring a first intermediate priority list Tω= { Tω corresponding to ω ₁ ，……，Tω _c ，……，Tω _w }，Tω _c Is omega _c A corresponding first intermediate priority.

Specifically, the first intermediate priority is the occupancy rate of the GPU in the running process of the initial electronic medical record question-answering model, where those skilled in the art know that any method for obtaining the occupancy rate of the GPU in the prior art falls into the protection scope of the present invention, and is not described herein.

S403, when the first preset text is the first preset text of the first type, based onObtaining a second intermediate priority set Eω= { Eω by a preset weight type ₁ ，……，Eω _c ，……，Eω _w }，Eω _c ＝{Eω _c1 ……，Eω _cμ ，……，Eω _cτ }，Eω _cμ Is omega _c The μ second intermediate priority in the corresponding second intermediate priority list, μ= … … τ, τ is the number of preset weight types.

Specifically, the first preset text of the first type is a question text which is a single question and has no relevance with other questions.

Specifically, the second intermediate priority is a score value corresponding to the initial electronic medical record question-answering model obtained based on the candidate parameter and the first preset text under different preset weight types, wherein a person skilled in the art knows that any method for obtaining the model based on different conditions in the prior art falls into the protection scope of the present invention, and is not repeated herein.

Specifically, the preset weight type is a matrix type of calculated weights, where it can be understood that: in the transducer architecture, there are four weight matrices (Wq, wk, wv, wo) in the self-attention module, where Wq (or Wk, wv) is considered a single square matrix.

Specifically, τ is more than or equal to 4 and less than or equal to 30.

Preferably, τ has a value of 6, where when τ takes 6, it is possible to avoid the problem of low efficiency caused by performing a large number of tests, and ensure the comprehensiveness of the tests.

S404, when the first preset text is the second type first preset text, acquiring a third intermediate priority set Lω= { Lω corresponding to ω based on the preset weight type ₁ ，……，Lω _c ，……，Lω _w }，Lω _c ＝{Lω _c1 ……，Lω _cμ ，……，Lω _cτ }，Lω _cμ Is omega _c The mu third intermediate priority in the corresponding third intermediate priority list.

Specifically, the second type of first preset text is a question text including a plurality of questions and associated with each question.

Specifically, the third intermediate priority is a score value corresponding to an initial electronic medical record question-answering model obtained under different preset weight types based on the candidate parameter and the second type first preset text.

Further, the obtaining mode of the third intermediate priority is consistent with the obtaining mode of the second intermediate priority.

S405, obtaining a final priority list Fω= { Fω corresponding to ω according to Tω, Eω and Lω ₁ ，……，Fω _c ，……，Fω _w }, F omega _c Meets the following conditions:

s406, obtaining omega according to Fomega _c Target parameters of an initial electronic medical record question-answering model, wherein F omega _c Is the largest final priority among fω.

According to the method, the performance of the initial electronic medical record question-answering model is obtained through the candidate parameters of the initial electronic medical record question-answering model, the time for training the model can be saved through setting the candidate parameters, the resource waste is not easy to cause, the reasoning capacity and the corresponding capacity of the model are not influenced, and the parameters are adjusted at the same time, so that the output result of the electronic medical record question-answering model is more accurate.

S500, inputting the second preset text set into the initial electronic medical record question-answering model, and obtaining the priority to be selected corresponding to the initial electronic medical record question-answering model.

Specifically, the second preset text set includes a plurality of second preset texts, where the second preset texts are question texts related to abnormal states for testing the effect of the initial electronic medical record question-answering model.

Specifically, in S500, the candidate priority is obtained by:

s501, inputting a second preset text set into the initial electronic medical record question-answering model to obtain a corresponding second preset text set First keytext set ep= { EP ₁ ，……，EP _δ ，……，EP _ζ }, where EP _δ For the first key text corresponding to the delta second preset text, delta= … … ζ, where ζ is the number corresponding to the second preset text.

Specifically, the first key text is an answer text and an explanation text corresponding to a second preset text obtained based on an initial electronic medical record question-answering model.

S503, according to the EP, acquiring a first key text vector set EP corresponding to the EP ⁰ ＝{EP ⁰ ₁ ，……，EP ⁰ _δ ，……，EP ⁰ _ζ }，EP ⁰ _δ ＝(EP ⁰ _δ1 ，……，EP ⁰ _δγ ，……，EP ⁰ _δη )，EP ⁰ _δγ Is EP _δ And the corresponding bit value of the gamma-th bit in the first key text vector, wherein gamma=1 … … eta, eta is the bit of the first key text vector.

Specifically, the first key text vector is a vector obtained by inputting the first key text into a natural language processing model, where any method for converting text into a vector by using a natural language processing model in the prior art is known to those skilled in the art, and all the methods fall into the protection scope of the present invention and are not described herein.

S505, a second key text set FP= { FP corresponding to the second preset text set is obtained ₁ ，……，FP _δ ，……，FP _ζ }，FP _δ And the second key text corresponding to the delta second preset text.

Specifically, the second key text is an accurate answer text and an interpretation text corresponding to the second preset text.

S507, obtaining a second key text vector set FP corresponding to the FP according to the FP ⁰ ＝{FP ⁰ ₁ ，……，FP ⁰ _δ ，……，FP ⁰ _ζ }，FP ⁰ _δ ＝(FP ⁰ _δ1 ，……，FP ⁰ _δγ ，……，FP ⁰ _δη )，EP ⁰ _δγ Is EP _δ And the bit value of the gamma-th bit in the corresponding second key text vector.

Specifically, the obtaining mode of the second key text vector is consistent with the obtaining mode of the first key text vector.

S509, according to EP ⁰ And FP ⁰ Obtaining a priority KL to be selected corresponding to the initial electronic medical record question-answering model, wherein the KL meets the following conditions:

in another specific embodiment, the candidate priority is obtained in S500 by:

s610, inputting a second preset text set into the initial electronic medical record question-answer model to obtain a first initial text set EW = { EW ₁ ，……，EW _λ ，……，EW _σ }, where EW _λ For the λ first initial text, λ= … … σ, σ is the number of first initial texts.

Specifically, the first initial text is a first key text with a Chinese-English ratio in a preset ratio range, which is obtained from a first key text set.

Further, the first key text set comprises a plurality of first key texts, wherein the first key texts are answer texts and explanation texts corresponding to second preset texts obtained based on an initial electronic medical record question-answering model.

Further, the answer text is a text which answers based on the question text.

Further, the preset ratio range is tr ¹ ～tr ² Wherein tr is ¹ ＝tr-tr ⁰ ，tr ² ＝tr+tr ⁰ Tr is the average value of the English-Chinese ratio of the text in the obtained sample text, tr ⁰ For a preset ratioExample threshold.

Further, tr ⁰ The value of (2) is in the range of 0.01-0.1, wherein, the tr can be carried out according to the actual requirement as known by the person skilled in the art ⁰ All falling within the protection scope of the present invention and will not be described herein.

Further, the sample text is a text which is output by inputting a preset sample text into an initial electronic medical record question-answering model, wherein the property of the preset sample text is consistent with that of a first preset text, and the acquisition mode of the preset sample text can refer to the acquisition mode of the first preset text.

S620, acquiring a first initial text vector set EW according to the EW ⁰ ＝{EW ⁰ ₁ ，……，EW ⁰ _λ ，……，EW ⁰ _σ }，EW ⁰ _λ ＝(EW ⁰ _λ1 ，……，EW ⁰ _λγ ，……，EW ⁰ _λη )，EW ⁰ _λγ Is EW _λ The corresponding bit value of the gamma-th bit in the first initial text vector, wherein gamma= … … eta, eta is the bit of the first initial text vector.

Specifically, the first initial text vector is a vector obtained by inputting the first initial text into a natural language processing model, where any method for converting the text into the vector by using any natural language processing model in the prior art is known to those skilled in the art, and all the methods fall into the protection scope of the present invention and are not described herein.

S630, acquiring a second initial text set FW= { FW according to the first initial text set ₁ ，……，FW _λ ，……，FW _σ FW (FW) _λ Is the lambda second initial text.

Specifically, the second initial text is an accurate answer text and an accurate interpretation text of a second preset text corresponding to the first initial text.

S640, according to FW, obtaining a second initial text vector set FW corresponding to FW ⁰ ＝{FW ⁰ ₁ ，……，FW ⁰ _λ ，……，FW ⁰ _σ }，FW ⁰ _λ ＝(FW ⁰ _λ1 ，……，FW ⁰ _λγ ，……，FW ⁰ _λη )，FW ⁰ _λγ Is FW _λ And the bit value of the gamma-th bit in the corresponding first initial text vector.

Specifically, the obtaining mode of the second initial text vector is consistent with the obtaining mode of the first initial text vector.

S650, according to EW ⁰ And FW ⁰ A first similarity list deltaw= { deltaw is obtained ₁ ，……，ΔW _λ ，……，ΔW _σ }, wherein DeltaW _λ Meets the following conditions:

s660, acquiring a first initial keyword set corresponding to the EW according to the EW, wherein the first initial keyword set comprises a plurality of first initial keyword lists, each first initial keyword list comprises a first initial keyword, and the first initial keywords are keywords in a first initial text.

Specifically, the first keyword is a word similar to a target word in a target term knowledge graph obtained from a first initial text,

Specifically, the first initial keyword may be obtained in a manner consistent with the candidate keyword, and reference may be made to steps S731 to S737.

S670, acquiring a second initial keyword set corresponding to FW according to the FW, wherein the second initial keyword set comprises a plurality of second initial keyword lists, each second initial keyword list comprises a second initial keyword, and the second initial keywords are keywords in a second initial text.

Specifically, the obtaining mode of the second initial keyword is consistent with the obtaining mode of the first initial keyword.

S680, acquiring a first initial keyword set and a second initial keyword set, and acquiring a second similarity list DeltaV＝{ΔV ₁ ，……，ΔV _λ ，……，ΔV _σ }, where DeltaV _λ And the similarity between the first initial keyword and the second initial keyword corresponding to the same second preset text is obtained.

Specifically, the DeltaV _λ Acquisition mode and ΔW of (a) _λ The acquisition modes of the obtained images are consistent.

S690, according to the DeltaW and DeltaV, obtaining the priority KL to be selected corresponding to the initial electronic medical record question-answering model.

Specifically, KL is obtained in S690 by:

s691 when DeltaW _λ ≤ZM ⁰ When kl=0, wherein ZM ⁰ Is a preset first similarity threshold.

Specifically, ZM ⁰ The value range of (2) is 0.6-0.85, wherein, the person skilled in the art knows that the person skilled in the art can select the preset first similarity threshold according to the actual requirement, and all fall into the protection range of the present invention, and the details are not repeated here.

S693 when DeltaW _λ ≥ZM ⁰ And DeltaV _λ ≤ZM ¹ When KL meets the following conditions:

wherein ZM ¹ A preset second similarity threshold.

Specifically, ZM ¹ The value range of (2) is 0.5-0.9, wherein, those skilled in the art know that the selection of the preset second similarity threshold can be performed according to the actual requirement, and all fall into the protection range of the present invention, and the details are not repeated here.

S695 when DeltaW _λ ≥ZM ⁰ And DeltaV _λ ≥ZM ¹ When KL meets the following conditions

According to the method, based on the difference of the first similarity and the second similarity, the correlation coefficients of different calculation priorities are set, the acquired priorities to be selected are more accurate based on the different dimensions, the candidate priorities corresponding to the electronic medical record question-answering model are acquired based on the different dimensions, meanwhile, the priorities to be selected are acquired in different modes based on different conditions, and the results output by the electronic medical record question-answering system are more accurate by reasonably setting the priorities.

And S600, carrying out parameter adjustment on the initial electronic question-answering model based on the priority to be selected until the priority to be selected is not smaller than a preset priority threshold to be selected so as to obtain the target electronic medical record question-answering model.

Specifically, the value range of the preset priority threshold to be selected is 0.7-0.9, where those skilled in the art know that those skilled in the art can select the preset priority threshold to be selected according to actual needs, and all the selection falls into the protection range of the present invention, which is not described herein.

Specifically, those skilled in the art know that any process of parameter adjustment for the training model in the prior art falls into the protection scope of the present invention, and is not described herein.

S700, acquiring a preset key text, and inputting the preset key text into a target electronic medical record question model to acquire a target text, wherein the preset key text is a question text which is to be queried and is acquired based on an abnormal state and is related to the abnormal state, and the target text is an answer text and an explanation text corresponding to the preset key text.

By applying the LLM model to the electronic medical record questions and answers, the large-scale data can be processed, the application limitation of the electronic medical record questions and answers model is reduced, the instruction is set for the electronic medical record questions and answers model through the prompt instruction, the understanding and the reply of the electronic medical record questions and answers system are facilitated, and the accuracy of the output result of the electronic medical record questions and answers system is improved.

Specifically, the step S700 further includes the following steps:

s701, acquiring a key entity set according to a sample database, wherein the key entity set comprises a plurality of key entities, and the key entities are entities related to abnormal states and acquired based on the sample database.

Specifically, the sample database includes a plurality of information related to abnormal states, such as a drug data table, a human body part, an ICD-10 standard word stock, symptom signs, infectious diseases and the like.

Further, in S701, the key entity is acquired by:

s7011, a sample entity set is obtained according to the sample data set, where the sample entity set includes a plurality of sample entities, and the sample entities are entities related to abnormal states and are obtained from the sample data set, which can be understood as: the sample dataset includes a plurality of text descriptions relating to abnormal conditions, from which terms associated with the medical field, namely, the acquired sample entities, are extracted.

In particular, the sample entity set includes millions of sample entities.

Further, it is known in the art that any method of extracting an entity from text in the prior art falls within the protection scope of the present invention, and is not described herein.

S7013, a first sample entity set is obtained according to the sample entity set, where the first sample entity set includes, for example, a dry first sample entity, and the first sample entity is an entity similar to the sample entity obtained based on the LLM model.

Specifically, those skilled in the art know that any method for obtaining similar entities based on LLM model in the prior art falls within the protection scope of the present invention, and is not described herein again, for example, LLM model such as chatglm.

S7015, obtaining a second sample entity set according to the first sample entity set, where the second sample entity set includes a plurality of second sample entities, and the second sample entities are entities that have no similar features with the first sample entities.

Specifically, those skilled in the art know that any method for acquiring an entity with no similar characteristics to an entity based on an entity characteristic in the prior art falls within the protection scope of the present invention, and is not described herein, for example, acquiring an entity with no similar characteristics to an entity through an FM model, an FFM model, or the like.

S7017, a key entity set is obtained based on the sample entity set, the first sample entity set, and the second sample entity set, wherein the key entity set includes the sample entity set, the first sample entity set, and the second sample entity set.

Specifically, the number of the key entities in the key entity set is tens of millions, wherein, those skilled in the art know that the ratio of the first sample entity to the second sample entity can be selected according to the actual requirement, which falls into the protection scope of the present invention, and is not described herein.

S702, inputting the key entity set and the target entity set into a first intermediate model to obtain the key entity vector set and the target entity vector set.

Specifically, the target entity set includes a plurality of target entities, wherein the target entities are standard terms related to abnormal states.

Specifically, the first intermediate model is a model for converting text into a vector, where a person skilled in the art knows that any natural language processing model for converting text into a vector can be performed according to actual requirements, which falls within the protection scope of the present invention, and details thereof, such as a natural language processing model such as a bert model, are not described herein.

Specifically, the key entity vector set includes a plurality of key entity vectors, where the key entity vectors are vectors corresponding to key entities.

Further, the target entity vector set includes a plurality of target entity vectors, where the target entity vectors are vectors corresponding to target entities.

S703, inputting the key entity vector set and the target entity vector set into a second intermediate model, and obtaining a final entity set corresponding to the key entity set, wherein the second intermediate model is a preset neural network model.

Specifically, in S703, the final entity set is acquired by:

s7031, any one key entity vector Xy= (XY) is obtained from the key entity vector set ₁ ，……，XY _(ab) ，……，XY _(jk) )，XY _(ab) For the bit value of the ab-th bit in the key entity vector, ab= … … jk, jk is the bit of the key entity vector.

S7032, a target entity vector set zh= { ZH is acquired ₁ ，……，ZH _(cd) ，……，ZH _(ef) }，ZH _(cd) ＝(ZH ¹ _(cd) ，……，ZH ^(ab) _(cd) ，……，ZH ^(jk) _(cd) )，ZH ^(ab) _(cd) For the bit value of the ab-th bit corresponding to the cd-th target entity vector, cd= … … ef, where ef is the number of target entity vectors.

S7033, according to XY and ZH, acquiring a first intermediate priority list xh= { XH corresponding to XY ₁ ，……，XH _(cd) ，……，XH _(ef) }，XH _(cd) Is XY and ZH _(cd) First intermediate priority among them, where XH _(cd) Meets the following conditions:

when the priority corresponding to the entity is obtained, the method is not limited to one method, and the final priority corresponding to the entity is obtained by combining a plurality of methods, so that the accuracy of obtaining the priority corresponding to the entity is improved, and the standardized result corresponding to the output result based on the electronic medical record question-answer model is more accurate.

S7035, obtaining the final entity corresponding to XY according to XH, wherein when XH _(cd) When the first intermediate priority is the largest in XH, the ZH is acquired _(cd) The corresponding target entity is the final entity corresponding to XY.

S704, acquiring a target model based on the sample entity set and the final entity set, wherein the target model is a model trained in the process of acquiring the final entity set based on the sample entity set.

S705, acquiring a first candidate entity set corresponding to a target text, wherein the first candidate entity set comprises a plurality of first candidate entities, and the first candidate entities are entities acquired from the target text.

Specifically, those skilled in the art know that any method for obtaining an entity from a text in the prior art falls within the protection scope of the present invention, and is not described herein.

S706, inputting the first candidate entity into the target model, and acquiring a second candidate entity set corresponding to the target text, wherein the second candidate entity set comprises a plurality of second candidate entities, and the second candidate entities are entities in the target entity corresponding to the first candidate entity acquired based on the first candidate entity and the target model.

And S707, replacing the first candidate entity set in the target text with a corresponding second candidate entity set to realize the standardization processing of the target text.

By means of the method, the results output by the electronic medical record question-answering model are subjected to standardized processing, and follow-up data query and statistics are facilitated.

The embodiment provides a data preprocessing system based on an electronic medical record question-answering model, which comprises a sample electronic medical record information set, a processor and a memory storing a computer program, wherein the sample electronic medical record information set comprises a plurality of sample electronic medical record information, the sample electronic medical record information is abnormal state characteristic information corresponding to medical records obtained from a database, and when the computer program is executed by the processor, the following steps are realized: according to a sample electronic case information set, a candidate text set is obtained, a candidate keyword set corresponding to the candidate text set is obtained according to the candidate text set and a target term knowledge graph, an initial text set is obtained according to the candidate text set and the candidate keyword set, and a target text set is obtained according to the initial text set, wherein the number of text strings corresponding to the initial text is processed based on different conditions to obtain a target text, and a designated text vector is obtained according to the target text set.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A data preprocessing system based on an electronic medical record question-answering model, the system comprising: the system comprises a sample electronic medical record information set, a processor and a memory storing a computer program, wherein the sample electronic medical record information set comprises a plurality of sample electronic medical record information, the sample electronic medical record information is abnormal state characteristic information corresponding to medical records obtained from a database, and when the computer program is executed by the processor, the following steps are realized:

s1, acquiring a candidate text set A= { A according to a sample electronic medical record information set ₁ ，……，A _i ，……，A _n }，A _i For the i-th candidate text, i= … … n, n being the number of candidate texts;

s3, acquiring a candidate keyword set Q= { Q corresponding to the A according to the A and the target term knowledge graph ₁ ，……，Q _i ，……，Q _n }，Q _i Is A _i A corresponding candidate keyword list;

s5, acquiring an initial text set T= { T according to A and Q ₁ ，……，T _i ，……，T _n }，T _i ＝{A _i ，Q _i }，T _i Is the ith initial text;

S71 according to T _i Obtaining T _i Corresponding text string WT _i ＝(WT ⁰ _i1 ，……，WT ⁰ _ix ，……，WT ⁰ _ip ，WT ¹ _i1 ，……，WT ¹ _iy ，……，WT ¹ _iq )，WT ⁰ _ix Is A _i Corresponding xth character, x= … … p, p is a _i Number of corresponding literal characters, WT ¹ _iy Is Q _i Corresponding y-th character, y= … … Q, Q is Q _i The number of corresponding literal characters;

s72, when p+q=k, acquires U _i ＝T _i Wherein K is a preset key priority threshold;

s73, when p+q > K, obtaining a candidate priority set P= { P corresponding to Q ₁ ，……，P _i ，……，P _n }，P _i ＝{P _i1 ，……，P _ie ，……，P _if(i) }，P _ie Is Q _i Candidate priority corresponding to the e-th candidate keyword in the corresponding candidate keyword list, e= … … f (i), f (i) being Q _i The number of candidate keywords in the corresponding candidate keyword list;

s74, P-based, for WT _i Processing to obtain U _i ；

S75, when p+q is less than K, obtaining Q _i Corresponding appointed keyword set R _i ＝{R _i1 ，……，R _ie ，……，R _if(i) Sum Q _i Corresponding designated priority set G _i ＝{G _i1 ，……，G _ie ，……，G _if(i) }，R _ie Is Q _ie Corresponding appointed keyword list G _ie Is Q _ie A corresponding specified priority list;

s76, according to R _i And G _i For WT _i Processing to obtain U _i ；

2. The system for preprocessing data based on an electronic medical record question-answering model according to claim 1, wherein the data format of the sample electronic medical record information includes a text format and a table format.

3. The data preprocessing system based on the electronic medical record question-answering model according to claim 2, wherein in S1, candidate texts are obtained by:

s11, when the data format of the sample electronic case information is a text format, the sample electronic case information is segmented according to segmentation symbols to generate candidate texts;

and S13, when the data format of the sample electronic case information is a table format, integrating each record in the sample electronic case information and the field name corresponding to the record to generate a candidate text.

4. The data preprocessing system based on the electronic medical record question-answering model according to claim 1, wherein in S3, Q is obtained by the steps of _i ：

S31, according to A, obtaining a first intermediate word set B= { B corresponding to A ₁ ，……，B _i ，……，B _n }，B _i ＝{B _i1 ，……，B _ij ，……，B _im(i) }，B _ij Is A _i The j-th first intermediate word in the corresponding first intermediate word list, j= … … m (i), m (i) being a _i The number of the first intermediate words in the corresponding first intermediate word list;

s33, acquiring the target according to the target term knowledge graphThe tagram list d= { D ₁ ，……，D _r ，……，D _s }，D _r R= … … s for the r-th target word, s being the number of target words;

s35, according to B and D, obtaining a first intermediate similarity set F= { F corresponding to B ₁ ，……，F _i ，……，F _n }，F _i ＝{F _i1 ，……，F _ij ，……，F _im(i) }，F _ij ＝{F ¹ _i1 ，……，F ^r _ij ，……，F ^s _im(i) }，F ^r _ij Is B _ij And D _r A first intermediate similarity between;

5. The system for preprocessing data based on an electronic medical record question-answering model according to claim 1, wherein the initial text is a text in which candidate text is spliced with candidate keywords and the candidate keywords are spliced after the candidate text.

6. The data preprocessing system based on the electronic medical record question-answering model according to claim 1, wherein K is acquired in S72 by:

s721, according to T, obtaining a key text type set C= { C ₁ ，……，C _d ，……，C _z }，C _d ＝{C _d1 ，……，C _dg ，……，C _dh(d) }，C _dg For the g-th key text in the d-th key text list, g=1 … … h (d), h (d) is the number of key texts in the d-th key text list, d=1 … … z, and z is the number of key text types;

S723, according to C, obtaining a first text string quantity set C corresponding to C ⁰ ＝{C ⁰ ₁ ，……，C ⁰ _d ，……，C ⁰ _z }，C ⁰ _d ＝{C ⁰ _d1 ，……，C ⁰ _dg ，……，C ⁰ _dh(d) }，C ⁰ _dg Is C _dg The corresponding number of the first text strings;

s725 according to C ⁰ Acquiring a second text string quantity set C corresponding to C ¹ ＝{C ¹ ₁ ，……，C ¹ _d ，……，C ¹ _z }，C ¹ _d ＝{C ¹ _d1 ，……，C ¹ _du ，……，C ¹ _dh(d) }，C ¹ _du For the (u) th second text in the second text string number list corresponding to the (d) th type of key text list, u= … … h (d), wherein C ¹ _d1 ≥……≥C ¹ _du ≥……≥C ¹ _dh(d) ；

S725 according to C ⁰ Obtaining K, wherein K meets the following conditions:

7. The data preprocessing system based on the electronic medical record question-answering model according to claim 6, wherein the key text is an initial text obtained from T based on a text type corresponding to the initial text.

8. The data preprocessing system based on the electronic medical record question-answering model according to claim 6, wherein the second number of text strings is the number of text strings sequentially acquired in order from the top to the bottom according to the first number of text strings.

9. According to claim 1The data preprocessing system based on the electronic medical record question-answering model is characterized in that in S73, P is acquired through the following steps _ie ：

S731, obtaining candidate keyword list Q _i ＝{Q _i1 ，……，Q _ie ，……，Q _if(i) }，Q _ie Is Q _i The e-th candidate keyword in (a);

s733, obtaining Q according to the target term knowledge graph _ie Corresponding appointed keyword list R _ie ＝{R ¹ _ie ，……，R ^a _ie ，……，R ^b(e) _ie Sum Q _ie Corresponding designated priority list G _ie ＝{G ¹ _ie ，……，G ^a _ie ，……，G ^b(e) _ie }，R ^a _ie Is Q _ie Corresponding a-th specified keyword, a= … … b (e), b (e) is Q _ie The number of corresponding specified keywords G ^a _ie Is Q _ie And R is R ^a _ie A designated priority therebetween;

10. The data preprocessing system based on the electronic medical record question-answering model according to claim 1, further comprising the steps of:

s741 according to P _i Obtaining T _i Corresponding first intermediate text beta ¹ _i ＝(A _i ，Q _i1 ，……，Q _i(e-1) ，Q _i(e+1) ……，Q _if(i) ) Wherein P is _ie Is P _i The smallest candidate priority of (a);

S743, when beta ¹ _i When the number of the corresponding text strings is not more than K, U is acquired _i ＝β _i ；

S745, when beta ¹ _i When the number of the corresponding text strings is larger than K, P is acquired _i Middle P _ie P removal _ie Minimum candidate priority beyond, put it in initial text Q _i Delete to get T _i Corresponding second intermediate text beta ² _i ；