CN114334068B

CN114334068B - Radiology report generation method, device, terminal and storage medium

Info

Publication number: CN114334068B
Application number: CN202111346347.8A
Authority: CN
Inventors: 张灵艳; 陈志鸿; 李米芳; 万翔; 朱记超; 谢尚煌; 孙崎元
Original assignee: Shenzhen Longgang Central Hospital Shenzhen Longgang Central Hospital Group Shenzhen Ninth People's Hospital Acupuncture Research Institute Of Shenzhen Longgang Central Hospital; Shenzhen Research Institute of Big Data SRIBD
Current assignee: Shenzhen Longgang Central Hospital Shenzhen Longgang Central Hospital Group Shenzhen Ninth People's Hospital Acupuncture Research Institute Of Shenzhen Longgang Central Hospital; Shenzhen Research Institute of Big Data SRIBD
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-11-01
Anticipated expiration: 2041-11-15
Also published as: CN114334068A

Abstract

The invention discloses a method, a device, a terminal and a storage medium for generating a radiology report. The invention provides a radiology report generation method, which is characterized in that an image to be processed is input into a trained report generation model, the model comprises a visual feature encoder, a potential feature encoder and a layered decoder, after the visual feature of the image to be processed is extracted through the visual feature encoder, the potential feature is extracted through the potential feature encoder, a multilayer attention mechanism operation is adopted in the layered decoder, so that character features and sentence features of a report are alternately aggregated and distributed, the potential features and the visual features are encoded into semantic features of the report, the accuracy of a next character predicted by using the existing characters of the report is ensured, the radiology report of the image to be processed is generated by using a deep learning model, and the compiling efficiency of the radiology report is improved.

Description

Radiology report generation method, device, terminal and storage medium

Technical Field

The present invention relates to the field of deep learning technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for generating a radiology report.

Background

Radiological images are widely used in the medical field, and diagnosis reports compiled based on the radiological images need to be described, but compiling the radiological reports is generally time-consuming and requires comprehensive knowledge and experience to understand the radiological images.

Thus, there is a need for improvements and enhancements in the art.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a radiology report generation method, aiming at solving the problem that the radiology report compiling in the prior art is long in time consumption.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

in a first aspect of the present invention, a method for generating a radiology report is provided, where the method includes:

acquiring an image to be processed, and inputting the image to be processed into a trained report generation model, wherein the report generation model comprises a visual feature encoder, a target embedding matrix, a potential feature encoder and a layered decoder, and the layered decoder comprises a first attention layer, a second attention layer and a third attention layer;

acquiring the visual features of the image to be processed through the visual feature encoder, inputting the visual features of the image to be processed into the potential feature encoder, and acquiring the potential features corresponding to the image to be processed output by the potential feature encoder;

acquiring an embedded feature of each character in a current radiology report according to the target embedded matrix, inputting each embedded feature into the first attention layer, and acquiring a first character-level feature of each character and a first aggregation feature of each sentence in the current radiology report output by the first attention layer;

inputting the first aggregation feature of each sentence of the current radiology report and the potential feature of the image to be processed into the second attention layer, and coding the potential feature of the image to be processed into the semantic feature of each sentence of the current radiology report through the second attention layer to obtain a second aggregation feature of each sentence of the current radiology report;

inputting each second aggregation feature, each first character-level feature and the visual feature of the to-be-processed image into the third attention layer, coding the visual feature of the to-be-processed image into the semantic feature of each character of the current radiology report through the third attention layer to obtain a second character-level feature corresponding to each character of the current radiology report, and obtaining the next character in the current radiology report according to each second character-level feature;

repeatedly executing the step of obtaining the embedding characteristics of each character in the current radiology report according to the target embedding matrix until a preset end character is obtained, and obtaining a target radiology report corresponding to the image to be processed;

wherein the initial content of the radiology report is a preset sentence marking character.

The radiology report generation method comprises the steps that the report generation model is obtained through training according to a preset data set, the preset data set comprises a plurality of groups of training samples, and each group of training samples comprises a sample image and a corresponding sample radiology report; before inputting the image to be processed into the trained report generation model, the method includes:

selecting a target training sample in the preset data set;

inputting a sample image in the target training sample into the report generation model, and acquiring the visual feature of the sample image;

inputting a sample radiology report in the target training sample into a text encoder, acquiring text features of the sample radiology report, inputting the text features into the potential feature encoder, and acquiring the potential features corresponding to the sample radiology report;

inputting the visual features of the sample images, a first character in the sample radiology report and the potential features corresponding to the sample radiology report to the layered encoder to obtain a prediction report corresponding to the sample radiology report;

obtaining the loss of the target training sample according to the prediction report, and updating the network parameters of the report generation model according to the loss of the target training sample;

and re-executing the step of selecting the target training sample in the preset data set until the parameters of the report generation model are converged.

The method for generating a radiology report, wherein the obtaining of the loss of the target training sample according to the prediction report includes:

acquiring a first probability distribution according to the prediction report, wherein the first probability distribution is the probability distribution of the prediction report being the sample radiology report under the joint condition of the potential features corresponding to the text features and the sample images;

inputting the visual features of the sample image into the potential feature encoder, and obtaining a second probability distribution according to the output of the potential feature encoder, wherein the second probability distribution is the probability distribution of the potential features corresponding to the text features under the condition of the sample image;

obtaining a third probability distribution according to the potential features corresponding to the text features, wherein the third probability distribution is the probability distribution of the potential features corresponding to the text features under the condition of the sample radiology report;

obtaining a loss of the target training sample based on the first probability distribution, the second probability distribution, and the third probability distribution.

The radiology report generating method, wherein the obtaining of the first character-level feature of each character and the first aggregate feature of each sentence in the current radiology report output by the first attention layer includes:

and taking the first character-level features corresponding to the preset sentence marking characters of each sentence in the current radiology report as the first aggregation features of each sentence in the radiology report.

The radiology report generating method, wherein the encoding, by the second attention layer, the potential features of the to-be-processed image into the semantic features of each current sentence of the radiology report to obtain the second aggregate features of each current sentence of the radiology report, includes:

generating a query embedding of a sentence according to the first aggregated features of the sentence;

generating key embedding and value embedding of sentences according to the potential features of the images to be processed;

performing a multi-attention mechanism based on query embedding, key embedding, and value embedding of each sentence to obtain the second aggregate features of each sentence.

The radiology report generation method includes the steps of, by the third attention layer, encoding the visual features of the to-be-processed image into semantic features of each character of the current radiology report to obtain second character-level features corresponding to each character of the current radiology report, where the second character-level features include:

generating query embeddings of the literal characters according to the first character-level features of the literal characters, and generating query embeddings of the preset sentence marking characters of the sentence according to the second aggregation features of the sentence;

generating key embedding and value embedding of characters according to the visual characteristics of the image to be processed;

performing a multi-head attention mechanism based on the query embedding, key embedding, and value embedding of each character to obtain the second character-level feature of each character.

The radiology report generating method of, wherein the layered decoder further comprises a feed forward layer, the feed forward layer comprising at least one linear transformation layer; the obtaining a next character in the current radiology report according to each of the second character-level features includes:

inputting each of the second character-level features to the feed-forward layer;

and obtaining the next character in the current radiology report according to the output of the feedforward layer.

In a second aspect of the present invention, there is provided a radiology report generation apparatus including:

the image acquisition module is used for acquiring an image to be processed and inputting the image to be processed into a trained report generation model, wherein the report generation model comprises a visual feature encoder, a target embedding matrix, a potential feature encoder and a layered decoder, and the layered decoder comprises a first attention layer, a second attention layer and a third attention layer;

the potential feature extraction module is used for acquiring the visual features of the image to be processed through the visual feature encoder, inputting the visual features of the image to be processed into the potential feature encoder, and acquiring the potential features corresponding to the image to be processed output by the potential feature encoder;

a first attention module, configured to obtain an embedding feature of each character in a current radiology report according to the target embedding matrix, input each of the embedding features to the first attention layer, and obtain a first character-level feature of each character and a first aggregation feature of each sentence in the current radiology report output by the first attention layer;

a second attention module, configured to input the first aggregate features of each current sentence of the radiology report and the latent features of the to-be-processed image into the second attention layer, and encode the latent features of the to-be-processed image into semantic features of each current sentence of the radiology report through the second attention layer, so as to obtain second aggregate features of each current sentence of the radiology report;

a third attention module, configured to input each of the second aggregate features, each of the first character-level features, and the visual features of the to-be-processed image into the third attention layer, encode the visual features of the to-be-processed image into semantic features of each character of the current radiology report through the third attention layer, obtain a second character-level feature corresponding to each character of the current radiology report, and obtain a next character in the current radiology report according to each of the second character-level features;

the circulation module is used for calling the first attention module to re-execute the step of acquiring the embedded features of each character in the current radiology report after the third attention module outputs the next character of the current radiology report until a preset end character is acquired, and obtaining a target radiology report corresponding to the image to be processed;

wherein the initial content of the radiology report is a preset sentence marker character.

In a third aspect of the present invention, there is provided a terminal comprising a processor, and a computer-readable storage medium communicatively connected to the processor, the computer-readable storage medium being adapted to store a plurality of instructions, and the processor being adapted to invoke the instructions in the computer-readable storage medium to perform the steps of implementing the classification method according to any one of the above.

In a fourth aspect of the invention, there is provided a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to perform the steps of the classification method of any one of the above.

Compared with the prior art, the invention provides a radiology report generating method, a device, a terminal and a storage medium, the radiology report generating method inputs an image to be processed into a trained report generating model, the model comprises a visual feature encoder, a potential feature encoder and a layered decoder, after the visual feature of the image to be processed is extracted through the visual feature encoder, the potential feature is extracted through the potential feature encoder, a multilayer attention mechanism operation is adopted in the layered decoder, so that character features and sentence features of a report are alternately aggregated and distributed, the potential features and the visual features are encoded into semantic features of the report, the accuracy of a next character predicted by using the existing characters of the report is ensured, the radiology report of the image to be processed is generated by using a deep learning model, and the compiling efficiency of the radiology report is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a radiology report generation method provided by the present invention;

FIG. 2 is a schematic diagram of a report generation model training process in the radiology report production method provided by the present invention;

fig. 3 is a schematic diagram illustrating a similarity calculation method for sentences in an evaluation process of a generated radiology report in an embodiment of a radiology report generation method according to the present invention;

fig. 4 is an exemplary diagram of a radiology report generated by the radiology report generating method provided by the present invention;

FIG. 5 is a statistical data plot of a data set employed during an experiment for a radiology report generation method provided by the present invention;

FIG. 6 is a schematic diagram of experimental results of a radiology report generation method provided by the present invention;

FIG. 7 is a schematic diagram of a configuration of an embodiment of a radiology report generating device provided by the present invention;

fig. 8 is a schematic diagram illustrating an embodiment of a terminal according to the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The radiology report generating method provided by the invention can be applied to a terminal with computing power, the terminal can execute the radiology report generating method provided by the invention to generate a radiology report, and the terminal can be, but is not limited to, various computers, mobile terminals, intelligent household appliances, wearable devices and the like.

Example one

As shown in fig. 1, one embodiment of the radiology report generation method includes the steps of:

s100, acquiring an image to be processed, and inputting the image to be processed into a trained report generation model, wherein the report generation model comprises a visual feature encoder, a target embedding matrix, a potential feature encoder and a layered decoder, and the layered decoder comprises a first attention layer, a second attention layer and a third attention layer.

Specifically, the image to be processed is a radiological image, such as an X-ray image, and in the prior art, a doctor needs to write a radiological report for image description after reading the radiological image, for example: 'two lungs have slightly strong texture, the two lower lungs have speckle flaky fuzzy images in the fields of' 8230 ';' and the radiology report needs an experienced doctor to carefully read and understand the images and then carry out coding, thus having low efficiency. To solve this problem, the radiology report generating method provided in this embodiment constructs and trains a report generating model, inputs the to-be-processed image, which needs to generate a radiology report, into the trained report generating model, and obtains the radiology report output by the report generating model.

S200, acquiring the visual characteristics of the image to be processed through the visual characteristic encoder, inputting the visual characteristics of the image to be processed into the potential characteristic encoder, and acquiring the potential characteristics corresponding to the image to be processed output by the potential characteristic encoder.

After the image to be processed is input into the report generation model, firstly, the visual feature of the image to be processed is extracted by a visual feature encoder in the report generation model, specifically, the visual feature encoder includes an initial feature extraction layer and a transform encoder, and the structure of the initial feature extraction layer may be the structure of an existing image feature extraction model, for example, a CNN network. The image to be processed can be decomposed into at least one region, the characteristics of each region are extracted through the initial characteristic extraction layer, and the characteristics output by the initial characteristic extraction layer can be connected to a long vector to be organized into a sequence which is expressed as { x₁,x₂,...,x_l,...,x_LIn which x_lFor the features extracted for the ith region, L is the number of regions. After extracting the features of each region, in order to further sum up the visual features and explore the similarity between the features of each region, a Transformer encoder is used to encode the features of each region and the visual markers as an overall representation of the image, and the process can be expressed as follows: v = f_ve(x_[VIS],x₁,x₂,...,x_l,...,x_L) Wherein x is_[VIS]Representing a visual mark, f_ve() And v represents the collected visual features, namely the visual features of the image to be processed.

The potential feature encoder in the report generation model is used for extracting potential features according to the visual features of the images to be processed, wherein the potential features are potential representation features which are common to the images to be processed and radiology reports corresponding to the images to be processed, so that the radiology reports corresponding to the images to be processed can be obtained according to the potential features. The parameters of the visual encoder and the latent feature encoder are determined by training the report generation model in advance, and in order to enable the latent feature encoder to achieve the effect of outputting the latent features, when the report generation model is trained, embedding of a sample radiology report corresponding to a sample image is used as the input of the latent feature encoder for training, which will be described later in detail.

After the visual feature encoder is input to the potential feature encoder and the potential features output by the potential feature encoder are obtained, the method for generating a radiology report according to this embodiment further includes:

s300, acquiring the embedded features of each character in the current radiology report according to the target embedded matrix, inputting each embedded feature into the first attention layer, and acquiring the first character-level features of each character and the first aggregation features of each sentence in the current radiology report output by the first attention layer.

Specifically, in this embodiment, the next character is predicted according to all characters in the current radiology report, the initial content of the radiology report is a preset sentence marking character, that is, at first, the embedding feature of the preset sentence marking character is obtained according to the target embedding matrix, the embedding feature of the preset sentence marking character is input to the first attention layer, after the next character is obtained, the preset sentence marking character and the embedding feature of the next character of the preset sentence marking character are input to the first attention layer, and finally, a complete report is obtained.

Specifically, the first attention layer employs a self-attention mechanism, that is, the self-attention mechanism is executed according to the embedded features of the context character of each character in the current radiology report, and the first character-level features of the character are obtainedAnd (5) performing characterization. In the attention mechanism, a query embedding matrix (Q matrix), a key embedding matrix (K matrix) and a value embedding matrix (V matrix) are arranged, the query embedding Q, the key embedding K and the value embedding V of each character are obtained through the corresponding matrixes, and then the attention mechanism output of the character is obtained according to the K of the context of the character and the Q and V of the character, wherein the attention mechanism can be expressed as follows:

when the self-attention mechanism is executed, the inquiry embedding matrix, the key embedding matrix and the value embedding matrix corresponding to each character are obtained by multiplying the embedding matrix of each character with the inquiry embedding matrix, the key embedding matrix and the value embedding matrix in the first attention layer respectively.

Specifically, the context range of the character in the self-attention mechanism adopted in the first attention layer may be other characters belonging to the same sentence, or other characters before and after the character by a preset number, other characters in the whole report, and the like.

In this embodiment, each sentence of the radiology report in the sample is preceded by the predetermined sentence marker character, e.g., character y, during training_[sent]When the training is completed and the next character of the radiology report is generated according to the report generation model, the preset tag character is output, which indicates that a sentence is finished and a new next sentence is started, that is, the preset sentence tag character exists before each sentence in the current radiology report. The obtaining a first character-level feature of each character and a first aggregate feature of each sentence in the current radiology report output by the first attention layer comprises:

For example, assume that the current radiology report is { y }_[sent],y₁,y₂,...,y_[sent],...,y_tIn which y is_tRepresenting the t-th character, y, of the current radiology report except for the preset sentence marker character_[sent]Marking characters for the preset sentence, then firstly obtaining the embedding of each character through the target embedding matrix: { y_[sens],y₁,y₂,...,y_[sens],...,y_t-then inputting the embedded features of each character into the first attention layer, in which for the embedded features of each character, self-attention is performed according to the embedded features of the context character of that character, resulting in the first character-level features of each character and the first aggregated features of each sentence in the current radiology report: { c_[sens1],c₁,c₂,...,c_[sens2],...,c_tIn which c is_tSaid first character-level features representing the current tth character of said radiology report, excluding said preset sentence marker character, c_[senst]The first aggregate feature representing a current tth sentence in the radiology report.

The parameters of the target embedding matrix and the parameters of the first attention layer (including the query embedding matrix, the key embedding matrix, and the value embedding matrix of the self-attention mechanism) are determined by training the report generation model in advance.

S400, inputting the first aggregation feature of each sentence of the current radiology report and the potential feature of the image to be processed into the second attention layer, and coding the potential feature of the image to be processed into the semantic feature of each sentence of the current radiology report through the second attention layer to obtain the second aggregation feature of each sentence of the current radiology report.

Specifically, the encoding, by the second attention layer, the latent features of the to-be-processed image into semantic features of each current sentence of the radiology report to obtain second aggregate features of each current sentence of the radiology report includes:

performing a multi-attention mechanism based on the query embedding, key embedding, and value embedding of each sentence to obtain the second aggregate features of each sentence.

In the second attention layer of the layered decoder, a multi-attention mechanism is executed, semantic features are operated at a sentence level, specifically, for each sentence in the current radiology report, query embedding of the sentence in the second attention layer is obtained according to the first aggregation feature of the sentence and a query embedding matrix in the second attention layer, key embedding and value embedding of the sentence in the second attention layer are obtained according to the latent features of the image to be processed and a key embedding matrix and a value embedding matrix in the second attention layer, and the multi-attention mechanism is executed according to the query embedding, key embedding and value embedding of each sentence in the second attention layer to obtain the second aggregation feature of each sentence.

Through the second attention layer, the semantic features of each sentence of the current radiology report, which encode the potential features of the image to be processed, are realized, so that the result of predicting the next character according to the semantic features of the current radiology report is more accurate.

Parameters in the second attention layer (including query embedding matrix, key embedding matrix, and value embedding matrix) are determined by training the report generation model in advance.

Referring to fig. 1 again, the method for generating a radiology report provided in the present embodiment further includes the steps of:

s500, inputting the second aggregation features, the first character-level features and the visual features of the to-be-processed image into the third attention layer, coding the visual features of the to-be-processed image into the semantic features of each character of the current radiology report through the third attention layer to obtain second character-level features corresponding to each character of the current radiology report, and obtaining a next character in the current radiology report according to the second character-level features.

Specifically, in step S400, the first character-level features of preset sentence-marking characters in the current radiology report have been processed as the second aggregation features, and thus, each of the first character-level features in step S500 is worth being the first character-level feature of each literal character. The encoding, by the third attention layer, the visual feature of the to-be-processed image into the semantic feature of each character of the current radiology report to obtain a second character-level feature corresponding to each character of the current radiology report includes:

generating key embedding and value embedding of characters according to the visual features of the image to be processed;

performing a multi-head attention mechanism based on query embedding, key embedding, and value embedding of each character to obtain the second character-level features of each character.

In the third attention layer of the hierarchical decoder, a multi-head attention mechanism is executed, semantic features are operated at a character level, specifically, for literal characters except for the preset sentence marking characters, query embedding of the literal characters in the third attention layer is generated according to the first character set features corresponding to the literal characters and the query embedding matrix in the third attention layer, and for each preset sentence marking character, query embedding of the preset sentence marking characters in the third attention layer is generated according to the second aggregation features of sentences to which the preset sentence marking characters belong and the query embedding matrix in the third attention layer, so that for each character in the current radiology report, corresponding query embedding is generated.

Generating key embedding of each character in the third attention layer according to the visual feature of the image to be processed and the key embedding matrix of the third attention layer, generating value embedding of each character in the third attention layer according to the visual feature of the image to be processed and the value embedding matrix of the third attention layer, and executing a multi-head attention mechanism according to query embedding, key embedding and value embedding of each character in the third attention layer to obtain the second character-level feature of each character.

It can be seen that, through the above steps, the visual features of the to-be-processed image are encoded into the second character-level features of each character of the current radiology report, the second character-level features of each character of the radiology report include the visual features of the to-be-processed image and the potential features of the to-be-processed image, and the layered decoder makes good use of the potential features and the visual features of the to-be-processed image, so that the accuracy of predicting the next character according to the current radiology report can be improved, and a radiology report with better accuracy can be obtained.

The parameters of the third attention layer (including the query embedding matrix, the key embedding matrix, and the value embedding matrix in the third attention layer) are determined in advance by training the report generation model.

The hierarchical decoder further comprises a feed-forward layer, and the acquiring the next character in the current radiology report according to each second character-level feature comprises:

Specifically, the feedforward layer includes at least one linear transformation layer, for example, the feedforward layer may include two linear transformation layers, and a ReLE activation function may be further disposed between the two linear transformation layers.

The hierarchical decoder also comprises a classification layer, the classification is linear transformation comprising a softmax activation function, the output of the feedforward layer is input into the classification layer, the distribution of a vocabulary table is obtained, namely the probability that the next character is output to be each preset character is output, and the character with the largest probability is selected as the next character of the radiology report.

The parameters of the feed-forward layer and the classification layer are determined by training the report generation model in advance.

S600, repeatedly executing the step of obtaining the embedding characteristics of each character in the current radiology report Europe according to the target embedding matrix until a preset end character is obtained, and obtaining a target radiology report corresponding to the image to be processed.

According to the steps S300 to S600, a next character of the current radiology report may be generated, the next character is added to the current radiology report, the current radiology report is updated, then the embedding feature of each character in the updated radiology report is obtained according to the target embedding matrix, that is, the step S300 is repeatedly performed until the next character of the obtained radiology report is a preset end character, and the preset end character may be set as a character, for example, "which is not repeated with a literal character and the preset sentence marking character. "," end ", etc.

The following describes a training process of the report generation model, specifically, the report generation model is obtained by training according to a preset data set, the preset data set includes a plurality of groups of training samples, and each group of training samples includes a sample image and a sample radiology report corresponding to the sample image. Before the image to be processed is input into the generated report generation model, the method comprises the following steps:

selecting a target training sample in the preset data set;

inputting a sample image in the target training sample into the report generation model, and acquiring the visual features of the sample image;

Specifically, the sample radiology report corresponding to the sample image included in each set of training samples is a radiology report compiled by the doctor according to the sample image, that is, the sample radiology report can be regarded as a correct radiology report of the corresponding sample image, a first character of the sample radiology report is the preset sentence identification character, and a last character of the sample radiology report is the preset end character.

During training, parameters of the report generation model are updated by using a group of target training samples each time, as shown in fig. 2, a text encoder is provided during training, and the parameters of the text encoder and the parameters of the report generation model are updated together, but after the parameters of the report production model converge, that is, after training is completed, the text encoder is not used in the process of generating the target radiology report corresponding to the image to be processed.

For a target training sample, inputting a sample image in the target training sample to a visual encoder in the report generation model, and acquiring a visual feature of the sample image, where a specific process is consistent with the above-described process of acquiring the visual feature of the to-be-processed image by the visual encoder. Will be said toAnd inputting the sample radiology report in the standard training sample into the text encoder, wherein in the text encoder, firstly, an embedded matrix with the same parameters as the target embedded matrix in the report generation model is adopted to obtain an embedded sequence of each character of the sample radiology report, and then, a Transformer encoder is used for encoding the embedding of each character in the sample radiology report to obtain the text characteristics of the sample radiology report. In one possible implementation, a text label y may be added before the first character of the sample radiology report_[TXT]The text label is also embedded and encoded.

After the text features of the sample radiology report are obtained, the text features are input to the potential feature encoder in the report generation model, and potential features corresponding to the sample radiology report are obtained. Inputting the visual features of the sample images, a first character in the sample radiology report, and the potential features corresponding to the sample radiology report to the layered encoder to obtain a prediction report corresponding to the sample radiology report. Specifically, the visual features of the sample image are taken as the visual features of the images to be processed in steps S300-S500, the first character in the sample radiology report is taken as the initial content of the radiology report in steps S300-S500, and through steps S300-S500, the first character in the sample radiology report can be given, the next character predicted by the report generation model is obtained, and the radiology report of the sample image predicted according to the current parameters of the report generation model, namely the prediction report, is obtained. It should be noted that, after the next character is generated each time in the training process, the next character may be added to the content of the sample radiology report predicted to obtain the next character, or the character after the content of the next character is predicted to obtain in the original sample radiology report may be added to the content of the sample radiology report predicted to obtain the next character, and then the next character is predicted.

It is obvious that, in order to make the report generation model work better, the update direction of the parameters of the report generation model should be such that the prediction report generated by the model according to the sample image and the sample radiology report is as close as the sample radiology report, and the potential features obtained according to the sample image are as close as the potential features obtained according to the sample radiology report. In order to capture the uncertainty of the radiology report and improve the generalization capability of the model, in the embodiment, a probabilistic modeling manner is adopted to obtain the training loss of the report generation model so as to capture the uncertainty, diversity and complex structure of the radiology report, so that the output of the model is more accurate.

The obtaining the loss of the target training sample according to the prediction report includes:

acquiring a first probability distribution according to the prediction report, wherein the first probability distribution is a probability distribution of the prediction forecast as the sample radiology report under the joint condition of the potential features corresponding to the text features and the sample images;

inputting the visual features of the sample image into the potential feature encoder, and acquiring a second probability distribution according to the output of the potential feature encoder, wherein the second probability distribution is the probability distribution of the potential features corresponding to the text features under the condition of the sample image;

Based on probabilistic modeling, the objective function of the report generation model can be constructed as follows:

L_ELBO＝log p_θ(Y|Z,I)-βKL[q_θ(Z|Y)||p_θ(Z|I)]

wherein L is_ELBORepresenting the value of the objective function, logp_θ(Y | Z, I) represents the union of the corresponding potential feature Z in the sample radiology report and the sample image IModel output probability distribution of sample radiology report Y under the Condition, q_θ(Z | Y) denotes the probability distribution, p, of generating a potential feature Z under the conditions of a sample radiology report Y_θ(Z | I) denotes the probability distribution, KL [ q ] q, of generating the latent feature Z under the conditions of the sample image I_θ(Z|Y)||p_θ(Z|I)]Denotes q_θ(Z | Y) and p_θThe KL divergence between (Z | I), β, is a hyperparameter, used to control the weight of the KL divergence. As explained above, in order for the report generation model to work better, the parameters of the report generation model should be updated in a direction such that the prediction report generated by the model from the sample image and the sample radiology report is closer to the sample radiology report, and therefore, log p_θThe larger the value of (Y | Z, I) the better, and the closer the potential features from the sample image and from the sample radiology report, the better, therefore, KL [ q_θ(Z|Y)||p_θ(Z|I)]The smaller the value, the better the value of the objective function, i.e. the larger the value of the objective function corresponding to the objective training sample, the smaller the model training loss corresponding to the coworker training sample. The target function value may be inverted as a loss of the target training sample.

And updating the network parameters of the report generation model according to the loss of the target training sample, and if the parameter convergence of the report generation model is known, completing the training of the report generation model, wherein the training of the report generation model can be used for predicting the radiology report of the image to be processed, namely generating the target radiology report according to the image to be processed.

The inventor also verifies the effectiveness of the radiology report generation method provided by the embodiment through experiments, specifically including qualitative experimental verification and quantitative experimental verification, in the qualitative experimental analysis, the target radiology reports Generated by the radiology report generation method provided by the embodiment are directly compared, as shown in fig. 4, as shown in the figure, the radiology reports (Generated Sample1 and Generated Sample2 in the figure) Generated by the report generation models trained according to the two training sets according to the same image to be processed are compared with the real radiology report (Ground report in the figure), it can be found that the report Generated by the radiology report generation method provided by the embodiment can generate an accurate description compared with the real report, and covers important findings in the real report. In addition, it can be seen that the report generation model obtained by training the two data sets has two different styles, because the sample radiology reports in the two data sets are written by research groups of two hospitals, the writing styles of the reports are different, and the difference is captured by the potential features, so that the probability modeling mode in the embodiment can capture the potential uncertainty between the reports.

In the quantitative experiment, the radiology report (hereinafter referred to as a candidate report) and the reference report generated by the radiology report generation method provided in the present embodiment are respectively expressed as a rule-based and model-based evaluation method (RMM)

And

wherein, the first and the second end of the pipe are connected with each other,

represents a candidate report S^cThe nth sentence in (a) is,

denotes a reference report S^rThe mth sentence in (1), N, M represent the total number of sentences in the candidate report and the reference report, respectively. First a rule-based information extraction is performed, extracting information from the sentences according to anatomical positions for the candidate and reference reports, in practice a large number of anatomical positions are collated to match the anatomical positions mentioned in the report, an anatomical position can be extracted from each sentence, thus generating anatomical position information for the candidate and reference reports, which can be expressed as

And

representing the anatomical location in the nth sentence in the candidate report,

representing the anatomical position in the mth sentence in the reference report, and N, M representing the total number of sentences in the candidate report and the reference report, respectively.

After information extraction, matching sentences in the candidate report and the reference report, then applying a pre-trained sentence embedding model to calculate a similarity score of the candidate sentence and the reference sentence, and specifically, in calculating the similarity between the sentences, as shown in fig. 3, adopting BERTScore as the pre-trained sentence embedding, using context embedding (i.e. BERT) to evaluate the similarity between two sentences, formally, given two matched sentences s^cAnd s^rGenerating two vector sequences from a pre-trained BERT model

Then s is^cEach vector in the corresponding vector sequence and s^rMatching one vector in the corresponding vector sequence to calculate recall rate, and comparing s^rEach vector in the corresponding vector sequence and s^cAnd matching one vector in the corresponding vector sequence to calculate the precision, wherein the greedy matching is carried out in a manner of maximizing the matching phase velocity score, so that each vector in one sentence can be matched with the most similar vector in another sentence, and finally, the similarity between the two sentences is calculated by combining the precision and the recall ratio. Can be formulated as:

wherein, SIM(s)^c,s^r) Representing a sentence s^cAnd s^rThe degree of similarity between the two images is determined,

representing a sentence s^cThe ith vector in the corresponding sequence of vectors,

representing a sentence s^rThe jth vector in the corresponding sequence of vectors.

A matching function MATCH (S, S) is defined, which represents the sentence matching S found from the report S, specifically, the matching is based on the anatomical position information, and the sentence matching S is the sentence matching S whose anatomical position is consistent with S' anatomical position in the report S. Constructing an evaluation metric for a candidate report F_RMMThe following were used:

according to F_RMMTo evaluate the difference between the candidate report and the reference report, enabling a more accurate evaluation.

Experiments were performed on a chinese radiology Report dataset SRIBD X-Ray, which contained 226347 cases, each with a positive chest film and corresponding reports, 10000 cases were randomly selected for validation, 100000 cases for testing, and the rest for training in order to segment the dataset, whose statistics include the number of cases and the average length of the reports (Report Len), the average length of the descriptions (trends Len), and the average length of the conclusions (Impression Len), as shown in fig. 5. The comparative models used in the experiments were mainly cyclic (i.e. ST, SAT, att2al, adaAtt and Updown) and acyclic (i.e. Trans, aoA and M2 Trans), the performance of which was evaluated by conventional WOMs (including BLEU, METEOR, ROUGE-L and CIDER) and the RMM metrics mentioned above. Before inputting the data set into the model, the report is processed at the character level and the characters are filtered at a frequency less than 10, and the initial feature extraction layer in the visual encoder in the model is pre-trained on ImageNetThe trained ResNet101 extracts 2,048-dimensional fragment features, adopts three layers and eight attention heads for a visual encoder, a text encoder and a layered encoder, adopts a 512-dimensional hidden state and a random initialization structure, and trains a model under cross entropy loss by using an Adam optimizer. The learning rate and other parameters of the visual encoder are set to 5 × 10, respectively^-5And 1X 10^-4. In the generation process, the beam size is set to 3 to balance the effectiveness and efficiency of all models. The optimal value of the hyperparameter is obtained by evaluating the model on the validation set from both data sets.

The results are shown in FIG. 6, and the report generation model of the present invention demonstrates its superiority by comparison with all other models. Although the most attractive model, aoA, and our model are both transform-based, our model is a significant improvement over it. The reason behind may be that AoA only focuses on improving the attention structure, and does not model the uncertainty in the report. The results indicate that potential topic modeling and layered decoding are critical to generating high quality radiology reports.

In summary, the present embodiment provides a method for generating a radiology report, where an image to be processed is input to a trained report generation model, the model includes a visual feature encoder, a potential feature encoder, and a layered decoder, after a visual feature of the image to be processed is extracted by the visual feature encoder, the potential feature encoder extracts the potential feature, and a multi-layer attention mechanism operation is used in the layered decoder, so that character features and sentence features of a report are alternately aggregated and distributed, and the potential feature and the visual feature are encoded into semantic features of the report, thereby ensuring accuracy of a next character predicted by using an existing character of the report, achieving generation of a radiology report of the image to be processed by using a deep learning model, and improving efficiency of compiling the radiology report.

It should be understood that, although the steps in the flowcharts shown in the figures of the present specification are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a portion of the steps in the flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases or other media used in the embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

Example two

Based on the foregoing embodiments, the present invention further provides a radiology report generating device, as shown in fig. 7, the radiology report generating device includes:

an image obtaining module, configured to obtain an image to be processed, and input the image to be processed into a trained report generation model, where the report generation model includes a visual feature encoder, a target embedding matrix, a latent feature encoder, and a layered decoder, and the layered decoder includes a first attention layer, a second attention layer, and a third attention layer, which is specifically described in embodiment one;

a potential feature extraction module, configured to obtain, by the visual feature encoder, a visual feature of the to-be-processed image, input the visual feature of the to-be-processed image to the potential feature encoder, and obtain a potential feature corresponding to the to-be-processed image output by the potential feature encoder, which is specifically described in embodiment one;

a first attention module, configured to obtain an embedding feature of each character in a current radiology report according to the target embedding matrix, input each embedding feature to the first attention layer, and obtain a first character-level feature of each character and a first aggregation feature of each sentence in the current radiology report output by the first attention layer, as described in embodiment one;

a second attention module, configured to input the first aggregate features of each current sentence of the radiology report and the latent features of the to-be-processed image into the second attention layer, and encode the latent features of the to-be-processed image into semantic features of each current sentence of the radiology report through the second attention layer, so as to obtain second aggregate features of each current sentence of the radiology report, as described in embodiment one;

a third attention module, configured to input each of the second aggregate features, each of the first character-level features, and the visual features of the to-be-processed image into the third attention layer, encode the visual features of the to-be-processed image into semantic features of each character of the current radiology report through the third attention layer, obtain a second character-level feature corresponding to each character of the current radiology report, and obtain a next character in the current radiology report according to each of the second character-level features, which is specifically described in embodiment one;

a circulation module, configured to invoke the first attention module to re-execute the step of obtaining the embedding feature of each character in the current radiology report after the third attention module outputs a next character of the current radiology report until a preset end character is obtained, so as to obtain a target radiology report corresponding to the to-be-processed image, which is specifically described in embodiment one.

EXAMPLE III

Based on the above embodiment, the present invention further provides a terminal, as shown in fig. 8, where the terminal includes a processor 10 and a memory 20. Fig. 8 shows only some of the components of the terminal, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

The memory 20 may in some embodiments be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 20 may also be an external storage device of the terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the terminal. The memory 20 is used for storing application software installed in the terminal and various data. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 has stored thereon a radiology report generating program 30, and the radiology report generating program 30 is executable by the processor 10 to implement the radiology report generating method of the present application.

The processor 10 may be a Central Processing Unit (CPU), microprocessor or other chip in some embodiments, and is used for running program codes stored in the memory 20 or Processing data, such as executing the classification method, and the like.

In one embodiment, the following steps are implemented when the processor 10 executes the radiology report generation program 30 in the memory 20:

inputting each second aggregation feature, each first character-level feature and the visual feature of the to-be-processed image into the third attention layer, coding the visual feature of the to-be-processed image into the semantic feature of each character of the current radiology report through the third attention layer to obtain a second character-level feature corresponding to each character of the current radiology report, and obtaining a next character in the current radiology report according to each second character-level feature;

The report generation model is obtained by training according to a preset data set, the preset data set comprises a plurality of groups of training samples, and each group of training samples comprises a sample image and a corresponding sample radiology report; before inputting the image to be processed into the trained report generation model, the method includes:

selecting a target training sample in the preset data set;

inputting the visual features of the sample image, a first character in the sample radiology report, and the potential features corresponding to the sample radiology report to the layered encoder, to obtain a prediction report corresponding to the sample radiology report;

Wherein the obtaining a loss of the target training sample according to the prediction report comprises:

obtaining a loss of the target training sample according to the first probability distribution, the second probability distribution, and the third probability distribution.

Wherein the obtaining of the first character-level feature of each character and the first aggregate feature of each sentence in the current radiology report output by the first attention layer comprises:

taking the first character-level feature corresponding to the preset sentence marking character of each sentence in the current radiology report as the first aggregation feature of each sentence in the radiology report.

Wherein the encoding, by the second attention layer, the latent features of the to-be-processed image into the semantic features of each current sentence of the radiology report to obtain a second aggregate feature of each current sentence of the radiology report includes:

generating a query embedding of a sentence according to the first aggregated feature of the sentence;

Wherein, the encoding, by the third attention layer, the visual feature of the to-be-processed image into the semantic feature of each character of the current radiology report to obtain a second character-level feature corresponding to each character of the current radiology report includes:

Wherein the layered decoder further comprises a feedforward layer comprising at least one linear transformation layer; the obtaining a next character in the current radiology report according to each of the second character-level features includes:

and obtaining the next character in the current radiology report according to the output of the feed-forward layer.

Example four

The present invention also provides a computer readable storage medium having stored thereon one or more programs, the one or more programs being executable by one or more processors to perform the steps of the radiology report generation method described above.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A radiology report generation method, the method comprising:

inputting the first aggregation feature of each sentence of the current radiology report and the potential feature of the to-be-processed image into the second attention layer, and coding the potential feature of the to-be-processed image into the semantic feature of each sentence of the current radiology report through the second attention layer to obtain a second aggregation feature of each sentence of the current radiology report;

2. The radiology report generation method of claim 1, wherein the report generation model is trained according to a preset data set, the preset data set includes a plurality of groups of training samples, and each group of training samples includes a sample image and a corresponding sample radiology report; before inputting the image to be processed into the trained report generation model, the method includes:

selecting a target training sample in the preset data set;

inputting the visual features of the sample image, a first character in the sample radiology report, and the potential features corresponding to the sample radiology report to the layered decoder to obtain a prediction report corresponding to the sample radiology report;

3. The radiology report generation method of claim 2, wherein the obtaining the loss of the target training sample from the prediction report comprises:

acquiring a first probability distribution according to the prediction report, wherein the first probability distribution is the probability distribution of the prediction report as the sample radiology report under the joint condition of the potential feature corresponding to the text feature and the sample image;

4. The radiology report generation method of claim 1, wherein the obtaining a first character-level feature for each character and a first aggregate feature for each sentence in the current radiology report output by the first attention layer comprises:

5. The radiology report generation method of claim 1, wherein the encoding, by the second attention layer, the latent features of the to-be-processed image into the semantic features of each current sentence of the radiology report to obtain a second aggregate feature of each current sentence of the radiology report comprises:

6. The radiology report generation method of claim 1, wherein the encoding, by the third attention layer, the visual features of the to-be-processed image into semantic features of each character of the current radiology report to obtain second character-level features corresponding to each character of the current radiology report comprises:

7. The radiology report generation method of claim 1, wherein the layered decoder further comprises a feed forward layer comprising at least one linear transformation layer; the obtaining a next character in the current radiology report according to each of the second character-level features includes:

8. A radiology report generating device, comprising:

a first attention module, configured to obtain an embedding feature of each character in a current radiology report according to the target embedding matrix, input each embedding feature to the first attention layer, and obtain a first character-level feature of each character and a first aggregation feature of each sentence in the current radiology report output by the first attention layer;

9. A terminal, characterized in that the terminal comprises: a processor, a computer readable storage medium communicatively connected to the processor, the computer readable storage medium adapted to store a plurality of instructions, the processor adapted to invoke the instructions in the computer readable storage medium to perform the steps of implementing the radiology report generation method of any one of claims 1-7 above.

10. A computer readable storage medium, storing one or more programs, the one or more programs being executable by one or more processors for performing the steps of the radiology report generation method of any one of claims 1-7.