CN112417241B

CN112417241B - Method for mining topic learning pipeline based on neuroimaging literature of event

Info

Publication number: CN112417241B
Application number: CN202011226838.4A
Authority: CN
Inventors: 闫健卓; 陈丽红; 陈建辉; 于涌川
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2024-03-12
Anticipated expiration: 2040-11-06
Also published as: CN112417241A

Abstract

The invention discloses a method for mining a subject learning pipeline based on a neuroimaging literature of an event. An event-based topic learning task is designed to obtain rich semantic neural image research topics so as to improve the interpretability and accuracy of the topics. And a novel topic learning method is provided by fusing deep learning and domain knowledge with a probability topic model so as to realize topic learning based on events aiming at the full-text neuroimage document. Finally, aiming at two core indexes of theme learning, theme consistency and KL difference are selected as evaluation parameters. A set of experiments was done based on the actual data to compare the proposed method with four main topic learning methods. Experimental results show that the neural image Event-BTM can significantly improve the accuracy and integrity of the subject of neural image literature mining.

Description

Method for mining topic learning pipeline based on neuroimaging literature of event

Technical Field

The invention belongs to the field of computer science computation, and relates to a method for mining a subject learning pipeline based on neuroimaging literature of events.

Background

Neuro-image text mining is to extract knowledge from neuro-image text, and is receiving extensive attention, and subject learning is an important research focus of neuro-imaging text mining. However, current neuro-image topic learning studies mainly use traditional probabilistic topic models to extract topics from documents, and cannot obtain high-quality neuro-image topics. The existing topic learning method can not meet topic learning requirements for full-text neuroimage documents. Therefore, the invention provides a method for mining a subject learning pipeline based on the neuroimaging literature of an event.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for mining a subject learning pipeline based on neuroimaging literature of events. According to the method, three types of neuroimage research topic events are determined by analyzing the neuroimage research process and the information availability of neuroimage documents, and then a novel topic learning method is provided by fusing deep learning and domain knowledge with a probability topic model so as to realize topic learning based on the events for the full-text neuroimage documents.

In order to solve the problems, the invention adopts the following technical scheme:

the method for mining the subject learning pipeline based on the neuroimaging literature of the event comprises the following steps:

and step 1, preprocessing data.

The stop word processing is performed on paper data crawled from the PLoS One website.

And 2, expressing a predefined event.

By analyzing the course and results of the neuro-imaging study and the availability of relevant information in the neuro-imaging literature, a set of neuro-imaging study events is determined.

And 3, training an LSTM-CNN model.

Firstly, converting words into word vectors through an Embedding Layer, inputting LSTM for semantic feature extraction, and finally, taking the output of the LSTM as the input of CNN for further feature extraction.

And 4, training a PCNN model.

By vector representation, convolution, max pooling, classifying four parts, and obtaining a vector representation of the relationship.

And 5, constructing a neuroimaging Event-BTM topic learning pipeline.

And inputting functional neuroimage document data, and acquiring a theme representation result of the document.

And 6, evaluating the model.

Model performance was evaluated using model evaluation indicators.

Drawings

FIG. 1 is a diagram of a neuroimaging Event-BTM topic learning pipeline framework;

FIG. 2 is an Event-BTM topic learning model;

Detailed Description

The invention is further described below with reference to the accompanying drawings:

as shown in fig. 1, the method of the present invention mainly comprises the following steps:

step 1, data preprocessing

And performing stop word processing on the document data crawled from the PLoS One website.

Step 2, expression of predefined event

By analyzing the process and results of the neuro-image study and the availability of relevant information in the neuro-image literature, a set of neuro-image study events are determined and are divided into three subject events, "cognitive response", "experiment" and "analysis", which are used to describe the results of the neuro-image study, the experimental process and the analytical process, respectively. Each topic event contains several meta-events for the event extraction task design. The specific formula is as follows:

Event _{deduce-results} ＝[trigger，<argument1，role1>？，<argument2，role2>？]＝[{evoke，indicate，reveal...}，<{EXPERIMENT TASK，COGNITIVE FUNCTION，MEDICAL PROBLEM}，research object>+，<{FEATURES OF PHYSIOLOGY AND PSYCHOLOGY}，biological mechanism>+]

in Event _{deduce-results} Is an inference event, trigger represents a trigger word, and parameters 1 and 2 refer to the first argument and the second argument, respectively, and role1 and role12 represent roles of the first argument and the second argument, respectively.

Step 3, training LSTM-CNN model

Event recognition includes trigger word recognition, argument recognition and trigger word type recognition. BiLSTM-CNN is used to model text features for event recognition.

And 3.1, vectorizing the text data.

v _word ＝[v _w ，v _c ，v _t ，v _char ]

Wherein v is _word Is word in sentence _i V of the combination vector of (c) _w ，v _c ，v _t ，v _char Word vectors, case vectors, term dictionaries, and character vectors, respectively.

Step 3.2, carrying out event element identification, wherein the characteristic modeling process based on BiLSTM is described as follows:

wherein v is _word Is word in sentence _i Is the combined vector of f _i Is a word representation, h _i Is the output of the LSTM hidden layer, and based on the output of the BiLSTM, a log-softmax function is used to obtain the log probability of each trigger word or argument.

And 4, training a PCNN model.

Taking the output of the BiLSTM-CNN model as the input of the model, the specific process is as follows:

and 4.1, feature vector.

V _lf ＝[E _1t ，E _2t ，E _1tf ，E _1tb ，E _2tf ，E _2tb ，r]

Wherein V is _lf Is the feature vector E _1t Word vector, E, which is a trigger word _2t Word vector being argument, E _1tf Is the word vector of the word preceding the trigger, E _1tb Is the word vector of the next word of the trigger, E _2tf Is a word vector of a word preceding the parameter, E _2tb Is the word vector of the word following the parameter, r is the index of the event role type.

Step 4.2, word representation.

Wherein v is _wp Is a word representation vector, where v _wf Is the word vector of the current word, d _pft Is the distance vector between the current word and the trigger, d _pfa Is the distance vector between the current word and the argument.

Step 4.3, CNN may extract global features at sentence level to predict roles.

And 5, constructing a neuroimaging Event-BTM topic learning pipeline.

Step 5.1, pre-training the BiLSTM-CNN model, wherein an input sentence firstly passes through an enabling layer to map each vocabulary or character into a word vector or character vector, then the word vector or character vector is transmitted into the BILSTM layer to obtain forward and backward vectors of the sentence, and then the forward and backward vectors are spliced to be used as hidden state vectors of the current vocabulary or character. While the CNN layer is used to extract local features of the current word.

And 5.2, taking the characteristic vector output by the BiLSTM-CNN model as input, and identifying the role of the argument by using the PCNN model.

And 5.3, applying the result identified in the step 5.2 to a topic model Event-BTM, and learning topic results of documents.

And 6, evaluating the model.

The performance of the model is evaluated by using the following model prediction effect evaluation indexes:

where coercie is the degree of topic aggregation, KL is the degree of divergence, where V is the set of words in a topic, ε is the smoothing factor (usually taken directly as 1), D (V _i ，v _j ) Is a calculation containing the word v _i And v _j And D (v) _j ) For computing packagesContaining v _j P is the topic distribution, p (x) is the topic word in p, q is another topic distribution, and q (x) is the topic word in q.

As described above, the present invention has the advantages that:

1. three types of neuro-imaging study topic events were determined by analyzing the course of neuro-imaging studies and the availability of information from neuro-imaging literature. Based on the above, an event-based topic learning task is designed to acquire rich semantic neural image research topics so as to improve the interpretability and accuracy of the topics.

2. By fusing deep learning and domain knowledge with a probability topic model, a novel topic learning method is provided to realize event-based topic learning for full-text neuroimage documents.

The above embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, the scope of which is defined by the claims. Various modifications and equivalent arrangements of this invention will occur to those skilled in the art, and are intended to be within the spirit and scope of the invention.

Claims

1. A method for mining a subject learning pipeline based on neuroimaging literature of events, comprising the steps of:

step 1: preprocessing data; performing stop word processing on paper data crawled from a PLoS One website;

step 2: expression of predefined events;

step 3: training an LSTM-CNN model; firstly, converting words into word vectors through an Embedding Layer, inputting LSTM (least squares) for semantic feature extraction, and finally, taking the output of the LSTM as the input of CNN for further feature extraction;

step 4: training a PCNN model; by vector representation, convolution, max pooling, classifying four parts to obtain vector representation of the relationship;

step 5: constructing a neuroimaging Event-BTM topic learning pipeline, inputting functional neuroimage literature data, and obtaining topic representation results of the literature;

step 6: evaluating the model; evaluating model performance using model evaluation indicators;

the expression of the predefined event in the step 2 is specifically: the subject event is constructed by using meta event, which is expressed as a structure of "trigger word + argument";

the PCNN model is trained in the step 4, and the specific method comprises the following steps: taking the output of the BiLSTM-CNN model as the input of the PCNN model;

step one: a feature vector;

v _lf ＝[E _1t ,E _2t ,E _1tf ,E _1tb ,E _2tf ,E _2tb ,r]

wherein V is _lf Is a feature vector, E _1t Word vector, E, which is a trigger word _2t Word vector being argument, E _1tf Is the word vector of the word preceding the trigger, E _1tb Is the word vector of the next word of the trigger, E _2tf Is a word vector of a word preceding the parameter, E _2tb A word vector which is a word after the parameter, r is an index of the event role type;

step two: word representation;

wherein v is _wp Is a word representation vector, where v _wf Is the word vector of the current word, d _pft Is the distance vector between the current word and the trigger, d _pfa Is the distance vector between the current word and the argument;

step three: the CNN extracts the global features of sentence level to predict the roles as follows;

n＝max(M ₁ v _wp )

V _sf ＝tanh(W ₂ n)

where n represents the most useful feature of each convolution kernel extracted by the maximum pool, max is used to maximize, v _wp Is a word representation vector, M ₁ And W is ₂ Is a linear transformation matrix of the hidden layer, tanh is an activation function, V _sf Is a sentence feature;

the Event-BTM topic learning model of step 5 specifically comprises the following steps:

step one: the probability of a single event pair b is:

step two: the probability of the whole event to the set B is:

step three: the topic distribution probability of the document is:

wherein b (e) _i ,e _j ) Is an event pair consisting of two events, e _i And e _j Representing an event, z being the subject of the event pair, θ _z Is the distribution of the subject matter,is event e in topic z _i Distribution;

the neuroimaging Event-BTM topic learning pipeline of step 5 comprises the following specific steps:

step 5.1, pre-training a BiLSTM-CNN model, wherein an input sentence firstly maps each word or character into a word vector or character vector through an enabling layer, then the word vector or character vector is transmitted into the BILSTM layer to obtain forward and backward vectors of the sentence, and then the forward and backward vectors are spliced to be used as hidden state vectors of the current word or character; and the CNN layer is used to extract local features of the current word;

step 5.2, using the feature vector output by the BiLSTM-CNN model as input, and identifying the role of the argument by using the PCNN model;

2. The method for mining a topic learning pipeline based on neuroimaging literature of claim 1, wherein: the functional neuroimage literature data in the data preprocessing method described in the step 1 remove stop words including the 'the, a and an'.

3. The method for mining a topic learning pipeline based on neuroimaging literature of claim 1, wherein: the LSTM-CNN model training step in the step 3 comprises the following steps:

step one: text vectorization;

v _word ＝[v _w ,v _c ,v _t ,v _char ]

wherein v is _word Is a combined vector of words in a sentence, v _w ,v _c ,v _t ,v _char Word vectors, case vectors, term dictionary, and character vectors, respectively;

step two: event element recognition, for a sentence s= [ word ] ₁ ,word ₂ ,…,word _i …,word _n ]The BiLSTM-based feature modeling process is described as follows:

wherein f _i Is a word representation, word _i Is the i-th word in the sentence, h _i Is the output of the LSTM hidden layer.

4. The method for mining a topic learning pipeline based on neuroimaging literature of claim 1, wherein: the performance indexes of the evaluation model in the step 6 are as follows:

wherein, coercie is the degree of topic aggregation, KL is the degree of divergence, V is the word set in a topic, ε is the smoothing factor, D (V _i ,v _j ) Is a calculation containing the word v _i And v _j And D (v) _j ) For calculating the inclusion v _j P is the topic distribution, p (x) is the topic word in p, q is another topic distribution, and q (x) is the topic word in q.