CN106933805B

CN106933805B - Method for identifying biological event trigger words in big data set

Info

Publication number: CN106933805B
Application number: CN201710148320.5A
Authority: CN
Inventors: 陈一飞; 刘峰; 韩冰青
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-03-14
Filing date: 2017-03-14
Publication date: 2020-04-28
Anticipated expiration: 2037-03-14
Also published as: CN106933805A

Abstract

The invention relates to the technical field of identification methods of biological event trigger words, in particular to an identification method of biological event trigger words in a large data set, which is a parallel undersampling method (PUS), comprises the steps of data segmentation, boundary factor calculation, sample undersampling, boundary set merging and final trimming, can be used for processing a large training data set with obvious distribution deviation among categories, and achieves the aim by reducing sample examples belonging to most categories in parallel. The method selects data based on the calculation of boundary factors, which measure the importance of the carried information of each sample instance to the classification. The method for identifying the biological event trigger words in the big data set, which is provided by the technical scheme, can simultaneously solve the problems of large data volume and unbalanced sample distribution among categories so as to achieve a better identification effect of the biological event trigger words.

Description

Method for identifying biological event trigger words in big data set

Technical Field

The invention relates to the technical field of identification methods of biological event trigger words, in particular to an identification method of biological event trigger words in a big data set.

Background

With the improvement of information technology and the increasing popularization of the internet, the biomedical electronic documents are exponentially growing as the products of scientific research, and the online document resources contain a large amount of valuable biomedical knowledge urgently needed by the system biological research. In the face of the continuous proliferation of massive biomedical text information, a text mining technology is being used as a technology for extracting important knowledge hidden in documents and is widely applied to the biomedical field.

Biological event extraction refers to a process of automatically detecting the description of the interaction relationship between biomolecules such as genes and proteins in massive medical research documents, so as to extract the structural information of predefined event types. In this process, if the biological event trigger can be accurately recognized, the performance of event extraction will be greatly improved. The event trigger word recognition is the first step in the biological event extraction process, and the recognized trigger word is the basis of event element recognition and is the core of the whole event. In the trigger word recognition, the category of the trigger word needs to be recognized, the category of the trigger word is the category of the whole event, if the trigger word recognition is wrong, the follow-up work also loses the meaning, and therefore the trigger word recognition is well done and is the key for extracting the biomedical event. Among them, the methods based on Support Vector Machine (SVM) and based on rich feature representation are the most common and the best-result ML models in event-triggered word recognition. However, in practical event-triggered recognition applications, there are two key issues regarding the complexity of the data. First, the imbalance of the distribution of data among classes. Second, the big dataness of the data set is trained. For large data sets, many classification algorithms have significant limitations and result in reduced performance. For example, training complexity of SVMs is highly dependent on the size of the data set, and training on large data sets is time consuming. Therefore, the characteristics of large data sets and highly unbalanced data distribution bring a very great challenge to the identification of event-triggered words.

In the face of large data sets, the undersampling technique is the most efficient method to construct a balanced data set by removing some sample instances in most categories, which can reduce computational complexity. Therefore, the undersampling technique is still effective under large data. For this reason, many more efficient undersampling methods are proposed. The clustering-based undersampling method aims to solve the problem of unbalanced data distribution by calculating the clustering of a data set. In this type of method, training data is divided into several clusters, and representative sample instances are selected from a plurality of classes of clusters according to a ratio, and a balanced data set is formed by the sample instances and a few classes of instances. The unbalanced data problem can be effectively solved by using a clustering-based undersampling method and ensemble learning. In addition, a new reverse random undersampling method (IRUS) constructs composite decision boundaries between classes by randomly massively sampling a majority of class datasets. However, these methods, while somewhat alleviating the problem of unbalanced data learning, still require a significant amount of time to iteratively cluster or find the boundaries of nearest neighbors. Therefore, these methods are not really efficient in the face of large data sets.

For large data sets, to overcome the bottleneck of SVM training complexity, various methods have also been proposed, for example, Sequential Minimum Optimization (SMO) decomposes a large QP problem into a series of minimum possible QP problems, allowing the SMO to process a large training set. Another data set using minimum closing ball (MEB) clustering divides training data by the MEB method, with the center of the cluster being used for SVM classification. However, these methods do not help in the classification of the imbalance data.

The existing methods can not well solve the problems of large data volume and unbalanced sample distribution among categories in the classification problem, which is an important link for solving the recognition of trigger words of biological events.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for identifying a biological event trigger word in a large data set, which can simultaneously solve the problems of large data volume and unbalanced sample distribution among categories, can solve the problem of unbalanced classification of samples in the large data set, and can achieve better identification effect of the biological event trigger word.

In order to solve the technical problems, the invention adopts the following technical scheme: a method for identifying a biological event trigger word in a large data set is a Parallel Undersampling (PUS) method and comprises the following steps:

step 1, data segmentation, defining a data set D { (x)₁,y₁),...,(x_n,y_n) Is the training data set, where x_i∈R^mIs a sample example, and y_iE {0, 1.. eta., l } is the class to which the sample instance belongs, and there are 1+ l class labels; definition D_αIs a multi-category data set containing n₀An example of a sample belonging to class y-0, let α -n₀(ii) a Data set D of multiple categories_αRandomly partitioning into K mutually disjoint majority class subdata sets

By α_kRepresenting each majority category of sub-data sets

Number of samples inThus there are

Definition D_βInto a small number of classes, i.e. D_β＝{∪D_j1, 2.. l, wherein β represents the number of samples in all the minority class datasets, there are

α > β;

step 2, calculating boundary factors, and defining each data set S^kContaining subdata sets from a corresponding plurality of categories

And a minority category dataset D_βIs expressed as

After the feature extraction step, S^kFrom m-dimensional features F ═ F_tWhere m denotes each sample boundary factor calculated based on its uncertainty belonging to all classes, mainly through the set S^kEach sample instance x to a given class C in_jDistance d (x, C)_j) Determined, the calculation of the distance is defined as follows:

for computing sample instances x to a given class C in the t-dimensional feature space_jSince the bio-trigger recognition data set is text, the distance component of (a) is, therefore,

defined as a text vector to category C_jDistance of centroid, said centroid being the word frequency TF (f)_t|C_j) Average value of (d):

in the formula, in d (x, C)_j) On the basis, each sample instance x is for class C_jDegree of membership mu of_j(x) The definition is as follows:

and is

The boundary factor BoundF (x) for sample instance x is defined as follows:

step 3, undersampling the samples, sorting the calculated BoundF (x) values to obtain the largest α'_kSamples of BoundF (x) values are extracted as boundary sample instances to form a boundary set

Number of samples α'_kP is multiplied by β, and p is used as a parameter to be adjusted of the PUS algorithm;

step 4, merging the boundary sets, merging all the boundary sets generated in the steps 2 and 3 through parallel undersampling to obtain a new majority category data set D'_αAnd all the minority classes are collected together to obtain a new training data set D '═ D'_α∪D_β；

And 5, pruning, namely repeating the undersampling step 2 and the undersampling step 3 on the training data set D ' to obtain a final training data set D ', so that the training data set D ' comprises α samples with the maximum BoundF (x) value, and the balance between the number of most-class samples and the number of few-class samples is achieved, namely α ″ - β.

The method for identifying the trigger words of the biological events in the large data set, which is provided by the technical scheme, mainly aims at the problem that the data set is large and the distribution among sample categories is unbalanced in a biological event identification task, provides a parallel under-sampling method (PUS), and combines an SVM classifier to construct a PUS-SVM trigger word identification system, so that the identification performance and efficiency of the trigger words are effectively improved. The parallel under-sampling method (PUS) adopts a sampling method based on class boundaries to reduce the imbalance of data, and can realize under-sampling by utilizing parallel distributed computation, thereby effectively reducing the computation complexity of a large data set.

Drawings

FIG. 1 is a flowchart illustrating the operation of a method for identifying a trigger word of a biological event in a big data set according to the present invention;

FIG. 2 is a graph of Data based on the same Data set_BioNLP09A time consumption comparison graph of an undersampling method in a biological trigger word recognition system;

FIG. 3 is a graph of Data based on the same Data set_BioNLP11A comparison of time consumption for the under-sampling method in a biometric trigger recognition system.

Detailed Description

In order that the objects and advantages of the invention will be more clearly understood, the following description is given in conjunction with the accompanying examples. It is to be understood that the following text is merely illustrative of one or more specific embodiments of the invention and does not strictly limit the scope of the invention as specifically claimed.

The technical scheme adopted by the invention is shown in fig. 1, the method provided by the invention can be used for processing a large training data set with obvious distribution deviation between classes, is an undersampling method, and adjusts the data distribution between the classes by reducing sample examples belonging to most classes. The method selects data based on the calculation of Boundary factors (BoundFactor, BoundF), which measure the importance of the information carried by each sample instance to the classification. After undersampling, sample instances at the center of the class are discarded, while sample instances at the class boundary are retained. In the recognition of subsequent biological event trigger words, the SVM will act as a classifier whose computation of the classification hyperplane depends on the sample instances of these classification boundaries, i.e. the support vectors, so the parallel undersampling method (PUS) retains the most likely sample instances containing the most classification information that help the SVM to classify.

The invention discloses a method for identifying a biological event trigger word in a big data set, which is a parallel undersampling method (PUS), and comprises the following steps:

By α_kRepresenting each majority category of sub-data sets

The number of the samples in, thus having

D_jIs one of a few categories of data sets, which contains n_jSample instances belonging to the class y j, j 1. There is a significant distribution deviation of data set D, so n₀＞＞n_jLet β denote the number of samples in all minority class datasets, then

And D_β＝∪D_j1, 2.., l, thereby obtaining α > β, each of the steps 1

All of the same scale.

And a minority category dataset D_βIs expressed as

for computing sample instances x to a given class C in the t-dimensional feature space_jD (x, C) in order to reduce the amount of calculation_j) Set up in class C_jOn the centroid of (2), but not C_jSince the bio-trigger recognition data set is text, on all samples, therefore,

in formula (2), class C_jWord frequency TF (f)_t|C_j) To C_jN of_jAveraging as classPin C_jThe center of mass of the lens. In the formula, in d (x, C)_j) On the basis, each sample instance x is for class C_jDegree of membership mu of_j(x) The definition is as follows:

and is

From equation (3), we can see that the smaller the distance of x to the centroid, the smaller x is for C_jThe greater the degree of membership; conversely, the larger the distance from x to the centroid, the larger x is for C_jThe smaller the degree of membership. The boundary factor BoundF (x) for sample instance x is defined as follows:

the boundary factor BoundF (x) is obtained by multiplying two parts. The first part is the entropy used to express the uncertainty of the sample instance belonging to each class. The closer a sample instance is to the boundary of a class, the greater its entropy of membership. The second part is the average distance of the sample instance to all classes. If a sample instance is more inside a class, its average distance is smaller. Conversely, if a sample instance is closer to the boundary of a class, its average distance is larger. Therefore, from the values of the two parts, we can see that the more the boundary factor boundf (x) value of the sample instance at the boundary of the class is, the more the sample instance inside the class carries the classification information.

Number of samples α'_kP is used as a parameter to be adjusted of the PUS algorithm, and in said step 3, if the data set contains noisy data, then the boundary sample instances are sampledThe noise data was previously deleted.

In order to compare the identification method of the biological event trigger words in the big Data set provided by the invention with other methods, two corpora are used, namely Data_BioNLP09And Data_BioNLP11In the BioNLP '09 and BioNLP' 11 sharing tasks, trigger recognition is only an intermediate step of the biomedical event extraction task, so there is no recognition result in the test set, so we use the validation data set as the test data set.

TABLE 1Data_BioNLP09List of types and numbers of trigger and non-trigger words in

Table 1 shows Data_BioNLP09Detailed statistical analysis of the training and validation data sets. As trigger recognition tasks, there are 9 trigger word types (corresponding to 9 biomedical events) that can be classified into three categories: simple event trigger words, binding event trigger words and complex event trigger words. Together with the negative category of non-trigger words, a category 10 classification problem is dealt with. It can be seen from the data in the table that it is a significant class imbalance of data, since only about 3.74% of the training data belongs to one of the trigger classes, while the rest belong to the non-trigger class.

TABLE 2Data_BioNLP11Triggering and non-triggering inList of word types and numbers

Table 2 shows Data_BioNLP11Detailed statistical analysis of the training and validation data sets.

Table 3 comparison of the performance of the present invention with other existing systems

Biological trigger word recognition system	P	R	F1
				PUS-SVM	69.7	69.9	68.3
System CRF	65.0	30.2	41.2
				System SVM	70.2	52.6	60.1
Turku	70.5	60.6	65.2
				TrigNER	69.3	57.3	62.7

Based on the same Data set Data_BioNLP09The performance of the present invention was compared to other existing identification systems. Table 3 details the 3 metric values for the present invention and the other 4 identification systems: p (precision), R (recall), and F1(F value, which is a weighted geometric average of recall and precision). The results in table 3 show that the PUS-SVM system for recognizing bio-trigger words proposed in the present invention can achieve the best overall performance, and has significant differences from other systems. Furthermore, our improvement in system performance is based on improvements in recall, which means that more likely trigger words are identified, which can further improve the performance of next-phase event recognition.

TABLE 4 Data based on the same Data set_BioNLP09Performance comparison of undersampling methods in a biological trigger recognition system

TABLE 5 Data based on the same Data set_BioNLP11Performance comparison of undersampling methods in a biological trigger recognition system

Tables 4 and 5 analyze the performance of the SVM based recognition system, different under-sampling methods, including PUS-SVM, non-sampling, including the parallel under-sampling method proposed in the present inventionSVM, randomly sampled RUS-SVM, k nearest neighbor samples undersampled kUS-SVM. Comparative experiment on Data set Data_BioNLP09And Data_BioNLP11The parallel undersampling system proposed in the present invention can achieve the best overall F1 performance under both data sets.

To further analyze the efficiency of the present invention in large data sets, the runtime consumption of the above various methods, including PUS-SVM, non-sampled SVM, randomly sampled RUS-SVM, k nearest-sample undersampled kUS-SVM, were compared on a machine with 8 processing cores @2.67GHz and RAM. Comparative experiment on Data set Data_BioNLP09And Data_BioNLP11The above is performed separately, and the specific results are shown in fig. 2 and 3. As can be seen from fig. 2 and 3, the method for identifying a trigger of a biological event in a large data set according to the present invention can obtain the best result of identifying a trigger of a biological event within an acceptable time.

The present invention is not limited to the above embodiments, and those skilled in the art can make various equivalent changes and substitutions without departing from the principle of the present invention after learning the content of the present invention, and these equivalent changes and substitutions should be considered as belonging to the protection scope of the present invention.

Claims

1. A method for identifying a biological event trigger word in a big data set is a parallel undersampling method and is characterized by comprising the following steps:

step 1, data segmentation, defining a data set D { (x)₁,y₁),...,(x_n,y_n) Is the training data set, where x_i∈R^mN is sample example, R ═ 1,2^mRepresents m-dimensional real numbers, and y_iE {0,1,. logue, l }, i ═ 1,2,. n is the class to which the sample instance belongs, with a total of 1+ l class labels; definition D_αIs a multi-category data set containing n_jJ-0 sample instances belonging to the class y-0, let α -n_jJ is 0; data set D of multiple categories_αRandomly partitioning into K mutually disjoint majority class subdata sets

By α_kRepresenting each majority category of sub-data sets

The number of the middle sample instances; definition D_βFor data sets of a minority category, D_jIs one of a few categories of data sets, containing n_j1,2, j, l sample instances belonging to the class y j, j 1, 1_β＝{∪D_j1, 2.. l, wherein β represents the number of samples in all the minority class datasets, there are

Namely, it is

α > β;

And a minority category dataset D_βIs expressed as

Data set S after the feature extraction step^kFor each sample instance x, x ∈ R^mFrom the m-dimensional feature F ═ F₁,...,f_tWhere f denotes each dimensional feature, each sample boundary factor is calculated based on its uncertainty belonging to all classes, mainly by the set S^kEach sample instance x to a given class C in_jDistance d (x, C)_j) Determined, the calculation of the distance is defined as follows:

and is

The boundary factor BoundF (x) for sample instance x is defined as follows:

2. The method for identifying trigger words of biological events in big data set according to claim 1, wherein: each of said step 1

All of the same scale.

3. The method for identifying trigger words of biological events in big data set according to claim 1, wherein: in said step 3, if the data set contains noise data, the noise data is deleted before sampling the boundary sample instance.