CN106933805B - Method for identifying biological event trigger words in big data set - Google Patents

Method for identifying biological event trigger words in big data set Download PDF

Info

Publication number
CN106933805B
CN106933805B CN201710148320.5A CN201710148320A CN106933805B CN 106933805 B CN106933805 B CN 106933805B CN 201710148320 A CN201710148320 A CN 201710148320A CN 106933805 B CN106933805 B CN 106933805B
Authority
CN
China
Prior art keywords
data set
sample
class
boundary
undersampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710148320.5A
Other languages
Chinese (zh)
Other versions
CN106933805A (en
Inventor
陈一飞
刘峰
韩冰青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710148320.5A priority Critical patent/CN106933805B/en
Publication of CN106933805A publication Critical patent/CN106933805A/en
Application granted granted Critical
Publication of CN106933805B publication Critical patent/CN106933805B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of identification methods of biological event trigger words, in particular to an identification method of biological event trigger words in a large data set, which is a parallel undersampling method (PUS), comprises the steps of data segmentation, boundary factor calculation, sample undersampling, boundary set merging and final trimming, can be used for processing a large training data set with obvious distribution deviation among categories, and achieves the aim by reducing sample examples belonging to most categories in parallel. The method selects data based on the calculation of boundary factors, which measure the importance of the carried information of each sample instance to the classification. The method for identifying the biological event trigger words in the big data set, which is provided by the technical scheme, can simultaneously solve the problems of large data volume and unbalanced sample distribution among categories so as to achieve a better identification effect of the biological event trigger words.

Description

Method for identifying biological event trigger words in big data set
Technical Field
The invention relates to the technical field of identification methods of biological event trigger words, in particular to an identification method of biological event trigger words in a big data set.
Background
With the improvement of information technology and the increasing popularization of the internet, the biomedical electronic documents are exponentially growing as the products of scientific research, and the online document resources contain a large amount of valuable biomedical knowledge urgently needed by the system biological research. In the face of the continuous proliferation of massive biomedical text information, a text mining technology is being used as a technology for extracting important knowledge hidden in documents and is widely applied to the biomedical field.
Biological event extraction refers to a process of automatically detecting the description of the interaction relationship between biomolecules such as genes and proteins in massive medical research documents, so as to extract the structural information of predefined event types. In this process, if the biological event trigger can be accurately recognized, the performance of event extraction will be greatly improved. The event trigger word recognition is the first step in the biological event extraction process, and the recognized trigger word is the basis of event element recognition and is the core of the whole event. In the trigger word recognition, the category of the trigger word needs to be recognized, the category of the trigger word is the category of the whole event, if the trigger word recognition is wrong, the follow-up work also loses the meaning, and therefore the trigger word recognition is well done and is the key for extracting the biomedical event. Among them, the methods based on Support Vector Machine (SVM) and based on rich feature representation are the most common and the best-result ML models in event-triggered word recognition. However, in practical event-triggered recognition applications, there are two key issues regarding the complexity of the data. First, the imbalance of the distribution of data among classes. Second, the big dataness of the data set is trained. For large data sets, many classification algorithms have significant limitations and result in reduced performance. For example, training complexity of SVMs is highly dependent on the size of the data set, and training on large data sets is time consuming. Therefore, the characteristics of large data sets and highly unbalanced data distribution bring a very great challenge to the identification of event-triggered words.
In the face of large data sets, the undersampling technique is the most efficient method to construct a balanced data set by removing some sample instances in most categories, which can reduce computational complexity. Therefore, the undersampling technique is still effective under large data. For this reason, many more efficient undersampling methods are proposed. The clustering-based undersampling method aims to solve the problem of unbalanced data distribution by calculating the clustering of a data set. In this type of method, training data is divided into several clusters, and representative sample instances are selected from a plurality of classes of clusters according to a ratio, and a balanced data set is formed by the sample instances and a few classes of instances. The unbalanced data problem can be effectively solved by using a clustering-based undersampling method and ensemble learning. In addition, a new reverse random undersampling method (IRUS) constructs composite decision boundaries between classes by randomly massively sampling a majority of class datasets. However, these methods, while somewhat alleviating the problem of unbalanced data learning, still require a significant amount of time to iteratively cluster or find the boundaries of nearest neighbors. Therefore, these methods are not really efficient in the face of large data sets.
For large data sets, to overcome the bottleneck of SVM training complexity, various methods have also been proposed, for example, Sequential Minimum Optimization (SMO) decomposes a large QP problem into a series of minimum possible QP problems, allowing the SMO to process a large training set. Another data set using minimum closing ball (MEB) clustering divides training data by the MEB method, with the center of the cluster being used for SVM classification. However, these methods do not help in the classification of the imbalance data.
The existing methods can not well solve the problems of large data volume and unbalanced sample distribution among categories in the classification problem, which is an important link for solving the recognition of trigger words of biological events.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for identifying a biological event trigger word in a large data set, which can simultaneously solve the problems of large data volume and unbalanced sample distribution among categories, can solve the problem of unbalanced classification of samples in the large data set, and can achieve better identification effect of the biological event trigger word.
In order to solve the technical problems, the invention adopts the following technical scheme: a method for identifying a biological event trigger word in a large data set is a Parallel Undersampling (PUS) method and comprises the following steps:
step 1, data segmentation, defining a data set D { (x)1,y1),...,(xn,yn) Is the training data set, where xi∈RmIs a sample example, and yiE {0, 1.. eta., l } is the class to which the sample instance belongs, and there are 1+ l class labels; definition DαIs a multi-category data set containing n0An example of a sample belonging to class y-0, let α -n0(ii) a Data set D of multiple categoriesαRandomly partitioning into K mutually disjoint majority class subdata sets
Figure GDA0002281987270000021
By αkRepresenting each majority category of sub-data sets
Figure GDA0002281987270000022
Number of samples inThus there are
Figure GDA0002281987270000023
Definition DβInto a small number of classes, i.e. Dβ={∪Dj1, 2.. l, wherein β represents the number of samples in all the minority class datasets, there are
Figure GDA0002281987270000024
α > β;
step 2, calculating boundary factors, and defining each data set SkContaining subdata sets from a corresponding plurality of categories
Figure GDA0002281987270000025
And a minority category dataset DβIs expressed as
Figure GDA0002281987270000026
After the feature extraction step, SkFrom m-dimensional features F ═ FtWhere m denotes each sample boundary factor calculated based on its uncertainty belonging to all classes, mainly through the set SkEach sample instance x to a given class C injDistance d (x, C)j) Determined, the calculation of the distance is defined as follows:
Figure GDA0002281987270000031
Figure GDA0002281987270000032
for computing sample instances x to a given class C in the t-dimensional feature spacejSince the bio-trigger recognition data set is text, the distance component of (a) is, therefore,
Figure GDA0002281987270000033
defined as a text vector to category CjDistance of centroid, said centroid being the word frequency TF (f)t|Cj) Average value of (d):
Figure GDA0002281987270000034
in the formula, in d (x, C)j) On the basis, each sample instance x is for class CjDegree of membership mu ofj(x) The definition is as follows:
Figure GDA0002281987270000035
and is
Figure GDA0002281987270000036
The boundary factor BoundF (x) for sample instance x is defined as follows:
Figure GDA0002281987270000037
step 3, undersampling the samples, sorting the calculated BoundF (x) values to obtain the largest α'kSamples of BoundF (x) values are extracted as boundary sample instances to form a boundary set
Figure GDA0002281987270000038
Number of samples α'kP is multiplied by β, and p is used as a parameter to be adjusted of the PUS algorithm;
step 4, merging the boundary sets, merging all the boundary sets generated in the steps 2 and 3 through parallel undersampling to obtain a new majority category data set D'αAnd all the minority classes are collected together to obtain a new training data set D '═ D'α∪Dβ
And 5, pruning, namely repeating the undersampling step 2 and the undersampling step 3 on the training data set D ' to obtain a final training data set D ', so that the training data set D ' comprises α samples with the maximum BoundF (x) value, and the balance between the number of most-class samples and the number of few-class samples is achieved, namely α ″ - β.
The method for identifying the trigger words of the biological events in the large data set, which is provided by the technical scheme, mainly aims at the problem that the data set is large and the distribution among sample categories is unbalanced in a biological event identification task, provides a parallel under-sampling method (PUS), and combines an SVM classifier to construct a PUS-SVM trigger word identification system, so that the identification performance and efficiency of the trigger words are effectively improved. The parallel under-sampling method (PUS) adopts a sampling method based on class boundaries to reduce the imbalance of data, and can realize under-sampling by utilizing parallel distributed computation, thereby effectively reducing the computation complexity of a large data set.
Drawings
FIG. 1 is a flowchart illustrating the operation of a method for identifying a trigger word of a biological event in a big data set according to the present invention;
FIG. 2 is a graph of Data based on the same Data setBioNLP09A time consumption comparison graph of an undersampling method in a biological trigger word recognition system;
FIG. 3 is a graph of Data based on the same Data setBioNLP11A comparison of time consumption for the under-sampling method in a biometric trigger recognition system.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the following description is given in conjunction with the accompanying examples. It is to be understood that the following text is merely illustrative of one or more specific embodiments of the invention and does not strictly limit the scope of the invention as specifically claimed.
The technical scheme adopted by the invention is shown in fig. 1, the method provided by the invention can be used for processing a large training data set with obvious distribution deviation between classes, is an undersampling method, and adjusts the data distribution between the classes by reducing sample examples belonging to most classes. The method selects data based on the calculation of Boundary factors (BoundFactor, BoundF), which measure the importance of the information carried by each sample instance to the classification. After undersampling, sample instances at the center of the class are discarded, while sample instances at the class boundary are retained. In the recognition of subsequent biological event trigger words, the SVM will act as a classifier whose computation of the classification hyperplane depends on the sample instances of these classification boundaries, i.e. the support vectors, so the parallel undersampling method (PUS) retains the most likely sample instances containing the most classification information that help the SVM to classify.
The invention discloses a method for identifying a biological event trigger word in a big data set, which is a parallel undersampling method (PUS), and comprises the following steps:
step 1, data segmentation, defining a data set D { (x)1,y1),...,(xn,yn) Is the training data set, where xi∈RmIs a sample example, and yiE {0, 1.. eta., l } is the class to which the sample instance belongs, and there are 1+ l class labels; definition DαIs a multi-category data set containing n0An example of a sample belonging to class y-0, let α -n0(ii) a Data set D of multiple categoriesαRandomly partitioning into K mutually disjoint majority class subdata sets
Figure GDA0002281987270000051
By αkRepresenting each majority category of sub-data sets
Figure GDA0002281987270000052
The number of the samples in, thus having
Figure GDA0002281987270000053
DjIs one of a few categories of data sets, which contains njSample instances belonging to the class y j, j 1. There is a significant distribution deviation of data set D, so n0>>njLet β denote the number of samples in all minority class datasets, then
Figure GDA0002281987270000054
And Dβ=∪Dj1, 2.., l, thereby obtaining α > β, each of the steps 1
Figure GDA0002281987270000055
All of the same scale.
Step 2, calculating boundary factors, and defining each data set SkContaining subdata sets from a corresponding plurality of categories
Figure GDA0002281987270000056
And a minority category dataset DβIs expressed as
Figure GDA0002281987270000057
After the feature extraction step, SkFrom m-dimensional features F ═ FtWhere m denotes each sample boundary factor calculated based on its uncertainty belonging to all classes, mainly through the set SkEach sample instance x to a given class C injDistance d (x, C)j) Determined, the calculation of the distance is defined as follows:
Figure GDA0002281987270000058
Figure GDA0002281987270000059
for computing sample instances x to a given class C in the t-dimensional feature spacejD (x, C) in order to reduce the amount of calculationj) Set up in class CjOn the centroid of (2), but not CjSince the bio-trigger recognition data set is text, on all samples, therefore,
Figure GDA00022819872700000510
defined as a text vector to category CjDistance of centroid, said centroid being the word frequency TF (f)t|Cj) Average value of (d):
Figure GDA00022819872700000511
in formula (2), class CjWord frequency TF (f)t|Cj) To CjN ofjAveraging as classPin CjThe center of mass of the lens. In the formula, in d (x, C)j) On the basis, each sample instance x is for class CjDegree of membership mu ofj(x) The definition is as follows:
Figure GDA0002281987270000061
and is
Figure GDA0002281987270000062
From equation (3), we can see that the smaller the distance of x to the centroid, the smaller x is for CjThe greater the degree of membership; conversely, the larger the distance from x to the centroid, the larger x is for CjThe smaller the degree of membership. The boundary factor BoundF (x) for sample instance x is defined as follows:
Figure GDA0002281987270000063
the boundary factor BoundF (x) is obtained by multiplying two parts. The first part is the entropy used to express the uncertainty of the sample instance belonging to each class. The closer a sample instance is to the boundary of a class, the greater its entropy of membership. The second part is the average distance of the sample instance to all classes. If a sample instance is more inside a class, its average distance is smaller. Conversely, if a sample instance is closer to the boundary of a class, its average distance is larger. Therefore, from the values of the two parts, we can see that the more the boundary factor boundf (x) value of the sample instance at the boundary of the class is, the more the sample instance inside the class carries the classification information.
Step 3, undersampling the samples, sorting the calculated BoundF (x) values to obtain the largest α'kSamples of BoundF (x) values are extracted as boundary sample instances to form a boundary set
Figure GDA0002281987270000064
Number of samples α'kP is used as a parameter to be adjusted of the PUS algorithm, and in said step 3, if the data set contains noisy data, then the boundary sample instances are sampledThe noise data was previously deleted.
Step 4, merging the boundary sets, merging all the boundary sets generated in the steps 2 and 3 through parallel undersampling to obtain a new majority category data set D'αAnd all the minority classes are collected together to obtain a new training data set D '═ D'α∪Dβ
And 5, pruning, namely repeating the undersampling step 2 and the undersampling step 3 on the training data set D ' to obtain a final training data set D ', so that the training data set D ' comprises α samples with the maximum BoundF (x) value, and the balance between the number of most-class samples and the number of few-class samples is achieved, namely α ″ - β.
In order to compare the identification method of the biological event trigger words in the big Data set provided by the invention with other methods, two corpora are used, namely DataBioNLP09And DataBioNLP11In the BioNLP '09 and BioNLP' 11 sharing tasks, trigger recognition is only an intermediate step of the biomedical event extraction task, so there is no recognition result in the test set, so we use the validation data set as the test data set.
TABLE 1DataBioNLP09List of types and numbers of trigger and non-trigger words in
Figure GDA0002281987270000071
Table 1 shows DataBioNLP09Detailed statistical analysis of the training and validation data sets. As trigger recognition tasks, there are 9 trigger word types (corresponding to 9 biomedical events) that can be classified into three categories: simple event trigger words, binding event trigger words and complex event trigger words. Together with the negative category of non-trigger words, a category 10 classification problem is dealt with. It can be seen from the data in the table that it is a significant class imbalance of data, since only about 3.74% of the training data belongs to one of the trigger classes, while the rest belong to the non-trigger class.
TABLE 2DataBioNLP11Triggering and non-triggering inList of word types and numbers
Figure GDA0002281987270000072
Table 2 shows DataBioNLP11Detailed statistical analysis of the training and validation data sets.
Table 3 comparison of the performance of the present invention with other existing systems
Biological trigger word recognition system P R F1
PUS-SVM 69.7 69.9 68.3
System CRF 65.0 30.2 41.2
System SVM 70.2 52.6 60.1
Turku 70.5 60.6 65.2
TrigNER 69.3 57.3 62.7
Based on the same Data set DataBioNLP09The performance of the present invention was compared to other existing identification systems. Table 3 details the 3 metric values for the present invention and the other 4 identification systems: p (precision), R (recall), and F1(F value, which is a weighted geometric average of recall and precision). The results in table 3 show that the PUS-SVM system for recognizing bio-trigger words proposed in the present invention can achieve the best overall performance, and has significant differences from other systems. Furthermore, our improvement in system performance is based on improvements in recall, which means that more likely trigger words are identified, which can further improve the performance of next-phase event recognition.
TABLE 4 Data based on the same Data setBioNLP09Performance comparison of undersampling methods in a biological trigger recognition system
Figure GDA0002281987270000081
TABLE 5 Data based on the same Data setBioNLP11Performance comparison of undersampling methods in a biological trigger recognition system
Figure GDA0002281987270000082
Figure GDA0002281987270000091
Tables 4 and 5 analyze the performance of the SVM based recognition system, different under-sampling methods, including PUS-SVM, non-sampling, including the parallel under-sampling method proposed in the present inventionSVM, randomly sampled RUS-SVM, k nearest neighbor samples undersampled kUS-SVM. Comparative experiment on Data set DataBioNLP09And DataBioNLP11The parallel undersampling system proposed in the present invention can achieve the best overall F1 performance under both data sets.
To further analyze the efficiency of the present invention in large data sets, the runtime consumption of the above various methods, including PUS-SVM, non-sampled SVM, randomly sampled RUS-SVM, k nearest-sample undersampled kUS-SVM, were compared on a machine with 8 processing cores @2.67GHz and RAM. Comparative experiment on Data set DataBioNLP09And DataBioNLP11The above is performed separately, and the specific results are shown in fig. 2 and 3. As can be seen from fig. 2 and 3, the method for identifying a trigger of a biological event in a large data set according to the present invention can obtain the best result of identifying a trigger of a biological event within an acceptable time.
The present invention is not limited to the above embodiments, and those skilled in the art can make various equivalent changes and substitutions without departing from the principle of the present invention after learning the content of the present invention, and these equivalent changes and substitutions should be considered as belonging to the protection scope of the present invention.

Claims (3)

1. A method for identifying a biological event trigger word in a big data set is a parallel undersampling method and is characterized by comprising the following steps:
step 1, data segmentation, defining a data set D { (x)1,y1),...,(xn,yn) Is the training data set, where xi∈RmN is sample example, R ═ 1,2mRepresents m-dimensional real numbers, and yiE {0,1,. logue, l }, i ═ 1,2,. n is the class to which the sample instance belongs, with a total of 1+ l class labels; definition DαIs a multi-category data set containing njJ-0 sample instances belonging to the class y-0, let α -njJ is 0; data set D of multiple categoriesαRandomly partitioning into K mutually disjoint majority class subdata sets
Figure FDA0002364715100000011
By αkRepresenting each majority category of sub-data sets
Figure FDA0002364715100000012
The number of the middle sample instances; definition DβFor data sets of a minority category, DjIs one of a few categories of data sets, containing nj1,2, j, l sample instances belonging to the class y j, j 1, 1β={∪Dj1, 2.. l, wherein β represents the number of samples in all the minority class datasets, there are
Figure FDA0002364715100000013
Namely, it is
Figure FDA0002364715100000014
α > β;
step 2, calculating boundary factors, and defining each data set SkContaining subdata sets from a corresponding plurality of categories
Figure FDA0002364715100000015
And a minority category dataset DβIs expressed as
Figure FDA0002364715100000016
Data set S after the feature extraction stepkFor each sample instance x, x ∈ RmFrom the m-dimensional feature F ═ F1,...,ftWhere f denotes each dimensional feature, each sample boundary factor is calculated based on its uncertainty belonging to all classes, mainly by the set SkEach sample instance x to a given class C injDistance d (x, C)j) Determined, the calculation of the distance is defined as follows:
Figure FDA0002364715100000017
Figure FDA0002364715100000018
for computing sample instances x to a given class C in the t-dimensional feature spacejSince the bio-trigger recognition data set is text, the distance component of (a) is, therefore,
Figure FDA0002364715100000019
defined as a text vector to category CjDistance of centroid, said centroid being the word frequency TF (f)t|Cj) Average value of (d):
Figure FDA0002364715100000021
in the formula, in d (x, C)j) On the basis, each sample instance x is for class CjDegree of membership mu ofj(x) The definition is as follows:
Figure FDA0002364715100000022
and is
Figure FDA0002364715100000023
The boundary factor BoundF (x) for sample instance x is defined as follows:
Figure FDA0002364715100000024
step 3, undersampling the samples, sorting the calculated BoundF (x) values to obtain the largest α'kSamples of BoundF (x) values are extracted as boundary sample instances to form a boundary set
Figure FDA0002364715100000025
Number of samples α'kP is multiplied by β, and p is used as a parameter to be adjusted of the PUS algorithm;
step 4, merging the boundary sets, merging all the boundary sets generated in the steps 2 and 3 through parallel undersampling to obtain a new majority category data set D'αAnd all the minority classes are collected together to obtain a new training data set D '═ D'α∪Dβ
And 5, pruning, namely repeating the undersampling step 2 and the undersampling step 3 on the training data set D ' to obtain a final training data set D ', so that the training data set D ' comprises α samples with the maximum BoundF (x) value, and the balance between the number of most-class samples and the number of few-class samples is achieved, namely α ″ - β.
2. The method for identifying trigger words of biological events in big data set according to claim 1, wherein: each of said step 1
Figure FDA0002364715100000026
All of the same scale.
3. The method for identifying trigger words of biological events in big data set according to claim 1, wherein: in said step 3, if the data set contains noise data, the noise data is deleted before sampling the boundary sample instance.
CN201710148320.5A 2017-03-14 2017-03-14 Method for identifying biological event trigger words in big data set Expired - Fee Related CN106933805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710148320.5A CN106933805B (en) 2017-03-14 2017-03-14 Method for identifying biological event trigger words in big data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710148320.5A CN106933805B (en) 2017-03-14 2017-03-14 Method for identifying biological event trigger words in big data set

Publications (2)

Publication Number Publication Date
CN106933805A CN106933805A (en) 2017-07-07
CN106933805B true CN106933805B (en) 2020-04-28

Family

ID=59432925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710148320.5A Expired - Fee Related CN106933805B (en) 2017-03-14 2017-03-14 Method for identifying biological event trigger words in big data set

Country Status (1)

Country Link
CN (1) CN106933805B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897989B (en) * 2018-06-06 2020-05-19 大连理工大学 Biological event extraction method based on candidate event element attention mechanism

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927874A (en) * 2014-04-29 2014-07-16 东南大学 Automatic incident detection method based on under-sampling and used for unbalanced data set
CN104965819A (en) * 2015-07-12 2015-10-07 大连理工大学 Biomedical event trigger word identification method based on syntactic word vector
CN105260361A (en) * 2015-10-28 2016-01-20 南京邮电大学 Trigger word tagging system and method for biomedical events
CN105512209A (en) * 2015-11-28 2016-04-20 大连理工大学 Biomedicine event trigger word identification method based on characteristic automatic learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927874A (en) * 2014-04-29 2014-07-16 东南大学 Automatic incident detection method based on under-sampling and used for unbalanced data set
CN104965819A (en) * 2015-07-12 2015-10-07 大连理工大学 Biomedical event trigger word identification method based on syntactic word vector
CN105260361A (en) * 2015-10-28 2016-01-20 南京邮电大学 Trigger word tagging system and method for biomedical events
CN105512209A (en) * 2015-11-28 2016-04-20 大连理工大学 Biomedicine event trigger word identification method based on characteristic automatic learning

Also Published As

Publication number Publication date
CN106933805A (en) 2017-07-07

Similar Documents

Publication Publication Date Title
Li et al. Unsupervised streaming feature selection in social media
CN108564129B (en) Trajectory data classification method based on generation countermeasure network
CN104699772B (en) A kind of big data file classification method based on cloud computing
CN108363810B (en) Text classification method and device
CN107169504B (en) A kind of hand-written character recognition method based on extension Non-linear Kernel residual error network
CN107066555B (en) On-line theme detection method for professional field
CN104112026B (en) A kind of short message text sorting technique and system
CN108197144B (en) Hot topic discovery method based on BTM and Single-pass
CN102004786B (en) Acceleration method in image retrieval system
WO2016049975A1 (en) Clustering coefficient-based adaptive clustering method and system
CN107832456B (en) Parallel KNN text classification method based on critical value data division
Pang et al. A generalized cluster centroid based classifier for text categorization
CN111125469B (en) User clustering method and device of social network and computer equipment
Dubey et al. A systematic review on k-means clustering techniques
CN111539444A (en) Gaussian mixture model method for modified mode recognition and statistical modeling
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN106203508A (en) A kind of image classification method based on Hadoop platform
Verdikha et al. Study of undersampling method: Instance hardness threshold with various estimators for hate speech classification
CN104866606A (en) MapReduce parallel big data text classification method
Baena-García et al. TF-SIDF: Term frequency, sketched inverse document frequency
CN108268461A (en) A kind of document sorting apparatus based on hybrid classifer
CN106933805B (en) Method for identifying biological event trigger words in big data set
CN105760471B (en) Based on the two class text classification methods for combining convex linear perceptron
Senthilnath et al. A novel harmony search-based approach for clustering problems
CN103761433A (en) Network service resource classifying method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200428

Termination date: 20210314

CF01 Termination of patent right due to non-payment of annual fee