CN111382273A - Text classification method based on feature selection of attraction factors - Google Patents

Text classification method based on feature selection of attraction factors Download PDF

Info

Publication number
CN111382273A
CN111382273A CN202010158078.1A CN202010158078A CN111382273A CN 111382273 A CN111382273 A CN 111382273A CN 202010158078 A CN202010158078 A CN 202010158078A CN 111382273 A CN111382273 A CN 111382273A
Authority
CN
China
Prior art keywords
attraction
texts
class
entry
factor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010158078.1A
Other languages
Chinese (zh)
Other versions
CN111382273B (en
Inventor
周红芳
韩霜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Zhiying Wanshi Market Management Co ltd
Xi'an Huaqi Zhongxin Technology Development Co ltd
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202010158078.1A priority Critical patent/CN111382273B/en
Publication of CN111382273A publication Critical patent/CN111382273A/en
Application granted granted Critical
Publication of CN111382273B publication Critical patent/CN111382273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text classification method based on feature selection of attraction factors, which comprises the steps of preprocessing a data set by using a naive Bayes classifier and a support vector machine classifier, acquiring the data set, eliminating entries with the frequency of the entries in the data set exceeding 25% in a document and entries with the entries in the document less than 3, and dividing a test set and a training set by adopting a cross-validation method; setting the number of feature words in a test set and a training set by using a feature selection method based on an attraction factor to generate an optimal feature subset; training and classifying the optimal feature subset of the training set in sequence by using a naive Bayes classifier and a support vector machine classifier, training out a classifier model, and inputting the optimal feature subset of the test set into the classifier model to obtain a classification result; the classification results were evaluated using two evaluation indexes, micro-average-F1 and macro-average-F1, to verify the performance of the method.

Description

Text classification method based on feature selection of attraction factors
Technical Field
The invention belongs to the technical field of data mining methods, and relates to a text classification method based on feature selection of attraction factors.
Background
Text classification is the task of assigning predefined categories to documents, traditionally performed manually by domain experts, but with the significant increase in the number of digital documents available on the internet, IT is not possible to manually process such a large amount of information, and classification algorithms have evolved as IT technology has developed. Text classification, which is studied in information science and computer science, has found many applications in many fields, such as information retrieval, genre classification, spam filtering, language identification, and the like. Text classification is a basic function of text information mining, is also a core technology for processing and organizing text data, can effectively assist people in organizing and classifying information data, solves the problem of information disorder to a great extent, and has strong practical significance for efficient management and effective utilization of information, so that the text classification technology becomes one of important research directions in the field of data mining.
The text classification technology is a complex system engineering, and the feature selection is one of the key technologies of text classification. Feature selection is an important problem in text classification, and can reduce the size of a feature space without sacrificing classification performance, and simultaneously avoid the generation of an overfitting phenomenon. The method is mainly characterized in that feature words which do not greatly contribute to text classification are deleted from an original high-dimensional feature set space according to a certain rule, and a part of most effective and most representative feature words are selected to form a new feature subset. Through the step of feature selection, some feature words irrelevant to requirements can be removed, so that the dimension of a text feature set space is greatly reduced, and the efficiency and the precision of text classification are improved.
The main feature of text classification is that even for medium-sized data sets, the number of features in the feature space can easily reach tens of thousands of orders, so that in a high-dimensional situation there are two problems:
one is that some complex algorithms cannot be optimally used in text classification; another problem is that when most algorithms are trained in the training set, over-classification is inevitable in text classification, resulting in low classification accuracy. Therefore, dimension reduction has been a major research area. Meanwhile, the rapid development of the text classification technology brings difficulties and challenges which are not met before, and a great development space still exists for the research of the text classification technology in theory and practice.
Disclosure of Invention
The invention aims to provide a text classification method based on feature selection of attraction factors, and solves the problem of low classification accuracy in the prior art.
The technical scheme adopted by the invention is that the text classification method based on the feature selection of the attraction factor specifically comprises the following steps:
step 1: preprocessing a data set by using a naive Bayes classifier NB and a support vector machine classifier SVM, acquiring a plurality of data sets subjected to stem extraction and stop word processing, eliminating entries with the frequency of the entries in the data sets exceeding 25% in a document and entries with the frequency of the entries less than 3, and dividing a test set and a training set by adopting a cross-validation method;
step 2: setting the quantity of the feature words of the test set and the training set obtained in the step 1 by using a feature selection method based on an attraction factor to generate an optimal feature subset;
and step 3: sequentially training and classifying the optimal feature subset of the training set obtained in the step 2 by using a naive Bayes classifier NB and a Support Vector Machine (SVM), training out a classifier model, and inputting the optimal feature subset of the test set obtained in the step 2 into the classifier model to obtain a classification result;
and 4, step 4: and (3) evaluating the classification result obtained in the step (3) by using two evaluation indexes of micro-average-F1 and macro-average-F1, and verifying the performance of the feature selection method based on the attraction factor.
The invention is also characterized in that:
the data sets in step 1 are four data sets of 20Newsgroups, WebKB, K1a and K1 b.
The step 2 comprises the following specific steps:
step 2.1: calculating an attraction factor T (T)i) Wherein the attraction factor represents an average frequency of occurrence of the term in each text in the category;
step 2.2: calculating the maximum term positive rate MT, namely taking the maximum value of the real rate and the false positive rate to balance the real relevance of the terms;
step 2.3: according to the true rate tpr and the false positive rate fpr calculated in the step 2.2, calculating a normalized difference measurement factor NDM;
step 2.4: calculating the weight value MTFS (t) of each entry according to the following formulai) Then sorting is carried out, an optimal characteristic subset is selected according to the number of the entries,
MTFS(ti)=MT·T(ti)·NDM
wherein MT stands for the maximum term positive rate in step 2.2, T (T)i) NDM represents the normalized difference measure factor in step 2.3, which is the attraction factor in step 2.1.
The specific steps of the step 2 are as follows:
step 2.1: calculating an attraction factor T (T)i) Wherein the attraction factor represents the average frequency of occurrence of the term in each text in the category
Figure BDA0002404802440000031
Wherein, tfijIs that the term is in class CiText d in (1)jNumber of occurrences in, N is class CiThe total number of texts in (1);
step 2.2: calculating the maximum term positive rate MT, namely taking the maximum value of the real rate tpr and the false positive rate fpr to balance the real correlation of the terms;
the calculation formula of the true rate tpr and the false positive rate fpr is as follows:
Figure BDA0002404802440000041
Figure BDA0002404802440000042
Figure BDA0002404802440000043
wherein tp represents the inclusion of an entry tiAnd belong to class CkThe number of texts of; fn denotes no entry tiAnd belong to class CkThe number of texts of; fp denotes containing an entry tiAnd do not belong to the classPin CkThe number of texts of; tn denotes that the entry t is not includediAnd do not belong to class CkThe number of texts of;
step 2.3: according to the true rate tpr and false positive rate fpr calculated by the formulas (2) and (3) in the step 2.2, the normalized difference measurement factor is calculated according to the following formula,
Figure BDA0002404802440000044
step 2.4: the weight value MTFS (t) of each entry is calculated according to the following formulai) Then sorting is carried out, an optimal characteristic subset is selected according to the number of the entries,
Figure BDA0002404802440000045
wherein MT stands for the maximum term positive, T (T), obtained in step 2.2i) For the attraction factor obtained in step 2.1, NDM represents the normalized difference measure factor obtained in step 2.3.
The calculation formula of the micro-average-F1 in the step 4 is as follows:
Figure BDA0002404802440000046
wherein,
Figure BDA0002404802440000051
in order to average the precision ratio,
Figure BDA0002404802440000052
is the average recall ratio and precision ratio
Figure BDA0002404802440000053
Recall ratio of
Figure BDA0002404802440000054
tp indicates the inclusion of an entry tiAnd belong to class CkThe number of texts of; fn denotes no entry tiAnd belong to class CkThe number of texts of; fp denotes containing an entry tiAnd do not belong to class CkThe number of texts of;
the macroaverage-F1 calculation formula is as follows:
Figure BDA0002404802440000055
in the formula F1(k) The value of-F1 for the micro-average of the kth test class is indicated, and K indicates the total number of test classes.
The invention has the beneficial effects that:
1. the invention comprehensively considers the contribution of the document frequency and the distribution problem of terms in the classes and among the classes to the classification, so compared with the traditional CHI algorithm, GINI algorithm, NDM algorithm and OR algorithm, the invention has obvious advantages and results on the data sets 20Newsgroups, WebKB, K1a and K1b in the classification accuracy, and experiments prove that the characteristic selection method based on the attraction factor can improve the classification accuracy when applied to text classification, and is an effective characteristic selection algorithm.
2. On the basis of matching with different classifiers, the feature subsets selected by the invention and other traditional CHI algorithm, GINI algorithm, NDM algorithm and OR algorithm are respectively operated on the NB classifier and the SVM classifier, and the final result shows that the result of the invention has good effect and high classification accuracy.
Drawings
FIG. 1 is a flow chart of a method of text classification based on feature selection by an attraction factor of the present invention;
FIG. 2 is a comparison graph of the text classification method based on feature selection of attraction factors with the prior art showing the comparison of the micro-average-F1 values when a naive Bayes classifier is used on different data sets and under different lexical dimension dimensions;
FIG. 3 is a comparison graph of the text classification method based on feature selection of attraction factors according to the present invention and the prior art with a polyline of the value of Macro-average-F1 when a naive Bayes classifier is used on different data sets and under different lexical dimension dimensions;
FIG. 4 is a comparison graph of the text classification method based on feature selection of attraction factors and the prior art with a broken line of the micro-average-F1 value when a support vector machine classifier is used on different data sets and under different vocabulary entry dimensions;
FIG. 5 is a comparison graph of the text classification method based on feature selection of attraction factors and the broken line of the macro-average-F1 value when using a support vector machine classifier on different data sets and under different vocabulary entry dimensions according to the prior art;
FIG. 6 is a histogram comparing the micro-average-F1 values of a text classification method based on feature selection of attraction factors of the present invention with the prior art using a naive Bayes classifier on different datasets and under different lexical dimension numbers;
FIG. 7 is a histogram of the macro-average-F1 values when a naive Bayes classifier is used on different datasets and under different lexical dimension in accordance with the text classification method for feature selection based on an attraction factor of the present invention and the prior art;
FIG. 8 is a histogram comparing the micro-mean-F1 values of a text classification method based on feature selection of attraction factors with prior art using a SVM classifier on different datasets and under different vocabulary entry dimensions;
FIG. 9 is a histogram comparing the macro mean-F1 value when using a SVM classifier on different data sets and under different vocabulary entry dimensions in comparison with the prior art text classification method based on feature selection of attraction factors.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention relates to a text classification method based on feature selection of attraction factors, which specifically comprises the following steps as shown in figure 1:
step 1: preprocessing a data set by using a naive Bayes classifier NB and a support vector machine classifier SVM, acquiring a plurality of data sets subjected to stem extraction and stop word processing, eliminating entries with the frequency of the entries in the data sets exceeding 25% in a document and entries with the frequency of the entries less than 3, and dividing a test set and a training set by adopting a cross-validation method;
step 2: setting the quantity of the feature words of the test set and the training set obtained in the step 1 by using a feature selection method based on an attraction factor to generate an optimal feature subset;
and step 3: sequentially training and classifying the optimal feature subset of the training set obtained in the step 2 by using a naive Bayes classifier NB and a Support Vector Machine (SVM), training out a classifier model, and inputting the optimal feature subset of the test set obtained in the step 2 into the classifier model to obtain a classification result;
and 4, step 4: and (3) evaluating the classification result obtained in the step (3) by using two evaluation indexes of micro-average-F1 and macro-average-F1, and verifying the performance of the feature selection method based on the attraction factor.
The invention uses Naive Bayes (NB) and Support Vector Machine (SVM) classification algorithm to classify. The naive Bayes algorithm is a probability-based algorithm, is widely applied to the field of machine learning, mainly focuses on the probability that a text belongs to a certain category, and shows good efficiency and robustness in practical application. The support vector machine algorithm has a good effect on the aspect of mining the internal features of data, has higher accuracy compared with other classification algorithms, and can reduce the operation of a vector space from a high dimension to a low dimension by a classified kernel function in the high-dimension vector space.
The data sets in step 1 are four data sets of 20Newsgroups, WebKB, K1a and K1 b.
The step 2 comprises the following specific steps:
step 2.1: calculating an attraction factor T (T)i) Wherein the attraction factor represents the average frequency of occurrence of the term in each text in the category, and the larger the attraction factor, the more representative the term is;
step 2.2: calculating the maximum term positive rate MT, namely taking the maximum value of the real rate and the false positive rate to balance the real relevance of the terms;
step 2.3: according to the true rate tpr and the false positive rate fpr calculated in the step 2.2, calculating a normalized difference measurement factor NDM;
step 2.4: calculating the weight value MTFS (t) of each entry according to the following formulai) Then sorting is carried out, an optimal characteristic subset is selected according to the number of the entries,
MTFS(ti)=MT·T(ti)·NDM
wherein MT stands for the maximum term positive rate in step 2.2, T (T)i) NDM represents the normalized difference measure factor in step 2.3, which is the attraction factor in step 2.1.
The specific steps of the step 2 are as follows:
step 2.1: calculating an attraction factor T (T)i) Wherein the attraction factor represents the average frequency of occurrence of the term in each text in the category
Figure BDA0002404802440000081
Wherein, tfijIs that the term is in class CiText d in (1)jNumber of occurrences in, N is class CiThe total number of texts in (1);
step 2.2: calculating the maximum term positive rate MT, namely taking the maximum value of the real rate tpr and the false positive rate fpr to balance the real correlation of the terms;
the calculation formula of the true rate tpr and the false positive rate fpr is as follows:
Figure BDA0002404802440000082
Figure BDA0002404802440000091
Figure BDA0002404802440000092
wherein tp represents the inclusion of an entry tiAnd belong to class CkThe number of texts of; fn denotes no entry tiAnd belong to class CkThe number of texts of; fp denotes containing an entry tiAnd do not belong to class CkThe number of texts of; tn denotes that the entry t is not includediAnd do not belong to class CkThe number of texts of;
step 2.3: according to the true rate tpr and false positive rate fpr calculated by the formulas (2) and (3) in the step 2.2, the normalized difference measurement factor is calculated according to the following formula,
Figure BDA0002404802440000093
step 2.4: the weight value MTFS (t) of each entry is calculated according to the following formulai) Then sorting is carried out, an optimal characteristic subset is selected according to the number of the entries,
Figure BDA0002404802440000094
wherein MT stands for the maximum term positive, T (T), obtained in step 2.2i) For the attraction factor obtained in step 2.1, NDM represents the normalized difference measure factor obtained in step 2.3.
The calculation formula of the micro-average-F1 in the step 4 is as follows:
Figure BDA0002404802440000095
wherein,
Figure BDA0002404802440000096
in order to average the precision ratio,
Figure BDA0002404802440000097
is the average recall ratio and precision ratio
Figure BDA0002404802440000098
Recall ratio of
Figure BDA0002404802440000099
tp indicates the inclusion of an entry tiAnd belong to class CkThe number of texts of; fn denotes no entry tiAnd belong to class CkNumber of texts(ii) a fp denotes containing an entry tiAnd do not belong to class CkThe number of texts of;
the macroaverage-F1 calculation formula is as follows:
Figure BDA0002404802440000101
in the formula F1(k) The value of-F1 for the micro-average of the kth test class is indicated, and K indicates the total number of test classes.
The higher and more stable the Micro-F1 value and the Macro-F1 value obtained in the experiment are, the better the classification effect is and the higher the precision is.
In the analysis of a data set in an experiment, the invention considers the distribution condition of terms in classes, and also considers the problem how to solve when highly sparse terms exist among the classes, and aims to select a feature item with stronger class distinguishing capability from an original feature space, and perform dimension reduction processing on a feature complete set according to a certain evaluation standard or certain evaluation standards to generate a feature subset with lower dimension.
To validate the ability of the feature selection method based on document-level word frequency reordering, the method was compared to the known normalized CHI-square test (CHI), the Gini coefficient (GINI) method, the Difference measurement method (NDM), and the dominance rate (OR). As can be seen from fig. 2, 3, 6 and 7, in the experimental results of the naive bayes classifier, the method of the invention has a higher F1 value, is more stable and is average optimal than the prior method. As can be seen from fig. 4, 5, 8 and 9, the present invention shows good results in most data sets in the experimental results of the support vector machine classifier. Experiments prove that the invention is an effective feature selection algorithm.
The pseudo code of the algorithm of the present invention is as follows:
Figure BDA0002404802440000102
Figure BDA0002404802440000111
the invention discloses a text classification method based on feature selection of attraction factors, which has the beneficial effects that: the invention comprehensively considers the contribution of the document frequency and the distribution problem of terms in the classes and among the classes to the classification, so compared with the traditional CHI algorithm, GINI algorithm, NDM algorithm and OR algorithm, the invention has obvious advantages and results on the data sets 20Newsgroups, WebKB, K1a and K1b in the classification accuracy, and experiments prove that the characteristic selection method based on the attraction factor can improve the classification accuracy when applied to text classification, and is an effective characteristic selection algorithm.

Claims (5)

1. A text classification method based on feature selection of attraction factors is characterized by comprising the following steps:
step 1: preprocessing a data set by using a naive Bayes classifier NB and a support vector machine classifier SVM, acquiring a plurality of data sets subjected to stem extraction and stop word processing, eliminating entries with frequency of more than 25% appearing in documents in the data sets and entries with less than 3 documents appearing in the entries, and dividing a test set and a training set by adopting a cross-validation method;
step 2: setting the quantity of the feature words of the test set and the training set obtained in the step 1 by using a feature selection method based on an attraction factor to generate an optimal feature subset;
and step 3: sequentially training and classifying the optimal feature subset of the training set obtained in the step 2 by using a naive Bayes classifier NB and a Support Vector Machine (SVM), training out a classifier model, and inputting the optimal feature subset of the test set obtained in the step 2 into the classifier model to obtain a classification result;
and 4, step 4: and (3) evaluating the classification result obtained in the step (3) by using two evaluation indexes of micro-average-F1 and macro-average-F1, and verifying the performance of the feature selection method based on the attraction factor.
2. The method of claim 1, wherein the data sets in step 1 are four data sets of 20Newsgroups, WebKB, K1a, and K1 b.
3. The method for classifying texts based on feature selection of attraction factors according to claim 1, wherein the step 2 comprises the following specific steps:
step 2.1: calculating an attraction factor T (T)i) Wherein the attraction factor represents an average frequency of occurrence of the term in each text in the category;
step 2.2: calculating the maximum term positive rate MT, namely taking the maximum value of the real rate and the false positive rate to balance the real relevance of the terms;
step 2.3: according to the true rate tpr and the false positive rate fpr calculated in the step 2.2, calculating a normalized difference measurement factor NDM;
step 2.4: calculating the weight value MTFS (t) of each entry according to the following formulai) Then sorting is carried out, an optimal characteristic subset is selected according to the number of the entries,
MTFS(ti)=MT·T(ti)·NDM
wherein MT stands for the maximum term positive rate in said step 2.2, T (T)i) NDM represents the normalized difference measure factor in step 2.3, which is the attraction factor in step 2.1.
4. The method for classifying texts based on feature selection of attraction factors according to claim 1, wherein the specific steps of the step 2 are as follows:
step 2.1: calculating an attraction factor T (T)i) Wherein the attraction factor represents the average frequency of occurrence of the term in each text in the category
Figure FDA0002404802430000021
Wherein, tfijIs that the term is in class CiText d in (1)jNumber of occurrences in, N is class CiThe total number of texts in (1);
step 2.2: calculating the maximum term positive rate MT, namely taking the maximum value of the real rate tpr and the false positive rate fpr to balance the real correlation of the terms;
the calculation formula of the true rate tpr and the false positive rate fpr is as follows:
Figure FDA0002404802430000022
Figure FDA0002404802430000023
Figure FDA0002404802430000031
wherein tp represents the inclusion of an entry tiAnd belong to class CkThe number of texts of; fn denotes no entry tiAnd belong to class CkThe number of texts of; fp denotes containing an entry tiAnd do not belong to class CkThe number of texts of; tn denotes that the entry t is not includediAnd do not belong to class CkThe number of texts of;
step 2.3: calculating a normalized difference measurement factor according to the real rate tpr and the false positive rate fpr calculated according to the formulas (2) and (3) in the step 2.2 and the following formula,
Figure FDA0002404802430000032
step 2.4: the weight value MTFS (t) of each entry is calculated according to the following formulai) Then sorting is carried out, an optimal characteristic subset is selected according to the number of the entries,
Figure FDA0002404802430000033
wherein MT represents the maximum term positive rate, T (T), obtained in said step 2.2i) For the attraction factor obtained in said step 2.1, NDM represents said stepNormalized difference measurement factor obtained in step 2.3.
5. The method for classifying texts based on feature selection of attraction factors according to claim 1, wherein the calculation formula of the micro-average-F1 in the step 4 is as follows:
Figure FDA0002404802430000034
wherein,
Figure FDA0002404802430000035
in order to average the precision ratio,
Figure FDA0002404802430000036
is the average recall ratio and precision ratio
Figure FDA0002404802430000037
Recall ratio of
Figure FDA0002404802430000038
tp indicates the inclusion of an entry tiAnd belong to class CkThe number of texts of; fn denotes no entry tiAnd belong to class CkThe number of texts of; fp denotes containing an entry tiAnd do not belong to class CkThe number of texts of;
the macroaverage-F1 calculation formula is as follows:
Figure FDA0002404802430000041
in the formula F1(k) The value of-F1 for the micro-average of the kth test class is indicated, and K indicates the total number of test classes.
CN202010158078.1A 2020-03-09 2020-03-09 Text classification method based on feature selection of attraction factors Active CN111382273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010158078.1A CN111382273B (en) 2020-03-09 2020-03-09 Text classification method based on feature selection of attraction factors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010158078.1A CN111382273B (en) 2020-03-09 2020-03-09 Text classification method based on feature selection of attraction factors

Publications (2)

Publication Number Publication Date
CN111382273A true CN111382273A (en) 2020-07-07
CN111382273B CN111382273B (en) 2023-04-14

Family

ID=71217271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010158078.1A Active CN111382273B (en) 2020-03-09 2020-03-09 Text classification method based on feature selection of attraction factors

Country Status (1)

Country Link
CN (1) CN111382273B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657106A (en) * 2021-07-05 2021-11-16 西安理工大学 Feature selection method based on normalized word frequency weight

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN107273387A (en) * 2016-04-08 2017-10-20 上海市玻森数据科技有限公司 Towards higher-dimension and unbalanced data classify it is integrated
WO2018218706A1 (en) * 2017-05-27 2018-12-06 中国矿业大学 Method and system for extracting news event based on neural network
CN109376235A (en) * 2018-07-24 2019-02-22 西安理工大学 The feature selection approach to be reordered based on document level word frequency

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification
CN107273387A (en) * 2016-04-08 2017-10-20 上海市玻森数据科技有限公司 Towards higher-dimension and unbalanced data classify it is integrated
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
WO2018218706A1 (en) * 2017-05-27 2018-12-06 中国矿业大学 Method and system for extracting news event based on neural network
CN109376235A (en) * 2018-07-24 2019-02-22 西安理工大学 The feature selection approach to be reordered based on document level word frequency

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YAO LIFANG ET AL.: ""Feature selection algorithm for hierarchical text classification using Kullback-Leibler divergence"", 《IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS》 *
如先姑力·阿布都热西提 等: ""维吾尔文论坛中基于术语选择和Rocchio分类器的文本过滤方法"", 《万方数据知识服务平台》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657106A (en) * 2021-07-05 2021-11-16 西安理工大学 Feature selection method based on normalized word frequency weight

Also Published As

Publication number Publication date
CN111382273B (en) 2023-04-14

Similar Documents

Publication Publication Date Title
Georgakopoulos et al. Convolutional neural networks for toxic comment classification
CN109101477B (en) Enterprise field classification and enterprise keyword screening method
Huang et al. An improved knn based on class contribution and feature weighting
CN111709439B (en) Feature selection method based on word frequency deviation rate factor
CN109376235B (en) Feature selection method based on document layer word frequency reordering
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
CN104881399B (en) Event recognition method and system based on probability soft logic PSL
CN112800249A (en) Fine-grained cross-media retrieval method based on generation of countermeasure network
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN110781333A (en) Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN111460161A (en) Unsupervised text theme related gene extraction method for unbalanced big data set
CN113836896A (en) Patent text abstract generation method and device based on deep learning
Zhang et al. Compact representation of high-dimensional feature vectors for large-scale image recognition and retrieval
Pristyanto et al. The effect of feature selection on classification algorithms in credit approval
CN113626604B (en) Web page text classification system based on maximum interval criterion
CN112579783B (en) Short text clustering method based on Laplace atlas
CN111382273B (en) Text classification method based on feature selection of attraction factors
Mandal et al. Unsupervised non-redundant feature selection: a graph-theoretic approach
CN110348497B (en) Text representation method constructed based on WT-GloVe word vector
CN105760471B (en) Based on the two class text classification methods for combining convex linear perceptron
Fursov et al. Sequence embeddings help to identify fraudulent cases in healthcare insurance
Wang et al. Learning based neural similarity metrics for multimedia data mining
CN115186138A (en) Comparison method and terminal for power distribution network data
CN114610884A (en) Classification method based on PCA combined feature extraction and approximate support vector machine
CN113657106A (en) Feature selection method based on normalized word frequency weight

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230313

Address after: Room 501, No. 18, Haizhou Road, Haizhu District, Guangzhou City, Guangdong Province, 510000 (Location: Self made 01) (Office only)

Applicant after: Guangzhou Zhiying Wanshi Market Management Co.,Ltd.

Address before: 710000 No. B49, Xinda Zhongchuang space, 26th Street, block C, No. 2 Trading Plaza, South China City, international port district, Xi'an, Shaanxi Province

Applicant before: Xi'an Huaqi Zhongxin Technology Development Co.,Ltd.

Effective date of registration: 20230313

Address after: 710000 No. B49, Xinda Zhongchuang space, 26th Street, block C, No. 2 Trading Plaza, South China City, international port district, Xi'an, Shaanxi Province

Applicant after: Xi'an Huaqi Zhongxin Technology Development Co.,Ltd.

Address before: 710048 Shaanxi province Xi'an Beilin District Jinhua Road No. 5

Applicant before: XI'AN University OF TECHNOLOGY

GR01 Patent grant
GR01 Patent grant