WO2017028416A1 - 分类器训练方法、类型识别方法及装置 - Google Patents

分类器训练方法、类型识别方法及装置 Download PDF

Info

Publication number
WO2017028416A1
WO2017028416A1 PCT/CN2015/097615 CN2015097615W WO2017028416A1 WO 2017028416 A1 WO2017028416 A1 WO 2017028416A1 CN 2015097615 W CN2015097615 W CN 2015097615W WO 2017028416 A1 WO2017028416 A1 WO 2017028416A1
Authority
WO
WIPO (PCT)
Prior art keywords
classifier
feature
sample
clause
words
Prior art date
Application number
PCT/CN2015/097615
Other languages
English (en)
French (fr)
Inventor
汪平仄
龙飞
张涛
Original Assignee
小米科技有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 小米科技有限责任公司 filed Critical 小米科技有限责任公司
Priority to KR1020167003870A priority Critical patent/KR101778784B1/ko
Priority to RU2016111677A priority patent/RU2643500C2/ru
Priority to MX2016003981A priority patent/MX2016003981A/es
Priority to JP2017534873A priority patent/JP2017535007A/ja
Publication of WO2017028416A1 publication Critical patent/WO2017028416A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors

Definitions

  • the present disclosure relates to the field of natural language processing, and in particular, to a classifier training method, a type identification method, and a device.
  • SMS content recognition and extraction is a practical application of natural language processing.
  • the related art provides a recognition method, which is preset with a plurality of keywords, and whether the short message is carried by identifying whether the short message includes all or part of the keywords. SMS with birthday date.
  • the present disclosure provides a classifier training method, a type identification method, and a device.
  • the technical solution is as follows:
  • a classifier training method comprising:
  • the sample clause is binary-labeled to obtain a sample training set
  • the classifier is trained based on the binary annotation results in the sample training set.
  • the specified set of features is extracted from a number of words, including:
  • a specified set of features is extracted from a number of words based on the information gain.
  • the classifier is constructed from the feature words in the specified feature set, including:
  • the feature words in the specified feature set are constructed into a naive Bayes classifier, and each feature word is independent of each other in the naive Bayes classifier.
  • the classifier is trained based on the binary annotation results in the sample training set, including:
  • the first conditional probability that the clause carrying the characteristic word belongs to the target category is calculated, and the part carrying the characteristic word is The second conditional probability that the sentence does not belong to the target category;
  • the trained naive Bayes classifier is obtained according to each feature word, the first conditional probability and the second conditional probability.
  • a type identification method comprising:
  • the feature set of the original information is input into the trained classifier for prediction, and the classifier is a classifier constructed in advance according to the feature words in the specified feature set;
  • the prediction result of the classifier indicates that the original information belongs to the target category or does not belong to the target category.
  • the feature set of the original information is input into the trained classifier for prediction, including:
  • the trained naive Bayes classifier includes a first conditional probability and a second conditional probability of each feature word, the first conditional probability is a probability that the clause carrying the characteristic word belongs to the target category, and second The conditional probability is the probability that a clause carrying a feature word does not belong to the target category.
  • the method further includes:
  • the target information is extracted from the original information.
  • the target information is a birthday date
  • Extract target information from the original information including:
  • the date of receipt of the original information is extracted as the date of the birthday.
  • a classifier training apparatus comprising:
  • a clause extraction module configured to extract a sample clause carrying the target keyword from the sample information
  • the clause labeling module is configured to perform binary labeling on the sample clause according to whether each sample clause belongs to the target category, and obtain a sample training set;
  • a clause segmentation module configured to segment each sample clause in the sample training set to obtain a number of words
  • a feature word extraction module configured to extract a specified feature set from the plurality of words, the specified feature set including at least one feature word
  • a classifier building module configured to construct a classifier according to a feature word in the specified feature set
  • the classifier training module is configured to train the classifier based on the binary annotation results in the sample training set.
  • the feature word extraction module is configured to extract a specified feature set from the plurality of words according to the chi-square test; or the feature word extraction module is configured to use the information gain from the plurality of words Extract the specified feature set.
  • the classifier building module is configured to construct a naive Bayes classifier for the feature words in the specified feature set, each feature word being independent of each other in the naive Bayes classifier.
  • the classifier training module includes:
  • a statistical sub-module configured to, for each feature word in the naive Bayes classifier, count the first conditional probability that the clause carrying the characteristic word belongs to the target category according to the binary annotation result in the sample training set, and The second conditional probability that the clause carrying the characteristic word does not belong to the target category;
  • the training submodule is configured to obtain the trained naive Bayes classifier according to each feature word, the first conditional probability, and the second conditional probability.
  • a type identifying apparatus comprising:
  • the original extraction module is configured to extract a clause carrying the target keyword from the original information
  • the feature extraction module is configured to generate a feature set of the original information according to the feature words belonging to the specified feature set in the extracted clause, and the feature word in the specified feature set is a word segmentation result according to the sample clause carrying the target keyword Extracted
  • a feature input module configured to input a feature set of the original information into the trained classifier for prediction, and the classifier is a classifier configured in advance according to the feature words in the specified feature set;
  • the result obtaining module is configured to obtain a prediction result of the classifier, and the prediction result indicates that the original information belongs to the target category or does not belong to the target category.
  • the feature input module includes:
  • a calculation submodule configured to input each feature word in the feature set of the original information into the trained naive Bayes classifier, and calculate a first prediction probability that the original information belongs to the target category and the original information does not belong to the target category Second predicted probability;
  • a prediction submodule configured to predict whether the original information belongs to the target category according to a magnitude relationship between the first prediction probability and the second prediction probability
  • the trained naive Bayes classifier includes a first conditional probability and a second conditional probability of each feature word, the first conditional probability is a probability that the clause carrying the characteristic word belongs to the target category, and the second conditional probability Is the probability that a clause carrying a characteristic word does not belong to the target category.
  • the apparatus further includes:
  • the information extraction module is configured to extract the target information from the original information when it is predicted that the original information belongs to the target category.
  • the target information is a birthday date
  • An information extraction module configured to extract a birthday date from the original information by using a regular expression
  • the information extraction module is configured to extract the date of receipt of the original information as a birthday date.
  • a classifier training apparatus comprising:
  • a memory for storing processor executable instructions
  • processor is configured to:
  • the classifier is trained based on the binary annotation results in the sample training set.
  • a type identifying apparatus comprising:
  • a memory for storing processor executable instructions
  • processor is configured to:
  • the feature set of the original information is input into the trained classifier for prediction, and the classifier is a classifier constructed in advance according to the feature words in the specified feature set;
  • the prediction result of the classifier indicates that the original information belongs to the target category or does not belong to the target category.
  • FIG. 1 is a flowchart of a classifier training method according to an exemplary embodiment
  • FIG. 2 is a flowchart of a classifier training method according to another exemplary embodiment
  • FIG. 3 is a flowchart of a type identification method according to an exemplary embodiment
  • FIG. 4 is a flowchart of a type identification method according to another exemplary embodiment
  • FIG. 5 is a block diagram of a classifier training apparatus according to an exemplary embodiment
  • FIG. 6 is a block diagram of a classifier training apparatus according to another exemplary embodiment
  • FIG. 7 is a block diagram of a type identifying apparatus according to an exemplary embodiment
  • FIG. 8 is a block diagram of a type identifying apparatus according to another exemplary embodiment.
  • FIG. 9 is a block diagram of a classifier training device or a type identification device, according to an exemplary embodiment.
  • a text message carrying the target keyword "birthday” or "birth” is as follows:
  • SMS 1 "Xiao Min, tomorrow is not his birthday, you don't want to buy a cake.”
  • SMS 4 "Baby born on May 20 has good luck.”
  • the third text message is a text message carrying a valid birthday date, and the other three text messages are not carrying a text message with a valid birthday date.
  • the embodiment of the present disclosure provides a classifier-based identification method.
  • the identification method comprises two phases: a first phase, a phase of training a classifier, and a second phase, a phase of class identification using a classifier.
  • the first stage the stage of training the classifier.
  • FIG. 1 is a flowchart of a classifier training method according to an exemplary embodiment. The method includes the following steps.
  • step 101 a sample clause carrying the target keyword is extracted from the sample information.
  • the category of the sample information is any one of a short message, a mail, a microblog, or instant messaging information.
  • the embodiment of the present disclosure does not limit the category of sample information.
  • Each piece of sample information includes at least one clause.
  • the clause carrying the target keyword is a sample clause.
  • step 102 according to whether each sample clause belongs to a target category, the sample clause is binary-labeled to obtain a sample training set.
  • each sample clause in the sample training set is segmented to obtain a plurality of words.
  • a specified feature set is extracted from a plurality of words, the specified feature set including at least one feature word.
  • step 105 a classifier is constructed from the feature words in the specified feature set.
  • the classifier is a naive Bayes classifier.
  • step 106 the classifier is trained based on the binary annotation results in the sample training set.
  • the classifier training method obtains a plurality of words by segmenting each sample clause in the sample training set, and extracts a specified feature set from the plurality of words, according to the specified feature set.
  • the feature word constructs the classifier; solves the problem that the recognition result is inaccurate when using the birthday keyword for short message category analysis; since the feature word in the specified feature set is based on the word segmentation result of the sample clause carrying the target keyword
  • the extracted classifier can make a more accurate prediction of the clause carrying the target keyword, and achieves the effect that the recognition result is more accurate.
  • FIG. 2 is a flowchart of a classifier training method according to another exemplary embodiment. The method includes the following steps.
  • step 201 a plurality of pieces of sample information carrying the target keyword are acquired.
  • the target keyword is a keyword related to the target category.
  • the target category is information carrying a valid birthday date
  • the target keywords include: "birthday” and "birth”.
  • the sample information includes:
  • step 202 a sample clause carrying the target keyword is extracted from the sample information.
  • Each piece of sample information includes at least one clause.
  • a clause is a sentence that is not separated by punctuation. such as:
  • step 203 according to whether each sample clause belongs to the target category, the sample clause is binary-labeled to obtain a sample training set.
  • the label value of the binary label is 1 or 0, and the label is 1 when the sample clause belongs to the target category; and 0 when the sample clause does not belong to the target category.
  • the label of the sample clause 1 is 0, the label of the sample clause 2 is 0, the label of the sample clause 3 is 1, the label of the sample clause 4 is 0, and the label of the sample clause 5 is 1.
  • the sample training set includes multiple sample clauses.
  • each sample clause in the sample training set is segmented to obtain a number of words.
  • sample clause 1 is divided into words, and there are 5 words of “Tomorrow”, “No”, “He”, “Yes” and “ Birthday ”; the sample clause 2 is divided into words, and “Today” and “Yes” are obtained.
  • “,””you”,”you”,” birthday “, “?” a total of 6 words;
  • sample segmentation 3 for word segmentation get “I”, “son”, “yes”, “last year”, “of” , “Today”, " birth “, “” a total of 8 words;
  • sample segmentation 4 for word segmentation get “May", “20th”, “ birth “, “The”, “Baby”, “All There are 7 words in “good luck” and 5 words in the sample clause 5, and 4 words are obtained from “I", "son", “ born “ and “day”.
  • step 205 a specified set of features is extracted from a number of words based on a chi-square test or information gain.
  • this step can extract feature words in two different ways.
  • the feature words of the top n positions of the correlation with the target category are extracted from a plurality of words according to the chi-square test to form the specified feature set F.
  • the chi-square test can detect the relevance of each word to the target category. The higher the correlation, the more suitable it is as a feature word corresponding to the target category.
  • the feature words of the top n bits of the information gain value are extracted from a plurality of words according to the information gain, and the specified feature set F is formed.
  • the information gain is used to represent the amount of information of a word relative to a sample training set. The more information the word carries, the more suitable it is to be a feature word.
  • a naive Bayes classifier is constructed based on the feature words in the specified feature set, and each feature word is independent of each other in the naive Bayes classifier.
  • the naive Bayes classifier is a classification that predicts based on the first conditional probability and the second conditional probability of each feature word.
  • the first conditional probability is the probability that the clause carrying the feature word belongs to the target category
  • the second conditional probability is the probability that the clause carrying the feature word does not belong to the target category.
  • the first conditional probability and the second conditional probability of each feature word need to be calculated according to the sample training set.
  • step 207 for each feature word in the naive Bayes classifier, according to the binary labeling result in the sample training set, the first conditional probability that the clause carrying the characteristic word belongs to the target category is calculated, and, A clause with a feature word does not belong to the second conditional probability of the target category;
  • step 208 the trained naive Bayes classifier is obtained according to each feature word, the first conditional probability, and the second conditional probability.
  • the classifier training method obtains a plurality of words by segmenting each sample clause in the sample training set, and extracts a specified feature set from the plurality of words, according to the specified feature set.
  • the feature word constructs the classifier; solves the problem that the recognition result is inaccurate when using the birthday keyword for short message category analysis; since the feature word in the specified feature set is based on the word segmentation result of the sample clause carrying the target keyword
  • the extracted classifier can make a more accurate prediction of the clause carrying the target keyword, and achieves the effect that the recognition result is more accurate.
  • features are also extracted from each clause of the sample training set by chi-square test or information gain. Words can extract feature words that have a better effect on classification accuracy, thus improving the classification accuracy of Naive Bayes classifier.
  • the classifier is used for the stage of type identification.
  • FIG. 3 is a flowchart of a type identification method according to an exemplary embodiment.
  • the classifier used in this type of identification method is the classifier trained in the embodiment of Fig. 1 or Fig. 2.
  • the method includes the following steps.
  • step 301 a clause carrying the target keyword is extracted from the original information.
  • the original information is any one of a short message, a mail, a microblog, or instant messaging information.
  • the embodiment of the present disclosure does not limit the category of the original information.
  • Each piece of original information includes at least one clause.
  • step 302 a feature set of the original information is generated according to the feature words belonging to the specified feature set in the extracted clause, and the feature words in the specified feature set are extracted according to the word segmentation result of the sample clause carrying the target keyword. owned.
  • step 303 the feature set of the original information is input into the trained classifier for prediction, and the classifier is a classifier constructed in advance according to the feature words in the specified feature set.
  • the classifier is a naive Bayes classifier.
  • step 304 a prediction result of the classifier is obtained, the prediction result indicating that the original information belongs to the target category or does not belong to the target category.
  • the type identification method extracts a feature word in a clause by specifying a feature set as a feature set of the original information, and then inputs the feature set into a trained classifier for prediction.
  • the classifier is a classifier constructed in advance according to the feature words in the specified feature set; it solves the problem that the recognition result is inaccurate when the message category analysis is performed using the birthday keyword alone; since the feature words in the specified feature set are based on carrying the target The word segmentation result of the sample clause of the keyword is extracted, so the classifier can make a more accurate prediction of the clause carrying the target keyword, and achieves the effect that the recognition result is more accurate.
  • FIG. 4 is a flowchart of another type identification method according to another exemplary embodiment.
  • the classifier used in this type of identification method is the classifier trained in the embodiment of Fig. 1 or Fig. 2.
  • the method includes the following steps.
  • step 401 it is detected whether the original information includes a target keyword
  • the original information is a text message.
  • the original message is "My birthday is July 28th, today is not my birthday!”.
  • the target keyword is a keyword related to the target category.
  • the target category is information carrying a valid birthday date
  • the target keywords include: "birthday” and "birth”.
  • step 402 if the original information includes the target keyword, the clause carrying the target keyword is extracted from the original information.
  • the original information includes the target keyword "birthday”
  • the clause "My birthday is July 28th” is extracted from the original information.
  • step 403 a feature set of the original information is generated according to the feature words belonging to the specified feature set in the extracted clause, and the feature words in the specified feature set are extracted according to the word segmentation result of the sample clause carrying the target keyword. owned;
  • the specified feature set includes: “Tomorrow”, “No”, “He”, “Yes”, “Birthday”, “Today”, “Yes”, “You”, “?”, “I”, “Son” , “last year”, “born”, “day” and other characteristics.
  • the characteristic words belonging to the specified feature set in the clause "My birthday is July 28th” include: “I”, “Yes”, “Birthday”, “Yes”. Four words including “I”, “”, “Birthday”, and “Yes” will be included as a feature set of the original information.
  • each feature word in the feature set of the original information is input into the trained naive Bayes classifier, and the first prediction probability that the original information belongs to the target category and the original information does not belong to the target category.
  • Second prediction probability
  • the trained naive Bayes classifier includes a first conditional probability and a second conditional probability of each feature word, the first conditional probability is a probability that the clause carrying the characteristic word belongs to the target category, and the second conditional probability Is the probability that a clause carrying a characteristic word does not belong to the target category.
  • the first predicted probability of the original information is equal to the product of the first conditional probability of each feature word in the feature set of the original information.
  • the first conditional probability of "I” is 0.6
  • the first conditional probability of "" is 0.5
  • the first conditional probability of "birth” is 0.65
  • the first conditional probability of "yes” is 0.7
  • the second predicted probability of the original information is equal to the product of the second conditional probability of each feature word in the feature set of the original information.
  • the first conditional probability of "I” is 0.4
  • the first conditional probability of "" is 0.5
  • the first conditional probability of "birth” is 0.35
  • the first conditional probability of "yes” is 0.3
  • step 405 it is predicted whether the original information belongs to the target category according to the magnitude relationship between the first predicted probability and the second predicted probability;
  • the prediction result is that the original information belongs to the target category.
  • the original information belongs to the target category, that is, the original information is information carrying a valid birthday date.
  • the prediction result is that the original information does not belong to the target category.
  • step 406 if it is predicted that the original information belongs to the target category, the target information is extracted from the original information.
  • This step can be implemented in any of the following ways:
  • the birthday date is extracted from the original information by a regular expression.
  • the date of receipt of the original information is extracted as the date of the birthday.
  • the type identification method extracts a feature word in a clause by specifying a feature set as a feature set of the original information, and then inputs the feature set into a trained classifier for prediction.
  • the classifier is a classifier constructed in advance according to the feature words in the specified feature set; it solves the problem that the recognition result is inaccurate when the message category analysis is performed using the birthday keyword alone; since the feature words in the specified feature set are based on carrying the target The word segmentation result of the sample clause of the keyword is extracted, so the classifier can make a more accurate prediction of the clause carrying the target keyword, and achieves the effect that the recognition result is more accurate.
  • the type identification method provided by the embodiment further extracts the target information from the original information after predicting that the original information belongs to the target category, thereby realizing the extraction of the target information such as the birthday date and the travel date, and automatically generating the reminder item for the subsequent generation.
  • Data support is provided by features such as calendar markers.
  • the target category is information carrying a valid birthday date, but the application of the above method is not limited to this single target category.
  • the target category may also be information carrying a valid travel date, information carrying a valid holiday date, and the like.
  • FIG. 5 is a block diagram of a classifier training apparatus, as shown in FIG. 5, according to an exemplary embodiment.
  • the classifier training device includes, but is not limited to:
  • the clause extraction module 510 is configured to extract a sample clause carrying the target keyword from the sample information
  • the clause labeling module 520 is configured to perform binary labeling on the sample clause according to whether each sample clause belongs to the target category, to obtain a sample training set;
  • the clause segmentation module 530 is configured to perform segmentation on each sample clause in the sample training set to obtain a plurality of words
  • the feature word extraction module 540 is configured to extract a specified feature set from the plurality of words, the specified feature set including at least one feature word;
  • a classifier building module 550 configured to construct a classifier according to the feature words in the specified feature set
  • the classifier training module 560 is configured to train the classifier based on the binary annotation results in the sample training set.
  • the classifier training apparatus obtains a plurality of words by segmenting each sample clause in the sample training set, and extracts a specified feature set from the plurality of words, according to the specified feature set.
  • the feature word constructs the classifier; solves the problem that the recognition result is inaccurate when using the birthday keyword for short message category analysis; since the feature word in the specified feature set is based on the word segmentation result of the sample clause carrying the target keyword
  • the extracted classifier can make a more accurate prediction of the clause carrying the target keyword, and achieves the effect that the recognition result is more accurate.
  • FIG. 6 is a block diagram of a classifier training apparatus according to an exemplary embodiment. As shown in FIG. 6, the classifier training apparatus includes, but is not limited to:
  • the clause extraction module 510 is configured to extract a sample clause carrying the target keyword from the sample information
  • the clause labeling module 520 is configured to perform binary labeling on the sample clause according to whether each sample clause belongs to the target category, to obtain a sample training set;
  • the clause segmentation module 530 is configured to perform segmentation on each sample clause in the sample training set to obtain a plurality of words
  • the feature word extraction module 540 is configured to extract a specified feature set from the plurality of words, the specified feature set including at least one feature word;
  • a classifier building module 550 configured to construct a classifier according to the feature words in the specified feature set
  • the classifier training module 560 is configured to train the classifier based on the binary annotation results in the sample training set.
  • the feature word extraction module 540 is configured to extract a specified feature set from the plurality of words according to the chi-square test; or, the feature word extraction module 540 is configured to extract the designation from the plurality of words according to the information gain. Feature set.
  • the classifier construction module 550 is configured to construct a naive Bayes classifier for the feature words in the specified feature set, the feature words being independent of each other in the naive Bayes classifier.
  • the classifier training module 560 includes:
  • the statistic sub-module 562 is configured to calculate, for each feature word in the naive Bayes classifier, a first conditional probability that the clause carrying the feature word belongs to the target category according to the binary annotation result in the sample training set. And a second conditional probability that the clause carrying the characteristic word does not belong to the target category;
  • the training sub-module 564 is configured to obtain the trained naive Bayes classifier according to each feature word, the first conditional probability, and the second conditional probability.
  • the classifier training apparatus obtains a plurality of words by segmenting each sample clause in the sample training set, and extracts a specified feature set from the plurality of words, according to the specified feature set.
  • the feature word constructs the classifier; solves the problem that the recognition result is inaccurate when using the birthday keyword for short message category analysis; since the feature word in the specified feature set is based on the word segmentation result of the sample clause carrying the target keyword
  • the extracted classifier can make a more accurate prediction of the clause carrying the target keyword, and achieves the effect that the recognition result is more accurate.
  • the feature words are extracted from each clause of the sample training set by the chi-square test or the information gain, and the feature words having better effects on the classification accuracy can be extracted, thereby improving the classification of the naive Bayes classifier. Sex.
  • FIG. 7 is a block diagram of a type identifying apparatus according to an exemplary embodiment. As shown in FIG. 7, the type identifying apparatus includes, but is not limited to:
  • the original extraction module 720 is configured to extract a clause carrying the target keyword from the original information
  • the feature extraction module 740 is configured to generate a feature set of the original information according to the feature words belonging to the specified feature set in the extracted clause, and the feature words in the specified feature set are segmentation words according to the sample clause carrying the target keyword. Resulting from the extraction;
  • Feature input module 760 configured to input a feature set of the original information into the trained classifier Performing prediction, the classifier is a classifier constructed in advance according to feature words in a specified feature set;
  • the result obtaining module 780 is configured to acquire a prediction result of the classifier, and the prediction result indicates that the original information belongs to the target category or does not belong to the target category.
  • the type identifying apparatus extracts a feature word in a clause by specifying a feature set as a feature set of the original information, and then inputs the feature set into a trained classifier to predict,
  • the classifier is a classifier constructed in advance according to the feature words in the specified feature set; it solves the problem that the recognition result is inaccurate when the message category analysis is performed using the birthday keyword alone; since the feature words in the specified feature set are based on carrying the target The word segmentation result of the sample clause of the keyword is extracted, so the classifier can make a more accurate prediction of the clause carrying the target keyword, and achieves the effect that the recognition result is more accurate.
  • FIG. 8 is a block diagram of a type identifying apparatus according to an exemplary embodiment. As shown in FIG. 8, the type identifying apparatus includes, but is not limited to:
  • the original extraction module 720 is configured to extract a clause carrying the target keyword from the original information
  • the feature extraction module 740 is configured to generate a feature set of the original information according to the feature words belonging to the specified feature set in the extracted clause, and the feature words in the specified feature set are segmentation words according to the sample clause carrying the target keyword. Resulting from the extraction;
  • the feature input module 760 is configured to input the feature set of the original information into the trained classifier for prediction, and the classifier is a classifier constructed in advance according to the feature words in the specified feature set;
  • the result obtaining module 780 is configured to acquire a prediction result of the classifier, and the prediction result indicates that the original information belongs to the target category or does not belong to the target category.
  • the feature input module 760 includes:
  • the calculation sub-module 762 is configured to input each feature word in the feature set of the original information into the trained naive Bayes classifier, and calculate the first prediction probability that the original information belongs to the target category and the original information does not belong to the target The second predicted probability of the category;
  • the prediction sub-module 764 is configured to predict whether the original information belongs to the target category according to the magnitude relationship between the first prediction probability and the second prediction probability;
  • the trained naive Bayes classifier includes a first conditional probability and a second conditional probability of each feature word, the first conditional probability is a probability that the clause carrying the characteristic word belongs to the target category, and the second conditional probability Is the probability that a clause carrying a characteristic word does not belong to the target category.
  • the device further includes:
  • the information extraction module 790 is configured to extract the target information from the original information when it is predicted that the original information belongs to the target category.
  • the target information is a birthday date
  • the information extraction module 790 is configured to extract a birthday date from the original information by using a regular expression
  • the information extraction module 790 is configured to extract the date of receipt of the original information as a birthday date.
  • the type identifying apparatus extracts a feature word in a clause by specifying a feature set as a feature set of the original information, and then inputs the feature set into a trained classifier to predict,
  • the classifier is a classifier constructed in advance according to the feature words in the specified feature set; it solves the problem that the recognition result is inaccurate when the message category analysis is performed using the birthday keyword alone; since the feature words in the specified feature set are based on carrying the target The word segmentation result of the sample clause of the keyword is extracted, so the classifier can make a more accurate prediction of the clause carrying the target keyword, and achieves the effect that the recognition result is more accurate.
  • the type identification device provided by the embodiment further extracts the target information from the original information after predicting that the original information belongs to the target category, thereby realizing the extraction of the target information such as the birthday date and the travel date, and automatically generating the reminder item for the subsequent generation.
  • Data support is provided by features such as calendar markers.
  • An exemplary embodiment of the present disclosure provides a classifier training apparatus capable of implementing the classifier training method provided by the present disclosure, the classifier training apparatus comprising: a processor, a memory for storing processor executable instructions; The processor is configured to:
  • the sample clause is binary-labeled to obtain a sample training set
  • the classifier is trained based on the binary annotation results in the sample training set.
  • An exemplary embodiment of the present disclosure provides a type identification device capable of implementing the disclosure provided by the present disclosure.
  • a type identification method comprising: a processor, a memory for storing processor executable instructions; wherein the processor is configured to:
  • the feature set of the original information is input into the trained classifier for prediction, and the classifier is a classifier constructed in advance according to the feature words in the specified feature set;
  • the prediction result of the classifier indicates that the original information belongs to the target category or does not belong to the target category.
  • FIG. 9 is a block diagram of a classifier training device or type identification device, according to an exemplary embodiment.
  • device 900 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • apparatus 900 can include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and Communication component 916.
  • Processing component 902 typically controls the overall operation of device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • Processing component 902 can include one or more processors 918 to execute instructions to perform all or part of the steps of the above described methods.
  • processing component 902 can include one or more modules to facilitate interaction between component 902 and other components.
  • processing component 902 can include a multimedia module to facilitate interaction between multimedia component 908 and processing component 902.
  • Memory 904 is configured to store various types of data to support operation at device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 904 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • Power component 906 provides power to various components of device 900.
  • Power component 906 can include electricity The source management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 900.
  • the multimedia component 908 includes a screen between the device 900 and the user that provides an output interface.
  • the screen can include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • the multimedia component 908 includes a front camera and/or a rear camera. When the device 900 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 910 is configured to output and/or input an audio signal.
  • audio component 910 includes a microphone (MIC) that is configured to receive an external audio signal when device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
  • the received audio signal may be further stored in memory 904 or transmitted via communication component 916.
  • the audio component 910 also includes a speaker for outputting an audio signal.
  • the I/O interface 912 provides an interface between the processing component 902 and the peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.
  • Sensor assembly 914 includes one or more sensors for providing device 900 with various aspects of status assessment.
  • sensor component 914 can detect an open/closed state of device 900, a relative positioning of components, such as a display and a keypad of device 900, and sensor component 914 can also detect a change in position of one component of device 900 or device 900, user The presence or absence of contact with device 900, device 900 orientation or acceleration/deceleration and temperature variation of device 900.
  • Sensor assembly 914 can include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • Sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 914 can also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 916 is configured to facilitate wired or wireless communication between device 900 and other devices.
  • the device 900 can access a wireless network based on a communication standard, such as Wi-Fi, 2G or 3G, or a combination thereof.
  • communication component 916 receives from the outside via a broadcast channel Broadcast the broadcast signal or broadcast related information of the system.
  • communication component 916 also includes a near field communication (NFC) module to facilitate short range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • device 900 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic component implementation for performing the above-described classifier training method or type identification method.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor, or other electronic component implementation for performing the above-described classifier training method or type identification method.
  • non-transitory computer readable storage medium comprising instructions, such as a memory 904 comprising instructions executable by processor 918 of apparatus 900 to perform the above-described classifier training method or type recognition methods.
  • the non-transitory computer readable storage medium can be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Operations Research (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

一种分类器训练方法、类型识别方法及装置,属于自然语言处理领域。分类器训练方法包括:从样本信息中提取携带有目标关键字的样本分句(101);根据每条样本分句是否属于目标类别,对样本分句进行二值标注,得到样本训练集(102);对样本训练集中的每个样本分句进行分词,得到若干个词语(103);从若干个词语中提取出指定特征集合,指定特征集合包括至少一个特征词(104);根据指定特征集合中的特征词构建分类器(105);根据样本训练集中的二值标注结果对分类器进行训练(106)。由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。

Description

分类器训练方法、类型识别方法及装置
本申请基于申请号为201510511468.1、申请日为2015年08月19日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开涉及自然语言处理领域,特别涉及一种分类器训练方法、类型识别方法及装置。
背景技术
短信内容识别和提取是自然语言处理的一个实际应用。
以识别生日类短信为例,相关技术中提供了一种识别方法,该识别方法预先设置了若干个关键词,通过识别短信的内容中是否包括全部或部分关键词,来识别该短信是否为携带有生日日期的短信。
发明内容
为了解决直接使用关键词进行类型识别并不准确的问题,本公开提供一种分类器训练方法、类型识别方法及装置。所述技术方案如下:
根据本公开实施例的第一方面,提供了一种分类器训练方法,该方法包括:
从样本信息中提取携带有目标关键字的样本分句;
根据每条样本分句是否属于目标类别,对样本分句进行二值标注,得到样本训练集;
对样本训练集中的每个样本分句进行分词,得到若干个词语;
从若干个词语中提取出指定特征集合,指定特征集合包括至少一个特征词;
根据指定特征集合中的特征词构建分类器;
根据样本训练集中的二值标注结果对分类器进行训练。
在一个可选的实施例中,从若干个词语中提取出指定特征集合,包括:
根据卡方检验从若干个词语中提取出指定特征集合;
或,
根据信息增益从若干个词语中提取出指定特征集合。
在一个可选的实施例中,根据指定特征集合中的特征词构建分类器,包括:
将指定特征集合中的特征词构建朴素贝叶斯分类器,各个特征词在朴素贝叶斯分类器中互相独立。
在一个可选的实施例中,根据样本训练集中的二值标注结果对分类器进行训练,包括:
对于朴素贝叶斯分类器中的每个特征词,根据样本训练集中的二值标注结果,统计出携带有特征词的分句属于目标类别的第一条件概率,和,携带有特征词的分句不属于目标类别的第二条件概率;
根据各个特征词、第一条件概率和第二条件概率,得到训练后的朴素贝叶斯分类器。
根据本公开的第二方面,提供了一种类型识别方法,该方法包括:
从原始信息中提取携带有目标关键字的分句;
根据提取出的分句中属于指定特征集合的特征词,生成原始信息的特征集合,指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的;
将原始信息的特征集合输入训练后的分类器中进行预测,分类器是预先根据指定特征集合中的特征词构建的分类器;
获取分类器的预测结果,预测结果表征原始信息属于目标类别或不属于目标类别。
在一个可选的实施例中,将原始信息的特征集合输入训练后的分类器中进行预测,包括:
将原始信息的特征集合中的每个特征词,输入训练后的朴素贝叶斯分类器中,计算原始信息属于目标类别的第一预测概率和原始信息不属于目标类别的第二预测概率;
根据第一预测概率和第二预测概率的大小关系,预测原始信息是否属于目标类别;
其中,训练后的朴素贝叶斯分类器中包括每个特征词的第一条件概率和第二条件概率,第一条件概率是携带有特征词的分句属于目标类别的概率,第二 条件概率是携带有特征词的分句不属于目标类别的概率。
在一个可选的实施例中,该方法还包括:
若预测出原始信息属于目标类别,则从原始信息中提取目标信息。
在一个可选的实施例中,目标信息是生日日期;
从原始信息中提取目标信息,包括:
通过正则表达式从原始信息中提取生日日期;
或,
将原始信息的接收日期提取为生日日期。
根据本公开的第三方面,提供了一种分类器训练装置,该装置包括:
分句提取模块,被配置为从样本信息中提取携带有目标关键字的样本分句;
分句标注模块,被配置为根据每条样本分句是否属于目标类别,对样本分句进行二值标注,得到样本训练集;
分句分词模块,被配置为对样本训练集中的每个样本分句进行分词,得到若干个词语;
特征词提取模块,被配置为从若干个词语中提取出指定特征集合,指定特征集合包括至少一个特征词;
分类器构建模块,被配置为根据指定特征集合中的特征词构建分类器;
分类器训练模块,被配置为根据样本训练集中的二值标注结果对分类器进行训练。
在一个可选的实施例中,特征词提取模块,被配置为根据卡方检验从若干个词语中提取出指定特征集合;或,特征词提取模块,被配置为根据信息增益从若干个词语中提取出指定特征集合。
在一个可选的实施例中,分类器构建模块,被配置为将指定特征集合中的特征词构建朴素贝叶斯分类器,各个特征词在朴素贝叶斯分类器中互相独立。
在一个可选的实施例中,分类器训练模块,包括:
统计子模块,被配置为对于朴素贝叶斯分类器中的每个特征词,根据样本训练集中的二值标注结果,统计出携带有特征词的分句属于目标类别的第一条件概率,和,携带有特征词的分句不属于目标类别的第二条件概率;
训练子模块,被配置为根据各个特征词、第一条件概率和第二条件概率,得到训练后的朴素贝叶斯分类器。
根据本公开的第四方面,提供了一种类型识别装置,该装置包括:
原始提取模块,被配置为从原始信息中提取携带有目标关键字的分句;
特征提取模块,被配置为根据提取出的分句中属于指定特征集合的特征词,生成原始信息的特征集合,指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的;
特征输入模块,被配置为将原始信息的特征集合输入训练后的分类器中进行预测,分类器是预先根据指定特征集合中的特征词构建的分类器;
结果获取模块,被配置为获取分类器的预测结果,预测结果表征原始信息属于目标类别或不属于目标类别。
在一个可选的实施例中,特征输入模块,包括:
计算子模块,被配置为将原始信息的特征集合中的每个特征词,输入训练后的朴素贝叶斯分类器中,计算原始信息属于目标类别的第一预测概率和原始信息不属于目标类别的第二预测概率;
预测子模块,被配置为根据第一预测概率和第二预测概率的大小关系,预测原始信息是否属于目标类别;
其中,训练后的朴素贝叶斯分类器中包括每个特征词的第一条件概率和第二条件概率,第一条件概率是携带有特征词的分句属于目标类别的概率,第二条件概率是携带有特征词的分句不属于目标类别的概率。
在一个可选的实施例中,该装置还包括:
信息提取模块,被配置为在预测出原始信息属于目标类别时,从原始信息中提取目标信息。
在一个可选的实施例中,目标信息是生日日期;
信息提取模块,被配置为通过正则表达式从原始信息中提取生日日期;
或,
信息提取模块,被配置为将原始信息的接收日期提取为生日日期。
根据本公开的第五方面,提供了一种分类器训练装置,该装置包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,处理器被配置为:
从样本信息中提取携带有目标关键字的样本分句;
根据每条样本分句是否属于目标类别,对样本分句进行二值标注,得到样 本训练集;
对样本训练集中的每个样本分句进行分词,得到若干个词语;
从若干个词语中提取出指定特征集合,指定特征集合包括至少一个特征词;
根据指定特征集合中的特征词构建分类器;
根据样本训练集中的二值标注结果对分类器进行训练。
根据本公开的第六方面,提供了一种类型识别装置,该装置包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,处理器被配置为:
从原始信息中提取携带有目标关键字的分句;
根据提取出的分句中属于指定特征集合的特征词,生成原始信息的特征集合,指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的;
将原始信息的特征集合输入训练后的分类器中进行预测,分类器是预先根据指定特征集合中的特征词构建的分类器;
获取分类器的预测结果,预测结果表征原始信息属于目标类别或不属于目标类别。
本公开的实施例提供的技术方案可以包括以下有益效果:
通过对样本训练集中的每个样本分句进行分词得到若干个词语,从该若干个词语中提取出指定特征集合,根据指定特征集合中的特征词构建分类器;解决了单纯使用生日关键字进行短信类别分析时,识别结果不准确的问题;由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并于说明书一起用于解释本公开的原理。
图1是根据一示例性实施例示出的一种分类器训练方法的流程图;
图2是根据另一示例性实施例示出的一种分类器训练方法的流程图;
图3是根据一示例性实施例示出的一种类型识别方法的流程图;
图4是根据另一示例性实施例示出的一种类型识别方法的流程图;
图5是根据一示例性实施例示出的一种分类器训练装置的框图;
图6是根据另一示例性实施例示出的一种分类器训练装置的框图;
图7是根据一示例性实施例示出的一种类型识别装置的框图;
图8是根据另一示例性实施例示出的一种类型识别装置的框图;
图9是根据一示例性实施例示出的一种用于分类器训练装置或类型识别装置的框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
由于自然语言表达的多样性和复杂性,直接使用目标关键词进行类型识别并不准确。比如,携带有目标关键词“生日”或“出生”的短信如下:
短信1:“小敏,明天不是他的生日,你不要买蛋糕了。”
短信2:“亲,今天是你的生日吗?”
短信3:“我儿子是去年的今天出生的。”
短信4:“5月20日出生的宝宝都有好运。”
上述4条短信中,只有第三条短信是携带有有效生日日期的短信,其它3条短信都不是携带有有效生日日期的短信。
为了对短信进行准确的类型识别,本公开实施例提供了一种基于分类器的识别方法。该识别方法包括两个阶段:第一阶段,训练分类器的阶段;第二阶段,使用分类器进行类型识别的阶段。
下面采用不同的实施例来阐述上述两个阶段。
第一阶段,训练分类器的阶段。
图1是根据一示例性实施例示出的一种分类器训练方法的流程图。该方法包括如下步骤。
在步骤101中,从样本信息中提取携带有目标关键字的样本分句。
可选地,样本信息的类别是短信、邮件、微博或即时通信信息中的任意一种。本公开实施例对样本信息的类别不作限定。
每条样本信息包括至少一个分句。其中,携带有目标关键字的分句是样本分句。
在步骤102中,根据每条样本分句是否属于目标类别,对样本分句进行二值标注,得到样本训练集。
在步骤103中,对样本训练集中的每个样本分句进行分词,得到若干个词语。
在步骤104中,从若干个词语中提取出指定特征集合,指定特征集合包括至少一个特征词。
在步骤105中,根据指定特征集合中的特征词构建分类器。
可选地,该分类器是朴素贝叶斯分类器。
在步骤106中,根据样本训练集中的二值标注结果对分类器进行训练。
综上所述,本实施例提供的分类器训练方法,通过对样本训练集中的每个样本分句进行分词得到若干个词语,从该若干个词语中提取出指定特征集合,根据指定特征集合中的特征词构建分类器;解决了单纯使用生日关键字进行短信类别分析时,识别结果不准确的问题;由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。
图2是根据另一示例性实施例示出的一种分类器训练方法的流程图。该方法包括如下步骤。
在步骤201中,获取若干条携带有目标关键字的样本信息。
目标关键词是与目标类别有关的关键词。以目标类别是携带有有效生日日期的信息为例,目标关键词包括:“生日”和“出生”。
携带有目标关键字的样本信息越多,训练得到的分类器越准确。在样本信息的类别是短信时,示意性的,样本信息包括:
样本短信1:“小敏,明天不是他的生日,你不要买蛋糕了。”
样本短信2:“亲,今天是你的生日吗?”
样本短信3:“我儿子是去年的今天出生的。”
样本短信4:“5月20日出生的宝宝都有好运。”
样本短信5:“我儿子出生那天,正好是4月1日愚人节。”
….诸如此类,不再一一列举。
在步骤202中,从样本信息中提取携带有目标关键字的样本分句。
每条样本信息包括至少一个分句。一个分句是指未被标点符号所隔开的句子。比如:
从样本短信1中提取出样本分句1:“明天不是他的生日”;
从样本短信2中提取出样本分句2:“今天是你的生日吗”
从样本短信3中提取出样本分句3:“我儿子是去年的今天出生的”
从样本短信4中提取出样本分句4:“5月20日出生的宝宝都有好运”
从样本短信5中提取出样本分句5:“我儿子出生那天”
在步骤203中,根据每条样本分句是否属于目标类别,对样本分句进行二值标注,得到样本训练集。
可选地,二值标注的标注值是1或0,在样本分句属于目标类别时,标注为1;在样本分句不属于目标类别时,标注为0。
比如,样本分句1的标注为0、样本分句2的标注为0、样本分句3的标注为1、样本分句4的标注为0、样本分句5的标注为1。
样本训练集包括多个样本分句。
在步骤204中,对样本训练集中的每个样本分句进行分词,得到若干个词语。
比如,将样本分句1进行分词,得到“明天”、“不是”、“他”、“的”、“生日”共5个词;将样本分句2进行分词,得到“今天”、“是”、“你”、“的”、“生日”、“吗”共6个词;将样本分句3进行分词,得到“我”、“儿子”、“是”、“去年”、“的”、“今天”、“出生”、“的”共8个词;将样本分句4进行分词,得到“5月”、“20日”、“出生”、“的”、“宝宝”、“都有”、“好运”共7个词;将样本分句5进行分词,得到“我”、“儿子”、“出生”、“那天”共4个词。
也即,若干个词包括:“明天”、“不是”、“他”、“的”、“生日”、“今天”、“是”、“你”、“吗”、“我”、“儿子”、“去年”、“出生”、“5月”、“20日”、“宝宝”、“都 有”、“好运”、“那天”等。
在步骤205中,根据卡方检验或信息增益从若干个词语中提取出指定特征集合。
由于分词得到的若干个词,有些词语的重要性较高,有些词语的重要性较低,并不是所有的词都适合作为特征词。所以本步骤可以采用两种不同的方式提取特征词。
第一种方式,根据卡方检验从若干个词语中提取出与目标类别的相关性排名前n位的特征词,形成指定特征集合F。
卡方检验可以检测出每个词语与目标类别的相关性。相关性越高,越适合作为与该目标类别对应的特征词。
示意性的,一种卡方检验提取特征词的方法如下:
1.1统计样本训练集中的样本分句总数N。
1.2统计每个词在属于目标类别的样本分句中的出现频率A、不属于目标类别的样本分句中的出现频率B、在属于目标类别的样本分句中的不出现频率C、在不属于目标类别的样本分句中的不出现频率D。
1.3计算每个词的卡方值如下:
Figure PCTCN2015097615-appb-000001
1.4将每个词按照各自的卡方值由大到小进行排序,选取前n个词作为特征词。
第二种方式,根据信息增益从若干个词语中提取出信息增益值排名前n位的特征词,形成指定特征集合F。
信息增益用于表示词语相对于样本训练集的信息量,该词语携带的信息量越多,越适合作为特征词。
示意性的,一种信息增益提取特征词的方法如下:
2.1统计属于目标类别的样本分句的个数N1、不属于目标类别的样本分句的个数N2。
2.2统计每个词在属于目标类别的样本分句中的出现频率A、不属于目标类别的样本分句中的出现频率B、在属于目标类别的样本分句中的不出现频率C、在不属于目标类别的样本分句中的不出现频率D。
2.3计算信息熵
Figure PCTCN2015097615-appb-000002
2.4计算每个词的信息增益值
Figure PCTCN2015097615-appb-000003
2.5将每个词按照信息增益值从大到小排序,选取前n个词作为特征词。
在步骤206中,根据指定特征集合中的特征词构建朴素贝叶斯分类器,各个特征词在朴素贝叶斯分类器中互相独立。
朴素贝叶斯分类器是一种基于每个特征词的第一条件概率和第二条件概率进行预测的分类。对于任意一个特征词,第一条件概率是携带有特征词的分句属于目标类别的概率,第二条件概率是携带有特征词的分句不属于目标类别的概率。
训练朴素贝叶斯分类器的过程,需要根据样本训练集计算出每个特征词的第一条件概率和第二条件概率。
比如,携带有特征词“今天”的样本分句有100个,其中属于目标类别的样本分句有73个,不属于目标类别的样本分句有27个,则特征词“今天”的第一条件概率为0.73,第二条件概率为0.27。
在步骤207中,对于朴素贝叶斯分类器中的每个特征词,根据样本训练集中的二值标注结果,统计出携带有特征词的分句属于目标类别的第一条件概率,和,携带有特征词的分句不属于目标类别的第二条件概率;
在步骤208中,根据各个特征词、第一条件概率和第二条件概率,得到训练后的朴素贝叶斯分类器。
综上所述,本实施例提供的分类器训练方法,通过对样本训练集中的每个样本分句进行分词得到若干个词语,从该若干个词语中提取出指定特征集合,根据指定特征集合中的特征词构建分类器;解决了单纯使用生日关键字进行短信类别分析时,识别结果不准确的问题;由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。
本实施例还通过卡方检验或信息增益从样本训练集的各个分句中提取特征 词,能够提取出对分类准确性有较佳作用的特征词,从而提高朴素贝叶斯分类器的分类准确性。
第二阶段,使用分类器进行类型识别的阶段。
图3是根据一示例性实施例示出的一种类型识别方法的流程图。该类型识别方法所使用的分类器是图1或图2实施例所训练得到的分类器。该方法包括如下步骤。
在步骤301中,从原始信息中提取携带有目标关键字的分句。
可选地,原始信息是短信、邮件、微博或即时通信信息中的任意一种。本公开实施例对原始信息的类别不作限定。每条原始信息包括至少一个分句。
在步骤302中,根据提取出的分句中属于指定特征集合的特征词,生成原始信息的特征集合,指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的。
在步骤303中,将原始信息的特征集合输入训练后的分类器中进行预测,分类器是预先根据指定特征集合中的特征词构建的分类器。
可选地,该分类器是朴素贝叶斯分类器。
在步骤304中,获取分类器的预测结果,该预测结果表征原始信息属于目标类别或不属于目标类别。
综上所述,本实施例提供的类型识别方法,通过指定特征集合来提取分句中的特征词,作为原始信息的特征集合,然后将该特征集合输入至训练后的分类器中预测,该分类器是预先根据指定特征集合中的特征词构建的分类器;解决了单纯使用生日关键字进行短信类别分析时,识别结果不准确的问题;由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。
图4是根据另一示例性实施例示出的另一种类型识别方法的流程图。该类型识别方法所使用的分类器是图1或图2实施例所训练得到的分类器。该方法包括如下步骤。
在步骤401中,检测原始信息是否包括目标关键字;
可选地,原始信息是短信。比如,原始信息是“我的生日是7月28日,今天不是我的生日呦!”。
目标关键词是与目标类别有关的关键词。以目标类别是携带有有效生日日期的信息为例,目标关键词包括:“生日”和“出生”。
检测原始信息是否包括目标关键词;若包括,则进入步骤402;若不包括,则不做后续处理。
在步骤402中,若原始信息包括目标关键字,则从原始信息中提取携带有目标关键字的分句。
比如,原始信息包括目标关键字“生日”,则从原始信息中提取出分句“我的生日是7月28日”。
在步骤403中,根据提取出的分句中属于指定特征集合的特征词,生成原始信息的特征集合,指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的;
比如,指定特征集合包括:“明天”、“不是”、“他”、“的”、“生日”、“今天”、“是”、“你”、“吗”、“我”、“儿子”、“去年”、“出生”、“那天”等特征词。
分句“我的生日是7月28日”中属于指定特征集合的特征词包括:“我”、“的”、“生日”、“是”。将包括“我”、“的”、“生日”、“是”共4个词作为原始信息的特征集合。
在步骤404中,将原始信息的特征集合中的每个特征词,输入训练后的朴素贝叶斯分类器中,计算原始信息属于目标类别的第一预测概率和原始信息不属于目标类别的第二预测概率;
其中,训练后的朴素贝叶斯分类器中包括每个特征词的第一条件概率和第二条件概率,第一条件概率是携带有特征词的分句属于目标类别的概率,第二条件概率是携带有特征词的分句不属于目标类别的概率。
原始信息的第一预测概率,等于原始信息的特征集合中的各个特征词的第一条件概率的乘积。
比如,“我”的第一条件概率是0.6、“的”的第一条件概率是0.5、“生日”的第一条件概率是0.65、“是”的第一条件概率是0.7,则原始信息的第一预测概率=0.6*0.5*0.65*0.7=0.11375
原始信息的第二预测概率,等于原始信息的特征集合中的各个特征词的第二条件概率的乘积。
比如,“我”的第一条件概率是0.4、“的”的第一条件概率是0.5、“生日”的第一条件概率是0.35、“是”的第一条件概率是0.3,则原始信息的第一预测概率 =0.6*0.5*0.65*0.7=0.021。
在步骤405中,根据第一预测概率和第二预测概率的大小关系,预测原始信息是否属于目标类别;
在第一预测概率大于第二预测概率时,预测结果为原始信息属于目标类别。
比如,0.11375>0.021,所以原始信息属于目标类别,也即原始信息是携带有有效生日日期的信息。
在第二预测概率大于第一预测概率时,预测结果为原始信息不属于目标类别。
在步骤406中,若预测出原始信息属于目标类别,则从原始信息中提取目标信息。
本步骤可以采用如下任意一种实现方式:
第一,通过正则表达式从原始信息中提取生日日期。
第二,将原始信息的接收日期提取为生日日期。
第三,尝试通过正则表达式从原始信息中提取生日日期;若无法通过正则表达式提取出生日日期,则将原始信息的接收日期提取为生日日期。
综上所述,本实施例提供的类型识别方法,通过指定特征集合来提取分句中的特征词,作为原始信息的特征集合,然后将该特征集合输入至训练后的分类器中预测,该分类器是预先根据指定特征集合中的特征词构建的分类器;解决了单纯使用生日关键字进行短信类别分析时,识别结果不准确的问题;由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。
本实施例提供的类型识别方法,还通过在预测出原始信息属于目标类别后,从原始信息中提取目标信息,实现对生日日期、出行日期之类的目标信息的提取,为后续自动生成提醒事项、日历标记等功能提供数据支持。
需要补充说明的是,上述实施例均以目标类别是携带有有效生日日期的信息为举例说明,但是上述方法的应用不限定在这单一目标类别。目标类别还可以是携带有有效出行日期的信息、携带有有效放假日期的信息等等。
下述为本公开装置实施例,可以用于执行本公开方法实施例。对于本公开装置实施例中未披露的细节,请参照本公开方法实施例。
图5是根据一示例性实施例示出的一种分类器训练装置的框图,如图5所 示,该分类器训练装置包括但不限于:
分句提取模块510,被配置为从样本信息中提取携带有目标关键字的样本分句;
分句标注模块520,被配置为根据每条样本分句是否属于目标类别,对样本分句进行二值标注,得到样本训练集;
分句分词模块530,被配置为对样本训练集中的每个样本分句进行分词,得到若干个词语;
特征词提取模块540,被配置为从若干个词语中提取出指定特征集合,指定特征集合包括至少一个特征词;
分类器构建模块550,被配置为根据指定特征集合中的特征词构建分类器;
分类器训练模块560,被配置为根据样本训练集中的二值标注结果对分类器进行训练。
综上所述,本实施例提供的分类器训练装置,通过对样本训练集中的每个样本分句进行分词得到若干个词语,从该若干个词语中提取出指定特征集合,根据指定特征集合中的特征词构建分类器;解决了单纯使用生日关键字进行短信类别分析时,识别结果不准确的问题;由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。
图6是根据一示例性实施例示出的一种分类器训练装置的框图,如图6所示,该分类器训练装置包括但不限于:
分句提取模块510,被配置为从样本信息中提取携带有目标关键字的样本分句;
分句标注模块520,被配置为根据每条样本分句是否属于目标类别,对样本分句进行二值标注,得到样本训练集;
分句分词模块530,被配置为对样本训练集中的每个样本分句进行分词,得到若干个词语;
特征词提取模块540,被配置为从若干个词语中提取出指定特征集合,指定特征集合包括至少一个特征词;
分类器构建模块550,被配置为根据指定特征集合中的特征词构建分类器;
分类器训练模块560,被配置为根据样本训练集中的二值标注结果对分类器进行训练。
可选地,特征词提取模块540,被配置为根据卡方检验从若干个词语中提取出指定特征集合;或,特征词提取模块540,被配置为根据信息增益从若干个词语中提取出指定特征集合。
可选地,分类器构建模块550,被配置为将指定特征集合中的特征词构建朴素贝叶斯分类器,各个特征词在朴素贝叶斯分类器中互相独立。
可选地,分类器训练模块560,包括:
统计子模块562,被配置为对于朴素贝叶斯分类器中的每个特征词,根据样本训练集中的二值标注结果,统计出携带有特征词的分句属于目标类别的第一条件概率,和,携带有特征词的分句不属于目标类别的第二条件概率;
训练子模块564,被配置为根据各个特征词、第一条件概率和第二条件概率,得到训练后的朴素贝叶斯分类器。
综上所述,本实施例提供的分类器训练装置,通过对样本训练集中的每个样本分句进行分词得到若干个词语,从该若干个词语中提取出指定特征集合,根据指定特征集合中的特征词构建分类器;解决了单纯使用生日关键字进行短信类别分析时,识别结果不准确的问题;由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。
本实施例还通过卡方检验或信息增益从样本训练集的各个分句中提取特征词,能够提取出对分类准确性有较佳作用的特征词,从而提高朴素贝叶斯分类器的分类准确性。
图7是根据一示例性实施例示出的一种类型识别装置的框图,如图7所示,该类型识别装置包括但不限于:
原始提取模块720,被配置为从原始信息中提取携带有目标关键字的分句;
特征提取模块740,被配置为根据提取出的分句中属于指定特征集合的特征词,生成原始信息的特征集合,指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的;
特征输入模块760,被配置为将原始信息的特征集合输入训练后的分类器中 进行预测,分类器是预先根据指定特征集合中的特征词构建的分类器;
结果获取模块780,被配置为获取分类器的预测结果,预测结果表征原始信息属于目标类别或不属于目标类别。
综上所述,本实施例提供的类型识别装置,通过指定特征集合来提取分句中的特征词,作为原始信息的特征集合,然后将该特征集合输入至训练后的分类器中预测,该分类器是预先根据指定特征集合中的特征词构建的分类器;解决了单纯使用生日关键字进行短信类别分析时,识别结果不准确的问题;由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。
图8是根据一示例性实施例示出的一种类型识别装置的框图,如图8所示,该类型识别装置包括但不限于:
原始提取模块720,被配置为从原始信息中提取携带有目标关键字的分句;
特征提取模块740,被配置为根据提取出的分句中属于指定特征集合的特征词,生成原始信息的特征集合,指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的;
特征输入模块760,被配置为将原始信息的特征集合输入训练后的分类器中进行预测,分类器是预先根据指定特征集合中的特征词构建的分类器;
结果获取模块780,被配置为获取分类器的预测结果,预测结果表征原始信息属于目标类别或不属于目标类别。
可选地,特征输入模块760,包括:
计算子模块762,被配置为将原始信息的特征集合中的每个特征词,输入训练后的朴素贝叶斯分类器中,计算原始信息属于目标类别的第一预测概率和原始信息不属于目标类别的第二预测概率;
预测子模块764,被配置为根据第一预测概率和第二预测概率的大小关系,预测原始信息是否属于目标类别;
其中,训练后的朴素贝叶斯分类器中包括每个特征词的第一条件概率和第二条件概率,第一条件概率是携带有特征词的分句属于目标类别的概率,第二条件概率是携带有特征词的分句不属于目标类别的概率。
可选地,该装置还包括:
信息提取模块790,被配置为在预测出原始信息属于目标类别时,从原始信息中提取目标信息。
可选地,目标信息是生日日期;
信息提取模块790,被配置为通过正则表达式从原始信息中提取生日日期;
或,
信息提取模块790,被配置为将原始信息的接收日期提取为生日日期。
综上所述,本实施例提供的类型识别装置,通过指定特征集合来提取分句中的特征词,作为原始信息的特征集合,然后将该特征集合输入至训练后的分类器中预测,该分类器是预先根据指定特征集合中的特征词构建的分类器;解决了单纯使用生日关键字进行短信类别分析时,识别结果不准确的问题;由于指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的,所以该分类器能够对携带有目标关键词的分句做出较为准确的预测,达到了识别结果较为准确的效果。
本实施例提供的类型识别装置,还通过在预测出原始信息属于目标类别后,从原始信息中提取目标信息,实现对生日日期、出行日期之类的目标信息的提取,为后续自动生成提醒事项、日历标记等功能提供数据支持。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本公开一示例性实施例提供了一种分类器训练装置,能够实现本公开提供的分类器训练方法,该分类器训练装置包括:处理器、用于存储处理器可执行指令的存储器;其中,处理器被配置为:
从样本信息中提取携带有目标关键字的样本分句;
根据每条样本分句是否属于目标类别,对样本分句进行二值标注,得到样本训练集;
对样本训练集中的每个样本分句进行分词,得到若干个词语;
从若干个词语中提取出指定特征集合,指定特征集合包括至少一个特征词;
根据指定特征集合中的特征词构建分类器;
根据样本训练集中的二值标注结果对分类器进行训练。
本公开一示例性实施例提供了一种类型识别装置,能够实现本公开提供的 类型识别方法,该类型识别装置包括:处理器、用于存储处理器可执行指令的存储器;其中,处理器被配置为:
从原始信息中提取携带有目标关键字的分句;
根据提取出的分句中属于指定特征集合的特征词,生成原始信息的特征集合,指定特征集合中的特征词是根据携带有目标关键词的样本分句的分词结果所提取得到的;
将原始信息的特征集合输入训练后的分类器中进行预测,分类器是预先根据指定特征集合中的特征词构建的分类器;
获取分类器的预测结果,预测结果表征原始信息属于目标类别或不属于目标类别。
图9是根据一示例性实施例示出的一种用分类器训练装置或类型识别装置的框图。例如,装置900可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图9,装置900可以包括以下一个或多个组件:处理组件902,存储器904,电源组件906,多媒体组件908,音频组件910,输入/输出(I/O)接口912,传感器组件914,以及通信组件916。
处理组件902通常控制装置900的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件902可以包括一个或多个处理器918来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件902可以包括一个或多个模块,便于处理组件902和其他组件之间的交互。例如,处理组件902可以包括多媒体模块,以方便多媒体组件908和处理组件902之间的交互。
存储器904被配置为存储各种类型的数据以支持在装置900的操作。这些数据的示例包括用于在装置900上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器904可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件906为装置900的各种组件提供电力。电源组件906可以包括电 源管理***,一个或多个电源,及其他与为装置900生成、管理和分配电力相关联的组件。
多媒体组件908包括在装置900和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件908包括一个前置摄像头和/或后置摄像头。当装置900处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜***或具有焦距和光学变焦能力。
音频组件910被配置为输出和/或输入音频信号。例如,音频组件910包括一个麦克风(MIC),当装置900处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器904或经由通信组件916发送。在一些实施例中,音频组件910还包括一个扬声器,用于输出音频信号。
I/O接口912为处理组件902和***接口模块之间提供接口,上述***接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件914包括一个或多个传感器,用于为装置900提供各个方面的状态评估。例如,传感器组件914可以检测到装置900的打开/关闭状态,组件的相对定位,例如组件为装置900的显示器和小键盘,传感器组件914还可以检测装置900或装置900一个组件的位置改变,用户与装置900接触的存在或不存在,装置900方位或加速/减速和装置900的温度变化。传感器组件914可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件914还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件914还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件916被配置为便于装置900和其他设备之间有线或无线方式的通信。装置900可以接入基于通信标准的无线网络,如Wi-Fi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件916经由广播信道接收来自外部广 播管理***的广播信号或广播相关信息。在一个示例性实施例中,通信组件916还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,装置900可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述分类器训练方法或类型识别方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器904,上述指令可由装置900的处理器918执行以完成上述分类器训练方法或类型识别方法。例如,非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (18)

  1. 一种分类器训练方法,其特征在于,所述方法包括:
    从样本信息中提取携带有目标关键字的样本分句;
    根据每条所述样本分句是否属于目标类别,对所述样本分句进行二值标注,得到样本训练集;
    对所述样本训练集中的每个所述样本分句进行分词,得到若干个词语;
    从所述若干个词语中提取出指定特征集合,所述指定特征集合包括至少一个特征词;
    根据所述指定特征集合中的所述特征词构建分类器;
    根据所述样本训练集中的二值标注结果对所述分类器进行训练。
  2. 根据权利要求1所述的方法,其特征在于,所述从所述若干个词语中提取出指定特征集合,包括:
    根据卡方检验从所述若干个词语中提取出所述指定特征集合;
    或,
    根据信息增益从所述若干个词语中提取出所述指定特征集合。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述指定特征集合中的所述特征词构建所述分类器,包括:
    将所述指定特征集合中的所述特征词构建朴素贝叶斯分类器,各个特征词在所述朴素贝叶斯分类器中互相独立。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述样本训练集中的二值标注结果对所述分类器进行训练,包括:
    对于所述朴素贝叶斯分类器中的每个所述特征词,根据所述样本训练集中的二值标注结果,统计出携带有所述特征词的分句属于所述目标类别的第一条件概率,和,携带有所述特征词的分句不属于所述目标类别的第二条件概率;
    根据各个所述特征词、所述第一条件概率和所述第二条件概率,得到训练后的所述朴素贝叶斯分类器。
  5. 一种类型识别方法,其特征在于,所述方法包括:
    从原始信息中提取携带有目标关键字的分句;
    根据提取出的所述分句中属于指定特征集合的特征词,生成所述原始信息的特征集合,所述指定特征集合中的特征词是根据携带有所述目标关键词的样本分句的分词结果所提取得到的;
    将所述原始信息的特征集合输入训练后的分类器中进行预测,所述分类器是预先根据所述指定特征集合中的所述特征词构建的分类器;
    获取所述分类器的预测结果,所述预测结果表征所述原始信息属于所述目标类别或不属于所述目标类别。
  6. 根据权利要求5所述的方法,其特征在于,所述将所述原始信息的特征集合输入训练后的分类器中进行预测,包括:
    将所述原始信息的特征集合中的每个特征词,输入训练后的朴素贝叶斯分类器中,计算所述原始信息属于所述目标类别的第一预测概率和所述原始信息不属于所述目标类别的第二预测概率;
    根据所述第一预测概率和所述第二预测概率的大小关系,预测所述原始信息是否属于所述目标类别;
    其中,所述训练后的朴素贝叶斯分类器中包括每个特征词的第一条件概率和第二条件概率,所述第一条件概率是携带有所述特征词的分句属于所述目标类别的概率,所述第二条件概率是携带有所述特征词的分句不属于所述目标类别的概率。
  7. 根据权利要求5或6所述的方法,其特征在于,所述方法还包括:
    若预测出所述原始信息属于所述目标类别,则从所述原始信息中提取目标信息。
  8. 根据权利要求7所述的方法,其特征在于,所述目标信息是生日日期;
    所述从所述原始信息中提取目标信息,包括:
    通过正则表达式从所述原始信息中提取所述生日日期;
    或,
    将所述原始信息的接收日期提取为所述生日日期。
  9. 一种分类器训练装置,其特征在于,所述装置包括:
    分句提取模块,被配置为从样本信息中提取携带有目标关键字的样本分句;
    分句标注模块,被配置为根据每条所述样本分句是否属于目标类别,对所述样本分句进行二值标注,得到样本训练集;
    分句分词模块,被配置为对所述样本训练集中的每个所述样本分句进行分词,得到若干个词语;
    特征词提取模块,被配置为从所述若干个词语中提取出指定特征集合,所述指定特征集合包括至少一个特征词;
    分类器构建模块,被配置为根据所述指定特征集合中的所述特征词构建分类器;
    分类器训练模块,被配置为根据所述样本训练集中的二值标注结果对所述分类器进行训练。
  10. 根据权利要求9所述的装置,其特征在于,
    所述特征词提取模块,被配置为根据卡方检验从所述若干个词语中提取出所述指定特征集合;
    或,
    所述特征词提取模块,被配置为根据信息增益从所述若干个词语中提取出所述指定特征集合。
  11. 根据权利要求9所述的装置,其特征在于,
    所述分类器构建模块,被配置为将所述指定特征集合中的所述特征词构建朴素贝叶斯分类器,各个特征词在所述朴素贝叶斯分类器中互相独立。
  12. 根据权利要求11所述的装置,其特征在于,所述分类器训练模块,包括:
    统计子模块,被配置为对于所述朴素贝叶斯分类器中的每个所述特征词, 根据所述样本训练集中的二值标注结果,统计出携带有所述特征词的分句属于所述目标类别的第一条件概率,和,携带有所述特征词的分句不属于所述目标类别的第二条件概率;
    训练子模块,被配置为根据各个所述特征词、所述第一条件概率和所述第二条件概率,得到训练后的所述朴素贝叶斯分类器。
  13. 一种类型识别装置,其特征在于,所述装置包括:
    原始提取模块,被配置为从原始信息中提取携带有目标关键字的分句;
    特征提取模块,被配置为根据提取出的所述分句中属于指定特征集合的特征词,生成所述原始信息的特征集合,所述指定特征集合中的特征词是根据携带有所述目标关键词的样本分句的分词结果所提取得到的;
    特征输入模块,被配置为将所述原始信息的特征集合输入训练后的分类器中进行预测,所述分类器是预先根据所述指定特征集合中的所述特征词构建的分类器;
    结果获取模块,被配置为获取所述分类器的预测结果,所述预测结果表征所述原始信息属于所述目标类别或不属于所述目标类别。
  14. 根据权利要求13所述的装置,其特征在于,所述特征输入模块,包括:
    计算子模块,被配置为将所述原始信息的特征集合中的每个特征词,输入训练后的朴素贝叶斯分类器中,计算所述原始信息属于所述目标类别的第一预测概率和所述原始信息不属于所述目标类别的第二预测概率;
    预测子模块,被配置为根据所述第一预测概率和所述第二预测概率的大小关系,预测所述原始信息是否属于所述目标类别;
    其中,所述训练后的朴素贝叶斯分类器中包括每个特征词的第一条件概率和第二条件概率,所述第一条件概率是携带有所述特征词的分句属于所述目标类别的概率,所述第二条件概率是携带有所述特征词的分句不属于所述目标类别的概率。
  15. 根据权利要求13或14所述的装置,其特征在于,所述装置还包括:
    信息提取模块,被配置为在预测出所述原始信息属于所述目标类别时,从 所述原始信息中提取目标信息。
  16. 根据权利要求15所述的装置,其特征在于,所述目标信息是生日日期;
    所述信息提取模块,被配置为通过正则表达式从所述原始信息中提取所述生日日期;
    或,
    所述信息提取模块,被配置为将所述原始信息的接收日期提取为所述生日日期。
  17. 一种分类器训练装置,其特征在于,所述装置包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    其中,所述处理器被配置为:
    从样本信息中提取携带有目标关键字的样本分句;
    根据每条所述样本分句是否属于目标类别,对所述样本分句进行二值标注,得到样本训练集;
    对所述样本训练集中的每个所述样本分句进行分词,得到若干个词语;
    从所述若干个词语中提取出指定特征集合,所述指定特征集合包括至少一个特征词;
    根据所述指定特征集合中的所述特征词构建分类器;
    根据所述样本训练集中的二值标注结果对所述分类器进行训练。
  18. 一种类型识别装置,其特征在于,所述装置包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    其中,所述处理器被配置为:
    从原始信息中提取携带有目标关键字的分句;
    根据提取出的所述分句中属于指定特征集合的特征词,生成所述原始信息的特征集合,所述指定特征集合中的特征词是根据携带有所述目标关键词的样本分句的分词结果所提取得到的;
    将所述原始信息的特征集合输入训练后的分类器中进行预测,所述分类器是预先根据所述指定特征集合中的所述特征词构建的分类器;
    获取所述分类器的预测结果,所述预测结果表征所述原始信息属于所述目标类别或不属于所述目标类别。
PCT/CN2015/097615 2015-08-19 2015-12-16 分类器训练方法、类型识别方法及装置 WO2017028416A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR1020167003870A KR101778784B1 (ko) 2015-08-19 2015-12-16 분류기 트레이닝, 타입 식별 방법 및 장치
RU2016111677A RU2643500C2 (ru) 2015-08-19 2015-12-16 Способ и устройство для обучения классификатора и распознавания типа
MX2016003981A MX2016003981A (es) 2015-08-19 2015-12-16 Metodo y dispositivo para capacitar un clasificador, reconocimiento de tipo.
JP2017534873A JP2017535007A (ja) 2015-08-19 2015-12-16 分類器トレーニング方法、種類認識方法及び装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510511468.1 2015-08-19
CN201510511468.1A CN105117384A (zh) 2015-08-19 2015-08-19 分类器训练方法、类型识别方法及装置

Publications (1)

Publication Number Publication Date
WO2017028416A1 true WO2017028416A1 (zh) 2017-02-23

Family

ID=54665378

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/097615 WO2017028416A1 (zh) 2015-08-19 2015-12-16 分类器训练方法、类型识别方法及装置

Country Status (8)

Country Link
US (1) US20170052947A1 (zh)
EP (1) EP3133532A1 (zh)
JP (1) JP2017535007A (zh)
KR (1) KR101778784B1 (zh)
CN (1) CN105117384A (zh)
MX (1) MX2016003981A (zh)
RU (1) RU2643500C2 (zh)
WO (1) WO2017028416A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992771A (zh) * 2019-03-13 2019-07-09 北京三快在线科技有限公司 一种文本生成的方法及装置
CN112529623A (zh) * 2020-12-14 2021-03-19 中国联合网络通信集团有限公司 恶意用户的识别方法、装置和设备

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105117384A (zh) * 2015-08-19 2015-12-02 小米科技有限责任公司 分类器训练方法、类型识别方法及装置
CN111277579B (zh) * 2016-05-06 2023-01-17 青岛海信移动通信技术股份有限公司 一种识别验证信息的方法和设备
CN106211165B (zh) * 2016-06-14 2020-04-21 北京奇虎科技有限公司 检测外文骚扰短信的方法、装置及相应的客户端
CN107135494B (zh) * 2017-04-24 2020-06-19 北京小米移动软件有限公司 垃圾短信识别方法及装置
CN110349572B (zh) * 2017-05-27 2021-10-22 腾讯科技(深圳)有限公司 一种语音关键词识别方法、装置、终端及服务器
CN110019782B (zh) * 2017-09-26 2021-11-02 北京京东尚科信息技术有限公司 用于输出文本类别的方法和装置
CN107704892B (zh) * 2017-11-07 2019-05-17 宁波爱信诺航天信息有限公司 一种基于贝叶斯模型的商品编码分类方法以及***
US10726204B2 (en) * 2018-05-24 2020-07-28 International Business Machines Corporation Training data expansion for natural language classification
CN109325123B (zh) * 2018-09-29 2020-10-16 武汉斗鱼网络科技有限公司 基于补集特征的贝叶斯文档分类方法、装置、设备及介质
US11100287B2 (en) * 2018-10-30 2021-08-24 International Business Machines Corporation Classification engine for learning properties of words and multi-word expressions
CN109979440B (zh) * 2019-03-13 2021-05-11 广州市网星信息技术有限公司 关键词样本确定方法、语音识别方法、装置、设备和介质
CN110083835A (zh) * 2019-04-24 2019-08-02 北京邮电大学 一种基于图和词句协同的关键词提取方法及装置
CN111339297B (zh) * 2020-02-21 2023-04-25 广州天懋信息***股份有限公司 网络资产异常检测方法、***、介质和设备
CN113688436A (zh) * 2020-05-19 2021-11-23 天津大学 一种pca与朴素贝叶斯分类融合的硬件木马检测方法
CN112925958A (zh) * 2021-02-05 2021-06-08 深圳力维智联技术有限公司 多源异构数据适配方法、装置、设备及可读存储介质
CN114281983B (zh) * 2021-04-05 2024-04-12 北京智慧星光信息技术有限公司 分层结构的文本分类方法、***、电子设备和存储介质
CN113705818B (zh) * 2021-08-31 2024-04-19 支付宝(杭州)信息技术有限公司 对支付指标波动进行归因的方法及装置
CN116094886B (zh) * 2023-03-09 2023-08-25 浙江万胜智能科技股份有限公司 一种双模模块中载波通信数据处理方法及***

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7624006B2 (en) * 2004-09-15 2009-11-24 Microsoft Corporation Conditional maximum likelihood estimation of naïve bayes probability models
CN103246686A (zh) * 2012-02-14 2013-08-14 阿里巴巴集团控股有限公司 文本分类方法和装置及文本分类的特征处理方法和装置
CN103336766A (zh) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 短文本垃圾识别以及建模方法和装置
CN103885934A (zh) * 2014-02-19 2014-06-25 中国专利信息中心 一种专利文献关键短语自动提取方法
CN105117384A (zh) * 2015-08-19 2015-12-02 小米科技有限责任公司 分类器训练方法、类型识别方法及装置

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11203318A (ja) * 1998-01-19 1999-07-30 Seiko Epson Corp 文書分類方法および装置並びに文書分類処理プログラムを記録した記録媒体
US6192360B1 (en) * 1998-06-23 2001-02-20 Microsoft Corporation Methods and apparatus for classifying text and for building a text classifier
US7376635B1 (en) * 2000-07-21 2008-05-20 Ford Global Technologies, Llc Theme-based system and method for classifying documents
JP2006301972A (ja) 2005-04-20 2006-11-02 Mihatenu Yume:Kk 電子秘書装置
US7818176B2 (en) 2007-02-06 2010-10-19 Voicebox Technologies, Inc. System and method for selecting and presenting advertisements based on natural language processing of voice-based input
US8082151B2 (en) * 2007-09-18 2011-12-20 At&T Intellectual Property I, Lp System and method of generating responses to text-based messages
CN101516071B (zh) * 2008-02-18 2013-01-23 ***通信集团重庆有限公司 垃圾短消息的分类方法
US20100161406A1 (en) * 2008-12-23 2010-06-24 Motorola, Inc. Method and Apparatus for Managing Classes and Keywords and for Retrieving Advertisements
JP5346841B2 (ja) * 2010-02-22 2013-11-20 株式会社野村総合研究所 文書分類システムおよび文書分類プログラムならびに文書分類方法
US8892488B2 (en) * 2011-06-01 2014-11-18 Nec Laboratories America, Inc. Document classification with weighted supervised n-gram embedding
RU2491622C1 (ru) * 2012-01-25 2013-08-27 Общество С Ограниченной Ответственностью "Центр Инноваций Натальи Касперской" Способ классификации документов по категориям
US9910909B2 (en) * 2013-01-23 2018-03-06 24/7 Customer, Inc. Method and apparatus for extracting journey of life attributes of a user from user interactions
CN103501487A (zh) * 2013-09-18 2014-01-08 小米科技有限责任公司 分类器更新方法、装置、终端、服务器及***
CN103500195B (zh) * 2013-09-18 2016-08-17 小米科技有限责任公司 分类器更新方法、装置、***及设备
US10394953B2 (en) * 2015-07-17 2019-08-27 Facebook, Inc. Meme detection in digital chatter analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7624006B2 (en) * 2004-09-15 2009-11-24 Microsoft Corporation Conditional maximum likelihood estimation of naïve bayes probability models
CN103246686A (zh) * 2012-02-14 2013-08-14 阿里巴巴集团控股有限公司 文本分类方法和装置及文本分类的特征处理方法和装置
CN103336766A (zh) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 短文本垃圾识别以及建模方法和装置
CN103885934A (zh) * 2014-02-19 2014-06-25 中国专利信息中心 一种专利文献关键短语自动提取方法
CN105117384A (zh) * 2015-08-19 2015-12-02 小米科技有限责任公司 分类器训练方法、类型识别方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992771A (zh) * 2019-03-13 2019-07-09 北京三快在线科技有限公司 一种文本生成的方法及装置
CN109992771B (zh) * 2019-03-13 2020-05-05 北京三快在线科技有限公司 一种文本生成的方法及装置
CN112529623A (zh) * 2020-12-14 2021-03-19 中国联合网络通信集团有限公司 恶意用户的识别方法、装置和设备
CN112529623B (zh) * 2020-12-14 2023-07-11 中国联合网络通信集团有限公司 恶意用户的识别方法、装置和设备

Also Published As

Publication number Publication date
KR20170032880A (ko) 2017-03-23
RU2016111677A (ru) 2017-10-04
US20170052947A1 (en) 2017-02-23
RU2643500C2 (ru) 2018-02-01
CN105117384A (zh) 2015-12-02
KR101778784B1 (ko) 2017-09-26
EP3133532A1 (en) 2017-02-22
JP2017535007A (ja) 2017-11-24
MX2016003981A (es) 2017-04-27

Similar Documents

Publication Publication Date Title
WO2017028416A1 (zh) 分类器训练方法、类型识别方法及装置
CN107491541B (zh) 文本分类方法及装置
US10291774B2 (en) Method, device, and system for determining spam caller phone number
WO2020029966A1 (zh) 视频处理方法及装置、电子设备和存储介质
EP3173948A1 (en) Method and apparatus for recommendation of reference documents
WO2021031645A1 (zh) 图像处理方法及装置、电子设备和存储介质
WO2016050038A1 (zh) 通信消息识别方法及装置
RU2664003C2 (ru) Способ и устройство для определения ассоциированного пользователя
WO2017092122A1 (zh) 相似性确定方法、装置及终端
WO2021036382A1 (zh) 图像处理方法及装置、电子设备和存储介质
CN109002184B (zh) 一种输入法候选词的联想方法和装置
WO2019165832A1 (zh) 文字信息处理方法、装置及终端
CN111259967B (zh) 图像分类及神经网络训练方法、装置、设备及存储介质
CN105528403B (zh) 目标数据识别方法及装置
WO2018040040A1 (zh) 消息通信方法及装置
WO2021238135A1 (zh) 对象计数方法、装置、电子设备、存储介质及程序
TW202117707A (zh) 資料處理方法、電子設備和電腦可讀儲存介質
WO2018188410A1 (zh) 反馈的响应方法及装置
CN111813932B (zh) 文本数据的处理方法、分类方法、装置及可读存储介质
CN112328809A (zh) 实体分类方法、装置及计算机可读存储介质
CN109145151B (zh) 一种视频的情感分类获取方法及装置
WO2023092975A1 (zh) 图像处理方法及装置、电子设备、存储介质及计算机程序产品
WO2021082461A1 (zh) 存储和读取方法、装置、电子设备和存储介质
CN108345590B (zh) 一种翻译方法、装置、电子设备以及存储介质
CN113312475B (zh) 一种文本相似度确定方法及装置

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2017534873

Country of ref document: JP

Kind code of ref document: A

Ref document number: 20167003870

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: MX/A/2016/003981

Country of ref document: MX

ENP Entry into the national phase

Ref document number: 2016111677

Country of ref document: RU

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15901618

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15901618

Country of ref document: EP

Kind code of ref document: A1