CN107507613B - Scene-oriented Chinese instruction identification method, device, equipment and storage medium - Google Patents

Scene-oriented Chinese instruction identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN107507613B
CN107507613B CN201710620448.7A CN201710620448A CN107507613B CN 107507613 B CN107507613 B CN 107507613B CN 201710620448 A CN201710620448 A CN 201710620448A CN 107507613 B CN107507613 B CN 107507613B
Authority
CN
China
Prior art keywords
prediction
test sample
sample
preset
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710620448.7A
Other languages
Chinese (zh)
Other versions
CN107507613A (en
Inventor
闫永刚
沈亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Midea Intelligent Technologies Co Ltd
Original Assignee
Hefei Midea Intelligent Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Midea Intelligent Technologies Co Ltd filed Critical Hefei Midea Intelligent Technologies Co Ltd
Priority to CN201710620448.7A priority Critical patent/CN107507613B/en
Publication of CN107507613A publication Critical patent/CN107507613A/en
Application granted granted Critical
Publication of CN107507613B publication Critical patent/CN107507613B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a scene-oriented Chinese instruction identification method, a scene-oriented Chinese instruction identification device and a storage medium, wherein the scene-oriented Chinese instruction identification method comprises the following steps: and correcting the prediction weight of each prediction model according to a sample set comprising mispartition samples and a first preset formula, wherein the mispartition samples are test samples of which the prediction class identifications do not match with the actual class identifications. According to the technical scheme, the prediction weight of each prediction model is trained and corrected by using the sample set comprising the misclassified samples, so that the accuracy rate of Chinese instruction recognition is effectively improved, background computing resources are effectively saved through scene prediction, and the intelligent level of Chinese instruction recognition is improved.

Description

Scene-oriented Chinese instruction identification method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of man-machine intelligent interaction, in particular to a scene-oriented Chinese instruction identification method, a scene-oriented Chinese instruction identification device, computer equipment and a computer-readable storage medium.
Background
The modern intelligent question-answering system generally comprises a plurality of technical links such as voice recognition, text parsing, syntactic analysis, semantic analysis, topic recognition, parsing response and the like, wherein scene-oriented Chinese instruction recognition (mainly question sentence pattern recognition) in the syntactic analysis serves as a portal verification function of the whole intelligent question-answering system.
In the related technology, scene-oriented Chinese instruction identification in syntactic analysis is mainly realized by two major methods of query word rule pattern matching and conversion generation syntactic analysis, and the following technical defects exist:
(1) the query rule pattern matching is very complicated and difficult to exhaust all query tables, and the Chinese instruction is relatively shallow to understand and has low recognition accuracy.
(2) The syntactic analysis is generated by conversion, a corresponding word library set needs to be established in advance, a syntactic mode needs to be established in advance, excessive manual intervention is needed, and the intelligent degree is low.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art or the related art.
Therefore, the invention aims to provide a scene-oriented Chinese instruction identification method.
Another object of the present invention is to provide a scene-oriented Chinese command recognition apparatus.
It is a further object of this invention to provide such a computer apparatus.
It is yet another object of the present invention to provide a computer-readable storage medium.
In order to achieve the above object, a technical solution of a first aspect of the present invention provides a scene-oriented chinese instruction recognition method, including: and correcting the prediction weight of each prediction model according to a sample set comprising mispartition samples and a first preset formula, wherein the mispartition samples are test samples of which the prediction class identifications do not match with the actual class identifications.
In the technical scheme, the prediction weight of each prediction model is corrected according to the sample set comprising the misclassified samples and the first preset formula, so that the prediction weight of each prediction model is corrected by using the test sample with unmatched prediction class identification and actual class identification, the prediction models can be effectively trained, the prediction accuracy is improved, thereby effectively improving the accuracy of Chinese instruction identification, and when the predicted class identification of the test sample is not matched with the actual class identification, the test sample is marked as a wrong sample, meanwhile, the probability of the misclassified samples is improved, so that the misclassified samples can be preferentially extracted to serve as a sample set for correcting the prediction weight of each prediction model, and the misclassified samples can be preferentially extracted to serve as new test samples, so that the manual intervention is reduced to a certain extent, the intelligent level of the prediction model training is improved, and the intelligent level of Chinese instruction recognition is also improved.
In addition, the sample set including the misclassified samples may be a sample set including all the misclassified samples, or may be a sample set including a part of the misclassified samples and a part of the correctly predicted samples, and the number of the sample sets is large so as to achieve the purpose of correcting the prediction weight of each prediction model.
In the foregoing technical solution, preferably, the modifying the prediction weight of each prediction model according to the sample set including the misclassified samples and the first preset formula specifically includes: cross-validating each prediction model according to a sample set comprising misclassified samples to determine a prediction accuracy of each prediction model; correcting the prediction weight of each prediction model according to a first preset formula and prediction precision, wherein the first preset formula comprises the following steps:
Figure BDA0001361562470000021
ωiprediction weight, p, characterized as the i-th prediction modeliCharacterized by the prediction accuracy of the ith prediction model,
Figure BDA0001361562470000022
the characterization is the sum of the prediction accuracy of all prediction models.
In the technical scheme, the prediction accuracy of each prediction model is determined by cross-verifying each prediction model by using a sample set comprising a misclassified sample, specifically, a 10-fold cross-verification method can be adopted, that is, the sample set comprising the misclassified sample is divided into 10 parts, 9 parts are used as training data, 1 part are used as test data, the test is carried out, each test can obtain corresponding accuracy, the average value of the accuracy of 10 results is used as the prediction accuracy of the prediction model, generally, 10-fold cross-verification is carried out for multiple times, for example, 10 times, and then the average value is obtained, so that the accuracy of determining the prediction accuracy of the prediction model is improved.
The prediction weight of each prediction model is calculated through the first preset formula and the prediction precision to obtain the corrected prediction weight of each prediction model, so that the accuracy of determining the prediction weight of each prediction model is improved, and the accuracy of Chinese instruction identification is further improved.
In any one of the above technical solutions, preferably, before modifying the prediction weight of each prediction model according to the sample set including the misclassified samples and the first preset formula, the method further includes: determining a prediction class identifier of the test sample according to the prediction weight of each prediction model and a second preset formula; if the actual class identification of the test sample is not matched with the predicted class identification, determining the test sample as a misclassified sample; improving the sampling probability of the misclassified samples to extract a sample set comprising the misclassified samples and to extract the misclassified samples as new test samples, wherein the second preset formula comprises:
pred=Max(ωi·nj)
ωiprediction weight, n, characterized as the ith prediction modeljCharacterized by the number of occurrences of the jth class identifier in all prediction models, and pred characterized by Max (ω)i·nj) The corresponding class identification, i.e. the prediction class identification.
According to the technical scheme, the prediction class identification of the test sample is determined according to the prediction weight of each prediction model and the second preset formula, the test sample with unmatched prediction class identification and actual class identification is marked as the wrongly-divided sample, the prediction model is tested, the next training of the prediction model is facilitated, the wrongly-divided sample can be preferentially extracted by improving the probability of the wrongly-divided sample to serve as a sample set for correcting the prediction weight of each prediction model, the wrongly-divided sample can be preferentially extracted to serve as a new test sample, manual intervention is reduced to a certain extent, the intellectualization level of the training of the prediction model is improved, and the accuracy of Chinese instruction recognition is further improved.
In any of the above technical solutions, preferably, before determining the prediction class identifier of the test sample according to the preset weight and the second preset formula of each prediction model, the method further includes: determining whether the test sample comprises vocabularies matched with a preset scene vocabulary library or not; if the fact that the vocabulary matched with the preset scene vocabulary library is not included in the test sample is determined, a prompt signal is sent out, and the prediction type identification of the test sample is not determined; and if the test sample comprises the vocabulary matched with the preset scene vocabulary library, replacing the corresponding vocabulary in the test sample by the matched vocabulary in the preset scene vocabulary library, and determining the prediction type identification of the test sample.
In the technical scheme, whether the vocabulary matched with the preset scene vocabulary library is included in the test sample is determined before the prediction type identification of the test sample is determined, so that the scene prejudgment is realized, the Chinese instruction recognition is scene-oriented, the Chinese instruction recognition is more targeted, the background computing resources can be effectively saved, if the test sample is determined not to include the vocabulary matched with the preset scene vocabulary library, a prompt signal is sent, the determination of the prediction type identification of the test sample is not carried out, irrelevant test samples can be filtered, the background computing resources are further effectively saved, when the test sample is determined to include the vocabulary matched with the preset scene vocabulary library, the corresponding vocabulary in the test sample is replaced by the vocabulary matched with the preset scene vocabulary library, the prediction type identification of the test sample is determined, and the standardization degree of the test sample entering the prediction model is improved, the method is beneficial to the output of the prediction type identification matched with the actual type identification by the prediction model, and further improves the accuracy of Chinese instruction recognition.
For example, if the scene is set as a kitchen scene, the preset scene vocabulary library may include the following vocabularies: the first kind of common food materials (450 kinds of common food materials such as apple, celery, potato and the like and the same meaning thereof are defined and selected); a second kind of common recipes (10000 kinds of common recipes such as sauerkraut fish, shredded pork with fish flavor and the like and the same meaning thereof are defined and selected); a third category of flavor (comprising multiple sub-categories of sour, spicy, light, etc. and synonyms thereof); season (including multiple sub-categories such as morning, valentine's day, etc. and synonyms thereof); the fifth category of nutritional efficacy (including multiple subclasses of weight loss, insomnia, weight loss, and the like, and synonyms thereof); a sixth special group (including a plurality of subclasses such as drivers, teachers and examinees and the same meaning thereof); type seven disease conditioning (including multiple subclasses and synonyms thereof, such as hypertension, cold, toothache, etc.); the eighth category of beauty treatment and slimming (including a plurality of subclasses such as whitening, acne removal, freckle removal and the like and synonyms thereof); a ninth type of dish (including snack, barbeque, overnight, and the like and their equivalents); a tenth type of scene (including multiple subclasses such as singleness, afternoon tea, transition, etc., and synonyms thereof).
In any one of the above technical solutions, preferably, the improving the sampling probability of the misclassified sample specifically includes: and re-determining the sampling probability of the wrong sample according to a third preset formula, wherein the third preset formula comprises the following steps:
Figure BDA0001361562470000041
ykcharacterised by the actual class identity, h, of the test sample k(k)Characterised by the prediction class identity, W, of the test sample kk+1Characterised by the sampling probability, Σ (y), of the redetermined miscut sample kk≠h(k)) Characterized as the total number of all misclassified samples.
In the technical scheme, the sampling probability of the misclassified samples is re-determined through a third preset formula, so that the sampling probability of the misclassified samples is improved according to a certain rule, a sample set containing the misclassified samples is favorably extracted to correct the prediction weight of each prediction model, the misclassified samples are favorably extracted to serve as new test samples, the sampling probability of the misclassified samples calculated by the third preset formula is gradually increased, that is, the sampling probability of the first misclassified sample is greater than that of the general sample, if the misclassified sample is further misclassified as a new test sample, the sampling probability will continue to increase, the sampling probability of the second-time misclassified sample is greater than that of the first-time misclassified sample, and a more appropriate prediction weight of each prediction model can be obtained through multiple rounds of training, so that the accuracy of Chinese instruction recognition can be effectively improved.
In any one of the above technical solutions, preferably, before modifying the prediction weight of each prediction model according to the sample set including the misclassified samples and the first preset formula, the method further includes: and constructing prediction models according to a preset corpus based on preset rules, and presetting the prediction weight of each prediction model.
In the technical scheme, the construction of the prediction models is realized according to the preset corpus based on the preset rules, and then the prediction weight of each prediction model is preset, so that the realization of the training of the prediction models is facilitated, for example, 4 prediction models are provided, and the prediction weight of each prediction model can be preset to be 0.25.
The preset rules are a support vector machine algorithm, a random forest tree algorithm, a KNN nearest neighbor algorithm and a naive Bayes algorithm, each algorithm independently constructs a prediction model, and the accuracy of Chinese instruction recognition can be further improved by combining the prediction models.
The method comprises the steps of constructing a prediction model by using a preset corpus and providing corpora for training, wherein test samples and a sample set comprising wrongly-divided samples are extracted from the preset corpus, specifically, collecting and arranging doubtful sentences, imperative sentences, exclamatory sentences and statement sentences 4 types of corpora as the preset corpus, and marking to form a prediction model training test set T { (x)1,y1),(x2,y2)…(xn,yn) Where x ∈ χ, and the instance space χ ∈ Rn,ynThe system belongs to a mark set {1, 2, 3 and 4}, the set respectively corresponds to 4 type marks of interrogative sentences, imperative sentences, exclamatory sentences and declarative sentences, each type corpus comprises related subclasses, wherein the interrogative sentences comprise 4 subclasses of the interrogative sentences, selective interrogative sentences, positive and negative interrogative sentences, whether to ask sentences, imperative sentences (comprising 4 subclasses of command imperative sentences, solicited imperative sentences, prohibited imperative sentences and solicited imperative sentences, and the exclamatory sentences comprise 4 subclasses of explamatory words, noun exclamatory sentences, spoken exclamatory sentences and adverbial exclamatory sentences, and the declarative sentences comprise 2 subclasses of negative statement declarative sentences and positive statement sentences.
The technical solution of the second aspect of the present invention provides a scene-oriented chinese instruction recognition apparatus, including: and the correcting unit is used for correcting the prediction weight of each prediction model according to a sample set comprising mispartition samples and a first preset formula, wherein the mispartition samples are test samples of which the prediction class identifications do not match with the actual class identifications.
In the technical scheme, the prediction weight of each prediction model is corrected according to the sample set comprising the misclassified samples and the first preset formula, so that the prediction weight of each prediction model is corrected by using the test sample with unmatched prediction class identification and actual class identification, the prediction models can be effectively trained, the prediction accuracy is improved, thereby effectively improving the accuracy of Chinese instruction identification, and when the predicted class identification of the test sample is not matched with the actual class identification, the test sample is marked as a wrong sample, meanwhile, the probability of the misclassified samples is improved, so that the misclassified samples can be preferentially extracted to serve as a sample set for correcting the prediction weight of each prediction model, and the misclassified samples can be preferentially extracted to serve as new test samples, so that the manual intervention is reduced to a certain extent, the intelligent level of the prediction model training is improved, and the intelligent level of Chinese instruction recognition is also improved.
In addition, the sample set including the misclassified samples may be a sample set including all the misclassified samples, or may be a sample set including a part of the misclassified samples and a part of the correctly predicted samples, and the number of the sample sets is large so as to achieve the purpose of correcting the prediction weight of each prediction model.
In the above technical solution, preferably, the method further includes: a verification unit for cross-verifying each prediction model according to a sample set including misclassified samples to determine a prediction accuracy of each prediction model; the correction unit is further configured to: correcting the prediction weight of each prediction model according to a first preset formula and prediction precision, wherein the first preset formula comprises the following steps:
Figure BDA0001361562470000061
ωiprediction weight, p, characterized as the i-th prediction modeliCharacterized by the prediction accuracy of the ith prediction model,
Figure BDA0001361562470000062
the characterization is the sum of the prediction accuracy of all prediction models.
In the technical scheme, the prediction accuracy of each prediction model is determined by cross-verifying each prediction model by using a sample set comprising a misclassified sample, specifically, a 10-fold cross-verification method can be adopted, that is, the sample set comprising the misclassified sample is divided into 10 parts, 9 parts are used as training data, 1 part are used as test data, the test is carried out, each test can obtain corresponding accuracy, the average value of the accuracy of 10 results is used as the prediction accuracy of the prediction model, generally, 10-fold cross-verification is carried out for multiple times, for example, 10 times, and then the average value is obtained, so that the accuracy of determining the prediction accuracy of the prediction model is improved.
The prediction weight of each prediction model is calculated through the first preset formula and the prediction precision to obtain the corrected prediction weight of each prediction model, so that the accuracy of determining the prediction weight of each prediction model is improved, and the accuracy of Chinese instruction identification is further improved.
In any one of the above technical solutions, preferably, the method further includes: the determining unit is used for determining the prediction class identification of the test sample according to the prediction weight of each prediction model and a second preset formula; the determination unit is further configured to: when the actual class identification of the test sample is not matched with the prediction class identification, determining the test sample as a misclassified sample; and the improving unit is used for improving the sampling probability of the misclassified samples so as to extract a sample set comprising the misclassified samples and extract the misclassified samples as new test samples, wherein the second preset formula comprises:
pred=Max(ωi·nj)
ωiprediction weight, n, characterized as the ith prediction modeljCharacterized by the number of occurrences of the jth class identifier in all prediction models, and pred characterized by Max (ω)i·nj) The corresponding class identification, i.e. the prediction class identification.
According to the technical scheme, the prediction class identification of the test sample is determined according to the prediction weight of each prediction model and the second preset formula, the test sample with unmatched prediction class identification and actual class identification is marked as the wrongly-divided sample, the prediction model is tested, the next training of the prediction model is facilitated, the wrongly-divided sample can be preferentially extracted by improving the probability of the wrongly-divided sample to serve as a sample set for correcting the prediction weight of each prediction model, the wrongly-divided sample can be preferentially extracted to serve as a new test sample, manual intervention is reduced to a certain extent, the intellectualization level of the training of the prediction model is improved, and the accuracy of Chinese instruction recognition is further improved.
In any one of the above technical solutions, preferably, the determining unit is further configured to: determining whether the test sample comprises vocabularies matched with a preset scene vocabulary library or not; the Chinese instruction recognition device further comprises: the prompting unit is used for sending a prompting signal when determining that the test sample does not include the vocabulary matched with the preset scene vocabulary library, and determining the prediction type identification of the test sample; and the replacing unit is used for replacing the corresponding vocabulary in the test sample by the matched vocabulary in the preset scene vocabulary library when the test sample is determined to comprise the vocabulary matched with the preset scene vocabulary library, and determining the prediction type identification of the test sample.
In the technical scheme, whether the vocabulary matched with the preset scene vocabulary library is included in the test sample is determined before the prediction type identification of the test sample is determined, so that the scene prejudgment is realized, the Chinese instruction recognition is scene-oriented, the Chinese instruction recognition is more targeted, the background computing resources can be effectively saved, if the test sample is determined not to include the vocabulary matched with the preset scene vocabulary library, a prompt signal is sent, the determination of the prediction type identification of the test sample is not carried out, irrelevant test samples can be filtered, the background computing resources are further effectively saved, when the test sample is determined to include the vocabulary matched with the preset scene vocabulary library, the corresponding vocabulary in the test sample is replaced by the vocabulary matched with the preset scene vocabulary library, the prediction type identification of the test sample is determined, and the standardization degree of the test sample entering the prediction model is improved, the method is beneficial to the output of the prediction type identification matched with the actual type identification by the prediction model, and further improves the accuracy of Chinese instruction recognition.
For example, if the scene is set as a kitchen scene, the preset scene vocabulary library may include the following vocabularies: the first kind of common food materials (450 kinds of common food materials such as apple, celery, potato and the like and the same meaning thereof are defined and selected); a second kind of common recipes (10000 kinds of common recipes such as sauerkraut fish, shredded pork with fish flavor and the like and the same meaning thereof are defined and selected); a third category of flavor (comprising multiple sub-categories of sour, spicy, light, etc. and synonyms thereof); season (including multiple sub-categories such as morning, valentine's day, etc. and synonyms thereof); the fifth category of nutritional efficacy (including multiple subclasses of weight loss, insomnia, weight loss, and the like, and synonyms thereof); a sixth special group (including a plurality of subclasses such as drivers, teachers and examinees and the same meaning thereof); type seven disease conditioning (including multiple subclasses and synonyms thereof, such as hypertension, cold, toothache, etc.); the eighth category of beauty treatment and slimming (including a plurality of subclasses such as whitening, acne removal, freckle removal and the like and synonyms thereof); a ninth type of dish (including snack, barbeque, overnight, and the like and their equivalents); a tenth type of scene (including multiple subclasses such as singleness, afternoon tea, transition, etc., and synonyms thereof).
In any one of the above technical solutions, preferably, the determining unit is further configured to: and re-determining the sampling probability of the wrong sample according to a third preset formula, wherein the third preset formula comprises the following steps:
Figure BDA0001361562470000081
ykcharacterised by the actual class identity, h, of the test sample k(k)Characterised by the prediction class identity, W, of the test sample kk+1Characterised by the sampling probability, Σ (y), of the redetermined miscut sample kk≠h(k)) Characterized as the total number of all misclassified samples.
In the technical scheme, the sampling probability of the misclassified samples is re-determined through a third preset formula, so that the sampling probability of the misclassified samples is improved according to a certain rule, a sample set containing the misclassified samples is favorably extracted to correct the prediction weight of each prediction model, the misclassified samples are favorably extracted to serve as new test samples, the sampling probability of the misclassified samples calculated by the third preset formula is gradually increased, that is, the sampling probability of the first misclassified sample is greater than that of the general sample, if the misclassified sample is further misclassified as a new test sample, the sampling probability will continue to increase, the sampling probability of the second-time misclassified sample is greater than that of the first-time misclassified sample, and a more appropriate prediction weight of each prediction model can be obtained through multiple rounds of training, so that the accuracy of Chinese instruction recognition can be effectively improved.
In any one of the above technical solutions, preferably, the method further includes: and the presetting unit is used for constructing the prediction models according to the preset corpus based on the preset rules and presetting the prediction weight of each prediction model.
In the technical scheme, the construction of the prediction models is realized according to the preset corpus based on the preset rules, and then the prediction weight of each prediction model is preset, so that the realization of the training of the prediction models is facilitated, for example, 4 prediction models are provided, and the prediction weight of each prediction model can be preset to be 0.25.
The preset rules are a support vector machine algorithm, a random forest tree algorithm, a KNN nearest neighbor algorithm and a naive Bayes algorithm, each algorithm independently constructs a prediction model, and the accuracy of Chinese instruction recognition can be further improved by combining the prediction models.
The method comprises the steps of constructing a prediction model by using a preset corpus and providing corpora for training, wherein test samples and a sample set comprising wrongly-divided samples are extracted from the preset corpus, specifically, collecting and arranging doubtful sentences, imperative sentences, exclamatory sentences and statement sentences 4 types of corpora as the preset corpus, and marking to form a prediction model training test set T { (x)1,y1),(x2,y2)…(xn,yn) Where x ∈ χ, and the instance space χ ∈ Rn,ynThe system belongs to a mark set {1, 2, 3 and 4}, the set respectively corresponds to 4 type marks of interrogative sentences, imperative sentences, exclamatory sentences and declarative sentences, each type corpus comprises related subclasses, wherein the interrogative sentences comprise 4 subclasses of the interrogative sentences, selective interrogative sentences, positive and negative interrogative sentences, whether to ask sentences, imperative sentences (comprising 4 subclasses of command imperative sentences, solicited imperative sentences, prohibited imperative sentences and solicited imperative sentences, and the exclamatory sentences comprise 4 subclasses of explamatory words, noun exclamatory sentences, spoken exclamatory sentences and adverbial exclamatory sentences, and the declarative sentences comprise 2 subclasses of negative statement declarative sentences and positive statement sentences.
An aspect of the third aspect of the present invention provides a computer device, where the computer device includes a processor, and the processor is configured to implement the steps of the scene-oriented chinese instruction recognition method according to any one of the aspects of the present invention as set forth in the first aspect of the present invention when executing a computer program stored in a memory.
In this technical solution, the computer device includes a processor, and the processor is configured to implement the steps of any one of the scene-oriented chinese instruction identification methods proposed in the technical solution of the first aspect of the present invention when executing the computer program stored in the memory, so that all the beneficial effects of any one of the scene-oriented chinese instruction identification methods proposed in the technical solution of the first aspect of the present invention are achieved, and details are not described herein again.
An aspect of the fourth aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of any one of the scene-oriented chinese instruction recognition methods presented in the aspect of the first aspect of the present invention.
In this technical solution, a computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the steps of any one of the scene-oriented chinese instruction identification methods proposed in the technical solution of the first aspect of the present invention are implemented, so that all the beneficial effects of any one of the scene-oriented chinese instruction identification methods proposed in the technical solution of the first aspect of the present invention are achieved, and details are not repeated herein.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 shows a schematic flow diagram of a scenario-oriented Chinese instruction recognition method according to one embodiment of the present invention;
FIG. 2 shows a schematic flow diagram of a scenario-oriented Chinese instruction recognition apparatus according to one embodiment of the present invention;
FIG. 3 shows a schematic flow diagram of a scenario-oriented Chinese instruction recognition method according to another embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Example 1
As shown in fig. 1, the method for recognizing a scene-oriented chinese instruction according to an embodiment of the present invention includes: step S102, correcting the prediction weight of each prediction model according to a sample set comprising mispartition samples and a first preset formula, wherein the mispartition samples are test samples of which the prediction class identifications are not matched with the actual class identifications.
In the embodiment, the prediction weight of each prediction model is corrected according to the sample set comprising the misclassified samples and the first preset formula, so that the prediction weight of each prediction model is corrected by the test samples with unmatched prediction class identifications and actual class identifications, the prediction models can be effectively trained, the prediction accuracy is improved, thereby effectively improving the accuracy of Chinese instruction identification, and when the predicted class identification of the test sample is not matched with the actual class identification, the test sample is marked as a wrong sample, meanwhile, the probability of the misclassified samples is improved, so that the misclassified samples can be preferentially extracted to serve as a sample set for correcting the prediction weight of each prediction model, and the misclassified samples can be preferentially extracted to serve as new test samples, so that the manual intervention is reduced to a certain extent, the intelligent level of the prediction model training is improved, and the intelligent level of Chinese instruction recognition is also improved.
In addition, the sample set including the misclassified samples may be a sample set including all the misclassified samples, or may be a sample set including a part of the misclassified samples and a part of the correctly predicted samples, and the number of the sample sets is large so as to achieve the purpose of correcting the prediction weight of each prediction model.
In the foregoing embodiment, preferably, the modifying the prediction weight of each prediction model according to the sample set including the misclassified samples and the first preset formula specifically includes: cross-validating each prediction model according to a sample set comprising misclassified samples to determine a prediction accuracy of each prediction model; correcting the prediction weight of each prediction model according to a first preset formula and prediction precision, wherein the first preset formula comprises the following steps:
Figure BDA0001361562470000111
ωiprediction weight, p, characterized as the i-th prediction modeliCharacterized by the prediction accuracy of the ith prediction model,
Figure BDA0001361562470000112
the characterization is the sum of the prediction accuracy of all prediction models.
In this embodiment, the prediction accuracy of each prediction model is determined by cross-verifying each prediction model with a sample set including misclassified samples, specifically, a 10-fold cross-verification method may be adopted, that is, the sample set including misclassified samples is divided into 10 samples, 9 samples are used as training data, and 1 sample is used as test data, and the test is performed, where each test results in a corresponding accuracy, an average value of the accuracy of 10 results is used as the prediction accuracy of the prediction model, and generally, 10-fold cross-verification is performed for multiple times, for example, 10 times, and then the average value is calculated, so as to improve the accuracy of determining the prediction accuracy of the prediction model.
The prediction weight of each prediction model is calculated through the first preset formula and the prediction precision to obtain the corrected prediction weight of each prediction model, so that the accuracy of determining the prediction weight of each prediction model is improved, and the accuracy of Chinese instruction identification is further improved.
In any of the above embodiments, preferably, before modifying the prediction weight of each prediction model according to the sample set including the misclassified samples and the first preset formula, the method further includes: determining a prediction class identifier of the test sample according to the prediction weight of each prediction model and a second preset formula; if the actual class identification of the test sample is not matched with the predicted class identification, determining the test sample as a misclassified sample; improving the sampling probability of the misclassified samples to extract a sample set comprising the misclassified samples and to extract the misclassified samples as new test samples, wherein the second preset formula comprises:
pred=Max(ωi·nj)
ωiprediction weight, n, characterized as the ith prediction modeljCharacterized by the number of occurrences of the jth class identifier in all prediction models, and pred characterized by Max (ω)i·nj) The corresponding class identification, i.e. the prediction class identification.
In the embodiment, the prediction class identification of the test sample is determined according to the prediction weight of each prediction model and the second preset formula, and the test sample with unmatched prediction class identification and actual class identification is marked as a misclassified sample, so that the test on the prediction model is realized, the next training on the prediction model is facilitated, the misclassified sample can be preferentially extracted as a sample set for correcting the prediction weight of each prediction model by improving the probability of the misclassified sample, the misclassified sample can also be preferentially extracted as a new test sample, the manual intervention is reduced to a certain extent, the intellectualization level of the training on the prediction model is improved, and the accuracy of Chinese instruction recognition is further improved.
In any of the above embodiments, preferably, before determining the prediction class identifier of the test sample according to the preset weight of each prediction model and the second preset formula, the method further includes: determining whether the test sample comprises vocabularies matched with a preset scene vocabulary library or not; if the fact that the vocabulary matched with the preset scene vocabulary library is not included in the test sample is determined, a prompt signal is sent out, and the prediction type identification of the test sample is not determined; and if the test sample comprises the vocabulary matched with the preset scene vocabulary library, replacing the corresponding vocabulary in the test sample by the matched vocabulary in the preset scene vocabulary library, and determining the prediction type identification of the test sample.
In the embodiment, the scene prejudgment is realized by determining whether the test sample comprises words matched with the preset scene word library or not before the prediction class identification of the test sample is determined, so that the Chinese instruction recognition is scene-oriented, the Chinese instruction recognition is more targeted, background computing resources can be effectively saved, if the test sample is determined not to comprise the words matched with the preset scene word library, a prompt signal is sent, the determination of the prediction class identification of the test sample is not carried out, irrelevant test samples can be filtered, the background computing resources are further effectively saved, when the test sample is determined to comprise the words matched with the preset scene word library, the corresponding words in the test sample are replaced by the words matched with the preset scene word library, the determination of the prediction class identification of the test sample is carried out, and the standardization degree of the test sample entering the prediction model is improved, the method is beneficial to the output of the prediction type identification matched with the actual type identification by the prediction model, and further improves the accuracy of Chinese instruction recognition.
For example, if the scene is set as a kitchen scene, the preset scene vocabulary library may include the following vocabularies: the first kind of common food materials (450 kinds of common food materials such as apple, celery, potato and the like and the same meaning thereof are defined and selected); a second kind of common recipes (10000 kinds of common recipes such as sauerkraut fish, shredded pork with fish flavor and the like and the same meaning thereof are defined and selected); a third category of flavor (comprising multiple sub-categories of sour, spicy, light, etc. and synonyms thereof); season (including multiple sub-categories such as morning, valentine's day, etc. and synonyms thereof); the fifth category of nutritional efficacy (including multiple subclasses of weight loss, insomnia, weight loss, and the like, and synonyms thereof); a sixth special group (including a plurality of subclasses such as drivers, teachers and examinees and the same meaning thereof); type seven disease conditioning (including multiple subclasses and synonyms thereof, such as hypertension, cold, toothache, etc.); the eighth category of beauty treatment and slimming (including a plurality of subclasses such as whitening, acne removal, freckle removal and the like and synonyms thereof); a ninth type of dish (including snack, barbeque, overnight, and the like and their equivalents); a tenth type of scene (including multiple subclasses such as singleness, afternoon tea, transition, etc., and synonyms thereof).
In any of the foregoing embodiments, preferably, the improving the sampling probability of the misclassified sample specifically includes: and re-determining the sampling probability of the wrong sample according to a third preset formula, wherein the third preset formula comprises the following steps:
Figure BDA0001361562470000131
ykcharacterised by the actual class identity, h, of the test sample k(k)Characterised by the prediction class identity, W, of the test sample kk+1Characterised by the sampling probability, Σ (y), of the redetermined miscut sample kk≠h(k)) Characterized as the total number of all misclassified samples.
In the embodiment, the sampling probability of the misclassified samples is re-determined through the third preset formula, so that the sampling probability of the misclassified samples is improved according to a certain rule, a sample set containing the misclassified samples is favorably extracted to correct the prediction weight of each prediction model, the misclassified samples are favorably extracted to serve as new test samples, the sampling probability of the misclassified samples calculated by the third preset formula is gradually increased, that is, the sampling probability of the first misclassified sample is greater than that of the general sample, if the misclassified sample is further misclassified as a new test sample, the sampling probability will continue to increase, the sampling probability of the second-time misclassified sample is greater than that of the first-time misclassified sample, and a more appropriate prediction weight of each prediction model can be obtained through multiple rounds of training, so that the accuracy of Chinese instruction recognition can be effectively improved.
In any of the above embodiments, preferably, before modifying the prediction weight of each prediction model according to the sample set including the misclassified samples and the first preset formula, the method further includes: and constructing prediction models according to a preset corpus based on preset rules, and presetting the prediction weight of each prediction model.
In this embodiment, the construction of the prediction models is realized according to the preset corpus based on the preset rule, and then the prediction weight of each prediction model is preset, which is beneficial to the realization of the training of the prediction models, for example, 4 prediction models are provided, and the prediction weight of each prediction model can be preset to be 0.25.
The preset rules are a support vector machine algorithm, a random forest tree algorithm, a KNN nearest neighbor algorithm and a naive Bayes algorithm, each algorithm independently constructs a prediction model, and the accuracy of Chinese instruction recognition can be further improved by combining the prediction models.
The method comprises the steps of constructing a prediction model by using a preset corpus and providing corpora for training, wherein test samples and a sample set comprising wrongly-divided samples are extracted from the preset corpus, specifically, collecting and arranging doubtful sentences, imperative sentences, exclamatory sentences and statement sentences 4 types of corpora as the preset corpus, and marking to form a prediction model training test set T { (x)1,y1),(x2,y2)…(xn,yn) Where x ∈ χ, and the instance space χ ∈ Rn,ynThe system belongs to a mark set {1, 2, 3 and 4}, the set respectively corresponds to 4 type marks of interrogative sentences, imperative sentences, exclamatory sentences and declarative sentences, each type corpus comprises related subclasses, wherein the interrogative sentences comprise 4 subclasses of the interrogative sentences, selective interrogative sentences, positive and negative interrogative sentences, whether to ask sentences, imperative sentences (comprising 4 subclasses of command imperative sentences, solicited imperative sentences, prohibited imperative sentences and solicited imperative sentences, and the exclamatory sentences comprise 4 subclasses of explamatory words, noun exclamatory sentences, spoken exclamatory sentences and adverbial exclamatory sentences, and the declarative sentences comprise 2 subclasses of negative statement declarative sentences and positive statement sentences.
Example 2
As shown in fig. 2, a scene-oriented chinese instruction recognition apparatus 200 according to an embodiment of the present invention includes: a correcting unit 201, configured to correct the prediction weight of each prediction model according to a sample set including misclassified samples and a first preset formula, where the misclassified samples are test samples whose prediction class identifiers do not match the actual class identifiers.
In the embodiment, the prediction weight of each prediction model is corrected according to the sample set comprising the misclassified samples and the first preset formula, so that the prediction weight of each prediction model is corrected by the test samples with unmatched prediction class identifications and actual class identifications, the prediction models can be effectively trained, the prediction accuracy is improved, thereby effectively improving the accuracy of Chinese instruction identification, and when the predicted class identification of the test sample is not matched with the actual class identification, the test sample is marked as a wrong sample, meanwhile, the probability of the misclassified samples is improved, so that the misclassified samples can be preferentially extracted to serve as a sample set for correcting the prediction weight of each prediction model, and the misclassified samples can be preferentially extracted to serve as new test samples, so that the manual intervention is reduced to a certain extent, the intelligent level of the prediction model training is improved, and the intelligent level of Chinese instruction recognition is also improved.
In addition, the sample set including the misclassified samples may be a sample set including all the misclassified samples, or may be a sample set including a part of the misclassified samples and a part of the correctly predicted samples, and the number of the sample sets is large so as to achieve the purpose of correcting the prediction weight of each prediction model.
In the above embodiment, preferably, the method further includes: a verification unit 202, configured to cross-verify each prediction model according to a sample set including misclassified samples to determine a prediction accuracy of each prediction model;
the correction unit 201 is further configured to: correcting the prediction weight of each prediction model according to a first preset formula and prediction precision, wherein the first preset formula comprises the following steps:
Figure BDA0001361562470000151
ωiprediction weight, p, characterized as the i-th prediction modeliCharacterized by the prediction accuracy of the ith prediction model,
Figure BDA0001361562470000152
the characterization is the sum of the prediction accuracy of all prediction models.
In this embodiment, the prediction accuracy of each prediction model is determined by cross-verifying each prediction model with a sample set including misclassified samples, specifically, a 10-fold cross-verification method may be adopted, that is, the sample set including misclassified samples is divided into 10 samples, 9 samples are used as training data, and 1 sample is used as test data, and the test is performed, where each test results in a corresponding accuracy, an average value of the accuracy of 10 results is used as the prediction accuracy of the prediction model, and generally, 10-fold cross-verification is performed for multiple times, for example, 10 times, and then the average value is calculated, so as to improve the accuracy of determining the prediction accuracy of the prediction model.
The prediction weight of each prediction model is calculated through the first preset formula and the prediction precision to obtain the corrected prediction weight of each prediction model, so that the accuracy of determining the prediction weight of each prediction model is improved, and the accuracy of Chinese instruction identification is further improved.
In any one of the above embodiments, preferably, the method further includes: a determining unit 206, configured to determine a prediction class identifier of the test sample according to the prediction weight of each prediction model and a second preset formula; the determining unit 206 is further configured to: when the actual class identification of the test sample is not matched with the prediction class identification, determining the test sample as a misclassified sample; an increasing unit 208, configured to increase a sampling probability of the misclassified sample, so as to extract a sample set including the misclassified sample, and extract the misclassified sample as a new test sample, where the second preset formula includes:
pred=Max(ωi·nj)
ωiprediction weight, n, characterized as the ith prediction modeljCharacterized by the number of occurrences of the jth class identifier in all prediction models, and pred characterized by Max (ω)i·nj) The corresponding class identification, i.e. the prediction class identification.
In the embodiment, the prediction class identification of the test sample is determined according to the prediction weight of each prediction model and the second preset formula, and the test sample with unmatched prediction class identification and actual class identification is marked as a misclassified sample, so that the test on the prediction model is realized, the next training on the prediction model is facilitated, the misclassified sample can be preferentially extracted as a sample set for correcting the prediction weight of each prediction model by improving the probability of the misclassified sample, the misclassified sample can also be preferentially extracted as a new test sample, the manual intervention is reduced to a certain extent, the intellectualization level of the training on the prediction model is improved, and the accuracy of Chinese instruction recognition is further improved.
In any of the above embodiments, preferably, the determining unit 206 is further configured to: determining whether the test sample comprises vocabularies matched with a preset scene vocabulary library or not; the Chinese instruction recognition device further comprises: the prompting unit 210 is configured to send a prompting signal when it is determined that the test sample does not include a vocabulary matched with the preset scene vocabulary library, and not determine the prediction class identifier of the test sample; and the replacing unit 212 is configured to, when it is determined that the test sample includes a vocabulary matched with the preset scene vocabulary library, replace the corresponding vocabulary in the test sample with the matched vocabulary in the preset scene vocabulary library, and determine the prediction class identifier of the test sample.
In the embodiment, the scene prejudgment is realized by determining whether the test sample comprises words matched with the preset scene word library or not before the prediction class identification of the test sample is determined, so that the Chinese instruction recognition is scene-oriented, the Chinese instruction recognition is more targeted, background computing resources can be effectively saved, if the test sample is determined not to comprise the words matched with the preset scene word library, a prompt signal is sent, the determination of the prediction class identification of the test sample is not carried out, irrelevant test samples can be filtered, the background computing resources are further effectively saved, when the test sample is determined to comprise the words matched with the preset scene word library, the corresponding words in the test sample are replaced by the words matched with the preset scene word library, the determination of the prediction class identification of the test sample is carried out, and the standardization degree of the test sample entering the prediction model is improved, the method is beneficial to the output of the prediction type identification matched with the actual type identification by the prediction model, and further improves the accuracy of Chinese instruction recognition.
For example, if the scene is set as a kitchen scene, the preset scene vocabulary library may include the following vocabularies: the first kind of common food materials (450 kinds of common food materials such as apple, celery, potato and the like and the same meaning thereof are defined and selected); a second kind of common recipes (10000 kinds of common recipes such as sauerkraut fish, shredded pork with fish flavor and the like and the same meaning thereof are defined and selected); a third category of flavor (comprising multiple sub-categories of sour, spicy, light, etc. and synonyms thereof); season (including multiple sub-categories such as morning, valentine's day, etc. and synonyms thereof); the fifth category of nutritional efficacy (including multiple subclasses of weight loss, insomnia, weight loss, and the like, and synonyms thereof); a sixth special group (including a plurality of subclasses such as drivers, teachers and examinees and the same meaning thereof); type seven disease conditioning (including multiple subclasses and synonyms thereof, such as hypertension, cold, toothache, etc.); the eighth category of beauty treatment and slimming (including a plurality of subclasses such as whitening, acne removal, freckle removal and the like and synonyms thereof); a ninth type of dish (including snack, barbeque, overnight, and the like and their equivalents); a tenth type of scene (including multiple subclasses such as singleness, afternoon tea, transition, etc., and synonyms thereof).
In any of the above embodiments, preferably, the determining unit 206 is further configured to: and re-determining the sampling probability of the wrong sample according to a third preset formula, wherein the third preset formula comprises the following steps:
Figure BDA0001361562470000171
ykcharacterised by the actual class identity, h, of the test sample k(k)Characterised by the prediction class identity, W, of the test sample kk+1Characterised by the sampling probability, Σ (y), of the redetermined miscut sample kk≠h(k)) Characterized as the total number of all misclassified samples.
In the embodiment, the sampling probability of the misclassified samples is re-determined through the third preset formula, so that the sampling probability of the misclassified samples is improved according to a certain rule, a sample set containing the misclassified samples is favorably extracted to correct the prediction weight of each prediction model, the misclassified samples are favorably extracted to serve as new test samples, the sampling probability of the misclassified samples calculated by the third preset formula is gradually increased, that is, the sampling probability of the first misclassified sample is greater than that of the general sample, if the misclassified sample is further misclassified as a new test sample, the sampling probability will continue to increase, the sampling probability of the second-time misclassified sample is greater than that of the first-time misclassified sample, and a more appropriate prediction weight of each prediction model can be obtained through multiple rounds of training, so that the accuracy of Chinese instruction recognition can be effectively improved.
In any one of the above embodiments, preferably, the method further includes: the presetting unit 214 is configured to construct prediction models according to a preset corpus based on preset rules, and preset a prediction weight of each prediction model.
In this embodiment, the construction of the prediction models is realized according to the preset corpus based on the preset rule, and then the prediction weight of each prediction model is preset, which is beneficial to the realization of the training of the prediction models, for example, 4 prediction models are provided, and the prediction weight of each prediction model can be preset to be 0.25.
The preset rules are a support vector machine algorithm, a random forest tree algorithm, a KNN nearest neighbor algorithm and a naive Bayes algorithm, each algorithm independently constructs a prediction model, and the accuracy of Chinese instruction recognition can be further improved by combining the prediction models.
The method comprises the steps of constructing a prediction model by using a preset corpus and providing corpora for training, wherein test samples and a sample set comprising wrongly-divided samples are extracted from the preset corpus, specifically, collecting and arranging doubtful sentences, imperative sentences, exclamatory sentences and statement sentences 4 types of corpora as the preset corpus, and marking to form a prediction model training test set T { (x)1,y1),(x2,y2)…(xn,yn) Where x ∈ χ, and the instance space χ ∈ Rn,ynBelongs to a mark set {1, 2, 3, 4}, the set respectively corresponds to 4 class marks of question sentences, imperative sentences, exclamation sentences and declarative sentences, each class corpus comprises related subclasses, wherein the question sentences comprise special question sentences, selective question sentences, positive and negative question sentences, whether question sentences comprise 4 subclasses and imperative sentences (comprising command imperative sentences and request imperative sentencesThe expression sentence comprises 4 subclasses of sentences, imperative sentences and persuasive sentences, the exclamation sentence comprises 4 subclasses of an exclamation word exclamation sentence, a noun exclamation sentence, a spoken exclamation sentence and an adverb exclamation sentence, and the statement sentences comprise 2 subclasses of negative statement statements and positive statement statements.
Example 3
According to a computer device of an embodiment of the present invention, the computer device comprises a processor for implementing the steps of the scene-oriented chinese instruction recognition method as any one of the embodiments of the present invention set forth above when executing a computer program stored in a memory.
In this embodiment, the computer device includes a processor, and the processor is configured to implement the steps of any one of the scene-oriented chinese instruction identification methods proposed in the embodiments of the present invention when executing the computer program stored in the memory, so that all the beneficial effects of any one of the scene-oriented chinese instruction identification methods proposed in the embodiments of the present invention are achieved, and details are not described herein again.
Example 4
The computer readable storage medium according to an embodiment of the present invention has a computer program stored thereon, and the computer program, when executed by a processor, implements the steps of the scene-oriented chinese instruction recognition method of any one of the embodiments of the present invention set forth above.
In this embodiment, a computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the steps of any one of the scene-oriented chinese instruction identification methods provided in the embodiments of the present invention described above are implemented, so that all the beneficial effects of any one of the scene-oriented chinese instruction identification methods provided in the embodiments of the present invention described above are achieved, and details are not repeated herein.
Example 5
As shown in fig. 3, according to the scene-oriented chinese instruction recognition method of an embodiment of the present invention, firstly, according to a corpus, 4 prediction models are constructed through a support vector machine algorithm, a random forest tree algorithm, a KNN nearest neighbor algorithm, and a naive bayesian algorithm, and weights ω 1, ω 2, ω 3, and ω 4 are preset respectively, then a test sample is extracted from the corpus, the test sample is read, a text character string returned by speech recognition is obtained, a natural language processing technique is used to perform chinese word segmentation, stop word filtering, dictionary customization and text deduplication on the text at a text parsing layer, then a text character string array of the processed test sample is obtained, then at a scene topic layer, whether words in a preset scene vocabulary library are included is judged, if no words are judged, i.e. words in the preset scene vocabulary library are not included, a prediction result is output, the question is irrelevant to a scene, if the question is judged to be yes, namely the question comprises words in a preset scene word library, the class identification of the test text is respectively predicted through 4 built prediction models, then the prediction result of each prediction model is integrated according to preset weights omega 1, omega 2, omega 3 and omega 4 to obtain the prediction class identification of the test text, then wrong division judgment is carried out, if the actual class identification of the test text is not matched with the prediction class identification, namely wrong division is judged, the test text is determined to be wrong division text, the prediction weight of each prediction model is corrected, if the actual class identification of the test text is matched with the prediction class identification, namely wrong division is judged, the prediction result, namely the prediction class identification, namely the actual class identification is output, the correction of the prediction weight of each prediction model is realized according to the wrong division sample, and the prediction weight of each prediction model is corrected, the accuracy rate of Chinese instruction identification can be effectively improved.
The technical scheme of the invention is described in detail in the above with reference to the accompanying drawings, and the invention provides a scene-oriented Chinese instruction identification method, device, equipment and storage medium, wherein the prediction weight of each prediction model is corrected according to a sample set comprising misclassification samples and a first preset formula, so that the accuracy of Chinese instruction identification is effectively improved, background computing resources are effectively saved through scene prejudgment, and the intelligent level of Chinese instruction identification is improved.
The steps in the method of the invention can be sequentially adjusted, combined and deleted according to actual needs.
The units in the device of the invention can be merged, divided and deleted according to actual needs.
It will be understood by those skilled in the art that all or part of the steps in the methods of the embodiments described above may be implemented by hardware instructions of a program, and the program may be stored in a computer-readable storage medium, where the storage medium includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM), or other Memory, such as a magnetic disk, or a combination thereof, A tape memory, or any other medium readable by a computer that can be used to carry or store data.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A scene-oriented Chinese instruction identification method is characterized by comprising the following steps:
modifying the prediction weight of each prediction model according to a sample set including the misclassified samples and a first preset formula,
the misclassification sample is a test sample with unmatched prediction class identification and actual class identification;
the correcting the prediction weight of each prediction model according to the sample set including the misclassified samples and the first preset formula specifically includes:
cross-validating each of the prediction models according to the sample set comprising the misclassified samples to determine a prediction accuracy of each of the prediction models;
correcting the prediction weight of each prediction model according to the first preset formula and the prediction precision,
wherein the first preset formula comprises:
Figure FDA0002710141200000011
ωiprediction weight, p, characterized as the i-th prediction modeliCharacterized by the prediction accuracy of the ith prediction model,
Figure FDA0002710141200000012
the representation is the sum of the prediction precisions of all the prediction models;
before the modifying the prediction weight of each prediction model according to the sample set including the misclassified samples and the first preset formula, the method further includes:
determining a prediction class identifier of the test sample according to the prediction weight of each prediction model and a second preset formula;
if the actual class identifier of the test sample is not matched with the predicted class identifier, determining the test sample as the misclassified sample;
increasing the sampling probability of the misclassified samples to extract the sample set including the misclassified samples and to extract the misclassified samples as new test samples,
wherein the second preset formula comprises:
pred=Max(ωi·nj)
ωiprediction weight, n, characterized as the ith prediction modeljCharacterized by the number of occurrences of the jth class identifier in all prediction models, and pred characterized by Max (ω)i·nj) The corresponding class identifier, namely the prediction class identifier;
before the modifying the prediction weight of each prediction model according to the sample set including the misclassified samples and the first preset formula, the method further includes:
and constructing the prediction models according to a preset corpus based on a preset rule, and presetting the prediction weight of each prediction model.
2. The scene-oriented chinese instruction recognition method of claim 1, further comprising, before the determining the prediction class identifier of the test sample according to the preset weight and the second preset formula of each prediction model,:
determining whether the test sample comprises vocabularies matched with a preset scene vocabulary library or not;
if the fact that the vocabulary matched with the preset scene vocabulary library is not included in the test sample is determined, a prompt signal is sent out, and the prediction type identification of the test sample is not determined;
and if the test sample is determined to comprise the vocabulary matched with the preset scene vocabulary library, replacing the corresponding vocabulary in the test sample with the matched vocabulary in the preset scene vocabulary library, and determining the prediction type identification of the test sample.
3. The scene-oriented chinese instruction recognition method of claim 1, wherein the improving the sampling probability of the misclassified samples specifically comprises:
re-determining the sampling probability of the misclassified samples according to a third preset formula,
wherein the third preset formula comprises:
Figure FDA0002710141200000021
ykcharacterised by the actual class identity, h, of the test sample k(k)Characterized by a prediction class identity, W, of the test sample kk+1Characterised by the sampling probability, Σ (y), of the redetermined miscut sample kk≠h(k)) Characterized as the total number of all misclassified samples.
4. A scene-oriented Chinese instruction recognition device is characterized by comprising:
a correction unit for correcting the prediction weight of each prediction model according to a sample set including the misclassified samples and a first preset formula,
the misclassification sample is a test sample with unmatched prediction class identification and actual class identification;
a verification unit, configured to cross-verify the each prediction model according to the sample set including the misclassified samples to determine a prediction accuracy of the each prediction model;
the correction unit is further configured to: correcting the prediction weight of each prediction model according to the first preset formula and the prediction precision,
wherein the first preset formula comprises:
Figure FDA0002710141200000031
ωiprediction weight, p, characterized as the i-th prediction modeliCharacterized by the prediction accuracy of the ith prediction model,
Figure FDA0002710141200000032
the representation is the sum of the prediction precisions of all the prediction models;
the determining unit is used for determining the prediction class identification of the test sample according to the prediction weight of each prediction model and a second preset formula;
the determination unit is further configured to: when the actual class identification of the test sample is not matched with the prediction class identification, determining the test sample as the misclassified sample;
an increasing unit for increasing a sampling probability of the misclassified samples to extract the sample set including the misclassified samples and to extract the misclassified samples as new test samples,
wherein the second preset formula comprises:
pred=Max(ωi·nj)
ωiprediction weight, n, characterized as the ith prediction modeljCharacterized by the number of occurrences of the jth class identifier in all prediction models, and pred characterized by Max (ω)i·nj) The corresponding class identifier, namely the prediction class identifier;
and the presetting unit is used for constructing the prediction models according to a preset corpus based on a preset rule and presetting the prediction weight of each prediction model.
5. The scene-oriented Chinese command recognition device of claim 4,
the determination unit is further configured to: determining whether the test sample comprises vocabularies matched with a preset scene vocabulary library or not;
the Chinese instruction recognition device further comprises:
the prompting unit is used for sending a prompting signal when determining that the test sample does not include the vocabulary matched with the preset scene vocabulary library, and determining the prediction type identification of the test sample;
and the replacing unit is used for replacing the corresponding vocabulary in the test sample by the matched vocabulary in the preset scene vocabulary library when the test sample is determined to comprise the vocabulary matched with the preset scene vocabulary library, and determining the prediction type identification of the test sample.
6. The scene-oriented Chinese command recognition device of claim 4,
the determination unit is further configured to: re-determining the sampling probability of the misclassified samples according to a third preset formula,
wherein the third preset formula comprises:
Figure FDA0002710141200000041
ykcharacterised by the actual class identity, h, of the test sample k(k)Characterized by a prediction class identity, W, of the test sample kk+1Characterised by being re-determinedA determined probability of sampling the wrong sample k, Σ (y)k≠h(k)) Characterized as the total number of all misclassified samples.
7. A computer device, characterized in that it comprises a processor for implementing the steps of the scene oriented chinese instruction recognition method according to any one of claims 1 to 3 when executing a computer program stored in a memory.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the scene-oriented chinese instruction recognition method according to any one of claims 1 to 3.
CN201710620448.7A 2017-07-26 2017-07-26 Scene-oriented Chinese instruction identification method, device, equipment and storage medium Active CN107507613B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710620448.7A CN107507613B (en) 2017-07-26 2017-07-26 Scene-oriented Chinese instruction identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710620448.7A CN107507613B (en) 2017-07-26 2017-07-26 Scene-oriented Chinese instruction identification method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN107507613A CN107507613A (en) 2017-12-22
CN107507613B true CN107507613B (en) 2021-03-16

Family

ID=60689769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710620448.7A Active CN107507613B (en) 2017-07-26 2017-07-26 Scene-oriented Chinese instruction identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN107507613B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110602307A (en) * 2018-06-12 2019-12-20 范世汶 Data processing method, device and equipment
CN110689135B (en) * 2019-09-05 2022-10-11 第四范式(北京)技术有限公司 Anti-money laundering model training method and device and electronic equipment
CN111651686B (en) * 2019-09-24 2021-02-26 北京嘀嘀无限科技发展有限公司 Test processing method and device, electronic equipment and storage medium
CN113096642A (en) * 2021-03-31 2021-07-09 南京地平线机器人技术有限公司 Speech recognition method and device, computer readable storage medium, electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification
CN104573013A (en) * 2015-01-09 2015-04-29 上海大学 Category weight combined integrated learning classifying method
CN106548210A (en) * 2016-10-31 2017-03-29 腾讯科技(深圳)有限公司 Machine learning model training method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7831380B2 (en) * 2006-03-03 2010-11-09 Inrix, Inc. Assessing road traffic flow conditions using data obtained from mobile data sources

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification
CN104573013A (en) * 2015-01-09 2015-04-29 上海大学 Category weight combined integrated learning classifying method
CN106548210A (en) * 2016-10-31 2017-03-29 腾讯科技(深圳)有限公司 Machine learning model training method and device

Also Published As

Publication number Publication date
CN107507613A (en) 2017-12-22

Similar Documents

Publication Publication Date Title
US10311146B2 (en) Machine translation method for performing translation between languages
CN106528845B (en) Retrieval error correction method and device based on artificial intelligence
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
CN107507613B (en) Scene-oriented Chinese instruction identification method, device, equipment and storage medium
Kwiatkowski et al. Scaling semantic parsers with on-the-fly ontology matching
CN107038158B (en) Method and apparatus for creating translation corpus, recording medium, and machine translation system
CN111310440B (en) Text error correction method, device and system
CN106844351B (en) Medical institution organization entity identification method and device oriented to multiple data sources
CN103324621B (en) A kind of Thai text spelling correcting method and device
Li et al. Generating long and informative reviews with aspect-aware coarse-to-fine decoding
CN105225657B (en) Method and device for generating polyphone annotating template
US11593557B2 (en) Domain-specific grammar correction system, server and method for academic text
CN110096572B (en) Sample generation method, device and computer readable medium
US11170169B2 (en) System and method for language-independent contextual embedding
US10867525B1 (en) Systems and methods for generating recitation items
CN110223674B (en) Speech corpus training method, device, computer equipment and storage medium
WO2010018453A2 (en) System and method for processing electronically generated text
Zablotskiy et al. Speech and Language Resources for LVCSR of Russian.
CN110866390B (en) Method and device for recognizing Chinese grammar error, computer equipment and storage medium
KR102518895B1 (en) Method of bio information analysis and storage medium storing a program for performing the same
US11900072B1 (en) Quick lookup for speech translation
JP2011243087A (en) Automatic word mapping device, method and program therefor
Chinea-Rios et al. Vector sentences representation for data selection in statisticalmachine translation
Wu et al. Correcting serial grammatical errors based on n-grams and syntax
KR20200072005A (en) Method for correcting speech recognized sentence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 230088 Building No. 198, building No. 198, Mingzhu Avenue, Anhui high tech Zone, Anhui

Applicant after: Hefei Hualing Co.,Ltd.

Address before: 230601 R & D building, No. 176, Jinxiu Road, Hefei economic and Technological Development Zone, Anhui 501

Applicant before: Hefei Hualing Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant