CN108986789A

CN108986789A - Audio recognition method, device, storage medium and electronic equipment

Info

Publication number: CN108986789A
Application number: CN201811060707.6A
Authority: CN
Inventors: 陈浩
Original assignee: Ctrip Travel Information Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Information Technology Shanghai Co Ltd
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2018-12-11

Abstract

The present invention provides a kind of audio recognition method, device, storage medium and electronic equipment, the audio recognition method includes the following steps: to obtain multiple sample voice data；General coefficient is fallen to each sample voice data progress speech feature extraction, to obtain the eigenmatrix of each sample voice data using mel-frequency；The size of the eigenmatrix of each sample voice data is constructed, according to a preset value to obtain the set of normalized eigenmatrix；Set based on the normalized eigenmatrix establishes a disaggregated model with algorithm of support vector machine；Target speech data is identified by the disaggregated model.The present invention can accurately distinguish the target speech data of multilingual, particular with CRBT or the voice data of the outgoing call call failure of ring.

Description

Audio recognition method, device, storage medium and electronic equipment

Technical field

The present invention relates to field of computer technology more particularly to a kind of audio recognition method, device, storage medium and electronics Equipment.

Background technique

In general, there are a large amount of voice calls data in call center daily, wherein there are many outgoing call call failure Voice data.The failed signaling that operator provides at present is more general, and such as shutdown, rejection, spacing, shutdown, ring unanswered, the line is busy Etc. signalings it is consistent, true cause cannot be distinguished, the number for be easy to causeing business invalid dialed repeatedly, influences efficiency.Therefore it needs A strategy is wanted to find out the failure cause of these call failure voices.

Current existing way is using ASR (Automatic Speech Recognition, automatic speech recognition) language Voice recognition method, the method are semantic-based.As shutdown, the line is busy, the outgoing call voice of spacing and shutdown these types, because every time The voice of casting be all it is identical, ASR can be very good to identify the failure cause of these types.But there are two fatal to lack by ASR It falls into, i.e., multilingual type is supported that limited and cost is very high and can not be identified in voice with CRBT and ring situation.With business Expansion, can often encounter in the voice data and voice of many foreign languages have CRBT and ring voice data, the prior art is Speech recognition demand through being difficult under such situation of meet demand.

Summary of the invention

For the problems of the prior art, the purpose of the present invention is to provide a kind of audio recognition method, device, electronics to set Standby and storage medium is called with accurately distinguishing the target speech data of multilingual particular with the outgoing call of CRBT or ring The voice data of failure.

A kind of audio recognition method is provided according to an aspect of the present invention, it includes the following steps: to obtain multiple sample languages Sound data；General coefficient is fallen to each sample voice data progress speech feature extraction, to obtain each sample using mel-frequency The eigenmatrix of voice data；The size of the eigenmatrix of each sample voice data is constructed, according to a preset value to be returned The set of one eigenmatrix changed；Set based on the normalized eigenmatrix establishes a classification with algorithm of support vector machine Model；Target speech data is identified by the disaggregated model.

In one embodiment of the present invention, the sample voice data are divided into the first voice data and the second voice The type of data, the sample voice data is exported as the classification of the disaggregated model.

In one embodiment of the present invention, by more to quantity in first voice data and second speech data A kind of voice data sampled so that first voice data is identical with the quantity of second speech data.

In one embodiment of the present invention, first voice data and the second speech data are respectively labeled as Rejection voice data and ring unanswered's voice data.

In one embodiment of the present invention, first voice data and second speech data include CRBT or ring.

In one embodiment of the present invention, it is every to be constructed to instruction for each eigenmatrix in the set of the eigenmatrix Rear n seconds voice data of one sample voice data, n are the integer for being less than or equal to 15 more than or equal to 5.

In one embodiment of the present invention, the value based on n makes described preset value [1, M], and M is more than or equal to 1 The step of integer, the size of the eigenmatrix that each sample voice data are constructed according to a preset value includes:

The size of the eigenmatrix of each sample voice data is configured to [1, M], wherein M is this feature matrix column Number.

In one embodiment of the present invention, the size of the eigenmatrix by each sample voice data is configured to The step of [1, M] includes:

If the size of the eigenmatrix of the sample voice data is more than [1, M], then the spy of the sample voice data is intercepted The rear M column in matrix are levied, its size [1, M] is made；

If the size of the eigenmatrix of the sample voice data is less than [1, M], then with the spy for making the sample voice data Sign matrix preceding paragraph is filled with 0, makes its size [1, M].

In one embodiment of the present invention, n is 10 seconds, M 17381.

In one embodiment of the present invention, the set based on the normalized eigenmatrix is with support vector machines Algorithm establishes the step of disaggregated model and includes:

By the set of the normalized eigenmatrix according to a preset ratio establish training dataset, validation data set and Test data set；

The classification is established with algorithm of support vector machine based on the training dataset, validation data set and test data set Model；

The training dataset is for training pattern or determines model parameter, and the validation data set is for doing model choosing It selects, the test data set is used to test the resolution capability of trained model.

In one embodiment of the present invention, the preset ratio of the training dataset, validation data set and test data For 6:2:2.

According to another aspect of the present invention, a kind of speech recognition equipment is provided, it includes: to obtain module, feature extraction mould Block, feature construction module, model construction module and identification module.The acquisition module is for obtaining multiple sample voice numbers According to.The characteristic extracting module be used for using mel-frequency fall general coefficient to each sample voice data carry out phonetic feature mention It takes, to obtain the eigenmatrix of each sample voice data.The feature construction module is used to construct according to a preset value each The size of the eigenmatrix of sample voice data, to obtain the set of normalized eigenmatrix.The model construction module is used One disaggregated model is established with algorithm of support vector machine in the set based on the normalized eigenmatrix.The identification module is used Target speech data is identified in passing through the disaggregated model.

According to another aspect of the invention, a kind of storage medium is provided, is stored with computer program on the storage medium, The computer program executes step as described above when being run by processor.

According to another aspect of the present invention, a kind of electronic equipment is provided, the electronic equipment includes: that processor and storage are situated between Matter.Computer program is stored on the storage medium, the computer program executes institute as above when being run by the processor The step of stating.

Audio recognition method proposed by the invention uses mel-frequency to fall general coefficient to each sample voice data first Speech feature extraction is carried out to obtain the eigenmatrix of each sample voice data；Each sample language is constructed according still further to a preset value The size of the eigenmatrix of sound data is to obtain the set of normalized eigenmatrix；And it is based on the normalized eigenmatrix Set one disaggregated model is established with algorithm of support vector machine, the mesh of multilingual can be accurately distinguished by the disaggregated model Voice data is marked, particular with CRBT or the voice data of the outgoing call call failure of ring.

In addition, audio recognition method proposed by the invention also has good expansion, accuracy rate higher and cost Lower advantage.

Detailed description of the invention

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon.

Fig. 1 is the flow chart of audio recognition method in one embodiment of the invention.

Fig. 2 is the structural schematic diagram of speech recognition equipment in one embodiment of the invention.

Fig. 3 is the structural schematic diagram of computer readable storage medium in one embodiment of the invention.

Fig. 4 is the structural schematic diagram of electronic equipment in one embodiment of the invention.

Fig. 5 is that the set based on the normalized eigenmatrix in one embodiment of the invention is built with algorithm of support vector machine The flow chart of a vertical disaggregated model.And

Fig. 6 is the structural schematic diagram of speech recognition equipment in another embodiment of the present invention.

Specific embodiment

Example embodiment is described more fully with reference to the drawings.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein；On the contrary, thesing embodiments are provided so that the disclosure will more Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.Described feature, knot Structure or characteristic can be incorporated in any suitable manner in one or more embodiments.

In addition, attached drawing is only the schematic illustrations of the disclosure, it is not necessarily drawn to scale.Identical attached drawing mark in figure Note indicates same or similar part, thus will omit repetition thereof.Some block diagrams shown in the drawings are function Energy entity, not necessarily must be corresponding with physically or logically independent entity.These function can be realized using software form Energy entity, or these functional entitys are realized in one or more hardware modules or integrated circuit, or at heterogeneous networks and/or place These functional entitys are realized in reason device device and/or microcontroller device.

In order to solve the deficiencies in the prior art, the target speech data of multilingual is accurately distinguished, particular with CRBT Or the voice data of the outgoing call call failure of ring.The present invention provides a kind of audio recognition method, device, electronic equipment and storage Medium,

Fig. 1 is the flow chart of audio recognition method in one embodiment of the invention.The audio recognition method includes following step It is rapid:

S110 obtains multiple sample voice data.

It will be appreciated by persons skilled in the art that the sample voice data obtained herein can be have been subjected to it is pre- The voice data that handles and can directly use.And the validity in order to guarantee the following disaggregated model, the sample voice The radix of data should be sufficiently large.

In a specific embodiment of the invention, the sample voice data can be divided into the first voice data and Second speech data.Specifically, can be labeled in step S110 to sample voice data, to distinguish the first voice data And second speech data.

Furthermore, the quantity of the first voice data and second speech data as described in sample voice data may Larger difference is had, then can be by a fairly large number of a kind of voice in first voice data and second speech data Data are sampled, so that first voice data is identical with the quantity of second speech data.That is, extracting described first A fairly large number of a kind of a part of voice data in voice data and second speech data, so that first voice data Maintain an equal level with the quantity of second speech data.Thus the quantity that various types of voice data in sample voice data can be solved is unbalanced Problem to improve the reliability of the building of subsequent classification model, and then improves the validity of the audio recognition method.

S120 falls general coefficient to each sample voice data progress speech feature extraction, to obtain often using mel-frequency The eigenmatrix of a sample voice data.

Common Speech Feature Extraction has very much.The present invention is carried out using mel-frequency cepstrum coefficient (abbreviation mfcc) The complexity of follow-up mode identifying system can be effectively reduced in the feature extraction of sample voice data, can be effective for voice The High Efficiency Modeling of principle of sound, has and its wide applicability.Furthermore, described that general coefficient pair is fallen using mel-frequency The step of each sample voice data progress speech feature extraction, can be based on python_speech_features Open-Source Tools It realizes.

S130 constructs the size of the eigenmatrix of each sample voice data according to a preset value, normalized to obtain The set of eigenmatrix.

Continue with sample voice data as ring unanswered's (such as first voice data) and rejection (such as the second voice number According to) for two kinds of voice data two types, sample voice data at this time include CRBT or ring, existing speech recognition Method can not effectively distinguish it.Since the CRBT or ring are generally present in the beginning of sample voice data, So we can construct by the eigenmatrix to sample voice data, inactive portion included in it is eliminated, To obtain the set of normalized eigenmatrix.It intercepts first described in sample voice data indicated by the eigenmatrix The CRBT of the beginning of voice data and second speech data or ring part retain first voice data and the second voice number According to it is subsequent distinguish part, so as to training pattern.

Furthermore, each eigenmatrix in the set of the eigenmatrix is constructed to indicate each sample voice number According to rear n seconds voice data, n be more than or equal to 5 be less than or equal to 15 integer.Value based on n make the preset value [1, M], M is the integer more than or equal to 1.The size of the eigenmatrix that each sample voice data are constructed according to a preset value Step includes: that the size of the eigenmatrix of each sample voice data is configured to [1, M], and wherein M is this feature matrix column Number (in other words, the size of the eigenmatrix of each sample voice data is 1 row M column).It is described by each sample voice data The size of eigenmatrix is configured to the step of [1, M] and further comprises: if the size of the eigenmatrix of the sample voice data More than [1, M], then the rear M column in the eigenmatrix of the sample voice data are intercepted, its size [1, M] is made.If the sample The size of the eigenmatrix of voice data is less than [1, M], then with making the eigenmatrix preceding paragraph of the sample voice data be filled with 0, Make its size [1, M].

Find that the length of these voice data usually differs by the detection to two kinds of voice data of ring unanswered and rejection, Longest 70 seconds reachable, short will also have 3,4 seconds, and the beginning of these voice data is all ring back tone or color bell sound. For aforementioned two classes voice data only in last n second different from, the voice data of rejection class can say " you in the last n second The phone dialed is busy now, and is please dialled again later ", and the voice data of ring unanswered's class always is in the last n second Ring back tone or color bell sound.According to test, moduli is adjusted to draw up as n=10, voice data of the present invention to above two type Differentiation effect it is ideal.The columns of the eigenmatrix also determines that when determining when the numerical value of n, at this time M= 17381, also mean that the size of the eigenmatrix of each sample voice data is configured to [1,17381] by needs, if described The size of the eigenmatrix of sample voice data is more than [1,17381], then in the eigenmatrix for intercepting the sample voice data 17381 column afterwards, make its size [1,17381].If the size of the eigenmatrix of the sample voice data be less than [1, 17381], then make its size [1,17381] with making the eigenmatrix preceding paragraph of the sample voice data be filled with 0.

S140, the set based on the normalized eigenmatrix establish a disaggregated model with algorithm of support vector machine.

The type that the sample voice data are marked is exported as the classification of the disaggregated model.That is, described at this time Disaggregated model is dedicated for identifying the voice data of first voice data and second speech data two categories.Certainly, originally Invention is not limited thereto, the type for the target speech data that can be used for identifying and the type one of the sample voice data It causes.

S150 identifies target speech data by the disaggregated model.

In one embodiment of the present invention, when voice data to be identified is rejection and ring unanswered's two types, Audio recognition method of the present invention is especially suitable.Specifically, obtaining sufficient amount of sample voice data first, wherein The sample voice data are divided into two type of rejection (such as first voice data) and ring unanswered's (such as second speech data) Type carries out speech feature extraction to the sample voice data by mfcc (mel-frequency fall general coefficient) to obtain corresponding spy Matrix is levied, then extracted eigenmatrix is normalized to form the set of the eigenmatrix of default size, and base Disaggregated model is constructed with SVM algorithm (algorithm of support vector machine) in the set of the eigenmatrix, then passes through the disaggregated model Rejection voice data and ring unanswered's voice data to be identified can effectively be distinguished.Algorithm of support vector machine is built upon statistics The VC dimension of the theories of learning is theoretical and Structural risk minization basis on, according to limited sample information model complexity (i.e. to the study precision of specific training sample, Accuracy) and learning ability (identify the ability of arbitrary sample) without error Between seek optimal compromise, to obtain best Generalization Ability (or generalization ability), therefore, SVM algorithm solve sample This, show many distinctive advantages, and Function Fitting can be promoted the use of etc. other in the identification of non-linear and high dimensional pattern In Machine Learning Problems.Since the disaggregated model is constructed based on machine learning, as long as target speech data belongs to Type in sample voice data can be distinguished effectively, be linguistic property without regard to target speech data.Likewise, this Invention can also accurately distinguish the target speech data of multilingual by the disaggregated model.

Fig. 5 is that the set based on the normalized eigenmatrix in one embodiment of the invention is built with algorithm of support vector machine The flow chart of a vertical disaggregated model, as shown in figure 5, the set based on the normalized eigenmatrix is with support vector machines It includes: S1410 that algorithm, which establishes the step of disaggregated model, by the set of the normalized eigenmatrix according to a preset ratio Establish training dataset, validation data set and test data set.S1420 is based on the training dataset, validation data set and survey It tries data set and the disaggregated model is established with algorithm of support vector machine.

Specifically, it all includes multi-group data, every group of data packet that training dataset, validation data set and test data, which are concentrated, Include the eigenmatrix of sample voice data and the tag along sort of sample voice data.Training dataset is used for sample language therein Input of the eigenmatrix of sound data as disaggregated model, the tag along sort of sample voice data therein is as disaggregated model Output carrys out training pattern or determines model parameter.Test data set be used for using the eigenmatrix of sample voice data therein as The input of disaggregated model obtains the output of disaggregated model according to the disaggregated model of train number, by the output and test of disaggregated model The tag along sort of the sample voice data of data set is compared, to determine the resolution capability (example of the trained disaggregated model Such as, accuracy rate).Typically, when the preset ratio of the training dataset, validation data set and test data is 6:2:2, The disaggregated model it is preferable to the recognition effect of target speech data.

Wherein, the training dataset for training pattern or determines model parameter, i.e., for constructing model.The verifying Data set is used to do the final optimization pass and determination of model, and submodel building can repeat to make.The test data set is only in mould Type uses when examining, and for the accuracy rate of assessment models, is not allowed for model construction process, otherwise will lead to transition fitting.

Fig. 2 is the structural schematic diagram of speech recognition equipment in one embodiment of the invention.As shown in Fig. 2, the speech recognition fills Set includes: to obtain module 201, characteristic extracting module 202, feature construction module 203, model construction module 204 and identification mould Block 205.The acquisition module 201 is for obtaining multiple sample voice data.The characteristic extracting module 202 is used to use Meier Frequency falls general coefficient to each sample voice data progress speech feature extraction, to obtain the feature square of each sample voice data Battle array.The feature construction module 203 is used to construct the size of the eigenmatrix of each sample voice data according to a preset value, with Obtain the set of normalized eigenmatrix.The model construction module 204 is used for based on the normalized eigenmatrix Set establishes a disaggregated model with algorithm of support vector machine.The identification module 205 is used to identify mesh by the disaggregated model Mark voice data.Wherein, the execution step and principle of modules are described in the above-described embodiments, therefore no longer superfluous It states.Under the premise of without prejudice to present inventive concept, the fractionation and merging of modules are all within protection scope of the present invention.This Inventing proposed speech recognition equipment has the advantages that good expansion, accuracy rate are higher and cost is relatively low, can accurate area The target speech data for dividing multilingual, particular with CRBT or the voice data of the outgoing call call failure of ring.

Fig. 6 is the structural schematic diagram of speech recognition equipment in another embodiment of the present invention.As shown in fig. 6, the speech recognition Device is in addition to including obtaining module 201, characteristic extracting module 202, feature construction module 203, model construction module 204 and knowing It further include sampling module 206 outside other module 205.Wherein, the feature construction module 203 further comprises fisrt feature building Module 2031 and second feature construct module 2032.The model construction module 204 further comprises data set building module 2041.The acquisition module 201 is for obtaining multiple sample voice data.The characteristic extracting module 202 is used to use Meier Frequency falls general coefficient to each sample voice data progress speech feature extraction, to obtain the feature square of each sample voice data Battle array.The feature construction module 203 is used to construct the size of the eigenmatrix of each sample voice data according to a preset value, with Obtain the set of normalized eigenmatrix.The model construction module 204 is used for based on the normalized eigenmatrix Set establishes a disaggregated model with algorithm of support vector machine.The identification module 205 is used to identify mesh by the disaggregated model Mark voice data.The sampling module 206 is used for by a kind of voice data progress a fairly large number of in sample voice data Sampling, so that the quantity of various types of voice data is same or similar in sample voice data.The fisrt feature constructs module 2031 eigenmatrix for being more than a preset value to size constructs.The second feature building module 2032 is used for big The small eigenmatrix less than a preset value is constructed.The data set building module 2041 is used for the normalized feature The set of matrix establishes training dataset, validation data set and test data set according to a preset ratio.Wherein, modules It executes step and principle is described in the above-described embodiments, therefore repeat no more.In the premise without prejudice to present inventive concept Under, the fractionation and merging of modules are all within protection scope of the present invention.

Speech recognition equipment proposed by the invention has good expansion, accuracy rate higher and lower-cost excellent Point can accurately distinguish the target speech data of multilingual, particular with CRBT or the voice of the outgoing call call failure of ring Data.

In an exemplary embodiment of the present invention, a kind of computer readable storage medium is additionally provided, meter is stored thereon with Audio recognition method described in any one above-mentioned embodiment may be implemented when being executed by such as processor in calculation machine program, the program The step of.In some possible embodiments, various aspects of the invention are also implemented as a kind of form of program product, It includes program code, and when described program product is run on the terminal device, said program code is for setting the terminal The step of standby various illustrative embodiments according to the present invention for executing the above-mentioned audio recognition method description of this specification.The present invention So that user's reports that event is available to be automatically processed for repairment, to simplify speech recognition process；Mitigating background service team Operating pressure, while user experience can also be promoted.

Fig. 3 is the structural schematic diagram of computer readable storage medium in one embodiment of the invention.Fig. 3 is described according to this hair The program product 300 for realizing the above method of bright embodiment can use portable compact disc read only memory (CD-ROM) it and including program code, and can be run on terminal device, such as PC.However, program of the invention Product is without being limited thereto, and in this document, readable storage medium storing program for executing can be any tangible medium for including or store program, the program Execution system, device or device use or in connection can be commanded.

Described program product 300 can be using any combination of one or more readable mediums.Readable medium can be can Read signal medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can be but be not limited to electricity, magnetic, optical, electromagnetic, infrared The system of line or semiconductor, device or device, or any above combination.The more specific example of readable storage medium storing program for executing is (non- The list of exhaustion) include: electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), Read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, the read-only storage of portable compact disc Device (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

The computer readable storage medium may include in a base band or the data as the propagation of carrier wave a part are believed Number, wherein carrying readable program code.The data-signal of this propagation can take various forms, including but not limited to electromagnetism Signal, optical signal or above-mentioned any appropriate combination.Readable storage medium storing program for executing can also be any other than readable storage medium storing program for executing Readable medium, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or Person's program in connection.The program code for including on readable storage medium storing program for executing can transmit with any suitable medium, packet Include but be not limited to wireless, wired, optical cable, RF etc. or above-mentioned any appropriate combination.

The program for executing operation of the present invention can be write with any combination of one or more programming languages Code, described program design language include object oriented program language-Java, C++ etc., further include conventional Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user It calculates and executes in equipment, partly executes on a user device, being executed as an independent software package, partially in user's calculating Upper side point is executed on a remote computing or is executed in remote computing device or server completely.It is being related to far Journey calculates in the situation of equipment, and remote computing device can pass through the network of any kind, including local area network (LAN) or wide area network (WAN), it is connected to user calculating equipment, or, it may be connected to external computing device (such as utilize ISP To be connected by internet).

In an exemplary embodiment of the present invention, a kind of electronic equipment is also provided, which may include processor, And the memory of the executable instruction for storing the processor.Wherein, the processor is configured to via described in execution Executable instruction is come the step of executing audio recognition method described in any one above-mentioned embodiment.

Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or Program product.Therefore, various aspects of the invention can be embodied in the following forms, it may be assumed that complete hardware embodiment, complete The embodiment combined in terms of full Software Implementation (including firmware, microcode etc.) or hardware and software, can unite here Referred to as circuit, " module " or " system ".

The electronic equipment 400 of this embodiment according to the present invention is described referring to Fig. 4.The electronics that Fig. 4 is shown Equipment 400 is only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.

As shown in figure 4, electronic equipment 400 is showed in the form of universal computing device.The component of electronic equipment 400 can wrap It includes but is not limited to: at least one processing unit 410, at least one storage unit 420, (including the storage of the different system components of connection Unit 420 and processing unit 410) bus 430, display unit 440 etc..

Wherein, the storage unit is stored with program code, and said program code can be held by the processing unit 410 Row, so that various according to the present invention described in the execution of the processing unit 410 above-mentioned audio recognition method part of this specification The step of illustrative embodiments.For example, the processing unit 410 can execute step as shown in fig. 1.

The storage unit 420 may include the readable medium of volatile memory cell form, such as random access memory Unit (RAM) 4201 and/or cache memory unit 4202 can further include read-only memory unit (ROM) 4203.

The storage unit 420 can also include program/practical work with one group of (at least one) program module 4205 Tool 4204, such program module 4205 includes but is not limited to: operating system, one or more application program, other programs It may include the realization of network environment in module and program data, each of these examples or certain combination.

Bus 430 can be to indicate one of a few class bus structures or a variety of, including storage unit bus or storage Cell controller, peripheral bus, graphics acceleration port, processing unit use any bus structures in a variety of bus structures Local bus.

Electronic equipment 400 can also be with one or more external equipments 500 (such as keyboard, sensing equipment, bluetooth equipment Deng) communication, can also be enabled a user to one or more equipment interact with the electronic equipment 400 communicate, and/or with make Any equipment (such as the router, modulation /demodulation that the electronic equipment 400 can be communicated with one or more of the other calculating equipment Device etc.) communication.This communication can be carried out by input/output (I/O) interface 450.Also, electronic equipment 400 can be with By network adapter 460 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, Such as internet) communication.Network adapter 460 can be communicated by bus 430 with other modules of electronic equipment 400.It should Understand, although not shown in the drawings, other hardware and/or software module can be used in conjunction with electronic equipment 400, including but unlimited In: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and number According to backup storage system etc..

Through the above description of the embodiments, those skilled in the art is it can be readily appreciated that example described herein is implemented Mode can also be realized by software realization in such a way that software is in conjunction with necessary hardware.Therefore, according to the present invention The technical solution of embodiment can be embodied in the form of software products, which can store non-volatile at one Property storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.) in or network on, including some instructions are so that a calculating Equipment (can be personal computer, server or network equipment etc.) executes the above-mentioned voice of embodiment according to the present invention Recognition methods.

The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention Protection scope.

Claims

1. a kind of audio recognition method, which comprises the steps of:

Obtain multiple sample voice data；

General coefficient is fallen to each sample voice data progress speech feature extraction, to obtain each sample voice using mel-frequency The eigenmatrix of data；

The size of the eigenmatrix of each sample voice data is constructed, according to a preset value to obtain normalized eigenmatrix Set；

Set based on the normalized eigenmatrix establishes a disaggregated model with algorithm of support vector machine；

Target speech data is identified by the disaggregated model.

2. audio recognition method according to claim 1, which is characterized in that the sample voice data are divided into first The type of voice data and second speech data, the sample voice data is exported as the classification of the disaggregated model.

3. audio recognition method according to claim 2, which is characterized in that by first voice data and second A kind of a fairly large number of voice data is sampled in voice data, so that first voice data and second speech data Quantity it is identical.

4. audio recognition method according to claim 2, which is characterized in that first voice data and second language Sound data are respectively labeled as rejection voice data and ring unanswered's voice data.

5. audio recognition method according to claim 2, which is characterized in that first voice data and the second voice number According to including CRBT or ring.

6. audio recognition method according to claim 1, which is characterized in that each feature in the set of the eigenmatrix Matrix is constructed to indicate that rear n seconds voice data of each sample voice data, n are the integer for being less than or equal to 15 more than or equal to 5.

7. audio recognition method according to claim 6, which is characterized in that the value based on n make described preset value [1, M], M is the integer more than or equal to 1, the size of the eigenmatrix that each sample voice data are constructed according to a preset value Step includes:

8. according to audio recognition method as claimed in claim 7, which is characterized in that the feature square by each sample voice data Size the step of being configured to [1, M] of battle array includes:

If the size of the eigenmatrix of the sample voice data is more than [1, M], then the feature square of the sample voice data is intercepted Rear M column in battle array, make its size [1, M]；

If the size of the eigenmatrix of the sample voice data is less than [1, M], then with the feature square for making the sample voice data Battle array preceding paragraph is filled with 0, makes its size [1, M].

9. audio recognition method according to claim 7 or 8, which is characterized in that n is 10 seconds, M 17381.

10. audio recognition method according to claim 1, which is characterized in that described to be based on the normalized feature square Set the step of establishing a disaggregated model with algorithm of support vector machine of battle array includes:

The set of the normalized eigenmatrix is established into training dataset, validation data set and test according to a preset ratio Data set；

The classification mould is established with algorithm of support vector machine based on the training dataset, validation data set and test data set Type；

Wherein, the training dataset for training pattern or determines model parameter, and the validation data set is for doing model choosing It selects, the test data set is used to test the resolution capability of trained model.

11. audio recognition method according to claim 10, which is characterized in that the training dataset, validation data set Preset ratio with test data is 6:2:2.

12. a kind of speech recognition equipment characterized by comprising

Module is obtained, for obtaining multiple sample voice data；

Characteristic extracting module, for using mel-frequency fall general coefficient to each sample voice data carry out speech feature extraction, To obtain the eigenmatrix of each sample voice data；

Feature construction module, the size of the eigenmatrix for constructing each sample voice data according to a preset value, to obtain The set of normalized eigenmatrix；

Model construction module establishes a classification for the set based on the normalized eigenmatrix with algorithm of support vector machine Model；

Identification module, for identifying target speech data by the disaggregated model.

13. a kind of storage medium, which is characterized in that be stored with computer program, the computer program on the storage medium Step as described in any one of claim 1 to 11 is executed when being run by processor.

14. a kind of electronic equipment, which is characterized in that the electronic equipment includes:

Processor；

Storage medium is stored thereon with computer program, and such as right is executed when the computer program is run by the processor It is required that 1 to 11 described in any item steps.