WO2023279691A1 - Speech classification method and apparatus, model training method and apparatus, device, medium, and program - Google Patents

Speech classification method and apparatus, model training method and apparatus, device, medium, and program Download PDF

Info

Publication number
WO2023279691A1
WO2023279691A1 PCT/CN2022/071089 CN2022071089W WO2023279691A1 WO 2023279691 A1 WO2023279691 A1 WO 2023279691A1 CN 2022071089 W CN2022071089 W CN 2022071089W WO 2023279691 A1 WO2023279691 A1 WO 2023279691A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
features
classified
voice
data set
Prior art date
Application number
PCT/CN2022/071089
Other languages
French (fr)
Chinese (zh)
Inventor
张军伟
李�诚
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023279691A1 publication Critical patent/WO2023279691A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This application belongs to the field of speech recognition, and involves but is not limited to a speech classification method, a model training method and device, equipment, medium and program.
  • Speech recognition technology is to enable smart devices to understand human speech. It is a multidisciplinary science involving digital signal processing, artificial intelligence, linguistics, mathematical statistics, acoustics, emotion and psychology. In recent years, with the rise of artificial intelligence, speech recognition technology has made great breakthroughs in both theory and application. It has begun to move from the laboratory to the market, and has gradually entered our daily life.
  • Speech recognition is a relatively large application field of artificial intelligence technology, which is divided into speech meaning recognition and speech type recognition.
  • speech recognition For the recognition of speech categories, the current artificial intelligence products that can realize speech recognition generally integrate trained speech classification models. When it is necessary to increase the recognition of new categories, the current solution cannot be realized.
  • Embodiments of the present application provide a speech classification method, a model training method and an apparatus, device, medium and program.
  • the first aspect of the embodiment of the present application provides a training method for a speech classification model.
  • the training method includes: obtaining at least one category of speech data, and the same category of speech data constitutes a speech data set; extracting each speech data in the speech data set Speech features; use the speech features in the speech data set to train the sub-category models in the speech classification model; the speech classification model includes at least one sub-classification model, and the sub-classification models correspond to the speech data sets one by one.
  • the proposed speech classification model includes sub-classification models, and a sub-classification model corresponds to a category of speech data sets.
  • the speech data of each category is obtained, and the speech data of each category constitutes a speech Data set, using the voice data set to train the sub-classification model in the voice classification model, so that the voice classification model can realize voice classification.
  • the speech classification model in the embodiment of the present application can add new classifications of speech categories at any time.
  • the training method also includes: based on at least part of the voice data in the voice data set, determining the category features of the voice data set; using the category features of the voice data set, processing the voice features of each voice data in the voice data set; using the voice data
  • the concentrated voice features train the sub-classification models in the voice classification model, including: using the voice features processed in the voice data set to train the sub-classification models in the voice classification model.
  • the category features of the voice data set can be obtained, that is, the category features can be used to highlight the category of the voice data set, and the voice features can be processed by using the category features, which can make the training effect better. , which is more conducive to the subclassification model to identify the category.
  • the category features of the speech data set include audio loudness features and pitch change features of the speech data set.
  • the category features of speech datasets are mainly reflected in the loudness and pitch changes of speech.
  • determining the category characteristics of the voice data set includes: calculating the root mean square of the voice energy of at least part of the voice data in the voice data set to obtain audio loudness features; calculating at least part of the voice data in the voice data set Zero-crossing features of speech data to obtain pitch change features.
  • the root mean square of the energy of each speech data can be obtained, thereby obtaining the audio loudness feature in the category feature.
  • the audio zero-crossing feature of each speech data is obtained, so as to obtain the pitch change feature in the category feature.
  • the processing of the voice features of each voice data in the voice data set by using the category features of the voice data set includes: dividing the voice features by the audio loudness features, and adding the pitch change feature.
  • the processed speech features can be obtained based on the class features of different speech data, so as to further strengthen the distinction of different classes, which is beneficial to the subsequent training of speech classification models.
  • extracting the voice features of each voice data in the voice data set includes: extracting the voice features of each voice data in the voice data set, and performing dimensionality reduction processing on the voice features.
  • performing dimensionality reduction processing on speech features can reduce the amount of calculation in subsequent training, and utilize the training of the classification model in the terminal.
  • the training method includes: presenting an entry instruction, the entry instruction corresponds to the entry of a category of voice data; acquiring at least one category of voice data includes: acquiring the voice data according to the entry instruction.
  • the second aspect of the embodiment of the present application provides a voice classification method.
  • the voice classification method includes: obtaining the voice to be classified; extracting the voice features to be classified of the voice to be classified; inputting the voice features to be classified into the voice classification model, and determining the voice to be classified category, and the voice classification model is trained by the above-mentioned training method.
  • the speech to be classified can be recognized and classified efficiently and with high accuracy, and the class of speech to be classified that can be recognized and classified can be trained in advance.
  • the voice classification method also includes: determining the voice loudness feature to be classified and the tone feature to be classified; using the loudness feature to be classified and the tone feature to be classified, processing the voice feature to be classified; inputting the voice feature to be classified into the voice classification
  • the model includes: inputting the processed speech features to be classified into the speech classification model.
  • the to-be-classified speech loudness features and the to-be-classified pitch features of the to-be-classified speech of different users are different.
  • the voices of different users can be distinguished, so as to realize the extraction and optimization of the speech feature to be classified.
  • the to-be-classified voice loudness feature and the to-be-classified tone feature as the classification dimension the to-be-classified voice feature is optimized to achieve accurate classification of different users.
  • extracting the to-be-classified speech features of the to-be-classified speech includes: extracting the to-be-classified speech features of the to-be-classified speech, and performing dimensionality reduction processing on the to-be-classified speech features.
  • obtaining the voice to be classified includes: obtaining the control voice for the fan as the voice to be classified; determining the category of the voice to be classified includes: determining the category of the voice to be classified as starting, stopping, accelerating, decelerating, turning left, and turning right kind of.
  • the third aspect of the embodiment of the present application provides a terminal device, including a memory and a processor coupled to each other, the processor is used to execute the program instructions stored in the memory, so as to implement the training method in the first aspect and the second aspect above Speech classification methods in .
  • the fourth aspect of the embodiment of the present application provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a processor, the training method in the above-mentioned first aspect and the speech classification method in the above-mentioned second aspect are implemented. .
  • the fifth aspect of the embodiment of the present application provides a computer program, including computer readable code, when the computer readable code is run in the terminal device, the processor in the terminal device executes to implement the above first aspect The training method in and the speech classification method in the second aspect above.
  • the speech classification model in the embodiment of the present application includes at least one sub-classification model, and the sub-classification model is set in a one-to-one correspondence with the speech data set. Thereby the voice data set of each category of the embodiment of the present application corresponds to training a sub-category model separately.
  • the training method in the embodiment of the present application has a low amount of calculation, and can complete the speech classification training task on a robot with limited computing power. In the field of robot application, it can be suitable for use as an artificial intelligence teaching aid.
  • Fig. 1 is the schematic flow sheet of the training method of speech classification model of the embodiment of the present application
  • Fig. 2 is the schematic flow chart of optimizing the speech feature in the training method of the speech classification model of the embodiment of the present application;
  • Fig. 3 is the schematic flow chart of the voice classification method of the embodiment of the present application.
  • Fig. 4 is a schematic flow diagram of optimizing the speech features to be classified in the speech classification method of the embodiment of the present application.
  • Fig. 5 is the frame schematic diagram of the training device of speech classification model of the embodiment of the present application.
  • Fig. 6 is the frame diagram of the speech classification device of the embodiment of the present application.
  • FIG. 7 is a schematic diagram of a framework of a terminal device according to an embodiment of the present application.
  • Fig. 8 is a schematic diagram of the framework of a computer-readable storage medium according to an embodiment of the present application.
  • Fig. 1 is the schematic flow chart of the training method of speech classification model of the embodiment of the present application
  • Fig. 2 is the schematic flow chart of optimizing the speech feature in the training method of speech classification model of the embodiment of the present application.
  • the training method of the speech classification model of the embodiment of the present application is performed by electronic devices such as smart devices or terminal devices, and the terminal devices can be user equipment (User Equipment, UE), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal Digital assistants (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., smart devices can include intelligent educational robots, intelligent mobile robots, etc., and the method can be called by the processor of the electronic device. stored computer readable instructions.
  • the embodiment of the present application provides a training method of a speech classification model, comprising the following steps:
  • Step S11 Obtain at least one category of speech data, and the same category of speech data constitutes a speech data set.
  • Categories can be based on general field classifications such as gender, number, and orientation, and/or user-based. For example, gender classification includes gender classification categories of male and female; number classification includes number classification categories of 0-9; direction classification includes direction classification categories such as front, back, left, and right; user classification includes user classification categories based on different users.
  • the user when acquiring the speech data of each category, the user may be guided to record multiple passes of the speech data according to instructions, and cluster them to form a speech data set.
  • before acquiring at least one category of voice data may include: presenting an input indication, where the input indication corresponds to the input of one category of voice data.
  • the device will present input instructions to guide the user to record voice data, which may be presented in the form of screen display and/or voice broadcast, and each input instruction corresponds to the input of a category of voice data.
  • the application scenario is the voice control of the fan
  • the recording requirement is to control the fan to turn on, stop, accelerate, decelerate, turn left, turn right, etc.
  • the input instruction can be displayed on the screen and/or voice broadcast to guide the user to repeat " Start the fan", “stop the fan”, “increase the fan speed”, “decrease the fan speed”, “turn the fan to the left”, “turn the fan to the right” and other voices to obtain the corresponding category of voice data.
  • the input instructions can be displayed on the screen and/or voice broadcast to guide the user to repeat "forward".
  • Voices of direction categories such as "walking”, “walking backward”, “walking left”, “walking right”, voices of number categories such as "1” and “2”, voices of length unit categories such as "meter”, and Other desired voices.
  • the voice data of the corresponding category is obtained.
  • obtaining speech data of at least one category includes obtaining speech data according to an input instruction.
  • the duration of the single input voice data is 3-10s, such as 3s, 5s, 8s or 10s.
  • the length of speech data is conducive to the extraction of speech features, and the amount of calculation is kept small, which improves the speed of subsequent data processing, thereby improving training efficiency.
  • the user When acquiring voice data based on user classification categories, the user can be guided to record voice data such as "hello" multiple times according to the input instructions, and form a voice data set of user classification categories with user IDs.
  • the user When acquiring voice data based on direction classification categories, the user can be guided to record multiple times of "walking forward", “walking right” and other similar voice data according to the input instructions, and form a voice data set of direction classification categories corresponding to the direction .
  • voice data in four different directions of "front, back, left, and right” are recorded, which can form a voice data set classified into four directions: front, back, left, and right.
  • the user When acquiring voice data based on digital categories, the user can be guided to record voice data related to numbers such as "0" and “1” multiple times according to the input instructions, and form a voice data set corresponding to digital categories.
  • voice data Usually the voice data of ten different directions of "0-9" are recorded, which can constitute a voice data set of ten digital classification categories.
  • the user When acquiring gender-based voice data, the user can be guided to record multiple times of instructional phrase voice data according to the instructions, and can be combined with some auxiliary means such as face recognition to classify the user's gender and form a gender-specific category of the corresponding gender. voice dataset.
  • some auxiliary means such as face recognition to classify the user's gender and form a gender-specific category of the corresponding gender. voice dataset.
  • each entry indicates the entry of speech data corresponding to a category.
  • users can also be guided to record multiple times of voice data similar to "walking 1 meter forward" according to the input instructions, and form voice data sets of "front” direction classification categories according to different sound segments , and the voice data set of the digital classification category of "1", thereby reducing the amount of voice data recorded by the user and improving user experience.
  • the speech classification model only trains recognition in general domains, then there is no need to obtain speech data of user classification categories.
  • the voice data of various categories in the required general field can be obtained only according to the requirement, and the speech classification model recognized in the general field can be trained. If the speech classification model needs to train speech recognition based on each user, first obtain the speech data of the user category to form a speech data set of the user classification category with the user ID; then obtain the required other general field categories of each user Speech data, constituting other speech datasets of various categories.
  • voice data is usually obtained by recording the user's voice.
  • Robot products generally have a built-in sound card, and the recording function can be normally realized after the sound card is configured.
  • the voice recorded by the robot is very small and must be very close to the robot.
  • the voice enhancement configuration of the microphone can be configured to make the microphone slightly stronger, which is convenient for the user to record voice data.
  • the actual enhanced configuration parameters are adjusted according to the situation when the robot records the user's voice, and there is no limitation here.
  • voice data can also be obtained by communicating with other devices, such as by downloading from a cloud server or obtaining from other mobile devices.
  • Step S12 Extracting the speech features of each speech data in the speech data set.
  • extracting the speech feature of the speech data may be realized by Mel-frequency cepstral coefficients (Mel-frequency cepstral coefficients, MFCC) speech feature.
  • MFCC Mel-frequency cepstral coefficients
  • the Mel-frequency cepstrum coefficients are the coefficients that make up the Mel-frequency cepstrum.
  • the difference between the cepstrum and the mel-frequency cepstrum is that the frequency band division of the mel-frequency cepstrum is equally spaced on the mel scale, which is a better approximation than the linearly spaced frequency bands used in the normal log cepstrum The human auditory system.
  • the Mel filter is a triangular bandpass filter with a preset number of nonlinear distributions, and the logarithmic energy output by each filter can be obtained.
  • the preset number can be 20 and so on. It must be noted that this preset number of triangular bandpass filters is evenly distributed over the frequency of the "mel scale".
  • the Mel frequency represents the general human ear's sensitivity to frequency, and it can also be seen that the human ear's perception of frequency f changes logarithmically.
  • the general process of extracting the MFCCs voice features of each voice data in the voice data set includes the following methods:
  • the pre-emphasis filter is mainly to amplify the high-frequency, eliminate the effect of the vocal cords and lips during the vocalization process, to compensate the high-frequency part of the voice signal suppressed by the pronunciation system, and to highlight the high-frequency resonance. peak. This can be achieved by using a high pass filter.
  • the speech signal is a short-term stationary signal, so the feature extraction operation is usually performed in a short time frame window. At the same time, in order to avoid too much difference between consecutive frames, there will be overlapping parts between the extracted adjacent two frames.
  • each frame is generally multiplied by a window function to smooth the signal, such as a Hamming window.
  • a window function to smooth the signal, such as a Hamming window.
  • the purpose is to increase the continuity at both ends of the frame and reduce the leakage of the spectrum for subsequent operations.
  • the frequency domain conversion is the Fourier transform. This is called the Short-time Fourier Transform (STFT), and the purpose is to convert the signal from the time domain to the frequency domain.
  • STFT Short-time Fourier Transform
  • the purpose of the Mel scale is to simulate the non-linear perception of sound by the human ear, being more discriminating at lower frequencies and less discriminating at higher frequencies.
  • the filter bank coefficients calculated in the above steps are highly correlated, and the discrete cosine transform (Discrete Cosine Transform, DCT) can be applied to decorrelate the filter bank coefficients and generate a compressed representation of the filter bank. Put the energy logarithm obtained in the previous step into the discrete cosine transform formula to obtain MFCCs:
  • s(m) is the energy value of the filter obtained in the step of extracting the mel scale
  • L is the order of MFCC coefficients, usually 12-16
  • M is the number of triangular filters
  • N is each The size of the frame, usually a preset number of sampling points is combined into an observation unit, called a frame, and the preset number is usually 256 or 512, that is, the value of N is usually 256 or 512.
  • the speech features of MFCCs of each speech data in the speech data set can be extracted.
  • extracting the voice features of each voice data in the voice data set includes: extracting the voice features of each voice data in the voice data set, and performing dimensionality reduction processing on the voice features. Since the extracted original MFCC features may have different dimensions due to different audio time lengths, when using the speech data set to train the speech classification model, the classification model requires that the speech features of the speech data in the speech data set have the same feature dimension, so need Dimensionality reduction processing is performed on the speech features, so that it is suitable for the training of the classification model.
  • the speech features before performing dimensionality reduction processing on the speech features, it includes removing all speech data shorter than a preset duration in the speech data set.
  • the preset duration is 0.5s and so on. Thereby removing some invalid speech data that is too short, reducing the amount of calculation, and improving training accuracy and training efficiency.
  • performing dimensionality reduction processing on speech features includes: the dimension of the extracted mfcc feature is determined by two parts, the feature vector dimension and the number of frames, which are respectively recorded as [n_mfcc, n_frames]. According to empirical parameters, it can be The feature vector dimension n_mfcc is set to 16; the number of sub-frames n_frame is related to the length of the audio time, and the minimum value of the number of sub-frames can be taken, and then the two-dimensional features are flattened into one-dimensional features, so as to realize the dimensionality reduction processing of speech features , to reduce the amount of computation.
  • the voice features used for training the classification model can already be extracted by using the method provided in the above content.
  • the voice data in the voice data sets based on user categories since the basic loudness of each user's voice is not the same, the voice data in the voice data sets of different user categories have different category characteristics. Therefore, when processing voice data in a voice data set based on user classification categories, in addition to using the method provided in the above content to extract voice features, it is also necessary to further optimize the voice features in the voice data set, including:
  • Step S121 Based on at least part of the voice data in the voice data set, determine category features of the voice data set.
  • the category features of the voice data set can be obtained, that is, the category of the voice data set can be highlighted through the category features, and the voice features can be processed by using the category features, which can make the training effect better and more efficient.
  • the category features of a voice data set composed of voice data of the same user category include: audio loudness features and pitch change features of the voice data set. Through audio loudness features and pitch change features, the voices of different users can be distinguished to realize feature extraction and optimization of speech data sets.
  • determining the category characteristics of the voice data set includes:
  • a root mean square of speech energy of at least a portion of the speech data in the speech data set is calculated to obtain an audio loudness feature. According to the difference in the basic audio loudness of each category, the root mean square of the energy of each voice data can be obtained, so as to obtain the audio loudness feature in the category feature.
  • the voice features are optimized to achieve accurate classification of voice data sets of different users.
  • Step S122 Using the category features of the voice data set, process the voice features of each voice data in the voice data set.
  • the voice features of each voice data in the voice data set are processed by using the determined category features of the voice data set, that is, the audio loudness features and pitch change features obtained in the above step S121.
  • processing the voice features of each voice data in the voice data set includes: dividing the voice features of each user category by the corresponding audio loudness features, and adding the corresponding Pitch Change Features to obtain the speech features of each user category for the speech dataset.
  • the speech feature extraction and optimization scheme adopted in the embodiment of the present application can obtain more generalized speech features and apply to more speech classification models.
  • Step S13 Using the voice features in the voice data set to train the sub-category models in the voice classification model, the voice classification model includes at least one sub-category model, and the sub-category models are in one-to-one correspondence with the voice data set.
  • the speech classification model in the embodiment of the present application includes at least one sub-classification model, and the sub-classification model is set in a one-to-one correspondence with the speech data set.
  • the voice data set of each category of the embodiment of the present application corresponds to training a sub-category model separately.
  • a Gaussian Mixed Model may be used as the speech classification model.
  • the Gaussian mixture model can be regarded as a model composed of K Gaussian sub-models, and these K single models are the hidden variables of the mixture model.
  • the number of speech data categories to be classified is K
  • the sub-classification model is the Gaussian sub-model.
  • the GMM model will train 4 Gaussian sub-models.
  • the GMM model will train 10 Gaussian sub-models.
  • EM Expectation-Maximum
  • M-step Find the maximum and calculate the model parameters of a new round of iterations.
  • is the model parameter of each subclassification model
  • X is the speech feature
  • ⁇ jk is the expected output
  • N is the total number of speech data in each speech data set
  • j is the sequence number of each speech data.
  • each sub-classification model is trained by the EM algorithm to obtain a sub-classification model that recognizes the corresponding category of speech data.
  • the speech classification model only trains recognition in general domains, then there is no need to obtain speech data of user classification categories. Directly use the speech features in each speech data set to train the corresponding subclassification models in the speech classification model.
  • the speech classification model training is based on the identification of user classification categories and general field classification categories, it is first necessary to use the speech features after centralized processing of the speech data of a user category to train the corresponding sub-classification models in the speech classification model; then use this Speech features in each speech data set in other general fields of the user are used to train each corresponding sub-classification model in the speech classification model. Then follow the same method to sequentially train the language classification models of other user categories.
  • each user has its corresponding sub-classification model, and the language classification model obtained through training can specifically recognize the voices of different users and improve the accuracy of the voice classification model.
  • the voice classification model proposed in the embodiment of the present application includes sub-category models, and a sub-category model corresponds to a category of voice data sets, then when training the voice classification model, the voice data of each category is obtained, and the voice data of each category constitutes A voice data set, using the voice data set to train the sub-classification model in the voice classification model, so that the voice classification model can realize voice classification.
  • the speech classification model in the embodiment of the present application can add new classifications of speech categories at any time. Thereby reducing the amount of training, improving training efficiency, and realizing a general-purpose language recognition scheme.
  • the training method of the embodiment of the present application has a low amount of calculation, and can complete the speech classification training task on a robot with limited computing power. In the field of robot application, it can be used as an artificial intelligence teaching aid.
  • the training method of the embodiment of the present application can implement the entire speech recognition process through python programming.
  • FIG. 3 is a schematic flow diagram of the speech classification method of the embodiment of the present application
  • FIG. 4 is a schematic flow diagram of optimizing the speech features to be classified in the speech classification method of the embodiment of the present application.
  • the speech classification method of the embodiment of the present application is performed by electronic devices such as smart devices or terminal devices, and the terminal devices can be user equipment (User Equipment, UE), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal digital assistants ( Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., smart devices can include intelligent educational robots, intelligent mobile robots, etc., the method can call the computer stored in the memory through the processor of the electronic device It is implemented in the form of readable instructions.
  • UE User Equipment
  • PDA Personal Digital Assistant
  • smart devices can include intelligent educational robots, intelligent mobile robots, etc.
  • the method can call the computer stored in the memory through the processor of the electronic device It is implemented in the form of readable instructions.
  • an embodiment of the present application provides a method for classifying speech, the method for classifying speech includes:
  • Step S21 Obtain the speech to be classified.
  • voices to be classified are acquired, and the voices to be classified may include wake-up voices and instruction voices.
  • the wake-up voice is used to wake up the device, and can be used by the voice classification model to identify the corresponding user, and the command voice is used to control the device.
  • the acquisition of voices to be classified includes:
  • control voice for the fan as the voice to be classified.
  • the to-be-classified speech categories recognized by the fan can be pre-set or directly obtained by the user through training on the fan, and can actually include start, stop, acceleration, deceleration, turn left, turn right, etc.
  • the above command voices are only some common command voices listed, and other command voices with similar meanings can also be used instead.
  • deceleration can also be turned down, acceleration can also be turned up; It can be closed, and there is no limitation here.
  • Step S22 Extracting speech features of the speech to be classified.
  • the to-be-classified speech features of the to-be-classified speech may be implemented based on MFCC speech features.
  • MFCC speech features The following is a brief introduction to the voice characteristics of MFCC:
  • the Mel-frequency cepstrum coefficients are the coefficients that make up the Mel-frequency cepstrum.
  • the difference between the cepstrum and the mel-frequency cepstrum is that the frequency band division of the mel-frequency cepstrum is equally spaced on the mel scale, which is a better approximation than the linearly spaced frequency bands used in the normal log cepstrum The human auditory system.
  • the Mel filter is a triangular bandpass filter with a preset number of nonlinear distributions, and the logarithmic energy output by each filter can be obtained.
  • the preset number can be 20 and so on. It must be noted that this preset number of triangular bandpass filters is evenly distributed over the frequency of the "mel scale".
  • the Mel frequency represents the general human ear's sensitivity to frequency, and it can also be seen that the human ear's perception of frequency f changes logarithmically.
  • the general process of extracting the MFCCs voice features of the voice features to be classified of the voice to be classified includes pre-emphasis, framing, windowing, frequency domain conversion, power spectrum, extraction of mel scale and obtaining MFCCs, by the above process that is
  • the speech features of MFCCs of speech to be classified can be extracted.
  • the steps of actually extracting the speech features of the MFCCs of speech to be classified are similar to the corresponding steps in the above-mentioned embodiment.
  • extracting the speech features to be classified of the speech to be classified includes: extracting the speech features to be classified of the speech to be classified, and performing dimensionality reduction processing on the speech features to be classified, thereby reducing the amount of computation and improving recognition efficiency.
  • removing speech to be classified that is shorter than a preset duration is performed before performing dimensionality reduction processing on the speech features to be classified.
  • the preset duration is 0.5s and so on.
  • the dimensionality reduction processing of the speech features to be classified includes: the dimension of the extracted mfcc feature is determined by the feature vector dimension and the number of frames, which are respectively recorded as [n_mfcc, n_frames]. According to empirical parameters, it can be Set n_mfcc to 16; n_frame is related to the audio time length, you can take the minimum value of the number of frames, and then flatten the two-dimensional features into one-dimensional features, so as to realize the dimensionality reduction processing of the speech features to be classified and reduce the amount of calculation.
  • the speech to be classified can already be extracted by using the method provided in the above content.
  • the speech to be classified of the user classification category since factors such as the basic loudness of each user's voice are not the same, the characteristics of the speech to be classified of different users are different. Therefore, when processing the speech to be classified, in addition to extracting the speech features to be classified using the methods provided above, it is necessary to further optimize the speech features to be classified, including:
  • Step S221 Determine the loudness feature and tone feature of the speech to be classified.
  • the to-be-classified speech loudness features and to-be-classified tone features of the to-be-classified speech of different users are different.
  • the voices of different users can be distinguished through the loudness feature of the speech to be classified and the pitch feature of the speech to be classified, so as to realize the extraction and optimization of the speech feature to be classified.
  • determining the loudness feature of the speech to be classified and the pitch feature to be classified of the speech to be classified comprises:
  • the voice features to be classified are loudness features and the pitch features to be classified are used as classification dimensions, and the voice features to be classified are optimized to realize accurate classification of different users. In other embodiments, it is also possible to classify different users based on other features as classification dimensions.
  • Step S222 Process the speech features to be classified by using the loudness features to be classified and the pitch features to be classified.
  • the speech features to be classified are processed by using the determined loudness features and pitch features of the speech to be classified, that is, the loudness features and pitch features to be classified obtained in the above step S221.
  • processing the to-be-classified speech feature includes: dividing each to-be-classified speech feature by the corresponding to-be-classified loudness feature, and adding the corresponding Tone features to be classified to obtain speech features to be classified for each user.
  • the speech feature extraction and optimization scheme adopted in the embodiment of the present application can obtain more generalized speech features to be classified, and is applicable to more speech classification models.
  • Step S23 input the features of the speech to be classified into the speech classification model, and determine the category of the speech to be classified.
  • the speech classification model of the embodiment of the present application is trained by using the training method in any of the above embodiments.
  • the speech classification model in the embodiment of the present application includes at least one sub-classification model, and each sub-classification model recognizes a class of speech features to be classified.
  • a Gaussian mixture model (GMM model) may be used as a speech classification model.
  • the Gaussian mixture model can be regarded as a model composed of K Gaussian sub-models, and these K single models are the hidden variables of the mixture model.
  • the number of categories of speech data to be classified is K
  • the sub-classification model is the Gaussian sub-model.
  • the GMM model will train 4 Gaussian sub-models.
  • the GMM model will train 10 Gaussian sub-models.
  • the speech classification model is only used for recognition in the general field, then the speech to be classified is directly input into the speech classification model to obtain the classification result.
  • call all sub-classification models in the speech classification model calculate and save the probability that the speech to be classified belongs to each sub-classification model, and select the category corresponding to the sub-classification model with the highest probability as the classification result.
  • the speech classification model is used for identification based on user classification categories and general field classification categories, it is first necessary to identify the user category to which the speech to be classified belongs, and inputting the speech features to be classified into the speech classification model includes: inputting the processed speech features to be classified into the speech Classification model to obtain user category classification results. Then use other sub-classification models related to the user to identify the classification results of the speech to be classified in the general field category.
  • call all sub-classification models that identify user categories in the speech classification model calculate and save the probability that the speech to be classified belongs to each sub-classification model, and select the user category corresponding to the sub-classification model to which the maximum probability belongs, as the user category classification result.
  • call other sub-classification models related to the user calculate and save the probability that the voice to be classified belongs to each sub-classification model, and select the category corresponding to the sub-classification model with the largest probability as the classification result.
  • the user's speech can be identified in a targeted manner, and the recognition efficiency and accuracy can be improved. Especially for users with dialects or accents, it can effectively improve the recognition accuracy and improve user experience.
  • the voice classification method in the embodiment of the present application can efficiently and accurately identify and classify the speech to be classified, and the recognized and classified speech categories to be classified can be trained in advance, and a general language recognition and classification scheme can be realized.
  • the fan has a pre-trained voice classification model, or the user directly trains on the fan to obtain a voice classification model.
  • the voice classification model determining the category of the speech to be classified includes: determining the category of the speech to be classified as one of start, stop, acceleration, deceleration, turn left, and turn right.
  • command voices are only some of the common command voices listed, and other command voices with similar meanings can also be used to train the voice classification model of the fan and be used for recognition. It can also be turned up; open can also be open, and stop can also be closed, which is not limited here.
  • the voice classification method of the embodiment of the present application can also be used on other types of educational robots such as lighting devices and walking cars.
  • the embodiment of the present application provides a speech classification method, which can be implemented in the following manner:
  • Audio data recording configure the sound card and microphone to complete the audio data recording.
  • Extract MFCC Mel Frequency Cepstral Coefficient: Based on traditional MFCC speech features, better speech recognition is achieved by optimizing speech classifiers.
  • the Mel-frequency cepstrum coefficients are the coefficients that make up the Mel-frequency cepstrum.
  • the difference between the cepstrum and the Mel-frequency cepstrum is that the frequency band division of the Mel-frequency cepstrum is equally spaced on the Mel scale, which more closely approximates the human frequency bands than the linearly spaced frequency bands in the normal logarithmic cepstrum. auditory system.
  • the Mel filter is a group of 20 triangular bandpass filters with nonlinear distribution, and the logarithmic energy output by each filter can be obtained; among them, the frequency of the 20 triangular bandpass filters on the Mel scale is Evenly distributed.
  • the Mel frequency represents the general human ear's sensitivity to frequency, and it can also be seen that the human ear's perception of frequency f changes logarithmically.
  • MFCCs acquisition pre-emphasis, framing, windowing, frequency domain conversion, power spectrum, extraction of mel scale and MFCCs.
  • Feature optimization Through the above steps, the audio recording has been completed, and the corresponding MFCC features are extracted for classification.
  • the original MFCC features may have different dimensions due to different audio time lengths, while most classifiers such as SVM require the same feature dimensions, so the features need to be optimized.
  • the embodiment of the present application further optimizes the original MFCC features, including the following:
  • n_mfcc can be set to 16
  • n_frame is related to the audio time length, and the minimum number of frames can be taken, and then the two-dimensional feature is flattened into a one-dimensional feature.
  • the root mean square of the energy of each person is obtained in view of the difference in the basic audio loudness of each person, and the normalized feature dimension obtained in the above steps is divided by the root mean square.
  • the audio zero-crossing feature of each person is obtained, and this feature is superimposed on the above-mentioned features as a dimension of classification.
  • the Gaussian mixture model can be regarded as a model composed of K single Gaussian models, and these K sub-models are the hidden variables of the mixture model.
  • the number of speech classifications is K.
  • the GMM model will be trained to obtain four Gaussian sub-models.
  • the GMM model will train 10 Gaussian sub-models.
  • the embodiment of the present application adopts the EM algorithm, which is an iterative algorithm for maximum likelihood estimation of the parameters of the probability model containing hidden variables (Hidden variable). Each iteration contains two parts, one is to find the expectation, and the other is to find the maximum, and calculate the model parameters of the new round of iteration.
  • Hidden variable hidden variables
  • the training process of its speech recognition algorithm is: for each audio file in each type of audio data, extract the mfcc feature of audio frequency; The mfcc feature is optimized; Train the mean value and variance parameters of each model by EM algorithm; Save The model file for each training completion.
  • the recognition process of its speech recognition algorithm is: for an audio file, extract the mfcc feature of its audio; optimize the mfcc feature; for each GMM model in all GMM model sets, call the model to calculate the probability of belonging to the model; save Probabilities for all models; pick the class with the largest probability.
  • FIG. 5 is a schematic frame diagram of a training device for a speech classification model according to an embodiment of the present application.
  • this embodiment of the present application provides a speech classification model training device 300 , including: a speech acquisition module 31 , a feature extraction module 32 and a calculation module 33 .
  • the voice acquiring module 31 is configured to acquire at least one category of voice data, and the same category of voice data constitutes a voice data set.
  • the feature extraction module 32 is configured to extract the speech features of each speech data in the speech data set.
  • the operation module 33 is configured to use the speech features in the speech data set to train the sub-category models in the speech classification model; the speech classification model includes at least one sub-classification model, and the sub-classification models are in one-to-one correspondence with the speech data set.
  • the training device 300 of the embodiment of the present application classifies the speech data to form a corresponding speech data set, extracts and optimizes the speech features of different types of speech data, and uses the speech features to train the corresponding sub-classification model, thereby obtaining the speech of the required class.
  • Speech classification model for the data includes at least one sub-classification model, and the sub-classification model is set in a one-to-one correspondence with the speech data set. Thereby the voice data set of each category of the embodiment of the present application corresponds to training a sub-category model separately.
  • the training method of the embodiment of the present application has a low amount of calculation, and can complete the speech classification training task on a robot with limited computing power. In the field of robot application, it can be used as an artificial intelligence teaching aid.
  • the training device 300 of the embodiment of the present application can implement the entire speech recognition process through python programming.
  • the training device further includes: a feature determination module configured to determine category features of the speech data set based on at least part of the speech data in the speech data set; a feature processing module configured to use the The category feature of the voice data set is used to process the voice features of each voice data in the voice data set; the operation module includes: an operation sub-module configured to use the voice features processed in the voice data set to process the voice features of the voice data set The subclassification model in the speech classification model is trained.
  • the category features of the speech data set include audio loudness features and pitch change features of the speech data set.
  • the feature determination module includes: a first feature acquisition component configured to calculate the root mean square of speech energy of at least part of the speech data in the speech data set to obtain the audio loudness feature; A feature acquisition component configured to calculate zero-crossing features of at least part of the voice data in the voice data set, so as to obtain the pitch change feature.
  • the feature processing module includes: a feature processing sub-module configured to divide the speech feature by the audio loudness feature, and add the pitch change feature.
  • the feature extraction module includes: a feature extraction submodule configured to extract speech features of each speech data in the speech data set, and perform dimensionality reduction processing on the speech features.
  • the training device includes: a presentation module configured to present an entry indication, the entry indication corresponding to the entry of a category of voice data;
  • the voice acquisition module includes: a voice acquisition sub-module configured to acquire Voice data according to the input instruction.
  • FIG. 6 is a schematic frame diagram of a speech classification device according to an embodiment of the present application.
  • this embodiment of the present application provides a speech classification device 400 , including: a speech acquisition module 41 , a feature extraction module 42 and a classification module 43 .
  • the voice acquiring module 41 is configured to acquire the voice to be classified.
  • the feature extraction module 42 is configured to extract speech features of the speech to be classified.
  • the classification module 43 is configured to input the characteristics of the speech to be classified into the speech classification model to determine the category of the speech to be classified.
  • the speech classification model in the embodiment of the present application is trained by the training device in the above embodiment.
  • the speech classification device 400 of the embodiment of the present application has high recognition efficiency and accuracy of the speech to be classified, and the recognition and classification of speech categories to be classified can be trained in advance to realize general speech recognition and classification.
  • the speech classification device further includes: a feature determination module configured to determine the speech loudness feature to be classified and the tone feature to be classified of the speech to be classified; a feature processing module configured to use the loudness feature to be classified features and tone features to be classified, processing the speech features to be classified; the classification module includes: a first classification sub-module configured to input the processed speech features to be classified into the speech classification model.
  • the feature extraction module includes: a feature extraction submodule configured to extract speech features to be classified of the speech to be classified, and perform dimensionality reduction processing on the speech features to be classified.
  • the voice acquisition module includes: a voice acquisition submodule configured to acquire the control voice for the fan as the voice to be classified; the classification module includes: a second classification submodule configured to Determine the category of the speech to be classified as one of start, stop, acceleration, deceleration, turn left, and turn right.
  • FIG. 7 is a schematic diagram of a framework of a terminal device according to an embodiment of the present application.
  • this embodiment of the present application provides a terminal device 700, including a memory 701 and a processor 702 coupled to each other, and the processor 702 is used to execute the program instructions stored in the memory 701, so as to implement any of the above embodiments
  • the training method and the speech classification method of any of the above-mentioned embodiments may include, but is not limited to: mobile devices such as microcomputers, servers, notebook computers, and tablet computers.
  • the terminal device 700 may also include a fan, a lighting device, a walking trolley, and the like.
  • the processor 702 is configured to control itself and the memory 701 to implement the steps in any of the above embodiments of the training method, or to implement the steps in any of the above embodiments of the speech classification method.
  • the processor 702 may also be called a CPU (Central Processing Unit, central processing unit).
  • the processor 702 may be an integrated circuit chip with signal processing capability.
  • the processor 702 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field-programmable gate array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the processor 702 may be jointly implemented by integrated circuit chips.
  • FIG. 8 is a schematic frame diagram of a computer-readable storage medium according to an embodiment of the present application.
  • this embodiment of the present application provides a computer-readable storage medium 800, on which program instructions 801 are stored.
  • program instructions 801 are executed by a processor, any of the above-mentioned training methods and any of the language Classification.
  • speech classification can be realized accurately and efficiently.
  • An embodiment of the present application also provides a computer program, the computer program includes computer readable code, and when the computer readable code is run on an electronic device or a terminal device, the methods in the foregoing embodiments are executed.
  • the embodiment of the present application also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.
  • the disclosed methods and devices may be implemented in other ways.
  • the device implementations described above are only illustrative.
  • the division of modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may also be distributed to network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • An integrated unit may be stored in a computer-readable storage medium 800 if it is realized in the form of a software function unit and sold or used as an independent product.
  • the medium 800 includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods in various embodiments of the present application.
  • the aforementioned storage medium 800 includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. medium.
  • the embodiment of the present application provides a voice classification method, a model training method and device, equipment, medium and program, wherein the training method includes: acquiring at least one category of voice data, and the same category of voice data constitutes a voice data set; extracting voice Voice features of each voice data in the data set; using the voice features in the voice data set to train the sub-category models in the voice classification model; the voice classification model includes at least one sub-category model, and the sub-category models correspond to the voice data sets one-to-one.
  • the voice classification model includes at least one sub-category model, and the sub-category models correspond to the voice data sets one-to-one.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A speech classification method and apparatus, a model training method and apparatus (400), a device (700), a medium (800), and a program. The training method comprises: obtaining speech data of at least one category, the speech data of the same category forming one speech data set (S11); extracting a speech feature of each speech data in the speech data set (S12); and training a sub-classification model in a speech classification model by using the speech features in the speech data set, the speech classification model comprising at least one sub-classification model, and the sub-classification model having one-to-one correspondence to the speech data set (S13). A speech data set is formed by performing category classification on speech data, and a sub-classification model is trained by using speech features, so as to obtain a speech classification model for identifying speech data of a required category. Performing training by only using speech data of a new category can enable a voice classification model to classify a new category.

Description

语音分类方法、模型训练方法及装置、设备、介质和程序Speech classification method, model training method and device, equipment, medium and program
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202110762453.8、申请日为2021年07月06日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以全文引用的方式引入本申请。This application is based on a Chinese patent application with application number 202110762453.8 and a filing date of July 06, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference in its entirety. .
技术领域technical field
本申请属于语音识别领域,涉及但不限于一种语音分类方法、模型训练方法及装置、设备、介质和程序。This application belongs to the field of speech recognition, and involves but is not limited to a speech classification method, a model training method and device, equipment, medium and program.
背景技术Background technique
语音识别技术就是让智能设备听懂人类的语音。它是一门涉及数字信号处理、人工智能、语言学、数理统计学、声学、情感学及心理学等多学科交叉的科学。近年来,随着人工智能的兴起,语音识别技术在理论和应用方面都取得大突破,开始从实验室走向市场,已逐渐走进我们的日常生活。Speech recognition technology is to enable smart devices to understand human speech. It is a multidisciplinary science involving digital signal processing, artificial intelligence, linguistics, mathematical statistics, acoustics, emotion and psychology. In recent years, with the rise of artificial intelligence, speech recognition technology has made great breakthroughs in both theory and application. It has begun to move from the laboratory to the market, and has gradually entered our daily life.
语音识别是人工智能技术的一个比较大的应用领域,分为语音意义识别和语音类型识别。对于语音类别的识别,当前能够实现语音识别的人工智能产品中,一般集成的是训练好的语音分类模型,当需要增加对新类别的识别,当前方案无法实现。Speech recognition is a relatively large application field of artificial intelligence technology, which is divided into speech meaning recognition and speech type recognition. For the recognition of speech categories, the current artificial intelligence products that can realize speech recognition generally integrate trained speech classification models. When it is necessary to increase the recognition of new categories, the current solution cannot be realized.
发明内容Contents of the invention
本申请实施例提供一种语音分类方法、模型训练方法及装置、设备、介质和程序。Embodiments of the present application provide a speech classification method, a model training method and an apparatus, device, medium and program.
本申请实施例第一方面提供了一种语音分类模型的训练方法,训练方法包括:获取至少一个类别的语音数据,同一类别的语音数据构成一个语音数据集;提取语音数据集中每个语音数据的语音特征;利用语音数据集中的语音特征对语音分类模型中的子分类模型进行训练;语音分类模型包括至少一个子分类模型,子分类模型与语音数据集一一对应。The first aspect of the embodiment of the present application provides a training method for a speech classification model. The training method includes: obtaining at least one category of speech data, and the same category of speech data constitutes a speech data set; extracting each speech data in the speech data set Speech features; use the speech features in the speech data set to train the sub-category models in the speech classification model; the speech classification model includes at least one sub-classification model, and the sub-classification models correspond to the speech data sets one by one.
因此,提出的语音分类模型包括子分类模型,一个子分类模型对应一个类别的语音数据集,则在训练语音分类模型时,获取到各个类别的语音数据,并且每个类别的语音数据构成一个语音数据集,利用语音数据集来对语音分类模型中的子分类模型进行训练,即可使得语音分类模型能够实现语音分类。且基于该训练方法,本申请实施例中语音分类模型可随时增加新的语音类别的分类。Therefore, the proposed speech classification model includes sub-classification models, and a sub-classification model corresponds to a category of speech data sets. When training the speech classification model, the speech data of each category is obtained, and the speech data of each category constitutes a speech Data set, using the voice data set to train the sub-classification model in the voice classification model, so that the voice classification model can realize voice classification. And based on the training method, the speech classification model in the embodiment of the present application can add new classifications of speech categories at any time.
其中,训练方法还包括:基于语音数据集中的至少部分语音数据,确定语音数据集的类别特征;利用语音数据集的类别特征,对语音数据集中每个语音 数据的语音特征进行处理;利用语音数据集中的语音特征对语音分类模型中的子分类模型进行训练,包括:利用语音数据集中处理后的语音特征对语音分类模型中的子分类模型进行训练。Wherein, the training method also includes: based on at least part of the voice data in the voice data set, determining the category features of the voice data set; using the category features of the voice data set, processing the voice features of each voice data in the voice data set; using the voice data The concentrated voice features train the sub-classification models in the voice classification model, including: using the voice features processed in the voice data set to train the sub-classification models in the voice classification model.
因此,利用语音数据集中的至少部分语音数据,可以获得该语音数据集的类别特征,即通过类别特征突出体现该语音数据集的类别,利用类别特征对语音特征进行处理,可以使得训练效果更好,更利于子分类模型识别该类别。Therefore, using at least part of the voice data in the voice data set, the category features of the voice data set can be obtained, that is, the category features can be used to highlight the category of the voice data set, and the voice features can be processed by using the category features, which can make the training effect better. , which is more conducive to the subclassification model to identify the category.
其中,语音数据集的类别特征包括语音数据集的音频响度特征和音调变化特征。Wherein, the category features of the speech data set include audio loudness features and pitch change features of the speech data set.
因此,语音数据集的类别特征主要体现在语音的响度和音调的变化。Therefore, the category features of speech datasets are mainly reflected in the loudness and pitch changes of speech.
其中,基于语音数据集中的至少部分语音数据,确定语音数据集的类别特征,包括:计算语音数据集中至少部分语音数据的语音能量的均方根,以获得音频响度特征;计算语音数据集中至少部分语音数据的过零特征,以获得音调变化特征。Wherein, based on at least part of the voice data in the voice data set, determining the category characteristics of the voice data set includes: calculating the root mean square of the voice energy of at least part of the voice data in the voice data set to obtain audio loudness features; calculating at least part of the voice data in the voice data set Zero-crossing features of speech data to obtain pitch change features.
因此,针对每个类别基础音频响度的不同,可获得每个语音数据能量的均方根,从而获得类别特征中的音频响度特征。针对每个类别的音调变化不同,获得每个语音数据的音频过零特征,从而获得类别特征中的音调变化特征。Therefore, for the difference in the basic audio loudness of each category, the root mean square of the energy of each speech data can be obtained, thereby obtaining the audio loudness feature in the category feature. According to the different pitch changes of each category, the audio zero-crossing feature of each speech data is obtained, so as to obtain the pitch change feature in the category feature.
其中,所述利用所述语音数据集的类别特征,对所述语音数据集中每个语音数据的语音特征进行处理,包括:将语音特征除以所述音频响度特征,并加上所述音调变化特征。Wherein, the processing of the voice features of each voice data in the voice data set by using the category features of the voice data set includes: dividing the voice features by the audio loudness features, and adding the pitch change feature.
因此,可基于不同语音数据的类别特征,获得处理后的语音特征,以进一步强化不同类别的区别,利于后续训练语音分类模型。Therefore, the processed speech features can be obtained based on the class features of different speech data, so as to further strengthen the distinction of different classes, which is beneficial to the subsequent training of speech classification models.
其中,提取语音数据集中每个语音数据的语音特征,包括:提取语音数据集中每个语音数据的语音特征,并对语音特征进行降维处理。Wherein, extracting the voice features of each voice data in the voice data set includes: extracting the voice features of each voice data in the voice data set, and performing dimensionality reduction processing on the voice features.
因此,对语音特征进行降维处理,可以减少后续训练时的计算量,利用在终端中实现分类模型的训练。Therefore, performing dimensionality reduction processing on speech features can reduce the amount of calculation in subsequent training, and utilize the training of the classification model in the terminal.
其中,训练方法包括:呈现录入指示,录入指示对应一个类别的语音数据的录入;获取至少一个类别的语音数据,包括:获取依据录入指示的语音数据。Wherein, the training method includes: presenting an entry instruction, the entry instruction corresponds to the entry of a category of voice data; acquiring at least one category of voice data includes: acquiring the voice data according to the entry instruction.
因此,便于引导用户录入语音数据。Therefore, it is convenient to guide the user to input voice data.
本申请实施例第二方面提供了一种语音分类方法,语音分类方法包括:获取待分类语音;提取待分类语音的待分类语音特征;将待分类语音特征输入语音分类模型,确定待分类语音的类别,语音分类模型由上述的训练方法训练获得。The second aspect of the embodiment of the present application provides a voice classification method. The voice classification method includes: obtaining the voice to be classified; extracting the voice features to be classified of the voice to be classified; inputting the voice features to be classified into the voice classification model, and determining the voice to be classified category, and the voice classification model is trained by the above-mentioned training method.
因此,可高效和高准确率地对待分类语音进行识别分类,可识别分类的待分类语音类别可经过提前训练。Therefore, the speech to be classified can be recognized and classified efficiently and with high accuracy, and the class of speech to be classified that can be recognized and classified can be trained in advance.
其中,语音分类方法还包括:确定待分类语音的待分类语音响度特征和待分类音调特征;利用待分类响度特征和待分类音调特征,对待分类语音特征进行处理;将待分类语音特征输入语音分类模型,包括:将处理后的待分类语音特征输入语音分类模型。Wherein, the voice classification method also includes: determining the voice loudness feature to be classified and the tone feature to be classified; using the loudness feature to be classified and the tone feature to be classified, processing the voice feature to be classified; inputting the voice feature to be classified into the voice classification The model includes: inputting the processed speech features to be classified into the speech classification model.
因此,不同用户的待分类语音的待分类语音响度特征和待分类音调特征不同。通过待分类语音的待分类语音响度特征和待分类音调特征,可以区别开不 同用户的声音,以实现对待分类语音特征的提取和优化。基于待分类语音特征的待分类语音响度特征和待分类音调特征作为分类维度,对待分类语音特征进行优化,实现对不同用户的精确分类。Therefore, the to-be-classified speech loudness features and the to-be-classified pitch features of the to-be-classified speech of different users are different. Through the loudness feature and pitch feature of the speech to be classified, the voices of different users can be distinguished, so as to realize the extraction and optimization of the speech feature to be classified. Based on the to-be-classified voice loudness feature and the to-be-classified tone feature as the classification dimension, the to-be-classified voice feature is optimized to achieve accurate classification of different users.
其中,提取待分类语音的待分类语音特征,包括:提取待分类语音的待分类语音特征,并对待分类语音特征进行降维处理。Wherein, extracting the to-be-classified speech features of the to-be-classified speech includes: extracting the to-be-classified speech features of the to-be-classified speech, and performing dimensionality reduction processing on the to-be-classified speech features.
因此,可实现对待分类语音特征的降维处理,降低运算量。Therefore, the dimensionality reduction processing of the speech features to be classified can be realized, and the amount of computation can be reduced.
其中,获取待分类语音,包括:获取针对风扇的控制语音,作为待分类语音;确定待分类语音的类别包括:确定待分类语音的类别为开启、停止、加速、减速、左转、右转中的一种。Wherein, obtaining the voice to be classified includes: obtaining the control voice for the fan as the voice to be classified; determining the category of the voice to be classified includes: determining the category of the voice to be classified as starting, stopping, accelerating, decelerating, turning left, and turning right kind of.
因此,可实现对风扇的语音控制。Therefore, voice control of the fan can be realized.
本申请实施例第三方面提供了一种终端设备,包括相互耦接的存储器和处理器,处理器用于执行存储器中存储的程序指令,以实现上述第一方面中的训练方法和上述第二方面中的语音分类方法。The third aspect of the embodiment of the present application provides a terminal device, including a memory and a processor coupled to each other, the processor is used to execute the program instructions stored in the memory, so as to implement the training method in the first aspect and the second aspect above Speech classification methods in .
本申请实施例第四方面提供了一种计算机可读存储介质,其上存储有程序指令,程序指令被处理器执行时实现上述第一方面中的训练方法和上述第二方面中的语音分类方法。The fourth aspect of the embodiment of the present application provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a processor, the training method in the above-mentioned first aspect and the speech classification method in the above-mentioned second aspect are implemented. .
本申请实施例第五方面提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在终端设备中运行时,所述终端设备中的处理器执行用于实现上述第一方面中的训练方法和上述第二方面中的语音分类方法。The fifth aspect of the embodiment of the present application provides a computer program, including computer readable code, when the computer readable code is run in the terminal device, the processor in the terminal device executes to implement the above first aspect The training method in and the speech classification method in the second aspect above.
上述方案,通过对语音数据进行类别分类,形成对应语音数据集,并提取优化不同类别语音数据的语音特征,利用语音特征训练对应的子分类模型,从而得到识别所需类别语音数据的语音分类模型。本申请实施例的语音分类模型包括至少一个子分类模型,子分类模型与语音数据集一一对应设置。从而本申请实施例的每个类别的语音数据集对应单独训练一个子分类模型,需要增加类别数量时,无需重新训练整个语音分类模型,仅需新增训练一个子分类模型,以增加可识别的语音类别即可。从而减小训练量,提高训练效率,并实现通用的语言识别方案。进一步地,本申请实施例的训练方法运算量低,可以实现在计算力受限的机器人上完成语音分类训练任务,在机器人应用领域,可适合作为人工智能教具使用。In the above scheme, by classifying the voice data to form a corresponding voice data set, extracting and optimizing the voice features of different categories of voice data, and using the voice features to train the corresponding sub-classification model, so as to obtain the voice classification model for recognizing the required category of voice data . The speech classification model in the embodiment of the present application includes at least one sub-classification model, and the sub-classification model is set in a one-to-one correspondence with the speech data set. Thereby the voice data set of each category of the embodiment of the present application corresponds to training a sub-category model separately. When it is necessary to increase the number of categories, there is no need to retrain the entire voice classification model, and only a new sub-category model needs to be trained to increase the recognizable voice category. Thereby reducing the amount of training, improving training efficiency, and realizing a general-purpose language recognition scheme. Furthermore, the training method in the embodiment of the present application has a low amount of calculation, and can complete the speech classification training task on a robot with limited computing power. In the field of robot application, it can be suitable for use as an artificial intelligence teaching aid.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对本申请实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图,其中:In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments of the present application. Obviously, the accompanying drawings in the following description are only some embodiments of the present application , for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative work, wherein:
图1是本申请实施例语音分类模型的训练方法的流程示意图;Fig. 1 is the schematic flow sheet of the training method of speech classification model of the embodiment of the present application;
图2是本申请实施例语音分类模型的训练方法中的对语音特征进行优化的流程示意图;Fig. 2 is the schematic flow chart of optimizing the speech feature in the training method of the speech classification model of the embodiment of the present application;
图3是本申请实施例语音分类方法的流程示意图;Fig. 3 is the schematic flow chart of the voice classification method of the embodiment of the present application;
图4是本申请实施例语音分类方法中的对待分类语音特征进行优化的流程示意图;Fig. 4 is a schematic flow diagram of optimizing the speech features to be classified in the speech classification method of the embodiment of the present application;
图5是本申请实施例语音分类模型的训练装置的框架示意图;Fig. 5 is the frame schematic diagram of the training device of speech classification model of the embodiment of the present application;
图6是本申请实施例的语音分类装置的框架示意图;Fig. 6 is the frame diagram of the speech classification device of the embodiment of the present application;
图7是本申请实施例终端设备的框架示意图;FIG. 7 is a schematic diagram of a framework of a terminal device according to an embodiment of the present application;
图8是本申请实施例计算机可读存储介质的框架示意图。Fig. 8 is a schematic diagram of the framework of a computer-readable storage medium according to an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
请参阅图1和图2,图1是本申请实施例语音分类模型的训练方法的流程示意图;图2是本申请实施例语音分类模型的训练方法中的对语音特征进行优化的流程示意图。本申请实施例的语音分类模型的训练方法由智能设备或终端设备等电子设备执行,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等,智能设备可包括智能教育机器人、智能移动机器人等,所述方法可以通过电子设备的处理器调用存储器中存储的计算机可读指令的方式来实现。Please refer to Fig. 1 and Fig. 2, Fig. 1 is the schematic flow chart of the training method of speech classification model of the embodiment of the present application; Fig. 2 is the schematic flow chart of optimizing the speech feature in the training method of speech classification model of the embodiment of the present application. The training method of the speech classification model of the embodiment of the present application is performed by electronic devices such as smart devices or terminal devices, and the terminal devices can be user equipment (User Equipment, UE), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal Digital assistants (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., smart devices can include intelligent educational robots, intelligent mobile robots, etc., and the method can be called by the processor of the electronic device. stored computer readable instructions.
本申请实施例提供一种语音分类模型的训练方法,包括如下步骤:The embodiment of the present application provides a training method of a speech classification model, comprising the following steps:
步骤S11:获取至少一个类别的语音数据,同一类别的语音数据构成一个语音数据集。Step S11: Obtain at least one category of speech data, and the same category of speech data constitutes a speech data set.
类别可以基于通用领域分类和/或基于用户分类,通用领域分类包括性别分类,数字分类和方向分类等。例如,性别分类包括男、女的性别分类类别;数字分类包括0-9的数字分类类别;方向分类包括前后左右等方向分类类别;用户分类包括基于不同使用者个人的用户分类类别。Categories can be based on general field classifications such as gender, number, and orientation, and/or user-based. For example, gender classification includes gender classification categories of male and female; number classification includes number classification categories of 0-9; direction classification includes direction classification categories such as front, back, left, and right; user classification includes user classification categories based on different users.
在一些实施例中,获取每个类别的语音数据时,可以引导用户按照指示录制多遍语音数据,并将其聚类后构成一个语音数据集。In some embodiments, when acquiring the speech data of each category, the user may be guided to record multiple passes of the speech data according to instructions, and cluster them to form a speech data set.
在一些实施例中,获取至少一个类别的语音数据前可包括:呈现录入指示,录入指示对应一个类别的语音数据的录入。设备会呈现引导用户录制语音数据的录入指示,可以通过屏幕显示和/或语音播报的形式呈现,每个录入指示对应一个类别的语音数据的录入。In some embodiments, before acquiring at least one category of voice data may include: presenting an input indication, where the input indication corresponds to the input of one category of voice data. The device will present input instructions to guide the user to record voice data, which may be presented in the form of screen display and/or voice broadcast, and each input instruction corresponds to the input of a category of voice data.
需要说明的是,录入指示的实际内容可根据应用场景和录音需求作出调整。It should be noted that the actual content of the input instruction can be adjusted according to application scenarios and recording requirements.
例如,应用场景为风扇的语音控制,录音需求为控制风扇开启、停止、加速、减速、左转、右转等,则录入指示可以是以屏幕显示和/或语音播报的方式,引导用户重复“开启风扇”、“停止风扇”、“风扇风速增大”、“风扇风速减小”、“风扇向左转动”、“风扇向右转动”等语音,从而获取对应类别的语音数据。For example, the application scenario is the voice control of the fan, and the recording requirement is to control the fan to turn on, stop, accelerate, decelerate, turn left, turn right, etc., then the input instruction can be displayed on the screen and/or voice broadcast to guide the user to repeat " Start the fan", "stop the fan", "increase the fan speed", "decrease the fan speed", "turn the fan to the left", "turn the fan to the right" and other voices to obtain the corresponding category of voice data.
例如,应用场景为行走小车的语音控制,录音需求为控制行走小车前进、后退、向右、向左等,则录入指示可以是以屏幕显示和/或语音播报的方式,引导用户重复“向前行走”、“向后行走”、“向左行走”、“向右行走”等方向类别的语音,“1”、“2”等数字类别的语音,“米”等长度单位类别的语音,以及其他所需的语音。从而获取对应类别的语音数据。For example, if the application scenario is voice control of a walking car, and the recording requirement is to control the walking car to move forward, backward, right, left, etc., the input instructions can be displayed on the screen and/or voice broadcast to guide the user to repeat "forward". Voices of direction categories such as "walking", "walking backward", "walking left", "walking right", voices of number categories such as "1" and "2", voices of length unit categories such as "meter", and Other desired voices. Thereby, the voice data of the corresponding category is obtained.
在一些实施例中,获取至少一个类别的语音数据,包括获取依据录入指示的语音数据。可选地,依据录入指示,单次录入的语音数据的时长为3-10s,例如3s、5s、8s或者10s等。在此范围内,语音数据的时长利于语音特征的提取,并且保持较小的计算量,提高后续数据处理速度,进而提高训练效率。In some embodiments, obtaining speech data of at least one category includes obtaining speech data according to an input instruction. Optionally, according to the input instruction, the duration of the single input voice data is 3-10s, such as 3s, 5s, 8s or 10s. Within this range, the length of speech data is conducive to the extraction of speech features, and the amount of calculation is kept small, which improves the speed of subsequent data processing, thereby improving training efficiency.
以不同类别来说明如何获取至少一个类别的语音数据:Use different categories to illustrate how to get speech data of at least one category:
在获取基于用户分类的类别的语音数据时,可以引导用户按照录入指示录制多遍“您好”等语音数据,并构成一个具有用户ID的用户分类类别的语音数据集。When acquiring voice data based on user classification categories, the user can be guided to record voice data such as "hello" multiple times according to the input instructions, and form a voice data set of user classification categories with user IDs.
在获取基于方向分类的类别的语音数据时,可以引导用户按照录入指示录制多遍“向前行走”、“向右行走”等类似的语音数据,并构成对应方向的方向分类类别的语音数据集。通常录制“前后左右”四个不同方向的语音数据,可构成前后左右四个方向分类类别的语音数据集。When acquiring voice data based on direction classification categories, the user can be guided to record multiple times of "walking forward", "walking right" and other similar voice data according to the input instructions, and form a voice data set of direction classification categories corresponding to the direction . Usually, voice data in four different directions of "front, back, left, and right" are recorded, which can form a voice data set classified into four directions: front, back, left, and right.
在获取基于数字分类的类别的语音数据时,可以引导用户按照录入指示录制多遍“0”、“1”等数字相关的语音数据,并构成对应数字的数字分类类别的语音数据集。通常录制“0-9”十个不同方向的语音数据,可构成十个数字分类类别的语音数据集。When acquiring voice data based on digital categories, the user can be guided to record voice data related to numbers such as "0" and "1" multiple times according to the input instructions, and form a voice data set corresponding to digital categories. Usually the voice data of ten different directions of "0-9" are recorded, which can constitute a voice data set of ten digital classification categories.
在获取基于性别分类的类别的语音数据时,可以引导用户按照指示录制多遍指示短语类语音数据,并可结合一些人脸识别等辅助手段对用户性别进行分类,并构成对应性别的性别分类类别的语音数据集。When acquiring gender-based voice data, the user can be guided to record multiple times of instructional phrase voice data according to the instructions, and can be combined with some auxiliary means such as face recognition to classify the user's gender and form a gender-specific category of the corresponding gender. voice dataset.
为了减小训练运算量,每个录入指示对应一个类别的语音数据的录入。当然,在技术允许的情况下,还可以引导用户按照录入指示录制多遍类似“向前行走1米”的语音数据,根据不同音段将其分别构成“前”的方向分类类别的语音数据集,以及构成“1”的数字分类类别的语音数据集,从而减少用户录制的语音数据量,提升用户体验。In order to reduce the amount of training computation, each entry indicates the entry of speech data corresponding to a category. Of course, if technology permits, users can also be guided to record multiple times of voice data similar to "walking 1 meter forward" according to the input instructions, and form voice data sets of "front" direction classification categories according to different sound segments , and the voice data set of the digital classification category of "1", thereby reducing the amount of voice data recorded by the user and improving user experience.
需要说明的是,若语音分类模型仅训练通用领域的识别,那么无需获取用户分类类别的语音数据。可以仅根据需求,获取所需通用领域的各个类别的语音数据,训练在通用领域识别的语音分类模型。若语音分类模型需要训练基于每个用户的语音识别,那么首先获取用户类别的语音数据,构成一个具有用户ID的用户分类类别的语音数据集;进而获取每个用户的所需其他通用领域类别的语音数据,构成其他各个类别的语音数据集。It should be noted that if the speech classification model only trains recognition in general domains, then there is no need to obtain speech data of user classification categories. The voice data of various categories in the required general field can be obtained only according to the requirement, and the speech classification model recognized in the general field can be trained. If the speech classification model needs to train speech recognition based on each user, first obtain the speech data of the user category to form a speech data set of the user classification category with the user ID; then obtain the required other general field categories of each user Speech data, constituting other speech datasets of various categories.
在一些实施例中,获取语音数据通常通过录制用户的声音实现,机器人产品一般自带声卡,对声卡进行配置后即可正常实现录音功能。当一些情况下,用户录制声音时,机器人录入的声音很小,而且得与机器人靠得很近,此时可以对麦克风进行语音增强配置,使得麦克风稍稍加强一些,便于用户录入语音数据。实际加强配置参数根据机器人录制用户声音时的情况进行调节,此处不 作限制。In some embodiments, voice data is usually obtained by recording the user's voice. Robot products generally have a built-in sound card, and the recording function can be normally realized after the sound card is configured. In some cases, when the user records the voice, the voice recorded by the robot is very small and must be very close to the robot. At this time, the voice enhancement configuration of the microphone can be configured to make the microphone slightly stronger, which is convenient for the user to record voice data. The actual enhanced configuration parameters are adjusted according to the situation when the robot records the user's voice, and there is no limitation here.
在一些实施例中,语音数据也可以通过与其他设备通信获取,例如通过从云端服务器下载或从其他移动设备获取。In some embodiments, voice data can also be obtained by communicating with other devices, such as by downloading from a cloud server or obtaining from other mobile devices.
步骤S12:提取语音数据集中每个语音数据的语音特征。Step S12: Extracting the speech features of each speech data in the speech data set.
目前技术完成语音识别的方案,大多是利用神经网络模型通过词向量(word embedding)分类完成语音训练和识别,训练计算量大,是在计算力受限的机器人硬件上,无法完成昂贵的计算操作,并且训练过程耗时长,训练效率低。Most of the current technology solutions for speech recognition use neural network models to complete speech training and recognition through word embedding classification. The amount of training calculations is large, and it is impossible to complete expensive calculation operations on robot hardware with limited computing power. , and the training process takes a long time and the training efficiency is low.
本申请实施例基于不同类别语音数据的语音特征,通过优化语音分类模型实现更好的语音识别。其中,提取语音数据的语音特征可以是梅尔频率倒谱系数(Mel-frequency cepstral coefficients,MFCC)语音特征实现。以下对MFCC语音特征进行简单介绍:In the embodiment of the present application, better speech recognition is achieved by optimizing a speech classification model based on speech features of different types of speech data. Wherein, extracting the speech feature of the speech data may be realized by Mel-frequency cepstral coefficients (Mel-frequency cepstral coefficients, MFCC) speech feature. The following is a brief introduction to the voice characteristics of MFCC:
梅尔频率倒谱系数(MFCC)就是组成梅尔频率倒谱的系数。倒谱和梅尔频率倒谱的区别在于,梅尔频率倒谱的频带划分是在梅尔刻度上等距划分的,它比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉***。The Mel-frequency cepstrum coefficients (MFCC) are the coefficients that make up the Mel-frequency cepstrum. The difference between the cepstrum and the mel-frequency cepstrum is that the frequency band division of the mel-frequency cepstrum is equally spaced on the mel scale, which is a better approximation than the linearly spaced frequency bands used in the normal log cepstrum The human auditory system.
由于能量频谱中还存在大量的无用讯息,尤其人耳无法分辨高频的频率变化,因此让频谱通过梅尔滤波器可解决该问题。梅尔滤波器也就是一组预设数量个非线性分布的三角带通滤波器,能求得每一个滤波器输出的对数能量。预设数量可以为20等。必须注意的是:这预设数量个三角带通滤波器在“梅尔刻度”的频率上是平均分布的。梅尔频率代表一般人耳对于频率的感受度,由此也可以看出人耳对于频率f的感受是呈对数变化的。Since there are still a lot of useless information in the energy spectrum, especially the human ear cannot distinguish the frequency changes of high frequencies, so passing the spectrum through a Mel filter can solve this problem. The Mel filter is a triangular bandpass filter with a preset number of nonlinear distributions, and the logarithmic energy output by each filter can be obtained. The preset number can be 20 and so on. It must be noted that this preset number of triangular bandpass filters is evenly distributed over the frequency of the "mel scale". The Mel frequency represents the general human ear's sensitivity to frequency, and it can also be seen that the human ear's perception of frequency f changes logarithmically.
在一些实施例中,提取语音数据集中每个语音数据的MFCCs语音特征的一般流程包括如下方法:In some embodiments, the general process of extracting the MFCCs voice features of each voice data in the voice data set includes the following methods:
预加重pre-emphasis
通常高频能量比低频能量小,预加重滤波器主要为了放大高频,消除发声过程中声带和嘴唇的效应,来补偿语音信号受到发音***所抑制的高频部分,也为了突出高频的共振峰。可通过使用一个高通滤波器实现。Usually the high-frequency energy is smaller than the low-frequency energy. The pre-emphasis filter is mainly to amplify the high-frequency, eliminate the effect of the vocal cords and lips during the vocalization process, to compensate the high-frequency part of the voice signal suppressed by the pronunciation system, and to highlight the high-frequency resonance. peak. This can be achieved by using a high pass filter.
分帧Framing
语音信号是短时平稳信号,所以通常是在短时帧窗口内进行特征提取操作。同时为了避免连续帧差别太大,提取的相邻两帧之间会有重叠部分。The speech signal is a short-term stationary signal, so the feature extraction operation is usually performed in a short time frame window. At the same time, in order to avoid too much difference between consecutive frames, there will be overlapping parts between the extracted adjacent two frames.
加窗window
分帧后,一般会对每帧乘以一个窗函数来平滑信号,如汉明(Hamming)窗口。目的是增加帧两端的连续性,减少后续操作对频谱的泄漏。After framing, each frame is generally multiplied by a window function to smooth the signal, such as a Hamming window. The purpose is to increase the continuity at both ends of the frame and reduce the leakage of the spectrum for subsequent operations.
频域转换frequency domain conversion
频域转换就是对傅立叶变换了。这里称为短时傅立叶变换(Short-time Fourier Transform,STFT),目的就是将信号从时域转换到频域。The frequency domain conversion is the Fourier transform. This is called the Short-time Fourier Transform (STFT), and the purpose is to convert the signal from the time domain to the frequency domain.
功率谱power spectrum
对语音信号的频谱取模平方,得到语音信号的谱线能量。Take the modulus square of the frequency spectrum of the speech signal to obtain the spectral line energy of the speech signal.
提取mel刻度Extract mel scale
计算Mel滤波器组,将功率谱通过一组Mel刻度(通常取40个滤波器,nfilt=40)的三角滤波器(triangular filters)来提取频带。Calculate the Mel filter bank, and pass the power spectrum through a set of triangular filters (triangular filters) with a Mel scale (usually 40 filters, nfilt=40) to extract frequency bands.
Mel刻度的目的是模拟人耳对声音的非线性感知,在较低的频率下更具辨别力,在较高的频率下则不具辨别力。The purpose of the Mel scale is to simulate the non-linear perception of sound by the human ear, being more discriminating at lower frequencies and less discriminating at higher frequencies.
计算方法:对于快速傅里叶变换(Fast Fourier Transform,FFT)得到的幅度谱,分别跟每一个滤波器进行频率相乘累加,得到的值即为该帧数据在该滤波器对应频段的能量值。Calculation method: For the magnitude spectrum obtained by Fast Fourier Transform (FFT), multiply and accumulate the frequency with each filter respectively, and the obtained value is the energy value of the frame data in the frequency band corresponding to the filter .
得到MFCCsGet MFCCs
以上步骤中计算的滤波器组系数是高度相关的,可以应用离散余弦变换(Discrete Cosine Transform,DCT)对滤波器组系数去相关处理,并产生滤波器组的压缩表示。将上一步得到的能量对数带入到离散余弦变换公式得到MFCCs:The filter bank coefficients calculated in the above steps are highly correlated, and the discrete cosine transform (Discrete Cosine Transform, DCT) can be applied to decorrelate the filter bank coefficients and generate a compressed representation of the filter bank. Put the energy logarithm obtained in the previous step into the discrete cosine transform formula to obtain MFCCs:
Figure PCTCN2022071089-appb-000001
Figure PCTCN2022071089-appb-000001
其中,s(m)为提取mel刻度的步骤中所得到的滤波器的能量值;L为MFCC系数阶数,通常取12-16;M是三角滤波器个数;N为分帧步骤中每帧的大小,通常将预设数量个采样点集合成一个观测单位,称为帧,预设数量通常为256或者512,即通常情况下N的值为256或512。Among them, s(m) is the energy value of the filter obtained in the step of extracting the mel scale; L is the order of MFCC coefficients, usually 12-16; M is the number of triangular filters; N is each The size of the frame, usually a preset number of sampling points is combined into an observation unit, called a frame, and the preset number is usually 256 or 512, that is, the value of N is usually 256 or 512.
通过上述方法,即可提取语音数据集中每个语音数据的MFCCs语音特征。Through the above method, the speech features of MFCCs of each speech data in the speech data set can be extracted.
在一些实施例中,提取语音数据集中每个语音数据的语音特征包括:提取语音数据集中每个语音数据的语音特征,并对语音特征进行降维处理。由于提取的原始MFCC特征可能由于音频时间长度不同,而导致维度不同,在利用语音数据集对语音分类模型进行训练时,分类模型要求语音数据集中的语音数据的语音特征的特征维度相同,所以需要对语音特征进行降维处理,从而适用于分类模型的训练。In some embodiments, extracting the voice features of each voice data in the voice data set includes: extracting the voice features of each voice data in the voice data set, and performing dimensionality reduction processing on the voice features. Since the extracted original MFCC features may have different dimensions due to different audio time lengths, when using the speech data set to train the speech classification model, the classification model requires that the speech features of the speech data in the speech data set have the same feature dimension, so need Dimensionality reduction processing is performed on the speech features, so that it is suitable for the training of the classification model.
在一些实施例中,在对语音特征进行降维处理前包括去除语音数据集中所有短于预设时长的语音数据。例如,预设时长为0.5s等。从而去除一些过短的无效语音数据,降低计算量,提高训练精度和训练效率。In some embodiments, before performing dimensionality reduction processing on the speech features, it includes removing all speech data shorter than a preset duration in the speech data set. For example, the preset duration is 0.5s and so on. Thereby removing some invalid speech data that is too short, reducing the amount of calculation, and improving training accuracy and training efficiency.
在一些实施例中,对语音特征进行降维处理包括:提取得到的mfcc特征其维度为特征向量维数和分帧数两部分决定,分别记为[n_mfcc,n_frames],根据经验参数,可以将特征向量维数n_mfcc设定为16;分帧数n_frame和音频时间长度相关,可以取其分帧数最小值,然后将该二维特征拉平为一维特征,从而实现对语音特征的降维处理,降低运算量。In some embodiments, performing dimensionality reduction processing on speech features includes: the dimension of the extracted mfcc feature is determined by two parts, the feature vector dimension and the number of frames, which are respectively recorded as [n_mfcc, n_frames]. According to empirical parameters, it can be The feature vector dimension n_mfcc is set to 16; the number of sub-frames n_frame is related to the length of the audio time, and the minimum value of the number of sub-frames can be taken, and then the two-dimensional features are flattened into one-dimensional features, so as to realize the dimensionality reduction processing of speech features , to reduce the amount of computation.
对于基于通用领域分类类别的语音数据集中的语音数据,利用上述内容提供的方法,已经可以提取出用于训练分类模型的语音特征。除此之外,对于基于用户分类类别的语音数据集中的语音数据,由于每个用户声音的基础响度等因素不尽相同,所以不同用户类别的语音数据集中的语音数据的类别特征不同。所以在处理基于用户分类类别的语音数据集中的语音数据时,除了利用上述内容提供的方法,提取出语音特征以外,还需要进一步对语音数据集中的语音特征进行优化,包括:For the voice data in the voice data set based on the general domain classification category, the voice features used for training the classification model can already be extracted by using the method provided in the above content. In addition, for the voice data in the voice data sets based on user categories, since the basic loudness of each user's voice is not the same, the voice data in the voice data sets of different user categories have different category characteristics. Therefore, when processing voice data in a voice data set based on user classification categories, in addition to using the method provided in the above content to extract voice features, it is also necessary to further optimize the voice features in the voice data set, including:
步骤S121:基于语音数据集中的至少部分语音数据,确定语音数据集的类别特征。Step S121: Based on at least part of the voice data in the voice data set, determine category features of the voice data set.
利用语音数据集中的至少部分语音数据,可以获得该语音数据集的类别特 征,即通过类别特征突出体现该语音数据集的类别,利用类别特征对语音特征进行处理,可以使得训练效果更好,更利于子分类模型识别该类别。Using at least part of the voice data in the voice data set, the category features of the voice data set can be obtained, that is, the category of the voice data set can be highlighted through the category features, and the voice features can be processed by using the category features, which can make the training effect better and more efficient. Facilitate the subclassification model to identify the category.
在一些实施例中,同一用户类别的语音数据构成的一个语音数据集的类别特征包括:语音数据集的音频响度特征和音调变化特征。通过音频响度特征和音调变化特征,可以区别开不同用户的声音,以实现对语音数据集的特征提取和优化。In some embodiments, the category features of a voice data set composed of voice data of the same user category include: audio loudness features and pitch change features of the voice data set. Through audio loudness features and pitch change features, the voices of different users can be distinguished to realize feature extraction and optimization of speech data sets.
在一些实施例中,基于每个语音数据集中的至少部分语音数据,确定语音数据集的类别特征,包括:In some embodiments, based on at least part of the voice data in each voice data set, determining the category characteristics of the voice data set includes:
计算语音数据集中至少部分语音数据的语音能量的均方根,以获得音频响度特征。针对每个类别基础音频响度的不同,可获得每个语音数据能量的均方根,从而获得类别特征中的音频响度特征。A root mean square of speech energy of at least a portion of the speech data in the speech data set is calculated to obtain an audio loudness feature. According to the difference in the basic audio loudness of each category, the root mean square of the energy of each voice data can be obtained, so as to obtain the audio loudness feature in the category feature.
计算语音数据集中至少部分语音数据的过零特征,以获得音调变化特征。针对每个类别的音调变化不同,获得每个语音数据的音频过零特征,从而获得类别特征中的音调变化特征。Computing zero-crossing features of at least a portion of the speech data in the speech data set to obtain pitch change features. According to the different pitch changes of each category, the audio zero-crossing feature of each speech data is obtained, so as to obtain the pitch change feature in the category features.
在上述实施例中,基于类别特征中的音频响度特征和音调变化特征作为分类维度,对语音特征进行优化,实现对不同用户的语音数据集的精确分类。在其他实施例中,还可以基于类别特征中的其他特征作为分类维度,实现对不同用户类别的分类。In the above embodiments, based on the audio loudness feature and the pitch change feature in the category features as classification dimensions, the voice features are optimized to achieve accurate classification of voice data sets of different users. In other embodiments, it is also possible to classify different user categories based on other features in the category features as classification dimensions.
步骤S122:利用语音数据集的类别特征,对语音数据集中每个语音数据的语音特征进行处理。Step S122: Using the category features of the voice data set, process the voice features of each voice data in the voice data set.
利用确定的语音数据集的类别特征,即上述步骤S121中获得的音频响度特征和音调变化特征,对语音数据集中每个语音数据的语音特征进行处理。The voice features of each voice data in the voice data set are processed by using the determined category features of the voice data set, that is, the audio loudness features and pitch change features obtained in the above step S121.
在一些实施例中,利用语音数据集的类别特征,对语音数据集中每个语音数据的语音特征进行处理包括:将每个用户类别的语音特征除以对应的音频响度特征,并加上对应的音调变化特征,以获得每个用户类别的语音数据集的语音特征。In some embodiments, using the category features of the voice data set, processing the voice features of each voice data in the voice data set includes: dividing the voice features of each user category by the corresponding audio loudness features, and adding the corresponding Pitch Change Features to obtain the speech features of each user category for the speech dataset.
本申请实施例采用的语音特征提取及优化方案,可获得更加泛化的语音特征,适用更多的语音分类模型。The speech feature extraction and optimization scheme adopted in the embodiment of the present application can obtain more generalized speech features and apply to more speech classification models.
步骤S13:利用语音数据集中的语音特征对语音分类模型中的子分类模型进行训练,语音分类模型包括至少一个子分类模型,子分类模型与语音数据集一一对应。Step S13: Using the voice features in the voice data set to train the sub-category models in the voice classification model, the voice classification model includes at least one sub-category model, and the sub-category models are in one-to-one correspondence with the voice data set.
本申请实施例基于不同类别语音数据的语音特征,通过优化语音分类模型实现更好的语音识别。其中,提取语音数据的语音特征可以通过上述步骤实现,本申请实施例的语音分类模型包括至少一个子分类模型,子分类模型与语音数据集一一对应设置。从而本申请实施例的每个类别的语音数据集对应单独训练一个子分类模型,需要增加类别数量时,无需重新训练整个语音分类模型,仅需新增训练一个子分类模型,以增加可识别的语音类别即可。从而减小训练量,提高训练效率,并实现通用的语言识别方案。In the embodiment of the present application, better speech recognition is achieved by optimizing a speech classification model based on speech features of different types of speech data. Wherein, the speech feature extraction of the speech data can be realized through the above steps. The speech classification model in the embodiment of the present application includes at least one sub-classification model, and the sub-classification model is set in a one-to-one correspondence with the speech data set. Thereby the voice data set of each category of the embodiment of the present application corresponds to training a sub-category model separately. When it is necessary to increase the number of categories, there is no need to retrain the entire voice classification model, and only a new sub-category model needs to be trained to increase the recognizable voice category. Thereby reducing the amount of training, improving training efficiency, and realizing a general-purpose language recognition scheme.
本申请实施例可采用高斯混合模型(Gaussian Mixed Model,GMM)作为语音分类模型。高斯混合模型可以看作是由K个高斯子模型组合而成的模型, 这K个单模型是混合模型的隐变量(Hidden variable)。在语音分类模型的训练中,语音数据需要分类的类别数量即为K,子分类模型即为高斯子模型。例如,对于“前后左右”四个不同方向的方向分类类别,GMM模型会训练4个高斯子模型。而对于“0-9”十个不同数字的数字分类类别,GMM模型会训练10个高斯子模型。In this embodiment of the present application, a Gaussian Mixed Model (GMM) may be used as the speech classification model. The Gaussian mixture model can be regarded as a model composed of K Gaussian sub-models, and these K single models are the hidden variables of the mixture model. In the training of the speech classification model, the number of speech data categories to be classified is K, and the sub-classification model is the Gaussian sub-model. For example, for the direction classification category of four different directions of "front, back, left, and right", the GMM model will train 4 Gaussian sub-models. For the digital classification category of "0-9" ten different numbers, the GMM model will train 10 Gaussian sub-models.
不同模型可能具有不同的参数,我们可采用最大期望值(Expectation-Maximum,EM)算法确定模型参数,EM算法是一种迭代算法,用于含有隐变量(Hidden variable)的概率模型参数的最大似然估计。Different models may have different parameters. We can use the Expectation-Maximum (EM) algorithm to determine the model parameters. The EM algorithm is an iterative algorithm for the maximum likelihood of the probability model parameters containing hidden variables (Hidden variable). estimate.
每次迭代包含两个步骤:Each iteration consists of two steps:
E-step:求期望E-step: seeking expectations
E(γ jk|X,θ)for all j=1,2,...,N E(γ jk |X, θ) for all j=1, 2,..., N
M-step:求极大,计算新一轮迭代的模型参数。M-step: Find the maximum and calculate the model parameters of a new round of iterations.
其中,θ为每个子分类模型的模型参数;X为语音特征;γ jk为预期输出;N为每个语音数据集中语音数据的总数量;j为每个语音数据的序号。 Among them, θ is the model parameter of each subclassification model; X is the speech feature; γ jk is the expected output; N is the total number of speech data in each speech data set; j is the sequence number of each speech data.
通过EM算法训练每个子分类模型的均值和方差参数,以获得识别对应类别语音数据的子分类模型。The mean value and variance parameters of each sub-classification model are trained by the EM algorithm to obtain a sub-classification model that recognizes the corresponding category of speech data.
需要说明的是,若语音分类模型仅训练通用领域的识别,那么无需获取用户分类类别的语音数据。直接利用各语音数据集中的语音特征,对语音分类模型中的对应子分类模型进行训练即可。It should be noted that if the speech classification model only trains recognition in general domains, then there is no need to obtain speech data of user classification categories. Directly use the speech features in each speech data set to train the corresponding subclassification models in the speech classification model.
若语音分类模型训练基于用户分类类别和通用领域分类类别的识别,首先需要利用一用户类别的语音数据集中处理后的语音特征,对语音分类模型中的对应的子分类模型进行训练;随后利用该用户的其他通用领域各语音数据集中的语音特征,对语音分类模型中的各对应子分类模型进行训练。之后再按照相同方法依次训练其他用户类别的语言分类模型。If the speech classification model training is based on the identification of user classification categories and general field classification categories, it is first necessary to use the speech features after centralized processing of the speech data of a user category to train the corresponding sub-classification models in the speech classification model; then use this Speech features in each speech data set in other general fields of the user are used to train each corresponding sub-classification model in the speech classification model. Then follow the same method to sequentially train the language classification models of other user categories.
通过上述训练方法,每个用户都具有其对应的子分类模型,训练获得的语言分类模型可以针对性地识别不同用户的语音,提高语音分类模型的精度。Through the above training method, each user has its corresponding sub-classification model, and the language classification model obtained through training can specifically recognize the voices of different users and improve the accuracy of the voice classification model.
本申请实施例通过对语音数据进行类别分类,形成对应语音数据集,并提取优化不同类别语音数据的语音特征,利用语音特征训练对应的子分类模型,从而得到识别所需类别语音数据的语音分类模型。本申请实施例提出的语音分类模型包括子分类模型,一个子分类模型对应一个类别的语音数据集,则在训练语音分类模型时,获取到各个类别的语音数据,并且每个类别的语音数据构成一个语音数据集,利用语音数据集来对语音分类模型中的子分类模型进行训练,即可使得语音分类模型能够实现语音分类。且基于该训练方法,本申请实施例中语音分类模型可随时增加新的语音类别的分类。从而减小训练量,提高训练效率,并实现通用的语言识别方案。本申请实施例的训练方法运算量低,可以实现在计算力受限的机器人上完成语音分类训练任务,在机器人应用领域,可适合作为人工智能教具使用。本申请实施例的训练方法可以通过python编程实现了整个语音识别流程。In the embodiment of the present application, by classifying the voice data to form a corresponding voice data set, and extracting and optimizing the voice features of different types of voice data, and using the voice features to train the corresponding sub-classification model, thereby obtaining the voice classification for recognizing the required category of voice data Model. The voice classification model proposed in the embodiment of the present application includes sub-category models, and a sub-category model corresponds to a category of voice data sets, then when training the voice classification model, the voice data of each category is obtained, and the voice data of each category constitutes A voice data set, using the voice data set to train the sub-classification model in the voice classification model, so that the voice classification model can realize voice classification. And based on the training method, the speech classification model in the embodiment of the present application can add new classifications of speech categories at any time. Thereby reducing the amount of training, improving training efficiency, and realizing a general-purpose language recognition scheme. The training method of the embodiment of the present application has a low amount of calculation, and can complete the speech classification training task on a robot with limited computing power. In the field of robot application, it can be used as an artificial intelligence teaching aid. The training method of the embodiment of the present application can implement the entire speech recognition process through python programming.
请参阅图3和图4,图3是本申请实施例语音分类方法的流程示意图;图4是本申请实施例语音分类方法中的对待分类语音特征进行优化的流程示意图。 本申请实施例的语音分类方法由智能设备或终端设备等电子设备执行,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等,智能设备可包括智能教育机器人、智能移动机器人等,所述方法可以通过电子设备的处理器调用存储器中存储的计算机可读指令的方式来实现。Please refer to FIG. 3 and FIG. 4, FIG. 3 is a schematic flow diagram of the speech classification method of the embodiment of the present application; FIG. 4 is a schematic flow diagram of optimizing the speech features to be classified in the speech classification method of the embodiment of the present application. The speech classification method of the embodiment of the present application is performed by electronic devices such as smart devices or terminal devices, and the terminal devices can be user equipment (User Equipment, UE), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal digital assistants ( Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., smart devices can include intelligent educational robots, intelligent mobile robots, etc., the method can call the computer stored in the memory through the processor of the electronic device It is implemented in the form of readable instructions.
基于前述的实施例,本申请实施例提供一种语音分类方法,所述语音分类方法包括:Based on the aforementioned embodiments, an embodiment of the present application provides a method for classifying speech, the method for classifying speech includes:
步骤S21:获取待分类语音。Step S21: Obtain the speech to be classified.
在一些实施例中,获取待分类语音,待分类语音可以包括唤醒语音和指令语音。唤醒语音用于唤醒设备,以及可以供语音分类模型识别对应用户,指令语音用于控制设备。In some embodiments, voices to be classified are acquired, and the voices to be classified may include wake-up voices and instruction voices. The wake-up voice is used to wake up the device, and can be used by the voice classification model to identify the corresponding user, and the command voice is used to control the device.
以一个利用语音进行风扇控制的方案举例,获取待分类语音包括:Taking a solution that uses voice to control fans as an example, the acquisition of voices to be classified includes:
获取针对风扇的控制语音,作为待分类语音。需要说明的是,风扇可识别的待分类语音类别可以预先设置或者用户直接在风扇上进行训练得到,实际可以包括开启、停止、加速、减速、左转、右转等。其中,上述指令语音仅为列举的常见的几种指令语音,还可以采用相近意思的其他指令语音代替,例如减速还可以为调小,加速还可以为调大;开启还可以为打开,停止还可以为关闭,此处不作限制。Obtain the control voice for the fan as the voice to be classified. It should be noted that the to-be-classified speech categories recognized by the fan can be pre-set or directly obtained by the user through training on the fan, and can actually include start, stop, acceleration, deceleration, turn left, turn right, etc. Among them, the above command voices are only some common command voices listed, and other command voices with similar meanings can also be used instead. For example, deceleration can also be turned down, acceleration can also be turned up; It can be closed, and there is no limitation here.
步骤S22:提取待分类语音的待分类语音特征。Step S22: Extracting speech features of the speech to be classified.
待分类语音的待分类语音特征可以基于MFCC语音特征实现。以下对MFCC语音特征进行简单介绍:The to-be-classified speech features of the to-be-classified speech may be implemented based on MFCC speech features. The following is a brief introduction to the voice characteristics of MFCC:
梅尔频率倒谱系数(MFCC)就是组成梅尔频率倒谱的系数。倒谱和梅尔频率倒谱的区别在于,梅尔频率倒谱的频带划分是在梅尔刻度上等距划分的,它比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉***。The Mel-frequency cepstrum coefficients (MFCC) are the coefficients that make up the Mel-frequency cepstrum. The difference between the cepstrum and the mel-frequency cepstrum is that the frequency band division of the mel-frequency cepstrum is equally spaced on the mel scale, which is a better approximation than the linearly spaced frequency bands used in the normal log cepstrum The human auditory system.
由于能量频谱中还存在大量的无用讯息,尤其人耳无法分辨高频的频率变化,因此让频谱通过梅尔滤波器可解决该问题。梅尔滤波器也就是一组预设数量个非线性分布的三角带通滤波器,能求得每一个滤波器输出的对数能量。预设数量可以为20等。必须注意的是:这预设数量个三角带通滤波器在“梅尔刻度”的频率上是平均分布的。梅尔频率代表一般人耳对于频率的感受度,由此也可以看出人耳对于频率f的感受是呈对数变化的。Since there are still a lot of useless information in the energy spectrum, especially the human ear cannot distinguish the frequency changes of high frequencies, so passing the spectrum through a Mel filter can solve this problem. The Mel filter is a triangular bandpass filter with a preset number of nonlinear distributions, and the logarithmic energy output by each filter can be obtained. The preset number can be 20 and so on. It must be noted that this preset number of triangular bandpass filters is evenly distributed over the frequency of the "mel scale". The Mel frequency represents the general human ear's sensitivity to frequency, and it can also be seen that the human ear's perception of frequency f changes logarithmically.
在一些实施例中,提取待分类语音的待分类语音特征的MFCCs语音特征的一般流程包括预加重、分帧、加窗、频域转换、功率谱、提取mel刻度和得到MFCCs,通过上述流程即可提取待分类语音的MFCCs语音特征。实际提取待分类语音的MFCCs语音特征的步骤与上述实施例对应步骤类似。In some embodiments, the general process of extracting the MFCCs voice features of the voice features to be classified of the voice to be classified includes pre-emphasis, framing, windowing, frequency domain conversion, power spectrum, extraction of mel scale and obtaining MFCCs, by the above process that is The speech features of MFCCs of speech to be classified can be extracted. The steps of actually extracting the speech features of the MFCCs of speech to be classified are similar to the corresponding steps in the above-mentioned embodiment.
在一些实施例中,提取待分类语音的待分类语音特征包括:提取待分类语音的待分类语音特征,并对待分类语音特征进行降维处理,从而降低运算量,提高识别效率。In some embodiments, extracting the speech features to be classified of the speech to be classified includes: extracting the speech features to be classified of the speech to be classified, and performing dimensionality reduction processing on the speech features to be classified, thereby reducing the amount of computation and improving recognition efficiency.
在一些实施例中,在对待分类语音特征进行降维处理前包括去除短于预设时长的待分类语音。例如,预设时长为0.5s等。从而去除过短的无效待分类语 音,从而避免造成识别失误。In some embodiments, before performing dimensionality reduction processing on the speech features to be classified, removing speech to be classified that is shorter than a preset duration is performed. For example, the preset duration is 0.5s and so on. Thereby removing too short invalid speech to be classified, thereby avoiding recognition errors.
在一些实施例中,对待分类语音特征进行降维处理包括:提取得到的mfcc特征其维度为特征向量维数和分帧数两部分决定,分别记为[n_mfcc,n_frames],根据经验参数,可以将n_mfcc设定为16;n_frame和音频时间长度相关,可以取其分帧数最小值,然后将该二维特征拉平为一维特征,从而实现对待分类语音特征的降维处理,降低运算量。In some embodiments, the dimensionality reduction processing of the speech features to be classified includes: the dimension of the extracted mfcc feature is determined by the feature vector dimension and the number of frames, which are respectively recorded as [n_mfcc, n_frames]. According to empirical parameters, it can be Set n_mfcc to 16; n_frame is related to the audio time length, you can take the minimum value of the number of frames, and then flatten the two-dimensional features into one-dimensional features, so as to realize the dimensionality reduction processing of the speech features to be classified and reduce the amount of calculation.
对于通用领域分类类别的待分类语音,利用上述内容提供的方法,已经可以提取出待分类语音。除此之外,对于用户分类类别的待分类语音,由于每个用户声音的基础响度等因素不尽相同,所以不同用户的待分类语音的类别特征不同。所以在处理待分类语音时,除了利用上述内容提供的方法,提取出待分类语音特征以外,还需要进一步对待分类语音特征进行优化,包括:For the speech to be classified in the classification category of the general field, the speech to be classified can already be extracted by using the method provided in the above content. In addition, for the speech to be classified of the user classification category, since factors such as the basic loudness of each user's voice are not the same, the characteristics of the speech to be classified of different users are different. Therefore, when processing the speech to be classified, in addition to extracting the speech features to be classified using the methods provided above, it is necessary to further optimize the speech features to be classified, including:
步骤S221:确定待分类语音的待分类语音响度特征和待分类音调特征。Step S221: Determine the loudness feature and tone feature of the speech to be classified.
不同用户的待分类语音的待分类语音响度特征和待分类音调特征不同。通过待分类语音的待分类语音响度特征和待分类音调特征,可以区别开不同用户的声音,以实现对待分类语音特征的提取和优化。The to-be-classified speech loudness features and to-be-classified tone features of the to-be-classified speech of different users are different. The voices of different users can be distinguished through the loudness feature of the speech to be classified and the pitch feature of the speech to be classified, so as to realize the extraction and optimization of the speech feature to be classified.
在一些实施例中,确定待分类语音的待分类语音响度特征和待分类音调特征包括:In some embodiments, determining the loudness feature of the speech to be classified and the pitch feature to be classified of the speech to be classified comprises:
计算待分类语音的语音能量的均方根,以获得待分类语音响度特征。针对每个待分类语音基础音频响度的不同,可获得每个待分类语音能量的均方根,从而获得待分类语音的待分类语音响度特征。Calculate the root mean square of the speech energy of the speech to be classified to obtain the loudness feature of the speech to be classified. According to the difference in the basic audio loudness of each speech to be classified, the root mean square of the energy of each speech to be classified can be obtained, so as to obtain the loudness feature of the speech to be classified.
计算待分类语音的过零特征,以获得待分类音调特征。针对每个待分类语音的音调变化不同,获得每个待分类语音的音频过零特征,从而获得待分类语音的待分类音调特征。Calculate the zero-crossing feature of the speech to be classified to obtain the tone feature to be classified. According to the different pitch changes of each speech to be classified, the audio zero-crossing feature of each speech to be classified is obtained, so as to obtain the pitch feature of the speech to be classified.
在上述实施例中,基于待分类语音特征的待分类语音响度特征和待分类音调特征作为分类维度,对待分类语音特征进行优化,实现对不同用户的精确分类。在其他实施例中,还可以基于其他特征作为分类维度,实现对不同用户的分类。In the above embodiments, based on the voice features to be classified, the voice features to be classified are loudness features and the pitch features to be classified are used as classification dimensions, and the voice features to be classified are optimized to realize accurate classification of different users. In other embodiments, it is also possible to classify different users based on other features as classification dimensions.
步骤S222:利用待分类响度特征和待分类音调特征,对待分类语音特征进行处理。Step S222: Process the speech features to be classified by using the loudness features to be classified and the pitch features to be classified.
利用确定的待分类语音的待分类响度特征和待分类音调特征,即上述步骤S221中获得的待分类响度特征和待分类音调特征,对待分类语音特征进行处理。The speech features to be classified are processed by using the determined loudness features and pitch features of the speech to be classified, that is, the loudness features and pitch features to be classified obtained in the above step S221.
在一些实施例中,利用待分类语音的待分类响度特征和待分类音调特征,对待分类语音特征进行处理包括:将每个待分类语音特征除以对应的待分类响度特征,并加上对应的待分类音调特征,以获得每个用户的待分类语音特征。In some embodiments, using the to-be-classified loudness feature and the to-be-classified tone feature of the to-be-classified speech feature, processing the to-be-classified speech feature includes: dividing each to-be-classified speech feature by the corresponding to-be-classified loudness feature, and adding the corresponding Tone features to be classified to obtain speech features to be classified for each user.
本申请实施例采用的待分类语音特征提取及优化方案,可获得更加泛化的待分类语音特征,适用更多的语音分类模型。The speech feature extraction and optimization scheme adopted in the embodiment of the present application can obtain more generalized speech features to be classified, and is applicable to more speech classification models.
步骤S23:将待分类语音特征输入语音分类模型,确定待分类语音的类别。Step S23: input the features of the speech to be classified into the speech classification model, and determine the category of the speech to be classified.
本申请实施例的语音分类模型采用上述任一实施例中的训练方法训练获得。The speech classification model of the embodiment of the present application is trained by using the training method in any of the above embodiments.
本申请实施例的语音分类模型包括至少一个子分类模型,每个子分类模型识别一个类别的待分类语音特征。本申请实施例可采用高斯混合模型(GMM 模型)作为语音分类模型。高斯混合模型可以看作是由K个高斯子模型组合而成的模型,这K个单模型是混合模型的隐变量(Hidden variable)。在GMM语音分类模型中,语音数据需要分类的类别数量即为K,子分类模型即为高斯子模型。例如,对于“前后左右”四个不同方向的方向分类类别,GMM模型会训练4个高斯子模型。而对于“0-9”十个不同数字的数字分类类别,GMM模型会训练10个高斯子模型。The speech classification model in the embodiment of the present application includes at least one sub-classification model, and each sub-classification model recognizes a class of speech features to be classified. In this embodiment of the present application, a Gaussian mixture model (GMM model) may be used as a speech classification model. The Gaussian mixture model can be regarded as a model composed of K Gaussian sub-models, and these K single models are the hidden variables of the mixture model. In the GMM speech classification model, the number of categories of speech data to be classified is K, and the sub-classification model is the Gaussian sub-model. For example, for the direction classification category of four different directions of "front, back, left, and right", the GMM model will train 4 Gaussian sub-models. For the digital classification category of "0-9" ten different numbers, the GMM model will train 10 Gaussian sub-models.
需要说明的是,若语音分类模型仅用于通用领域的识别,那么直接将待分类语音输入语音分类模型,获得分类结果。It should be noted that if the speech classification model is only used for recognition in the general field, then the speech to be classified is directly input into the speech classification model to obtain the classification result.
可选地,调用语音分类模型中所有子分类模型,计算待分类语音属于每个子分类模型的概率并保存,选取最大的概率所属的子分类模型对应的类别,作为分类结果。Optionally, call all sub-classification models in the speech classification model, calculate and save the probability that the speech to be classified belongs to each sub-classification model, and select the category corresponding to the sub-classification model with the highest probability as the classification result.
若语音分类模型用于基于用户分类类别和通用领域分类类别的识别,首先需要识别待分类语音属于的用户类别,将待分类语音特征输入语音分类模型包括:将处理后的待分类语音特征输入语音分类模型,以获得用户类别分类结果。随后利用该用户相关的其他子分类模型,识别待分类语音在通用领域类别的分类结果。可选地,调用语音分类模型中所有识别用户类别的子分类模型,计算待分类语音属于每个子分类模型的概率并保存,选取最大的概率所属的子分类模型对应的用户类别,作为用户类别分类结果。随后调用该用户相关的其他子分类模型,计算待分类语音属于每个子分类模型的概率并保存,选取最大的概率所属的子分类模型对应的类别,作为分类结果。If the speech classification model is used for identification based on user classification categories and general field classification categories, it is first necessary to identify the user category to which the speech to be classified belongs, and inputting the speech features to be classified into the speech classification model includes: inputting the processed speech features to be classified into the speech Classification model to obtain user category classification results. Then use other sub-classification models related to the user to identify the classification results of the speech to be classified in the general field category. Optionally, call all sub-classification models that identify user categories in the speech classification model, calculate and save the probability that the speech to be classified belongs to each sub-classification model, and select the user category corresponding to the sub-classification model to which the maximum probability belongs, as the user category classification result. Then call other sub-classification models related to the user, calculate and save the probability that the voice to be classified belongs to each sub-classification model, and select the category corresponding to the sub-classification model with the largest probability as the classification result.
通过首先识别用户类别,并作为类似登录入口,采用其他对应子分类模型对该用户的待分类语音进行进一步识别,可以有针对性的识别用户语音,提高识别效率和准确率。尤其对带有方言或者口音的用户,可有效提高识别准确率,提升用户体验。本申请实施例中的语音分类方法可高效和高准确率地对待分类语音进行识别分类,可识别分类的待分类语音类别可经过提前训练,可实现通用的语言识别分类方案。By first identifying the user category, and using other corresponding sub-classification models as a similar login entry to further identify the user's speech to be classified, the user's speech can be identified in a targeted manner, and the recognition efficiency and accuracy can be improved. Especially for users with dialects or accents, it can effectively improve the recognition accuracy and improve user experience. The voice classification method in the embodiment of the present application can efficiently and accurately identify and classify the speech to be classified, and the recognized and classified speech categories to be classified can be trained in advance, and a general language recognition and classification scheme can be realized.
仍以一个利用语音进行风扇控制的方案举例,风扇上具有经过预先训练的语音分类模型,或者用户直接在风扇上进行训练获得语音分类模型。语音分类模型确定待分类语音的类别包括:确定待分类语音的类别为开启、停止、加速、减速、左转、右转中的一种。Still taking a fan control solution using voice as an example, the fan has a pre-trained voice classification model, or the user directly trains on the fan to obtain a voice classification model. The voice classification model determining the category of the speech to be classified includes: determining the category of the speech to be classified as one of start, stop, acceleration, deceleration, turn left, and turn right.
需要说明的是,上述指令语音仅为列举的常见的几种指令语音,还可以采用相近意思的其他指令语音进行对风扇的语音分类模型进行训练并用于识别,例如减速还可以为调小,加速还可以为调大;开启还可以为打开,停止还可以为关闭,此处不作限制。It should be noted that the above-mentioned command voices are only some of the common command voices listed, and other command voices with similar meanings can also be used to train the voice classification model of the fan and be used for recognition. It can also be turned up; open can also be open, and stop can also be closed, which is not limited here.
除风扇外,本申请实施例的语音分类方法还可以用于照明装置、行走小车等其他类型的教育机器人上。In addition to the fan, the voice classification method of the embodiment of the present application can also be used on other types of educational robots such as lighting devices and walking cars.
基于前述的实施例,本申请实施例再提供一种语音分类方法,所述方法可以通过以下方式实现:Based on the foregoing embodiments, the embodiment of the present application provides a speech classification method, which can be implemented in the following manner:
(1)音频数据录制:配置声卡和麦克风,完成音频数据的录制。(1) Audio data recording: configure the sound card and microphone to complete the audio data recording.
(2)提取MFCC(梅尔频率倒谱系数):基于传统的MFCC语音特征, 通过优化语音分类器实现更好的语音识别。梅尔频率倒谱系数即组成梅尔频率倒谱的系数。倒谱和梅尔频率倒谱的区别在于,梅尔频率倒谱的频带划分是在梅尔刻度上等距划分的,它比正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉***。(2) Extract MFCC (Mel Frequency Cepstral Coefficient): Based on traditional MFCC speech features, better speech recognition is achieved by optimizing speech classifiers. The Mel-frequency cepstrum coefficients are the coefficients that make up the Mel-frequency cepstrum. The difference between the cepstrum and the Mel-frequency cepstrum is that the frequency band division of the Mel-frequency cepstrum is equally spaced on the Mel scale, which more closely approximates the human frequency bands than the linearly spaced frequency bands in the normal logarithmic cepstrum. auditory system.
由于能量频谱中还存在大量的无用信息,尤其人耳无法分辨高频的频率变化,因此让频谱通过梅尔滤波器。梅尔滤波器是一组20个非线性分布的三角带通滤波器,可以求得每一个滤波器输出的对数能量;其中,这20个三角带通滤波器在梅尔刻度的频率上是平均分布的。梅尔频率代表一般人耳对于频率的感受度,由此也可以看出人耳对于频率f的感受是呈对数变化的。Since there is still a lot of useless information in the energy spectrum, especially the human ear cannot distinguish the frequency changes of high frequencies, so the spectrum is passed through the Mel filter. The Mel filter is a group of 20 triangular bandpass filters with nonlinear distribution, and the logarithmic energy output by each filter can be obtained; among them, the frequency of the 20 triangular bandpass filters on the Mel scale is Evenly distributed. The Mel frequency represents the general human ear's sensitivity to frequency, and it can also be seen that the human ear's perception of frequency f changes logarithmically.
MFCCs获取的一般流程:预加重、分帧、加窗、频域转换、功率谱、提取mel刻度和得到MFCCs。The general process of MFCCs acquisition: pre-emphasis, framing, windowing, frequency domain conversion, power spectrum, extraction of mel scale and MFCCs.
(3)特征的优化:通过上述步骤,已完成录制音频,并提取对应的MFCC特征用于分类。原始MFCC特征可能由于音频时间长度不同导致维度不同,而SVM等大多数分类器要求特征维度相同,所以需要对特征进行优化。此外,由于每个人声音的基础响度等因素不尽相同,针对于以上问题,本申请实施例对原始MFCC特征进行进一步优化,包括以下内容:(3) Feature optimization: Through the above steps, the audio recording has been completed, and the corresponding MFCC features are extracted for classification. The original MFCC features may have different dimensions due to different audio time lengths, while most classifiers such as SVM require the same feature dimensions, so the features need to be optimized. In addition, because factors such as the basic loudness of each person's voice are different, in view of the above problems, the embodiment of the present application further optimizes the original MFCC features, including the following:
A、归一化特征维度;A. Normalized feature dimension;
首先,通过搜索过滤所有过短时间的音频(例如与小于0.5秒)。其次,提取得到的mfcc特征,其维度为特征向量维数和分帧数两部分决定,记为[n_mfcc,n_frames]。根据经验参数,可以将n_mfcc设定为16,n_frame和音频时间长度相关,可以取其分帧数最小值,然后将该二维特征拉平为一维特征。First, filter all audio that is too short (eg less than 0.5 seconds) by searching. Secondly, the dimension of the extracted mfcc features is determined by the dimension of the feature vector and the number of sub-frames, which is recorded as [n_mfcc, n_frames]. According to the empirical parameters, n_mfcc can be set to 16, and n_frame is related to the audio time length, and the minimum number of frames can be taken, and then the two-dimensional feature is flattened into a one-dimensional feature.
B、归一化特征;B. Normalized features;
首先,针对于每个人基础音频响度不同,获取每个人能量的均方根,将上述步骤得到的归一化的特征维度除以该均方根。其次,针对于每个人的音调变化不同,获取每个人的音频过零点特征,将该特征叠加在上述特征中,作为分类的一个维度。First, the root mean square of the energy of each person is obtained in view of the difference in the basic audio loudness of each person, and the normalized feature dimension obtained in the above steps is divided by the root mean square. Secondly, according to the different pitch changes of each person, the audio zero-crossing feature of each person is obtained, and this feature is superimposed on the above-mentioned features as a dimension of classification.
(4)GMM分类器:传统语音识别技术大多采用SVM作为分类器,在某些场景下也可以完成语音识别任务。但是由于语音信号中无法界定开始说话的时刻,使得SVM分类器在这种情形下,效果并不理想。因此,本申请实施例提出了利用GMM完成语音分类任务。高斯混合模型可以看作是由K个单高斯模型组合而成的模型,这K个子模型是混合模型的隐变量(Hidden variable)。在语音识别问题中,语音需要分类数量即为K。例如,对于前后左右四个方向的语音分类,GMM模型会训练得到四个高斯子模型。而对于0至9的数字语音识别,GMM模型会训练的到10个高斯子模型。(4) GMM classifier: Most traditional speech recognition technologies use SVM as a classifier, and can also complete speech recognition tasks in some scenarios. However, the effect of the SVM classifier in this case is not ideal due to the inability to define the moment of speaking in the speech signal. Therefore, the embodiment of the present application proposes to use the GMM to complete the speech classification task. The Gaussian mixture model can be regarded as a model composed of K single Gaussian models, and these K sub-models are the hidden variables of the mixture model. In the speech recognition problem, the number of speech classifications is K. For example, for speech classification in the four directions of front, back, left, and right, the GMM model will be trained to obtain four Gaussian sub-models. For digital speech recognition from 0 to 9, the GMM model will train 10 Gaussian sub-models.
不同模型可能具有不同的参数,本申请实施例采用EM算法,EM算法是一种迭代算法,用于含有隐变量(Hidden variable)的概率模型参数的最大似然估计。每次迭代包含两个部分,一部分是求期望,另一部分是求极大,计算新一轮迭代的模型参数。Different models may have different parameters. The embodiment of the present application adopts the EM algorithm, which is an iterative algorithm for maximum likelihood estimation of the parameters of the probability model containing hidden variables (Hidden variable). Each iteration contains two parts, one is to find the expectation, and the other is to find the maximum, and calculate the model parameters of the new round of iteration.
其语音识别算法的训练流程为:对于每一类音频数据中的每一个音频文件,提取音频的mfcc特征;对所述mfcc特征进行优化;通过EM算法训练每一模 型的均值和方差参数;保存每一训练完成的模型文件。The training process of its speech recognition algorithm is: for each audio file in each type of audio data, extract the mfcc feature of audio frequency; The mfcc feature is optimized; Train the mean value and variance parameters of each model by EM algorithm; Save The model file for each training completion.
其语音识别算法的识别流程为:对于音频文件,提取其音频的mfcc特征;对mfcc特征进行优化;对于所有的GMM模型集合中的每一GMM模型,调用模型计算属于该模型的概率大小;保存所有模型的概率;选取最大的概率所属类别。The recognition process of its speech recognition algorithm is: for an audio file, extract the mfcc feature of its audio; optimize the mfcc feature; for each GMM model in all GMM model sets, call the model to calculate the probability of belonging to the model; save Probabilities for all models; pick the class with the largest probability.
请参阅图5,图5是本申请实施例语音分类模型的训练装置的框架示意图。Please refer to FIG. 5 . FIG. 5 is a schematic frame diagram of a training device for a speech classification model according to an embodiment of the present application.
基于前述的实施例,本申请实施例提供一种语音分类模型的训练装置300,包括:语音获取模块31,特征提取模块32和运算模块33。其中,语音获取模块31配置为获取至少一个类别的语音数据,同一类别的语音数据构成一个语音数据集。特征提取模块32配置为提取语音数据集中每个语音数据的语音特征。运算模块33配置为利用语音数据集中的语音特征对语音分类模型中的子分类模型进行训练;语音分类模型包括至少一个子分类模型,子分类模型与语音数据集一一对应。本申请实施例的训练装置300通过对语音数据进行类别分类,形成对应语音数据集,并提取优化不同类别语音数据的语音特征,利用语音特征训练对应的子分类模型,从而得到识别所需类别语音数据的语音分类模型。本申请实施例的语音分类模型包括至少一个子分类模型,子分类模型与语音数据集一一对应设置。从而本申请实施例的每个类别的语音数据集对应单独训练一个子分类模型,需要增加类别数量时,无需重新训练整个语音分类模型,仅需新增训练一个子分类模型,以增加可识别的语音类别即可。从而减小训练量,提高训练效率,并实现通用的语言识别方案。本申请实施例的训练方法运算量低,可以实现在计算力受限的机器人上完成语音分类训练任务,在机器人应用领域,可适合作为人工智能教具使用。本申请实施例的训练装置300可以通过python编程实现了整个语音识别流程。Based on the aforementioned embodiments, this embodiment of the present application provides a speech classification model training device 300 , including: a speech acquisition module 31 , a feature extraction module 32 and a calculation module 33 . Wherein, the voice acquiring module 31 is configured to acquire at least one category of voice data, and the same category of voice data constitutes a voice data set. The feature extraction module 32 is configured to extract the speech features of each speech data in the speech data set. The operation module 33 is configured to use the speech features in the speech data set to train the sub-category models in the speech classification model; the speech classification model includes at least one sub-classification model, and the sub-classification models are in one-to-one correspondence with the speech data set. The training device 300 of the embodiment of the present application classifies the speech data to form a corresponding speech data set, extracts and optimizes the speech features of different types of speech data, and uses the speech features to train the corresponding sub-classification model, thereby obtaining the speech of the required class. Speech classification model for the data. The speech classification model in the embodiment of the present application includes at least one sub-classification model, and the sub-classification model is set in a one-to-one correspondence with the speech data set. Thereby the voice data set of each category of the embodiment of the present application corresponds to training a sub-category model separately. When it is necessary to increase the number of categories, there is no need to retrain the entire voice classification model, and only a new sub-category model needs to be trained to increase the recognizable voice category. Thereby reducing the amount of training, improving training efficiency, and realizing a general-purpose language recognition scheme. The training method of the embodiment of the present application has a low amount of calculation, and can complete the speech classification training task on a robot with limited computing power. In the field of robot application, it can be used as an artificial intelligence teaching aid. The training device 300 of the embodiment of the present application can implement the entire speech recognition process through python programming.
在一些实施例中,所述训练装置还包括:特征确定模块,配置为基于所述语音数据集中的至少部分语音数据,确定所述语音数据集的类别特征;特征处理模块,配置为利用所述语音数据集的类别特征,对所述语音数据集中每个语音数据的语音特征进行处理;所述运算模块,包括:运算子模块,配置为利用所述语音数据集中处理后的语音特征对所述语音分类模型中的子分类模型进行训练。In some embodiments, the training device further includes: a feature determination module configured to determine category features of the speech data set based on at least part of the speech data in the speech data set; a feature processing module configured to use the The category feature of the voice data set is used to process the voice features of each voice data in the voice data set; the operation module includes: an operation sub-module configured to use the voice features processed in the voice data set to process the voice features of the voice data set The subclassification model in the speech classification model is trained.
在一些实施例中,所述语音数据集的类别特征包括所述语音数据集的音频响度特征和音调变化特征。In some embodiments, the category features of the speech data set include audio loudness features and pitch change features of the speech data set.
在一些实施例中,所述特征确定模块,包括:第一特征获取部件,配置为计算所述语音数据集中至少部分语音数据的语音能量的均方根,以获得所述音频响度特征;第二特征获取部件,配置为计算所述语音数据集中至少部分语音数据的过零特征,以获得所述音调变化特征。In some embodiments, the feature determination module includes: a first feature acquisition component configured to calculate the root mean square of speech energy of at least part of the speech data in the speech data set to obtain the audio loudness feature; A feature acquisition component configured to calculate zero-crossing features of at least part of the voice data in the voice data set, so as to obtain the pitch change feature.
在一些实施例中,所述特征处理模块,包括:特征处理子模块,配置为将语音特征除以所述音频响度特征,并加上所述音调变化特征。In some embodiments, the feature processing module includes: a feature processing sub-module configured to divide the speech feature by the audio loudness feature, and add the pitch change feature.
在一些实施例中,所述特征提取模块,包括:特征提取子模块,配置为提取所述语音数据集中每个语音数据的语音特征,并对所述语音特征进行降维处理。In some embodiments, the feature extraction module includes: a feature extraction submodule configured to extract speech features of each speech data in the speech data set, and perform dimensionality reduction processing on the speech features.
在一些实施例中,所述训练装置包括:呈现模块,配置为呈现录入指示,所述录入指示对应一个类别的语音数据的录入;所述语音获取模块,包括:语音获取子模块,配置为获取依据所述录入指示的语音数据。In some embodiments, the training device includes: a presentation module configured to present an entry indication, the entry indication corresponding to the entry of a category of voice data; the voice acquisition module includes: a voice acquisition sub-module configured to acquire Voice data according to the input instruction.
请参阅图6,图6是本申请实施例语音分类装置的框架示意图。Please refer to FIG. 6 . FIG. 6 is a schematic frame diagram of a speech classification device according to an embodiment of the present application.
基于前述的实施例,本申请实施例提供一种语音分类装置400,包括:语音获取模块41,特征提取模块42和分类模块43。其中,语音获取模块41配置为获取待分类语音。特征提取模块42配置为提取待分类语音的待分类语音特征。分类模块43配置为将待分类语音特征输入语音分类模型,确定待分类语音的类别,本申请实施例中的语音分类模型由上述实施例中的训练装置训练获得。本申请实施例的语音分类装置400的对待分类语音的识别效率和准确率高,可识别分类的待分类语音类别可经过提前训练,可实现通用的语言识别分类。Based on the foregoing embodiments, this embodiment of the present application provides a speech classification device 400 , including: a speech acquisition module 41 , a feature extraction module 42 and a classification module 43 . Wherein, the voice acquiring module 41 is configured to acquire the voice to be classified. The feature extraction module 42 is configured to extract speech features of the speech to be classified. The classification module 43 is configured to input the characteristics of the speech to be classified into the speech classification model to determine the category of the speech to be classified. The speech classification model in the embodiment of the present application is trained by the training device in the above embodiment. The speech classification device 400 of the embodiment of the present application has high recognition efficiency and accuracy of the speech to be classified, and the recognition and classification of speech categories to be classified can be trained in advance to realize general speech recognition and classification.
在一些实施例中,所述语音分类装置还包括:特征确定模块,配置为确定所述待分类语音的待分类语音响度特征和待分类音调特征;特征处理模块,配置为利用所述待分类响度特征和待分类音调特征,对所述待分类语音特征进行处理;所述分类模块,包括:第一分类子模块,配置为将处理后的待分类语音特征输入所述语音分类模型。In some embodiments, the speech classification device further includes: a feature determination module configured to determine the speech loudness feature to be classified and the tone feature to be classified of the speech to be classified; a feature processing module configured to use the loudness feature to be classified features and tone features to be classified, processing the speech features to be classified; the classification module includes: a first classification sub-module configured to input the processed speech features to be classified into the speech classification model.
在一些实施例中,所述特征提取模块,包括:特征提取子模块,配置为提取所述待分类语音的待分类语音特征,并对所述待分类语音特征进行降维处理。In some embodiments, the feature extraction module includes: a feature extraction submodule configured to extract speech features to be classified of the speech to be classified, and perform dimensionality reduction processing on the speech features to be classified.
在一些实施例中,所述语音获取模块,包括:语音获取子模块,配置为获取针对风扇的控制语音,作为所述待分类语音;所述分类模块,包括:第二分类子模块,配置为确定所述待分类语音的类别为开启、停止、加速、减速、左转、右转中的一种。In some embodiments, the voice acquisition module includes: a voice acquisition submodule configured to acquire the control voice for the fan as the voice to be classified; the classification module includes: a second classification submodule configured to Determine the category of the speech to be classified as one of start, stop, acceleration, deceleration, turn left, and turn right.
请参阅图7,图7是本申请实施例终端设备的框架示意图。Please refer to FIG. 7. FIG. 7 is a schematic diagram of a framework of a terminal device according to an embodiment of the present application.
基于前述的实施例,本申请实施例提供一种终端设备700,包括相互耦接的存储器701和处理器702,处理器702用于执行存储器701中存储的程序指令,以实现上述任一实施例的训练方法和上述任一实施例的语音分类方法。在一个实际的实施场景中,终端设备700可以包括但不限于:微型计算机、服务器、笔记本电脑、平板电脑等移动设备。此外,终端设备700还可以包括风扇、照明装置、行走小车等。Based on the foregoing embodiments, this embodiment of the present application provides a terminal device 700, including a memory 701 and a processor 702 coupled to each other, and the processor 702 is used to execute the program instructions stored in the memory 701, so as to implement any of the above embodiments The training method and the speech classification method of any of the above-mentioned embodiments. In an actual implementation scenario, the terminal device 700 may include, but is not limited to: mobile devices such as microcomputers, servers, notebook computers, and tablet computers. In addition, the terminal device 700 may also include a fan, a lighting device, a walking trolley, and the like.
其中,处理器702用于控制其自身以及存储器701以实现上述任一训练方法实施例的步骤,或实现上述任一语音分类方法实施例中的步骤。处理器702还可以称为CPU(Central Processing Unit,中央处理单元)。处理器702可能是一种集成电路芯片,具有信号的处理能力。处理器702还可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。另外,处理器702可以由集成电路芯片共同实现。Wherein, the processor 702 is configured to control itself and the memory 701 to implement the steps in any of the above embodiments of the training method, or to implement the steps in any of the above embodiments of the speech classification method. The processor 702 may also be called a CPU (Central Processing Unit, central processing unit). The processor 702 may be an integrated circuit chip with signal processing capability. The processor 702 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field-programmable gate array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. In addition, the processor 702 may be jointly implemented by integrated circuit chips.
通过上述方案,能够准确且高效地实现语音分类。Through the above solution, speech classification can be realized accurately and efficiently.
请参阅图8,图8是本申请实施例计算机可读存储介质的框架示意图。Please refer to FIG. 8 . FIG. 8 is a schematic frame diagram of a computer-readable storage medium according to an embodiment of the present application.
基于前述的实施例,本申请实施例提供一种计算机可读存储介质800,其上存储有程序指令801,程序指令801被处理器执行时实现上述任一项的训练方法以及任一项的语言分类方法。通过上述方案,能够准确且高效地实现语音分类。Based on the aforementioned embodiments, this embodiment of the present application provides a computer-readable storage medium 800, on which program instructions 801 are stored. When the program instructions 801 are executed by a processor, any of the above-mentioned training methods and any of the language Classification. Through the above solution, speech classification can be realized accurately and efficiently.
本申请实施例还提供了一种计算机程序,该计算机程序包括计算机可读代码,当该计算机可读代码在电子设备或终端设备上运行时,使得上述实施例中的方法被执行。An embodiment of the present application also provides a computer program, the computer program includes computer readable code, and when the computer readable code is run on an electronic device or a terminal device, the methods in the foregoing embodiments are executed.
本申请实施例还提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行上述方法。The embodiment of the present application also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施方式仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed methods and devices may be implemented in other ways. For example, the device implementations described above are only illustrative. For example, the division of modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。A unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may also be distributed to network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质800中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质800中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质800包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。An integrated unit may be stored in a computer-readable storage medium 800 if it is realized in the form of a software function unit and sold or used as an independent product. Based on this understanding, the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage The medium 800 includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods in various embodiments of the present application. The aforementioned storage medium 800 includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. medium.
以上所述仅为本申请的实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only an embodiment of the application, and does not limit the patent scope of the application. Any equivalent structure or equivalent process conversion made by using the specification and drawings of the application, or directly or indirectly used in other related technologies fields, are all included in the scope of patent protection of this application in the same way.
工业实用性Industrial Applicability
本申请实施例提供一种语音分类方法、模型训练方法及装置、设备、介质和程序,其中,训练方法包括:获取至少一个类别的语音数据,同一类别的语音数据构成一个语音数据集;提取语音数据集中每个语音数据的语音特征;利用语音数据集中的语音特征对语音分类模型中的子分类模型进行训练;语音分类模型包括至少一个子分类模型,子分类模型与语音数据集一一对应。通过对语音数据进行类别分类,形成对应语音数据集,利用语音特征训练对应的子分类模型,从而得到识别所需类别语音数据的语音分类模型。本申请实施例仅利用新类别的语音数据来进行训练,即可使得语音分类模型实现对新类别的分类。The embodiment of the present application provides a voice classification method, a model training method and device, equipment, medium and program, wherein the training method includes: acquiring at least one category of voice data, and the same category of voice data constitutes a voice data set; extracting voice Voice features of each voice data in the data set; using the voice features in the voice data set to train the sub-category models in the voice classification model; the voice classification model includes at least one sub-category model, and the sub-category models correspond to the voice data sets one-to-one. By classifying the voice data, a corresponding voice data set is formed, and the corresponding sub-classification model is trained by using voice features, so as to obtain a voice classification model for recognizing the required category of voice data. In the embodiment of the present application, only the voice data of the new category is used for training, so that the voice classification model can classify the new category.

Claims (25)

  1. 一种语音分类模型的训练方法,其中,所述训练方法包括:A training method for a speech classification model, wherein the training method includes:
    获取至少一个类别的语音数据,同一类别的语音数据构成一个语音数据集;Obtain at least one category of speech data, and the same category of speech data constitutes a speech data set;
    提取所述语音数据集中每个语音数据的语音特征;Extracting the voice features of each voice data in the voice data set;
    利用所述语音数据集中的语音特征对所述语音分类模型中的子分类模型进行训练;所述语音分类模型包括至少一个子分类模型,所述子分类模型与所述语音数据集一一对应。The speech features in the speech data set are used to train the sub-classification models in the speech classification model; the speech classification model includes at least one sub-classification model, and the sub-classification models are in one-to-one correspondence with the speech data set.
  2. 根据权利要求1所述的训练方法,其中,所述训练方法还包括:The training method according to claim 1, wherein the training method further comprises:
    基于所述语音数据集中的至少部分语音数据,确定所述语音数据集的类别特征;determining class features of the speech data set based on at least part of the speech data in the speech data set;
    利用所述语音数据集的类别特征,对所述语音数据集中每个语音数据的语音特征进行处理;Processing the voice features of each voice data in the voice data set by using the category features of the voice data set;
    所述利用所述语音数据集中的语音特征对所述语音分类模型中的子分类模型进行训练,包括:Said using the voice features in the voice data set to train the sub-category models in the voice classification model, including:
    利用所述语音数据集中处理后的语音特征对所述语音分类模型中的子分类模型进行训练。The speech features processed in the speech data set are used to train the subclassification models in the speech classification model.
  3. 根据权利要求2所述的训练方法,其中,所述语音数据集的类别特征包括所述语音数据集的音频响度特征和音调变化特征。The training method according to claim 2, wherein the category features of the speech data set include audio loudness features and pitch change features of the speech data set.
  4. 根据权利要求3所述的训练方法,其中,所述基于所述语音数据集中的至少部分语音数据,确定所述语音数据集的类别特征,包括:The training method according to claim 3, wherein said determining the class feature of said voice data set based on at least part of the voice data in said voice data set comprises:
    计算所述语音数据集中至少部分语音数据的语音能量的均方根,以获得所述音频响度特征;calculating the root mean square of the speech energy of at least part of the speech data in the speech data set to obtain the audio loudness feature;
    计算所述语音数据集中至少部分语音数据的过零特征,以获得所述音调变化特征。calculating zero-crossing features of at least part of the voice data in the voice data set to obtain the pitch change features.
  5. 根据权利要求4所述的训练方法,其中,所述利用所述语音数据集的类别特征,对所述语音数据集中每个语音数据的语音特征进行处理,包括:The training method according to claim 4, wherein said utilizing the category features of said voice data set to process the voice features of each voice data in said voice data set comprises:
    将语音特征除以所述音频响度特征,并加上所述音调变化特征。The speech feature is divided by the audio loudness feature, and the pitch change feature is added.
  6. 根据权利要求1-5任一项所述的训练方法,其中,所述提取所述语音数据集中每个语音数据的语音特征,包括:The training method according to any one of claims 1-5, wherein said extracting the voice features of each voice data in said voice data set comprises:
    提取所述语音数据集中每个语音数据的语音特征,并对所述语音特征进行降维处理。Extracting speech features of each speech data in the speech data set, and performing dimensionality reduction processing on the speech features.
  7. 根据权利要求1-5任一项所述的训练方法,其中,所述训练方法包括:The training method according to any one of claims 1-5, wherein the training method comprises:
    呈现录入指示,所述录入指示对应一个类别的语音数据的录入;Presenting an entry instruction, the entry instruction corresponding to the entry of a category of voice data;
    所述获取至少一个类别的语音数据,包括:获取依据所述录入指示的语音数据。The acquiring voice data of at least one category includes: acquiring voice data according to the input instruction.
  8. 一种语音分类方法,其中,所述语音分类方法包括:A kind of speech classification method, wherein, described speech classification method comprises:
    获取待分类语音;Obtain the speech to be classified;
    提取所述待分类语音的待分类语音特征;Extracting the speech features to be classified of the speech to be classified;
    将所述待分类语音特征输入语音分类模型,确定所述待分类语音的类别,所述语音分类模型由权利要求1-7中任一项所述的训练方法训练获得。The speech features to be classified are input into a speech classification model to determine the category of the speech to be classified, and the speech classification model is obtained by training according to any one of claims 1-7.
  9. 根据权利要求8所述的语音分类方法,其中,所述语音分类方法还包括:The speech classification method according to claim 8, wherein, the speech classification method further comprises:
    确定所述待分类语音的待分类语音响度特征和待分类音调特征;Determining the loudness feature of the speech to be classified and the tone feature to be classified of the speech to be classified;
    利用所述待分类响度特征和待分类音调特征,对所述待分类语音特征进行处理;Processing the speech features to be classified by using the loudness features to be classified and the pitch features to be classified;
    所述将所述待分类语音特征输入语音分类模型,包括:Said inputting the speech features to be classified into the speech classification model includes:
    将处理后的待分类语音特征输入所述语音分类模型。Input the processed speech features to be classified into the speech classification model.
  10. 根据权利要求8或9所述的语音分类方法,其中,所述提取所述待分类语音的待分类语音特征,包括:The speech classification method according to claim 8 or 9, wherein said extracting the speech features to be classified of the speech to be classified comprises:
    提取所述待分类语音的待分类语音特征,并对所述待分类语音特征进行降维处理。Extracting speech features to be classified of the speech to be classified, and performing dimensionality reduction processing on the speech features to be classified.
  11. 根据权利要求8或9所述的语音分类方法,其中,所述获取待分类语音,包括:The speech classification method according to claim 8 or 9, wherein said obtaining the speech to be classified comprises:
    获取针对风扇的控制语音,作为所述待分类语音;Obtain the control voice for the fan as the voice to be classified;
    所述确定所述待分类语音的类别,包括:The determination of the category of the speech to be classified includes:
    确定所述待分类语音的类别为开启、停止、加速、减速、左转、右转中的一种。Determine the category of the speech to be classified as one of start, stop, acceleration, deceleration, turn left, and turn right.
  12. 一种语音分类模型的训练装置,其中,所述训练装置包括:A training device for a speech classification model, wherein the training device includes:
    语音获取模块,配置为获取至少一个类别的语音数据,同一类别的语音数据构成一个语音数据集;The voice acquisition module is configured to acquire at least one category of voice data, and the same category of voice data constitutes a voice data set;
    特征提取模块,配置为提取所述语音数据集中每个语音数据的语音特征;A feature extraction module configured to extract the voice features of each voice data in the voice data set;
    运算模块,配置为利用所述语音数据集中的语音特征对所述语音分类模型中的子分类模型进行训练;所述语音分类模型包括至少一个子分类模型,所述子分类模型与所述语音数据集一一对应。An operation module configured to use the speech features in the speech data set to train the sub-classification models in the speech classification model; the speech classification model includes at least one sub-classification model, and the sub-classification model is related to the speech data Set one to one correspondence.
  13. 根据权利要求12所述的训练装置,其中,所述训练装置还包括:The training device of claim 12, wherein the training device further comprises:
    特征确定模块,配置为基于所述语音数据集中的至少部分语音数据,确定所述语音数据集的类别特征;A feature determination module configured to determine category features of the voice data set based on at least part of the voice data in the voice data set;
    特征处理模块,配置为利用所述语音数据集的类别特征,对所述语音数据集中每个语音数据的语音特征进行处理;A feature processing module configured to process the voice features of each voice data in the voice data set by using the category features of the voice data set;
    所述运算模块,包括:The computing module includes:
    运算子模块,配置为利用所述语音数据集中处理后的语音特征对所述语音分类模型中的子分类模型进行训练。The operation sub-module is configured to use the speech features processed in the speech data set to train the sub-classification models in the speech classification model.
  14. 根据权利要求13所述的训练装置,其中,所述语音数据集的类别特征包括所述语音数据集的音频响度特征和音调变化特征。The training device according to claim 13, wherein the category features of the speech data set include audio loudness features and pitch change features of the speech data set.
  15. 根据权利要求14所述的训练装置,其中,所述特征确定模块,包括:The training device according to claim 14, wherein the feature determination module comprises:
    第一特征获取部件,配置为计算所述语音数据集中至少部分语音数据的语音能量的均方根,以获得所述音频响度特征;A first feature acquisition component configured to calculate the root mean square of the speech energy of at least part of the speech data in the speech data set to obtain the audio loudness feature;
    第二特征获取部件,配置为计算所述语音数据集中至少部分语音数据的过零特征,以获得所述音调变化特征。The second feature acquisition component is configured to calculate the zero-crossing feature of at least part of the voice data in the voice data set, so as to obtain the pitch change feature.
  16. 根据权利要求15所述的训练装置,其中,所述特征处理模块,包括:The training device according to claim 15, wherein the feature processing module comprises:
    特征处理子模块,配置为将语音特征除以所述音频响度特征,并加上所述音调变化特征。The feature processing sub-module is configured to divide the speech feature by the audio loudness feature, and add the pitch change feature.
  17. 根据权利要求12-16任一项所述的训练装置,其中,所述特征提取模块,包括:The training device according to any one of claims 12-16, wherein the feature extraction module includes:
    特征提取子模块,配置为提取所述语音数据集中每个语音数据的语音特征,并对所述语音特征进行降维处理。The feature extraction submodule is configured to extract the speech features of each speech data in the speech data set, and perform dimensionality reduction processing on the speech features.
  18. 根据权利要求12-16任一项所述的训练装置,其中,所述训练装置包括:A training device according to any one of claims 12-16, wherein the training device comprises:
    呈现模块,配置为呈现录入指示,所述录入指示对应一个类别的语音数据的录入;A presenting module configured to present an entry indication, the entry indication corresponding to the entry of a category of voice data;
    所述语音获取模块,包括:The voice acquisition module includes:
    语音获取子模块,配置为获取依据所述录入指示的语音数据。The voice acquisition submodule is configured to acquire the voice data according to the input instruction.
  19. 一种语音分类装置,其中,所述语音分类装置包括:A kind of speech classification device, wherein, described speech classification device comprises:
    语音获取模块,配置为获取待分类语音;The voice acquisition module is configured to acquire the voice to be classified;
    特征提取模块,配置为提取所述待分类语音的待分类语音特征;A feature extraction module configured to extract speech features to be classified of the speech to be classified;
    分类模块,配置为将所述待分类语音特征输入语音分类模型,确定所述待分类语音的类别,所述语音分类模型由权利要求1-7中任一项所述的训练方法训练获得。A classification module configured to input the speech features to be classified into a speech classification model to determine the category of the speech to be classified, and the speech classification model is obtained by training the training method described in any one of claims 1-7.
  20. 根据权利要求19所述的语音分类装置,其中,所述语音分类装置还包括:The speech classification device according to claim 19, wherein, the speech classification device further comprises:
    特征确定模块,配置为确定所述待分类语音的待分类语音响度特征和待分类音调特征;A feature determination module configured to determine the loudness feature of the speech to be classified and the pitch feature to be classified of the speech to be classified;
    特征处理模块,配置为利用所述待分类响度特征和待分类音调特征,对所述待分类语音特征进行处理;A feature processing module configured to process the speech feature to be classified by using the loudness feature to be classified and the tone feature to be classified;
    所述分类模块,包括:The classification module includes:
    第一分类子模块,配置为将处理后的待分类语音特征输入所述语音分 类模型。The first classification submodule is configured to input the processed speech features to be classified into the speech classification model.
  21. 根据权利要求19或20所述的语音分类装置,其中,所述特征提取模块,包括:The speech classification device according to claim 19 or 20, wherein the feature extraction module includes:
    特征提取子模块,配置为提取所述待分类语音的待分类语音特征,并对所述待分类语音特征进行降维处理。The feature extraction submodule is configured to extract speech features to be classified of the speech to be classified, and perform dimensionality reduction processing on the speech features to be classified.
  22. 根据权利要求19或20所述的语音分类装置,其中,所述语音获取模块,包括:The speech classification device according to claim 19 or 20, wherein the speech acquisition module comprises:
    语音获取子模块,配置为获取针对风扇的控制语音,作为所述待分类语音;The voice acquisition sub-module is configured to acquire the control voice for the fan as the voice to be classified;
    所述分类模块,包括:The classification module includes:
    第二分类子模块,配置为确定所述待分类语音的类别为开启、停止、加速、减速、左转、右转中的一种。The second classification sub-module is configured to determine that the category of the speech to be classified is one of start, stop, acceleration, deceleration, left turn, and right turn.
  23. 一种终端设备,其中,所述终端设备包括相互耦接的存储器和处理器,所述处理器用于执行所述存储器中存储的程序指令,以实现权利要求1至11中任一项所述的方法。A terminal device, wherein the terminal device includes a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory, so as to implement the method described in any one of claims 1 to 11 method.
  24. 一种计算机可读存储介质,其上存储有程序数据,其中,所述程序数据被处理器执行时实现权利要求1至11中任一项所述的方法。A computer-readable storage medium on which program data is stored, wherein the method according to any one of claims 1 to 11 is implemented when the program data is executed by a processor.
  25. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在终端设备中运行时,所述终端设备中的处理器执行用于实现权利要求1至11中任一项所述的方法。A computer program, comprising computer readable code, when the computer readable code is run in the terminal device, the processor in the terminal device executes the method for implementing any one of claims 1 to 11 .
PCT/CN2022/071089 2021-07-06 2022-01-10 Speech classification method and apparatus, model training method and apparatus, device, medium, and program WO2023279691A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110762453.8A CN113539243A (en) 2021-07-06 2021-07-06 Training method of voice classification model, voice classification method and related device
CN202110762453.8 2021-07-06

Publications (1)

Publication Number Publication Date
WO2023279691A1 true WO2023279691A1 (en) 2023-01-12

Family

ID=78126826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071089 WO2023279691A1 (en) 2021-07-06 2022-01-10 Speech classification method and apparatus, model training method and apparatus, device, medium, and program

Country Status (2)

Country Link
CN (1) CN113539243A (en)
WO (1) WO2023279691A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539243A (en) * 2021-07-06 2021-10-22 上海商汤智能科技有限公司 Training method of voice classification model, voice classification method and related device
CN114296589A (en) * 2021-12-14 2022-04-08 北京华录新媒信息技术有限公司 Virtual reality interaction method and device based on film watching experience

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105161092A (en) * 2015-09-17 2015-12-16 百度在线网络技术(北京)有限公司 Speech recognition method and device
US20190371301A1 (en) * 2018-05-31 2019-12-05 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
CN111369982A (en) * 2020-03-13 2020-07-03 北京远鉴信息技术有限公司 Training method of audio classification model, audio classification method, device and equipment
CN112767967A (en) * 2020-12-30 2021-05-07 深延科技(北京)有限公司 Voice classification method and device and automatic voice classification method
CN113539243A (en) * 2021-07-06 2021-10-22 上海商汤智能科技有限公司 Training method of voice classification model, voice classification method and related device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986801B (en) * 2017-06-02 2020-06-05 腾讯科技(深圳)有限公司 Man-machine interaction method and device and man-machine interaction terminal
CN108305616B (en) * 2018-01-16 2021-03-16 国家计算机网络与信息安全管理中心 Audio scene recognition method and device based on long-time and short-time feature extraction
CN108764304B (en) * 2018-05-11 2020-03-06 Oppo广东移动通信有限公司 Scene recognition method and device, storage medium and electronic equipment
CN109741747B (en) * 2019-02-19 2021-02-12 珠海格力电器股份有限公司 Voice scene recognition method and device, voice control method and device and air conditioner
CN110047517A (en) * 2019-04-24 2019-07-23 京东方科技集团股份有限公司 Speech-emotion recognition method, answering method and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105161092A (en) * 2015-09-17 2015-12-16 百度在线网络技术(北京)有限公司 Speech recognition method and device
US20190371301A1 (en) * 2018-05-31 2019-12-05 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
CN111369982A (en) * 2020-03-13 2020-07-03 北京远鉴信息技术有限公司 Training method of audio classification model, audio classification method, device and equipment
CN112767967A (en) * 2020-12-30 2021-05-07 深延科技(北京)有限公司 Voice classification method and device and automatic voice classification method
CN113539243A (en) * 2021-07-06 2021-10-22 上海商汤智能科技有限公司 Training method of voice classification model, voice classification method and related device

Also Published As

Publication number Publication date
CN113539243A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
Koduru et al. Feature extraction algorithms to improve the speech emotion recognition rate
CN110289003B (en) Voiceprint recognition method, model training method and server
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN109243491B (en) Method, system and storage medium for emotion recognition of speech in frequency spectrum
Mukherjee et al. A lazy learning-based language identification from speech using MFCC-2 features
WO2023279691A1 (en) Speech classification method and apparatus, model training method and apparatus, device, medium, and program
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
Pokorny et al. Detection of negative emotions in speech signals using bags-of-audio-words
CN102800316A (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN103871426A (en) Method and system for comparing similarity between user audio frequency and original audio frequency
WO2022100692A1 (en) Human voice audio recording method and apparatus
WO2022100691A1 (en) Audio recognition method and device
CN111583906A (en) Role recognition method, device and terminal for voice conversation
Fan et al. Deep neural network based environment sound classification and its implementation on hearing aid app
Chiou et al. Feature space dimension reduction in speech emotion recognition using support vector machine
CN111161713A (en) Voice gender identification method and device and computing equipment
Huang et al. Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Fernandes et al. Speech emotion recognition using mel frequency cepstral coefficient and SVM classifier
Pao et al. A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition
Shah et al. Speech emotion recognition based on SVM using MATLAB
Chi et al. Robust emotion recognition by spectro-temporal modulation statistic features
CN111091809A (en) Regional accent recognition method and device based on depth feature fusion
Ahmed et al. CNN-based speech segments endpoints detection framework using short-time signal energy features

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22836451

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE