WO2023279691A1 - Procédé et appareil de classification de parole, procédé et appareil d'apprentissage de modèle, dispositif, support et programme - Google Patents

Procédé et appareil de classification de parole, procédé et appareil d'apprentissage de modèle, dispositif, support et programme Download PDF

Info

Publication number
WO2023279691A1
WO2023279691A1 PCT/CN2022/071089 CN2022071089W WO2023279691A1 WO 2023279691 A1 WO2023279691 A1 WO 2023279691A1 CN 2022071089 W CN2022071089 W CN 2022071089W WO 2023279691 A1 WO2023279691 A1 WO 2023279691A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
features
classified
voice
data set
Prior art date
Application number
PCT/CN2022/071089
Other languages
English (en)
Chinese (zh)
Inventor
张军伟
李�诚
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023279691A1 publication Critical patent/WO2023279691A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This application belongs to the field of speech recognition, and involves but is not limited to a speech classification method, a model training method and device, equipment, medium and program.
  • Speech recognition technology is to enable smart devices to understand human speech. It is a multidisciplinary science involving digital signal processing, artificial intelligence, linguistics, mathematical statistics, acoustics, emotion and psychology. In recent years, with the rise of artificial intelligence, speech recognition technology has made great breakthroughs in both theory and application. It has begun to move from the laboratory to the market, and has gradually entered our daily life.
  • Speech recognition is a relatively large application field of artificial intelligence technology, which is divided into speech meaning recognition and speech type recognition.
  • speech recognition For the recognition of speech categories, the current artificial intelligence products that can realize speech recognition generally integrate trained speech classification models. When it is necessary to increase the recognition of new categories, the current solution cannot be realized.
  • Embodiments of the present application provide a speech classification method, a model training method and an apparatus, device, medium and program.
  • the first aspect of the embodiment of the present application provides a training method for a speech classification model.
  • the training method includes: obtaining at least one category of speech data, and the same category of speech data constitutes a speech data set; extracting each speech data in the speech data set Speech features; use the speech features in the speech data set to train the sub-category models in the speech classification model; the speech classification model includes at least one sub-classification model, and the sub-classification models correspond to the speech data sets one by one.
  • the proposed speech classification model includes sub-classification models, and a sub-classification model corresponds to a category of speech data sets.
  • the speech data of each category is obtained, and the speech data of each category constitutes a speech Data set, using the voice data set to train the sub-classification model in the voice classification model, so that the voice classification model can realize voice classification.
  • the speech classification model in the embodiment of the present application can add new classifications of speech categories at any time.
  • the training method also includes: based on at least part of the voice data in the voice data set, determining the category features of the voice data set; using the category features of the voice data set, processing the voice features of each voice data in the voice data set; using the voice data
  • the concentrated voice features train the sub-classification models in the voice classification model, including: using the voice features processed in the voice data set to train the sub-classification models in the voice classification model.
  • the category features of the voice data set can be obtained, that is, the category features can be used to highlight the category of the voice data set, and the voice features can be processed by using the category features, which can make the training effect better. , which is more conducive to the subclassification model to identify the category.
  • the category features of the speech data set include audio loudness features and pitch change features of the speech data set.
  • the category features of speech datasets are mainly reflected in the loudness and pitch changes of speech.
  • determining the category characteristics of the voice data set includes: calculating the root mean square of the voice energy of at least part of the voice data in the voice data set to obtain audio loudness features; calculating at least part of the voice data in the voice data set Zero-crossing features of speech data to obtain pitch change features.
  • the root mean square of the energy of each speech data can be obtained, thereby obtaining the audio loudness feature in the category feature.
  • the audio zero-crossing feature of each speech data is obtained, so as to obtain the pitch change feature in the category feature.
  • the processing of the voice features of each voice data in the voice data set by using the category features of the voice data set includes: dividing the voice features by the audio loudness features, and adding the pitch change feature.
  • the processed speech features can be obtained based on the class features of different speech data, so as to further strengthen the distinction of different classes, which is beneficial to the subsequent training of speech classification models.
  • extracting the voice features of each voice data in the voice data set includes: extracting the voice features of each voice data in the voice data set, and performing dimensionality reduction processing on the voice features.
  • performing dimensionality reduction processing on speech features can reduce the amount of calculation in subsequent training, and utilize the training of the classification model in the terminal.
  • the training method includes: presenting an entry instruction, the entry instruction corresponds to the entry of a category of voice data; acquiring at least one category of voice data includes: acquiring the voice data according to the entry instruction.
  • the second aspect of the embodiment of the present application provides a voice classification method.
  • the voice classification method includes: obtaining the voice to be classified; extracting the voice features to be classified of the voice to be classified; inputting the voice features to be classified into the voice classification model, and determining the voice to be classified category, and the voice classification model is trained by the above-mentioned training method.
  • the speech to be classified can be recognized and classified efficiently and with high accuracy, and the class of speech to be classified that can be recognized and classified can be trained in advance.
  • the voice classification method also includes: determining the voice loudness feature to be classified and the tone feature to be classified; using the loudness feature to be classified and the tone feature to be classified, processing the voice feature to be classified; inputting the voice feature to be classified into the voice classification
  • the model includes: inputting the processed speech features to be classified into the speech classification model.
  • the to-be-classified speech loudness features and the to-be-classified pitch features of the to-be-classified speech of different users are different.
  • the voices of different users can be distinguished, so as to realize the extraction and optimization of the speech feature to be classified.
  • the to-be-classified voice loudness feature and the to-be-classified tone feature as the classification dimension the to-be-classified voice feature is optimized to achieve accurate classification of different users.
  • extracting the to-be-classified speech features of the to-be-classified speech includes: extracting the to-be-classified speech features of the to-be-classified speech, and performing dimensionality reduction processing on the to-be-classified speech features.
  • obtaining the voice to be classified includes: obtaining the control voice for the fan as the voice to be classified; determining the category of the voice to be classified includes: determining the category of the voice to be classified as starting, stopping, accelerating, decelerating, turning left, and turning right kind of.
  • the third aspect of the embodiment of the present application provides a terminal device, including a memory and a processor coupled to each other, the processor is used to execute the program instructions stored in the memory, so as to implement the training method in the first aspect and the second aspect above Speech classification methods in .
  • the fourth aspect of the embodiment of the present application provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a processor, the training method in the above-mentioned first aspect and the speech classification method in the above-mentioned second aspect are implemented. .
  • the fifth aspect of the embodiment of the present application provides a computer program, including computer readable code, when the computer readable code is run in the terminal device, the processor in the terminal device executes to implement the above first aspect The training method in and the speech classification method in the second aspect above.
  • the speech classification model in the embodiment of the present application includes at least one sub-classification model, and the sub-classification model is set in a one-to-one correspondence with the speech data set. Thereby the voice data set of each category of the embodiment of the present application corresponds to training a sub-category model separately.
  • the training method in the embodiment of the present application has a low amount of calculation, and can complete the speech classification training task on a robot with limited computing power. In the field of robot application, it can be suitable for use as an artificial intelligence teaching aid.
  • Fig. 1 is the schematic flow sheet of the training method of speech classification model of the embodiment of the present application
  • Fig. 2 is the schematic flow chart of optimizing the speech feature in the training method of the speech classification model of the embodiment of the present application;
  • Fig. 3 is the schematic flow chart of the voice classification method of the embodiment of the present application.
  • Fig. 4 is a schematic flow diagram of optimizing the speech features to be classified in the speech classification method of the embodiment of the present application.
  • Fig. 5 is the frame schematic diagram of the training device of speech classification model of the embodiment of the present application.
  • Fig. 6 is the frame diagram of the speech classification device of the embodiment of the present application.
  • FIG. 7 is a schematic diagram of a framework of a terminal device according to an embodiment of the present application.
  • Fig. 8 is a schematic diagram of the framework of a computer-readable storage medium according to an embodiment of the present application.
  • Fig. 1 is the schematic flow chart of the training method of speech classification model of the embodiment of the present application
  • Fig. 2 is the schematic flow chart of optimizing the speech feature in the training method of speech classification model of the embodiment of the present application.
  • the training method of the speech classification model of the embodiment of the present application is performed by electronic devices such as smart devices or terminal devices, and the terminal devices can be user equipment (User Equipment, UE), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal Digital assistants (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., smart devices can include intelligent educational robots, intelligent mobile robots, etc., and the method can be called by the processor of the electronic device. stored computer readable instructions.
  • the embodiment of the present application provides a training method of a speech classification model, comprising the following steps:
  • Step S11 Obtain at least one category of speech data, and the same category of speech data constitutes a speech data set.
  • Categories can be based on general field classifications such as gender, number, and orientation, and/or user-based. For example, gender classification includes gender classification categories of male and female; number classification includes number classification categories of 0-9; direction classification includes direction classification categories such as front, back, left, and right; user classification includes user classification categories based on different users.
  • the user when acquiring the speech data of each category, the user may be guided to record multiple passes of the speech data according to instructions, and cluster them to form a speech data set.
  • before acquiring at least one category of voice data may include: presenting an input indication, where the input indication corresponds to the input of one category of voice data.
  • the device will present input instructions to guide the user to record voice data, which may be presented in the form of screen display and/or voice broadcast, and each input instruction corresponds to the input of a category of voice data.
  • the application scenario is the voice control of the fan
  • the recording requirement is to control the fan to turn on, stop, accelerate, decelerate, turn left, turn right, etc.
  • the input instruction can be displayed on the screen and/or voice broadcast to guide the user to repeat " Start the fan", “stop the fan”, “increase the fan speed”, “decrease the fan speed”, “turn the fan to the left”, “turn the fan to the right” and other voices to obtain the corresponding category of voice data.
  • the input instructions can be displayed on the screen and/or voice broadcast to guide the user to repeat "forward".
  • Voices of direction categories such as "walking”, “walking backward”, “walking left”, “walking right”, voices of number categories such as "1” and “2”, voices of length unit categories such as "meter”, and Other desired voices.
  • the voice data of the corresponding category is obtained.
  • obtaining speech data of at least one category includes obtaining speech data according to an input instruction.
  • the duration of the single input voice data is 3-10s, such as 3s, 5s, 8s or 10s.
  • the length of speech data is conducive to the extraction of speech features, and the amount of calculation is kept small, which improves the speed of subsequent data processing, thereby improving training efficiency.
  • the user When acquiring voice data based on user classification categories, the user can be guided to record voice data such as "hello" multiple times according to the input instructions, and form a voice data set of user classification categories with user IDs.
  • the user When acquiring voice data based on direction classification categories, the user can be guided to record multiple times of "walking forward", “walking right” and other similar voice data according to the input instructions, and form a voice data set of direction classification categories corresponding to the direction .
  • voice data in four different directions of "front, back, left, and right” are recorded, which can form a voice data set classified into four directions: front, back, left, and right.
  • the user When acquiring voice data based on digital categories, the user can be guided to record voice data related to numbers such as "0" and “1” multiple times according to the input instructions, and form a voice data set corresponding to digital categories.
  • voice data Usually the voice data of ten different directions of "0-9" are recorded, which can constitute a voice data set of ten digital classification categories.
  • the user When acquiring gender-based voice data, the user can be guided to record multiple times of instructional phrase voice data according to the instructions, and can be combined with some auxiliary means such as face recognition to classify the user's gender and form a gender-specific category of the corresponding gender. voice dataset.
  • some auxiliary means such as face recognition to classify the user's gender and form a gender-specific category of the corresponding gender. voice dataset.
  • each entry indicates the entry of speech data corresponding to a category.
  • users can also be guided to record multiple times of voice data similar to "walking 1 meter forward" according to the input instructions, and form voice data sets of "front” direction classification categories according to different sound segments , and the voice data set of the digital classification category of "1", thereby reducing the amount of voice data recorded by the user and improving user experience.
  • the speech classification model only trains recognition in general domains, then there is no need to obtain speech data of user classification categories.
  • the voice data of various categories in the required general field can be obtained only according to the requirement, and the speech classification model recognized in the general field can be trained. If the speech classification model needs to train speech recognition based on each user, first obtain the speech data of the user category to form a speech data set of the user classification category with the user ID; then obtain the required other general field categories of each user Speech data, constituting other speech datasets of various categories.
  • voice data is usually obtained by recording the user's voice.
  • Robot products generally have a built-in sound card, and the recording function can be normally realized after the sound card is configured.
  • the voice recorded by the robot is very small and must be very close to the robot.
  • the voice enhancement configuration of the microphone can be configured to make the microphone slightly stronger, which is convenient for the user to record voice data.
  • the actual enhanced configuration parameters are adjusted according to the situation when the robot records the user's voice, and there is no limitation here.
  • voice data can also be obtained by communicating with other devices, such as by downloading from a cloud server or obtaining from other mobile devices.
  • Step S12 Extracting the speech features of each speech data in the speech data set.
  • extracting the speech feature of the speech data may be realized by Mel-frequency cepstral coefficients (Mel-frequency cepstral coefficients, MFCC) speech feature.
  • MFCC Mel-frequency cepstral coefficients
  • the Mel-frequency cepstrum coefficients are the coefficients that make up the Mel-frequency cepstrum.
  • the difference between the cepstrum and the mel-frequency cepstrum is that the frequency band division of the mel-frequency cepstrum is equally spaced on the mel scale, which is a better approximation than the linearly spaced frequency bands used in the normal log cepstrum The human auditory system.
  • the Mel filter is a triangular bandpass filter with a preset number of nonlinear distributions, and the logarithmic energy output by each filter can be obtained.
  • the preset number can be 20 and so on. It must be noted that this preset number of triangular bandpass filters is evenly distributed over the frequency of the "mel scale".
  • the Mel frequency represents the general human ear's sensitivity to frequency, and it can also be seen that the human ear's perception of frequency f changes logarithmically.
  • the general process of extracting the MFCCs voice features of each voice data in the voice data set includes the following methods:
  • the pre-emphasis filter is mainly to amplify the high-frequency, eliminate the effect of the vocal cords and lips during the vocalization process, to compensate the high-frequency part of the voice signal suppressed by the pronunciation system, and to highlight the high-frequency resonance. peak. This can be achieved by using a high pass filter.
  • the speech signal is a short-term stationary signal, so the feature extraction operation is usually performed in a short time frame window. At the same time, in order to avoid too much difference between consecutive frames, there will be overlapping parts between the extracted adjacent two frames.
  • each frame is generally multiplied by a window function to smooth the signal, such as a Hamming window.
  • a window function to smooth the signal, such as a Hamming window.
  • the purpose is to increase the continuity at both ends of the frame and reduce the leakage of the spectrum for subsequent operations.
  • the frequency domain conversion is the Fourier transform. This is called the Short-time Fourier Transform (STFT), and the purpose is to convert the signal from the time domain to the frequency domain.
  • STFT Short-time Fourier Transform
  • the purpose of the Mel scale is to simulate the non-linear perception of sound by the human ear, being more discriminating at lower frequencies and less discriminating at higher frequencies.
  • the filter bank coefficients calculated in the above steps are highly correlated, and the discrete cosine transform (Discrete Cosine Transform, DCT) can be applied to decorrelate the filter bank coefficients and generate a compressed representation of the filter bank. Put the energy logarithm obtained in the previous step into the discrete cosine transform formula to obtain MFCCs:
  • s(m) is the energy value of the filter obtained in the step of extracting the mel scale
  • L is the order of MFCC coefficients, usually 12-16
  • M is the number of triangular filters
  • N is each The size of the frame, usually a preset number of sampling points is combined into an observation unit, called a frame, and the preset number is usually 256 or 512, that is, the value of N is usually 256 or 512.
  • the speech features of MFCCs of each speech data in the speech data set can be extracted.
  • extracting the voice features of each voice data in the voice data set includes: extracting the voice features of each voice data in the voice data set, and performing dimensionality reduction processing on the voice features. Since the extracted original MFCC features may have different dimensions due to different audio time lengths, when using the speech data set to train the speech classification model, the classification model requires that the speech features of the speech data in the speech data set have the same feature dimension, so need Dimensionality reduction processing is performed on the speech features, so that it is suitable for the training of the classification model.
  • the speech features before performing dimensionality reduction processing on the speech features, it includes removing all speech data shorter than a preset duration in the speech data set.
  • the preset duration is 0.5s and so on. Thereby removing some invalid speech data that is too short, reducing the amount of calculation, and improving training accuracy and training efficiency.
  • performing dimensionality reduction processing on speech features includes: the dimension of the extracted mfcc feature is determined by two parts, the feature vector dimension and the number of frames, which are respectively recorded as [n_mfcc, n_frames]. According to empirical parameters, it can be The feature vector dimension n_mfcc is set to 16; the number of sub-frames n_frame is related to the length of the audio time, and the minimum value of the number of sub-frames can be taken, and then the two-dimensional features are flattened into one-dimensional features, so as to realize the dimensionality reduction processing of speech features , to reduce the amount of computation.
  • the voice features used for training the classification model can already be extracted by using the method provided in the above content.
  • the voice data in the voice data sets based on user categories since the basic loudness of each user's voice is not the same, the voice data in the voice data sets of different user categories have different category characteristics. Therefore, when processing voice data in a voice data set based on user classification categories, in addition to using the method provided in the above content to extract voice features, it is also necessary to further optimize the voice features in the voice data set, including:
  • Step S121 Based on at least part of the voice data in the voice data set, determine category features of the voice data set.
  • the category features of the voice data set can be obtained, that is, the category of the voice data set can be highlighted through the category features, and the voice features can be processed by using the category features, which can make the training effect better and more efficient.
  • the category features of a voice data set composed of voice data of the same user category include: audio loudness features and pitch change features of the voice data set. Through audio loudness features and pitch change features, the voices of different users can be distinguished to realize feature extraction and optimization of speech data sets.
  • determining the category characteristics of the voice data set includes:
  • a root mean square of speech energy of at least a portion of the speech data in the speech data set is calculated to obtain an audio loudness feature. According to the difference in the basic audio loudness of each category, the root mean square of the energy of each voice data can be obtained, so as to obtain the audio loudness feature in the category feature.
  • the voice features are optimized to achieve accurate classification of voice data sets of different users.
  • Step S122 Using the category features of the voice data set, process the voice features of each voice data in the voice data set.
  • the voice features of each voice data in the voice data set are processed by using the determined category features of the voice data set, that is, the audio loudness features and pitch change features obtained in the above step S121.
  • processing the voice features of each voice data in the voice data set includes: dividing the voice features of each user category by the corresponding audio loudness features, and adding the corresponding Pitch Change Features to obtain the speech features of each user category for the speech dataset.
  • the speech feature extraction and optimization scheme adopted in the embodiment of the present application can obtain more generalized speech features and apply to more speech classification models.
  • Step S13 Using the voice features in the voice data set to train the sub-category models in the voice classification model, the voice classification model includes at least one sub-category model, and the sub-category models are in one-to-one correspondence with the voice data set.
  • the speech classification model in the embodiment of the present application includes at least one sub-classification model, and the sub-classification model is set in a one-to-one correspondence with the speech data set.
  • the voice data set of each category of the embodiment of the present application corresponds to training a sub-category model separately.
  • a Gaussian Mixed Model may be used as the speech classification model.
  • the Gaussian mixture model can be regarded as a model composed of K Gaussian sub-models, and these K single models are the hidden variables of the mixture model.
  • the number of speech data categories to be classified is K
  • the sub-classification model is the Gaussian sub-model.
  • the GMM model will train 4 Gaussian sub-models.
  • the GMM model will train 10 Gaussian sub-models.
  • EM Expectation-Maximum
  • M-step Find the maximum and calculate the model parameters of a new round of iterations.
  • is the model parameter of each subclassification model
  • X is the speech feature
  • ⁇ jk is the expected output
  • N is the total number of speech data in each speech data set
  • j is the sequence number of each speech data.
  • each sub-classification model is trained by the EM algorithm to obtain a sub-classification model that recognizes the corresponding category of speech data.
  • the speech classification model only trains recognition in general domains, then there is no need to obtain speech data of user classification categories. Directly use the speech features in each speech data set to train the corresponding subclassification models in the speech classification model.
  • the speech classification model training is based on the identification of user classification categories and general field classification categories, it is first necessary to use the speech features after centralized processing of the speech data of a user category to train the corresponding sub-classification models in the speech classification model; then use this Speech features in each speech data set in other general fields of the user are used to train each corresponding sub-classification model in the speech classification model. Then follow the same method to sequentially train the language classification models of other user categories.
  • each user has its corresponding sub-classification model, and the language classification model obtained through training can specifically recognize the voices of different users and improve the accuracy of the voice classification model.
  • the voice classification model proposed in the embodiment of the present application includes sub-category models, and a sub-category model corresponds to a category of voice data sets, then when training the voice classification model, the voice data of each category is obtained, and the voice data of each category constitutes A voice data set, using the voice data set to train the sub-classification model in the voice classification model, so that the voice classification model can realize voice classification.
  • the speech classification model in the embodiment of the present application can add new classifications of speech categories at any time. Thereby reducing the amount of training, improving training efficiency, and realizing a general-purpose language recognition scheme.
  • the training method of the embodiment of the present application has a low amount of calculation, and can complete the speech classification training task on a robot with limited computing power. In the field of robot application, it can be used as an artificial intelligence teaching aid.
  • the training method of the embodiment of the present application can implement the entire speech recognition process through python programming.
  • FIG. 3 is a schematic flow diagram of the speech classification method of the embodiment of the present application
  • FIG. 4 is a schematic flow diagram of optimizing the speech features to be classified in the speech classification method of the embodiment of the present application.
  • the speech classification method of the embodiment of the present application is performed by electronic devices such as smart devices or terminal devices, and the terminal devices can be user equipment (User Equipment, UE), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal digital assistants ( Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., smart devices can include intelligent educational robots, intelligent mobile robots, etc., the method can call the computer stored in the memory through the processor of the electronic device It is implemented in the form of readable instructions.
  • UE User Equipment
  • PDA Personal Digital Assistant
  • smart devices can include intelligent educational robots, intelligent mobile robots, etc.
  • the method can call the computer stored in the memory through the processor of the electronic device It is implemented in the form of readable instructions.
  • an embodiment of the present application provides a method for classifying speech, the method for classifying speech includes:
  • Step S21 Obtain the speech to be classified.
  • voices to be classified are acquired, and the voices to be classified may include wake-up voices and instruction voices.
  • the wake-up voice is used to wake up the device, and can be used by the voice classification model to identify the corresponding user, and the command voice is used to control the device.
  • the acquisition of voices to be classified includes:
  • control voice for the fan as the voice to be classified.
  • the to-be-classified speech categories recognized by the fan can be pre-set or directly obtained by the user through training on the fan, and can actually include start, stop, acceleration, deceleration, turn left, turn right, etc.
  • the above command voices are only some common command voices listed, and other command voices with similar meanings can also be used instead.
  • deceleration can also be turned down, acceleration can also be turned up; It can be closed, and there is no limitation here.
  • Step S22 Extracting speech features of the speech to be classified.
  • the to-be-classified speech features of the to-be-classified speech may be implemented based on MFCC speech features.
  • MFCC speech features The following is a brief introduction to the voice characteristics of MFCC:
  • the Mel-frequency cepstrum coefficients are the coefficients that make up the Mel-frequency cepstrum.
  • the difference between the cepstrum and the mel-frequency cepstrum is that the frequency band division of the mel-frequency cepstrum is equally spaced on the mel scale, which is a better approximation than the linearly spaced frequency bands used in the normal log cepstrum The human auditory system.
  • the Mel filter is a triangular bandpass filter with a preset number of nonlinear distributions, and the logarithmic energy output by each filter can be obtained.
  • the preset number can be 20 and so on. It must be noted that this preset number of triangular bandpass filters is evenly distributed over the frequency of the "mel scale".
  • the Mel frequency represents the general human ear's sensitivity to frequency, and it can also be seen that the human ear's perception of frequency f changes logarithmically.
  • the general process of extracting the MFCCs voice features of the voice features to be classified of the voice to be classified includes pre-emphasis, framing, windowing, frequency domain conversion, power spectrum, extraction of mel scale and obtaining MFCCs, by the above process that is
  • the speech features of MFCCs of speech to be classified can be extracted.
  • the steps of actually extracting the speech features of the MFCCs of speech to be classified are similar to the corresponding steps in the above-mentioned embodiment.
  • extracting the speech features to be classified of the speech to be classified includes: extracting the speech features to be classified of the speech to be classified, and performing dimensionality reduction processing on the speech features to be classified, thereby reducing the amount of computation and improving recognition efficiency.
  • removing speech to be classified that is shorter than a preset duration is performed before performing dimensionality reduction processing on the speech features to be classified.
  • the preset duration is 0.5s and so on.
  • the dimensionality reduction processing of the speech features to be classified includes: the dimension of the extracted mfcc feature is determined by the feature vector dimension and the number of frames, which are respectively recorded as [n_mfcc, n_frames]. According to empirical parameters, it can be Set n_mfcc to 16; n_frame is related to the audio time length, you can take the minimum value of the number of frames, and then flatten the two-dimensional features into one-dimensional features, so as to realize the dimensionality reduction processing of the speech features to be classified and reduce the amount of calculation.
  • the speech to be classified can already be extracted by using the method provided in the above content.
  • the speech to be classified of the user classification category since factors such as the basic loudness of each user's voice are not the same, the characteristics of the speech to be classified of different users are different. Therefore, when processing the speech to be classified, in addition to extracting the speech features to be classified using the methods provided above, it is necessary to further optimize the speech features to be classified, including:
  • Step S221 Determine the loudness feature and tone feature of the speech to be classified.
  • the to-be-classified speech loudness features and to-be-classified tone features of the to-be-classified speech of different users are different.
  • the voices of different users can be distinguished through the loudness feature of the speech to be classified and the pitch feature of the speech to be classified, so as to realize the extraction and optimization of the speech feature to be classified.
  • determining the loudness feature of the speech to be classified and the pitch feature to be classified of the speech to be classified comprises:
  • the voice features to be classified are loudness features and the pitch features to be classified are used as classification dimensions, and the voice features to be classified are optimized to realize accurate classification of different users. In other embodiments, it is also possible to classify different users based on other features as classification dimensions.
  • Step S222 Process the speech features to be classified by using the loudness features to be classified and the pitch features to be classified.
  • the speech features to be classified are processed by using the determined loudness features and pitch features of the speech to be classified, that is, the loudness features and pitch features to be classified obtained in the above step S221.
  • processing the to-be-classified speech feature includes: dividing each to-be-classified speech feature by the corresponding to-be-classified loudness feature, and adding the corresponding Tone features to be classified to obtain speech features to be classified for each user.
  • the speech feature extraction and optimization scheme adopted in the embodiment of the present application can obtain more generalized speech features to be classified, and is applicable to more speech classification models.
  • Step S23 input the features of the speech to be classified into the speech classification model, and determine the category of the speech to be classified.
  • the speech classification model of the embodiment of the present application is trained by using the training method in any of the above embodiments.
  • the speech classification model in the embodiment of the present application includes at least one sub-classification model, and each sub-classification model recognizes a class of speech features to be classified.
  • a Gaussian mixture model (GMM model) may be used as a speech classification model.
  • the Gaussian mixture model can be regarded as a model composed of K Gaussian sub-models, and these K single models are the hidden variables of the mixture model.
  • the number of categories of speech data to be classified is K
  • the sub-classification model is the Gaussian sub-model.
  • the GMM model will train 4 Gaussian sub-models.
  • the GMM model will train 10 Gaussian sub-models.
  • the speech classification model is only used for recognition in the general field, then the speech to be classified is directly input into the speech classification model to obtain the classification result.
  • call all sub-classification models in the speech classification model calculate and save the probability that the speech to be classified belongs to each sub-classification model, and select the category corresponding to the sub-classification model with the highest probability as the classification result.
  • the speech classification model is used for identification based on user classification categories and general field classification categories, it is first necessary to identify the user category to which the speech to be classified belongs, and inputting the speech features to be classified into the speech classification model includes: inputting the processed speech features to be classified into the speech Classification model to obtain user category classification results. Then use other sub-classification models related to the user to identify the classification results of the speech to be classified in the general field category.
  • call all sub-classification models that identify user categories in the speech classification model calculate and save the probability that the speech to be classified belongs to each sub-classification model, and select the user category corresponding to the sub-classification model to which the maximum probability belongs, as the user category classification result.
  • call other sub-classification models related to the user calculate and save the probability that the voice to be classified belongs to each sub-classification model, and select the category corresponding to the sub-classification model with the largest probability as the classification result.
  • the user's speech can be identified in a targeted manner, and the recognition efficiency and accuracy can be improved. Especially for users with dialects or accents, it can effectively improve the recognition accuracy and improve user experience.
  • the voice classification method in the embodiment of the present application can efficiently and accurately identify and classify the speech to be classified, and the recognized and classified speech categories to be classified can be trained in advance, and a general language recognition and classification scheme can be realized.
  • the fan has a pre-trained voice classification model, or the user directly trains on the fan to obtain a voice classification model.
  • the voice classification model determining the category of the speech to be classified includes: determining the category of the speech to be classified as one of start, stop, acceleration, deceleration, turn left, and turn right.
  • command voices are only some of the common command voices listed, and other command voices with similar meanings can also be used to train the voice classification model of the fan and be used for recognition. It can also be turned up; open can also be open, and stop can also be closed, which is not limited here.
  • the voice classification method of the embodiment of the present application can also be used on other types of educational robots such as lighting devices and walking cars.
  • the embodiment of the present application provides a speech classification method, which can be implemented in the following manner:
  • Audio data recording configure the sound card and microphone to complete the audio data recording.
  • Extract MFCC Mel Frequency Cepstral Coefficient: Based on traditional MFCC speech features, better speech recognition is achieved by optimizing speech classifiers.
  • the Mel-frequency cepstrum coefficients are the coefficients that make up the Mel-frequency cepstrum.
  • the difference between the cepstrum and the Mel-frequency cepstrum is that the frequency band division of the Mel-frequency cepstrum is equally spaced on the Mel scale, which more closely approximates the human frequency bands than the linearly spaced frequency bands in the normal logarithmic cepstrum. auditory system.
  • the Mel filter is a group of 20 triangular bandpass filters with nonlinear distribution, and the logarithmic energy output by each filter can be obtained; among them, the frequency of the 20 triangular bandpass filters on the Mel scale is Evenly distributed.
  • the Mel frequency represents the general human ear's sensitivity to frequency, and it can also be seen that the human ear's perception of frequency f changes logarithmically.
  • MFCCs acquisition pre-emphasis, framing, windowing, frequency domain conversion, power spectrum, extraction of mel scale and MFCCs.
  • Feature optimization Through the above steps, the audio recording has been completed, and the corresponding MFCC features are extracted for classification.
  • the original MFCC features may have different dimensions due to different audio time lengths, while most classifiers such as SVM require the same feature dimensions, so the features need to be optimized.
  • the embodiment of the present application further optimizes the original MFCC features, including the following:
  • n_mfcc can be set to 16
  • n_frame is related to the audio time length, and the minimum number of frames can be taken, and then the two-dimensional feature is flattened into a one-dimensional feature.
  • the root mean square of the energy of each person is obtained in view of the difference in the basic audio loudness of each person, and the normalized feature dimension obtained in the above steps is divided by the root mean square.
  • the audio zero-crossing feature of each person is obtained, and this feature is superimposed on the above-mentioned features as a dimension of classification.
  • the Gaussian mixture model can be regarded as a model composed of K single Gaussian models, and these K sub-models are the hidden variables of the mixture model.
  • the number of speech classifications is K.
  • the GMM model will be trained to obtain four Gaussian sub-models.
  • the GMM model will train 10 Gaussian sub-models.
  • the embodiment of the present application adopts the EM algorithm, which is an iterative algorithm for maximum likelihood estimation of the parameters of the probability model containing hidden variables (Hidden variable). Each iteration contains two parts, one is to find the expectation, and the other is to find the maximum, and calculate the model parameters of the new round of iteration.
  • Hidden variable hidden variables
  • the training process of its speech recognition algorithm is: for each audio file in each type of audio data, extract the mfcc feature of audio frequency; The mfcc feature is optimized; Train the mean value and variance parameters of each model by EM algorithm; Save The model file for each training completion.
  • the recognition process of its speech recognition algorithm is: for an audio file, extract the mfcc feature of its audio; optimize the mfcc feature; for each GMM model in all GMM model sets, call the model to calculate the probability of belonging to the model; save Probabilities for all models; pick the class with the largest probability.
  • FIG. 5 is a schematic frame diagram of a training device for a speech classification model according to an embodiment of the present application.
  • this embodiment of the present application provides a speech classification model training device 300 , including: a speech acquisition module 31 , a feature extraction module 32 and a calculation module 33 .
  • the voice acquiring module 31 is configured to acquire at least one category of voice data, and the same category of voice data constitutes a voice data set.
  • the feature extraction module 32 is configured to extract the speech features of each speech data in the speech data set.
  • the operation module 33 is configured to use the speech features in the speech data set to train the sub-category models in the speech classification model; the speech classification model includes at least one sub-classification model, and the sub-classification models are in one-to-one correspondence with the speech data set.
  • the training device 300 of the embodiment of the present application classifies the speech data to form a corresponding speech data set, extracts and optimizes the speech features of different types of speech data, and uses the speech features to train the corresponding sub-classification model, thereby obtaining the speech of the required class.
  • Speech classification model for the data includes at least one sub-classification model, and the sub-classification model is set in a one-to-one correspondence with the speech data set. Thereby the voice data set of each category of the embodiment of the present application corresponds to training a sub-category model separately.
  • the training method of the embodiment of the present application has a low amount of calculation, and can complete the speech classification training task on a robot with limited computing power. In the field of robot application, it can be used as an artificial intelligence teaching aid.
  • the training device 300 of the embodiment of the present application can implement the entire speech recognition process through python programming.
  • the training device further includes: a feature determination module configured to determine category features of the speech data set based on at least part of the speech data in the speech data set; a feature processing module configured to use the The category feature of the voice data set is used to process the voice features of each voice data in the voice data set; the operation module includes: an operation sub-module configured to use the voice features processed in the voice data set to process the voice features of the voice data set The subclassification model in the speech classification model is trained.
  • the category features of the speech data set include audio loudness features and pitch change features of the speech data set.
  • the feature determination module includes: a first feature acquisition component configured to calculate the root mean square of speech energy of at least part of the speech data in the speech data set to obtain the audio loudness feature; A feature acquisition component configured to calculate zero-crossing features of at least part of the voice data in the voice data set, so as to obtain the pitch change feature.
  • the feature processing module includes: a feature processing sub-module configured to divide the speech feature by the audio loudness feature, and add the pitch change feature.
  • the feature extraction module includes: a feature extraction submodule configured to extract speech features of each speech data in the speech data set, and perform dimensionality reduction processing on the speech features.
  • the training device includes: a presentation module configured to present an entry indication, the entry indication corresponding to the entry of a category of voice data;
  • the voice acquisition module includes: a voice acquisition sub-module configured to acquire Voice data according to the input instruction.
  • FIG. 6 is a schematic frame diagram of a speech classification device according to an embodiment of the present application.
  • this embodiment of the present application provides a speech classification device 400 , including: a speech acquisition module 41 , a feature extraction module 42 and a classification module 43 .
  • the voice acquiring module 41 is configured to acquire the voice to be classified.
  • the feature extraction module 42 is configured to extract speech features of the speech to be classified.
  • the classification module 43 is configured to input the characteristics of the speech to be classified into the speech classification model to determine the category of the speech to be classified.
  • the speech classification model in the embodiment of the present application is trained by the training device in the above embodiment.
  • the speech classification device 400 of the embodiment of the present application has high recognition efficiency and accuracy of the speech to be classified, and the recognition and classification of speech categories to be classified can be trained in advance to realize general speech recognition and classification.
  • the speech classification device further includes: a feature determination module configured to determine the speech loudness feature to be classified and the tone feature to be classified of the speech to be classified; a feature processing module configured to use the loudness feature to be classified features and tone features to be classified, processing the speech features to be classified; the classification module includes: a first classification sub-module configured to input the processed speech features to be classified into the speech classification model.
  • the feature extraction module includes: a feature extraction submodule configured to extract speech features to be classified of the speech to be classified, and perform dimensionality reduction processing on the speech features to be classified.
  • the voice acquisition module includes: a voice acquisition submodule configured to acquire the control voice for the fan as the voice to be classified; the classification module includes: a second classification submodule configured to Determine the category of the speech to be classified as one of start, stop, acceleration, deceleration, turn left, and turn right.
  • FIG. 7 is a schematic diagram of a framework of a terminal device according to an embodiment of the present application.
  • this embodiment of the present application provides a terminal device 700, including a memory 701 and a processor 702 coupled to each other, and the processor 702 is used to execute the program instructions stored in the memory 701, so as to implement any of the above embodiments
  • the training method and the speech classification method of any of the above-mentioned embodiments may include, but is not limited to: mobile devices such as microcomputers, servers, notebook computers, and tablet computers.
  • the terminal device 700 may also include a fan, a lighting device, a walking trolley, and the like.
  • the processor 702 is configured to control itself and the memory 701 to implement the steps in any of the above embodiments of the training method, or to implement the steps in any of the above embodiments of the speech classification method.
  • the processor 702 may also be called a CPU (Central Processing Unit, central processing unit).
  • the processor 702 may be an integrated circuit chip with signal processing capability.
  • the processor 702 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field-programmable gate array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the processor 702 may be jointly implemented by integrated circuit chips.
  • FIG. 8 is a schematic frame diagram of a computer-readable storage medium according to an embodiment of the present application.
  • this embodiment of the present application provides a computer-readable storage medium 800, on which program instructions 801 are stored.
  • program instructions 801 are executed by a processor, any of the above-mentioned training methods and any of the language Classification.
  • speech classification can be realized accurately and efficiently.
  • An embodiment of the present application also provides a computer program, the computer program includes computer readable code, and when the computer readable code is run on an electronic device or a terminal device, the methods in the foregoing embodiments are executed.
  • the embodiment of the present application also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.
  • the disclosed methods and devices may be implemented in other ways.
  • the device implementations described above are only illustrative.
  • the division of modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may also be distributed to network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • An integrated unit may be stored in a computer-readable storage medium 800 if it is realized in the form of a software function unit and sold or used as an independent product.
  • the medium 800 includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods in various embodiments of the present application.
  • the aforementioned storage medium 800 includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. medium.
  • the embodiment of the present application provides a voice classification method, a model training method and device, equipment, medium and program, wherein the training method includes: acquiring at least one category of voice data, and the same category of voice data constitutes a voice data set; extracting voice Voice features of each voice data in the data set; using the voice features in the voice data set to train the sub-category models in the voice classification model; the voice classification model includes at least one sub-category model, and the sub-category models correspond to the voice data sets one-to-one.
  • the voice classification model includes at least one sub-category model, and the sub-category models correspond to the voice data sets one-to-one.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

L'invention concerne un procédé et un appareil de classification de parole, un procédé et un appareil d'apprentissage de modèle (400), un dispositif (700), un support (800) et un programme. Le procédé d'apprentissage consiste à : obtenir des données de parole d'au moins une catégorie, les données de parole de la même catégorie formant un ensemble de données de parole (S11) ; extraire une caractéristique de parole de chaque donnée de parole dans l'ensemble de données de parole (S12) ; et former un modèle de sous-classification dans un modèle de classification de parole à l'aide des caractéristiques de parole dans l'ensemble de données de parole, le modèle de classification de parole comprenant au moins un modèle de sous-classification, et le modèle de sous-classification ayant une correspondance biunivoque avec l'ensemble de données de parole (S13). Un ensemble de données de parole est formé par réalisation d'une classification de catégorie sur des données de parole, et un modèle de sous-classification est formé en utilisant des caractéristiques de parole, de façon à obtenir un modèle de classification de parole pour identifier des données de parole d'une catégorie requise. La réalisation d'un apprentissage en utilisant uniquement des données de parole d'une nouvelle catégorie peut permettre à un modèle de classification vocale de classer une nouvelle catégorie.
PCT/CN2022/071089 2021-07-06 2022-01-10 Procédé et appareil de classification de parole, procédé et appareil d'apprentissage de modèle, dispositif, support et programme WO2023279691A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110762453.8 2021-07-06
CN202110762453.8A CN113539243A (zh) 2021-07-06 2021-07-06 语音分类模型的训练方法、语音分类方法及相关装置

Publications (1)

Publication Number Publication Date
WO2023279691A1 true WO2023279691A1 (fr) 2023-01-12

Family

ID=78126826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071089 WO2023279691A1 (fr) 2021-07-06 2022-01-10 Procédé et appareil de classification de parole, procédé et appareil d'apprentissage de modèle, dispositif, support et programme

Country Status (2)

Country Link
CN (1) CN113539243A (fr)
WO (1) WO2023279691A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539243A (zh) * 2021-07-06 2021-10-22 上海商汤智能科技有限公司 语音分类模型的训练方法、语音分类方法及相关装置
CN114296589A (zh) * 2021-12-14 2022-04-08 北京华录新媒信息技术有限公司 一种基于影片观看体验的虚拟现实交互方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105161092A (zh) * 2015-09-17 2015-12-16 百度在线网络技术(北京)有限公司 一种语音识别方法和装置
US20190371301A1 (en) * 2018-05-31 2019-12-05 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
CN111369982A (zh) * 2020-03-13 2020-07-03 北京远鉴信息技术有限公司 音频分类模型的训练方法、音频分类方法、装置及设备
CN112767967A (zh) * 2020-12-30 2021-05-07 深延科技(北京)有限公司 语音分类方法、装置及自动语音分类方法
CN113539243A (zh) * 2021-07-06 2021-10-22 上海商汤智能科技有限公司 语音分类模型的训练方法、语音分类方法及相关装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986801B (zh) * 2017-06-02 2020-06-05 腾讯科技(深圳)有限公司 一种人机交互方法、装置及人机交互终端
CN108305616B (zh) * 2018-01-16 2021-03-16 国家计算机网络与信息安全管理中心 一种基于长短时特征提取的音频场景识别方法及装置
CN108764304B (zh) * 2018-05-11 2020-03-06 Oppo广东移动通信有限公司 场景识别方法、装置、存储介质及电子设备
CN109741747B (zh) * 2019-02-19 2021-02-12 珠海格力电器股份有限公司 语音场景识别方法和装置、语音控制方法和设备、空调
CN110047517A (zh) * 2019-04-24 2019-07-23 京东方科技集团股份有限公司 语音情感识别方法、问答方法及计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105161092A (zh) * 2015-09-17 2015-12-16 百度在线网络技术(北京)有限公司 一种语音识别方法和装置
US20190371301A1 (en) * 2018-05-31 2019-12-05 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
CN111369982A (zh) * 2020-03-13 2020-07-03 北京远鉴信息技术有限公司 音频分类模型的训练方法、音频分类方法、装置及设备
CN112767967A (zh) * 2020-12-30 2021-05-07 深延科技(北京)有限公司 语音分类方法、装置及自动语音分类方法
CN113539243A (zh) * 2021-07-06 2021-10-22 上海商汤智能科技有限公司 语音分类模型的训练方法、语音分类方法及相关装置

Also Published As

Publication number Publication date
CN113539243A (zh) 2021-10-22

Similar Documents

Publication Publication Date Title
WO2021208287A1 (fr) Procédé et appareil de détection d'activité vocale pour reconnaissance d'émotion, dispositif électronique et support de stockage
Koduru et al. Feature extraction algorithms to improve the speech emotion recognition rate
CN110289003B (zh) 一种声纹识别的方法、模型训练的方法以及服务器
WO2021093449A1 (fr) Procédé et appareil de détection de mot de réveil employant l'intelligence artificielle, dispositif, et support
CN109243491B (zh) 在频谱上对语音进行情绪识别的方法、***及存储介质
CN112562691B (zh) 一种声纹识别的方法、装置、计算机设备及存储介质
Mukherjee et al. A lazy learning-based language identification from speech using MFCC-2 features
WO2023279691A1 (fr) Procédé et appareil de classification de parole, procédé et appareil d'apprentissage de modèle, dispositif, support et programme
WO2020034628A1 (fr) Procédé et dispositif d'identification d'accents, dispositif informatique et support d'informations
Pokorny et al. Detection of negative emotions in speech signals using bags-of-audio-words
CN102800316A (zh) 基于神经网络的声纹识别***的最优码本设计方法
CN103871426A (zh) 对比用户音频与原唱音频相似度的方法及其***
WO2022100692A1 (fr) Procédé et appareil d'enregistrement audio de la voix humaine
WO2022100691A1 (fr) Procédé et dispositif de reconnaissance audio
CN111583906A (zh) 一种语音会话的角色识别方法、装置及终端
Fan et al. Deep neural network based environment sound classification and its implementation on hearing aid app
Chiou et al. Feature space dimension reduction in speech emotion recognition using support vector machine
CN111161713A (zh) 一种语音性别识别方法、装置及计算设备
Huang et al. Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering
CN112562725A (zh) 基于语谱图和胶囊网络的混合语音情感分类方法
Fernandes et al. Speech emotion recognition using mel frequency cepstral coefficient and SVM classifier
Pao et al. A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition
Shah et al. Speech emotion recognition based on SVM using MATLAB
Chi et al. Robust emotion recognition by spectro-temporal modulation statistic features
Ahmed et al. CNN-based speech segments endpoints detection framework using short-time signal energy features

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22836451

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE