WO2020098523A1 - 一种语音识别方法、装置及计算设备 - Google Patents

一种语音识别方法、装置及计算设备 Download PDF

Info

Publication number
WO2020098523A1
WO2020098523A1 PCT/CN2019/115308 CN2019115308W WO2020098523A1 WO 2020098523 A1 WO2020098523 A1 WO 2020098523A1 CN 2019115308 W CN2019115308 W CN 2019115308W WO 2020098523 A1 WO2020098523 A1 WO 2020098523A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
voice
audio data
voiceprint
new
Prior art date
Application number
PCT/CN2019/115308
Other languages
English (en)
French (fr)
Inventor
赵情恩
索宏彬
刘刚
卓著
雷赟
张平
孙尧
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020098523A1 publication Critical patent/WO2020098523A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/102Entity profiles

Definitions

  • the invention relates to the technical field of speech recognition, in particular to a speech recognition method, device and computing equipment.
  • the terminal device may use voiceprint recognition technology to identify the user's identity.
  • Voiceprint identification also known as speaker identification (Speaker Identification)
  • speaker identification Speaker Identification
  • voiceprint refers to the sound wave spectrum that carries speech information in human speech.
  • voiceprints have unique biological characteristics and the role of identification, not only specific, but also relatively stable.
  • the speaker needs to register the voiceprint on the terminal device in advance, and then the terminal device recognizes the user through the voiceprint, so that the user's behavior can be analyzed by analyzing the instruction corresponding to the user's voice, so as to provide the user with personalized, Customized services, such as song recommendation, etc.
  • embodiments of the present invention provide a voice recognition method, apparatus, and computing device, to try to solve or at least alleviate at least one of the above problems.
  • a voice recognition method including the steps of: receiving audio data including a first voice; judging whether there is a user matching the first voice; when there is no matching with the first voice In the case of a user of, store the audio data; cluster the stored multiple pieces of audio data to determine a new user from the multiple pieces of audio data.
  • the user corresponds to a user profile
  • the user profile includes the user's voiceprint
  • the step of determining whether there is a user matching the first voice includes: determining the first Whether the voice matches the user's voiceprint to determine whether there is a user matching the first voice.
  • the step of clustering the stored multiple pieces of audio data to determine a new user from the multiple pieces of audio data includes: based on two of the multiple pieces of audio data Between similar scores, divide multiple pieces of audio data into multiple sets; determine at least one target set based on the sample density and number of samples of the set, the target set corresponds to the new user; create a user profile for the new user corresponding to the target set, And use at least part of the audio data in the target set to generate the voiceprint of the new user.
  • the step of using at least part of the audio data in the target set to generate the voiceprint of the new user includes: determining the target set according to the distance from the centroid of the target set Audio data used to generate the voiceprint of a new user.
  • the user profile includes a user mark indicating whether the user is actively registered
  • the step of creating a user profile for a new user corresponding to the target set includes: corresponding to the target set
  • the user ID in the user profile created by the new user is set to inactive registration; and the method further includes the step of: in the case where there is a user matching the first voice and the corresponding user mark indicates that the user is inactive registration, Record the number of audio data from the user.
  • the method further includes the steps of: after recording the number of audio data from the user, determine whether the number of audio data reaches a specific number within a specific time period; if not, delete The user profile corresponding to this user.
  • the user profile further includes a device identification of the terminal device associated with the user.
  • the method includes the steps of: receiving the device identification of the terminal device sending audio data; based on the device The flag determines whether there is a user associated with the terminal device; if it does not exist, the audio data is stored.
  • the method further includes the step of: when there is a user matching the first voice, storing the instruction corresponding to the first voice in association with the user.
  • the method further includes the steps of: receiving audio data including a second voice, which is used to actively register a new user; creating a user profile for the actively registered new user, And use the audio data including the second voice to generate the voiceprint of the new user; and set the user ID in the user profile created for the actively registered new user to be actively registered.
  • the method further includes the steps of: receiving the device identification of the terminal device that sends the audio data including the second voice; and storing the device identification in association with the actively registered new user to The corresponding user profile.
  • the step of determining whether the first voice matches the user's voiceprint includes: extracting the voice characteristics of the first voice based on the audio data including the first voice; The voice feature of the first voice obtains a similarity score between the first voice and the user's voiceprint; according to the similarity score, it is determined whether the first voice matches the user's voiceprint.
  • a user recognition method including the steps of: receiving audio data including a first voice; determining whether there is a user matching the first voice; In the case of matching users, store audio data; cluster the stored multiple pieces of audio data to identify new users from the multiple pieces of audio data and conduct behavior analysis on the new users.
  • a voice recognition device including: a communication module adapted to receive audio data including a first voice; a voice recognition module adapted to determine whether there is a match with the first voice User; in the absence of a user matching the first voice, store the audio data to an audio storage module; an audio storage module, suitable for storing audio data; and a user discovery module, suitable for storing the audio storage module Multiple pieces of audio data are clustered to determine new users from the multiple pieces of audio data.
  • a user recognition device including: a communication module adapted to receive audio data including a first voice; a voice recognition module adapted to determine whether there is a match with the first voice User; in the absence of a user matching the first voice, store the audio data to an audio storage module; an audio storage module, suitable for storing audio data; and a user discovery module, suitable for storing the audio storage module Multiple pieces of audio data are clustered to identify new users from the multiple pieces of audio data and conduct behavior analysis on the new users.
  • a voice recognition system including a terminal device and a server, where the terminal device is adapted to receive the speaker's voice and send audio data including the voice to the server; the server resides The voice recognition device according to the present invention.
  • a computing device including: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by at least one processor, and the program instructions Including instructions for performing the speech recognition method according to the invention.
  • a new user is determined by clustering multiple pieces of stored audio data, and the entire new user determination process is unaware of the user, eliminating the user's active registration operation and improving User experience.
  • FIG. 1 shows a schematic diagram of a speech recognition system 100 according to an embodiment of the present invention
  • FIG. 2 shows an architecture diagram of a voice recognition device 200 according to an embodiment of the present invention
  • FIG. 3 shows a schematic diagram of a computing device 300 according to an embodiment of the invention.
  • FIG. 4 shows a structural block diagram of a voice recognition method 400 according to an embodiment of the present invention.
  • FIG. 1 shows a schematic diagram of a speech recognition system 100 according to an embodiment of the present invention.
  • the voice recognition system 100 includes a terminal device 102 and a server 106.
  • the terminal device 102 is the receiver of any speaker's voice.
  • the speaker can interact with the server 106 via the terminal device 102 using voice.
  • the terminal device 102 may be a computing device coupled to the server 106 through one or more networks 105 such as a local area network (LAN) or a wide area network (WAN) such as the Internet.
  • networks 105 such as a local area network (LAN) or a wide area network (WAN) such as the Internet.
  • the terminal device 102 may be a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a speaker computing device, a vehicle computing device (eg, an in-vehicle communication system, an in-vehicle entertainment system, an in-vehicle navigation system ), A wearable device including a computing device (eg, a watch with a computing device, glasses with a computing device) or a home device including a computing device (eg, a speaker with a computing device, a TV with a computing device, a TV with a computing device) washing machine).
  • a speaker it is possible for a speaker to operate multiple computing devices, for the sake of brevity, the examples in this disclosure will be directed to the speaker operating terminal device 102.
  • the terminal device 102 may operate one or more applications and / or components, which may involve providing notification to the speaker and providing various types of signals.
  • These applications and / or components may include, but are not limited to, microphone 103, output device 104, position coordinate components such as global positioning system ("GPS") components (not shown in FIG. 1), and so on.
  • GPS global positioning system
  • one or more of these applications and / or components may run on multiple terminal devices operated by the speaker.
  • Other components of the terminal device 102 not shown in FIG. 1 include but are not limited to barometers, cameras, light sensors, presence sensors, thermometers, health sensors (eg, heart rate monitors, blood glucose meters, sphygmomanometers), accelerometers, Gyroscope, etc.
  • the output device 104 may include one or more of a speaker (speakers), a screen, a touch screen, one or more notification lights (eg, light emitting diodes), a printer, and so on.
  • the output device 104 may be used to provide output based on one or more operations called in response to the speaker's voice (such as operations such as opening a program, playing a song, sending an email or text message, taking a picture, etc.).
  • the terminal device 102 includes one or more memories for storing data and software applications, one or more processors for accessing data and executing applications, and other components that facilitate communication over a network.
  • the terminal device 102 may be configured to sense one or more audible sounds (eg, spoken speech) using the microphone 103, for example, and may be based on the sensed one or more audible sounds Sound (also called "audio input") provides audio data to various other computing devices.
  • Those other computing devices can perform various operations based on the audio data to identify matching audio data.
  • the audio data may include: one or more original recordings of spoken speech; a compressed version of the recording; an indication of one or more characteristics of the audio input obtained via the microphone 103 of the terminal device 102, Such as pitch, pitch, audio, and / or volume; and / or transcription of audio input obtained via the microphone 103, etc.
  • the terminal device 102 sends audio data including the speaker's voice to the server 106.
  • the voice recognition device 200 resides in the server 106.
  • the voice recognition apparatus 200 may also reside in the terminal device 102. That is, the following processing is directly performed on the terminal device 102.
  • FIG. 2 shows a structural block diagram of a voice recognition apparatus 200 according to an embodiment of the present invention.
  • the voice recognition device 200 includes a communication module 210, a voice recognition module 220, an audio storage module 230 and a user discovery module 240.
  • the communication module 210 may receive audio data including the first voice from the terminal device 102, where the first voice is generally used to instruct the terminal device 102 to perform an operation.
  • the voice recognition module 220 performs voice recognition on the audio data to obtain an instruction corresponding to the first voice. Then the voice recognition module 220 returns a response result to the instruction to the terminal device 102 via the communication module 210, so that the terminal device 102 performs a corresponding operation at least according to the response result.
  • the terminal device 102 may be implemented as a sound box with a computing device.
  • the speaker receives the voice spoken by the speaker, "Playing Song Blue and White Porcelain", and sends audio data including the voice to the server 106.
  • the server 106 returns the corresponding response result-the audio file of "Blue and White Porcelain” to the speaker.
  • the speaker performs the corresponding operation according to the response result-playing the audio file.
  • the process of performing voice recognition on the audio data to obtain instructions may also be performed on the terminal device 102. That is, the terminal device 102 performs voice recognition on the audio data, and then sends the audio data and the recognized instruction to the voice recognition device 200.
  • the voice recognition module 220 also determines whether there is a user matching the first voice.
  • a user refers to a speaker whose identity is identified by a speech recognition system.
  • the user corresponds to a user profile that records data related to the user, and these user profiles may be stored in a user data storage device coupled to the voice recognition apparatus 200 or may be stored in the voice recognition apparatus 200 Included in the user data storage module (not shown in Figure 2).
  • the user's biometrics such as fingerprints, voice prints, and irises can be used to uniquely identify the user.
  • a voiceprint may be used to uniquely identify a user, and a voiceprint refers to a sound wave spectrum carrying speech information in a speaker's voice, and may uniquely identify the speaker.
  • the voice recognition module 220 may employ various voiceprint recognition technologies to determine whether there is a user matching the first voice.
  • the user profile may include the user's voice print.
  • the voice recognition module 220 may determine whether there is a user matching the first voice by determining whether the first voice matches the voiceprint of the user.
  • the audio data may be subjected to different levels of pre-processing before being matched by the voice recognition module 220 to the user.
  • preprocessing may facilitate the speech recognition module 220 to perform more efficient speech recognition.
  • the preprocessing may be performed by the terminal device 102 or by another component, such as a component of the voice recognition apparatus 200.
  • the voice recognition module 220 itself can pre-process audio data.
  • audio data may be initially captured, for example, by the microphone 103 of the terminal device 102 as raw data (for example, in a "lossless” form such as a wav file or a "lossy” form such as an MP3 file).
  • raw data may be pre-processed by, for example, one or more components of the terminal device 102 or the voice recognition apparatus 200 to facilitate voice recognition.
  • the preprocessing may include: sampling; quantization; removing non-speech audio data and silent audio data; framing and windowing audio data including speech for subsequent processing, and the like.
  • the voice recognition module 220 may extract the voice features of the first voice according to the audio data including the first voice, and match the first voice with the user's voiceprint based on the voice features of the first voice.
  • the voice feature may be a filter bank FBank (Filter Bank), a Mel frequency cepstral coefficient MFCC (Mel Frequency Cepstral Coefficents), a perceptual linear prediction coefficient PLP, a depth feature Deep Feature, and an energy regularization spectrum coefficient PNCC One or more combinations of features.
  • the voice recognition module 220 can also normalize the extracted voice features.
  • the voice recognition module 220 matches the first voice with the user's voiceprint to obtain a similarity score between the first voice and the user's voiceprint, and determines the The user whose first voice matches.
  • the user's voiceprint is described by a voiceprint model, such as a hidden Markov model (HMM model), a Gaussian mixture model (GMM model), and so on.
  • the user's voiceprint model is characterized by voice characteristics, and is obtained by training using audio data including user voice (hereinafter referred to as user audio data).
  • the voice recognition module 220 may use a matching operation function to calculate the similarity between the first voice and the voiceprint of the user. For example, the posterior probability that the voice feature of the first voice matches the user's voiceprint model can be calculated as a similarity score, or the likelihood between the voice feature of the first voice and the user's voiceprint model can be calculated as similar score.
  • the user's voiceprint model can be based on a general background model that is irrelevant to the user and can be obtained by training with a small amount of user's audio data Voice characteristics are features).
  • the user-independent audio data of multiple speakers can be used first, and a universal background model (UBM) can be obtained through the EM training of the expectation maximization algorithm to characterize the user-independent feature distribution.
  • UBM universal background model
  • a small amount of user audio data is used to train a GMM model through adaptive algorithms (such as maximum posterior probability MAP, maximum likelihood linear regression MLLR, etc.) (the GMM model thus obtained is called GMM-UBM model )
  • the GMM-UBM model is the user's voiceprint model.
  • the voice recognition module 220 may match the first voice with the user's voiceprint model and the general background model based on the voice characteristics of the first voice, respectively, to obtain a similarity score between the first voice and the user's voiceprint. For example, calculate the likelihood between the speech feature of the first speech and the UBM model and the GMM-UBM model, and then divide the two likelihoods to take the logarithm, and use the obtained value as the first speech and Similarity score between users' voiceprints.
  • the user's voiceprint is described by a voiceprint vector, such as i-vector, d-vector, x-vector, j-vector, and so on.
  • the voice recognition module 220 may extract the voiceprint vector of the first voice based at least on the voice characteristics of the first voice.
  • the voiceprint model of the first voice speaker may be first trained using the voice characteristics of the first voice. Similar to the foregoing, the voiceprint model of the first voice speaker can be obtained by training the voice characteristics of the first voice based on the pre-trained general background model that is irrelevant to the user.
  • the mean supervector of the first voice can be extracted according to the voiceprint model.
  • the mean value of each GMM component of the GMM-UBM model of the first speech speaker may be stitched to obtain the mean supervector of the GMM-UBM model of the first speech speaker, that is, the mean supervector of the first speech.
  • JFA joint factor analysis method
  • a simplified joint factor analysis method can be used to extract a low-dimensional voiceprint vector from the average supervector of the first speech.
  • the mean supervector of the universal background model can be extracted, and the global difference space (Total Variability Space, T) matrix can be estimated. Then, the i-vector of the first speech is calculated based on the mean supervector of the first speech, the T matrix, and the mean supervector of the general background model.
  • i-vector can be calculated according to the following formula:
  • M s, h is the mean supervector obtained from the speech h of the speaker s
  • mu is the mean supervector of the general background model
  • T is the global difference space matrix
  • ⁇ s, h is the global difference factor, that is i-vector.
  • the trained deep neural network can also be used to obtain the voiceprint vector of the first speech.
  • the DNN may include an input layer, a hidden layer, and an output layer.
  • the FBank feature of the first speech can be input to the DNN input layer first, and the output of the last hidden layer of the DNN is the d-vector.
  • the voice recognition module 220 may calculate a similarity score between the first voice and the user's voiceprint based on the voiceprint vector of the first voice and the user's voiceprint vector.
  • algorithms such as support vector machine (SVM), LDA (Linear Discriminant Analysis), PLDA (Probabilistic Linear Discriminant Analysis), likelihood, and cosine distance (Cosine Distance) can be used to calculate the first A similarity score between the voice and the user's voiceprint.
  • is the mean value of the voiceprint vector
  • F and G are the spatial feature matrices, each representing the feature space between speakers and the feature space within the class.
  • Each column of F corresponds to the feature vector of the feature space between classes
  • each column of G corresponds to the feature vector of the feature space within the class.
  • the vectors h i and w ij can be regarded as the feature representations of the speech in their respective spaces
  • ⁇ ij is the noise covariance. If the two voice characteristic h i greater the likelihood the same, i.e., the higher the similarity score, then they come from the same possibility of a speaker is larger.
  • the model parameters of PLDA include four, namely ⁇ , F, G and ⁇ ij , which are iteratively trained using the EM algorithm.
  • ⁇ , F, G and ⁇ ij which are iteratively trained using the EM algorithm.
  • a simplified version of the PLDA model can be used, ignoring the training of the intra-class feature space matrix G, and training only the inter-class feature space matrix F, namely:
  • the voice recognition module 220 may obtain the h i feature of the first voice based on the voiceprint vector of the first voice and referring to the above formula. Similarly, based on the user's voiceprint vector, h i obtained with reference to the above formula wherein the user's voice. Then, I characterized in h i may be calculated two logarithmic likelihood ratios or cosine distance as similarity score between a first voice and a voiceprint of the user.
  • the voiceprint is not limited to the above voiceprint vectors (i-vector, d-vector, x-vector, etc.) and the above voiceprint models (HMM model, GMM model, etc.), and the corresponding similarity scoring algorithm is also It can be arbitrarily selected according to the selected voiceprint, and the present invention does not limit it.
  • the voice recognition module 220 determines that the first voice matches the voiceprint of the user, that is, determines that the first voice matches the user corresponding to the voiceprint. Otherwise, the voice recognition module 220 determines that the first voice does not match the user's voiceprint.
  • the voice recognition module 220 may match the first voice with the voiceprint of each user to determine whether there is a user matching the first voice. In the case where there is a user matching the first voice, the voice recognition module 220 may store the command corresponding to the first voice in association with the matched user in addition to performing voice recognition on the audio data to obtain the command, for example Stored to the user's user profile. In this way, the voice recognition device 200 can subsequently analyze the user's behavior preferences according to all instructions from the user, thereby providing the user with personalized and customized services. For example, the user's song preferences are analyzed according to all instructions related to the user's playing songs, so that the user can be recommended songs that meet his preferences.
  • the voice recognition module 220 may store the (strip) audio data including the first voice to the audio storage module 230.
  • the audio storage module 230 is adapted to store audio data.
  • the user discovery module 240 may cluster multiple pieces of audio data stored in the audio storage module 230 to determine a new user from the multiple pieces of audio data. In this way, for subsequently received audio data including the voice of the new user, the voice recognition device 200 can match the new user, and store the corresponding instruction in association with the new user, so that the subsequent user can follow the new user ’s All instructions analyze the behavior preferences of the new user, thereby providing personalized services for the new user. In some embodiments, the user discovery module 240 may extract stored multiple pieces of audio data (eg, a fixed number of pieces of audio data) every predetermined period for clustering.
  • the user discovery module 240 first divides the multiple pieces of audio data into multiple sets based on the similarity score between two pieces of multiple pieces of audio data. It can be considered that the audio data contained in each set is similar to each other.
  • a clustering algorithm can be used to divide the set.
  • the user discovery module 240 determines at least one target set based on the sample characteristics of the set, and each target set corresponds to a new user.
  • the sample characteristics may include sample density, sample number, etc., and the sample refers to audio data.
  • the sample density and number of samples for that set can be calculated.
  • the set whose sample density and sample quantity meet the predetermined condition is selected as the target set.
  • the predetermined condition may be, for example, that the sample density exceeds the predetermined density; the number of samples exceeds the predetermined number, and so on.
  • the predetermined condition can be configured according to the number of target sets that need to be determined, and the present invention does not limit this.
  • the user discovery module 240 After determining the target set (ie, discovering a new user), the user discovery module 240 creates a user profile for the new user corresponding to the target set, and uses at least part of the audio data in the target set to generate the voiceprint of the new user.
  • the voiceprint may be a voiceprint model or a voiceprint vector.
  • a GMM model or GMM-UBM model can be trained using the voice features of these audio data as the voiceprint of a new user.
  • the voiceprint vector can also be extracted based on the voice features of these audio data as the voiceprint of the new user.
  • voiceprint generation process please refer to the previous description about voiceprint, which will not be repeated here.
  • At least part of the audio data in the target set may be randomly selected to generate a voiceprint.
  • the audio data for generating the voiceprint of the new user in the target set may also be determined according to the distance from the centroid of the target set. For example, first determine the centroid of the target set, then calculate the distance from each sample in the target set to the centroid of the target set, and select those samples with smaller distances as the audio data for generating the voiceprint of the new user.
  • the calculation of the center of mass is a conventional technique in the art, and will not be repeated here.
  • the user discovery module 240 can delete these audio data, that is, the multiple items previously extracted from the audio storage module 230 Audio data.
  • creating a user profile may be considered a user registration process.
  • the user can actively provide audio data including the user's voice (for example, send an active registration request to the server via the terminal device, and actively enter a voice for a specific text according to the corresponding registration prompt), so as to generate the user based on the actively provided audio data Voiceprint.
  • This process of user active operation can be regarded as an active registration process.
  • the process of discovering new users through clustering, creating user profiles and generating voiceprints for them is not perceived by users, so this process can be considered as an inactive registration process.
  • the user profile may further include a user flag indicating whether the user is actively registered.
  • the voice recognition device 200 may further include a user registration module 250.
  • the communication module 210 may receive audio data including a second voice, which is usually used to actively register a new user.
  • the second voice may be a voice recorded according to a registration prompt of the terminal device 102.
  • the user registration module 250 may create a user profile for the actively registered new user, and use the audio data including the second voice to generate the voiceprint of the new user, and the user in the user profile of the actively registered new user The logo is set to active registration.
  • the user discovery module 240 creates a user profile for the new user corresponding to the target set
  • the user identification in the created user profile can be set to inactive registration.
  • the voice recognition module 220 can determine whether the user is actively registered based on the user identification in the corresponding user profile. If the user mark indicates that the user is not actively registered, the voice recognition module 220 may record the number of audio data from the user. Specifically, the user profile may include the number of pieces of audio data from the user. Each time a piece of audio data from the user is received, the voice recognition module 220 increases the number of pieces of audio data from the user by one.
  • the user discovery module 240 can set the number of audio data items from the user in the created user profile to the initial value. The initial value can usually be 0.
  • the voice recognition module 220 can also determine whether the number of pieces of audio data from the user reaches a specific amount within a specific time period (for example, within a month since registration). If not, the voice recognition module 220 may delete the user profile corresponding to the user, that is, log out the user. If it is reached, you can do nothing.
  • the user profile may further include the device identification of the terminal device associated with the user.
  • the communication module 210 may receive the device identification of the terminal device that sends the audio data including the second voice, and the user registration module 250 may store the device identification in association with the actively registered new user to the corresponding User profile.
  • the voice recognition module 220 can also receive the device identification of the terminal device that sent the audio data when receiving the audio data, and before determining whether there is a user matching the first voice, determine whether there is a corresponding one based on the device identification The user associated with the terminal device, that is, to find out whether there is a user profile including the device identification.
  • the voice recognition module 220 may store the audio data to the audio storage module 230. If there is a user associated with the terminal device, the voice recognition module 220 determines whether there is a user matching the first voice.
  • an embodiment of the present invention also provides a user identification device.
  • the user recognition device includes a communication module, a voice recognition module, an audio storage module, and a user discovery module.
  • the communication module receives audio data including the first voice, and the voice recognition module can determine whether there is a user matching the first voice, and store the audio data to the audio storage module if there is no user matching the first voice .
  • the audio storage module stores audio data.
  • the user discovery module may cluster multiple pieces of audio data stored in the audio storage module, so as to determine new users from the multiple pieces of audio data, and conduct behavior analysis on the new users. For example, the behavior preference of the new user can be analyzed according to the instruction corresponding to the voice of the new user, so as to provide a personalized service for the new user.
  • each module in the user recognition device may be the same as the processing of each module in the voice recognition device 200 described above with reference to FIGS. 1 and 2, for example, and can achieve similar technical effects, which will not be repeated here.
  • FIG. 3 shows a schematic diagram of a computing device 300 according to an embodiment of the invention.
  • the computing device 300 typically includes a system memory 306 and one or more processors 304.
  • the memory bus 308 may be used for communication between the processor 304 and the system memory 306.
  • the processor 304 may be any type of processing, including but not limited to: a microprocessor ( ⁇ P), a microcontroller ( ⁇ C), a digital information processor (DSP), or any combination thereof.
  • the processor 304 may include one or more levels of cache, such as a level one cache 310 and a level two cache 312, a processor core 314, and registers 316.
  • the example processor core 314 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof.
  • the example memory controller 318 may be used with the processor 304, or in some implementations, the memory controller 318 may be an internal part of the processor 304.
  • the system memory 306 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof.
  • the system memory 306 may include an operating system 320, one or more applications 322, and program data 324.
  • the application 322 may be arranged to execute instructions by the one or more processors 304 using the program data 324 on the operating system.
  • the computing device 300 may also include an interface bus 340 that facilitates communication from various interface devices (eg, output device 342, peripheral interface 344, and communication device 346) to the basic configuration 302 via the bus / interface controller 330.
  • the example output device 342 includes a graphics processing unit 348 and an audio processing unit 350. They may be configured to facilitate communication with various external devices such as displays or speakers via one or more A / V ports 352.
  • the example peripheral interface 344 may include a serial interface controller 354 and a parallel interface controller 356, which may be configured to facilitate via one or more I / O ports 358 and input devices (eg, keyboard, mouse, pen) , Voice input devices, touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate.
  • the example communication device 346 may include a network controller 360, which may be arranged to facilitate communication with one or more other computing devices 362 via a network communication link via one or more communication ports 364.
  • the network communication link may be an example of a communication medium.
  • Communication media can generally be embodied as computer readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transmission mechanism, and can include any information delivery media.
  • a "modulated data signal" may be a signal in which one or more of its data set or its changes can be made in such a way as to encode information in the signal.
  • the communication medium may include a wired medium such as a wired network or a dedicated line network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media.
  • RF radio frequency
  • IR infrared
  • the term computer readable media as used herein may include both storage media and communication media.
  • the computing device 300 may be implemented as a server, such as a database server, an application server, a WEB server, etc., or as a personal computer including a desktop computer and a notebook computer configuration.
  • the computing device 300 can also be implemented as part of a small-sized portable (or mobile) electronic device.
  • the computing device 300 is implemented as a voice recognition apparatus 200, and is configured to perform the voice recognition method 400 according to the embodiment of the present invention.
  • the application 322 of the computing device 300 includes multiple program instructions for executing the voice recognition method 400 according to an embodiment of the present invention, and the program data 324 may also store configuration information of the voice recognition system 100 and the like.
  • FIG. 4 shows a voice recognition method 400 according to an embodiment of the present invention. As shown in FIG. 4, the voice recognition method 400 starts at step S410.
  • step S410 audio data including the first voice is received.
  • the first voice is generally a voice instructing the terminal device 102 to perform an operation. Therefore, according to the embodiment of the present invention, it is possible to perform voice recognition on the audio data to obtain an instruction corresponding to the first voice, and then return a response result to the instruction to the terminal device 102, so that the terminal device 102 executes at least according to the response result Operate accordingly.
  • each user corresponds to a user profile that records data related to the user.
  • These user profiles may be stored in a user data storage device coupled to the voice recognition apparatus 200, or may be stored in voice recognition The user data storage module included in the device 200.
  • the user's biometrics such as fingerprints, voice prints, and irises can be used to uniquely identify the user.
  • a voiceprint can be used to uniquely identify the user, and the user profile includes the user's voiceprint, and it can be determined whether the first voice matches the user's voiceprint to determine whether there is a first voice Matching users.
  • the voice features of the first voice may be extracted based on the audio data including the first voice.
  • the voice feature may be a filter bank FBank (Filter Bank), a Mel frequency cepstral coefficient MFCC (Mel Frequency Cepstral Coefficents), a perceptual linear prediction coefficient PLP, a depth feature Deep Feature, and an energy regularization spectrum coefficient PNCC One or more combinations of features.
  • a similarity score between the first voice and the voiceprint of the user is obtained based on the voice characteristics of the first voice, and it is determined whether the first voice matches the voiceprint of the user according to the similarity score. If the obtained similarity score exceeds the similarity threshold, it is determined that the first voice matches the voiceprint of the user, otherwise it is determined that the first voice does not match the voiceprint of the user.
  • the instruction corresponding to the first voice may be stored in association with the matched user. If there is no user matching the first voice, then in step S430, the (strip) audio data is stored.
  • step S440 the stored multiple pieces of audio data are clustered to determine a new user from the multiple pieces of audio data.
  • multiple pieces of audio data may be divided into multiple sets based on the similarity score between two pieces of multiple pieces of audio data.
  • at least one target set is determined based on the sample density and the number of samples of the set, and the target set corresponds to the new user.
  • a user profile is created for the new user corresponding to the target set, and at least part of the audio data in the target set is used to generate the voiceprint of the new user.
  • the audio data for generating the voiceprint of the new user in the target set may be determined according to the distance from the centroid of the target set. For example, first determine the centroid of the target set, and then calculate the distance from each sample in the target set to the centroid of the target set, and select those samples with smaller distances as the audio data for generating the voiceprint of the new user.
  • these audio data can be deleted, that is, the previous pieces of audio data.
  • the user profile may further include a user flag indicating whether the user is actively registered, and when creating a user profile for a new user corresponding to the target set, the user identification in the user profile may be set For non-active registration.
  • the number of audio data from the user can also be recorded to determine whether the number of audio data reaches a specific number within a specific time period . If not, the user profile corresponding to the user can be deleted.
  • the voice recognition method 400 may further include the step of receiving audio data including a second voice, which is generally used to actively register a new user. Create a user profile for the actively registered new user, and use the audio data including the second voice to generate the voiceprint of the new user, and set the user ID in the user profile of the actively registered new user to be actively registered.
  • the user profile may further include the device identification of the terminal device associated with the user
  • the voice recognition method 400 may further include the step of receiving the device identification of the terminal device that sends the audio data based on the device
  • the logo determines whether there is a user associated with the terminal device. If it does not exist, the above audio data is stored.
  • an embodiment of the present invention also provides a user recognition method, including the steps of: receiving audio data including the first voice; determining whether there is a user matching the first voice; there is no user matching the first voice In the case of, store the audio data; cluster the stored multiple pieces of audio data to determine the new user from the multiple pieces of audio data and conduct behavior analysis on the new user.
  • the processing of each step in the user recognition method may be the same as the processing of each step in the voice recognition method 400 described above in conjunction with FIG. 4 and can achieve similar technical effects, which will not be repeated here.
  • a new user is determined and the voiceprint of the new user is generated by clustering the stored multiple pieces of audio data, so that the user can be subsequently identified based on the voiceprint , And analyze the user's behavior preference according to the instruction from the user, so as to provide the user with more accurate personalized service.
  • the entire new user determination and voiceprint generation process are unaware of the user, eliminating the user's active registration operation and improving the user's experience.
  • modules or units or components of the device in the examples disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be positioned differently from the device in this example Of one or more devices.
  • the modules in the foregoing examples may be combined into one module or, in addition, may be divided into multiple sub-modules.
  • modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment.
  • the modules or units or components in the embodiments may be combined into one module or unit or component, and in addition, they may be divided into a plurality of submodules or subunits or subcomponents. Except that at least some of such features and / or processes or units are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any method so disclosed or All processes or units of equipment are combined. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Business, Economics & Management (AREA)
  • Signal Processing (AREA)
  • Game Theory and Decision Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音识别方法,包括步骤:接收包括第一语音的音频数据(S410);判断是否存在与第一语音相匹配的用户(S420);在不存在与第一语音相匹配的用户的情况下,存储该音频数据(S430);对所存储的多条音频数据进行聚类,以便从多条音频数据中确定新用户(S440)。还公开了相应的语音识别装置、***和计算设备。

Description

一种语音识别方法、装置及计算设备
本申请要求2018年11月12日递交的申请号为2018113400922、发明名称为“一种语音识别方法、装置及计算设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及语音识别技术领域,尤其是一种语音识别方法、装置及计算设备。
背景技术
随着诸如移动终端和智能音箱之类终端设备的广泛使用,人们越来越习惯于使用语音来与这些终端设备进行交互。其中,终端设备可以采用声纹识别技术来识别用户身份。
声纹识别(Voiceprint Identification),又称说话人识别(Speaker Identification),该技术是从说话人发出的语音信号中提取语音特征,并据此对说话人进行身份验证的生物识别技术。其中,声纹是指人类语音中携带言语信息的声波频谱。同指纹一样,声纹具备独特的生物学特征,具有身份识别的作用,不仅具有特定性,而且具有相对的稳定性。
通常地,说话者需要预先在终端设备上注册声纹,而后终端设备通过声纹来识别该用户,从而可以通过分析该用户的语音对应的指令来分析用户行为,以便为该用户提供个性化、定制化服务,例如歌曲推荐等。
由于目前大部分终端设备的用户没有主动注册声纹,所以无法准确识别用户,从而无法分析用户行为向其提供个性化服务,或者向其提供的个性化服务很难达到比较好的效果。
因此,需要提供一种更优越的语音识别方案,以便为用户提供服务。
发明内容
为此,本发明实施例提供了一种语音识别方法、装置及计算设备,以力图解决或者至少缓解上面存在的至少一个问题。
根据本发明实施例的一个方面,提供了一种语音识别方法,包括步骤:接收包括第一语音的音频数据;判断是否存在与第一语音相匹配的用户;在不存在与第一语音相匹配的用户的情况下,存储该音频数据;对所存储的多条音频数据进行聚类,以便从多条 音频数据中确定新用户。
可选地,在根据本发明实施例的语音识别方法中,用户对应有用户简档,用户简档包括用户的声纹,判断是否存在与第一语音相匹配的用户的步骤包括:判断第一语音是否与用户的声纹相匹配,以判断是否存在与第一语音相匹配的用户。
可选地,在根据本发明实施例的语音识别方法中,对所存储的多条音频数据进行聚类,以便从多条音频数据中确定新用户的步骤包括:基于多条音频数据中两两之间的相似评分,将多条音频数据划分为多个集合;基于集合的样本密度和样本数量确定至少一个目标集合,目标集合对应于新用户;为目标集合对应的新用户创建用户简档,并使用目标集合中的至少部分音频数据来生成该新用户的声纹。
可选地,在根据本发明实施例的语音识别方法中,使用目标集合中的至少部分音频数据来生成该新用户的声纹的步骤包括:根据到目标集合的质心的距离来确定目标集合中用于生成新用户的声纹的音频数据。
可选地,在根据本发明实施例的语音识别方法中,用户简档包括指示用户是否为主动注册的用户标记,为目标集合对应的新用户创建用户简档的步骤包括:将为目标集合对应的新用户所创建的用户简档中的用户标识置为非主动注册;以及方法还包括步骤:在存在与第一语音相匹配的用户且对应的用户标记指示用户为非主动注册的情况下,记录来自用户的音频数据条数。
可选地,在根据本发明实施例的语音识别方法中,还包括步骤:在记录来自用户的音频数据条数之后,判断音频数据条数是否在特定时间段内达到特定数量;若否,删除该用户对应的用户简档。
可选地,在根据本发明实施例的语音识别方法中,用户简档还包括与用户相关联的终端设备的设备标识,该方法包括步骤:接收发送音频数据的终端设备的设备标识;基于设备标识判断是否存在与终端设备相关联的用户;如果不存在,则存储该音频数据。
可选地,在根据本发明实施例的语音识别方法中,还包括步骤:在存在与第一语音相匹配的用户的情况下,将第一语音对应的指令与用户相关联地存储。
可选地,在根据本发明实施例的语音识别方法中,还包括步骤:接收包括第二语音的音频数据,第二语音用于主动注册新用户;为主动注册的新用户创建用户简档,并使用包括第二语音的音频数据来生成新用户的声纹;以及将为主动注册的新用户所创建的用户简档中的用户标识置为主动注册。
可选地,在根据本发明实施例的语音识别方法中,还包括步骤:接收发送包括第二 语音的音频数据的终端设备的设备标识;将设备标识与主动注册的新用户相关联地存储至对应的用户简档。
可选地,在根据本发明实施例的语音识别方法中,判断第一语音是否与用户的声纹相匹配的步骤包括:根据包括第一语音的音频数据,提取第一语音的语音特征;基于所述第一语音的语音特征得到第一语音与用户的声纹之间的相似评分;根据相似评分来确定第一语音是否与用户的声纹相匹配。
根据本发明实施例的另一方面,提供了一种用户识别方法,包括步骤:接收包括第一语音的音频数据;判断是否存在与第一语音相匹配的用户;在不存在与第一语音相匹配的用户的情况下,存储音频数据;对所存储的多条音频数据进行聚类,以便从多条音频数据中确定新用户,并对新用户进行行为分析。
根据本发明实施例的另一方面,提供了一种语音识别装置,包括:通信模块,适于接收包括第一语音的音频数据;语音识别模块,适于判断是否存在与第一语音相匹配的用户;在不存在与第一语音相匹配的用户的情况下,将该音频数据存储至音频存储模块;音频存储模块,适于存储音频数据;以及用户发现模块,适于对音频存储模块所存储的多条音频数据进行聚类,以便从多条音频数据中确定新用户。
根据本发明实施例的另一方面,提供了一种用户识别装置,包括:通信模块,适于接收包括第一语音的音频数据;语音识别模块,适于判断是否存在与第一语音相匹配的用户;在不存在与第一语音相匹配的用户的情况下,将该音频数据存储至音频存储模块;音频存储模块,适于存储音频数据;以及用户发现模块,适于对音频存储模块所存储的多条音频数据进行聚类,以便从多条音频数据中确定新用户,并对新用户进行行为分析。
根据本发明实施例的另一方面,提供了一种语音识别***,包括终端设备和服务器,其中终端设备适于接收说话人的语音,并将包括语音的音频数据发送至服务器;服务器驻留有根据本发明的语音识别装置。
根据本发明实施例的又一方面,提供了一种计算设备,包括:至少一个处理器;和存储有程序指令的存储器,其中,程序指令被配置为适于由至少一个处理器执行,程序指令包括用于执行根据本发明的语音识别方法的指令。
根据本发明实施例的语音识别方案,通过对所存储的多条音频数据进行聚类来从中确定新用户,整个新用户确定过程用户是无感知的,省去了用户的主动注册操作,提高了用户的使用体验。
附图说明
为了实现上述以及相关目的,本文结合下面的描述和附图来描述某些说明性方面,这些方面指示了可以实践本文所公开的原理的各种方式,并且所有方面及其等效方面旨在落入所要求保护的主题的范围内。通过结合附图阅读下面的详细描述,本公开的上述以及其它目的、特征和优势将变得更加明显。遍及本公开,相同的附图标记通常指代相同的部件或元素。
图1示出了根据本发明一个实施例的语音识别***100的示意图;
图2示出了根据本发明一个实施例的语音识别装置200的架构图;
图3示出了根据本发明一个实施例的计算设备300的示意图;以及
图4示出了根据本发明一个实施例的语音识别方法400的结构框图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
图1示出了根据本发明一个实施例的语音识别***100的示意图。如图1所示,语音识别***100包括终端设备102和服务器106。
终端设备102是任何说话人语音的接收方。说话人可以使用语音经由终端设备102与服务器106交互。终端设备102可以是通过诸如局域网(LAN)或者如因特网的广域网(WAN)的一个或多个网络105耦合至服务器106的计算设备。例如,终端设备102可以是桌面型计算设备、膝上型计算设备、平板型计算设备、移动电话计算设备、音箱计算设备、车辆的计算设备(例如,车载通信***、车载娱乐***、车载导航***)、包括计算设备的可穿戴装置(例如,具有计算设备的手表、具有计算设备的眼镜)或者包括计算设备的家居装置(例如,具有计算设备的音箱、具有计算设备的电视、具有计算设备的洗衣机)。尽管说话人有可能会操作多个计算设备,但为简洁起见,本公开中的示例将针对说话人操作终端设备102。
终端设备102可以操作一个或多个应用和/或组件,它们可以涉及向说话人提供通知 以及提供各种类型的信号。这些应用和/或组件可以包括但不限于麦克风103、输出设备104、诸如全球定位***(“GPS”)组件(图1未示出)的位置坐标组件等等。在一些实施例中,这些应用和/或组件中的一个或多个可以在由说话人操作的多个终端设备上运行。图1中并未示出的终端设备102的其他组件包括但不限于气压计、相机、光线传感器、存在传感器、温度计、健康传感器(例如,心率监视器、血糖仪、血压计)、加速计、陀螺仪等等。
在一些实施方式中,输出设备104可以包括扬声器(多个扬声器)、屏幕、触摸屏、一个或多个通知灯(例如,发光二极管)、打印机等等中的一个或多个。在一些实施方式中,输出设备104可以被用于基于响应于说话人语音而调用的一个或多个操作(诸如打开程序、播放歌曲、发送电子邮件或者文本消息、拍照等操作)提供输出。
终端设备102包括用于存储数据和软件应用的一个或多个存储器、用于访问数据和执行应用的一个或多个处理器以及促进通过网络的通信的其他组件。在一些实施方式中,终端设备102可以被配置成例如使用麦克风103来感测一个或多个可听声音(例如,说话人口述的语音),并且可以基于所感测到的一个或多个可听声音(也被称作“音频输入”)而将音频数据提供至各种其他计算设备。那些其他计算设备(其示例将会在下文更详细地描述)可以基于音频数据而执行各种操作以识别匹配音频数据。在各种实施方式中,音频数据可以包括:一个或多个说话人口述语音的原始记录;记录的压缩版本;经由终端设备102的麦克风103所获得的音频输入的一个或多个特征的指示,诸如音高、音调、音频和/或音量;和/或经由麦克风103所获得的音频输入的转录等等。
在一些实施方式中,终端设备102将包括说话人语音的音频数据发送至服务器106。服务器106中驻留有语音识别装置200。当然,在另一些实施方式中,终端设备102中也可以驻留有语音识别装置200。也就是说,直接在终端设备102上执行下述处理。
图2示出了根据本发明一个实施例的语音识别装置200的结构框图。如图2所示,语音识别装置200包括通信模块210、语音识别模块220、音频存储模块230和用户发现模块240。
通信模块210可以从终端设备102接收到包括第一语音的音频数据,这里第一语音通常用于指示终端设备102执行操作。
语音识别模块220对该音频数据进行语音识别,以得到第一语音对应的指令。而后语音识别模块220经由通信模块210向终端设备102返回对该指令的响应结果,以便终 端设备102至少根据该响应结果执行相应操作。
例如,在一种实施方式中,终端设备102可以实现为具有计算设备的音箱。音箱接收到说话人所说的语音——“播放歌曲青花瓷”,将包括该语音的音频数据发送至服务器106。服务器106向音箱返回相应的响应结果——《青花瓷》的音频文件。音箱根据响应结果执行相应操作——播放该音频文件。
当然,对音频数据进行语音识别以得到指令的过程也可以在终端设备102上进行。即,终端设备102对音频数据进行语音识别,而后将音频数据和识别得到的指令发送至语音识别装置200。
语音识别模块220还判断是否存在与第一语音相匹配的用户。通常地,用户指的是语音识别***标识其身份的说话人。根据一种实施方式,用户对应有记录与用户相关的数据的用户简档,这些用户简档可以存储在与语音识别装置200相耦接的用户数据存储设备中,也可以存储在语音识别装置200所包括的用户数据存储模块(图2未示出)中。
通常地,可以采用用户的诸如指纹、声纹和虹膜之类的生物特征来唯一标识用户。在本发明的一些实施方式中,可以采用声纹来唯一标识用户,声纹是指说话人语音中携带言语信息的声波频谱,可以唯一标识说话人。语音识别模块220可以采用各种声纹识别技术来判断是否存在与第一语音相匹配的用户。
具体地,在各种实施方式中,用户简档可以包括用户的声纹。语音识别模块220可以通过判断第一语音是否与用户的声纹相匹配来判断是否存在与第一语音相匹配的用户。
下面将详细介绍判断第一语音是否与用户的声纹相匹配的过程。
在一些实施方式中,音频数据可以在被由语音识别模块220匹配至用户之前经受不同层次的预处理。在一些实施例中,这种预处理可以促进语音识别模块220进行更加高效的语音识别。在各种实施方式中,预处理可以由终端设备102或者由另一组件来执行,诸如语音识别装置200的组件。在一些实施方式中,语音识别模块220本身可以预处理音频数据。
作为预处理的非限制性示例,音频数据可以最初例如由终端设备102的麦克风103捕捉,作为原始数据(例如,以诸如wav文件的“无损”形式或者诸如MP3文件的“有损”形式)。这种原始数据可以例如由终端设备102或者语音识别装置200的一个或多个组件 进行预处理,以促进语音识别。在各种实施方式中,预处理可以包括:采样;量化;去除非语音的音频数据和静默的音频数据;对包括语音的音频数据进行分帧、加窗,以供后续处理等等。
经过预处理之后,语音识别模块220可以根据包括第一语音的音频数据来提取第一语音的语音特征,并基于第一语音的语音特征将第一语音与用户的声纹进行匹配。
在一些实施方式中,语音特征可以是滤波器组FBank(Filter Bank)、梅尔频率倒谱系数MFCC(Mel Frequency Cepstral Coefficents)、感知线性预测系数PLP、深度特征Deep Feature、以及能量规整谱系数PNCC等特征中的一种或者多种的组合。在一种实施例中,语音识别模块220还可以对提取得到的语音特征进行归一化处理。
而后,语音识别模块220基于第一语音的语音特征,将第一语音与用户的声纹进行匹配,以得到第一语音与用户的声纹之间的相似评分,并根据该相似评分来确定与第一语音相匹配的用户。
具体地,在一些实施方式中,用户的声纹以声纹模型来描述,例如隐马尔可夫模型(HMM模型)、高斯混合模型(GMM模型)等等。用户的声纹模型以语音特征为特征,利用包括用户语音的音频数据(后文简称为用户的音频数据)训练得到。语音识别模块220可以采用匹配运算函数来计算第一语音与用户的声纹之间的相似度。例如可以计算第一语音的语音特征与用户的声纹模型相匹配的后验概率来作为相似评分,也可以计算第一语音的语音特征与用户的声纹模型之间的似然度来作为相似评分。
但由于训练好用户的声纹模型需要大量该用户的音频数据,因此在一些实施方式中,用户的声纹模型可以基于与用户无关的通用背景模型,利用少量用户的音频数据训练得到(同样以语音特征为特征)。例如,可以先使用与用户无关的、多个说话人的音频数据,通过期望最大化算法EM训练得到通用背景模型(Universal Background Model,UBM),以表征用户无关的特征分布。再基于该UBM模型,利用少量的用户的音频数据通过自适应算法(如最大后验概率MAP,最大似然线性回归MLLR等)训练得到GMM模型(这样得到的GMM模型称之为GMM-UBM模型),以表征用户的特征分布。该GMM-UBM模型即为用户的声纹模型。此时,语音识别模块220可以基于第一语音的语音特征,分别将第一语音与用户的声纹模型和通用背景模型进行匹配,以得到第一语音与用户的声纹之间的相似评分。例如,分别计算第一语音的语音特征与上述UBM模型和GMM-UBM模型之间的似然度,然后将这两个似然度相除后取对数,将得到的值作为第一语音与用户的声纹之间的 相似评分。
在另一些实施方式中,用户的声纹以声纹向量来描述,例如i-vector、d-vector、x-vector和j-vector等等。语音识别模块220可以至少基于第一语音的语音特征,提取第一语音的声纹向量。
根据一种实施例,可以先利用第一语音的语音特征训练第一语音说话人的声纹模型。如前文类似地,可以基于预先训练好的与用户无关的上述通用背景模型,利用第一语音的语音特征训练得到第一语音说话人的声纹模型。
在得到第一语音说话人的声纹模型之后,可以根据该声纹模型提取第一语音的均值超矢量。例如,可以将第一语音说话人的GMM-UBM模型的各个GMM分量的均值进行拼接,得到第一语音说话人的GMM-UBM模型的均值超矢量,即第一语音的均值超矢量。
之后,可以采用联合因子分析法(JFA)或者简化的联合因子分析法,从第一语音的均值超矢量中提取得到低维的声纹向量。
以i-vector为例,在训练得到与用户无关的上述通用背景模型(UBM模型)之后,可以提取该通用背景模型的均值超矢量,并估计全局差异空间(Total Variability Space,T)矩阵。而后基于第一语音的均值超矢量、T矩阵、通用背景模型的均值超矢量来计算第一语音的i-vector。
具体地,i-vector可以根据以下公式计算得到:
M s,h=m u+Tω s,h
其中,M s,h是从说话人s的语音h中得到的均值超矢量,m u是通用背景模型的均值超矢量,T是全局差异空间矩阵,ω s,h是全局差异因子,也就是i-vector。
根据另一种实施例,还可以利用训练好的深度神经网络(Deep Neural Network,DNN)来得到第一语音的声纹向量。以d-vector为例,DNN可以包括输入层、隐层和输出层。可以先将第一语音的FBank特征输入到DNN输入层,DNN最后一个隐层的输出即为d-vector。
在得到第一语音的声纹向量之后,语音识别模块220可以基于第一语音的声纹向量和用户的声纹向量,来计算第一语音与用户的声纹之间的相似评分。其中,可以采用支持向量机(SVM)、LDA(Linear Discriminant Analysis,线性判别分析)、PLDA(Probabilistic Linear Discriminant Analysis,概率线性判别分析)、似然度和余 弦距离(Cosine Distance)等算法来计算第一语音与用户的声纹之间的相似评分。
以PLDA算法为例,假设语音由I个说话人的语音组成,其中每个说话人有J段不一样的语音,并且定义第i个说话人的第j段语音为Y ij。那么,定义Y ij的生成模型为:
Y ij=μ+Fh i+Gw ijij
其中,μ是声纹向量的均值,F、G是空间特征矩阵,各自代表说话人类间特征空间和类内特征空间。F的每一列,相当于类间特征空间的特征向量,G的每一列,相当于类内特征空间的特征向量。向量h i和w ij可以看作是该语音分别在各自空间的特征表示,ε ij则是噪声协方差。如果两条语音的h i特征相同的似然度越大,即相似评分越高,那么它们来自同一个说话人的可能性就越大。
PLDA的模型参数包括4个,即μ、F、G和ε ij,是采用EM算法迭代训练而成。通常地,可以采用简化版的PLDA模型,忽略类内特征空间矩阵G的训练,只训练类间特征空间矩阵F,即:
Y ij=μ+Fh iij
语音识别模块220可以基于第一语音的声纹向量,参照上述公式得到第一语音的h i特征。同样地,基于用户的声纹向量,参照上述公式得到用户语音的h i特征。而后,可以计算两个h i特征的对数似然比或余弦距离来作为第一语音与用户的声纹之间的相似评分。
应当注意的是,声纹并不限于上述声纹向量(i-vector、d-vector和x-vector等等)和上述声纹模型(HMM模型和GMM模型等等),相应的相似评分算法也可依据所选定的声纹来任意选取,本发明对此不做限制。
在各种实施方式中,如果得到的相似评分超过相似阈值,则语音识别模块220确定第一语音与该用户的声纹相匹配,也就是确定第一语音与该声纹对应的用户相匹配。否则语音识别模块220确定第一语音不与该用户的声纹相匹配。
语音识别模块220可以将第一语音与每个用户的声纹相匹配,以判断是否存在与第一语音相匹配的用户。在存在与第一语音相匹配的用户的情况下,语音识别模块220在对音频数据进行语音识别以得到指令之外,可以将第一语音对应的指令与匹配到的用户相关联地存储,例如存储至该用户的用户简档。这样,语音识别装置200后续可以根据来自该用户的所有指令分析该用户的行为偏好,从而为该用户提供个性化和定制化的服务。例如,根据用户所有与播放歌曲相关的指令来分析用户的歌曲偏好,从而可以为用 户推荐符合其偏好的歌曲。
在不存在与第一语音相匹配的用户的情况下,语音识别模块220可以将该(条)包括第一语音的音频数据存储至音频存储模块230。音频存储模块230适于存储音频数据。
用户发现模块240可以对音频存储模块230所存储的多条音频数据进行聚类,以便从这多条音频数据中确定新用户。这样,对于后续接收到的包括该新用户语音的音频数据,语音识别装置200就可以匹配到该新用户,并将对应指令与该新用户相关联地存储,以便后续可以根据来自该新用户的所有指令分析该新用户的行为偏好,从而为该新用户提供个性化服务。在一些实施方式中,用户发现模块240可以每隔预定周期就提取所存储的多条音频数据(例如固定数量条音频数据)来进行聚类。
具体地,用户发现模块240先基于多条音频数据中两两之间的相似评分,将这多条音频数据划分为多个集合。可以认为每个集合所包含的音频数据彼此相似。在一种实施例中,可以采用聚类算法来实现集合的划分。
其中,相似评分的计算已在上述对第一语音与用户的声纹之间的相似评分的计算过程的描述中详细介绍,此处不再赘述。
而后,用户发现模块240基于集合的样本特征来确定至少一个目标集合,每个目标集合即对应于一个新用户。其中,样本特征可以包括样本密度、样本数量等等,样本则指的是音频数据。在一种实施例中,对于每个集合,可以计算该集合的样本密度和样本数量。而后,选择样本密度和样本数量满足预定条件的集合作为目标集合。预定条件例如可以是:样本密度超过预定密度;样本数量超过预定数量等等。预定条件可以根据所需要确定的目标集合个数来配置,本发明对此不做限制。
在确定目标集合(即发现新用户)之后,用户发现模块240为该目标集合对应的新用户创建用户简档,并使用该目标集合中的至少部分音频数据来生成该新用户的声纹。声纹可以是声纹模型或者声纹向量。例如,可以以这些音频数据的语音特征为特征来训练一个GMM模型或者GMM-UBM模型,作为新用户的声纹。也可以基于这些音频数据的语音特征来提取声纹向量,作为新用户的声纹。具体的声纹生成过程可以参考前文关于声纹的描述,此处不再赘述。
其中,可以随机选择目标集合中的至少部分音频数据来生成声纹。也可以根据到目标集合的质心的距离来确定目标集合中用于生成新用户的声纹的音频数据。例如,先确定目标集合的质心,再计算目标集合中各样本到目标集合的质心的距离,选择距离较小 的那些样本作为用于生成新用户的声纹的音频数据。质心的计算为本领域常规技术,此处不再赘述。
如果没有确定目标集合(即没有发现新用户),例如所有集合中没有满足预定条件的目标集合,那么,用户发现模块240可以删除这些音频数据,也就是之前从音频存储模块230所提取的多条音频数据。
可以理解地,创建用户简档可以认为是用户的注册过程。通常地,用户可以主动提供包括用户语音的音频数据(例如,经由终端设备向服务器发送主动注册请求,并根据相应注册提示针对特定文本主动录入语音),以便根据这些主动提供的音频数据来生成用户的声纹。这一用户主动操作的过程可以认为是主动注册过程。而通过聚类来发现新用户、为其创建用户简档并生成声纹的过程,用户并无感知,因此,该过程可以认为是非主动注册过程。
根据本发明的实施方式,用户简档还可以包括指示用户是否为主动注册的用户标记。如图2所示,语音识别装置200还可以包括用户注册模块250。通信模块210可以接收包括第二语音的音频数据,第二语音通常用于主动注册新用户,例如,可以是按照终端设备102的注册提示而录入的语音。用户注册模块250则可以为主动注册的新用户创建用户简档,并使用该包括第二语音的音频数据来生成新用户的声纹,以及将该主动注册的新用户的用户简档中的用户标识置为主动注册。
相应地,对于通过聚类发现的新用户,用户发现模块240在为该目标集合对应的新用户创建用户简档时,可以将所创建的用户简档中的用户标识置为非主动注册。
这样,语音识别模块220就可以在确定存在与第一语音相匹配的用户之后,基于对应用户简档中的用户标识来判断该用户是否为主动注册。如果用户标记指示用户为非主动注册,语音识别模块220可以记录来自该用户的音频数据条数。具体地,用户简档可以包括来自用户的音频数据条数。每接收一条来自该用户的音频数据,语音识别模块220将来自用户的音频数据条数加一。相应地,用户发现模块240在为非主动注册的新用户创建用户简档时,可以将所创建的用户简档中来自用户的音频数据条数置为初始值。初始值通常可以为0。
语音识别模块220还可以判断来自该用户的音频数据条数是否在特定时段段内达到特定数量(例如在自注册以来的1个月内达到特定数量)。如果没达到,语音识别模块220可以删除该用户对应的用户简档,也就是说,注销该用户。如果达到,则可以不做 任何操作。
根据本发明的另一个实施方式,用户简档还可以包括与用户相关联的终端设备的设备标识。例如,在主动注册过程中,通信模块210可以接收发送包括第二语音的音频数据的终端设备的设备标识,用户注册模块250可以将该设备标识与主动注册的新用户相关联地存储至对应的用户简档。这样,语音识别模块220可以在接收音频数据时也接收发送该音频数据的终端设备的设备标识,并在判断是否存在与第一语音相匹配的用户之前,先基于该设备标识判断是否存在与对应终端设备相关联的用户,也就是查找是否存在包括该设备标识的用户简档。
如果不存在与该终端设备相关联的用户,则语音识别模块220可以将该音频数据存储至音频存储模块230。如果存在与该终端设备相关联的用户,则语音识别模块220判断是否存在与第一语音相匹配的用户。
此外,本发明的实施例还提供了一种用户识别装置。该用户识别装置包括通信模块、语音识别模块、音频存储模块和用户发现模块。通信模块接收包括第一语音的音频数据,语音识别模块可以判断是否存在与第一语音相匹配的用户,在不存在与第一语音相匹配的用户的情况下,将音频数据存储至音频存储模块。音频存储模块存储音频数据。用户发现模块则可以对音频存储模块所存储的多条音频数据进行聚类,以便从多条音频数据中确定新用户,并对新用户进行行为分析。例如,可以根据该新用户的语音对应的指令来分析该用户的行为偏好,从而为该新用户提供个性化服务。
其中,用户识别装置中各模块的处理例如可以与上文中结合图1和图2所描述的语音识别装置200中各模块的处理相同,并能够达到相类似的技术效果,在此不再赘述。
在下文中将结合附图描述在上文中提及的各个模块和装置等的具体结构以及对应的处理方法。
根据本发明的实施方式,上述语音识别装置200(和上述用户识别装置)中的各种部件,如各种模块等均可以通过如下所述的计算设备300来实现。图3示出了根据本发明一个实施例的计算设备300的示意图。
如图3所示,在基本的配置302中,计算设备300典型地包括***存储器306和一个或者多个处理器304。存储器总线308可以用于在处理器304和***存储器306之间的通信。
取决于期望的配置,处理器304可以是任何类型的处理,包括但不限于:微处理器(μP)、微控制器(μC)、数字信息处理器(DSP)或者它们的任何组合。处理器304可以包括诸如一级高速缓存310和二级高速缓存312之类的一个或者多个级别的高速缓存、处理器核心314和寄存器316。示例的处理器核心314可以包括运算逻辑单元(ALU)、浮点数单元(FPU)、数字信号处理核心(DSP核心)或者它们的任何组合。示例的存储器控制器318可以与处理器304一起使用,或者在一些实现中,存储器控制器318可以是处理器304的一个内部部分。
取决于期望的配置,***存储器306可以是任意类型的存储器,包括但不限于:易失性存储器(诸如RAM)、非易失性存储器(诸如ROM、闪存等)或者它们的任何组合。***存储器306可以包括操作***320、一个或者多个应用322以及程序数据324。在一些实施方式中,应用322可以布置为在操作***上由一个或多个处理器304利用程序数据324执行指令。
计算设备300还可以包括有助于从各种接口设备(例如,输出设备342、外设接口344和通信设备346)到基本配置302经由总线/接口控制器330的通信的接口总线340。示例的输出设备342包括图形处理单元348和音频处理单元350。它们可以被配置为有助于经由一个或者多个A/V端口352与诸如显示器或者扬声器之类的各种外部设备进行通信。示例外设接口344可以包括串行接口控制器354和并行接口控制器356,它们可以被配置为有助于经由一个或者多个I/O端口358和诸如输入设备(例如,键盘、鼠标、笔、语音输入设备、触摸输入设备)或者其他外设(例如打印机、扫描仪等)之类的外部设备进行通信。示例的通信设备346可以包括网络控制器360,其可以被布置为便于经由一个或者多个通信端口364与一个或者多个其他计算设备362通过网络通信链路的通信。
网络通信链路可以是通信介质的一个示例。通信介质通常可以体现为在诸如载波或者其他传输机制之类的调制数据信号中的计算机可读指令、数据结构、程序模块,并且可以包括任何信息递送介质。“调制数据信号”可以是这样的信号,它的数据集中的一个或者多个或者它的改变可以在信号中编码信息的方式进行。作为非限制性的示例,通信介质可以包括诸如有线网络或者专线网络之类的有线介质,以及诸如声音、射频(RF)、微波、红外(IR)或者其它无线介质在内的各种无线介质。这里使用的术语计算机可读介质可以包括存储介质和通信介质二者。
计算设备300可以实现为服务器,例如数据库服务器、应用程序服务器和WEB服务器等,也可以实现为包括桌面计算机和笔记本计算机配置的个人计算机。当然,计算设备300也可以实现为小尺寸便携(或者移动)电子设备的一部分。
在根据本发明的实施例中,计算设备300被实现为语音识别装置200,并被配置为执行根据本发明实施例的语音识别方法400。其中,计算设备300的应用322中包含执行根据本发明实施例的语音识别方法400的多条程序指令,而程序数据324还可以存储语音识别***100的配置信息等。
图4示出了根据本发明一个实施例的语音识别方法400。如图4所示,语音识别方法400始于步骤S410。
在步骤S410中,接收包括第一语音的音频数据。如前所述,第一语音通常是指示终端设备102执行操作的语音。因此,根据本发明的实施方式,可以对该音频数据进行语音识别而得到第一语音对应的指令,而后向终端设备102返回对该指令的响应结果,以便终端设备102至少根据该响应结果来执行相应操作。
随后在步骤S420中,可以判断是否存在与第一语音相匹配的用户。根据一种实施方式,用户均会对应有记录与用户相关的数据的用户简档,这些用户简档可以存储在与语音识别装置200相耦接的用户数据存储设备中,也可以存储在语音识别装置200所包括的用户数据存储模块中。
通常地,可以采用用户的诸如指纹、声纹和虹膜之类的生物特征来唯一标识用户。在本发明的一种实施方式中,可以采用声纹来唯一标识用户,用户简档包括用户的声纹,可以判断第一语音是否与用户的声纹相匹配,以判断是否存在与第一语音相匹配的用户。
具体地,可以先根据包括第一语音的音频数据,提取第一语音的语音特征。在一些实施方式中,语音特征可以是滤波器组FBank(Filter Bank)、梅尔频率倒谱系数MFCC(Mel Frequency Cepstral Coefficents)、感知线性预测系数PLP、深度特征Deep Feature、以及能量规整谱系数PNCC等特征中的一种或者多种的组合。
而后基于第一语音的语音特征得到第一语音与用户的声纹之间的相似评分,根据该相似评分来确定第一语音是否与用户的声纹相匹配。如果得到的相似评分超过相似阈值,则确定第一语音与该用户的声纹相匹配,否则确定第一语音不与该用户的声纹相匹配。
如果存在与第一语音相匹配的用户,则可以将第一语音对应的指令与匹配到的用户 相关联地存储。如果不存在与第一语音相匹配的用户,那么在步骤S430中,存储该(条)音频数据。
而后在步骤S440中,对所存储的多条音频数据进行聚类,以便从多条音频数据中确定新用户。具体地,可以先基于多条音频数据中两两之间的相似评分,将多条音频数据划分为多个集合。再基于集合的样本密度和样本数量确定至少一个目标集合,目标集合对应于新用户。最后为目标集合对应的新用户创建用户简档,并使用目标集合中的至少部分音频数据来生成该新用户的声纹。
在一种实施例中,可以根据到目标集合的质心的距离来确定目标集合中用于生成新用户的声纹的音频数据。例如,先确定目标集合的质心,再计算目标集合中各样本到目标集合的质心的距离,选择距离较小的那些样本作为用于生成新用户的声纹的音频数据。
如果没有确定目标集合,则可以删除这些音频数据,也就是之前的多条音频数据。
根据本发明的一种实施方式,用户简档还可以包括指示用户是否为主动注册的用户标记,在为目标集合对应的新用户创建用户简档时,可以将该用户简档中的用户标识置为非主动注册。在存在与第一语音相匹配的用户且对应的用户标记指示用户为非主动注册的情况下,还可以记录来自用户的音频数据条数,判断音频数据条数是否在特定时间段内达到特定数量。若否,可以删除该用户对应的用户简档。
根据本发明的一种实施方式,语音识别方法400还可以包括步骤:接收包括第二语音的音频数据,第二语音通常用于主动注册新用户。为主动注册的新用户创建用户简档,并使用该包括第二语音的音频数据来生成新用户的声纹,以及将该主动注册的新用户的用户简档中的用户标识置为主动注册。
根据本发明的一种实施方式,用户简档还可以包括与用户相关联的终端设备的设备标识,语音识别方法400还可以包括步骤:接收发送上述音频数据的终端设备的设备标识,基于该设备标识判断是否存在与该终端设备相关联的用户。如果不存在,则存储上述音频数据。
语音识别方法400的具体步骤以及实施例,在结合图1~图3对语音识别***100的描述中已经详细公开,此处不再赘述。
此外,本发明实施例还提供了一种用户识别方法,包括步骤:接收包括第一语音的音频数据;判断是否存在与第一语音相匹配的用户;在不存在与第一语音相匹配的用户 的情况下,存储该音频数据;对所存储的多条音频数据进行聚类,以便从多条音频数据中确定新用户,并对新用户进行行为分析。其中,用户识别方法中各步骤的处理例如可以与上文中结合图4所描述的语音识别方法400中各步骤的处理相同,并能够达到相类似的技术效果,在此不再赘述。
综上所述,根据本发明实施例的语音识别方案,通过对所存储的多条音频数据进行聚类来从中确定新用户以及生成该新用户的声纹,以便后续可以根据声纹识别该用户,并根据来自该用户的指令分析该用户的行为偏好,从而可以为该用户提供更精准的个性化服务。并且,整个新用户确定和声纹生成过程用户是无感知的,省去了用户的主动注册操作,提高了用户的使用体验。
应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域那些技术人员应当理解在本文所公开的示例中的设备的模块或单元或组件可以布置在如该实施例中所描述的设备中,或者可替换地可以定位在与该示例中的设备不同的一个或多个设备中。前述示例中的模块可以组合为一个模块或者此外可以分成多个子模块。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中 所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
此外,所述实施例中的一些在此被描述成可以由计算机***的处理器或者由执行所述功能的其它装置实施的方法或方法元素的组合。因此,具有用于实施所述方法或方法元素的必要指令的处理器形成用于实施该方法或方法元素的装置。此外,装置实施例的在此所述的元素是如下装置的例子:该装置用于实施由为了实施该发明的目的的元素所执行的功能。
如在此所使用的那样,除非另行规定,使用序数词“第一”、“第二”、“第三”等等来描述普通对象仅仅表示涉及类似对象的不同实例,并且并不意图暗示这样被描述的对象必须具有时间上、空间上、排序方面或者以任意其它方式的给定顺序。
尽管根据有限数量的实施例描述了本发明,但是受益于上面的描述,本技术领域内的技术人员明白,在由此描述的本发明的范围内,可以设想其它实施例。此外,应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。

Claims (27)

  1. 一种语音识别方法,包括步骤:
    接收包括第一语音的音频数据;
    判断是否存在与所述第一语音相匹配的用户;
    在不存在与所述第一语音相匹配的用户的情况下,存储所述音频数据;
    对所存储的多条音频数据进行聚类,以便从所述多条音频数据中确定新用户。
  2. 如权利要求1所述的方法,其中,所述用户对应有用户简档,所述用户简档包括所述用户的声纹,所述判断是否存在与所述第一语音相匹配的用户的步骤包括:
    判断所述第一语音是否与所述用户的声纹相匹配,以判断是否存在与所述第一语音相匹配的用户。
  3. 如权利要求2所述的方法,其中,所述对所存储的多条音频数据进行聚类,以便从所述多条音频数据中确定新用户的步骤包括:
    基于所述多条音频数据中两两之间的相似评分,将所述多条音频数据划分为多个集合;
    基于所述集合的样本密度和样本数量确定至少一个目标集合,所述目标集合对应于所述新用户;
    为所述目标集合对应的新用户创建用户简档,并使用所述目标集合中的至少部分音频数据来生成该新用户的声纹。
  4. 如权利要求3所述的方法,其中,所述使用所述目标集合中的至少部分音频数据来生成该新用户的声纹的步骤包括:
    根据到所述目标集合的质心的距离来确定所述目标集合中用于生成新用户的声纹的音频数据。
  5. 如权利要求3所述的方法,其中,所述用户简档包括指示用户是否为主动注册的用户标记,所述为所述目标集合对应的新用户创建用户简档的步骤包括:
    将为目标集合对应的新用户所创建的用户简档中的用户标识置为非主动注册;以及
    所述方法还包括步骤:
    在存在与所述第一语音相匹配的用户且对应的用户标记指示所述用户为非主动注册的情况下,记录来自所述用户的音频数据条数。
  6. 如权利要求5所述的方法,其中,还包括步骤:
    在记录来自所述用户的音频数据条数之后,判断所述音频数据条数是否在特定时间 段内达到特定数量;若否,删除所述用户对应的用户简档。
  7. 如权利要求2所述的方法,其中,所述用户简档还包括与用户相关联的终端设备的设备标识,所述方法包括步骤:
    接收发送所述音频数据的终端设备的设备标识;
    基于所述设备标识判断是否存在与所述终端设备相关联的用户;
    如果不存在,则存储所述音频数据。
  8. 如权利要求1所述的方法,其中,还包括步骤:
    在存在与所述第一语音相匹配的用户的情况下,将所述第一语音对应的指令与所述用户相关联地存储。
  9. 如权利要求1-8中任一项所述的方法,其中,还包括步骤:
    接收包括第二语音的音频数据,所述第二语音用于主动注册新用户;
    为主动注册的新用户创建用户简档,并使用所述包括第二语音的音频数据来生成所述新用户的声纹;以及
    将为主动注册的新用户所创建的用户简档中的用户标识置为主动注册。
  10. 如权利要求9所述的方法,其中,还包括步骤:
    接收发送包括第二语音的音频数据的终端设备的设备标识;
    将所述设备标识与所述主动注册的新用户相关联地存储至对应的用户简档。
  11. 如权利要求2-10中任一项所述的方法,其中,所述判断所述第一语音是否与所述用户的声纹相匹配的步骤包括:
    根据所述包括第一语音的音频数据,提取第一语音的语音特征;
    基于所述第一语音的语音特征得到所述第一语音与用户的声纹之间的相似评分;
    根据所述相似评分来确定第一语音是否与用户的声纹相匹配。
  12. 一种用户识别方法,包括步骤:
    接收包括第一语音的音频数据;
    判断是否存在与所述第一语音相匹配的用户;
    在不存在与所述第一语音相匹配的用户的情况下,存储所述音频数据;
    对所存储的多条音频数据进行聚类,以便从所述多条音频数据中确定新用户,并对所述新用户进行行为分析。
  13. 一种语音识别装置,包括:
    通信模块,适于接收包括第一语音的音频数据;
    语音识别模块,适于判断是否存在与所述第一语音相匹配的用户;在不存在与所述第一语音相匹配的用户的情况下,将所述音频数据存储至音频存储模块;
    音频存储模块,适于存储所述音频数据;以及
    用户发现模块,适于对所述音频存储模块所存储的多条音频数据进行聚类,以便从所述多条音频数据中确定新用户。
  14. 如权利要求13所述的装置,其中,所述用户对应有用户简档,所述用户简档包括所述用户的声纹,所述语音识别模块适于
    判断所述第一语音是否与所述用户的声纹相匹配,以判断是否存在与所述第一语音相匹配的用户。
  15. 如权利要求14所述的装置,其中,所述用户发现模块适于
    基于所述多条音频数据中两两之间的相似评分,将所述多条音频数据划分为多个集合;
    基于所述集合的样本密度和样本数量确定至少一个目标集合,所述目标集合对应于所述新用户;
    为所述目标集合对应的新用户创建用户简档,并使用所述目标集合中的至少部分音频数据来生成该新用户的声纹。
  16. 如权利要求15所述的装置,其中,所述用户发现模块适于
    根据到所述目标集合的质心的距离来确定所述目标集合中用于生成新用户的声纹的音频数据。
  17. 如权利要求15所述的装置,其中,所述用户简档包括指示用户是否为主动注册的用户标记,所述用户发现模块适于
    将为目标集合对应的新用户所创建的用户简档中的用户标识置为非主动注册;以及
    所述语音识别模块还适于
    在存在与所述第一语音相匹配的用户且对应的用户标记指示所述用户为非主动注册的情况下,记录来自所述用户的音频数据条数。
  18. 如权利要求17所述的装置,其中,所述语音识别模块适于
    在记录来自所述用户的音频数据条数之后,判断所述音频数据条数是否在特定时间段内达到特定数量;若否,删除所述用户对应的用户简档。
  19. 如权利要求14所述的装置,其中,所述用户简档包括与用户相关联的终端设备的设备标识,
    所述通信模块还适于接收发送所述音频数据的终端设备的设备标识;以及所述语音识别模块还适于
    基于所述设备标识判断是否存在与所述终端设备相关联的用户;
    如果不存在,则将所述音频数据存储至所述音频存储模块。
  20. 如权利要求13所述的装置,其中,所述语音识别模块还适于
    在存在与所述第一语音相匹配的用户的情况下,将所述第一语音对应的指令与所述用户相关联地存储。
  21. 如权利要求13-19中任一项所述的装置,其中,所述通信模块还适于接收包括第二语音的音频数据,所述第二语音用于主动注册新用户;所述装置还包括:
    用户注册模块,适于为主动注册的新用户创建用户简档,并使用所述包括第二语音的音频数据来生成所述新用户的声纹;以及将为主动注册的新用户所创建的用户简档中的用户标识置为主动注册。
  22. 如权利要求21所述的装置,其中,所述通信模块还适于接收发送包括第二语音的音频数据的终端设备的设备标识;所述用户注册模块还适于将所述设备标识与所述主动注册的新用户相关联地存储至对应的用户简档。
  23. 如权利要求14-22中任一项所述的装置,其中,所述语音识别模块还适于
    根据所述包括第一语音的音频数据,提取第一语音的语音特征;
    基于所述第一语音的语音特征得到所述第一语音与用户的声纹之间的相似评分;
    根据所述相似评分来确定第一语音是否与用户的声纹相匹配。
  24. 如权利要求14-23中任一项所述的装置,该装置驻留在终端设备中,所述终端设备为音箱、电视机或洗衣机。
  25. 一种用户识别装置,包括:
    通信模块,适于接收包括第一语音的音频数据;
    语音识别模块,适于判断是否存在与所述第一语音相匹配的用户;在不存在与所述第一语音相匹配的用户的情况下,将所述音频数据存储至音频存储模块;
    音频存储模块,适于存储所述音频数据;以及
    用户发现模块,适于对所述音频存储模块所存储的多条音频数据进行聚类,以便从所述多条音频数据中确定新用户,并对所述新用户进行行为分析。
  26. 一种语音识别***,包括终端设备和服务器,其中
    所述终端设备适于接收说话人的语音,并将包括语音的音频数据发送至所述服务 器;所述服务器驻留有如权利要求13-24中任一项所述的语音识别装置。
  27. 一种计算设备,包括:
    至少一个处理器;和
    存储有程序指令的存储器,其中,所述程序指令被配置为适于由所述至少一个处理器执行,所述程序指令包括用于执行如权利要求1-11中任一项所述的语音识别方法的指令。
PCT/CN2019/115308 2018-11-12 2019-11-04 一种语音识别方法、装置及计算设备 WO2020098523A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811340092.2 2018-11-12
CN201811340092.2A CN111179940A (zh) 2018-11-12 2018-11-12 一种语音识别方法、装置及计算设备

Publications (1)

Publication Number Publication Date
WO2020098523A1 true WO2020098523A1 (zh) 2020-05-22

Family

ID=70655656

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/115308 WO2020098523A1 (zh) 2018-11-12 2019-11-04 一种语音识别方法、装置及计算设备

Country Status (3)

Country Link
CN (1) CN111179940A (zh)
TW (1) TW202018696A (zh)
WO (1) WO2020098523A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022081688A1 (en) * 2020-10-15 2022-04-21 Google Llc Speaker identification accuracy

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897977A (zh) * 2020-06-09 2020-11-06 惠州市德赛西威汽车电子股份有限公司 一种搭载在儿童座椅上的智能语音娱乐***及方法
CN112992174A (zh) * 2021-02-03 2021-06-18 深圳壹秘科技有限公司 一种语音分析方法及其语音记录装置
CN113448975B (zh) * 2021-05-26 2023-01-17 科大讯飞股份有限公司 一种人物画像库的更新方法、装置、***和存储介质
CN113707183B (zh) * 2021-09-02 2024-04-19 北京奇艺世纪科技有限公司 一种视频中的音频处理方法及装置
CN115171702A (zh) * 2022-05-30 2022-10-11 青岛海尔科技有限公司 数字孪生声纹特征处理方法、存储介质及电子装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090187405A1 (en) * 2008-01-18 2009-07-23 International Business Machines Corporation Arrangements for Using Voice Biometrics in Internet Based Activities
CN106782564A (zh) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置
CN107147618A (zh) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 一种用户注册方法、装置及电子设备
CN107863108A (zh) * 2017-11-16 2018-03-30 百度在线网络技术(北京)有限公司 信息输出方法和装置
CN107978311A (zh) * 2017-11-24 2018-05-01 腾讯科技(深圳)有限公司 一种语音数据处理方法、装置以及语音交互设备
CN108075892A (zh) * 2016-11-09 2018-05-25 阿里巴巴集团控股有限公司 一种语音处理的方法、装置和设备
CN108597525A (zh) * 2018-04-25 2018-09-28 四川远鉴科技有限公司 语音声纹建模方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106295299A (zh) * 2016-08-15 2017-01-04 歌尔股份有限公司 一种智能机器人的用户注册方法和装置
CN107623614B (zh) * 2017-09-19 2020-12-08 百度在线网络技术(北京)有限公司 用于推送信息的方法和装置
CN108766446A (zh) * 2018-04-18 2018-11-06 上海问之信息科技有限公司 声纹识别方法、装置、存储介质及音箱

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090187405A1 (en) * 2008-01-18 2009-07-23 International Business Machines Corporation Arrangements for Using Voice Biometrics in Internet Based Activities
CN108075892A (zh) * 2016-11-09 2018-05-25 阿里巴巴集团控股有限公司 一种语音处理的方法、装置和设备
CN106782564A (zh) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置
CN107147618A (zh) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 一种用户注册方法、装置及电子设备
CN107863108A (zh) * 2017-11-16 2018-03-30 百度在线网络技术(北京)有限公司 信息输出方法和装置
CN107978311A (zh) * 2017-11-24 2018-05-01 腾讯科技(深圳)有限公司 一种语音数据处理方法、装置以及语音交互设备
CN108597525A (zh) * 2018-04-25 2018-09-28 四川远鉴科技有限公司 语音声纹建模方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022081688A1 (en) * 2020-10-15 2022-04-21 Google Llc Speaker identification accuracy

Also Published As

Publication number Publication date
TW202018696A (zh) 2020-05-16
CN111179940A (zh) 2020-05-19

Similar Documents

Publication Publication Date Title
WO2020098523A1 (zh) 一种语音识别方法、装置及计算设备
US11915699B2 (en) Account association with device
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
US9940935B2 (en) Method and device for voiceprint recognition
US11734326B2 (en) Profile disambiguation
CN112435684B (zh) 语音分离方法、装置、计算机设备和存储介质
KR20190082900A (ko) 음성 인식 방법, 전자 디바이스, 및 컴퓨터 저장 매체
WO2021008538A1 (zh) 语音交互方法及相关装置
JP2007133414A (ja) 音声の識別能力推定方法及び装置、ならびに話者認証の登録及び評価方法及び装置
US11205428B1 (en) Deleting user data using keys
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
US20140195232A1 (en) Methods, systems, and circuits for text independent speaker recognition with automatic learning features
US11862170B2 (en) Sensitive data control
US20230386506A1 (en) Self-supervised speech representations for fake audio detection
US20200279568A1 (en) Speaker verification
US20240013784A1 (en) Speaker recognition adaptation
Chin et al. Speaker identification using discriminative features and sparse representation
US11681364B1 (en) Gaze prediction
WO2020003413A1 (ja) 情報処理装置、制御方法、及びプログラム
JP2005512246A (ja) 動作モデルを使用して非煩雑的に話者を検証するための方法及びシステム
Sarhan Smart voice search engine
CN111798844A (zh) 根据声纹识别的人工智能扬声器定制型个人化服务***
JP7287442B2 (ja) 情報処理装置、制御方法、及びプログラム
US11227591B1 (en) Controlled access to data
US11792365B1 (en) Message data analysis for response recommendations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19884582

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19884582

Country of ref document: EP

Kind code of ref document: A1