WO2019080639A1 - 一种对象识别方法、计算机设备及计算机可读存储介质 - Google Patents

一种对象识别方法、计算机设备及计算机可读存储介质

Info

Publication number
WO2019080639A1
WO2019080639A1 PCT/CN2018/103255 CN2018103255W WO2019080639A1 WO 2019080639 A1 WO2019080639 A1 WO 2019080639A1 CN 2018103255 W CN2018103255 W CN 2018103255W WO 2019080639 A1 WO2019080639 A1 WO 2019080639A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
voice
voiceprint
target object
confidence
Prior art date
Application number
PCT/CN2018/103255
Other languages
English (en)
French (fr)
Inventor
张明远
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2020522805A priority Critical patent/JP6938784B2/ja
Priority to EP18870826.7A priority patent/EP3614377B1/en
Priority to KR1020197038790A priority patent/KR102339594B1/ko
Publication of WO2019080639A1 publication Critical patent/WO2019080639A1/zh
Priority to US16/663,086 priority patent/US11289072B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the present application relates to the field of computer technologies, and in particular, to an object recognition method, a computer device, and a computer readable storage medium.
  • Voiceprint recognition can identify a certain speaker from multiple speakers, and can also identify a certain voice.
  • the voiceprint feature determines the speaker identity corresponding to the voice. For example, a phonography system in a speech recognition system can distinguish all speakers in a scene by voiceprint (for example, judging judges and prisoners in the scene by the voiceprint recognition technique in the transcript system).
  • voiceprint recognition is mainly performed by matching the voiceprint features of the acoustic model (for example, intonation, dialect, rhythm, and nasal sound).
  • voiceprint features for example, intonation, dialect, rhythm, and nasal sound.
  • voiceprint matching is likely to occur.
  • the difference between the results is small, and it is difficult to distinguish the speaker according to the result of the voiceprint matching, thereby affecting the accuracy of the voiceprint recognition result.
  • An object recognition method, a computer device, and a computer readable storage medium are provided in accordance with various embodiments of the present application.
  • An object recognition method is implemented on a computer device, the computer device comprising a memory and a processor, the method comprising:
  • a computer apparatus comprising a processor and a memory, the memory storing computer readable instructions, the computer readable instructions being executed by the processor such that the processor performs the following steps:
  • a non-transitory computer readable storage medium storing computer readable instructions, when executed by one or more processors, causes the one or more processors to perform the following steps:
  • FIG. 1 is a schematic diagram of a hardware architecture of an object identification device according to an embodiment of the present application.
  • FIG. 2 is a system block diagram of an object recognition device according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of an object identification method according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of another object identification method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a voice separation display based on a beamforming manner according to an embodiment of the present application.
  • FIG. 6 is a schematic flowchart diagram of another object identification method according to an embodiment of the present application.
  • FIG. 7 is a schematic flowchart diagram of another object recognition method according to an embodiment of the present application.
  • FIG. 8 is a schematic flowchart diagram of another object identification method according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an object identification device according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of another object recognition device according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of an object information acquiring module according to an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a confidence acquisition module according to an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a result obtaining module according to an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a second result obtaining unit according to an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • the object recognition method provided by the embodiment of the present application can be applied to a voiceprint recognition of a sound source object in a multi-sound source environment, and the target object is identified in the scene, for example, the object recognition device acquires the voice of the target object in the current voice environment. Information and orientation information of the target object, and then performing voiceprint feature extraction on the voice information based on the trained voiceprint matching model, and acquiring voiceprint feature information corresponding to the voice information after the voiceprint feature extraction; Finally, the sound confidence corresponding to the voiceprint feature information is obtained, and the object recognition result of the target object is obtained based on the voice confidence and using the orientation information and the voiceprint feature information.
  • the object recognition result is obtained according to the orientation information or the voiceprint feature information, and the accuracy of the acquired object recognition result is increased.
  • the object recognition device may be a tablet computer, a smart phone, a palmtop computer, and a mobile internet device (MID), and the like, which can integrate the microphone array or can receive the sound source orientation information sent by the microphone array and has the voiceprint recognition function.
  • Terminal Equipment The hardware architecture of the object recognition device may be as shown in FIG. 1 , wherein the audio processor is used for noise reduction and positioning direction, the system processor is used for connecting to the cloud and performing voiceprint feature analysis, and the storage system is used for storing the object recognition application. .
  • the system block diagram of the object recognition device may be as shown in FIG. 2, wherein the microphone array may identify voice information corresponding to sound sources of different orientations, and perform angular positioning on different sound sources.
  • FIG. 3 is a schematic flowchart of an object identification method according to an embodiment of the present application.
  • the object recognition method may include the following steps S101 to S103.
  • the object recognition device may acquire voice information of the target object in the current voice environment based on the microphone array, and acquire the orientation information of the target object based on the microphone array.
  • the target object may be a valid sound source object in the current voice environment (eg, a judge, a lawyer, a court, and a plaintiff in a trial case), and it should be noted that the object recognition device is acquired in the current voice environment.
  • the voice information in the voice information set may be the voice information of the target object or other non-essential voice information (for example, the voice information of the audience under the court case or the noise emitted by other objects, etc.), and the object recognition device acquires After the voice information collection in the current voice environment, the voice information can be combined and filtered to obtain the voice information of the target object.
  • the microphone array can acquire voice information of the same target object collected from different orientations through multiple microphones. Since multiple microphones are in different positions in the microphone array, each microphone can acquire the sound according to the size of the sound.
  • the phase information of the target object is calculated by beamforming according to the obtained phase information (ie, determining position information of the target object in the current voice environment).
  • the object recognition device may perform voiceprint feature extraction on the voice information based on the trained voiceprint matching model.
  • the voiceprint matching model may be a certain training algorithm (for example, a neural network method, using sample texture information corresponding to each voiceprint training voice and voiceprint training voice in the pre-acquired voiceprint training voice set. Hidden Markov method or VQ clustering method, etc.) The model established after training.
  • a certain training algorithm for example, a neural network method, using sample texture information corresponding to each voiceprint training voice and voiceprint training voice in the pre-acquired voiceprint training voice set. Hidden Markov method or VQ clustering method, etc.
  • the voice collector corresponding to the voice in the voiceprint training voice set may be a random experimental object and does not define a specific target object.
  • the sample feature information corresponding to the voiceprint training voice may be voiceprint feature information of the voiceprint training voice.
  • the object recognition device may acquire the voiceprint feature information corresponding to the voice information obtained after the voiceprint feature extraction.
  • the voiceprint feature information may be different feature information in the voice information of the target object, for example, information such as a spectrum, a cepstrum, a formant, a pitch, a reflection coefficient, and the like.
  • the object recognition device may acquire the sound confidence corresponding to the voiceprint feature information.
  • the sound confidence can indicate the degree of trust of the correspondence between the voiceprint feature information and the target object. For example, when the sound confidence level is 90%, the reliability of the target object identified based on the voiceprint feature information corresponding to the voice confidence level is 90%.
  • the object recognition device may match the voiceprint feature information with the sample feature information corresponding to the voiceprint training voice, obtain a matching degree value when the feature matching degree is the highest, and then determine the voiceprint feature information according to the matching degree value.
  • the confidence of the voice For example, after the voiceprint feature information is matched with the sample feature information corresponding to each voiceprint training voice in the voiceprint training voice set, the matching of the sample feature information and the voiceprint feature information of the voiceprint training voice A is detected to be the highest, and The highest value is 90%, and the object recognition device can determine that the voice confidence corresponding to the voiceprint feature information is 90%.
  • the object recognition device may generate the object recognition result for the target object using the voiceprint feature information, and the object recognition result may indicate the target object to which the voice information of the target object belongs.
  • the object recognition device may classify the voice information of the at least two target objects by the voiceprint feature information of the at least two target objects (for example, when the trial is to be performed in the recording system) The speech of all target objects is classified into judges, courts, plaintiffs, etc.).
  • the object recognition device when there are two similar voiceprint features in the voiceprint feature information, the object recognition device may have a situation in which the object recognition result of the target object cannot be accurately obtained by the above two similar voiceprint features. .
  • the object recognition device can acquire the object recognition result of the target object based on the sound confidence and using the orientation information and the voiceprint feature information. Specifically, the object recognition device may determine the object identification information used to identify the object recognition result of the target object based on the relationship between the voice confidence level and the preset sound confidence threshold, and then obtain the object recognition result according to the object identification information, which is understandable
  • the object identification information may be orientation information or voiceprint feature information.
  • the object recognition device may determine the voiceprint feature information as the adopted object identification information when the sound confidence is greater than or equal to the first confidence threshold, and acquire the object recognition result of the target object according to the object identification information. (ie, using the voiceprint feature information to identify the target object, and the orientation information does not participate in the recognition is only used as the sound source localization); when the sound confidence is greater than or equal to the second confidence threshold and less than the first confidence threshold, the orientation information and The voiceprint feature information is jointly determined as the object identification information used, and the object recognition result of the target object is obtained according to the object identification information (that is, the voiceprint identifying target object is adopted by using the voiceprint feature information, and the sound source direction using the orientation information is also used.
  • the sound confidence is less than the second confidence threshold, determining the orientation information as the used object identification information, and acquiring the object recognition result of the target object according to the object identification information (ie, using only the orientation information sound)
  • the positioning direction after the source is located identifies the target object).
  • the voice information of the target object in the current voice environment and the orientation information of the target object are obtained, and then the voiceprint feature is extracted based on the trained voiceprint matching model to obtain the voice information after the voiceprint feature extraction.
  • Corresponding voiceprint feature information finally obtaining the voice confidence corresponding to the voiceprint feature information, and acquiring the object recognition result of the target object based on the voice confidence and using the orientation information and the voiceprint feature information.
  • the voiceprint recognition may be to identify a plurality of speakers, it may be an identity confirmation of a certain speaker.
  • the execution process involving speaker recognition please refer to the embodiment shown in FIG. 4 below. Refer to the embodiment shown in Figure 8 below for the execution process involving speaker identification.
  • the object recognition method may include the following steps.
  • S201 Acquire a voiceprint training voice set, and perform training on the established voiceprint matching model to generate a trained voiceprint matching model based on sample feature information corresponding to each voiceprint training voice and voiceprint training voice in the voiceprint training voice set.
  • the object recognition device may acquire the voiceprint training voice set, and based on the sample feature information corresponding to each voiceprint training voice and voiceprint training voice in the voiceprint training voice set, the established voiceprint
  • the matching model is trained to generate a trained voiceprint matching model. It can be understood that the object recognition device can train the voiceprint matching model by using algorithms such as neural network, hidden Markov or VQ clustering.
  • the voice collector corresponding to the voice in the voiceprint training voice set may be a random experimental object, and does not define a specific target object, and the sample feature information corresponding to the voiceprint training voice may be the voiceprint feature information of the voiceprint training voice.
  • the object recognition device may acquire the voice information set in the current voice environment based on the microphone array.
  • the voice information in the voice information set may be voice information of the target object or other non-essential voice information (for example, The voice information of the audience in the court case or the noise emitted by other objects, etc., where the target object may be an effective sound source object in the current voice environment (for example, a judge, a lawyer, a court case and a plaintiff in a trial case).
  • the object recognition device may perform a filtering process on the voice information set to obtain the voice information of the filtered target object.
  • the filtering process may be filtering the noise by the noise reduction process, removing the echo, or filtering the voice of the target object according to the characteristics (sound sound, timbre or other feature information) of the voice information of the target object to be processed, or other voices. Filter processing.
  • the microphone array can acquire the phase information corresponding to each voice information in the voice information set while collecting the voice information set.
  • the object recognition device may acquire phase information, and may determine orientation information of the target object based on the voice orientation indicated by the phase information.
  • the phase in the phase information may indicate a scale of the speech waveform of the speech information at a certain time, and may describe a measure of the change in the waveform of the speech signal, typically in degrees (angles) as a unit, also referred to as a phase angle.
  • the microphone array can acquire voice information of the same target object collected from different orientations through multiple microphones. Since multiple microphones are in different positions in the microphone array, each microphone can acquire the sound according to the size of the sound.
  • the phase information of the target object is calculated by beamforming according to the obtained phase information (ie, determining position information of the target object in the current voice environment). The manner of beamforming is as shown in FIG. 5.
  • the voice extraction or separation may be performed by separately forming a sound beam to sound sources in different directions and suppressing sounds in other directions.
  • S204 Perform voiceprint feature extraction on the voice information based on the trained voiceprint matching model, and obtain voiceprint feature information corresponding to the voice information extracted by the voiceprint feature.
  • the object recognition device may perform voiceprint feature extraction on the voice information based on the trained voiceprint matching model, and obtain voiceprint feature information corresponding to the voice information extracted by the voiceprint feature.
  • the voiceprint feature information may be different feature information in the voice information of the target object, for example, may be information such as a spectrum, a cepstrum, a formant, a pitch, a reflection coefficient, and the like.
  • S205 Match the voiceprint feature information with the sample feature information corresponding to the voiceprint training voice, and obtain a matching degree value when the feature matching degree is the highest.
  • the object recognition device may match the voiceprint feature information with the sample feature information corresponding to the voiceprint training voice, and obtain the matching degree value when the feature matching degree is the highest.
  • the voiceprint characteristics of different people are different, and even the voiceprint characteristics of the same person may vary depending on the physical condition of the speaker or the environment in which the speaker is located. Therefore, when the voiceprint feature of the voiceprint feature information is matched with the sample feature information corresponding to each voiceprint training voice in the voiceprint training voice set, the obtained matching value may be large or small, but may pass Compares all the matching values to obtain the matching value when the feature matching is the highest.
  • the object recognition device may determine the sound confidence corresponding to the voiceprint feature information according to the matching degree value.
  • the voice confidence can indicate the degree of trust of the correspondence between the voiceprint feature information and the target object, for example, when the voice confidence is 90%, it can represent the voiceprint feature information corresponding to the voice confidence.
  • the identified target object is 90% trustworthy.
  • the object recognition device may directly determine the sound confidence level corresponding to the voiceprint feature information. For example, after the voiceprint feature information is matched with the sample feature information corresponding to each voiceprint training voice in the voiceprint training voice set, the matching of the sample feature information and the voiceprint feature information of the voiceprint training voice A is detected to be the highest, and The highest value is 90%, and the object recognition device can determine that the voice confidence corresponding to the voiceprint feature information is 90%.
  • the object recognition device may generate the object recognition result for the target object using the voiceprint feature information.
  • the object recognition result may indicate that the voice information of the target object belongs to the target object.
  • the object recognition device may classify the voice information of the at least two target objects by the voiceprint feature information of the at least two target objects (for example, when the trial is to be performed in the recording system) The speech of all target objects is classified into judges, courts, plaintiffs, etc.).
  • the object recognition device when there are two similar voiceprint features in the voiceprint feature information, the object recognition device may have a situation in which the object recognition result of the target object cannot be accurately obtained by the above two similar voiceprint features. .
  • the object recognition device may determine the adopted object identification information in the orientation information and the voiceprint feature information based on the relationship between the voice confidence level and the preset sound confidence threshold, and acquire the object recognition of the target object according to the object identification information. result.
  • the preset sound confidence value may be obtained according to experience in multiple recognition processes, and may include at least two preset sound confidence thresholds.
  • the object identification information may be used to identify the target object, and may include orientation information or voiceprint feature information.
  • the object identification information used is determined in the orientation information and the voiceprint feature information, and the object recognition result of the target object is obtained according to the object identification information. Including the following steps, as shown in Figure 6:
  • the voiceprint feature information is determined as the adopted object identification information, and the object recognition result of the target object is acquired according to the object identification information.
  • the object recognition device may determine the voiceprint feature information as used.
  • the object identification information is then used to identify the target object using the voiceprint feature information, and the orientation information at this time does not participate in the recognition and is only used as the sound source localization.
  • the first confidence threshold may be set to 90%, 95%, or other value determined based on actual conditions.
  • the object recognition device may jointly determine the orientation information and the voiceprint feature information as the object recognition information used, and then use the voiceprint feature information to perform voiceprint recognition to initially identify the target object, and simultaneously use the orientation information to locate.
  • the direction of the sound source further identifies the target object.
  • the first confidence threshold may be set to 90%, 95%, or other value determined according to actual conditions
  • the second confidence threshold may be set to 50%, 55%, 60%, or other basis.
  • the data determined by the actual situation can represent the average value.
  • the orientation information is determined as the adopted object identification information, and the object recognition result of the target object is acquired according to the object identification information.
  • the object recognition device may determine the orientation information as the object identification information used, and then use the orientation direction of the orientation information source to identify the target object, thereby realizing the separation of the human voice in the same voice environment. It can be understood that when the orientation information is used as the object identification information, there may be an error within the allowable range in the process of identification.
  • the object identification information for object recognition is determined by the sound confidence, the process of identifying the non-essential information in the process of object recognition is avoided, and the efficiency of object recognition is improved.
  • the orientation information and the voiceprint feature information are collectively determined as the adopted object identification information, and according to the object identification information.
  • Obtaining the object recognition result of the target object may include the following steps, as shown in FIG. 7:
  • the orientation information and the voiceprint feature information are jointly determined as the adopted object identification information.
  • the object recognition device may jointly determine the orientation information and the voiceprint feature information as the adopted object identification information.
  • the candidate recognition result of the target object may be acquired according to the voiceprint feature information.
  • the candidate recognition result may be the object recognition result of the final target object, that is, the object recognition device may accurately classify the plurality of voice information.
  • the voiceprint feature information of the judge A and the prisoner B is very similar, and the object recognition device may classify the voice information of the judge A into the voice information of the prisoner B or the prisoner when classifying the voice information of the two.
  • the voice information of B is classified into the voice information of the judge A.
  • Position information is used to locate an object recognition result of the target object from the candidate recognition result.
  • the object recognition device may further locate the object recognition result of the target object from the candidate recognition result by using the sound source direction of the orientation information positioning, that is, The object recognition device can adjust the candidate recognition result and finally determine the object recognition result of the target object.
  • the voiceprint feature information of the judge A and the prisoner B is relatively similar, and the object recognition device can further convert the voices of the two from the candidate recognition result, that is, the inaccurate voice information according to the position of the judge A and the prisoner B. Information is accurately classified.
  • the object recognition result of the target object is simultaneously recognized by the orientation information and the voiceprint feature information, thereby further increasing the accuracy of the obtained object recognition result.
  • the object recognition method obtains the voice information of the target object in the current voice environment and the orientation information of the target object, and then extracts the voiceprint feature of the voice information based on the trained voiceprint matching model, and obtains the voice information after the voiceprint feature extraction.
  • Corresponding voiceprint feature information finally obtaining the voice confidence corresponding to the voiceprint feature information, and acquiring the object recognition result of the target object based on the voice confidence and using the orientation information and the voiceprint feature information.
  • the object identification information avoids the process of identifying non-essential information in the process of object recognition, and improves the efficiency of object recognition; and further improves the obtained object recognition result by simultaneously identifying the object recognition result of the target object by the orientation information and the voiceprint feature information. The accuracy of the object recognition results.
  • FIG. 8 is a schematic flowchart diagram of another object identification method according to an embodiment of the present application. As shown in FIG. 8, the method in this embodiment of the present application may include the following steps.
  • S501 Acquire a voiceprint training voice set including a training voice of a target object, and perform training training generation on the established voiceprint matching model based on sample feature information corresponding to each voiceprint training voice and voiceprint training voice in the voiceprint training voice set. The subsequent voiceprint matching model.
  • the identity information of the speaker corresponding to one language information can be confirmed by voiceprint recognition, and the difference from the target speaker by distinguishing the voice information from the plurality of language information by the voiceprint recognition is the establishment process of the voiceprint matching model.
  • the object recognition device may acquire the voiceprint training voice set including the training voice of the target object, and based on the sample feature information corresponding to each voiceprint training voice and the voiceprint training voice in the voiceprint training voice set, and the established voiceprint
  • the matching model is trained to generate a trained voiceprint matching model.
  • the object recognition device can train the voiceprint matching model by using an algorithm such as a neural network, hidden Markov or VQ clustering, and the voice collector corresponding to the voice in the voiceprint training voice set is different from that in step S201.
  • the voice collector corresponding to the voice in the voiceprint training voice set must include the target object, and the sample feature information corresponding to the voiceprint training voice may be the voiceprint feature information of the voiceprint training voice.
  • S504 Perform a voiceprint feature extraction on the voice information based on the trained voiceprint matching model, and obtain voiceprint feature information corresponding to the voice information extracted by the voiceprint feature.
  • S505 Match the voiceprint feature information with the sample feature information corresponding to the voiceprint training voice, and obtain a matching degree value when the feature matching degree is the highest.
  • S507. Determine, according to the relationship between the voice confidence level and the preset sound confidence threshold, the object identification information used in the orientation information and the voiceprint feature information, and acquire the object recognition result of the target object according to the object identification information;
  • the object recognition device may generate the object recognition result for the target object by using the voiceprint feature information, and the object recognition result may indicate the identity information of the target object corresponding to the voice information of the target object.
  • the object recognition device may determine the target objects corresponding to the voice information of the at least two target objects by using the voiceprint feature information of the at least two target objects, and determine the identity information of the target object ( For example, when the voices of all target objects in the recording system are classified as judges, lawyers, and plaintiffs at the trial, it can be determined that the voice A belongs to the judge, the voice B belongs to the court, the voice C belongs to the plaintiff, and the like.
  • the object recognition device when there are two similar voiceprint features in the voiceprint feature information, the object recognition device may have a situation in which the object recognition result of the target object cannot be accurately obtained by the above two similar voiceprint features. .
  • the object recognition device may determine the adopted object identification information in the orientation information and the voiceprint feature information based on the relationship between the voice confidence level and the preset sound confidence threshold, and acquire the object recognition of the target object according to the object identification information. result.
  • the object identification information used is determined in the orientation information and the voiceprint feature information, and the object recognition result of the target object is obtained according to the object identification information.
  • the following steps are included. For details, refer to the process shown in FIG. 6.
  • the voiceprint feature information is determined as the adopted object identification information, and the object recognition result of the target object is acquired according to the object identification information.
  • the voice confidence when the voice confidence is greater than or equal to the first confidence threshold, it may represent that the identity information of the target object is greater according to the voiceprint feature information, and the object recognition device may determine the voiceprint feature information as used.
  • the object identification information is then used to identify the identity information of the target object using the voiceprint feature information, and the orientation information at this time does not participate in the identity confirmation and is only used as the sound source location.
  • the first confidence threshold may be set to 90%, 95%, or other value determined based on actual conditions.
  • the object recognition device may jointly determine the orientation information and the voiceprint feature information as the object identification information used, and then use the voiceprint feature information to perform voiceprint recognition to initially determine the identity of the target object.
  • the direction of the sound source positioned by the orientation information is further used to further identify the identity of the target object.
  • the first confidence threshold may be set to 90%, 95%, or other value determined according to actual conditions
  • the second confidence threshold may be set to 50%, 55%, 60%, or other basis.
  • the data determined by the actual situation can represent the average value.
  • the orientation information is determined as the adopted object identification information, and the object recognition result of the target object is acquired according to the object identification information.
  • the voice confidence when the voice confidence is less than the second confidence threshold, it may represent that the identity of the target object is less reliable according to the voiceprint feature information, and the accuracy of the target identity identified by the voiceprint feature information is higher.
  • the object recognition device may determine the orientation information as the object identification information used, and then determine the identity of the target object by using the orientation direction after the orientation information source is located, thereby realizing the separation of the human voice in the same voice environment. It can be understood that when the orientation information is used as the object identification information, there may be an error within the allowable range in the process of identification.
  • the current voice environment needs to be a specific voice environment, for example, the location of the target object is a certain environment (for example, in the trial, the position of the judge and the prisoner is determined)
  • the object identification information for object recognition is determined by the sound confidence, the process of identifying the non-essential information in the process of object recognition is avoided, and the efficiency of object recognition is improved.
  • the orientation information and the voiceprint feature information are collectively determined as the adopted object identification information, and according to the object identification information.
  • Obtaining the object recognition result of the target object may include the following steps. For details, refer to the process shown in FIG. 7:
  • the orientation information and the voiceprint feature information are jointly determined as the adopted object identification information.
  • the candidate recognition result of the target object may be acquired according to the voiceprint feature information.
  • the candidate recognition result may be an object recognition result of the final target object, that is, the object recognition device may clearly identify the target object from the plurality of voice information. Speech information; when there is at least two target objects in the target object that are not clearly distinguishable between the voiceprint feature information, the correspondence between the target object indicated by the candidate recognition result and the voice information may be inaccurate, for example, the judge A and The voiceprint feature information of Prisoner B is very similar.
  • the object recognition device recognizes the voice information of Judge A in the multiple voice messages of the trial, the voice information of Prisoner B may be mistaken for Judge A, or it may The voice information of Judge A is considered to be Prisoner B.
  • Position information is used to locate an object recognition result of the target object from the candidate recognition result.
  • the object recognition device may further locate the object recognition result of the target object from the candidate recognition result by using the sound source direction of the orientation information positioning, that is, The object recognition device can adjust the candidate recognition result and finally determine the object recognition result of the target object.
  • the voiceprint feature information of the judge A and the prisoner B is relatively similar, and the candidate recognition result indicates that the voice information of the judge A corresponds to the prisoner B, and the position information object identification device of the judge A can associate the voice information of the judge A with the judge A.
  • the object recognition result of the target object is simultaneously recognized by the orientation information and the voiceprint feature information, thereby further increasing the accuracy of the obtained object recognition result.
  • the object recognition method obtains the voice information of the target object in the current voice environment and the orientation information of the target object, and then extracts the voiceprint feature of the voice information based on the trained voiceprint matching model, and obtains the voice information after the voiceprint feature extraction.
  • Corresponding voiceprint feature information finally obtaining the voice confidence corresponding to the voiceprint feature information, and acquiring the object recognition result of the target object based on the voice confidence and using the orientation information and the voiceprint feature information.
  • the object identification information avoids the process of identifying non-essential information in the process of object recognition, and improves the efficiency of object recognition; and further improves the obtained object recognition result by simultaneously identifying the object recognition result of the target object by the orientation information and the voiceprint feature information. The accuracy of the object recognition results.
  • FIG. 9 to FIG. 14 is used to perform the method of the embodiment shown in FIG. 3 to FIG. 8 , and for the convenience of description, only the part related to the embodiment of the present application is shown. For specific technical details not disclosed, please refer to the embodiment shown in Figures 3-8 of the present application.
  • FIG. 9 is a schematic structural diagram of an object identification device according to an embodiment of the present application.
  • the object recognition device 1 of the embodiment of the present application may include: an object information acquisition module 11 , a feature information acquisition module 12 , a confidence degree acquisition module 13 , and a result acquisition module 14 .
  • the object information obtaining module 11 is configured to acquire voice information of the target object and orientation information of the target object in the current voice environment.
  • the object information acquiring module 11 may acquire the voice information of the target object in the current voice environment based on the microphone array, and acquire the orientation information of the target object based on the microphone array.
  • the target object may be a valid sound source object in the current speech environment (eg, a judge, a lawyer, a lawyer, and a plaintiff in a trial case).
  • the voice information in the voice information set acquired by the object information acquiring module 11 in the current voice environment may be the voice information of the target object or other non-essential voice information (for example, the audience in the court case case)
  • the object information acquiring module 11 obtains the voice information set in the current voice environment, and can perform a filtering process on the voice information combination to obtain the voice information of the target object.
  • the microphone array can acquire the voice information of the same target object collected from different orientations through multiple microphones. Since multiple microphones are in different positions in the microphone array, each microphone can be obtained according to the size of the sound.
  • the phase information of the target object is used to calculate the orientation information of the target object (ie, determine the location information of the target object in the current voice environment) by means of beamforming according to the obtained phase information.
  • the feature information obtaining module 12 is configured to perform voiceprint feature extraction on the voice information based on the trained voiceprint matching model, and obtain voiceprint feature information corresponding to the voice information extracted by the voiceprint feature.
  • the feature information obtaining module 12 may perform voiceprint feature extraction on the voice information based on the trained voiceprint matching model.
  • the voiceprint matching model may use some training algorithm for the sample feature information corresponding to each voiceprint training voice and voiceprint training voice in the pre-acquired voiceprint training voice set (for example, a neural network method, hidden Markov method or VQ clustering method, etc.) The model established after training.
  • the voice collector corresponding to the voice in the voiceprint training voice set may be a random experimental object and does not define a specific target object, and the sample feature information corresponding to the voiceprint training voice may be the voiceprint of the voiceprint training voice. Feature information.
  • the feature information acquiring module 12 may acquire the voiceprint feature information corresponding to the voice information after the voiceprint feature extraction.
  • the voiceprint feature information may be different feature information in the voice information of the target object, for example, may be information such as a spectrum, a cepstrum, a formant, a pitch, a reflection coefficient, and the like.
  • the confidence acquisition module 13 is configured to obtain the sound confidence corresponding to the voiceprint feature information.
  • the confidence acquisition module 13 may obtain the sound confidence corresponding to the voiceprint feature information.
  • the voice confidence can indicate the degree of trust of the correspondence between the voiceprint feature information and the target object, for example, when the voice confidence is 90%, it can represent the voiceprint feature information corresponding to the voice confidence.
  • the identified target object is 90% trustworthy.
  • the confidence acquiring module 13 may match the voiceprint feature information with the sample feature information corresponding to the voiceprint training voice, obtain the matching degree value when the feature matching degree is the highest, and then determine the voiceprint feature according to the matching degree value.
  • the confidence level of the voice corresponding to the information For example, after the voiceprint feature information is matched with the sample feature information corresponding to each voiceprint training voice in the voiceprint training voice set, the matching of the sample feature information and the voiceprint feature information of the voiceprint training voice A is detected to be the highest, and The highest value is 90%, and the object recognition device can determine that the voice confidence corresponding to the voiceprint feature information is 90%.
  • the result obtaining module 14 is configured to acquire the object recognition result of the target object by using the orientation information, the voiceprint feature information, and the sound confidence.
  • the object recognition device 1 can generate the object recognition result for the target object using the voiceprint feature information, and the object recognition result can indicate the target object to which the voice information of the target object belongs.
  • the object recognition device may classify the voice information of the at least two target objects by the voiceprint feature information of the at least two target objects (for example, when the trial is to be performed in the recording system) The speech of all target objects is classified into judges, defendants, plaintiffs, etc.).
  • the object recognition device 1 may have an object recognition result that cannot accurately obtain the target object by using the two similar voiceprint features. happening.
  • the result obtaining module 14 may acquire the object recognition result of the target object based on the sound confidence and using the orientation information and the voiceprint feature information.
  • the result obtaining module 14 may determine the object identification information used to identify the object recognition result of the target object based on the relationship between the voice confidence level and the preset sound confidence threshold, and then acquire the object recognition result according to the object identification information.
  • the object identification information may be orientation information or voiceprint feature information.
  • the result obtaining module 14 may determine the voiceprint feature information as the adopted object identification information when the voice confidence level is greater than or equal to the first confidence threshold, and acquire the target object according to the object identification information.
  • the object recognition result that is, the voiceprint feature information is used to identify the target object, and the orientation information is not involved in the recognition and is only used as the sound source localization.
  • the orientation information and the voiceprint feature information are jointly determined as the adopted object identification information, and the object recognition of the target object is obtained according to the object identification information.
  • the result is that the voiceprint feature information is used to identify the target object, and the sound source direction is further identified by the orientation information.
  • the orientation information is determined as the object identification information used, and the object recognition result of the target object is obtained according to the object identification information (ie, the orientation direction after only the orientation information source is used) Identify the target object).
  • the voice information of the target object in the current voice environment and the orientation information of the target object are obtained, and then the voiceprint feature is extracted based on the trained voiceprint matching model to obtain the voiceprint feature extraction.
  • the voiceprint feature information corresponding to the voice information is obtained, and finally the voice confidence corresponding to the voiceprint feature information is obtained, and the object recognition result of the target object is obtained based on the voice confidence and using the orientation information and the voiceprint feature information.
  • the voiceprint recognition may be to identify a plurality of speakers, it may be an identity confirmation of a certain speaker.
  • the execution process involving speaker recognition please refer to the following embodiment of the embodiment shown in FIG.
  • FIG. 10 is a schematic structural diagram of another object identification device according to an embodiment of the present application.
  • the object recognition device 1 of the embodiment of the present application may include: an object information acquisition module 11, a feature information acquisition module 12, a confidence acquisition module 13, a result acquisition module 14, and a model generation module 15, which are shown in FIG.
  • an object information acquisition module 11 a feature information acquisition module 12
  • a confidence acquisition module 13 a confidence acquisition module 13
  • a result acquisition module 14 a model generation module
  • the model generating module 15 is configured to obtain a voiceprint training voice set, and perform training on the established voiceprint matching model based on sample feature information corresponding to each voiceprint training voice and voiceprint training voice in the voiceprint training voice set. Voiceprint matching model.
  • the model generation module 15 may acquire the voiceprint training voice set, and build the sample feature information corresponding to each voiceprint training voice and the voiceprint training voice in the voiceprint training voice set.
  • the voiceprint matching model is trained to generate a trained voiceprint matching model. It can be understood that the model generation module 15 can use the neural network, hidden Markov or VQ clustering algorithm to train the voiceprint matching model, and the voice collector corresponding to the voice in the voiceprint training voice set can be a random experiment.
  • the object does not define a specific target object, and the sample feature information corresponding to the voiceprint training voice may be voiceprint feature information of the voiceprint training voice.
  • the object information obtaining module 11 is configured to acquire voice information of the target object and orientation information of the target object in the current voice environment.
  • the object information obtaining module 11 may acquire the voice information of the target object and the orientation information of the target object in the current voice environment.
  • FIG. 11 is a schematic structural diagram of an object information acquiring module according to an embodiment of the present application.
  • the object information obtaining module 11 may include:
  • the information acquiring unit 111 is configured to acquire a voice information set in the current voice environment based on the microphone array, and perform a filtering process on the voice information set to obtain voice information of the target object after the filtering process.
  • the information acquisition unit 111 may acquire a set of voice information in the current voice environment based on the microphone array.
  • the voice information in the voice information set may be the voice information of the target object or other non-essential voice information (for example, the voice information of the audience under the court case or the noise emitted by other objects, etc.), wherein
  • the target object can be a valid sound source object in the current speech environment (eg, a judge, a lawyer, a lawyer, and a plaintiff in a trial case).
  • the information acquiring unit 111 may perform a filtering process on the voice information set to obtain the voice information of the target object after the filtering process.
  • the filtering process may specifically be filtering noise by noise reduction processing, removing echoes, or filtering according to characteristics (sound sound, timbre or other feature information) of the voice information of the target object to be processed, unless the voice of the target object is filtered. Voice filtering processing.
  • the information determining unit 112 is configured to acquire phase information of the microphone array when collecting the voice information set, and determine the orientation information of the target object based on the voice orientation indicated by the phase information.
  • the microphone array can acquire phase information corresponding to each voice information in the voice information set while collecting the voice information set.
  • the information determining unit 112 may acquire phase information, and may determine the orientation information of the target object based on the voice orientation indicated by the phase information.
  • the phase in the phase information can indicate the scale of the speech waveform of the speech information at a certain moment, and can describe the metric of the change of the waveform of the speech signal, usually in degrees (angle) as a unit, also called a phase angle.
  • the microphone array can acquire voice information of the same target object collected from different orientations through multiple microphones. Since multiple microphones are in different positions in the microphone array, each microphone can acquire the sound according to the size of the sound.
  • the phase information of the target object is calculated by beamforming according to the obtained phase information (ie, determining position information of the target object in the current voice environment). The manner of beamforming is as shown in FIG. 5.
  • the voice extraction or separation may be performed by separately forming a sound beam to sound sources in different directions and suppressing sounds in other directions.
  • the feature information obtaining module 12 is configured to perform voiceprint feature extraction on the voice information based on the trained voiceprint matching model, and obtain voiceprint feature information corresponding to the voice information extracted by the voiceprint feature.
  • the feature information acquiring module 12 may perform voiceprint feature extraction on the voice information based on the trained voiceprint matching model, and acquire voiceprint feature information corresponding to the voiceprint feature extracted voice information.
  • the voiceprint feature information may be different feature information in the voice information of the target object, for example, may be information such as a spectrum, a cepstrum, a formant, a pitch, a reflection coefficient, and the like.
  • the confidence acquisition module 13 is configured to obtain the sound confidence corresponding to the voiceprint feature information.
  • the confidence acquiring module 13 may obtain the voice confidence corresponding to the voiceprint feature information.
  • FIG. 12 is a schematic structural diagram of a confidence acquisition module according to an embodiment of the present application.
  • the confidence acquisition module 13 may include:
  • the matching degree value obtaining unit 131 is configured to match the voiceprint feature information with the sample feature information corresponding to the voiceprint training voice, and obtain the matching degree value when the feature matching degree is the highest.
  • the matching degree value obtaining unit 131 may match the voiceprint feature information with the sample feature information corresponding to the voiceprint training voice, and obtain the matching degree value when the feature matching degree is the highest. It can be understood that the voiceprint characteristics of different people are different. Even the voiceprint characteristics of the same person may vary according to the physical condition of the speaker or the environment in which the speaker is present.
  • the matching degree value obtained may be large or small, but the feature matching may be obtained by comparing all the matching value values. The matching value at the highest degree.
  • the confidence determination unit 132 is configured to determine the voice confidence corresponding to the voiceprint feature information according to the matching degree value.
  • the confidence determination unit 132 may determine the acoustic confidence corresponding to the voiceprint feature information based on the matching degree value. It can be understood that the voice confidence can indicate the degree of trust of the correspondence between the voiceprint feature information and the target object, for example, when the voice confidence is 90%, it can represent the voiceprint feature information corresponding to the voice confidence. The identified target object is 90% trustworthy.
  • the confidence determination unit 132 may directly determine the confidence value corresponding to the voiceprint feature information. For example, after the voiceprint feature information is matched with the sample feature information corresponding to each voiceprint training voice in the voiceprint training voice set, the matching of the sample feature information and the voiceprint feature information of the voiceprint training voice A is detected to be the highest, and The highest value is 90%, and the object recognition device can determine that the voice confidence corresponding to the voiceprint feature information is 90%.
  • the result obtaining module 14 is specifically configured to determine the object identification information used in the orientation information and the voiceprint feature information based on the relationship between the voice confidence level and the preset sound confidence threshold, and acquire the object identifier of the target object according to the object identification information. result.
  • the object recognition device 1 may generate an object recognition result for the target object using the voiceprint feature information.
  • the object recognition result may indicate that the voice information of the target object belongs to the target object.
  • the object recognition device may classify the voice information of the at least two target objects by the voiceprint feature information of the at least two target objects (for example, when the trial is to be performed in the recording system) The speech of all target objects is classified into judges, courts, plaintiffs, etc.).
  • the object recognition device when there are two similar voiceprint features in the voiceprint feature information, the object recognition device may have a situation in which the object recognition result of the target object cannot be accurately obtained by the above two similar voiceprint features. .
  • the result obtaining module 14 may determine the adopted object identification information in the orientation information and the voiceprint feature information based on the relationship between the voice confidence level and the preset sound confidence threshold, and acquire the object of the target object according to the object identification information. Identify the results.
  • the preset sound confidence value may be obtained according to experience in multiple recognition processes, and may include at least two preset sound confidence thresholds.
  • the object identification information may be used to identify the target object, and may include orientation information or voiceprint feature information.
  • the result obtaining module 14 may include the following units, as shown in FIG. 13:
  • the first result obtaining unit 141 is configured to determine the voiceprint feature information as the adopted object identification information when the voice confidence level is greater than or equal to the first confidence threshold, and acquire the object recognition result of the target object according to the object identification information.
  • the correspondence between the voiceprint feature information and the target object may be greater, and the first result obtaining unit 141 may perform the voiceprint.
  • the feature information is determined as the object identification information used, and then the voice object feature information is used to identify the target object, and the orientation information at this time does not participate in the recognition and is only used as the sound source localization.
  • the first confidence threshold may be set to 90%, 95%, or other value determined based on actual conditions.
  • the second result obtaining unit 142 is configured to determine the orientation information and the voiceprint feature information as the used object identification information when the voice confidence is greater than or equal to the second confidence threshold and less than the first confidence threshold, and according to The object identification information acquires an object recognition result of the target object.
  • the second result obtaining unit 142 may jointly determine the orientation information and the voiceprint feature information as the object identification information used, and then use the voiceprint feature information to perform voiceprint recognition to initially identify the target object, and adopt the orientation information.
  • the direction of the positioned sound source further identifies the target object.
  • the first confidence threshold may be set to 90%, 95%, or other values determined according to actual conditions
  • the second confidence threshold may be set to 50%, 55%, or 60%, and the like.
  • the data determined by the actual situation can represent the average value.
  • the third result obtaining unit 143 is configured to determine the orientation information as the used object identification information when the voice confidence is less than the second confidence threshold, and acquire the object recognition result of the target object according to the object identification information.
  • the degree of trustworthiness of the correspondence between the voiceprint feature information and the target object may be lower, and the accuracy of the target object identified by the voiceprint feature information may be used.
  • the rate is lower.
  • the third result obtaining unit 143 may determine the orientation information as the object identification information used, and then use the positioning direction of the position information source to identify the target object, thereby realizing the separation of the human voice in the same voice environment. It can be understood that when the orientation information is used as the object identification information, there may be an error within the allowable range in the process of identification.
  • the object identification information for object recognition is determined by the sound confidence, the process of identifying the non-essential information in the process of object recognition is avoided, and the efficiency of object recognition is improved.
  • the second result obtaining unit 142 may include the following subunits, as shown in FIG.
  • the information determining sub-unit 1421 is configured to jointly determine the orientation information and the voiceprint feature information as the used object identification information when the voice confidence level is greater than or equal to the second confidence threshold and less than the first confidence threshold.
  • the degree of trust of the correspondence between the voiceprint feature information and the target object may be indicated to be average, ie
  • the determined degree of trust of the object recognition result is general.
  • the information determining sub-unit 1421 can jointly determine the orientation information and the voiceprint feature information as the adopted object. Identify information.
  • the candidate result obtaining sub-unit 1422 is configured to acquire a candidate recognition result of the target object according to the voiceprint feature information.
  • the candidate result obtaining sub-unit 1422 may acquire the candidate recognition result of the target object according to the voiceprint feature information. It can be understood that when the voiceprint feature information of the target object has a significant difference, the candidate recognition result may be the object recognition result of the final target object, that is, the object recognition device may accurately classify the plurality of voice information. When there is at least two target objects in which the voiceprint feature information is not clearly distinguished in the target object, the classification of the language information of the target object corresponding to the candidate recognition result is inaccurate.
  • the voiceprint feature information of the judge A and the prisoner B is very similar, and the object recognition device may classify the voice information of the judge A into the voice information of the prisoner B or the prisoner when classifying the voice information of the two.
  • the voice information of B is classified into the voice information of the judge A.
  • the result obtaining sub-unit 1423 is configured to locate the object recognition result of the target object from the candidate recognition result by using the orientation information.
  • the result obtaining sub-unit 1423 may further locate the sound source direction by using the orientation information from the candidate recognition result.
  • the object recognition result of the target object that is, the result acquisition sub-unit 1423 can adjust the candidate recognition result and finally determine the object recognition result of the target object.
  • the voiceprint feature information of the judge A and the prisoner B is relatively similar, and the object recognition device can further convert the voices of the two from the candidate recognition result, that is, the inaccurate voice information according to the position of the judge A and the prisoner B. Information is accurately classified.
  • the object recognition result of the target object is simultaneously recognized by the orientation information and the voiceprint feature information, thereby further increasing the accuracy of the obtained object recognition result.
  • the voice information of the target object in the current voice environment and the orientation information of the target object are obtained, and then the voiceprint feature is extracted based on the trained voiceprint matching model to obtain the voiceprint feature extraction.
  • the voiceprint feature information corresponding to the voice information is obtained, and finally the voice confidence corresponding to the voiceprint feature information is obtained, and the object recognition result of the target object is obtained based on the voice confidence and using the orientation information and the voiceprint feature information.
  • the object identification information avoids the process of identifying non-essential information in the process of object recognition, and improves the efficiency of object recognition; and further improves the obtained object recognition result by simultaneously identifying the object recognition result of the target object by the orientation information and the voiceprint feature information. The accuracy of the object recognition results.
  • the model generating module 15 is specifically configured to acquire a voiceprint training voice set containing the training voice of the target object, and based on the sample feature information corresponding to each voiceprint training voice and the voiceprint training voice in the voiceprint training voice set, the established voiceprint
  • the matching model is trained to generate a trained voiceprint matching model.
  • the identity information of the speaker corresponding to one language information can be confirmed by voiceprint recognition, and the difference from the target speaker by distinguishing the voice information from the plurality of language information by the voiceprint recognition is the establishment process of the voiceprint matching model.
  • the model generating module 15 may acquire a voiceprint training voice set containing the training voice of the target object, and based on the sample feature information corresponding to each voiceprint training voice and voiceprint training voice in the voiceprint training voice set, The voiceprint matching model is trained to generate a trained voiceprint matching model. It can be understood that the model generation module 15 can use the neural network, hidden Markov or VQ clustering algorithm to train the voiceprint matching model, and the voice collector corresponding to the voice in the voiceprint training voice set is shown in FIG. In the first implementation manner of the embodiment, the difference in the model generation module 15 is that the voice collector corresponding to the voice in the voiceprint training voice set must include the target object, and the sample feature information corresponding to the voiceprint training voice may be the voiceprint. Training the voiceprint feature information of the voice.
  • the object information obtaining module 11 is configured to acquire voice information of the target object and orientation information of the target object in the current voice environment.
  • the object information obtaining module 11 may acquire the voice information of the target object and the orientation information of the target object in the current voice environment.
  • FIG. 11 is a schematic structural diagram of an object information acquiring module according to an embodiment of the present application.
  • the object information obtaining module 11 may include:
  • the information acquiring unit 111 is configured to acquire a voice information set in the current voice environment based on the microphone array, and perform a filtering process on the voice information set to obtain voice information of the target object after the filtering process.
  • the detailed process of acquiring the voice information of the target object by the information acquiring unit 111 may refer to the description in the foregoing method embodiment, and details are not described herein again.
  • the information determining unit 112 is configured to acquire phase information of the microphone array when collecting the voice information set, and determine the orientation information of the target object based on the voice orientation indicated by the phase information.
  • the detailed process of the information determining unit 112 for acquiring the orientation information of the target object may refer to the description in the foregoing method embodiment, and details are not described herein again.
  • the feature information obtaining module 12 is configured to perform voiceprint feature extraction on the voice information based on the trained voiceprint matching model, and obtain voiceprint feature information corresponding to the voice information extracted by the voiceprint feature.
  • the detailed process of acquiring the voiceprint feature information by the feature information acquiring module 12 may refer to the description in the foregoing method embodiment, and details are not described herein again.
  • the confidence acquisition module 13 is configured to obtain the sound confidence corresponding to the voiceprint feature information.
  • the confidence acquiring module 13 may obtain the voice confidence corresponding to the voiceprint feature information.
  • FIG. 12 is a schematic structural diagram of a confidence acquisition module according to an embodiment of the present application.
  • the confidence acquisition module 13 may include:
  • the matching degree value obtaining unit 131 is configured to match the voiceprint feature information with the sample feature information corresponding to the voiceprint training voice, and obtain the matching degree value when the feature matching degree is the highest.
  • the detailed process of the matching degree value obtaining unit 131 for obtaining the matching degree value may refer to the description in the foregoing method embodiment, and details are not described herein again.
  • the confidence determination unit 132 is configured to determine the voice confidence corresponding to the voiceprint feature information according to the matching degree value.
  • the result obtaining module 14 is specifically configured to determine the object identification information used in the orientation information and the voiceprint feature information based on the relationship between the voice confidence level and the preset sound confidence threshold, and acquire the object identifier of the target object according to the object identification information. result.
  • the object recognition device 1 can generate the object recognition result for the target object by using the voiceprint feature information, and the object recognition result can indicate the identity information of the target object corresponding to the voice information of the target object. For example, if there are at least two target objects in the current voice environment, the object recognition device 1 may determine the target objects corresponding to the voice information of the at least two target objects by using the voiceprint feature information of the at least two target objects, and determine the identity information of the target object. (For example, when the voices of all target objects in the recording system are classified as judges, lawyers, and plaintiffs at the trial, it can be determined that the voice A belongs to the judge, the voice B belongs to the court, the voice C belongs to the plaintiff, and the like.)
  • the object recognition device 1 may have an object recognition result that cannot accurately obtain the target object by using the two similar voiceprint features. happening.
  • the result obtaining module 14 may determine the adopted object identification information in the orientation information and the voiceprint feature information based on the relationship between the voice confidence level and the preset sound confidence threshold, and acquire the object of the target object according to the object identification information. Identify the results.
  • the result obtaining module 14 may include the following unit as shown in FIG. 13:
  • the first result obtaining unit 141 is configured to determine the voiceprint feature information as the adopted object identification information when the voice confidence level is greater than or equal to the first confidence threshold, and acquire the object recognition result of the target object according to the object identification information.
  • the voice confidence when the voice confidence is greater than or equal to the first confidence threshold, it may represent that the identity information of the target object is confirmed to be greater according to the voiceprint feature information, and the first result obtaining unit 141 may perform the voiceprint.
  • the feature information is determined as the object identification information used, and then the voiceprint feature information is used to identify the identity information of the target object, and the orientation information at this time does not participate in the identity confirmation and is only used as the sound source location.
  • the first confidence threshold may be set to 90%, 95%, or other value determined based on actual conditions.
  • the second result obtaining unit 142 is configured to determine the orientation information and the voiceprint feature information as the used object identification information when the voice confidence is greater than or equal to the second confidence threshold and less than the first confidence threshold, and according to The object identification information acquires an object recognition result of the target object.
  • the voice confidence when the voice confidence is greater than or equal to the second confidence threshold and less than the first confidence threshold, it may represent that the credibility of the identity information of the target object is averaged according to the voiceprint feature information, for Accurately identify the identity of the target object.
  • the second result obtaining unit 142 may jointly determine the orientation information and the voiceprint feature information as the object identification information used, and then use the voiceprint feature information to perform voiceprint recognition, initially determine the identity of the target object, and simultaneously use the orientation information to locate.
  • the sound source direction further identifies the identity of the target object.
  • the first confidence threshold may be set to 90%, 95%, or other value determined according to actual conditions
  • the second confidence threshold may be set to 50%, 55%, 60%, or other basis.
  • the data determined by the actual situation can represent the average value.
  • the third result obtaining unit 143 is configured to determine the orientation information as the used object identification information when the voice confidence is less than the second confidence threshold, and acquire the object recognition result of the target object according to the object identification information.
  • the voice confidence when the voice confidence is less than the second confidence threshold, it may represent that the identity of the target object is less reliable according to the voiceprint feature information, and the target object identity identified by the voiceprint feature information is used. The accuracy is low.
  • the third result acquisition sub-unit 1323 may determine the orientation information as the object identification information used, and then determine the identity of the target object by using the orientation direction after the orientation information source is located, thereby realizing the separation of the human voice in the same voice environment. It can be understood that when the orientation information is used as the object identification information, there may be an error within the allowable range in the process of identification.
  • the current voice environment needs to be a specific voice environment, for example, the location of the target object is a certain environment (for example, in the trial, the position of the judge and the prisoner is determined)
  • the object identification information for object recognition is determined by the sound confidence, the process of identifying the non-essential information in the process of object recognition is avoided, and the efficiency of object recognition is improved.
  • the result obtaining module 14 may include a sub-unit as shown in FIG. 14:
  • the information determining sub-unit 1421 is configured to jointly determine the orientation information and the voiceprint feature information as the used object identification information when the voice confidence level is greater than or equal to the second confidence threshold and less than the first confidence threshold.
  • the detailed process of determining the object identification information by the information determining sub-unit 1421 can refer to the description in the foregoing method embodiment, and details are not described herein again.
  • the candidate result obtaining sub-unit 1422 is configured to acquire a candidate recognition result of the target object according to the voiceprint feature information.
  • the candidate result obtaining sub-unit 1422 can acquire the candidate recognition result of the target object according to the voiceprint feature information.
  • the candidate recognition result may be the object recognition result of the final target object, that is, the object recognition device may clearly identify the voice of the target object from the plurality of voice information. information.
  • the voiceprint feature information of the judge A and the prisoner B is very similar, and the object recognition device may mistake the voice information of the prisoner B as the judge A when the voice information of the judge A is identified in the plurality of voice information of the trial. It is also possible that the judge A's voice information will be mistaken for Prisoner B's.
  • the result obtaining sub-unit 1423 is configured to locate the object recognition result of the target object from the candidate recognition result by using the orientation information.
  • the result obtaining sub-unit 1423 may further locate the sound source direction by using the orientation information from the candidate recognition result.
  • the object recognition result of the target object that is, the result acquisition sub-unit 1423 can adjust the candidate recognition result and finally determine the object recognition result of the target object.
  • the voiceprint feature information of the judge A and the prisoner B is relatively similar, and the candidate recognition result indicates that the voice information of the judge A corresponds to the prisoner B, and the position information object identification device of the judge A can associate the voice information of the judge A with the judge A.
  • the object recognition result of the target object is simultaneously recognized by the orientation information and the voiceprint feature information, thereby further increasing the accuracy of the obtained object recognition result.
  • the voice information of the target object in the current voice environment and the orientation information of the target object are obtained, and then the voiceprint feature is extracted based on the trained voiceprint matching model to obtain the voiceprint feature extraction.
  • the voiceprint feature information corresponding to the voice information is obtained, and finally the voice confidence corresponding to the voiceprint feature information is obtained, and the object recognition result of the target object is obtained based on the voice confidence and using the orientation information and the voiceprint feature information.
  • the object identification information avoids the process of identifying non-essential information in the process of object recognition, and improves the efficiency of object recognition; and further improves the obtained object recognition result by simultaneously identifying the object recognition result of the target object by the orientation information and the voiceprint feature information. The accuracy of the object recognition results.
  • a computer apparatus comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform the step of: acquiring the current voice environment The voice information of the target object and the orientation information of the target object; the voiceprint feature is extracted based on the trained voiceprint matching model, and the voiceprint feature information corresponding to the voice information extracted by the voiceprint feature is obtained; and the voiceprint feature information is obtained. Corresponding voice confidence; and based on the voice confidence and using the orientation information and the voiceprint feature information to obtain the object recognition result of the target object.
  • the processor when the computer readable instructions are executed by the processor, causing the processor to perform the step of acquiring the voice information of the target object and the orientation information of the target object in the current voice environment, performing the following steps: acquiring the current based on the microphone array a set of voice information in a voice environment; screening a voice information set to obtain voice information of the target object after filtering; obtaining phase information of the microphone array when collecting the voice information set; and determining a voice position indicated by the phase information The orientation information of the target object.
  • the processor when the computer readable instructions are executed by the processor, causing the processor to perform the following steps: acquiring voiceprint training before performing the step of acquiring the voice information of the target object and the orientation information of the target object in the current voice environment The speech collection; and the sample feature information corresponding to each voiceprint training voice and voiceprint training voice in the voiceprint training voice set, and the established voiceprint matching model is trained to generate the trained voiceprint matching model.
  • the computer readable instructions when executed by the processor, causing the processor to perform the step of acquiring sound confidence corresponding to the voiceprint feature information, the step of: mapping the voiceprint feature information to the voiceprint training voice The sample feature information is matched to obtain a matching degree value when the feature matching degree is the highest; and the sound confidence corresponding to the voiceprint feature information is determined according to the matching degree value.
  • the processor when the computer readable instructions are executed by the processor, causing the processor to perform the step of acquiring the object recognition result of the target object based on the sound confidence and using the orientation information and the voiceprint feature information, performing the following steps: The relationship between the voice confidence level and the preset sound confidence threshold, determining the object identification information used in the orientation information and the voiceprint feature information; and acquiring the object recognition result of the target object based on the object identification information.
  • the computer readable instructions are executed by the processor such that the processor determines the object recognition employed in the orientation information and the voiceprint feature information while performing a relationship based on the voice confidence and the preset sound confidence threshold.
  • the following steps are performed: when the voice confidence is greater than or equal to the first confidence threshold, the voiceprint feature information is determined as the adopted object identification information; when the voice confidence is greater than or equal to the second confidence threshold, And when the first confidence threshold is smaller than the first confidence threshold, the orientation information and the voiceprint feature information are jointly determined as the object identification information used; and when the voice confidence is less than the second confidence threshold, the orientation information is determined as the adopted object. Identify information.
  • the computer readable instruction When executed by the processor, causing the processor to perform the step of acquiring the object recognition result of the target object according to the object identification information, performing the following steps: acquiring the candidate recognition result of the target object according to the voiceprint feature information; and using the orientation information from the candidate identification The result of the object recognition of the target object is located in the result.
  • the computer device obtains the voice information of the target object in the current voice environment and the orientation information of the target object, and then extracts the voiceprint feature of the voice information based on the trained voiceprint matching model, and obtains the voice information corresponding to the voiceprint feature extraction.
  • the voiceprint feature information finally obtains the voice confidence corresponding to the voiceprint feature information, and acquires the object recognition result of the target object based on the voice confidence and using the orientation information and the voiceprint feature information.
  • a non-transitory computer readable storage medium storing computer readable instructions that, when executed by one or more processors, cause one or more processors to perform the steps of: acquiring a current voice environment The voice information of the target object and the orientation information of the target object; the voiceprint feature is extracted based on the trained voiceprint matching model, and the voiceprint feature information corresponding to the voice information extracted by the voiceprint feature is obtained; and the voiceprint feature information is obtained. Corresponding voice confidence; and based on the voice confidence and using the orientation information and the voiceprint feature information to obtain the object recognition result of the target object.
  • the processor when the computer readable instructions are executed by the processor, causing the processor to perform the step of acquiring the voice information of the target object and the orientation information of the target object in the current voice environment, performing the following steps: acquiring the current based on the microphone array a set of voice information in a voice environment; screening a voice information set to obtain voice information of the target object after filtering; obtaining phase information of the microphone array when collecting the voice information set; and determining a voice position indicated by the phase information The orientation information of the target object.
  • the processor when the computer readable instructions are executed by the processor, causing the processor to perform the following steps: acquiring voiceprint training before performing the step of acquiring the voice information of the target object and the orientation information of the target object in the current voice environment The speech collection; and the sample feature information corresponding to each voiceprint training voice and voiceprint training voice in the voiceprint training voice set, and the established voiceprint matching model is trained to generate the trained voiceprint matching model.
  • the computer readable instructions when executed by the processor, causing the processor to perform the step of acquiring sound confidence corresponding to the voiceprint feature information, the step of: mapping the voiceprint feature information to the voiceprint training voice The sample feature information is matched to obtain a matching degree value when the feature matching degree is the highest; and the sound confidence corresponding to the voiceprint feature information is determined according to the matching degree value.
  • the processor when the computer readable instructions are executed by the processor, causing the processor to perform the step of acquiring the object recognition result of the target object based on the sound confidence and using the orientation information and the voiceprint feature information, performing the following steps: The relationship between the voice confidence level and the preset sound confidence threshold, determining the object identification information used in the orientation information and the voiceprint feature information; and acquiring the object recognition result of the target object based on the object identification information.
  • the computer readable instructions are executed by the processor such that the processor determines the object recognition employed in the orientation information and the voiceprint feature information while performing a relationship based on the voice confidence and the preset sound confidence threshold.
  • the following steps are performed: when the voice confidence is greater than or equal to the first confidence threshold, the voiceprint feature information is determined as the adopted object identification information; when the voice confidence is greater than or equal to the second confidence threshold, And when the first confidence threshold is smaller than the first confidence threshold, the orientation information and the voiceprint feature information are jointly determined as the object identification information used; and when the voice confidence is less than the second confidence threshold, the orientation information is determined as the adopted object. Identify information.
  • the computer readable instruction When executed by the processor, causing the processor to perform the step of acquiring the object recognition result of the target object according to the object identification information, performing the following steps: acquiring the candidate recognition result of the target object according to the voiceprint feature information; and using the orientation information from the candidate identification The result of the object recognition of the target object is located in the result.
  • the computer readable storage medium obtains the voice information of the target object in the current voice environment and the orientation information of the target object, and then extracts the voiceprint feature based on the trained voiceprint matching model to obtain the voiceprint feature extraction.
  • the voiceprint feature information corresponding to the voice information is obtained, and finally the voice confidence corresponding to the voiceprint feature information is obtained, and the object recognition result of the target object is obtained based on the voice confidence and using the orientation information and the voiceprint feature information.
  • FIG. 15 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • the terminal 1000 may include at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, a memory 1005, and at least one communication bus 1002.
  • the communication bus 1002 is used to implement connection communication between these components.
  • the user interface 1003 can include a display and a keyboard.
  • the optional user interface 1003 can also include a standard wired interface and a wireless interface.
  • the network interface 1004 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high speed RAM memory or a non-volatile memory such as at least one disk memory.
  • the memory 1005 can also optionally be at least one storage device located remotely from the aforementioned processor 1001. As shown in FIG. 15, an operating system, a network communication module, a user interface module, and an object recognition application may be included in the memory 1005 as a computer storage medium.
  • the user interface 1003 is mainly used to provide an input interface for the user to acquire data input by the user; the network interface 1004 is used for data communication with the user terminal; and the processor 1001 can be used to call the memory.
  • the object stored in 1005 identifies the application and specifically executes the above object recognition method.
  • the voice information of the target object in the current voice environment and the orientation information of the target object are obtained, and then the voiceprint feature is extracted based on the trained voiceprint matching model to obtain the voiceprint feature extraction.
  • the voiceprint feature information corresponding to the voice information is obtained, and finally the voice confidence corresponding to the voiceprint feature information is obtained, and the object recognition result of the target object is obtained based on the voice confidence and using the orientation information and the voiceprint feature information.
  • the object identification information avoids the process of identifying non-essential information in the process of object recognition, and improves the efficiency of object recognition; and further improves the obtained object recognition result by simultaneously identifying the object recognition result of the target object by the orientation information and the voiceprint feature information. The accuracy of the object recognition results.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种对象识别方法,包括如下步骤:获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息;基于训练后的声纹匹配模型对所述语音信息进行声纹特征提取,获取经所述声纹特征提取后所述语音信息对应的声纹特征信息;获取所述声纹特征信息对应的声音置信度;及基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果。

Description

一种对象识别方法、计算机设备及计算机可读存储介质
相关申请的交叉引用
本申请要求于2017年10月23日提交中国专利局,申请号为201710992605.7、发明名称为“一种对象识别方法及其设备、存储介质、终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种对象识别方法、计算机设备及计算机可读存储介质。
背景技术
随着科技的不断发展,声纹识别作为一种生物识别技术已经发展的越来越成熟,通过声纹识别可以从多个话说人中辨别出某一说话人,也可以通过识别某一语音的声纹特征确定该语音对应的说话人身份。例如,语音识别***中的笔录***可以通过声纹区分出在某一场景中的所有说话人(例如,通过笔录***中的声纹识别技术区分出庭审这一场景中的法官和犯人)。
传统技术中,主要是通过匹配声学模型的声纹特征(例如,语调、方言、节奏以及鼻音等)进行声纹识别,然而,当存在相似度较高的声纹特征时,容易出现声纹匹配结果差别较小,难以根据声纹匹配结果区分出说话人的情况,从而影响声纹识别结果的准确性。
发明内容
根据本申请的各种实施例提供一种对象识别方法、计算机设备及计算机可读存储介质。
一种对象识别方法,执行于计算机设备,所述计算机设备包括存储器和处理器,所述方法包括:
获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息;
基于训练后的声纹匹配模型对所述语音信息进行声纹特征提取,获取经所述声纹特征提取后所述语音信息对应的声纹特征信息;
获取所述声纹特征信息对应的声音置信度;及
基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果。
一种计算机设备,包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息;
基于训练后的声纹匹配模型对所述语音信息进行声纹特征提取,获取经所述声纹特征提取后所述语音信息对应的声纹特征信息;
获取所述声纹特征信息对应的声音置信度;及
基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果。
一种非易失性的计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息;
基于训练后的声纹匹配模型对所述语音信息进行声纹特征提取,获取经所述声纹特征提取后所述语音信息对应的声纹特征信息;
获取所述声纹特征信息对应的声音置信度;及
基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种对象识别设备的硬件架构示意图;
图2是本申请实施例提供的一种对象识别设备的***框图;
图3是本申请实施例提供的一种对象识别方法的流程示意图;
图4是本申请实施例提供的另一种对象识别方法的流程示意图;
图5是本申请实施例提供的基于波束形成方式的语音分离显示示意图;
图6是本申请实施例提供的另一种对象识别方法的流程示意图;
图7是本申请实施例提供的另一种对象识别方法的流程示意图;
图8是本申请实施例提供的另一种对象识别方法的流程示意图;
图9是本申请实施例提供的一种对象识别设备的结构示意图;
图10是本申请实施例提供的另一种对象识别设备的结构示意图;
图11是本申请实施例提供的对象信息获取模块的结构示意图;
图12是本申请实施例提供的置信度获取模块的结构示意图;
图13是本申请实施例提供的结果获取模块的结构示意图;
图14是本申请实施例提供的第二结果获取单元的结构示意图;
图15是本申请实施例提供的一种终端的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而 不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的对象识别方法可以应用于在多声源环境中对声源对象进行声纹识别,辨别出目标对象的场景中,例如:对象识别设备通过获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息,然后基于训练后的声纹匹配模型对所述语音信息进行声纹特征提取,获取经所述声纹特征提取后所述语音信息对应的声纹特征信息;最后获取所述声纹特征信息对应的声音置信度,基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性。
本申请实施例涉及的对象识别设备可以是平板电脑、智能手机、掌上电脑以及移动互联网设备(MID)等其他可以集成麦克风阵列或可以接收麦克风阵列发送的声源方位信息且具备声纹识别功能的终端设备。所述对象识别设备的硬件架构可以如图1所示,其中,音频处理器用于降噪以及定位方向,***处理器用于连接云端并进行声纹特征分析,存储***用于存储对象识别的应用程序。所述对象识别设备的***框图可以如图2所示,其中,麦克风阵列可以识别不同方位的声源对应的语音信息,并对不同的声源进行角度定位。
下面将结合附图3-附图8,对本申请实施例提供的对象识别方法进行详细介绍。
请参见图3,为本申请实施例提供了一种对象识别方法的流程示意图。如图3所示,在一个实施例中,对象识别方法可以包括以下步骤S101-步骤S103。
S101,获取当前语音环境中目标对象的语音信息和目标对象的方位信息。
具体的,对象识别设备可以基于麦克风阵列获取当前语音环境中目标对 象的语音信息,并基于麦克风阵列获取目标对象的方位信息。
在一个实施例中,目标对象可以是当前语音环境中的有效声源对象(例如,庭审案件时的法官、律师、被告以及原告),需要说明的是,对象识别设备在当前语音环境中所获取的语音信息集合中的语音信息可以是目标对象的语音信息也可以是其他非必要的语音信息(例如,庭审案件时庭下听众的语音信息或者其他物体发出的噪音等),对象识别设备获取到当前语音环境中的语音信息集合后,可以对语音信息结合进行筛选处理,获取目标对象的语音信息。
在一个实施例中,麦克风阵列可以通过多个麦克风获取从不同方位采集的同一个目标对象的语音信息,由于多个麦克风处于麦克风阵列中的不同位置,因此每个麦克风可以依据声音的大小获取该目标对象的相位信息,根据所获得的相位信息通过波束形成的方式计算出该目标对象的方位信息(即确定该目标对象在当前语音环境中的位置信息)。
S102,基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息。
具体的,对象识别设备可以基于训练后的声纹匹配模型对语音信息进行声纹特征提取。
在一个实施例中,声纹匹配模型可以是对预先采集的声纹训练语音集合中的各声纹训练语音和声纹训练语音对应的样本特征信息采用某种训练算法(例如,神经网络方法、隐马尔可夫方法或者VQ聚类方法等)进行训练后建立的模型。
在一个实施例中,声纹训练语音集合中的语音对应的语音采集者可以是随机的实验对象,并不限定特定的目标对象。声纹训练语音对应的样本特征信息可以是声纹训练语音的声纹特征信息。
在一个实施例中,对象识别设备可以获取经声纹特征提取后得到的与语音信息对应的声纹特征信息。可以理解的是,声纹特征信息可以是目标对象的语音信息中的区别特征信息,例如,可以是频谱、倒频谱、共振峰、基音、 反射系数等信息。
S103,获取声纹特征信息对应的声音置信度。
具体的,对象识别设备可以获取声纹特征信息对应的声音置信度。可以理解的是,声音置信度可以指示声纹特征信息与目标对象间的对应关系的可信程度。例如,当声音置信度为90%时,可以代表根据该声音置信度对应的声纹特征信息识别出的目标对象的可信程度为90%。
在一个实施例中,对象识别设备可以将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值,然后根据匹配度值确定声纹特征信息对应的声音置信度。例如,声纹特征信息与声纹训练语音集合中的各声纹训练语音对应的样本特征信息进行匹配后,检测到声纹训练语音A的样本特征信息与声纹特征信息的匹配度最高,且最高值为90%,则对象识别设备可以确定声纹特征信息对应的声音置信度为90%。
S104,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。
具体地,对象识别设备可以采用声纹特征信息生成对目标对象的对象识别结果,对象识别结果可以指示目标对象的语音信息所属的目标对象。例如,当前语音环境中存在至少两个目标对象,对象识别设备可以通过至少两个目标对象的声纹特征信息将至少两个目标对象的语音信息进行归类(例如,将庭审时将录音***中所有目标对象的语音分类为法官、被告和原告等)。
在一个实施例中,当声纹特征信息中存在两个相似的声纹特征时,对象识别设备可能存在不能准确地通过对上述两个相似的声纹特征得出目标对象的对象识别结果的情况。
对于上述情况,对象识别设备可以基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。具体的,对象识别设备可以基于声音置信度和预设声音置信度阈值的关系,确定用于识别目标对象的对象识别结果的对象识别信息,再根据对象识别信息获取对象识别结果,可以理解的是,对象识别信息可以是方位信息也可以是声纹特征信息。
在一个实施例中,对象识别设备可以在声音置信度大于或等于第一置信度阈值时,将声纹特征信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果(即采用声纹特征信息辨别目标对象,而方位信息不参与识别仅用作声源定位);在声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果(即采用是声纹特征信息进行声纹辨别目标对象,同时采用方位信息定位的声源方向进一步识别目标对象);在声音置信度小于第二置信度阈值时,将述方位信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果(即仅采用方位信息声源定位后的定位方向辨别目标对象)。
上述实施例中,通过获取当前语音环境中目标对象的语音信息和目标对象的方位信息,然后基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息,最后获取声纹特征信息对应的声音置信度,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性。
需要说明的是,由于声纹识别可以是对多个说话人进行辨别也可以是对某一说话人进行身份确认,对于涉及说话人辨别的执行过程请参见下述图4所示实施例,对于涉及说话人身份确认的执行过程请参见下述图8所示实施例。
请参见图4,为本申请实施例提供了另一种对象识别方法的流程示意图。如图4所示,在一个实施例中,对象识别方法可以包括以下步骤。
S201,获取声纹训练语音集合,基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。
具体地,在进行声纹识别之前,对象识别设备可以获取声纹训练语音集合,并基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。可以理解的是,对象识别设备可以采用神经网络、隐马尔可夫或者VQ聚类等算法对声纹匹配模型进行训练。声纹训练语音集合中的语音对应的语音采集者可以是随机的实验对象,并不限定特定的目标对象,声纹训练语音对应的样本特征信息可以是声纹训练语音的声纹特征信息。
S202,基于麦克风阵列获取当前语音环境中语音信息集合,并对语音信息集合进行筛选处理,获取经筛选处理后的目标对象的语音信息。
具体的,对象识别设备可以基于麦克风阵列获取当前语音环境中语音信息集合.可以理解的是,语音信息集合中的语音信息可以是目标对象的语音信息也可以是其他非必要的语音信息(例如,庭审案件时庭下听众的语音信息或者其他物体发出的噪音等),其中目标对象可以是当前语音环境中的有效声源对象(例如,庭审案件时的法官、律师、被告以及原告)。
在一个实施例中,由于语音信息集合中的语音信息并不全是目标对象的语音信息,对象识别设备可以对语音信息集合进行筛选处理,获取经筛选处理后的目标对象的语音信息。其中筛选处理可以是通过降噪处理滤除噪音、去除回音或者根据待处理的目标对象的语音信息的特征(声音响度、音色或其他特征信息)滤除非目标对象的语音,也可以是其他的语音过滤处理。
S203,获取麦克风阵列在采集语音信息集合时的相位信息,基于相位信息所指示的语音方位确定目标对象的方位信息。
可以理解的是,麦克风阵列在采集语音信息集合的同时可以获取到语音信息集合中各语音信息对应的相位信息。具体的,对象识别设备可以获取相位信息,并可以基于相位信息所指示的语音方位确定目标对象的方位信息。在一个实施例中,相位信息中的相位可以指示语音信息的语音波形在某一时刻的标度,可以描述语音信号波形变化的度量,通常以度(角度)作为单位,也称作相角。
在一个实施例中,麦克风阵列可以通过多个麦克风获取从不同方位采集的同一个目标对象的语音信息,由于多个麦克风处于麦克风阵列中的不同位置,因此每个麦克风可以依据声音的大小获取该目标对象的相位信息,根据所获得的相位信息通过波束形成的方式计算出该目标对象的方位信息(即确定该目标对象在当前语音环境中的位置信息)。其中,波束形成的方式如图5所示,可以是通过向不同方向的声源分别形成拾音波束,并且抑制其他方向的声音,来进行语音提取或分离。
S204,基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息。
具体的,对象识别设备可以基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息。可以理解的是,声纹特征信息可以是目标对象的语音信息中的区别特征信息,例如,可以是频谱、倒频谱、共振峰、基音、反射系数等信息。
S205,将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值。
具体的,对象识别设备可以将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值。
在一个实施例中,不同人的声纹特征是不一样的,即使是同一个人的声纹特征也会随说话人自身的身体状况或所处的环境而不同。因此,在将声纹特征信息的声纹特征与声纹训练语音集合中的各声纹训练语音对应的样本特征信息进行匹配时,所得到的匹配度值也会有大有小,但可以通过比较所有匹配度值从中获取特征匹配度最高时的匹配度值。
S206,根据匹配度值确定声纹特征信息对应的声音置信度。
具体的,对象识别设备可以根据匹配度值确定声纹特征信息对应的声音置信度。可以理解的是,声音置信度可以指示声纹特征信息与目标对象间的对应关系的可信程度,例如,当声音置信度为90%时,可以代表根据该声音置信度对应的声纹特征信息识别出的目标对象的可信程度为90%。
在一个实施例中,对象识别设备可以直接将匹配度值确定声纹特征信息对应的声音置信度。例如,声纹特征信息与声纹训练语音集合中的各声纹训练语音对应的样本特征信息进行匹配后,检测到声纹训练语音A的样本特征信息与声纹特征信息的匹配度最高,且最高值为90%,则对象识别设备可以确定声纹特征信息对应的声音置信度为90%。
S207,基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,对象识别设备可以采用声纹特征信息生成对目标对象的对象识别结果。其中,对象识别结果可以指示目标对象的语音信息是属于目标对象的。例如,当前语音环境中存在至少两个目标对象,对象识别设备可以通过至少两个目标对象的声纹特征信息将至少两个目标对象的语音信息进行归类(例如,将庭审时将录音***中所有目标对象的语音分类为法官、被告和原告等)。
在一个实施例中,当声纹特征信息中存在两个相似的声纹特征时,对象识别设备可能存在不能准确地通过对上述两个相似的声纹特征得出目标对象的对象识别结果的情况。
对于上述情况,对象识别设备可以基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。可以理解的是,预设声音置信度值可以是根据多次识别过程中的经验所得,可以包括至少两个预设的声音置信度阈值。对象识别信息可以用于识别目标对象,可以包括方位信息或声纹特征信息。
在一个实施例中,基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果可以包括以下几个步骤,如图6所示:
S301,当声音置信度大于或等于第一置信度阈值时,将声纹特征信息确 定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
具体的,当声音置信度大于或等于第一置信度阈值时,可以代表声纹特征信息与目标对象间的对应关系的可信程度较大,对象识别设备可以将声纹特征信息确定为所采用的对象识别信息,然后采用声纹特征信息辨别目标对象,此时的方位信息不参与识别仅用作声源定位。
在一个实施例中,可以将第一置信度阈值设为90%、95%或者其他根据实际情况所确定的值。
S302,当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
具体的,当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,可以代表声纹特征信息与目标对象间的对应关系的可信程度处于平均水平。为更准确的识别目标对象,对象识别设备可以将方位信息和声纹特征信息共同确定为所采用的对象识别信息,然后采用声纹特征信息进行声纹识别初步识别目标对象,同时采用方位信息定位的声源方向进一步识别目标对象。
在一个实施例中,可以将第一置信度阈值设为90%、95%或者其他根据实际情况所确定的值,可以将第二置信度阈值设置为50%、55%、60%或者其他根据实际情况所确定的可以代表平均值的数据。
S303,当声音置信度小于第二置信度阈值时,将述方位信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
具体的,当声音置信度小于第二置信度阈值时,可以代表声纹特征信息与目标对象间的对应关系的可信程度较低,采用声纹特征信息所识别的目标对象的准确率较低,对象识别设备可以将述方位信息确定为所采用的对象识别信息,然后采用方位信息声源定位后的定位方向辨别目标对象,实现同一语音环境下的人声分离。可以理解的是,采用方位信息作为对象识别信息时, 在识别的过程中可以存在允许范围内的误差。
上述实施例中,通过声音置信度确定用于对象识别的对象识别信息,避免了在对象识别的过程中对非必要信息的识别过程,提高了对象识别的效率。
在一个实施例中,当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果可以包括以下几个步骤,如图7所示:
S401,当声音置信度大于或等于第二置信度阈值、且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息。
可以理解的是,当声音置信度大于或等于第二置信度阈值、且小于第一置信度阈值时,可以指示声纹特征信息与目标对象间的对应关系的可信程度处于平均水平,即根据声纹特征信息识别目标对象的对象识别结果时,所确定的对象识别结果的可信程度一般,此时,对象识别设备可以将方位信息和声纹特征信息共同确定为所采用的对象识别信息。
S402,根据声纹特征信息获取目标对象的候选识别结果。
具体的,对象识别设备将方位信息和声纹特征信息共同确定为所采用的对象识别信息后,可以根据声纹特征信息获取目标对象的候选识别结果。在一个实施例中,当目标对象的声纹特征信息具有明显区别时,候选识别结果可以是最终的目标对象的对象识别结果,即对象识别设备可以将多个语音信息进行准确归类。当目标对象中存在至少两个声纹特征信息区别不明显的目标对象时,候选识别结果对应的目标对象的语言信息的归类是不准确的。例如,法官A和犯人B的声纹特征信息相似度很大,对象识别设备在对二者进行语音信息归类时,可能将法官A的语音信息归类至犯人B的语音信息,或者将犯人B的语音信息归类至法官A的语音信息。
S403,采用方位信息从候选识别结果中定位目标对象的对象识别结果。
具体的,在对象识别设备根据声纹特征信息初步识别目标对象的候选识别结果的同时,对象识别设备可以采用方位信息定位的声源方向从候选识别 结果中进一步定位目标对象的对象识别结果,即对象识别设备可以对候选识别结果进行调整并最终确定目标对象的对象识别结果。例如,法官A和犯人B的声纹特征信息相似度较大,对象识别设备可以根据法官A和犯人B所在的位置,从候选识别结果即归类不准确的语音信息中进一步将二者的语音信息进行准确归类。
上述实施例中,通过方位信息和声纹特征信息同时识别目标对象的对象识别结果,进一步增加了所获得的对象识别结果的准确性。
上述对象识别方法,通过获取当前语音环境中目标对象的语音信息和目标对象的方位信息,然后基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息,最后获取声纹特征信息对应的声音置信度,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性;通过声音置信度确定用于对象识别的对象识别信息,避免了在对象识别的过程中对非必要信息的识别过程,提高了对象识别的效率;通过方位信息和声纹特征信息同时识别目标对象的对象识别结果,进一步增加了所获得的对象识别结果的准确性。
请参见图8,为本申请实施例提供了另一种对象识别方法的流程示意图。如图8所示,本申请实施例的方法可以包括以下步骤。
S501,获取包含目标对象的训练语音的声纹训练语音集合,基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。
可以理解的是,通过声纹识别可以确认一个语言信息对应的说话人的身份信息,与通过声纹识别从多个语言信息中辨别目标说话人的不同之处在于声纹匹配模型的建立过程。
具体的,对象识别设备可以获取包含目标对象的训练语音的声纹训练语 音集合,并基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。可以理解的是,对象识别设备可以采用神经网络、隐马尔可夫或者VQ聚类等算法对声纹匹配模型进行训练,声纹训练语音集合中的语音对应的语音采集者与步骤S201中的不同,此时声纹训练语音集合中的语音对应的语音采集者必须包含目标对象,声纹训练语音对应的样本特征信息可以是声纹训练语音的声纹特征信息。
S502,基于麦克风阵列获取当前语音环境中语音信息集合,并对语音信息集合进行筛选处理,获取经筛选处理后的目标对象的语音信息。
S503,获取麦克风阵列在采集语音信息集合时的相位信息,基于相位信息所指示的语音方位确定目标对象的方位信息。
S504,基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息。
S505,将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值。
S506,根据匹配度值确定声纹特征信息对应的声音置信度。
S507,基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果;
可以理解的是,对象识别设备可以采用声纹特征信息生成对目标对象的对象识别结果,对象识别结果可以指示目标对象的语音信息对应的目标对象的身份信息。例如,当前语音环境中存在至少两个目标对象,对象识别设备可以通过至少两个目标对象的声纹特征信息确定至少两个目标对象的语音信息对应的目标对象,并确定目标对象的身份信息(例如,将庭审时将录音***中所有目标对象的语音分类为法官、被告和原告后可以确定声音A是属于法官的、声音B属于被告、声音C属于原告等。)
在一个实施例中,当声纹特征信息中存在两个相似的声纹特征时,对象 识别设备可能存在不能准确地通过对上述两个相似的声纹特征得出目标对象的对象识别结果的情况。
对于上述情况,对象识别设备可以基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果可以包括以下几个步骤,具体可以参见图6所示的过程。
S301,当声音置信度大于或等于第一置信度阈值时,将声纹特征信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
具体的,当声音置信度大于或等于第一置信度阈值时,可以代表根据声纹特征信息确认目标对象的身份信息的可信程度较大,对象识别设备可以将声纹特征信息确定为所采用的对象识别信息,然后采用声纹特征信息识别目标对象的身份信息,此时的方位信息不参与身份确认仅用作声源定位。
在一个实施例中,可以将第一置信度阈值设为90%、95%或者其他根据实际情况所确定的值。
S302,当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
具体的,当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,可以代表根据声纹特征信息确认目标对象的身份信息的可信程度处于平均水平。为更准确的识别目标对象的身份,对象识别设备可以将方位信息和声纹特征信息共同确定为所采用的对象识别信息,然后采用是声纹特征信息进行声纹识别,初步确定目标对象的身份,同时采用方位信息定位的声源方向进一步识别目标对象的身份。
在一个实施例中,可以将第一置信度阈值设为90%、95%或者其他根据实际情况所确定的值,可以将第二置信度阈值设置为50%、55%、60%或者其他根据实际情况所确定的可以代表平均值的数据。
S303,当声音置信度小于第二置信度阈值时,将述方位信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
具体的,当声音置信度小于第二置信度阈值时,可以代表根据声纹特征信息确认目标对象的身份信息的可信程度较低,采用声纹特征信息所识别的目标对象身份的准确率较低,对象识别设备可以将述方位信息确定为所采用的对象识别信息,然后采用方位信息声源定位后的定位方向确定目标对象的身份,实现同一语音环境下的人声分离。可以理解的是,采用方位信息作为对象识别信息时,在识别的过程中可以存在允许范围内的误差。需要说明的是,在此种情况下当前语音环境需要是特定的语音环境,例如,目标对象的位置都是确定的环境(例如,庭审中,法官和犯人的位置是确定的)
上述实施例中,通过声音置信度确定用于对象识别的对象识别信息,避免了在对象识别的过程中对非必要信息的识别过程,提高了对象识别的效率。
在一个实施例中,当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果可以包括以下几个步骤,具体可以参见图7所示的过程:
S401,当声音置信度大于或等于第二置信度阈值、且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息。
S402,根据声纹特征信息获取目标对象的候选识别结果。
具体的,对象识别设备将方位信息和声纹特征信息共同确定为所采用的对象识别信息后,可以根据声纹特征信息获取目标对象的候选识别结果。在一个实施例中,当目标对象的声纹特征信息具有明显区别时,候选识别结果可以是最终的目标对象的对象识别结果,即对象识别设备可以从多个语音信息中明确识别出目标对象的语音信息;当目标对象中存在至少两个声纹特征 信息区别不明显的目标对象时,候选识别结果所指示的目标对象与语音信息之间的对应关系可能是不准确的,例如,法官A和犯人B的声纹特征信息相似度很大,对象识别设备在在庭审的多个语音信息中识别法官A的语音信息时,可能将犯人B的语音信息错认为是法官A的,也可能将将法官A的语音信息错认为是犯人B的。
S403,采用方位信息从候选识别结果中定位目标对象的对象识别结果。
具体的,在对象识别设备根据声纹特征信息初步识别目标对象的候选识别结果的同时,对象识别设备可以采用方位信息定位的声源方向从候选识别结果中进一步定位目标对象的对象识别结果,即对象识别设备可以对候选识别结果进行调整并最终确定目标对象的对象识别结果。例如,法官A和犯人B的声纹特征信息相似度较大,候选识别结果指示法官A的语音信息对应犯人B,结合法官A的位置信息对象识别设备可以将法官A的语音信息对应法官A。
上述实施例中,通过方位信息和声纹特征信息同时识别目标对象的对象识别结果,进一步增加了所获得的对象识别结果的准确性。
上述对象识别方法,通过获取当前语音环境中目标对象的语音信息和目标对象的方位信息,然后基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息,最后获取声纹特征信息对应的声音置信度,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性;通过声音置信度确定用于对象识别的对象识别信息,避免了在对象识别的过程中对非必要信息的识别过程,提高了对象识别的效率;通过方位信息和声纹特征信息同时识别目标对象的对象识别结果,进一步增加了所获得的对象识别结果的准确性。
下面将结合附图9-附图14,对本申请实施例提供的对象识别设备进行详 细介绍。需要说明的是,附图9-附图14所示的设备,用于执行本申请图3-图8所示实施例的方法,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请图3-图8所示的实施例。
请参见图9,为本申请实施例提供了一种对象识别设备的结构示意图。如图9所示,本申请实施例的对象识别设备1可以包括:对象信息获取模块11、特征信息获取模块12、置信度获取模块13和结果获取模块14。
对象信息获取模块11,用于获取当前语音环境中目标对象的语音信息和目标对象的方位信息。
具体实现中,对象信息获取模块11可以基于麦克风阵列获取当前语音环境中目标对象的语音信息,并基于麦克风阵列获取目标对象的方位信息。可以理解的是,目标对象可以是当前语音环境中的有效声源对象(例如,庭审案件时的法官、律师、被告以及原告)。需要说明的是,对象信息获取模块11在当前语音环境中所获取的语音信息集合中的语音信息可以是目标对象的语音信息也可以是其他非必要的语音信息(例如,庭审案件时庭下听众的语音信息或者其他物体发出的噪音等),对象信息获取模块11获取到当前语音环境中的语音信息集合后,可以对语音信息结合进行筛选处理,获取目标对象的语音信息。
在本申请实施例中,麦克风阵列可以通过多个麦克风获取从不同方位采集的同一个目标对象的语音信息,由于多个麦克风处于麦克风阵列中的不同位置,因此每个麦克风可以依据声音的大小获取该目标对象的相位信息,根据所获得的相位信息通过波束形成的方式计算出该目标对象的方位信息(即确定该目标对象在当前语音环境中的位置信息)。
特征信息获取模块12,用于基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息。
在一个实施例中,特征信息获取模块12可以基于训练后的声纹匹配模型对语音信息进行声纹特征提取。可以理解的是,声纹匹配模型可以是对预先采集的声纹训练语音集合中的各声纹训练语音和声纹训练语音对应的样本特 征信息采用某种训练算法(例如,神经网络方法、隐马尔可夫方法或者VQ聚类方法等)进行训练后建立的模型。可以理解的是,声纹训练语音集合中的语音对应的语音采集者可以是随机的实验对象并不限定特定的目标对象,声纹训练语音对应的样本特征信息可以是声纹训练语音的声纹特征信息。
进一步的,特征信息获取模块12可以获取经声纹特征提取后语音信息对应的声纹特征信息。可以理解的是,声纹特征信息可以是目标对象的语音信息中的区别特征信息,例如,可以是频谱、倒频谱、共振峰、基音、反射系数等信息。
置信度获取模块13,用于获取声纹特征信息对应的声音置信度。
在一个实施例中,置信度获取模块13可以获取声纹特征信息对应的声音置信度。可以理解的是,声音置信度可以指示声纹特征信息与目标对象间的对应关系的可信程度,例如,当声音置信度为90%时,可以代表根据该声音置信度对应的声纹特征信息识别出的目标对象的可信程度为90%。
在一个实施例中,置信度获取模块13可以将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值,然后根据匹配度值确定声纹特征信息对应的声音置信度。例如,声纹特征信息与声纹训练语音集合中的各声纹训练语音对应的样本特征信息进行匹配后,检测到声纹训练语音A的样本特征信息与声纹特征信息的匹配度最高,且最高值为90%,则对象识别设备可以确定声纹特征信息对应的声音置信度为90%。
结果获取模块14,用于采用方位信息、声纹特征信息以及声音置信度获取目标对象的对象识别结果。
可以理解的是,对象识别设备1可以采用声纹特征信息生成对目标对象的对象识别结果,对象识别结果可以指示目标对象的语音信息所属的目标对象。例如,当前语音环境中存在至少两个目标对象,对象识别设备可以通过至少两个目标对象的声纹特征信息将至少两个目标对象的语音信息进行归类(例如,将庭审时将录音***中所有目标对象的语音分类为法官、被告和原告等)。
在一个实施例中,当声纹特征信息中存在两个相似的声纹特征时,对象识别设备1可能存在不能准确地通过对上述两个相似的声纹特征得出目标对象的对象识别结果的情况。
对于上述情况,结果获取模块14可以基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。具体实现中,结果获取模块14可以基于声音置信度和预设声音置信度阈值的关系,确定用于识别目标对象的对象识别结果的对象识别信息,再根据对象识别信息获取对象识别结果。可以理解的是,对象识别信息可以是方位信息也可以是声纹特征信息。
在本申请的具体实现方式中,结果获取模块14可以在声音置信度大于或等于第一置信度阈值时,将声纹特征信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果(即采用声纹特征信息辨别目标对象,而方位信息不参与识别仅用作声源定位)。在声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果(即采用是声纹特征信息进行声纹辨别目标对象,同时采用方位信息定位的声源方向进一步识别目标对象)。在声音置信度小于第二置信度阈值时,将述方位信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果(即仅采用方位信息声源定位后的定位方向辨别目标对象)。
在本申请实施例中,通过获取当前语音环境中目标对象的语音信息和目标对象的方位信息,然后基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息,最后获取声纹特征信息对应的声音置信度,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性。
需要说明的是,由于声纹识别可以是对多个说话人进行辨别也可以是对某一说话人进行身份确认,对于涉及说话人辨别的执行过程请参见下述图10所示实施例的第一种实现方式,对于涉及说话人身份确认的执行过程请参见下述图10所示实施例的第二种实现方式。
请参见图10,为本申请实施例提供了另一种对象识别设备的结构示意图。如图10所示,本申请实施例的对象识别设备1可以包括:对象信息获取模块11、特征信息获取模块12、置信度获取模块13、结果获取模块14和模型生成模块15,在图10所示实施例的第一种实现方式中:
模型生成模块15,用于获取声纹训练语音集合,基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。
在一个实施例中,在进行声纹识别之前模型生成模块15可以获取声纹训练语音集合,并基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。可以理解的是,模型生成模块15可以采用神经网络、隐马尔可夫或者VQ聚类等算法对声纹匹配模型进行训练,声纹训练语音集合中的语音对应的语音采集者可以是随机的实验对象并不限定特定的目标对象,声纹训练语音对应的样本特征信息可以是声纹训练语音的声纹特征信息。
对象信息获取模块11,用于获取当前语音环境中目标对象的语音信息和目标对象的方位信息。
在一个实施例中,对象信息获取模块11可以获取当前语音环境中目标对象的语音信息和目标对象的方位信息。
请一并参考图11,为本申请实施例提供了对象信息获取模块的结构示意图。如图11所示,对象信息获取模块11可以包括:
信息获取单元111,用于基于麦克风阵列获取当前语音环境中语音信息集合,并对语音信息集合进行筛选处理,获取经筛选处理后的目标对象的语音信息。
在一个实施例中,信息获取单元111可以基于麦克风阵列获取当前语音环境中语音信息集合。可以理解的是,语音信息集合中的语音信息可以是目标对象的语音信息也可以是其他非必要的语音信息(例如,庭审案件时庭下听众的语音信息或者其他物体发出的噪音等),其中目标对象可以是当前语音环境中的有效声源对象(例如,庭审案件时的法官、律师、被告以及原告)。
在一个实施例中,由于语音信息集合中的语音信息并不全是目标对象的语音信息,信息获取单元111可以对语音信息集合进行筛选处理,获取经筛选处理后的目标对象的语音信息。其中,筛选处理具体可以是通过降噪处理滤除噪音、去除回音或者根据待处理的目标对象的语音信息的特征(声音响度、音色或其他特征信息)滤除非目标对象的语音也可以是其他的语音过滤处理。
信息确定单元112,用于获取麦克风阵列在采集语音信息集合时的相位信息,基于相位信息所指示的语音方位确定目标对象的方位信息。
在一个实施例中,麦克风阵列在采集语音信息集合的同时可以获取到语音信息集合中各语音信息对应的相位信息。具体实现中,信息确定单元112可以获取相位信息,并可以基于相位信息所指示的语音方位确定目标对象的方位信息。可以理解的是,相位信息中的相位可以指示语音信息的语音波形在某一时刻的标度,可以描述语音信号波形变化的度量,通常以度(角度)作为单位,也称作相角。
在一个实施例中,麦克风阵列可以通过多个麦克风获取从不同方位采集的同一个目标对象的语音信息,由于多个麦克风处于麦克风阵列中的不同位置,因此每个麦克风可以依据声音的大小获取该目标对象的相位信息,根据所获得的相位信息通过波束形成的方式计算出该目标对象的方位信息(即确定该目标对象在当前语音环境中的位置信息)。其中,波束形成的方式如图5所示,可以是通过向不同方向的声源分别形成拾音波束,并且抑制其他方向的声音,来进行语音提取或分离。
特征信息获取模块12,用于基于训练后的声纹匹配模型对语音信息进行 声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息。
在一个实施例中,特征信息获取模块12可以基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹匹特征提取语音信息对应的声纹特征信息。可以理解的是,声纹特征信息可以是目标对象的语音信息中的区别特征信息,例如,可以是频谱、倒频谱、共振峰、基音、反射系数等信息。
置信度获取模块13,用于获取声纹特征信息对应的声音置信度。
具体实现中,置信度获取模块13可以获取声纹特征信息对应的声音置信度。
请一并参考图12,为本申请实施例提供了置信度获取模块的结构示意图。如图12所示,置信度获取模块13可以包括:
匹配度值获取单元131,用于将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值。
在一个实施例中,匹配度值获取单元131可以将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值。可以理解的是,不同人的声纹特征是不一样的,即使是同一个人的声纹特征也会随说话人自身的身体状况或所处的环境而不同,因此,在将声纹特征信息的声纹特征与声纹训练语音集合中的各声纹训练语音对应的样本特征信息进行匹配时,所得到的匹配度值也会有大有小,但可以通过比较所有匹配度值从中获取特征匹配度最高时的匹配度值。
置信度确定单元132,用于根据匹配度值确定声纹特征信息对应的声音置信度。
在一个实施例中,置信度确定单元132可以根据匹配度值确定声纹特征信息对应的声音置信度。可以理解的是,声音置信度可以指示声纹特征信息与目标对象间的对应关系的可信程度,例如,当声音置信度为90%时,可以代表根据该声音置信度对应的声纹特征信息识别出的目标对象的可信程度为90%。
在一个实施例中,置信度确定单元132可以直接将匹配度值确定声纹特 征信息对应的声音置信度。例如,声纹特征信息与声纹训练语音集合中的各声纹训练语音对应的样本特征信息进行匹配后,检测到声纹训练语音A的样本特征信息与声纹特征信息的匹配度最高,且最高值为90%,则对象识别设备可以确定声纹特征信息对应的声音置信度为90%。
结果获取模块14,具体用于基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,对象识别设备1可以采用声纹特征信息生成对目标对象的对象识别结果。其中,对象识别结果可以指示目标对象的语音信息是属于目标对象的。例如,当前语音环境中存在至少两个目标对象,对象识别设备可以通过至少两个目标对象的声纹特征信息将至少两个目标对象的语音信息进行归类(例如,将庭审时将录音***中所有目标对象的语音分类为法官、被告和原告等)。
在一个实施例中,当声纹特征信息中存在两个相似的声纹特征时,对象识别设备可能存在不能准确地通过对上述两个相似的声纹特征得出目标对象的对象识别结果的情况。
对于上述情况,结果获取模块14可以基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。可以理解的是,预设声音置信度值可以是根据多次识别过程中的经验所得,可以包括至少两个预设的声音置信度阈值。对象识别信息可以用于识别目标对象,可以包括方位信息或声纹特征信息。
在本申请实施例一种具体实现方式中,结果获取模块14可以包括一下几个单元,如图13所示:
第一结果获取单元141,用于当声音置信度大于或等于第一置信度阈值时,将声纹特征信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,当声音置信度大于或等于第一置信度阈值时,可以代表声纹特征信息与目标对象间的对应关系的可信程度较大,第一结果获取单元141可以将声纹特征信息确定为所采用的对象识别信息,然后采用声纹特征信息辨别目标对象,此时的方位信息不参与识别仅用作声源定位。
在一个实施例中,可以将第一置信度阈值设为90%、95%或者其他根据实际情况所确定的值。
第二结果获取单元142,用于当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,将方位信息和声纹特征信息中确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,可以代表声纹特征信息与目标对象间的对应关系的可信程度处于平均水平,为更准确的识别目标对象,第二结果获取单元142可以将方位信息和声纹特征信息共同确定为所采用的对象识别信息,然后采用声纹特征信息进行声纹识别初步识别目标对象,同时采用方位信息定位的声源方向进一步识别目标对象。
在一个实施例中,可以将第一置信度阈值设为90%、95%或者其他根据实际情况所确定的值,可以将第二置信度阈值设置为50%、55%或者60%等其他根据实际情况所确定的可以代表平均值的数据。
第三结果获取单元143,用于当声音置信度小于第二置信度阈值时,将述方位信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,当声音置信度小于第二置信度阈值时,可以代表声纹特征信息与目标对象间的对应关系的可信程度较低,采用声纹特征信息所识别的目标对象的准确率较低。第三结果获取单元143可以将述方位信息确定为所采用的对象识别信息,然后采用方位信息声源定位后的定位方向辨别目标对象,实现同一语音环境下的人声分离。可以理解的是,采用方位信息作为对象识别信息时,在识别的过程中可以存在允许范围内的误差。
在本申请实施例中,通过声音置信度确定用于对象识别的对象识别信息,避免了在对象识别的过程中对非必要信息的识别过程,提高了对象识别的效率。
在一个实施例中,第二结果获取单元142可以包括一下几个子单元,如图14所示:
信息确定子单元1421,用于当声音置信度大于或等于第二置信度阈值、且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息。
在一个实施例中,当声音置信度大于或等于第二置信度阈值、且小于第一置信度阈值时,可以指示声纹特征信息与目标对象间的对应关系的可信程度处于平均水平,即根据声纹特征信息识别目标对象的对象识别结果时,所确定的对象识别结果的可信程度一般,此时,信息确定子单元1421可以将方位信息和声纹特征信息共同确定为所采用的对象识别信息。
候选结果获取子单元1422,用于根据声纹特征信息获取目标对象的候选识别结果。
在一个实施例中,信息确定子单元1421将方位信息和声纹特征信息共同确定为所采用的对象识别信息后,候选结果获取子单元1422可以根据声纹特征信息获取目标对象的候选识别结果。可以理解的是,当目标对象的声纹特征信息具有明显区别时,候选识别结果可以是最终的目标对象的对象识别结果,即对象识别设备可以将多个语音信息进行准确归类。当目标对象中存在至少两个声纹特征信息区别不明显的目标对象时,候选识别结果对应的目标对象的语言信息的归类是不准确的。例如,法官A和犯人B的声纹特征信息相似度很大,对象识别设备在对二者进行语音信息归类时,可能将法官A的语音信息归类至犯人B的语音信息,或者将犯人B的语音信息归类至法官A的语音信息。
结果获取子单元1423,用于采用方位信息从候选识别结果中定位目标对象的对象识别结果。
在一个实施例中,在候选结果获取子单元1422根据声纹特征信息初步识别目标对象的候选识别结果的同时,结果获取子单元1423可以采用方位信息定位的声源方向从候选识别结果中进一步定位目标对象的对象识别结果,即结果获取子单元1423可以对候选识别结果进行调整并最终确定目标对象的对象识别结果。例如,法官A和犯人B的声纹特征信息相似度较大,对象识别设备可以根据法官A和犯人B所在的位置,从候选识别结果即归类不准确的语音信息中进一步将二者的语音信息进行准确归类。
在本申请实施例中,通过方位信息和声纹特征信息同时识别目标对象的对象识别结果,进一步增加了所获得的对象识别结果的准确性。
在本申请实施例中,通过获取当前语音环境中目标对象的语音信息和目标对象的方位信息,然后基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息,最后获取声纹特征信息对应的声音置信度,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性;通过声音置信度确定用于对象识别的对象识别信息,避免了在对象识别的过程中对非必要信息的识别过程,提高了对象识别的效率;通过方位信息和声纹特征信息同时识别目标对象的对象识别结果,进一步增加了所获得的对象识别结果的准确性。
在图10所示实施例的第二种实现方式中:
模型生成模块15,具体用于获取包含目标对象的训练语音的声纹训练语音集合,基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。
可以理解的是,通过声纹识别可以确认一个语言信息对应的说话人的身份信息,与通过声纹识别从多个语言信息中辨别目标说话人的不同之处在于声纹匹配模型的建立过程。
在一个实施例中,模型生成模块15可以获取包含目标对象的训练语音的声纹训练语音集合,基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。可以理解的是,模型生成模块15可以采用神经网络、隐马尔可夫或者VQ聚类等算法对声纹匹配模型进行训练,声纹训练语音集合中的语音对应的语音采集者与图8所示实施例的第一种实现方式中模型生成模块15中的不同,此时声纹训练语音集合中的语音对应的语音采集者必须包含目标对象,声纹训练语音对应的样本特征信息可以是声纹训练语音的声纹特征信息。
对象信息获取模块11,用于获取当前语音环境中目标对象的语音信息和目标对象的方位信息。
在一个实施例中,对象信息获取模块11可以获取当前语音环境中目标对象的语音信息和目标对象的方位信息。
请一并参考图11,为本申请实施例提供了对象信息获取模块的结构示意图。如图11所示,对象信息获取模块11可以包括:
信息获取单元111,用于基于麦克风阵列获取当前语音环境中语音信息集合,并对语音信息集合进行筛选处理,获取经筛选处理后的目标对象的语音信息。
在一个实施例中,信息获取单元111获取目标对象的语音信息的详细过程可以参考上述方法实施例中的描述,此处不再赘述。
信息确定单元112,用于获取麦克风阵列在采集语音信息集合时的相位信息,基于相位信息所指示的语音方位确定目标对象的方位信息。
具体实现中,信息确定单元112获取目标对象的方位信息的详细过程可以参考上述方法实施例中的描述,此处不再赘述。
特征信息获取模块12,用于基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息。
具体实现中,特征信息获取模块12获取声纹特征信息的详细过程可以参考上述方法实施例中的描述,此处不再赘述。
置信度获取模块13,用于获取声纹特征信息对应的声音置信度。
具体实现中,置信度获取模块13可以获取声纹特征信息对应的声音置信度。
请一并参考图12,为本申请实施例提供了置信度获取模块的结构示意图。如图12所示,置信度获取模块13可以包括:
匹配度值获取单元131,用于将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值。
具体实现中,匹配度值获取单元131获取匹配度值的详细过程可以参考上述方法实施例中的描述,此处不再赘述。
置信度确定单元132,用于根据匹配度值确定声纹特征信息对应的声音置信度。
具体实现中,置信度确定单元132确定声音置信度的详细过程可以参考上述方法实施例中的描述,此处不再赘述。
结果获取模块14,具体用于基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
可以理解的是,对象识别设备1可以采用声纹特征信息生成对目标对象的对象识别结果,对象识别结果可以指示目标对象的语音信息对应的目标对象的身份信息。例如,当前语音环境中存在至少两个目标对象,对象识别设备1可以通过至少两个目标对象的声纹特征信息确定至少两个目标对象的语音信息对应的目标对象,并确定目标对象的身份信息(例如,将庭审时将录音***中所有目标对象的语音分类为法官、被告和原告后可以确定声音A是属于法官的、声音B属于被告、声音C属于原告等。)
在一个实施例中,当声纹特征信息中存在两个相似的声纹特征时,对象识别设备1可能存在不能准确地通过对上述两个相似的声纹特征得出目标对象的对象识别结果的情况。
对于上述情况,结果获取模块14可以基于声音置信度和预设声音置信度 阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在本申请实施例一种具体实现方式中,结果获取模块14可以包括一下单元如图13所示:
第一结果获取单元141,用于当声音置信度大于或等于第一置信度阈值时,将声纹特征信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,当声音置信度大于或等于第一置信度阈值时,可以代表根据声纹特征信息确认目标对象的身份信息的可信程度较大,第一结果获取单元141可以将声纹特征信息确定为所采用的对象识别信息,然后采用声纹特征信息识别目标对象的身份信息,此时的方位信息不参与身份确认仅用作声源定位。
在一个实施例中,可以将第一置信度阈值设为90%、95%或者其他根据实际情况所确定的值。
第二结果获取单元142,用于当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,将方位信息和声纹特征信息中确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,当声音置信度大于或等于第二置信度阈值且小于第一置信度阈值时,可以代表根据声纹特征信息确认目标对象的身份信息的可信程度处于平均水平,为更准确的识别目标对象的身份。第二结果获取单元142可以将方位信息和声纹特征信息共同确定为所采用的对象识别信息,然后采用是声纹特征信息进行声纹识别,初步确定目标对象的身份,同时采用方位信息定位的声源方向进一步识别目标对象的身份。
在一个实施例中,可以将第一置信度阈值设为90%、95%或者其他根据实际情况所确定的值,可以将第二置信度阈值设置为50%、55%、60%或者其他根据实际情况所确定的可以代表平均值的数据。
第三结果获取单元143,用于当声音置信度小于第二置信度阈值时,将 述方位信息确定为所采用的对象识别信息,并根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,当声音置信度小于第二置信度阈值时,可以代表根据声纹特征信息确认目标对象的身份信息的可信程度较低,采用声纹特征信息所识别的目标对象身份的准确率较低。第三结果获取子单元1323可以将述方位信息确定为所采用的对象识别信息,然后采用方位信息声源定位后的定位方向确定目标对象的身份,实现同一语音环境下的人声分离。可以理解的是,采用方位信息作为对象识别信息时,在识别的过程中可以存在允许范围内的误差。需要说明的是,在此种情况下当前语音环境需要是特定的语音环境,例如,目标对象的位置都是确定的环境(例如,庭审中,法官和犯人的位置是确定的)
在本申请实施例中,通过声音置信度确定用于对象识别的对象识别信息,避免了在对象识别的过程中对非必要信息的识别过程,提高了对象识别的效率。
在本申请实施例一种具体实现方式中,结果获取模块14可以包括一下子单元如图14所示:
信息确定子单元1421,用于当声音置信度大于或等于第二置信度阈值、且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息。
具体实现中,信息确定子单元1421确定对象识别信息的详细过程可以参考上述方法实施例中的描述,此处不再赘述。
候选结果获取子单元1422,用于根据声纹特征信息获取目标对象的候选识别结果。
具体实现中,信息确定子单元1421将方位信息和声纹特征信息共同确定为所采用的对象识别信息后,候选结果获取子单元1422可以根据声纹特征信息获取目标对象的候选识别结果。可以理解的是,当目标对象的声纹特征信息具有明显区别时,候选识别结果可以是最终的目标对象的对象识别结果, 即对象识别设备可以从多个语音信息中明确识别出目标对象的语音信息。当目标对象中存在至少两个声纹特征信息区别不明显的目标对象时,候选识别结果所指示的目标对象与语音信息之间的对应关系可能是不准确的。例如,法官A和犯人B的声纹特征信息相似度很大,对象识别设备在在庭审的多个语音信息中识别法官A的语音信息时,可能将犯人B的语音信息错认为是法官A的,也可能将将法官A的语音信息错认为是犯人B的。
结果获取子单元1423,用于采用方位信息从候选识别结果中定位目标对象的对象识别结果。
在一个实施例中,在候选结果获取子单元1422根据声纹特征信息初步识别目标对象的候选识别结果的同时,结果获取子单元1423可以采用方位信息定位的声源方向从候选识别结果中进一步定位目标对象的对象识别结果,即结果获取子单元1423可以对候选识别结果进行调整并最终确定目标对象的对象识别结果。例如,法官A和犯人B的声纹特征信息相似度较大,候选识别结果指示法官A的语音信息对应犯人B,结合法官A的位置信息对象识别设备可以将法官A的语音信息对应法官A。
在本申请实施例中,通过方位信息和声纹特征信息同时识别目标对象的对象识别结果,进一步增加了所获得的对象识别结果的准确性。
在本申请实施例中,通过获取当前语音环境中目标对象的语音信息和目标对象的方位信息,然后基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息,最后获取声纹特征信息对应的声音置信度,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性;通过声音置信度确定用于对象识别的对象识别信息,避免了在对象识别的过程中对非必要信息的识别过程,提高了对象识别的效率;通过方位信息和声纹特征信息同时识别目标对象的对象识别结果,进一步增加了所获得的对象识别结果的准确性。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行以下步骤:获取当前语音环境中目标对象的语音信息和目标对象的方位信息;基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息;获取声纹特征信息对应的声音置信度;及基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行获取当前语音环境中目标对象的语音信息和目标对象的方位信息的步骤时,执行以下步骤:基于麦克风阵列获取当前语音环境中语音信息集合;对语音信息集合进行筛选处理,获取经筛选处理后的目标对象的语音信息;获取麦克风阵列在采集语音信息集合时的相位信息;及基于相位信息所指示的语音方位确定目标对象的方位信息。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行获取当前语音环境中目标对象的语音信息和目标对象的方位信息的步骤之前,还执行以下步骤:获取声纹训练语音集合;及基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行获取声纹特征信息对应的声音置信度的步骤时,执行以下步骤:将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值;及根据匹配度值确定声纹特征信息对应的声音置信度。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果的步骤时,执行以下步骤:基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息;及根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息的步骤时,执行以下步骤:当声音置信度大于或等于第一置信度阈值时,将声纹特征信息确定为所采用的对象识别信息;当声音置信度大于或等于第二置信度阈值、且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息;及当声音置信度小于第二置信度阈值时,将述方位信息确定为所采用的对象识别信息。
在一个实施例中,当声音置信度大于或等于第二置信度阈值、且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息,计算机可读指令被处理器执行时,使得处理器在执行根据对象识别信息获取目标对象的对象识别结果的步骤时,执行以下步骤:根据声纹特征信息获取目标对象的候选识别结果;及采用方位信息从候选识别结果中定位目标对象的对象识别结果。
上述计算机设备,通过获取当前语音环境中目标对象的语音信息和目标对象的方位信息,然后基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息,最后获取声纹特征信息对应的声音置信度,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性。
一种非易失性的计算机可读存储介质,存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:获取当前语音环境中目标对象的语音信息和目标对象的方位信息;基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息;获取声纹特征信息对应的声音置信度;及基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行获取当前语音环境中目标对象的语音信息和目标对象的方位信息的步骤时,执行以下步骤:基于麦克风阵列获取当前语音环境中语音信息集合;对语音信息集合进行筛选处理,获取经筛选处理后的目标对象的语音信息;获取麦克风阵列在采集语音信息集合时的相位信息;及基于相位信息所指示的语音方位确定目标对象的方位信息。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行获取当前语音环境中目标对象的语音信息和目标对象的方位信息的步骤之前,还执行以下步骤:获取声纹训练语音集合;及基于声纹训练语音集合中各声纹训练语音和声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行获取声纹特征信息对应的声音置信度的步骤时,执行以下步骤:将声纹特征信息与声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值;及根据匹配度值确定声纹特征信息对应的声音置信度。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果的步骤时,执行以下步骤:基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息;及根据对象识别信息获取目标对象的对象识别结果。
在一个实施例中,计算机可读指令被处理器执行时,使得处理器在执行基于声音置信度和预设声音置信度阈值的关系,在方位信息和声纹特征信息中确定所采用的对象识别信息的步骤时,执行以下步骤:当声音置信度大于或等于第一置信度阈值时,将声纹特征信息确定为所采用的对象识别信息;当声音置信度大于或等于第二置信度阈值、且小于第一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息;及当声音置信度小于第二置信度阈值时,将述方位信息确定为所采用的对象识别信息。
在一个实施例中,当声音置信度大于或等于第二置信度阈值、且小于第 一置信度阈值时,将方位信息和声纹特征信息共同确定为所采用的对象识别信息,计算机可读指令被处理器执行时,使得处理器在执行根据对象识别信息获取目标对象的对象识别结果的步骤时,执行以下步骤:根据声纹特征信息获取目标对象的候选识别结果;及采用方位信息从候选识别结果中定位目标对象的对象识别结果。
上述计算机可读存储介质,通过获取当前语音环境中目标对象的语音信息和目标对象的方位信息,然后基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息,最后获取声纹特征信息对应的声音置信度,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性。
请参见图15,为本申请实施例提供了一种终端的结构示意图。如图15所示,所述终端1000可以包括:至少一个处理器1001,例如CPU,至少一个网络接口1004,用户接口1003,存储器1005,至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图15所示,作为一种计算机存储介质的存储器1005中可以包括操作***、网络通信模块、用户接口模块以及对象识别应用程序。
在图15所示的终端1000中,用户接口1003主要用于为用户提供输入的接口,获取用户输入的数据;网络接口1004用于与用户终端进行数据通信;而处理器1001可以用于调用存储器1005中存储的对象识别应用程序,并具 体执行以上述对象识别方法。
在本申请实施例中,通过获取当前语音环境中目标对象的语音信息和目标对象的方位信息,然后基于训练后的声纹匹配模型对语音信息进行声纹特征提取,获取经声纹特征提取后语音信息对应的声纹特征信息,最后获取声纹特征信息对应的声音置信度,基于声音置信度并采用方位信息和声纹特征信息获取目标对象的对象识别结果。通过分析声音置信度在获取对象识别结果中的调节作用,实现根据方位信息或声纹特征信息获取对象识别结果,增加了获取到的对象识别结果的准确性;通过声音置信度确定用于对象识别的对象识别信息,避免了在对象识别的过程中对非必要信息的识别过程,提高了对象识别的效率;通过方位信息和声纹特征信息同时识别目标对象的对象识别结果,进一步增加了所获得的对象识别结果的准确性。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本发明较佳实施例而已,当然不能以此来限定本发明之权利范围,因此依本发明权利要求所作的等同变化,仍属本发明所涵盖的范围。

Claims (20)

  1. 一种对象识别方法,执行于计算机设备,所述计算机设备包括存储器和处理器,所述方法包括:
    获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息;
    基于训练后的声纹匹配模型对所述语音信息进行声纹特征提取,获取经所述声纹特征提取后所述语音信息对应的声纹特征信息;
    获取所述声纹特征信息对应的声音置信度;及
    基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果。
  2. 根据权利要求1所述的方法,其特征在于,所述获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息,包括:
    基于麦克风阵列获取当前语音环境中语音信息集合;
    对所述语音信息集合进行筛选处理,获取经所述筛选处理后的所述目标对象的语音信息;
    获取所述麦克风阵列在采集所述语音信息集合时的相位信息;及
    基于所述相位信息所指示的语音方位确定所述目标对象的方位信息。
  3. 根据权利要求1所述的方法,其特征在于,所述获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息之前,还包括:
    获取声纹训练语音集合;及
    基于所述声纹训练语音集合中各声纹训练语音和所述声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。
  4. 根据权利要求3所述的方法,其特征在于,所述获取所述声纹特征信息对应的声音置信度,包括:
    将所述声纹特征信息与所述声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值;及
    根据所述匹配度值确定所述声纹特征信息对应的声音置信度。
  5. 根据权利要求1所述的方法,其特征在于,所述基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果,包括:
    基于所述声音置信度和预设声音置信度阈值的关系,在所述方位信息和所述声纹特征信息中确定所采用的对象识别信息;及
    根据所述对象识别信息获取所述目标对象的对象识别结果。
  6. 根据权利要求5所述的方法,其特征在于,所述基于所述声音置信度和预设声音置信度阈值的关系,在所述方位信息和所述声纹特征信息中确定所采用的对象识别信息,包括:
    当所述声音置信度大于或等于第一置信度阈值时,将所述声纹特征信息确定为所采用的对象识别信息;
    当所述声音置信度大于或等于第二置信度阈值、且小于所述第一置信度阈值时,将所述方位信息和所述声纹特征信息共同确定为所采用的对象识别信息;及
    当所述声音置信度小于第二置信度阈值时,将所述述方位信息确定为所采用的对象识别信息。
  7. 根据权利要求6所述的方法,其特征在于,当所述声音置信度大于或等于第二置信度阈值、且小于所述第一置信度阈值时,将所述方位信息和所述声纹特征信息共同确定为所采用的对象识别信息,所述根据所述对象识别信息获取所述目标对象的对象识别结果,包括:
    根据所述声纹特征信息获取所述目标对象的候选识别结果;及
    采用所述方位信息从所述候选识别结果中定位所述目标对象的对象识别结果。
  8. 一种计算机设备,包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
    获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息;
    基于训练后的声纹匹配模型对所述语音信息进行声纹特征提取,获取经所述声纹特征提取后所述语音信息对应的声纹特征信息;
    获取所述声纹特征信息对应的声音置信度;及
    基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果。
  9. 根据权利要求8所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息的步骤时,执行以下步骤:
    基于麦克风阵列获取当前语音环境中语音信息集合;
    对所述语音信息集合进行筛选处理,获取经所述筛选处理后的所述目标对象的语音信息;
    获取所述麦克风阵列在采集所述语音信息集合时的相位信息;及
    基于所述相位信息所指示的语音方位确定所述目标对象的方位信息。
  10. 根据权利要求8所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息的步骤之前,还执行以下步骤:
    获取声纹训练语音集合;及
    基于所述声纹训练语音集合中各声纹训练语音和所述声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。
  11. 根据权利要求10所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述获取所述声纹特征信息对应的声音置信度的步骤时,执行以下步骤:
    将所述声纹特征信息与所述声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值;及
    根据所述匹配度值确定所述声纹特征信息对应的声音置信度。
  12. 根据权利要求8所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果的步骤时,执行以下步骤:
    基于所述声音置信度和预设声音置信度阈值的关系,在所述方位信息和所述声纹特征信息中确定所采用的对象识别信息;及
    根据所述对象识别信息获取所述目标对象的对象识别结果。
  13. 根据权利要求12所述的计算机设备,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述基于所述声音置信度和预设声音置信度阈值的关系,在所述方位信息和所述声纹特征信息中确定所采用的对象识别信息的步骤时,执行以下步骤:
    当所述声音置信度大于或等于第一置信度阈值时,将所述声纹特征信息确定为所采用的对象识别信息;
    当所述声音置信度大于或等于第二置信度阈值、且小于所述第一置信度阈值时,将所述方位信息和所述声纹特征信息共同确定为所采用的对象识别信息;及
    当所述声音置信度小于第二置信度阈值时,将所述述方位信息确定为所采用的对象识别信息。
  14. 根据权利要求13所述的计算机设备,其特征在于,当所述声音置信度大于或等于第二置信度阈值、且小于所述第一置信度阈值时,将所述方位信息和所述声纹特征信息共同确定为所采用的对象识别信息,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述根据所述对象识别信息获取所述目标对象的对象识别结果的步骤时,执行以下步骤:
    根据所述声纹特征信息获取所述目标对象的候选识别结果;及
    采用所述方位信息从所述候选识别结果中定位所述目标对象的对象识别结果。
  15. 一种非易失性的计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息;
    基于训练后的声纹匹配模型对所述语音信息进行声纹特征提取,获取经所述声纹特征提取后所述语音信息对应的声纹特征信息;
    获取所述声纹特征信息对应的声音置信度;及
    基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果。
  16. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息的步骤时,执行以下步骤:
    基于麦克风阵列获取当前语音环境中语音信息集合;
    对所述语音信息集合进行筛选处理,获取经所述筛选处理后的所述目标对象的语音信息;
    获取所述麦克风阵列在采集所述语音信息集合时的相位信息;及
    基于所述相位信息所指示的语音方位确定所述目标对象的方位信息。
  17. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述获取当前语音环境中目标对象的语音信息和所述目标对象的方位信息的步骤之前,还执行以下步骤:
    获取声纹训练语音集合;及
    基于所述声纹训练语音集合中各声纹训练语音和所述声纹训练语音对应的样本特征信息,对建立的声纹匹配模型进行训练生成训练后的声纹匹配模型。
  18. 根据权利要求17所述的计算机可读存储介质,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述获取所述声 纹特征信息对应的声音置信度的步骤时,执行以下步骤:
    将所述声纹特征信息与所述声纹训练语音对应的样本特征信息进行匹配,获取特征匹配度最高时的匹配度值;及
    根据所述匹配度值确定所述声纹特征信息对应的声音置信度。
  19. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述基于所述声音置信度并采用所述方位信息和所述声纹特征信息获取所述目标对象的对象识别结果的步骤时,执行以下步骤:
    基于所述声音置信度和预设声音置信度阈值的关系,在所述方位信息和所述声纹特征信息中确定所采用的对象识别信息;及
    根据所述对象识别信息获取所述目标对象的对象识别结果。
  20. 根据权利要求19所述的计算机可读存储介质,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器在执行所述基于所述声音置信度和预设声音置信度阈值的关系,在所述方位信息和所述声纹特征信息中确定所采用的对象识别信息的步骤时,执行以下步骤:
    当所述声音置信度大于或等于第一置信度阈值时,将所述声纹特征信息确定为所采用的对象识别信息;
    当所述声音置信度大于或等于第二置信度阈值、且小于所述第一置信度阈值时,将所述方位信息和所述声纹特征信息共同确定为所采用的对象识别信息;及
    当所述声音置信度小于第二置信度阈值时,将所述述方位信息确定为所采用的对象识别信息。
PCT/CN2018/103255 2017-10-23 2018-08-30 一种对象识别方法、计算机设备及计算机可读存储介质 WO2019080639A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2020522805A JP6938784B2 (ja) 2017-10-23 2018-08-30 オブジェクト識別の方法及びその、コンピュータ装置並びにコンピュータ装置可読記憶媒体
EP18870826.7A EP3614377B1 (en) 2017-10-23 2018-08-30 Object recognition method, computer device and computer readable storage medium
KR1020197038790A KR102339594B1 (ko) 2017-10-23 2018-08-30 객체 인식 방법, 컴퓨터 디바이스 및 컴퓨터 판독 가능 저장 매체
US16/663,086 US11289072B2 (en) 2017-10-23 2019-10-24 Object recognition method, computer device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710992605.7 2017-10-23
CN201710992605.7A CN108305615B (zh) 2017-10-23 2017-10-23 一种对象识别方法及其设备、存储介质、终端

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/663,086 Continuation US11289072B2 (en) 2017-10-23 2019-10-24 Object recognition method, computer device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2019080639A1 true WO2019080639A1 (zh) 2019-05-02

Family

ID=62869914

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/103255 WO2019080639A1 (zh) 2017-10-23 2018-08-30 一种对象识别方法、计算机设备及计算机可读存储介质

Country Status (6)

Country Link
US (1) US11289072B2 (zh)
EP (1) EP3614377B1 (zh)
JP (1) JP6938784B2 (zh)
KR (1) KR102339594B1 (zh)
CN (1) CN108305615B (zh)
WO (1) WO2019080639A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951809A (zh) * 2019-05-14 2020-11-17 深圳子丸科技有限公司 多人声纹辨别方法及***
CN112507294A (zh) * 2020-10-23 2021-03-16 重庆交通大学 一种基于人机交互的英语教学***及教学方法

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305615B (zh) * 2017-10-23 2020-06-16 腾讯科技(深圳)有限公司 一种对象识别方法及其设备、存储介质、终端
CN107945815B (zh) * 2017-11-27 2021-09-07 歌尔科技有限公司 语音信号降噪方法及设备
CN108197449A (zh) * 2017-12-27 2018-06-22 廖晓曦 一种基于移动终端的询问笔录装置、***及其笔录方法
CN112425157A (zh) * 2018-07-24 2021-02-26 索尼公司 信息处理装置和方法以及程序
CN110782622A (zh) * 2018-07-25 2020-02-11 杭州海康威视数字技术股份有限公司 一种安全监控***、安全检测方法、装置及电子设备
CN109256147B (zh) * 2018-10-30 2022-06-10 腾讯音乐娱乐科技(深圳)有限公司 音频节拍检测方法、装置及存储介质
CN111199741A (zh) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 声纹识别方法、声纹验证方法、装置、计算设备及介质
CN109346083A (zh) * 2018-11-28 2019-02-15 北京猎户星空科技有限公司 一种智能语音交互方法及装置、相关设备及存储介质
CN111292733A (zh) * 2018-12-06 2020-06-16 阿里巴巴集团控股有限公司 一种语音交互方法和装置
CN109410956B (zh) * 2018-12-24 2021-10-08 科大讯飞股份有限公司 一种音频数据的对象识别方法、装置、设备及存储介质
CN109903522A (zh) * 2019-01-24 2019-06-18 珠海格力电器股份有限公司 一种监控方法、装置、存储介质及家用电器
CN110058892A (zh) * 2019-04-29 2019-07-26 Oppo广东移动通信有限公司 电子设备交互方法、装置、电子设备及存储介质
CN110082723B (zh) * 2019-05-16 2022-03-15 浙江大华技术股份有限公司 一种声源定位方法、装置、设备及存储介质
CN110505504B (zh) * 2019-07-18 2022-09-23 平安科技(深圳)有限公司 视频节目处理方法、装置、计算机设备及存储介质
CN110491411B (zh) * 2019-09-25 2022-05-17 上海依图信息技术有限公司 结合麦克风声源角度和语音特征相似度分离说话人的方法
CN110767226B (zh) * 2019-10-30 2022-08-16 山西见声科技有限公司 具有高准确度的声源定位方法、装置、语音识别方法、***、存储设备及终端
US11664033B2 (en) 2020-06-15 2023-05-30 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method thereof
CN111916101B (zh) * 2020-08-06 2022-01-21 大象声科(深圳)科技有限公司 一种融合骨振动传感器和双麦克风信号的深度学习降噪方法及***
CN111904424B (zh) * 2020-08-06 2021-08-24 苏州国科医工科技发展(集团)有限公司 基于相阵麦克风的睡眠监测及调控***
CN111988426B (zh) * 2020-08-31 2023-07-18 深圳康佳电子科技有限公司 基于声纹识别的通信方法、装置、智能终端及存储介质
CN112233694B (zh) * 2020-10-10 2024-03-05 中国电子科技集团公司第三研究所 一种目标识别方法、装置、存储介质及电子设备
CN112530452B (zh) * 2020-11-23 2024-06-28 北京海云捷迅科技股份有限公司 一种后置滤波补偿方法、装置和***
CN112885370B (zh) * 2021-01-11 2024-05-31 广州欢城文化传媒有限公司 一种声音卡片有效性检测方法及装置
CN112820300B (zh) * 2021-02-25 2023-12-19 北京小米松果电子有限公司 音频处理方法及装置、终端、存储介质
CN113113044B (zh) * 2021-03-23 2023-05-09 北京小米松果电子有限公司 音频处理方法及装置、终端及存储介质
US11996087B2 (en) 2021-04-30 2024-05-28 Comcast Cable Communications, Llc Method and apparatus for intelligent voice recognition
CN113707173B (zh) * 2021-08-30 2023-12-29 平安科技(深圳)有限公司 基于音频切分的语音分离方法、装置、设备及存储介质
CN114863932A (zh) * 2022-03-29 2022-08-05 青岛海尔空调器有限总公司 一种工作模式设置方法及装置
CN114694635A (zh) * 2022-03-29 2022-07-01 青岛海尔空调器有限总公司 一种睡眠场景设置方法及装置
CN114999472A (zh) * 2022-04-27 2022-09-02 青岛海尔空调器有限总公司 一种空调控制方法、装置及一种空调
CN115331673B (zh) * 2022-10-14 2023-01-03 北京师范大学 一种复杂声音场景下的声纹识别家电控制方法和装置
CN116299179B (zh) * 2023-05-22 2023-09-12 北京边锋信息技术有限公司 一种声源定位方法、声源定位装置和可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009210956A (ja) * 2008-03-06 2009-09-17 National Institute Of Advanced Industrial & Technology 操作方法およびそのための操作装置、プログラム
CN105280183A (zh) * 2015-09-10 2016-01-27 百度在线网络技术(北京)有限公司 语音交互方法和***
CN106503513A (zh) * 2016-09-23 2017-03-15 北京小米移动软件有限公司 声纹识别方法及装置
CN107221331A (zh) * 2017-06-05 2017-09-29 深圳市讯联智付网络有限公司 一种基于声纹的身份识别方法和设备
CN107862060A (zh) * 2017-11-15 2018-03-30 吉林大学 一种追踪目标人的语义识别装置及识别方法
CN108305615A (zh) * 2017-10-23 2018-07-20 腾讯科技(深圳)有限公司 一种对象识别方法及其设备、存储介质、终端

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2991144B2 (ja) * 1997-01-29 1999-12-20 日本電気株式会社 話者認識装置
FR2761848B1 (fr) * 1997-04-04 2004-09-17 Parrot Sa Dispositif de commande vocale pour radiotelephone, notamment pour utilisation dans un vehicule automobile
US6751590B1 (en) * 2000-06-13 2004-06-15 International Business Machines Corporation Method and apparatus for performing pattern-specific maximum likelihood transformations for speaker recognition
JP2005122128A (ja) * 2003-09-25 2005-05-12 Fuji Photo Film Co Ltd 音声認識システム及びプログラム
JP4595364B2 (ja) * 2004-03-23 2010-12-08 ソニー株式会社 情報処理装置および方法、プログラム、並びに記録媒体
US20070219801A1 (en) * 2006-03-14 2007-09-20 Prabha Sundaram System, method and computer program product for updating a biometric model based on changes in a biometric feature of a user
JP4730404B2 (ja) * 2008-07-08 2011-07-20 ソニー株式会社 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム
US8442824B2 (en) * 2008-11-26 2013-05-14 Nuance Communications, Inc. Device, system, and method of liveness detection utilizing voice biometrics
JP2010165305A (ja) * 2009-01-19 2010-07-29 Sony Corp 情報処理装置、および情報処理方法、並びにプログラム
US8265341B2 (en) * 2010-01-25 2012-09-11 Microsoft Corporation Voice-body identity correlation
US8606579B2 (en) * 2010-05-24 2013-12-10 Microsoft Corporation Voice print identification for identifying speakers
CN102270451B (zh) * 2011-08-18 2013-05-29 安徽科大讯飞信息科技股份有限公司 说话人识别方法及***
US20130162752A1 (en) * 2011-12-22 2013-06-27 Advanced Micro Devices, Inc. Audio and Video Teleconferencing Using Voiceprints and Face Prints
US9401058B2 (en) * 2012-01-30 2016-07-26 International Business Machines Corporation Zone based presence determination via voiceprint location awareness
US9800731B2 (en) * 2012-06-01 2017-10-24 Avaya Inc. Method and apparatus for identifying a speaker
CN102930868A (zh) * 2012-10-24 2013-02-13 北京车音网科技有限公司 身份识别方法和装置
EP2797078B1 (en) * 2013-04-26 2016-10-12 Agnitio S.L. Estimation of reliability in speaker recognition
US9711148B1 (en) * 2013-07-18 2017-07-18 Google Inc. Dual model speaker identification
US9922667B2 (en) * 2014-04-17 2018-03-20 Microsoft Technology Licensing, Llc Conversation, presence and context detection for hologram suppression
US20150302856A1 (en) * 2014-04-17 2015-10-22 Qualcomm Incorporated Method and apparatus for performing function by speech input
CN105321520A (zh) * 2014-06-16 2016-02-10 丰唐物联技术(深圳)有限公司 一种语音控制方法及装置
US9384738B2 (en) * 2014-06-24 2016-07-05 Google Inc. Dynamic threshold for speaker verification
CN104219050B (zh) * 2014-08-08 2015-11-11 腾讯科技(深圳)有限公司 声纹验证方法、服务器、客户端及***
US10262655B2 (en) * 2014-11-03 2019-04-16 Microsoft Technology Licensing, Llc Augmentation of key phrase user recognition
US10397220B2 (en) * 2015-04-30 2019-08-27 Google Llc Facial profile password to modify user account data for hands-free transactions
CN104935819B (zh) * 2015-06-11 2018-03-02 广东欧珀移动通信有限公司 一种控制摄像头拍摄方法及终端
US10178301B1 (en) * 2015-06-25 2019-01-08 Amazon Technologies, Inc. User identification based on voice and face
US20180018973A1 (en) * 2016-07-15 2018-01-18 Google Inc. Speaker verification
US10026403B2 (en) * 2016-08-12 2018-07-17 Paypal, Inc. Location based voice association system
US20190182176A1 (en) * 2016-12-21 2019-06-13 Facebook, Inc. User Authentication with Voiceprints on Online Social Networks
CN106898355B (zh) * 2017-01-17 2020-04-14 北京华控智加科技有限公司 一种基于二次建模的说话人识别方法
CN106961418A (zh) * 2017-02-08 2017-07-18 北京捷通华声科技股份有限公司 身份认证方法和身份认证***
US10467510B2 (en) * 2017-02-14 2019-11-05 Microsoft Technology Licensing, Llc Intelligent assistant
CN107123421A (zh) * 2017-04-11 2017-09-01 广东美的制冷设备有限公司 语音控制方法、装置及家电设备
US11250844B2 (en) * 2017-04-12 2022-02-15 Soundhound, Inc. Managing agent engagement in a man-machine dialog

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009210956A (ja) * 2008-03-06 2009-09-17 National Institute Of Advanced Industrial & Technology 操作方法およびそのための操作装置、プログラム
CN105280183A (zh) * 2015-09-10 2016-01-27 百度在线网络技术(北京)有限公司 语音交互方法和***
CN106503513A (zh) * 2016-09-23 2017-03-15 北京小米移动软件有限公司 声纹识别方法及装置
CN107221331A (zh) * 2017-06-05 2017-09-29 深圳市讯联智付网络有限公司 一种基于声纹的身份识别方法和设备
CN108305615A (zh) * 2017-10-23 2018-07-20 腾讯科技(深圳)有限公司 一种对象识别方法及其设备、存储介质、终端
CN107862060A (zh) * 2017-11-15 2018-03-30 吉林大学 一种追踪目标人的语义识别装置及识别方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3614377A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951809A (zh) * 2019-05-14 2020-11-17 深圳子丸科技有限公司 多人声纹辨别方法及***
CN112507294A (zh) * 2020-10-23 2021-03-16 重庆交通大学 一种基于人机交互的英语教学***及教学方法

Also Published As

Publication number Publication date
JP2021500616A (ja) 2021-01-07
JP6938784B2 (ja) 2021-09-22
CN108305615A (zh) 2018-07-20
US11289072B2 (en) 2022-03-29
CN108305615B (zh) 2020-06-16
KR20200012963A (ko) 2020-02-05
KR102339594B1 (ko) 2021-12-14
EP3614377B1 (en) 2022-02-09
EP3614377A4 (en) 2020-12-30
US20200058293A1 (en) 2020-02-20
EP3614377A1 (en) 2020-02-26

Similar Documents

Publication Publication Date Title
WO2019080639A1 (zh) 一种对象识别方法、计算机设备及计算机可读存储介质
Sahidullah et al. Introduction to voice presentation attack detection and recent advances
US10593336B2 (en) Machine learning for authenticating voice
US10957339B2 (en) Speaker recognition method and apparatus, computer device and computer-readable medium
US9899025B2 (en) Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
US10242677B2 (en) Speaker dependent voiced sound pattern detection thresholds
JP2007156422A (ja) 生体認識方法、生体認識システム及びプログラム
JP2006079079A (ja) 分散音声認識システム及びその方法
US11222641B2 (en) Speaker recognition device, speaker recognition method, and recording medium
CN109215634A (zh) 一种多词语音控制通断装置的方法及其***
US9953633B2 (en) Speaker dependent voiced sound pattern template mapping
WO2018095167A1 (zh) 声纹识别方法和声纹识别***
KR20230116886A (ko) 페이크 오디오 검출을 위한 자기 지도형 음성 표현
CN111081223A (zh) 一种语音识别方法、装置、设备和存储介质
EP3816996B1 (en) Information processing device, control method, and program
WO2017113370A1 (zh) 声纹检测的方法和装置
US11763806B1 (en) Speaker recognition adaptation
GB2576960A (en) Speaker recognition
CN112992175B (zh) 一种语音区分方法及其语音记录装置
WO2022166220A1 (zh) 一种语音分析方法及其语音记录装置
CN113077784B (zh) 一种角色识别智能语音设备
JP7511374B2 (ja) 発話区間検知装置、音声認識装置、発話区間検知システム、発話区間検知方法及び発話区間検知プログラム
CN114512133A (zh) 发声对象识别方法、装置、服务器及存储介质
Tahliramani et al. Performance Analysis of Speaker Identification System With and Without Spoofing Attack of Voice Conversion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18870826

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018870826

Country of ref document: EP

Effective date: 20191119

ENP Entry into the national phase

Ref document number: 20197038790

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020522805

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE