WO2021159688A1

WO2021159688A1 - Voiceprint recognition method and apparatus, and storage medium and electronic apparatus

Info

Publication number: WO2021159688A1
Application number: PCT/CN2020/111370
Authority: WO
Inventors: 郜开开; 吴信朝; 周宝; 陈远旭
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-02-13
Filing date: 2020-08-26
Publication date: 2021-08-19
Also published as: CN111341325A

Abstract

Disclosed are a voiceprint recognition method and apparatus, and a storage medium and an electronic apparatus. The method comprises: detecting, in real time, whether a wake-up word voice is received (101); where it is determined that the wake-up word voice is received, extracting a voiceprint feature of the wake-up word voice, and inputting the voiceprint feature into a voiceprint library (102); extracting a voiceprint feature of the current voice signal detected in real time (103); comparing whether the voiceprint feature of the current voice signal is identical to any voiceprint feature stored in the voiceprint library (104); and if an identical voiceprint feature is matched, executing semantic recognition on the current voice signal and performing feedback (105). The technical problem in the prior art of a dialogue between a robot and a speaker who sends an instruction being interrupted or suspended in a scenario with strong interference, such as a multi-person conversation, is solved, and the technical effect that a dialogue with a speaker who sends an instruction can still be kept in a scenario with strong background sound interference is achieved.

Description

声纹识别方法、装置、存储介质、电子装置Voiceprint recognition method, device, storage medium and electronic device

本申请要求于2020年02月13日提交中国专利局、申请号为202010090868.0，发明名称为“声纹识别方法、装置、存储介质、电子装置”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 13, 2020 with the application number 202010090868.0 and the invention title "voiceprint recognition method, device, storage medium, electronic device", the entire content of which is incorporated by reference Incorporated in this application.

技术领域Technical field

本申请涉及人工智能领域，具体而言，涉及一种声纹识别方法、装置、存储介质、电子装置。This application relates to the field of artificial intelligence, and specifically to a voiceprint recognition method, device, storage medium, and electronic device.

背景技术Background technique

声纹(Voiceprint)，是用电声学仪器显示的携带言语信息的声波频谱。人在讲话时使用的发声器官—舌、牙齿、喉头、肺、鼻腔在尺寸和形态方面每个人的差异很大，所以任何两个人的声纹图谱都有差异。每个人的语音声学特征既有相对稳定性，又有变异性，不是绝对的、一成不变的。Voiceprint (Voiceprint) is a sound wave spectrum that carries verbal information displayed by an electro-acoustic instrument. The vocal organs that people use when speaking-tongue, teeth, throat, lungs, and nasal cavity are very different in size and shape from person to person, so any two people's voiceprint patterns are different. The acoustic characteristics of each person's voice have both relative stability and variability, and they are not absolute and static.

声纹识别，也称为说话人识别，有两类，即说话人辨认和说话人确认。前者用以判断某段语音是若干人中的哪一个所说的，是“多选一”问题；而后者用以确认某段语音是否是指定的某个人所说的，是“一对一判别”问题。不同的任务和应用会使用不同的声纹识别技术，如缩小刑侦范围时可能需要辨认技术，而银行交易时则需要确认技术。一个典型的声音识别***的识别过程一般需要涉及如下几个步骤：声音信号的采集与量化、预处理、信号特征的提取、模板匹配识别等。Voiceprint recognition, also known as speaker recognition, has two types, namely speaker recognition and speaker confirmation. The former is used to determine which one of several people said a certain speech, which is a "multiple choice" question; while the latter is used to confirm whether a certain speech is spoken by a designated person, which is a "one-to-one discrimination". "problem. Different tasks and applications will use different voiceprint recognition technologies. For example, recognition technology may be required when reducing the scope of criminal investigation, while confirmation technology is required for bank transactions. The recognition process of a typical voice recognition system generally involves the following steps: voice signal collection and quantization, preprocessing, signal feature extraction, template matching recognition, etc.

现有的声纹识别应用场景大都为智能安防、公安***，在机器人的动态识别交互人上还未有应用。发明人意识到，在目前的场景中，机器人经常发生两种状况：(1)在别人闲聊时，当距离过近过音量过大可以让机器人检测到语音，则机器人开始与该语音位置声源人进行互动聊天。(2)在机器人与人交互谈话时，当出现其他声源声音被识别，机器人谈话会被打断打乱甚至中止。The existing voiceprint recognition application scenarios are mostly intelligent security and public security systems, and there is no application in the dynamic recognition of interactive people by robots. The inventor realizes that in the current scenario, the robot often occurs in two situations: (1) When others are chatting, when the distance is too close or the volume is too loud, the robot can detect the voice, and the robot starts to talk to the sound source of the voice. People engage in interactive chat. (2) When the robot talks with people interactively, when other sound sources are recognized, the robot talk will be interrupted or even stopped.

发明内容Summary of the invention

本申请实施例提供了一种声纹识别方法、装置、存储介质、电子装置，以至少解决现有技术中多人交谈等干扰较强的场景下机器人与发出指令的说话人之间的对话被打断或中止的技术问题。The embodiments of the application provide a voiceprint recognition method, device, storage medium, and electronic device to at least solve the problem of the conversation between the robot and the speaker issuing the instruction in the scene of strong interference such as multi-person conversation in the prior art. Technical problem of interruption or suspension.

根据本申请的一个实施例，提供了一种声纹识别方法，包括：实时监测是否接收到唤醒词语音；在确定接收到唤醒词语音的情况下，提取唤醒词语音的声纹特征，并将声纹特征录入声纹库；提取实时监测到的当前语音信号的声纹特征；对比当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；如果匹配到相同的声纹特征，则对当前语音信号执行语义识别并进行反馈。According to an embodiment of the present application, a voiceprint recognition method is provided, which includes: real-time monitoring whether a wake-up word voice is received; in a case where it is determined that the wake-up word voice is received, extract the voiceprint features of the wake-up word voice, and The voiceprint features are recorded in the voiceprint library; the voiceprint features of the current voice signal monitored in real time are extracted; the voiceprint feature of the current voice signal is compared with any voiceprint feature stored in the voiceprint library; if the same voiceprint is matched Feature, perform semantic recognition and feedback on the current voice signal.

根据本申请的另一个实施例，提供了一种声纹识别装置，该装置包括：监测模块，用于实时监测是否接收到唤醒词语音；第一提取模块，用于在确定接收到唤醒词语音的情况下，提取唤醒词语音的声纹特征，并将声纹特征录入声纹库；第二提取模块，用于提取实时监测到的当前语音信号的声纹特征；对比模块，用于对比当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；识别模块，用于如果匹配到相同的声纹特征，则对当前语音信号执行语义识别并进行反馈。According to another embodiment of the present application, a voiceprint recognition device is provided. The device includes: a monitoring module for real-time monitoring whether a wake-up word voice is received; a first extraction module for determining whether a wake-up word voice is received In the case of, extract the voiceprint features of the wake-up word voice, and record the voiceprint features into the voiceprint library; the second extraction module is used to extract the voiceprint features of the current voice signal monitored in real time; the comparison module is used to compare the current Whether the voiceprint feature of the voice signal is the same as any voiceprint feature stored in the voiceprint library; the recognition module is used to perform semantic recognition and feedback on the current voice signal if the same voiceprint feature is matched.

根据本申请的又一个实施例，还提供了一种存储介质，所述存储介质中存储有计算机程序，其中，所述计算机程序被设置为运行时执行以下步骤：According to another embodiment of the present application, there is also provided a storage medium in which a computer program is stored, wherein the computer program is configured to execute the following steps when it is running:

实时监测是否接收到唤醒词语音；Real-time monitoring whether the wake-up word voice is received;

在确定接收到所述唤醒词语音的情况下，提取所述唤醒词语音的声纹特征，并将所述声纹特征录入声纹库；In the case where it is determined that the wake-up word voice is received, extract the voiceprint feature of the wake-up word voice, and record the voiceprint feature into a voiceprint library;

提取实时监测到的当前语音信号的声纹特征；Extract the voiceprint features of the current voice signal monitored in real time;

对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；Compare whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in the voiceprint library;

如果匹配到相同的声纹特征，则对所述当前语音信号执行语义识别并进行反馈。If the same voiceprint feature is matched, semantic recognition is performed on the current voice signal and feedback is performed.

根据本申请的又一个实施例，还提供了一种电子装置，包括存储器和处理器，所述存储器中存储有计算机程序，所述处理器被设置为运行以下步骤：According to another embodiment of the present application, there is also provided an electronic device, including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the following steps:

通过本申请，通过实时监测是否接收到唤醒词语音；在确定接收到唤醒词语音的情况下，提取唤醒词语音的声纹特征，并将声纹特征录入声纹库；提取实时监测到的当前语音信号的声纹特征；对比当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；如果匹配到相同的声纹特征，则对当前语音信号执行语义识别并进行反馈，解决现有技术中多人交谈等干扰较强的场景下机器人与发出指令的说话人之间的对话被打断或中止的技术问题，实现了在背景音干扰较强的场景下仍可以保持与发出指令的说话人进行对话的技术效果。Through this application, the wake-up word voice is monitored in real time; in the case where the wake-up word voice is determined to be received, the voiceprint feature of the wake-up word voice is extracted, and the voiceprint feature is recorded into the voiceprint database; the current monitored in real time is extracted The voiceprint feature of the voice signal; compare whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in the voiceprint library; if the same voiceprint feature is matched, semantic recognition and feedback will be performed on the current voice signal, It solves the technical problem that the dialogue between the robot and the speaker issuing the instruction is interrupted or aborted in the scene with strong interference such as multi-person conversation in the prior art, and realizes that the communication can still be maintained in the scene with strong background sound interference. The technical effect of the conversation between the speaker who issued the instruction.

附图说明Description of the drawings

图1是根据本申请实施例的声纹识别方法的流程图；Fig. 1 is a flowchart of a voiceprint recognition method according to an embodiment of the present application;

图2是根据本申请实施例的声纹识别方法的动态时间规整的路径示意图；2 is a schematic diagram of a path of dynamic time warping of the voiceprint recognition method according to an embodiment of the present application;

图3是根据本申请实施例的声纹识别装置的示意图；Fig. 3 is a schematic diagram of a voiceprint recognition device according to an embodiment of the present application;

图4是本申请实施例的一种电子装置的硬件结构框图。Fig. 4 is a block diagram of the hardware structure of an electronic device according to an embodiment of the present application.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分的实施例，而不是全部的实施例，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the solutions of the application, the technical solutions in the embodiments of the application will be clearly and completely described below in conjunction with the drawings in the embodiments of the application. Obviously, the described embodiments are only It is a part of the embodiments of the present application, not all of the embodiments. The embodiments in the present application and the features in the embodiments can be combined with each other if there is no conflict. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work should fall within the protection scope of this application.

需要说明的是，本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、***、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms “first” and “second” in the specification and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments of the present application described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.

实施例1Example 1

本实施例提供了一种声纹识别方法，可以应用于具有声音接收器的电子装置，例如，手机、平板电脑等移动终端，计算机设备，智能家居电器，用于识别说话人的身份。需要说明的是，运行在不同的运算设备仅是方案在执行主体上的差异，本领域人员可预见在不同运算设备中运行能够产生相同的技术效果。可选的，本实施例提供的声纹识别方法可以应用于人工智能(Artificial Intelligence，简称AI)领域，例如，智能家居电器、机器人、语音助手等应用场景，用于对说话人身份进行识别，以确定说话人的相关信息或权限等，在利用本实施例提供的声纹识别方法进行识别之后，可以针对说话人说话的具体内容进行语义识别，以进行相应的交互。This embodiment provides a voiceprint recognition method, which can be applied to electronic devices with voice receivers, such as mobile terminals such as mobile phones and tablet computers, computer equipment, and smart home appliances to recognize the identity of the speaker. It should be noted that running on different computing devices is only a difference in the execution subject of the scheme, and those skilled in the art can foresee that running on different computing devices can produce the same technical effect. Optionally, the voiceprint recognition method provided in this embodiment can be applied to the field of Artificial Intelligence (AI), for example, application scenarios such as smart home appliances, robots, voice assistants, etc., to identify the speaker's identity. To determine the speaker's related information or permissions, after the voiceprint recognition method provided in this embodiment is used for recognition, semantic recognition can be performed on the specific content of the speaker's utterance to perform corresponding interactions.

图1为本实施例提供的一种可选的声纹识别方法的流程示意图，如图1所示，本实施例提供的声纹识别方法包括如下步骤：FIG. 1 is a schematic flowchart of an optional voiceprint recognition method provided in this embodiment. As shown in FIG. 1, the voiceprint recognition method provided in this embodiment includes the following steps:

步骤101，实时监测是否接收到唤醒词语音； Step 101, real-time monitoring whether the wake-up word voice is received;

本实施例的执行方具有声音接收器，可以实时监听接收到的声音，包括人类的语音，并进行语音转换为文本的处理，然后判断语音转换的文本中是否包括指定的唤醒词。语音文本转换(voice-to-text，speech-to-text)是一种语音识别程序，可以将语音转换成文本，相关现有技术中的语音文本转换可以通用的识别一般人的声音，将声音转换为文字。本实施例中，语音识别可以针对接收到的所有声音信号进行转换，以确定接收到的声音所对应的自然语言的具体内容，如果人类说出唤醒词语音，则在识别正确的情况下会将语音转换为文本，得到唤醒词文本，进而使得本实施例的执行方确定接收到唤醒词文本，也即接收到唤醒词语音。The executor of this embodiment has a sound receiver, which can monitor the received sound in real time, including human voice, and process the voice into text, and then judge whether the voice-converted text includes a specified wake-up word. Voice-to-text (speech-to-text) is a speech recognition program that can convert speech into text. The related prior art speech-to-text conversion can generally recognize the voice of ordinary people and convert the voice As text. In this embodiment, speech recognition can convert all the received sound signals to determine the specific content of the natural language corresponding to the received sound. If a human speaks the wake-up word voice, it will be recognized if the recognition is correct. The voice is converted into text to obtain the wake-up word text, so that the executing party of this embodiment determines that the wake-up word text is received, that is, the wake-up word voice is received.

例如，本实施例应用于语音助手的场景下，可以预先指定唤醒词为“语音助手”，在实际应用时，如果本实施例的执行方对接收到的声音执行语音识别，确定语音对应的文本为“语音助手”，则确定接收到唤醒词语音。For example, in a scenario where this embodiment is applied to a voice assistant, the wake-up word can be pre-designated as "voice assistant". In actual application, if the executor of this embodiment performs voice recognition on the received sound, determine the text corresponding to the voice If it is a "voice assistant", it is determined that the wake-up word voice is received.

步骤102，在确定接收到唤醒词语音的情况下，提取唤醒词语音的声纹特征，并将声纹特征录入声纹库。 Step 102, in a case where it is determined that the wake-up word voice is received, extract the voiceprint feature of the wake-up word voice, and record the voiceprint feature into the voiceprint library.

声纹(Voiceprint)是用电声学仪器显示的携带言语信息的声波频谱。声纹的特征可以通过矢量来表示，得到声音对应的声纹特征矢量。声纹的特征矢量通过提取声纹特征的处理方式获得，在提取声纹特征时，可以在一句待识别的声音中，提取不同时间的声学特征形成特征矢量序列，形成该说话人的声纹特征。Voiceprint (Voiceprint) is a sound wave spectrum that carries verbal information displayed by an electroacoustic instrument. The feature of the voiceprint can be represented by a vector, and the voiceprint feature vector corresponding to the sound can be obtained. The feature vector of the voiceprint is obtained by the processing method of extracting the voiceprint feature. When extracting the voiceprint feature, the acoustic features at different times can be extracted from the voice to be recognized to form a feature vector sequence to form the speaker's voiceprint feature .

提取声纹特征的一种可选的实施方式包括如下步骤：An optional implementation manner for extracting voiceprint features includes the following steps:

步骤11，对包含唤醒词语音的声音信号执行预处理，例如，相关现有技术中的归一化、预加重、端点检测、加窗分帧等预处理方式，其中，端点检测可以采用相关现有技术中短时能量和短时过零率双重门限的方法。 Step 11. Perform preprocessing on the sound signal containing the wake-up word speech, for example, normalization, pre-emphasis, endpoint detection, windowing and framing and other preprocessing methods in the related prior art. Among them, the endpoint detection can adopt the relevant current There is a dual-threshold method for short-term energy and short-term zero-crossing rate in technology.

步骤12，提取出预处理后的声音信号中的声学特征形成用于存储的第一说话人的特征矢量序列。Step 12: Extract the acoustic features in the preprocessed voice signal to form a feature vector sequence of the first speaker for storage.

步骤13，将特征矢量序列存储至声纹库。 Step 13. Store the feature vector sequence in the voiceprint library.

矢量也即向量，特征矢量序列包括多个矢量，是多个矢量的有序排列，其中，每个矢量又可以是多维的。特征矢量序列用于通过机器能够识别的语言(数字向量)来表示声学特征。提取声学特征可以采用相关现有技术中的提取方法，例如，采用隐马尔科夫模型(HiddenMarkovModel，简称HMM)进行建模，或者，混合高斯模型(Gaussian Mixture Model，简称GMM)-通用背景模型(Universal Background Model，简称UBM)进行建模，以得到特征矢量序列。A vector is also a vector. The feature vector sequence includes multiple vectors and is an ordered arrangement of multiple vectors, where each vector can be multi-dimensional. The feature vector sequence is used to express acoustic features through a language (digital vector) that can be recognized by the machine. Acoustic features can be extracted using extraction methods in the related prior art, for example, Hidden Markov Model (HMM) for modeling, or Gaussian Mixture Model (GMM)-General Background Model ( Universal Background Model (UBM for short) performs modeling to obtain a feature vector sequence.

在确定接收到唤醒词语音的情况下，确定有说话人A期望输入语音指令或对话，由于说话人A所处的环境可能是较嘈杂的，例如，多人聊天的场景。为了防止其它说话人的语音造成的干扰，误识别其他说话人的语音内容，本实施例的执行方通过提取和保存说话人A的声纹特征，在后续接收到语音之后，利用保存的说话人A的声纹特征，辨别接收到的声音是否为说话人A发出的，如果是说话人A发出的，再执行相应的语音指令或与说话人A进行对话。相应的，在本步骤中，将根据唤醒词语音所提取出的声纹特征存储为第一说话人(也即期望输入语音指令或对话的说话人)的声纹特征。In the case where it is determined that the voice of the wake-up word is received, it is determined that there is speaker A who expects to input a voice instruction or dialogue, because the environment in which speaker A is located may be noisy, for example, a scene where many people are chatting. In order to prevent interference caused by other speakers’ voices and misrecognize the content of other speakers’ voices, the executor of this embodiment extracts and saves the voiceprint features of speaker A, and uses the saved speaker after subsequent voices are received. The voiceprint feature of A, distinguish whether the received voice is from speaker A, if it is from speaker A, then execute the corresponding voice command or have a dialogue with speaker A. Correspondingly, in this step, the voiceprint feature extracted according to the voice of the wake-up word is stored as the voiceprint feature of the first speaker (that is, the speaker expecting to input a voice command or dialogue).

在本实施例中，声纹库用于存储任一发出唤醒词语音的说话人的声纹特征，以用于作为后续接收到的语音信号的对比依据。声纹库中任一说话人的声纹特征如果未在预设时长内(例如，20s之内)被匹配到，说明该说话人期望结束对话，不继续发出指令或对话，因此，将未在预设时长内再次接收到语音信号的说话人的声纹特征删除。一种可选的实施方式为，在预设时长之内如果未接收到第一说话人(用于指代任意一个说话人)发出的语音信号，则删除存储的第一说话人的声纹特征。In this embodiment, the voiceprint library is used to store the voiceprint characteristics of any speaker who emits the wake-up word voice, so as to be used as a comparison basis for the subsequent received voice signals. If the voiceprint feature of any speaker in the voiceprint library is not matched within the preset time (for example, within 20s), it means that the speaker expects to end the dialogue and does not continue to issue instructions or dialogues. Therefore, it will not be matched. The voiceprint feature of the speaker who receives the voice signal again within the preset time period is deleted. An optional implementation manner is that if the voice signal sent by the first speaker (used to refer to any speaker) is not received within the preset time period, the stored voiceprint characteristics of the first speaker are deleted .

例如，如果说话人A说出唤醒词，则说话人A的声纹特征会被存储在声纹库中，在20s之内如果说话人A不再发出声音，本实施例的执行方在接收到的声音中识别不到说话人A发出语音，则将说话人A的声纹特征在声纹库中删除。如果在20s之内，又有其他的说话人B说出了唤醒词，则将说话人B的声纹特征也存储在声纹库中，此时，声纹库中至少存储有说话人A和B的声纹特征。其原因为，我们认为任何说出唤醒词的说话人都想与本实施例的执行方进行交互，因此，将说出唤醒词的说话人的声纹特征都临时的保存在声纹库中，如果接收到的声音的声纹特征与声纹库中任一说话人的声纹特征匹配成功，则说话人的身份匹配成功，可以进行相应的响应，否则，不对该声音进行处理。又由于交互过程应该是比较连续的，如果较长时间未接收到说话人的对话，我们认为该说话人已经结束对话了，则将对应说话人的声纹特征从声纹库中删除，如果再次接收到该说话人说出唤醒词，则仍然会将提取说话人的声纹特征并存储在声纹库中。For example, if speaker A speaks a wake-up word, the voiceprint feature of speaker A will be stored in the voiceprint library. If speaker A no longer makes a sound within 20s, the executor of this embodiment is receiving If speaker A's voice cannot be recognized in the voice of speaker A, the voiceprint feature of speaker A is deleted from the voiceprint library. If within 20s, another speaker B utters the wake word, then the voiceprint characteristics of speaker B are also stored in the voiceprint library. At this time, at least speaker A and speaker A are stored in the voiceprint library. B's voiceprint characteristics. The reason is that we think that any speaker who speaks the wake-up word wants to interact with the executor of this embodiment. Therefore, the voiceprint characteristics of the speaker who speaks the wake-up word are temporarily stored in the voiceprint library. If the voiceprint feature of the received voice is successfully matched with the voiceprint feature of any speaker in the voiceprint library, the identity of the speaker is successfully matched and a corresponding response can be made; otherwise, the voice is not processed. And because the interaction process should be relatively continuous, if the speaker’s dialogue is not received for a long time, we think that the speaker has ended the dialogue, and the corresponding speaker’s voiceprint features will be deleted from the voiceprint database. When the wake-up word is uttered by the speaker, the voiceprint feature of the speaker will still be extracted and stored in the voiceprint library.

步骤103，提取实时监测到的当前语音信号的声纹特征。Step 103: Extract the voiceprint feature of the current voice signal monitored in real time.

继续接收语音信号，并提取当前语音信号的声纹特征。由于说话人所处环境可能较为嘈杂，本申请实施例的执行方所接收到的语音信号可能是说出过唤醒词、期望进行对话或语音指示的说话人发出的语音，也可能是周围其它说话人发出的语音，因此，对接收到的语音信号进行声纹特征的提取，以与声纹库中存储的声纹特征进行对比。声纹特征提取的具体方法与步骤102中所用的具体方法相同，在此不再赘述，可以提取出当前语音信号的声纹特征，得到当前语音信号的声纹特征矢量。Continue to receive the voice signal and extract the voiceprint features of the current voice signal. Since the speaker’s environment may be relatively noisy, the voice signal received by the executor of the embodiment of the present application may be the voice of the speaker who has spoken a wake-up word, hopes to have a dialogue or voice instruction, or may be other surrounding speeches. For the human voice, the voiceprint feature of the received voice signal is extracted to compare with the voiceprint feature stored in the voiceprint library. The specific method of voiceprint feature extraction is the same as the specific method used in step 102, which will not be repeated here. The voiceprint feature of the current voice signal can be extracted to obtain the voiceprint feature vector of the current voice signal.

步骤104，对比当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；Step 104: Compare whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in the voiceprint library;

在确定当前语音信号的声纹特征之后，将其与存储的所有说话人的声纹特征逐一进行匹配，判断当前语音信号的声纹特征是否与任一说话人的声纹特征相匹配。After the voiceprint feature of the current voice signal is determined, it is matched with the stored voiceprint features of all speakers one by one to determine whether the voiceprint feature of the current voice signal matches the voiceprint feature of any speaker.

具体的，在匹配声纹特征时，可以计算当前语音信号的声纹特征的特征矢量序列与每个说话人声纹模型的对数似然得分(或似然得分)，根据对数似然得分是否超过预设阈值判断是否与对应的说话人声纹模型相匹配。Specifically, when matching voiceprint features, the feature vector sequence of the voiceprint features of the current speech signal and the log-likelihood score (or likelihood score) of each speaker's voiceprint model can be calculated, based on the log-likelihood score Whether it exceeds a preset threshold is judged whether it matches the corresponding speaker's voiceprint model.

可选的，在对比当前语音信号的声纹特征和第一说话人的声纹特征时，可以采用特征矢量法结合动态时间规整(Dynamic Time Warping，DTW)的方法。动态时间规整的基本原理是，采用动态规划的方法，将一个复杂的全局最优化问题，逐步转化成多个简单的局部最优化问题，一步步进行决策。其主要解决了声音信号由于发音过程中各音素持续时间不一致，导致的特征参数矢量序列在时间上对不齐的问题。对于特征矢量组，只有当矢量长度相同时，对应特征矢量的比较，全局失真度才会有意义。因此用DTW方法在时间上对发音的各音素进行规整，把待比较向量进行压缩或者拉伸至与模板一样长。Optionally, when comparing the voiceprint feature of the current voice signal with the voiceprint feature of the first speaker, a feature vector method combined with a dynamic time warping (DTW) method may be used. The basic principle of dynamic time warping is to use the method of dynamic programming to gradually transform a complex global optimization problem into multiple simple local optimization problems, and make decisions step by step. It mainly solves the problem that the characteristic parameter vector sequence is not aligned in time due to the inconsistency of the duration of each phoneme in the pronunciation process of the sound signal. For the feature vector group, only when the vector length is the same, the comparison of the corresponding feature vectors will make the global distortion meaningful. Therefore, the DTW method is used to regularize the pronunciation of each phoneme in time, and the vector to be compared is compressed or stretched to the same length as the template.

具体的，利用动态时间规整的方式对比两个声纹特征的方法包括如下步骤：Specifically, the method of comparing two voiceprint features by means of dynamic time warping includes the following steps:

步骤21，通过语音识别方法识别当前语音信号中的每个发音音素。本实施例中，发音音素是发音的基本划分单位，是根据语音的自然属性划分出来的最小语音单位，依据音节里的发音动作来分析，一个动作构成一个音素。音素分为元音与辅音两大类。例如，对于中文，发音音素可以包括各个声母和各个韵母。利用语音识别的方法，可以确定语音信号中每个发音音素具体是什么。Step 21: Recognize each pronunciation phoneme in the current voice signal through a voice recognition method. In this embodiment, the pronunciation phoneme is the basic division unit of pronunciation, and is the smallest phonetic unit divided according to the natural attributes of the speech. It is analyzed according to the pronunciation actions in the syllable, and one action constitutes one phoneme. Phonemes are divided into two categories: vowels and consonants. For example, for Chinese, the pronunciation phoneme may include each initial and each final. Using the method of speech recognition, it is possible to determine what each pronunciation phoneme is in the speech signal.

步骤22，提取当前语音信号中每个发音音素对应的声纹信号的特征矢量序列。Step 22: Extract the feature vector sequence of the voiceprint signal corresponding to each pronunciation phoneme in the current speech signal.

需要说明的是，本实施例中，特征矢量序列为基于时间帧的特征矢量序列，也即，将待提取声纹特征的一段声音按照时间间隔划分为多个时间帧，针对每一帧的信号提取声纹特征的特征矢量，得到的特征矢量序列中包括多个特征矢量，每个特征矢量为对应时间间隔内的声纹信号的特征矢量。It should be noted that, in this embodiment, the feature vector sequence is a feature vector sequence based on a time frame, that is, a segment of sound for which voiceprint features are to be extracted is divided into multiple time frames according to the time interval, and the signal of each frame is The feature vector of the voiceprint feature is extracted, and the obtained feature vector sequence includes a plurality of feature vectors, and each feature vector is the feature vector of the voiceprint signal in the corresponding time interval.

在识别出发音音素之后，在当前语音信号中截取该发音音素起止时间之内的声纹信号，以得到说话人发出对应发音音素的声纹信号。进而，将该发音音素按照预设时间间隔划分为多帧，提取每一帧的特征矢量，得到针对该发音因素的特征矢量序列，特征矢量序列包括按时间排序的所有帧的特征矢量。After the pronunciation phoneme is identified, the voiceprint signal within the start and end time of the pronunciation phoneme is intercepted from the current speech signal to obtain the voiceprint signal corresponding to the pronunciation phoneme emitted by the speaker. Furthermore, the pronunciation phoneme is divided into multiple frames according to a preset time interval, the feature vector of each frame is extracted, and a feature vector sequence for the pronunciation factor is obtained. The feature vector sequence includes the feature vectors of all frames sorted in time.

步骤23，计算当前语音信号每个发音音素的特征矢量序列与第一说话人对应发音音素的语音信号的特征矢量序列的最小距离。Step 23: Calculate the minimum distance between the feature vector sequence of each pronunciation phoneme of the current voice signal and the feature vector sequence of the voice signal of the voice signal corresponding to the first speaker.

需要说明的是，第一说话人的不同发音音素的特征矢量序列的是在声纹库中保存的，在需要比对时提取出来。第一说话人的不同发音音素的特征矢量序列的生成方式与上述确定当前语音信号的发音音素的特征矢量序列的方式相同，在此不再赘述。It should be noted that the feature vector sequence of the different pronunciation phonemes of the first speaker is stored in the voiceprint library and extracted when comparison is needed. The method of generating the feature vector sequence of the different pronunciation phonemes of the first speaker is the same as the above-mentioned method of determining the feature vector sequence of the pronunciation phonemes of the current speech signal, and will not be repeated here.

例如，假设第一说话人的语音信号中音素b的声纹特征(作为参考模板)的特征矢量序列为X＝{x1,x2,……,xn}，当前语音信号(待判定的声音)中音素b的声纹特征的特征矢量序列为Y＝{y1,y2,……,ym}，其中，n，m是序列的长度，确定两个矢量序列间的最小距离，相当于求取两个矢量序列之间距离的函数的最小值，其中，两个矢量序列之间的距离可以看做是对各帧矢量xi和yj之间距离的计算再求和，例如，分别对比x1～x5与y3之间的矢量距离，如果x2与y3之间的矢量距离更近，则确定x2与y3对齐，进而，对y4与序列X中的哪个元素对齐进行判断，需要说明的是，由于语音信号是具有连续性的，虽然时长不同，但发声的顺序是相同的，因此，表现在对两个特征矢量序列进行对比时，如果待判定声音的特征序列Y中某一个元素yj的前一个元素y(j-1)与参考模板X中的一个元素xi对齐，那么，yj对齐的元素只能是xi以及xi之后的元素。求取两个矢量序列之间的最小距离，也可以视作是确定两个语音信号的最小失真度。For example, suppose that the feature vector sequence of the voiceprint feature of phoneme b (as a reference template) in the speech signal of the first speaker is X={x1,x2,...,xn}, the current speech signal (the voice to be determined) The feature vector sequence of the voiceprint feature of phoneme b is Y={y1,y2,……,ym}, where n and m are the length of the sequence. To determine the minimum distance between two vector sequences is equivalent to finding two The minimum value of the function of the distance between vector sequences, where the distance between two vector sequences can be regarded as the calculation of the distance between each frame vector xi and yj and then sum, for example, compare x1 ~ x5 and y3 respectively If the vector distance between x2 and y3 is closer, it is determined that x2 is aligned with y3, and then, which element in the sequence X aligns with y4 is judged. It should be noted that because the voice signal has Continuous, although the duration is different, the sequence of utterance is the same. Therefore, when comparing two feature vector sequences, if the previous element y(j -1) Align with an element xi in the reference template X, then the aligned elements of yj can only be xi and the elements after xi. Obtaining the minimum distance between two vector sequences can also be regarded as determining the minimum distortion of two speech signals.

如图2所示，横坐标表示Y的各个帧的特征矢量，纵坐标表示参考模板X的各个帧特征矢量，通过这些表示帧号的整数坐标画出横线、纵线构成一个网格，网格中的交叉点表示Y中的某一帧矢量与参考模板中的某一帧矢量的距离。DTW算法，就是要找到一条通过此网格中若干个交叉点的路径，使得X和Y的距离最小(如图2中的弯折线)。当然，弯折路径不是随便选择的，首先声音的各音素可能有快慢，但其前后顺序肯定是不会变的，所以路径必须从左下角出发，右上角结束。其次，路径的倾斜度也不能是任意的，可以根据两个声音信号的时长的对比确定路径最大的倾斜度值，如果不限制倾斜度，可能会出现对齐误差，例如，将Y序列中较靠后面的一个元素与X序列中较靠前面的一个元素对齐，因此，限制路径的倾斜度可以避免出现这种问题，例如，可以设定最大斜率为2，最小斜率为0.5。如图2中的菱形范围。As shown in Figure 2, the abscissa represents the feature vector of each frame of Y, and the ordinate represents the feature vector of each frame of the reference template X. Through these integer coordinates representing the frame number, the horizontal and vertical lines are drawn to form a grid. The intersection in the grid represents the distance between a frame vector in Y and a frame vector in the reference template. The DTW algorithm is to find a path through several intersections in this grid, so that the distance between X and Y is the smallest (as shown in the bending line in Figure 2). Of course, the bending path is not chosen randomly. First of all, the phonemes of the sound may be fast or slow, but the order of the front and back will certainly not change, so the path must start from the lower left corner and end at the upper right corner. Secondly, the inclination of the path cannot be arbitrary. The maximum inclination value of the path can be determined based on the comparison of the duration of the two sound signals. If the inclination is not limited, alignment errors may occur. The latter element is aligned with the earlier element in the X sequence. Therefore, limiting the slope of the path can avoid this problem. For example, you can set the maximum slope to 2 and the minimum slope to 0.5. Figure 2 shows the diamond-shaped range.

步骤24，判断最小距离是否小于预设阈值，如果小于预设阈值，则确定当前语音信号为第一说话人发出的语音信号。Step 24: Determine whether the minimum distance is less than a preset threshold, and if it is less than the preset threshold, determine that the current voice signal is the voice signal sent by the first speaker.

上述实施例中，通过计算最小距离的方式，使得时间维度上声音信号中的发音音素的时长压缩或者拉伸至与第一说话人对应的发音音素的时长，对声音信号的发音音素在时间维度上进行规整，使得声音信号中的发音音素与第一说话人的发音音素时长相等。In the above embodiment, by calculating the minimum distance, the duration of the pronunciation phoneme in the sound signal in the time dimension is compressed or stretched to the duration of the pronunciation phoneme corresponding to the first speaker, and the pronunciation phoneme of the sound signal is in the time dimension. The above is regularized, so that the pronunciation phoneme in the sound signal is equal to the pronunciation phoneme duration of the first speaker.

步骤105，如果匹配到相同的声纹特征，则对当前语音信号执行语义识别并进行反馈。Step 105: If the same voiceprint feature is matched, perform semantic recognition on the current voice signal and give feedback.

可选的，在反馈之前，先判断当前语音信号与第一说话人的前序语音信号(前一次语音信号的末端点时刻)的时间间隔是否超过预设时长，如果当前语音信号与第一说话人的前序语音信号的间隔时间不超过预设时长，则存储当前语音信号的接收时间，并针对当前语音信号的语义内容进行反馈。存储的接收时间用于作为下一轮语音信号判断时间间隔是否超过预设时长的依据。Optionally, before the feedback, determine whether the time interval between the current voice signal and the first speaker’s preamble voice signal (the end point time of the previous voice signal) exceeds a preset length of time, if the current voice signal and the first speaker The interval time of the human preamble speech signal does not exceed the preset time length, then the receiving time of the current speech signal is stored, and the semantic content of the current speech signal is fed back. The stored receiving time is used as a basis for judging whether the time interval of the next round of voice signals exceeds the preset duration.

需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机***中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although the logical sequence is shown in the flowchart, in some cases, The steps shown or described can be performed in a different order than here.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is Better implementation. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the method described in each embodiment of the present application.

实施例2Example 2

在本实施例中还提供了一种声纹识别装置，该装置用于实现上述实施例1及其优选实施方式，对于本实施例中未详述的术语或实现方式，可参见实施例1中的相关说明，已经进行过说明的不再赘述。In this embodiment, a voiceprint recognition device is also provided, which is used to implement the above-mentioned embodiment 1 and its preferred embodiments. For terms or implementations that are not detailed in this embodiment, please refer to embodiment 1 The related descriptions of, the ones that have been explained will not be repeated here.

如以下所使用的术语“模块”，是可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现，但是硬件，或者软件和硬件的组合的实现也是可以被构想的。The term "module" as used below is a combination of software and/or hardware that can implement predetermined functions. Although the devices described in the following embodiments are preferably implemented by software, the implementation of hardware or a combination of software and hardware can also be conceived.

图3是根据本申请实施例的声纹识别装置的示意图，如图3所示，该装置包括：监测模块10，第一提取模块20，第二提取模块30，对比模块40和识别模块50。3 is a schematic diagram of a voiceprint recognition device according to an embodiment of the present application. As shown in FIG. 3, the device includes: a monitoring module 10, a first extraction module 20, a second extraction module 30, a comparison module 40, and a recognition module 50.

其中，监测模块，用于实时监测是否接收到唤醒词语音；第一提取模块，用于在确定接收到唤醒词语音的情况下，提取唤醒词语音的声纹特征，并将声纹特征录入声纹库；第二提取模块，用于提取实时监测到的当前语音信号的声纹特征；对比模块，用于对比当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；识别模块，用于如果匹配到相同的声纹特征，则对当前语音信号执行语义识别并进行反馈。Among them, the monitoring module is used to monitor whether the wake-up word voice is received in real time; the first extraction module is used to extract the voiceprint features of the wake-up word voice when it is determined that the wake-up word voice is received, and record the voiceprint features into the voice Pattern library; the second extraction module is used to extract the voiceprint feature of the current voice signal monitored in real time; the comparison module is used to compare whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in the voiceprint library; The recognition module is used to perform semantic recognition and feedback on the current voice signal if the same voiceprint feature is matched.

可选的，该装置还包括：判断模块，用于判断声纹库中每个声纹特征的说话人在最后一次说话的时刻之后，是否超过预设时长之内未再次说话；第一删除模块，用于如果否，则删除对应的声纹特征；更新模块，用于如果是，则更新对应声纹特征的说话人最后一次说话的时刻。Optionally, the device further includes: a judging module for judging whether the speaker with each voiceprint feature in the voiceprint library did not speak again within a preset time after the last time he spoke; the first deletion module , Used to delete the corresponding voiceprint feature if not; update module, used to update the last time the speaker corresponding to the voiceprint feature spoke.

可选的，第一提取模块包括：预处理单元，用于对包括唤醒词语音的声音信号执行预处理；第一提取单元，用于提取预处理后的包括唤醒词语音的声音信号中的声学特征，以得到用于表示声纹特征的特征矢量序列；存储单元，用于将特征矢量序列存储至声纹库。Optionally, the first extraction module includes: a preprocessing unit, configured to perform preprocessing on the sound signal including the wake-up word speech; Feature to obtain a feature vector sequence used to represent voiceprint features; the storage unit is used to store the feature vector sequence in the voiceprint library.

可选的，声纹特征通过基于时间帧的特征矢量序列表示，对比模块包括：识别单元，用于识别当前语音信号中的每个发音音素；第二提取单元，用于提取当前语音信号中每个发音音素对应的声纹信号的特征矢量序列；计算单元，用于计算当前语音信号每个发音音素的特征矢量序列与声纹库中存储的第一声纹特征的对应发音音素的特征矢量序列的最小距离；判断单元，用于判断最小距离是否小于预设阈值，其中，如果小于预设阈值，则确定当前语音信号匹配到第一声纹特征。Optionally, the voiceprint feature is represented by a feature vector sequence based on a time frame, and the comparison module includes: a recognition unit for recognizing each pronunciation phoneme in the current voice signal; a second extraction unit for extracting each phoneme in the current voice signal The feature vector sequence of the voiceprint signal corresponding to each pronunciation phoneme; the calculation unit is used to calculate the feature vector sequence of each pronunciation phoneme of the current voice signal and the feature vector sequence of the corresponding pronunciation phoneme of the first voiceprint feature stored in the voiceprint library The determination unit is used to determine whether the minimum distance is less than a preset threshold, and if it is less than the preset threshold, it is determined that the current voice signal matches the first voiceprint feature.

需要说明的是，上述各个模块是可以通过软件或硬件来实现的，对于后者，可以通过以下方式实现，但不限于此：上述模块均位于同一处理器中；或者，上述各个模块以任意组合的形式分别位于不同的处理器中。It should be noted that each of the above modules can be implemented by software or hardware. For the latter, it can be implemented in the following manner, but not limited to this: the above modules are all located in the same processor; or, the above modules can be combined in any combination. The forms are located in different processors.

显然，本领域的技术人员应该明白，上述的本申请的各模块或各步骤可以用通用的计算装置来实现，它们可以集中在单个的计算装置上，或者分布在多个计算装置所组成的网络上，可选地，它们可以用计算装置可执行的程序代码来实现，从而，可以将它们存储在存储装置中由计算装置来执行，并且在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样，本申请不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the above-mentioned modules or steps of this application can be implemented by a general computing device, and they can be concentrated on a single computing device or distributed in a network composed of multiple computing devices. Above, alternatively, they can be implemented with program codes executable by a computing device, so that they can be stored in a storage device for execution by the computing device, and in some cases, they can be executed in a different order than here. Perform the steps shown or described, or fabricate them into individual integrated circuit modules respectively, or fabricate multiple modules or steps of them into a single integrated circuit module for implementation. In this way, this application is not limited to any specific combination of hardware and software.

实施例3Example 3

本申请的实施例还提供了一种计算机可读存储介质，该计算机可读存储介质中存储有计算机程序，其中，该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。The embodiment of the present application also provides a computer-readable storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the foregoing method embodiments when running.

可选地，在本实施例中，上述计算机可读存储介质可以包括但不限于：U盘、只读存储器(Read-Only Memory，简称为ROM)、随机存取存储器(Random Access Memory，简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。其中，所述计算机可读存储介质可以是非易失性，也可以是易失性的。Optionally, in this embodiment, the foregoing computer-readable storage medium may include, but is not limited to: U disk, Read-Only Memory (Read-Only Memory, ROM for short), Random Access Memory (Random Access Memory, for short) RAM), mobile hard disks, magnetic disks or optical disks and other media that can store computer programs. Wherein, the computer-readable storage medium may be non-volatile or volatile.

实施例4Example 4

本申请的实施例还提供了一种电子装置，包括存储器和处理器，该存储器中存储有计算机程序，该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。The embodiment of the present application also provides an electronic device, including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the computer program to execute the steps in any of the foregoing method embodiments.

可选地，上述电子装置还可以包括传输设备以及输入输出设备，其中，该传输设备和上述处理器连接，该输入输出设备和上述处理器连接。以电子装置为电子装置为例，图4是本申请实施例的一种电子装置的硬件结构框图。如图4所示，电子装置可以包括一个或多个(图4中仅示出一个)处理器302(处理器302可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器304，可选地，上述电子装置还可以包括用于通信功能的传输设备306以及输入输出设备308。本领域普通技术人员可以理解，图4所示的结构仅为示意，其并不对上述电子装置的结构造成限定。例如，电子装置还可包括比图4中所示更多或者更少的组件，或者具有与图4所示不同的配置。Optionally, the aforementioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the aforementioned processor, and the input-output device is connected to the aforementioned processor. Taking the electronic device as an electronic device as an example, FIG. 4 is a hardware structure block diagram of an electronic device according to an embodiment of the present application. As shown in FIG. 4, the electronic device may include one or more (only one is shown in FIG. 4) processor 302 (the processor 302 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) And a memory 304 for storing data. Optionally, the above electronic device may also include a transmission device 306 and an input/output device 308 for communication functions. Those of ordinary skill in the art can understand that the structure shown in FIG. 4 is only for illustration, and does not limit the structure of the above electronic device. For example, the electronic device may also include more or fewer components than shown in FIG. 4, or have a different configuration from that shown in FIG.

存储器304可用于存储计算机程序，例如，应用软件的软件程序以及模块，如本申请实施例中的图像的识别方法对应的计算机程序，处理器302通过运行存储在存储器304内的计算机程序，从而执行各种功能应用以及数据处理，即实现上述的方法。存储器304可包括高速随机存储器，还可包括非易失性存储器，如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中，存储器304可进一步包括相对于处理器302 远程设置的存储器，这些远程存储器可以通过网络连接至电子装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 304 may be used to store computer programs, for example, software programs and modules of application software, such as the computer programs corresponding to the image recognition method in the embodiment of the present application. The processor 302 executes the computer programs stored in the memory 304 by running Various functional applications and data processing, that is, to achieve the above methods. The memory 304 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 304 may further include a memory remotely provided with respect to the processor 302, and these remote memories may be connected to the electronic device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

传输装置306用于经由一个网络接收或者发送数据。上述的网络具体实例可包括电子装置的通信供应商提供的无线网络。在一个实例中，传输装置306包括一个网络适配器(Network Interface Controller，简称为NIC)，其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中，传输装置306可以为射频(Radio Frequency，简称为RF)模块，其用于通过无线方式与互联网进行通讯。The transmission device 306 is used to receive or send data via a network. The above-mentioned specific examples of the network may include a wireless network provided by a communication provider of an electronic device. In one example, the transmission device 306 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet. In an example, the transmission device 306 may be a radio frequency (Radio Frequency, referred to as RF) module, which is used to communicate with the Internet in a wireless manner.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

一种声纹识别方法，其中，所述方法包括：A voiceprint recognition method, wherein the method includes:

实时监测是否接收到唤醒词语音；Real-time monitoring whether the wake-up word voice is received;

在确定接收到所述唤醒词语音的情况下，提取所述唤醒词语音的声纹特征，并将所述声纹特征录入声纹库；In the case where it is determined that the wake-up word voice is received, extract the voiceprint feature of the wake-up word voice, and record the voiceprint feature into a voiceprint library;

提取实时监测到的当前语音信号的声纹特征；Extract the voiceprint features of the current voice signal monitored in real time;

对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；Compare whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in the voiceprint library;

如果匹配到相同的声纹特征，则对所述当前语音信号执行语义识别并进行反馈。If the same voiceprint feature is matched, semantic recognition is performed on the current voice signal and feedback is performed.
根据权利要求1所述的方法，其中，在将所述声纹特征录入声纹库之后，所述方法还包括：The method according to claim 1, wherein after recording the voiceprint feature into a voiceprint library, the method further comprises:

判断所述声纹库中每个所述声纹特征的说话人在最后一次说话的时刻之后，是否超过预设时长之内未再次说话；Judging whether each speaker with the voiceprint feature in the voiceprint library did not speak again within a preset period of time after the last time he spoke;

如果否，则删除对应的声纹特征；If not, delete the corresponding voiceprint feature;

如果是，则更新对应声纹特征的说话人最后一次说话的时刻。If yes, update the last time the speaker corresponding to the voiceprint feature spoke.
根据权利要求1所述的方法，其中，所述将所述声纹特征录入声纹库，包括：The method according to claim 1, wherein the recording the voiceprint characteristics into a voiceprint library comprises:

对包括所述唤醒词语音的声音信号执行预处理；Performing preprocessing on the sound signal including the wake-up word speech;

提取预处理后的包括所述唤醒词语音的声音信号中的声学特征，以得到用于表示所述声纹特征的特征矢量序列；Extracting the preprocessed acoustic features in the voice signal including the wake-up word speech to obtain a feature vector sequence for representing the voiceprint feature;

将所述特征矢量序列存储至所述声纹库。The feature vector sequence is stored in the voiceprint library.
根据权利要求3所述的方法，其中，所述声纹特征通过基于时间帧的特征矢量序列表示，所述对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同，包括：The method according to claim 3, wherein the voiceprint feature is represented by a feature vector sequence based on a time frame, and the voiceprint feature of the current voice signal is compared with any voiceprint feature stored in the voiceprint library. Same, including:

识别所述当前语音信号中的每个发音音素；Identifying each pronunciation phoneme in the current speech signal;

提取所述当前语音信号中每个发音音素对应的声纹信号的特征矢量序列；Extracting the feature vector sequence of the voiceprint signal corresponding to each pronunciation phoneme in the current speech signal;

计算所述当前语音信号每个发音音素的特征矢量序列与所述声纹库中存储的第一声纹特征的对应发音音素的特征矢量序列的最小距离；Calculating the minimum distance between the feature vector sequence of each pronunciation phoneme of the current speech signal and the feature vector sequence of the corresponding pronunciation phoneme of the first voiceprint feature stored in the voiceprint library;

判断所述最小距离是否小于预设阈值，其中，如果小于所述预设阈值，则确定所述当前语音信号匹配到所述第一声纹特征。It is determined whether the minimum distance is less than a preset threshold, and if it is less than the preset threshold, it is determined that the current voice signal matches the first voiceprint feature.
根据权利要求3所述的方法，其中，所述提取预处理后的包括所述唤醒词语音的声音信号中的声学特征，以得到用于表示所述声纹特征的特征矢量序列，包括：The method according to claim 3, wherein said extracting the preprocessed acoustic features in the voice signal including the wake-up word speech to obtain a feature vector sequence for representing the voiceprint feature comprises:

基于隐马尔科夫模型，或者，混合高斯模型-通用背景模型获取所述声纹特征对应的特征矢量序列。The feature vector sequence corresponding to the voiceprint feature is acquired based on the hidden Markov model, or the Gaussian mixture model-general background model.
根据权利要求1所述的方法，其中，所述对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同，包括：The method according to claim 1, wherein the comparing whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in a voiceprint library comprises:

计算所述当前语音信号的声纹特征的特征矢量序列与每个说话人声纹模型的对数似然得分，根据所述对数似然得分是否超过预设阈值判断是否与对应的说话人声纹模型相匹配。Calculate the feature vector sequence of the voiceprint feature of the current voice signal and the log-likelihood score of each speaker's voiceprint model, and determine whether the log-likelihood score exceeds a preset threshold and whether it matches the corresponding speaker's voice The pattern matches the pattern.
根据权利要求1-6任一项所述的方法，其中，所述对所述当前语音信号执行语义识别并进行反馈，包括：The method according to any one of claims 1-6, wherein the performing semantic recognition on the current speech signal and giving feedback comprises:

判断所述当前语音信号与所述说话人的前序语音信号的时间间隔是否超过预设时长；Judging whether the time interval between the current voice signal and the speaker's preamble voice signal exceeds a preset duration;

若所述当前语音信号与所述说话人的前序语音信号的间隔时间不超过预设时长，则存储当前语音信号的接收时间，并针对所述当前语音信号的语义内容进行反馈。If the interval time between the current voice signal and the preamble voice signal of the speaker does not exceed the preset duration, the receiving time of the current voice signal is stored, and the semantic content of the current voice signal is fed back.
一种声纹识别装置，其中，所述装置包括：A voiceprint recognition device, wherein the device includes:

监测模块，用于实时监测是否接收到唤醒词语音；Monitoring module, used for real-time monitoring whether the wake-up word voice is received;

第一提取模块，用于在确定接收到所述唤醒词语音的情况下，提取所述唤醒词语音的声纹特征，并将所述声纹特征录入声纹库；The first extraction module is configured to extract the voiceprint feature of the wake-up word voice in the case that the wake-up word voice is determined to be received, and enter the voiceprint feature into the voiceprint library;

第二提取模块，用于提取实时监测到的当前语音信号的声纹特征；The second extraction module is used to extract the voiceprint features of the current voice signal monitored in real time;

对比模块，用于对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；The comparison module is used to compare whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in the voiceprint library;

识别模块，用于如果匹配到相同的声纹特征，则对所述当前语音信号执行语义识别并进行反馈。The recognition module is configured to perform semantic recognition and feedback on the current voice signal if the same voiceprint feature is matched.
一种电子装置，其中，所述电子装置包括存储器和处理器，所述存储器和所述处理器相互连接，所述存储器用于存储计算机程序，所述计算机程序被配置为由所述处理器执行，所述计算机程序配置用于执行一种声纹识别方法：An electronic device, wherein the electronic device includes a memory and a processor, the memory and the processor are connected to each other, and the memory is used to store a computer program configured to be executed by the processor , The computer program is configured to execute a voiceprint recognition method:

其中，所述方法包括：Wherein, the method includes:

实时监测是否接收到唤醒词语音；Real-time monitoring whether the wake-up word voice is received;

在确定接收到所述唤醒词语音的情况下，提取所述唤醒词语音的声纹特征，并将所述声纹特征录入声纹库；In the case where it is determined that the wake-up word voice is received, extract the voiceprint feature of the wake-up word voice, and record the voiceprint feature into a voiceprint library;

提取实时监测到的当前语音信号的声纹特征；Extract the voiceprint features of the current voice signal monitored in real time;

对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；Compare whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in the voiceprint library;

如果匹配到相同的声纹特征，则对所述当前语音信号执行语义识别并进行反馈。If the same voiceprint feature is matched, semantic recognition is performed on the current voice signal and feedback is performed.
根据权利要求9所述的电子装置，其中，在将所述声纹特征录入声纹库之后，所述方法还包括：9. The electronic device according to claim 9, wherein, after recording the voiceprint feature into a voiceprint library, the method further comprises:

判断所述声纹库中每个所述声纹特征的说话人在最后一次说话的时刻之后，是否超过预设时长之内未再次说话；Judging whether each speaker with the voiceprint feature in the voiceprint library did not speak again within a preset period of time after the last time he spoke;

如果否，则删除对应的声纹特征；If not, delete the corresponding voiceprint feature;

如果是，则更新对应声纹特征的说话人最后一次说话的时刻。If yes, update the last time the speaker corresponding to the voiceprint feature spoke.
根据权利要求9所述的电子装置，其中，所述将所述声纹特征录入声纹库，包括：The electronic device according to claim 9, wherein said recording the voiceprint characteristics into a voiceprint library comprises:

对包括所述唤醒词语音的声音信号执行预处理；Performing preprocessing on the sound signal including the wake-up word speech;

提取预处理后的包括所述唤醒词语音的声音信号中的声学特征，以得到用于表示所述声纹特征的特征矢量序列；Extracting the preprocessed acoustic features in the voice signal including the wake-up word speech to obtain a feature vector sequence for representing the voiceprint feature;

将所述特征矢量序列存储至所述声纹库。The feature vector sequence is stored in the voiceprint library.
根据权利要求11所述的电子装置，其中，所述声纹特征通过基于时间帧的特征矢量序列表示，所述对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同，包括：The electronic device according to claim 11, wherein the voiceprint feature is represented by a feature vector sequence based on a time frame, and the voiceprint feature of the current voice signal is compared with any voiceprint feature stored in a voiceprint library Are they the same, including:

识别所述当前语音信号中的每个发音音素；Identifying each pronunciation phoneme in the current speech signal;

提取所述当前语音信号中每个发音音素对应的声纹信号的特征矢量序列；Extracting the feature vector sequence of the voiceprint signal corresponding to each pronunciation phoneme in the current speech signal;

计算所述当前语音信号每个发音音素的特征矢量序列与所述声纹库中存储的第一声纹特征的对应发音音素的特征矢量序列的最小距离；Calculating the minimum distance between the feature vector sequence of each pronunciation phoneme of the current speech signal and the feature vector sequence of the corresponding pronunciation phoneme of the first voiceprint feature stored in the voiceprint library;

判断所述最小距离是否小于预设阈值，其中，如果小于所述预设阈值，则确定所述当前语音信号匹配到所述第一声纹特征。It is determined whether the minimum distance is less than a preset threshold, and if it is less than the preset threshold, it is determined that the current voice signal matches the first voiceprint feature.
根据权利要求11所述的电子装置，其中，所述提取预处理后的包括所述唤醒词语音的声音信号中的声学特征，以得到用于表示所述声纹特征的特征矢量序列，包括：11. The electronic device according to claim 11, wherein said extracting the preprocessed acoustic features in the voice signal including the wake-up word speech to obtain a feature vector sequence for representing the voiceprint feature comprises:

基于隐马尔科夫模型，或者，混合高斯模型-通用背景模型获取所述声纹特征对应的特征矢量序列。The feature vector sequence corresponding to the voiceprint feature is acquired based on the hidden Markov model, or the Gaussian mixture model-general background model.
根据权利要求9所述的电子装置，其中，所述对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同，包括：9. The electronic device according to claim 9, wherein the comparing whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in a voiceprint library comprises:

计算所述当前语音信号的声纹特征的特征矢量序列与每个说话人声纹模型的对数似然得分，根据所述对数似然得分是否超过预设阈值判断是否与对应的说话人声纹模型相匹配。Calculate the feature vector sequence of the voiceprint feature of the current speech signal and the log-likelihood score of each speaker's voiceprint model, and determine whether the log-likelihood score exceeds a preset threshold and whether it matches the corresponding speaker's voice Match the pattern model.
根据权利要求9-14任一项所述的电子装置，其中，所述对所述当前语音信号执行语义识别并进行反馈，包括：The electronic device according to any one of claims 9-14, wherein the performing semantic recognition on the current voice signal and giving feedback comprises:

判断所述当前语音信号与所述说话人的前序语音信号的时间间隔是否超过预设时长；Judging whether the time interval between the current voice signal and the speaker's preamble voice signal exceeds a preset duration;

若所述当前语音信号与所述说话人的前序语音信号的间隔时间不超过预设时长，则存储当前语音信号的接收时间，并针对所述当前语音信号的语义内容进行反馈。If the interval time between the current voice signal and the preamble voice signal of the speaker does not exceed the preset duration, the receiving time of the current voice signal is stored, and the semantic content of the current voice signal is fed back.
一种计算机可读存储介质，其中，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时用于实现一种声纹识别方法，所述方法包括以下步骤：A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that is used to implement a voiceprint recognition method when the computer program is executed by a processor, and the method includes the following steps:

实时监测是否接收到唤醒词语音；Real-time monitoring whether the wake-up word voice is received;

在确定接收到所述唤醒词语音的情况下，提取所述唤醒词语音的声纹特征，并将所述声纹特征录入声纹库；In the case where it is determined that the wake-up word voice is received, extract the voiceprint feature of the wake-up word voice, and record the voiceprint feature into a voiceprint library;

提取实时监测到的当前语音信号的声纹特征；Extract the voiceprint features of the current voice signal monitored in real time;

对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同；Compare whether the voiceprint feature of the current voice signal is the same as any voiceprint feature stored in the voiceprint library;

如果匹配到相同的声纹特征，则对所述当前语音信号执行语义识别并进行反馈。If the same voiceprint feature is matched, semantic recognition is performed on the current voice signal and feedback is performed.
根据权利要求16所述的计算机可读存储介质，其中，在将所述声纹特征录入声纹库之后，所述方法还包括：The computer-readable storage medium according to claim 16, wherein, after recording the voiceprint feature into a voiceprint library, the method further comprises:

判断所述声纹库中每个所述声纹特征的说话人在最后一次说话的时刻之后，是否超过预设时长之内未再次说话；Judging whether each speaker with the voiceprint feature in the voiceprint library did not speak again within a preset period of time after the last time he spoke;

如果否，则删除对应的声纹特征；If not, delete the corresponding voiceprint feature;

如果是，则更新对应声纹特征的说话人最后一次说话的时刻。If yes, update the last time the speaker corresponding to the voiceprint feature spoke.
根据权利要求16所述的计算机可读存储介质，其中，所述将所述声纹特征录入声纹库，包括：The computer-readable storage medium according to claim 16, wherein said recording the voiceprint feature into a voiceprint library comprises:

对包括所述唤醒词语音的声音信号执行预处理；Performing preprocessing on the sound signal including the wake-up word speech;

提取预处理后的包括所述唤醒词语音的声音信号中的声学特征，以得到用于表示所述声纹特征的特征矢量序列；Extracting the preprocessed acoustic features in the voice signal including the wake-up word speech to obtain a feature vector sequence for representing the voiceprint feature;

将所述特征矢量序列存储至所述声纹库。The feature vector sequence is stored in the voiceprint library.
根据权利要求18所述的计算机可读存储介质，其中，所述声纹特征通过基于时间帧的特征矢量序列表示，所述对比所述当前语音信号的声纹特征与声纹库中存储的任一声纹特征是否相同，包括：The computer-readable storage medium according to claim 18, wherein the voiceprint feature is represented by a feature vector sequence based on a time frame, and the voiceprint feature of the current voice signal is compared with any stored in a voiceprint library. Whether the features of a voiceprint are the same, including:

识别所述当前语音信号中的每个发音音素；Identifying each pronunciation phoneme in the current speech signal;

提取所述当前语音信号中每个发音音素对应的声纹信号的特征矢量序列；Extracting the feature vector sequence of the voiceprint signal corresponding to each pronunciation phoneme in the current speech signal;

计算所述当前语音信号每个发音音素的特征矢量序列与所述声纹库中存储的第一声纹特征的对应发音音素的特征矢量序列的最小距离；Calculating the minimum distance between the feature vector sequence of each pronunciation phoneme of the current speech signal and the feature vector sequence of the corresponding pronunciation phoneme of the first voiceprint feature stored in the voiceprint library;

判断所述最小距离是否小于预设阈值，其中，如果小于所述预设阈值，则确定所述当前语音信号匹配到所述第一声纹特征。It is determined whether the minimum distance is less than a preset threshold, and if it is less than the preset threshold, it is determined that the current voice signal matches the first voiceprint feature.
根据权利要求18所述的计算机可读存储介质，其中，所述提取预处理后的包括所述唤醒词语音的声音信号中的声学特征，以得到用于表示所述声纹特征的特征矢量序列，包括：18. The computer-readable storage medium according to claim 18, wherein the preprocessed acoustic features in the voice signal including the wake-up word speech are extracted to obtain a feature vector sequence representing the voiceprint feature ,include:

基于隐马尔科夫模型，或者，混合高斯模型-通用背景模型获取所述声纹特征对应的特征矢量序列。The feature vector sequence corresponding to the voiceprint feature is acquired based on the hidden Markov model, or the Gaussian mixture model-general background model.