WO2020199384A1

WO2020199384A1 - Audio recognition method, apparatus and device, and storage medium

Info

Publication number: WO2020199384A1
Application number: PCT/CN2019/093398
Authority: WO
Inventors: 鲁霄
Original assignee: 腾讯音乐娱乐科技（深圳）有限公司
Priority date: 2019-04-04
Filing date: 2019-06-27
Publication date: 2020-10-08
Also published as: CN110047515B; CN110047515A

Abstract

Disclosed are an audio recognition method, apparatus and device, and a storage medium. The method comprises: extracting an audio fingerprint of audio to be recognized to serve as a standard fingerprint, and calculating the similarity between the standard fingerprint and audio fingerprints in a pre-set fingerprint database (101); according to the similarity, screening out, from the fingerprint database, a candidate fingerprint set (102); selecting, from the candidate fingerprint set, a reference fingerprint, and acquiring a same-audio fingerprint of the reference fingerprint (103); and selecting, from audios corresponding to the reference fingerprint and to the same-audio fingerprint of the reference fingerprint, a target audio corresponding to the audio to be recognized (104).

Description

音频识别方法、装置、设备及存储介质Audio recognition method, device, equipment and storage medium

技术领域Technical field

本申请涉及通信技术领域，具体涉及一种音频识别方法、装置、设备及存储介质。This application relates to the field of communication technology, and in particular to an audio recognition method, device, equipment and storage medium.

背景技术Background technique

听歌识曲功能为广大音乐爱好者提供了一种非常便利搜索方式，用户只需录制环境中的音乐，或者哼唱歌曲片段，输入应用软件就可以识别出这是哪首歌曲。目前的听歌识曲，主要是根据输入歌曲的特征信息在海量的曲库中进行检索，选出与输入歌曲最相似的歌曲。The function of listening to songs and recognizing songs provides a very convenient search method for music lovers. Users only need to record the music in the environment, or hum a piece of a song, and enter the application software to identify which song it is. At present, listening to songs and recognizing songs are mainly based on searching in a massive music library based on the characteristic information of the input song, and selecting the song that is most similar to the input song.

在对现有技术的研究和实践过程中，本申请的发明人发现：用户上传的音频片段可能对应有多个版本的音频，而目前的音乐平台音频识别过程粗糙，并未考虑到不同版本之间的差异，导致音乐平台依据用户提供的片段来选出的歌曲可能并不是音频片段真正的来源，不是用户真正想要的。可以看出，目前的音频识别准确性较差。In the process of research and practice of the prior art, the inventor of this application found that the audio clips uploaded by users may correspond to multiple versions of audio. However, the audio recognition process of the current music platform is rough and does not take into account the different versions. Due to the difference between the music platforms, the songs selected by the music platform based on the clips provided by the users may not be the real source of the audio clips, and not what the users really want. It can be seen that the current audio recognition accuracy is poor.

技术问题technical problem

本申请实施例提供一种音频识别方法、装置、设备及存储介质，可以提高音频识别的准确性。The embodiments of the present application provide an audio recognition method, device, equipment, and storage medium, which can improve the accuracy of audio recognition.

技术解决方案Technical solutions

本申请实施例提供一种音频识别方法，包括：The embodiment of the present application provides an audio recognition method, including:

提取待识别音频的音频指纹作为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度；Extracting the audio fingerprint of the audio to be recognized as a reference fingerprint, and calculating the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library;

根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集；Screening out candidate fingerprint sets in the fingerprint library according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint library;

在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹；Selecting a reference fingerprint from the candidate fingerprint set, and obtaining homophone fingerprints of the reference fingerprint;

在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。Among the audio corresponding to the reference fingerprint and the homophone fingerprint, the target audio corresponding to the audio to be identified is selected.

此外，本申请实施例还提供一种音频识别装置，包括：In addition, an embodiment of the present application also provides an audio recognition device, including:

指纹单元，用于提取待识别音频的音频指纹作为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度；The fingerprint unit is used to extract the audio fingerprint of the audio to be recognized as a reference fingerprint, and calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library;

候选单元，用于根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集；The candidate unit is configured to screen out a candidate fingerprint set in the fingerprint library according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint library;

同音单元，用于在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹；The homophone unit is used to select a reference fingerprint from the candidate fingerprint set and obtain homophone fingerprints of the reference fingerprint;

音频单元，用于在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。The audio unit is used to select the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophone fingerprint.

此外，本申请实施例还提供一种音频识别设备，所述音频识别设备包括：存储器、处理器及存储在所述存储器上，并可在所述处理器上运行的音频识别程序，所述音频识别程序被所述处理器执行时实现如本申请实施例提供的任一音频识别方法中的步骤。In addition, an embodiment of the present application also provides an audio recognition device. The audio recognition device includes a memory, a processor, and an audio recognition program that is stored on the memory and can run on the processor. When the recognition program is executed by the processor, the steps in any audio recognition method provided in the embodiments of the present application are implemented.

在一些实施例中，所述音频识别设备还包括音频采集装置，所述音频采集装置用于采集待识别音频。In some embodiments, the audio recognition device further includes an audio collection device, and the audio collection device is used to collect audio to be recognized.

此外，本申请实施例还提供一种存储介质，所述存储介质存储有多条指令，所述指令适于处理器进行加载，以执行本申请实施例提供的任一音频识别方法中的步骤。In addition, an embodiment of the present application also provides a storage medium that stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute the steps in any audio recognition method provided in the embodiments of the present application.

有益效果Beneficial effect

本申请实施例通过提取待识别音频的音频指纹作为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度；根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集；在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹；在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。由此，该方案在检索到与基准指纹近似的候选指纹后，虽然候选指纹是与基准指纹匹配的，但是其可能会因为待识别音频的版本问题等导致存在不确定性。因此，该方案进一步在候选指纹集中选出参考指纹，进而通过重合度的计算在候选指纹集的其他候选指纹中选出同音指纹，实现了对候选指纹的进一步筛选。该方案经过多次筛选得到的参考指纹及其同音指纹，包括了与待识别音频的基准指纹最近似，且对应音频相同或可视为相同的音频指纹。从而，在参考指纹及其同音指纹对应的音频中选出的目标音频，为最优版本的音频，可作为待识别音频的真正出处或来源，同时保障了目标音频内容和版本的准确性，提高了音频识别的整体效率和用户体验。该方案通过对指纹库中的音频指纹进行层层筛选，细化了音频识别粒度，提升了音频识别的精细化程度，从而检索得到更加准确的目标音频。In the embodiment of this application, the audio fingerprint of the audio to be recognized is extracted as the reference fingerprint, and the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library is calculated; according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint library, A candidate fingerprint set is selected from the fingerprint library; a reference fingerprint is selected from the candidate fingerprint set, and the homophone fingerprint of the reference fingerprint is obtained; among the reference fingerprint and the audio corresponding to the homophone fingerprint, the candidate fingerprint is selected Identify the target audio corresponding to the audio. Therefore, after the solution retrieves candidate fingerprints similar to the reference fingerprint, although the candidate fingerprint matches the reference fingerprint, it may cause uncertainty due to the version of the audio to be recognized. Therefore, the solution further selects reference fingerprints in the candidate fingerprint set, and then selects homophone fingerprints from other candidate fingerprints in the candidate fingerprint set through the calculation of the degree of coincidence, thereby achieving further screening of candidate fingerprints. The reference fingerprint and its homophonic fingerprints obtained by this solution after multiple screenings include audio fingerprints that are most similar to the reference fingerprint of the audio to be recognized, and the corresponding audio is the same or can be regarded as the same. Therefore, the target audio selected from the audio corresponding to the reference fingerprint and the homophonic fingerprint is the audio of the optimal version, which can be used as the true source or source of the audio to be identified, while ensuring the accuracy of the target audio content and version, and improving Improve the overall efficiency and user experience of audio recognition. This solution refines the granularity of audio recognition by screening the audio fingerprints in the fingerprint library, improves the refinement of audio recognition, and retrieves more accurate target audio.

附图说明Description of the drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.

图1a是本申请实施例提供的信息交互***的场景示意图；Figure 1a is a schematic diagram of a scenario of an information interaction system provided by an embodiment of the present application;

图1b是本申请实施例提供的音频识别方法的流程示意图；FIG. 1b is a schematic flowchart of an audio recognition method provided by an embodiment of the present application;

图2a是本申请实施例提供的音频识别场景示意图；Figure 2a is a schematic diagram of an audio recognition scene provided by an embodiment of the present application;

图2b是本申请实施例提供的候选指纹集示意图；Figure 2b is a schematic diagram of a candidate fingerprint set provided by an embodiment of the present application;

图2c是本申请实施例提供的识别结果显示界面示意图；2c is a schematic diagram of a recognition result display interface provided by an embodiment of the present application;

图3是本申请实施例提供的音频识别装置结构示意图；Figure 3 is a schematic structural diagram of an audio recognition device provided by an embodiment of the present application;

图4a是本申请实施例提供的音频识别设备结构示意图；Figure 4a is a schematic structural diagram of an audio recognition device provided by an embodiment of the present application;

图4b是本申请实施例提供的另一音频识别设备结构示意图。Figure 4b is a schematic structural diagram of another audio recognition device provided by an embodiment of the present application.

本发明的实施方式Embodiments of the invention

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative work are within the protection scope of this application.

本申请实施例提供一种音频识别方法、装置、设备及存储介质。The embodiments of the present application provide an audio recognition method, device, equipment, and storage medium.

本申请实施例提供了一种音频识别方法，包括：The embodiment of the application provides an audio recognition method, including:

在一些实施例中，所述获取所述参考指纹的同音指纹，包括：In some embodiments, the acquiring the homophone fingerprint of the reference fingerprint includes:

计算所述参考指纹与候选指纹集中其他候选指纹的重合度；Calculating the degree of coincidence between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set;

根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹。According to the degree of coincidence, a homophone fingerprint of the reference fingerprint is selected from the other candidate fingerprints.

在一些实施例中，所述计算所述参考指纹与候选指纹集中其他候选指纹的重合度，包括：In some embodiments, the calculating the degree of overlap between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set includes:

获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，统计所述最长公共子序列的长度；Acquiring the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and counting the length of the longest common subsequence;

根据所述最长公共子序列的长度，计算得到所述参考指纹与其他候选指纹的重合度。According to the length of the longest common subsequence, the degree of overlap between the reference fingerprint and other candidate fingerprints is calculated.

在一些实施例中，所述根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹，包括：In some embodiments, the selecting the homophone fingerprints of the reference fingerprint from the other candidate fingerprints according to the coincidence degree includes:

在所述其他候选指纹中，筛选出与所述参考指纹的重合度大于或等于预设阈值的候选指纹，作为所述参考指纹的同音指纹。Among the other candidate fingerprints, candidate fingerprints whose degree of coincidence with the reference fingerprint is greater than or equal to a preset threshold are screened out as the homophone fingerprints of the reference fingerprint.

在一些实施例中，所述方法还包括：In some embodiments, the method further includes:

若未找到与所述参考指纹的重合度大于或等于预设阈值的候选指纹，则将所述参考指纹对应的音频确定为所述待识别音频对应的目标音频。If no candidate fingerprint whose coincidence degree with the reference fingerprint is greater than or equal to the preset threshold is not found, the audio corresponding to the reference fingerprint is determined as the target audio corresponding to the audio to be recognized.

在一些实施例中，在所述候选指纹集中选出参考指纹，包括：In some embodiments, selecting a reference fingerprint from the candidate fingerprint set includes:

将所述候选指纹集中，与所述基准指纹的相似度数值最大的候选指纹，确定为参考指纹。The candidate fingerprint in the candidate fingerprint set with the largest similarity value to the reference fingerprint is determined as a reference fingerprint.

在一些实施例中，所述计算所述基准指纹与预设指纹库中音频指纹的相似度，包括：In some embodiments, the calculating the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library includes:

分别统计所述基准指纹与预设指纹库中各音频指纹所包含的相同哈希值的数量；Respectively count the number of the same hash value contained in each audio fingerprint in the reference fingerprint and the preset fingerprint library;

根据所述相同哈希值的数量，分别计算所述基准指纹与指纹库中各音频指纹的相似度。According to the number of the same hash value, the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library is calculated respectively.

在一些实施例中，所述在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频，包括：In some embodiments, the selecting the target audio corresponding to the audio to be recognized among the audio corresponding to the reference fingerprint and the homophone fingerprint thereof includes:

获取所述参考指纹及其同音指纹对应的音频为同音音频，获取同音音频的版本信息；Acquiring the reference fingerprint and the audio corresponding to the homophonic fingerprint as homophonic audio, and acquiring the version information of the homophonic audio;

根据所述版本信息，确定所述同音音频的版本优先级；Determine the version priority of the homophonic audio according to the version information;

将版本优先级最高的同音音频作为所述待识别音频对应的目标音频。The homophonic audio with the highest version priority is used as the target audio corresponding to the audio to be recognized.

在一些实施例中，所述参考指纹和所述候选指纹集中其他候选指纹均使用哈希序列表征；In some embodiments, the reference fingerprint and other candidate fingerprints in the candidate fingerprint set are all characterized by a hash sequence;

获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，包括：使用动态规划计算参考指纹和其他候选指纹哈希序列的最长公共子序列长度。本申请实施例提供一种信息交互***，该***包括本申请实施例任一提供的音频识别装置，该音频识别装置可以集成在服务器等设备中；此外，该***还可以包括其他设备，比如，客户端等。客户端可以是终端或个人计算机（PC，Personl Computer）等设备，用于采集待识别音频和/或向服务器上传待识别音频。Obtaining the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set includes: using dynamic programming to calculate the longest common subsequence length of the reference fingerprint and the hash sequence of the other candidate fingerprints. The embodiments of the present application provide an information interaction system. The system includes the audio recognition device provided by any of the embodiments of the present application. The audio recognition device may be integrated in a server and other equipment; in addition, the system may also include other equipment, such as, Client etc. The client can be a terminal or a personal computer (PC, Personal Computer) and other equipment, used to collect the audio to be recognized and/or upload the audio to be recognized to the server.

参考图1a，客户端将录音或本地音频作为待识别音频，发送给服务器，请求进行音频识别。服务器接收客户端发送的待识别音频，提取待识别音频的音频指纹作为基准指纹，然后计算所述基准指纹与预设指纹库中音频指纹的相似度；从而根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集；然后，在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹；在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。Referring to Figure 1a, the client sends the recording or local audio as the audio to be recognized and sends it to the server to request audio recognition. The server receives the audio to be recognized sent by the client, extracts the audio fingerprint of the audio to be recognized as the reference fingerprint, and then calculates the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library; thus, according to the reference fingerprint and the audio fingerprint in the fingerprint library For the similarity of fingerprints, the candidate fingerprint set is screened in the fingerprint library; then, a reference fingerprint is selected from the candidate fingerprint set, and the homophone fingerprint of the reference fingerprint is obtained; in the reference fingerprint and the homophone fingerprint corresponding Among the audios, select the target audio corresponding to the audio to be recognized.

由此，该方案在检索到与基准指纹近似的候选指纹后，虽然候选指纹是与基准指纹匹配的，但是其可能会因为待识别音频的版本问题等导致存在不确定性。因此，该方案进一步在候选指纹集中选出参考指纹，进而通过重合度的计算在候选指纹集的其他候选指纹中选出同音指纹，实现了对候选指纹的进一步筛选。该方案经过多次筛选得到的参考指纹及其同音指纹，包括了与待识别音频的基准指纹最近似，且对应音频相同或可视为相同的音频指纹。从而，在参考指纹及其同音指纹对应的音频中选出的目标音频，为最优版本的音频，可作为待识别音频的真正出处或来源，同时保障了目标音频内容和版本的准确性，提高了音频识别的整体效率和用户体验。该方案通过对指纹库中的音频指纹进行层层筛选，细化了音频识别粒度，提升了音频识别的精细化程度，从而检索得到更加准确的目标音频。Therefore, after the solution retrieves candidate fingerprints similar to the reference fingerprint, although the candidate fingerprint matches the reference fingerprint, it may cause uncertainty due to the version of the audio to be recognized. Therefore, the solution further selects reference fingerprints in the candidate fingerprint set, and then selects homophone fingerprints from other candidate fingerprints in the candidate fingerprint set through the calculation of the degree of coincidence, thereby achieving further screening of candidate fingerprints. The reference fingerprint and its homophonic fingerprints obtained by this solution after multiple screenings include audio fingerprints that are most similar to the reference fingerprint of the audio to be recognized, and the corresponding audio is the same or can be regarded as the same. Therefore, the target audio selected from the audio corresponding to the reference fingerprint and the homophonic fingerprint is the audio of the optimal version, which can be used as the true source or source of the audio to be identified, while ensuring the accuracy of the target audio content and version, and improving Improve the overall efficiency and user experience of audio recognition. This solution refines the granularity of audio recognition by screening the audio fingerprints in the fingerprint library, improves the refinement of audio recognition, and retrieves more accurate target audio.

以下分别进行详细说明。Detailed descriptions are given below.

本实施例将从音频识别装置的角度进行描述，该音频识别装置具体可以集成在网络设备中，该网络设备可以是终端或服务器等设备，其中，该终端可以是手机、平板电脑、笔记本电脑或个人计算机（PC， Personal Computer）等。This embodiment will be described from the perspective of an audio recognition device. The audio recognition device can be integrated in a network device. The network device can be a terminal or a server. The terminal can be a mobile phone, a tablet, a laptop, or Personal Computer (PC, Personal Computer), etc.

如图1b所示，该音频识别方法的具体流程可以如下：As shown in Figure 1b, the specific process of the audio recognition method can be as follows:

101、获取待识别音频的音频指纹为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度。101. Obtain the audio fingerprint of the audio to be recognized as a reference fingerprint, and calculate the similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint library.

其中，预设的指纹库中存储了音频库中各音频的音频指纹，以及音频指纹和音频库中各音频的映射关系。例如，音频识别装置可以预先对音频库中的各音频进行音频指纹的提取，将提取得到各音频指纹存储到指纹库中，并记录各音频和音频指纹的映射关系。Among them, the preset fingerprint library stores the audio fingerprints of each audio in the audio library, and the mapping relationship between the audio fingerprint and each audio in the audio library. For example, the audio recognition device may extract audio fingerprints from each audio in the audio library in advance, store each extracted audio fingerprint in the fingerprint library, and record the mapping relationship between each audio and audio fingerprint.

例如，音频识别装置获取待识别音频，进行音频指纹的提取，并将待识别音频的音频指纹作为基准指纹，用于查询与其最接近或最相似的音频指纹。For example, the audio recognition device obtains the audio to be recognized, extracts the audio fingerprint, and uses the audio fingerprint of the audio to be recognized as a reference fingerprint to query the audio fingerprint that is closest or most similar to it.

在一些实施例中，图像检索装置可以接收音频识别请求，获取待识别音频；对所述待识别音频进行音频指纹提取，得到哈希序列，将所述哈希序列作为基准指纹。In some embodiments, the image retrieval device may receive an audio recognition request to obtain the audio to be recognized; perform audio fingerprint extraction on the audio to be recognized to obtain a hash sequence, and use the hash sequence as a reference fingerprint.

例如，用户可以使用客户端输入音频识别请求，音频识别装置在收到音频识别请求后，通知客户端开始进行音频采集，从而对用户的哼唱声音或是环境中的声音等进行录音，得到待识别音频，该待识别音频即为本次音频识别请求对应的待识别音频。当然，用户也可以将客户端本地存储的，或是从网络上下载的音频上传给音频识别装置，由此，音频识别装置得到音频识别请求及其对应的待识别音频。For example, the user can use the client to input an audio recognition request. After receiving the audio recognition request, the audio recognition device notifies the client to start audio collection, so as to record the user's humming sound or the sound in the environment to obtain Identify the audio, and the audio to be identified is the audio to be identified corresponding to this audio identification request. Of course, the user can also upload the audio stored locally on the client or downloaded from the network to the audio recognition device, so that the audio recognition device obtains the audio recognition request and the corresponding audio to be recognized.

其中，客户端可以是具有音频采集功能的录音设备或手机、平板、个人计算机等终端设备。Among them, the client can be a recording device with audio collection function or a terminal device such as a mobile phone, a tablet, or a personal computer.

然后，音频识别装置对待识别音频的音频信号进行音频指纹提取，得到待识别音频的音频指纹，该音频指纹包含了待识别音频的音频特征信息。其中，对音频信号的音频指纹提取具体可以包括对音频信号进行分帧、加窗、FFT（Fast Fourier Transform，快速傅里叶变换）频域变换、提取局部峰值和转换哈希序列等。Then, the audio recognition device performs audio fingerprint extraction on the audio signal of the audio to be recognized, and obtains the audio fingerprint of the audio to be recognized. The audio fingerprint contains the audio feature information of the audio to be recognized. Among them, the audio fingerprint extraction of the audio signal may specifically include framing, windowing, FFT (Fast Fourier Transform, Fast Fourier Transform) frequency domain transformation, extracting local peaks, and transforming hash sequences, etc., of the audio signal.

具体的，音频识别装置在得到待识别音频后，对待识别音频的音频信号进行分帧和加窗处理。分帧为按预设规则将整段音频信号切成多段，每一段即为一帧，以使音频信号在微观上是平稳的，从而能为后期的音频信号处理输入平稳信号。然后，音频识别装置使用预设的加窗函数分别对每帧音频进行加窗，预设的加窗函数可以是汉明窗等，从而使分帧后的音频信号更加连贯，表现出周期函数特征。Specifically, after obtaining the audio to be recognized, the audio recognition device performs framing and windowing processing on the audio signal of the audio to be recognized. Framing is to cut the entire audio signal into multiple segments according to preset rules, and each segment is a frame, so that the audio signal is microscopically stable, so that a stable signal can be input for later audio signal processing. Then, the audio recognition device uses a preset windowing function to separately window each frame of audio. The preset windowing function can be a Hamming window, etc., so that the framed audio signal is more coherent and exhibits periodic function characteristics .

然后，音频识别装置对每一帧音频信号进行FFT频域变换，得到包含频域信息的频谱。进而，音频识别装置提取频谱中的局部峰值，并转换成哈希序列该哈希序列即为待识别音频的音频指纹。需要说明的是，该哈希序列中可以包括多个哈希值。Then, the audio recognition device performs FFT frequency domain transformation on each frame of audio signal to obtain a frequency spectrum containing frequency domain information. Furthermore, the audio recognition device extracts the local peaks in the frequency spectrum and converts them into a hash sequence. The hash sequence is the audio fingerprint of the audio to be recognized. It should be noted that the hash sequence may include multiple hash values.

音频识别装置将待识别音频的音频指纹作为基准指纹，来计算基准指纹与预设指纹库中音频指纹的相似度，实现音频指纹的检索或匹配。The audio recognition device uses the audio fingerprint of the audio to be recognized as the reference fingerprint to calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library, so as to realize the retrieval or matching of the audio fingerprint.

在一些实施例中，基准指纹和指纹库中的音频指纹均使用哈希序列表征，步骤“计算所述基准指纹与预设指纹库中音频指纹的相似度”可以包括：分别统计所述基准指纹与预设指纹库中各音频指纹所包含的相同哈希值的数量；根据所述相同哈希值的数量，分别计算所述基准指纹与指纹库中各音频指纹的相似度。In some embodiments, both the reference fingerprint and the audio fingerprints in the fingerprint library are characterized by a hash sequence, and the step of "calculating the similarity between the reference fingerprint and the audio fingerprints in the preset fingerprint library" may include: separately counting the reference fingerprints The number of identical hash values contained in each audio fingerprint in the preset fingerprint library; according to the number of identical hash values, the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library is respectively calculated.

以指纹库中任一音频指纹为例，音频识别装置将基准指纹哈希序列中的哈希值与该音频指纹哈希序列中的哈希值进行一一比对，并统计相同哈希值的数量数量，音频识别装置将得到的相同哈希值的数量作为基准指纹与该音频指纹的相似度。由此，音频识别装置分别计算得到基准指纹与指纹库中各音频指纹的相似度。Taking any audio fingerprint in the fingerprint library as an example, the audio recognition device compares the hash value in the reference fingerprint hash sequence with the hash value in the audio fingerprint hash sequence, and counts the same hash value. Quantity quantity, the audio recognition device uses the quantity of the same hash value obtained as the similarity between the reference fingerprint and the audio fingerprint. As a result, the audio recognition device separately calculates the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library.

102、根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集。102. According to the similarity between the reference fingerprint and the audio fingerprints in the fingerprint library, select a candidate fingerprint set from the fingerprint library.

例如，音频识别装置可以根据预设的相似度阈值，将指纹库中与基准指纹的相似度数值大于该相似度阈值的音频指纹筛选出来，作为与基准指纹匹配的候选指纹。For example, the audio recognition device may screen out the audio fingerprints in the fingerprint library whose similarity value with the reference fingerprint is greater than the similarity threshold according to a preset similarity threshold, as candidate fingerprints matching the reference fingerprint.

需要说明的是，与基准指纹匹配的候选指纹，可以理解为其对应的音频与待识别音频是相同或可视为相同的，例如同一首歌曲，或是编曲不同的同一首歌。It should be noted that the candidate fingerprint matching the reference fingerprint can be understood as the corresponding audio and the audio to be identified are the same or can be regarded as the same, for example, the same song, or the same song with different arrangements.

进而，音频识别装置将筛选得到的候选指纹配置到同一集合中，得到候选指纹集。由此，候选指纹集中包括了一个或多个与基准指纹匹配的候选指纹。Furthermore, the audio recognition device arranges the candidate fingerprints obtained by screening into the same set to obtain a candidate fingerprint set. Thus, the candidate fingerprint set includes one or more candidate fingerprints that match the reference fingerprint.

103、在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹。103. Select a reference fingerprint from the candidate fingerprint set, and obtain homophone fingerprints of the reference fingerprint.

其中，参考指纹为与基准指纹最相似的候选指纹。例如，音频识别装置可以将所述候选指纹集中，与所述基准指纹的相似度数值最大的候选指纹，确定为参考指纹。Among them, the reference fingerprint is the candidate fingerprint most similar to the reference fingerprint. For example, the audio recognition device may collect the candidate fingerprints and determine the candidate fingerprint with the largest similarity value to the reference fingerprint as the reference fingerprint.

然后，音频识别装置选出该参考指纹的同音指纹。需要说明的是，同音指纹可以理解为其对应的音频与参考指纹对应的音频是相同或可视为相同的。例如，在音乐平台的曲库中，存在编号不同但其实是同一首歌曲的多个音频，比如是同一首歌曲的不同版本，不同歌手翻唱的不同版本，或是收入不同专辑或电台的同一首歌曲，将属于同一首歌的多个音频定义为同音音频，它们的音频指纹即为同音指纹。Then, the audio recognition device selects the homophone fingerprint of the reference fingerprint. It should be noted that the homophone fingerprint can be understood as the audio corresponding to the reference fingerprint is the same or can be regarded as the same. For example, in the music library of a music platform, there are multiple audios of the same song with different numbers, such as different versions of the same song, different versions of covers by different singers, or the same song from different albums or radio stations. For songs, multiple audios belonging to the same song are defined as homophonic audio, and their audio fingerprints are homophonic fingerprints.

在一些实施例中，步骤“获取所述参考指纹的同音指纹”可以包括：计算所述参考指纹与所述候选指纹集中其他候选指纹的重合度；根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹。In some embodiments, the step of "obtaining homophonic fingerprints of the reference fingerprint" may include: calculating the degree of coincidence between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set; and according to the degree of coincidence, in the other candidate fingerprints The homophone fingerprint of the reference fingerprint is selected from among.

其中，其他候选指纹可以为所述候选指纹机中除所述参考指纹以外的候选指纹。Wherein, other candidate fingerprints may be candidate fingerprints other than the reference fingerprint in the candidate fingerprint machine.

其中，参考指纹与其他候选指纹的重合度可以通过相关性、最长公共子序列等方式进行计算。其中，相关性可以是计算参考指纹与其他候选指纹哈希序列的方差，将方差值作为参考指纹与其他候选指纹的重合度。然后，音频识别装置将方差数值满足预设要求的其他候选指纹，作为参考指纹的同音指纹。Among them, the degree of overlap between the reference fingerprint and other candidate fingerprints can be calculated by means of correlation, the longest common subsequence, etc. Among them, the correlation may be to calculate the variance of the hash sequence of the reference fingerprint and other candidate fingerprints, and use the variance value as the degree of coincidence between the reference fingerprint and the other candidate fingerprints. Then, the audio recognition device uses other candidate fingerprints whose variance values meet the preset requirements as homophone fingerprints of the reference fingerprint.

以最长公共子序列（LCS， Longest Common Subsequence）进行举例说明，步骤“计算所述参考指纹与所述候选指纹集中其他候选指纹的重合度”可以包括：获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，统计所述最长公共子序列的长度；根据所述最长公共子序列的长度，计算得到所述参考指纹与其他候选指纹的重合度。Take the longest common subsequence (LCS, Longest Common Subsequence) For an example, the step "calculating the degree of overlap between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set" may include: obtaining the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and counting The length of the longest common subsequence; according to the length of the longest common subsequence, the degree of coincidence between the reference fingerprint and other candidate fingerprints is calculated.

其中，参考指纹和候选指纹集中其他候选指纹均使用哈希序列表征。Among them, the reference fingerprint and other candidate fingerprints in the candidate fingerprint set are characterized by hash sequences.

哈希序列作为一个特定序列，其子序列是指在不改变元素相对次序的条件下，将序列中零个或多个元素去掉得到的序列。若一个序列同时作为多个哈希序列的子序列，则该序列即为这多个哈希序列的公共子序列。而哈希序列的最长公共子序列，即是多个哈希序列最长的共有子序列。最长公共子序列的长度即为公共子序列中元素的数量。The hash sequence is a specific sequence, and its subsequence refers to the sequence obtained by removing zero or more elements from the sequence without changing the relative order of the elements. If a sequence serves as a subsequence of multiple hash sequences at the same time, the sequence is the common subsequence of the multiple hash sequences. The longest common subsequence of the hash sequence is the longest common subsequence of multiple hash sequences. The length of the longest common subsequence is the number of elements in the common subsequence.

例如，可使用动态规划（DP，Dynamic Programming）计算参考指纹和其他候选指纹哈希序列的最长公共子序列长度。在本实施例中，参考指纹和其他候选指纹哈希序列的最长公共子序列长度计算公式如下：For example, dynamic programming (DP, Dynamic Programming) Calculate the longest common subsequence length of the reference fingerprint and other candidate fingerprint hash sequences. In this embodiment, the calculation formula for the longest common subsequence length of the reference fingerprint and other candidate fingerprint hash sequences is as follows:

nlcs = LCS(res[i].hash_seq, res[0].hash_seq)nlcs = LCS(res[i].hash_seq, res[0].hash_seq)

其中，nlcs为最长公共子序列长度，LCS为动态规划最长公共子序列长度计算函数，res[i]hash_seq为第i个候选指纹哈希序列，res[0].hash_seq为参考指纹哈希序列。Among them, nlcs is the longest common subsequence length, LCS is the dynamic programming longest common subsequence length calculation function, res[i]hash_seq is the i-th candidate fingerprint hash sequence, res[0].hash_seq is the reference fingerprint hash sequence.

例如，参考指纹哈希序列X={A，B，C，B，D，A，B}，任一其他候选指纹哈希序列Y={B，D，C，A，B，A}。例如{A，B}和{B，C，B，A}等序列，既是X序列的子序列，也是Y序列的子序列，因此，是X和Y序列的公共子序列。在本实施例中不再一一完整列举X和Y序列的公共子序列。在X和Y的公共子序列中，序列{B，C，B，A}包含4个元素，因此统计得到其长度为4，是X和Y的最长公共子序列。For example, referring to the fingerprint hash sequence X={A, B, C, B, D, A, B}, any other candidate fingerprint hash sequence Y={B, D, C, A, B, A}. For example, {A, B} and {B, C, B, A} and other sequences are not only subsequences of X sequence, but also subsequences of Y sequence, so they are common subsequences of X and Y sequences. In this embodiment, the common subsequences of the X and Y sequences are not listed completely one by one. In the common subsequence of X and Y, the sequence {B, C, B, A} contains 4 elements, so its length is 4, which is the longest common subsequence of X and Y.

以任一其他候选指纹为例，在得到参考指纹与其的最长公共子序列长度后，音频识别装置计算参考指纹与该其他候选指纹的重合度。例如，可使用如下公式计算：Taking any other candidate fingerprint as an example, after obtaining the reference fingerprint and the longest common subsequence length, the audio recognition device calculates the degree of coincidence between the reference fingerprint and the other candidate fingerprint. For example, you can use the following formula to calculate:

sim = nlcs/hash_seq_cnt×100%；sim = nlcs/hash_seq_cnt×100%;

其中，sim为参考指纹与该其他候选指纹的相似度，nlcs为最长公共子序列长度，hash_seq_cnt为参考指纹哈希序列长度。在一些实施例中，该公式的代码可参照int sim = nlcs*1.0/hash_seq_cnt*100。Among them, sim is the similarity between the reference fingerprint and the other candidate fingerprints, nlcs is the length of the longest common subsequence, and hash_seq_cnt is the length of the reference fingerprint hash sequence. In some embodiments, the code of the formula can refer to int sim = nlcs*1.0/hash_seq_cnt*100.

由此，音频识别装置可分别计算得到参考指纹与各其他候选指纹的重合度。In this way, the audio recognition device can respectively calculate the degree of overlap between the reference fingerprint and each of the other candidate fingerprints.

然后，音频识别装置可以在其他候选指纹中，选出参考指纹的同音指纹。Then, the audio recognition device can select the homophone fingerprint of the reference fingerprint among other candidate fingerprints.

例如，音频识别装置可以将重合度数值最大的其他候选指纹，作为参考指纹的同音指纹；或者，音频识别装置将重合度数值按照由大至小的顺序，选取排序在前预设位次的其他候选指纹，作为参考指纹的同音指纹。For example, the audio recognition device may use the other candidate fingerprints with the largest coincidence value as the homophone fingerprint of the reference fingerprint; or, the audio recognition device may select the other candidate fingerprints with the highest coincidence value in descending order, and select other fingerprints with the highest overlap value. Candidate fingerprint, as the homophone fingerprint of the reference fingerprint.

在一些实施例中，步骤“根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹”可以包括：在所述其他候选指纹中，筛选出与所述参考指纹的重合度大于或等于预设阈值的候选指纹，作为所述参考指纹的同音指纹。In some embodiments, the step of “selecting homophone fingerprints of the reference fingerprint among the other candidate fingerprints according to the degree of overlap” may include: screening out the fingerprints that are identical to the reference fingerprint among the other candidate fingerprints. The candidate fingerprint with the coincidence degree greater than or equal to the preset threshold is used as the homophone fingerprint of the reference fingerprint.

其中，预设阈值可根据实际需要灵活调整，例如25%。Among them, the preset threshold can be flexibly adjusted according to actual needs, such as 25%.

由此，音频识别装置在候选指纹集的其他候选指纹中，筛选得到参考指纹的同音指纹。In this way, the audio recognition device screens and obtains homophone fingerprints of the reference fingerprint among other candidate fingerprints in the candidate fingerprint set.

由此，本实施例通过相似度的计算，省去了对音频库中的音频做同音音频标记的人力和时间成本，也避免了人工录入信息不及时的情况，而且在音频入库时，无需再做同音音频的人工额外标记或分类，也就消除了信息错漏记录的风险，降低了维护成本。因此，本实施例提高了同音指纹和同音音频识别的准确性和效率。Thus, this embodiment saves the labor and time cost of making homophonic audio tags for audio in the audio library through the calculation of similarity, and also avoids the situation that the manual input of information is not timely, and there is no need for audio storage in the library. Doing additional manual marking or classification of homophonic audio also eliminates the risk of incorrect or missing information and reduces maintenance costs. Therefore, this embodiment improves the accuracy and efficiency of homophone fingerprint and homophone audio recognition.

在一些实施例中，若未找到与所述参考指纹的重合度大于或等于预设阈值的候选指纹，则将所述参考指纹对应的音频确定为所述待识别音频对应的目标音频。In some embodiments, if no candidate fingerprint whose degree of coincidence with the reference fingerprint is greater than or equal to a preset threshold is not found, the audio corresponding to the reference fingerprint is determined as the target audio corresponding to the audio to be recognized.

由此，音频识别装置在无法找到参考指纹的同音指纹时，判定候选指纹集中没有与参考指纹非常近似的其他候选指纹。因此，音频识别装置根据指纹库中各音频指纹和音频的映射关系，确定该参考指纹对应的音频，并将该音频确定为待识别音频对应的目标音频。Therefore, when the audio recognition device cannot find the homophone fingerprint of the reference fingerprint, it determines that there is no other candidate fingerprint in the candidate fingerprint set that is very similar to the reference fingerprint. Therefore, the audio recognition device determines the audio corresponding to the reference fingerprint according to the mapping relationship between each audio fingerprint and the audio in the fingerprint library, and determines the audio as the target audio corresponding to the audio to be recognized.

104、在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。104. From the audio corresponding to the reference fingerprint and the homophonic fingerprint, select the target audio corresponding to the audio to be identified.

在得到参考指纹及其同音指纹后，音频识别装置根据指纹库中各音频指纹和音频的映射关系，确定参考指纹及其同音指纹对应的音频。After obtaining the reference fingerprint and the homophonic fingerprint, the audio recognition device determines the audio corresponding to the reference fingerprint and the homophonic fingerprint according to the mapping relationship between each audio fingerprint and audio in the fingerprint library.

然后，音频识别装置在参考指纹及其同音指纹对应的音频中，选出目标音频。例如，音频识别装置将参考指纹及其同音指纹对应的音频，全部作为待识别音频对应的目标音频。这样，避免了由于版本问题而导致漏选的与待识别音频实质相同的音频，提升了音频指纹匹配的准确性。Then, the audio recognition device selects the target audio from the audio corresponding to the reference fingerprint and the homophonic fingerprint. For example, the audio recognition device uses the reference fingerprint and the audio corresponding to the homophone fingerprint as the target audio corresponding to the audio to be recognized. In this way, it is avoided that the audio that is substantially the same as the to-be-identified audio is missed due to the version problem, and the accuracy of audio fingerprint matching is improved.

在一些实施例中，还可以根据实际需要对参考指纹及其同音指纹对应的音频进行筛选，步骤“在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频”可以包括：获取所述参考指纹及其同音指纹对应的音频为同音音频，获取同音音频的版本信息；根据所述版本信息，确定所述同音音频的版本优先级；将版本优先级最高的同音音频作为所述待识别音频对应的目标音频。In some embodiments, the audio corresponding to the reference fingerprint and its homophonic fingerprints can also be screened according to actual needs. The step "select the target corresponding to the audio to be recognized among the audio corresponding to the reference fingerprint and homophonic fingerprints" "Audio" may include: acquiring the reference fingerprint and the audio corresponding to the homophonic fingerprint as homophonic audio, acquiring the version information of homophonic audio; determining the version priority of the homophonic audio according to the version information; setting the version priority of the highest The homophonic audio is used as the target audio corresponding to the audio to be recognized.

其中，版本信息包括音频的来源、歌手、上架和/或发行时间等信息，可以是音频自带的预设信息。同音音频可以是来源不同和/或版本等版本信息不同的音频。Wherein, the version information includes information such as the source of the audio, the singer, the release time and/or the release time, and may be preset information that comes with the audio. The homophonic audio may be audio with different sources and/or different version information such as versions.

例如，音频识别装置根据同音音频中的来源信息，将来源为专辑的版本优先级设定为最高，来源为电台的版本优先级设定为最低。由此，音频识别装置将来源为专辑的同音音频确定为目标音频。For example, the audio recognition device sets the priority of the version whose source is an album to the highest according to the source information in the homophonic audio, and sets the priority of the version whose source is the radio station to the lowest. Thus, the audio recognition device determines the homophonic audio whose source is the album as the target audio.

例如，音频识别装置根据同音音频的上架时间，按照时间先后顺序，将上架时间最早的版本优先级设置为最高，上架时间最晚的版本优先级设置为最低。由此，音频识别装置将上架时间最早的同音音频确定为目标音频。For example, the audio recognition device sets the priority of the version with the earliest release time as the highest, and the priority of the version with the latest release time as the lowest in chronological order according to the release time of the homophonic audio. Thus, the audio recognition device determines the homophonic audio with the earliest shelf time as the target audio.

由此，目标音频为与待识别音频最相似，且版本最准确音频。Therefore, the target audio is the audio that is most similar to the audio to be recognized and has the most accurate version.

由上可知，本申请实施例可以提取待识别音频的音频指纹作为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度；根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集，候选指纹集中包括与基准指纹近似的音频指纹；然后，在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹；在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。由此，该方案在检索到与基准指纹近似的候选指纹后，虽然候选指纹是与基准指纹匹配的，但是其可能会因为待识别音频的版本问题等导致存在不确定性。因此，该方案进一步在候选指纹集中选出参考指纹，进而通过重合度的计算在候选指纹集的其他候选指纹中选出同音指纹，实现了对候选指纹的进一步筛选。该方案经过多次筛选得到的参考指纹及其同音指纹，包括了与待识别音频的基准指纹最近似，且对应音频相同或可视为相同的音频指纹。从而，在参考指纹及其同音指纹对应的音频中选出的目标音频，为最优版本的音频，可作为待识别音频的真正出处或来源，同时保障了目标音频内容和版本的准确性，提高了音频识别的整体效率和用户体验。该方案通过对指纹库中的音频指纹进行层层筛选，细化了音频识别粒度，提升了音频识别的精细化程度，从而检索得到更加准确的目标音频。It can be seen from the above that the embodiment of the application can extract the audio fingerprint of the audio to be recognized as the reference fingerprint, and calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library; according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint library In the fingerprint library, the candidate fingerprint set is screened out, and the candidate fingerprint set includes audio fingerprints similar to the reference fingerprint; then, a reference fingerprint is selected from the candidate fingerprint set, and the homophone fingerprint of the reference fingerprint is obtained; Among the audio corresponding to the reference fingerprint and the homophonic fingerprint, the target audio corresponding to the audio to be recognized is selected. Therefore, after the solution retrieves candidate fingerprints similar to the reference fingerprint, although the candidate fingerprint matches the reference fingerprint, it may cause uncertainty due to the version of the audio to be recognized. Therefore, the solution further selects reference fingerprints in the candidate fingerprint set, and then selects homophone fingerprints from other candidate fingerprints in the candidate fingerprint set through the calculation of the degree of coincidence, thereby achieving further screening of candidate fingerprints. The reference fingerprint and its homophonic fingerprints obtained by this solution after multiple screenings include audio fingerprints that are most similar to the reference fingerprint of the audio to be recognized, and the corresponding audio is the same or can be regarded as the same. Therefore, the target audio selected from the audio corresponding to the reference fingerprint and the homophonic fingerprint is the audio of the optimal version, which can be used as the true source or source of the audio to be identified, while ensuring the accuracy of the target audio content and version, and improving Improve the overall efficiency and user experience of audio recognition. This solution refines the granularity of audio recognition by screening the audio fingerprints in the fingerprint library, improves the refinement of audio recognition, and retrieves more accurate target audio.

根据前面实施例所描述的方法，以下将举例作进一步详细说明。According to the method described in the previous embodiment, an example will be given below for further detailed description.

例如，参照图2a，在本实施例中，将以该音频识别装置具体集成在服务器集群中进行说明。该服务器集群包括特征提取服务器、叶服务器和根服务器。该***中可以包括一台或多台特征提取服务器、叶服务器和根服务器。本实施例以该***包括一台特征提取服务器、多台叶服务器和一台根服务器进行举例说明。For example, referring to FIG. 2a, in this embodiment, the audio recognition device will be specifically integrated in a server cluster for description. The server cluster includes feature extraction servers, leaf servers, and root servers. The system can include one or more feature extraction servers, leaf servers and root servers. In this embodiment, the system includes a feature extraction server, multiple leaf servers, and a root server for example.

（一）客户端上传待识别音频。(1) The client uploads the audio to be recognized.

用户可以将录制的音频或是本地的音频，通过客户端中安装的音频识别软件或是音乐软件等，上传给特征提取服务器。The user can upload the recorded audio or local audio to the feature extraction server through audio recognition software or music software installed in the client.

（二）提取音频指纹。(2) Extract audio fingerprints.

特征提取服务器提取待识别音频的音频指纹，作为基准指纹。然后，特征提取服务器将基准指纹分别发送给各个叶服务器，以进行音频指纹的匹配。The feature extraction server extracts the audio fingerprint of the audio to be recognized as a reference fingerprint. Then, the feature extraction server sends the reference fingerprint to each leaf server to match the audio fingerprint.

（三）指纹匹配。(3) Fingerprint matching.

各个叶服务器分别从指纹库中提取部分音频指纹，来进行音频指纹的匹配。例如，各个叶服务器可以根据预设的分配规则，从指纹库中提取对应的音频指纹进行匹配，从而实现海量数据的分流处理和并行处理，提高音频识别速度。Each leaf server extracts part of the audio fingerprint from the fingerprint database to match the audio fingerprint. For example, each leaf server can extract corresponding audio fingerprints from the fingerprint library for matching according to preset distribution rules, so as to realize the split processing and parallel processing of massive data and improve the audio recognition speed.

以任一叶服务器举例说明。Take any leaf server as an example.

该叶服务器分别计算基准指纹与指纹库中各音频指纹的相似度。例如，叶服务器可以分别统计所述基准指纹与指纹库中各音频指纹所包含的相同哈希值的数量；将相同哈希值的数量，分别对应作为基准指纹与指纹库中各音频指纹的相似度。The leaf server separately calculates the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library. For example, the leaf server can respectively count the number of the same hash value contained in each audio fingerprint in the reference fingerprint and the fingerprint library; respectively correspond to the number of the same hash value as the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library degree.

然后，该叶服务器将与基准指纹的相似度数值大于预设相似度阈值的候选指纹，确定为候选指纹，并将候选指纹发送给根服务器。Then, the leaf server determines candidate fingerprints whose similarity value with the reference fingerprint is greater than a preset similarity threshold as candidate fingerprints, and sends the candidate fingerprints to the root server.

（四）同音识别。(4) Homophony recognition.

根服务器在得到各个页服务器发送的候选指纹后，将各个候选指纹配置到候选指纹集中，然后，在候选指纹集中选出参考指纹及其同音指纹。After obtaining the candidate fingerprints sent by each page server, the root server configures each candidate fingerprint in the candidate fingerprint set, and then selects the reference fingerprint and its homophonic fingerprints from the candidate fingerprint set.

例如，根服务器将候选指纹集中，与基准指纹相似度数值最大的候选指纹作为参考指纹。For example, the root server collects candidate fingerprints, and the candidate fingerprint with the largest similarity value to the reference fingerprint is used as the reference fingerprint.

然后，根服务器计算参考指纹与候选指纹集中其他候选指纹的重合度。作为一种实施方式，根服务器可以获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，统计所述最长公共子序列的长度；然后，将最长公共子序列的长度，分别对应作为参考指纹与其他候选指纹的重合度。Then, the root server calculates the degree of coincidence between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set. As an implementation manner, the root server may obtain the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and count the length of the longest common subsequence; then, the length of the longest common subsequence is Respectively correspond to the degree of overlap between the reference fingerprint and other candidate fingerprints.

然后，根服务器根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹。作为一种实施方式，根服务器在所述其他候选指纹中，筛选出与所述参考指纹的重合度大于或等于预设阈值的候选指纹，作为所述参考指纹的同音指纹。Then, the root server selects homophone fingerprints of the reference fingerprint from the other candidate fingerprints according to the degree of coincidence. As an implementation manner, the root server screens out candidate fingerprints whose coincidence degree with the reference fingerprint is greater than or equal to a preset threshold among the other candidate fingerprints, as homophone fingerprints of the reference fingerprint.

由此，根服务器实现了同音指纹的识别。Thus, the root server realizes the identification of homophonic fingerprints.

例如，图2b中，idx为候选指纹与基准指纹的相似度数值排名，其中，idx数值为0的音频指纹与基准指纹的相似度数值最大；id为候选指纹对应的音频编号，从而能够根据该id找到其对应的音频；score为候选指纹与基准指纹的相似度数值，数值越大则说明其与基准指纹相似度越高；lcs为候选指纹与参考指纹的最长公共子序列长度，也即相似度数值。For example, in Figure 2b, idx is the numerical ranking of the similarity between the candidate fingerprint and the reference fingerprint. Among them, the audio fingerprint with the idx value of 0 has the largest similarity value with the reference fingerprint; id is the audio number corresponding to the candidate fingerprint, which can be id finds its corresponding audio; score is the similarity value between the candidate fingerprint and the reference fingerprint, the larger the value, the higher the similarity with the reference fingerprint; lcs is the length of the longest common subsequence between the candidate fingerprint and the reference fingerprint, that is The similarity value.

以图2b为例，取相似度阈值为9，则根服务器配置的候选指纹集中共包含35个候选指纹，也即，这35个候选指纹与基准指纹的相似度数值score大于9。Taking Figure 2b as an example, if the similarity threshold is set to 9, the candidate fingerprint set configured by the root server contains a total of 35 candidate fingerprints, that is, the similarity score of these 35 candidate fingerprints to the reference fingerprint is greater than 9.

其中，idx为0的音频指纹与基准指纹的相似度数值最大，将其作为参考指纹，因此，其与自身的lcs即为100。根服务器分别计算出候选指纹集中，idx0至34的候选指纹与参考指纹的lcs长度，作为相似度。若预设阈值为25，则根服务器将相似度数值为25及以上的候选指纹全部作为参考指纹的同音指纹。Among them, the audio fingerprint with idx of 0 has the largest similarity value with the reference fingerprint, and it is taken as the reference fingerprint. Therefore, its lcs with itself is 100. The root server respectively calculates the candidate fingerprint set, the lcs length of the candidate fingerprints idx0 to 34 and the reference fingerprint as the similarity. If the preset threshold is 25, the root server uses all candidate fingerprints with a similarity value of 25 and above as homophone fingerprints of the reference fingerprint.

（五）音频筛选。(5) Audio screening.

在得到参考指纹及其同音指纹后，根服务在在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。After obtaining the reference fingerprint and its homophone fingerprint, the root service selects the target audio corresponding to the audio to be recognized from among the audio corresponding to the reference fingerprint and homophone fingerprint.

例如，根服务器获取所述参考指纹及其同音指纹对应的音频为同音音频，获取同音音频的版本信息；根据所述版本信息，确定所述同音音频的版本优先级；将版本优先级最高的同音音频作为所述待识别音频对应的目标音频。For example, the root server obtains the reference fingerprint and the audio corresponding to the homophone fingerprint as homophone audio, and obtains the version information of homophone audio; determines the version priority of the homophone audio according to the version information; and sets the homophone with the highest version priority The audio is used as the target audio corresponding to the audio to be recognized.

以上述图2b为例，若根服务器确定idx26的候选指纹对应的音频为目标音频，则输出其音频id。Taking the above Figure 2b as an example, if the root server determines that the audio corresponding to the candidate fingerprint of idx26 is the target audio, it outputs its audio id.

（六）结果输出。(6) Result output.

根服务器将筛选得到的目标音频返回给客户端，供客户端播放给用户。The root server returns the filtered target audio to the client for the client to play to the user.

例如，图2c中，客户端得到根服务器返回的音频id，从音频库中检索出该编号对应的目标音频，并在识别结果显示界面上展示给用户。当然，显示界面上还可以提供该目标音频的名称，歌手例如某某，来源例如专辑等信息，并提供播放按钮，供用户播放。For example, in Figure 2c, the client obtains the audio id returned by the root server, retrieves the target audio corresponding to the number from the audio library, and displays it to the user on the recognition result display interface. Of course, the display interface can also provide the name of the target audio, the singer such as XX, the source such as album and other information, and provide a play button for the user to play.

由上可知，用户可将需要识别的音频上传给服务器集群，服务器集群通过叶服务器进行并行的指纹匹配，提高了音频检索速度。根服务器对叶服务器的匹配结果进行进一步的筛选，从而选出内容与待识别音频最接近，且版本与用户需求最匹配的目标音频，提高了音频识别效率和用户体验。It can be seen from the above that the user can upload the audio that needs to be identified to the server cluster, and the server cluster performs parallel fingerprint matching through the leaf server, which improves the audio retrieval speed. The root server further filters the matching results of the leaf servers, thereby selecting the target audio whose content is closest to the audio to be recognized, and whose version is the closest to the user's needs, which improves audio recognition efficiency and user experience.

为了更好地实施以上方法，本申请实施例还可以提供一种音频识别装置，该音频识别装置具体可以集成在网络设备中，该网络设备可以是终端或服务器等设备。In order to better implement the above method, embodiments of the present application may also provide an audio recognition device. The audio recognition device may be specifically integrated in a network device, which may be a terminal or a server.

本申请实施例提供了一种音频识别装置，可以包括：The embodiment of the present application provides an audio recognition device, which may include:

在一些实施例中，所述同音单元，用于：计算所述参考指纹与候选指纹集中其他候选指纹的重合度；根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹。In some embodiments, the homophone unit is used to: calculate the degree of coincidence between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set; according to the degree of coincidence, select the reference fingerprint from the other candidate fingerprints Homophonic fingerprints.

在一些实施例中，所述同音单元，用于：获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，统计所述最长公共子序列的长度；根据所述最长公共子序列的长度，计算得到所述参考指纹与其他候选指纹的重合度。In some embodiments, the homophone unit is used to: obtain the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and count the length of the longest common subsequence; The length of the sub-sequence is calculated to obtain the degree of overlap between the reference fingerprint and other candidate fingerprints.

在一些实施例中，所述同音单元，用于：在所述其他候选指纹中，筛选出与所述参考指纹的重合度大于或等于预设阈值的候选指纹，作为所述参考指纹的同音指纹。In some embodiments, the homophone unit is configured to: among the other candidate fingerprints, screen out candidate fingerprints whose coincidence degree with the reference fingerprint is greater than or equal to a preset threshold, as the homophone fingerprint of the reference fingerprint .

在一些实施例中，所述音频单元，还用于：若未找到与所述参考指纹的重合度大于或等于预设阈值的候选指纹，则将所述参考指纹对应的音频确定为所述待识别音频对应的目标音频。In some embodiments, the audio unit is further configured to: if no candidate fingerprint whose degree of coincidence with the reference fingerprint is greater than or equal to a preset threshold is not found, determine the audio corresponding to the reference fingerprint as the waiting Identify the target audio corresponding to the audio.

在一些实施例中，所述同音单元，用于：将所述候选指纹集中，与所述基准指纹的相似度数值最大的候选指纹，确定为参考指纹。In some embodiments, the homophone unit is used to determine the candidate fingerprint with the largest similarity value to the reference fingerprint in the candidate fingerprints as a reference fingerprint.

在一些实施例中，所述指纹单元，用于：分别统计所述基准指纹与预设指纹库中各音频指纹所包含的相同哈希值的数量；根据所述相同哈希值的数量，分别计算所述基准指纹与指纹库中各音频指纹的相似度。In some embodiments, the fingerprint unit is configured to: respectively count the number of the same hash value contained in each audio fingerprint in the reference fingerprint and the preset fingerprint library; according to the number of the same hash value, respectively Calculate the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library.

在一些实施例中，所述音频单元，用于：获取所述参考指纹及其同音指纹对应的音频为同音音频，获取同音音频的版本信息；根据所述版本信息，确定所述同音音频的版本优先级；将版本优先级最高的同音音频作为所述待识别音频对应的目标音频。In some embodiments, the audio unit is configured to: obtain the reference fingerprint and the audio corresponding to the homophonic fingerprint as homophonic audio, obtain version information of the homophonic audio; and determine the version of the homophonic audio according to the version information Priority: The homophonic audio with the highest version priority is used as the target audio corresponding to the audio to be recognized.

所述同音单元，用于：使用动态规划计算参考指纹和其他候选指纹哈希序列的最长公共子序列长度。The homophone unit is used to calculate the longest common subsequence length of the reference fingerprint and other candidate fingerprint hash sequences using dynamic programming.

例如，如图3所示，该音频识别装置可以包括指纹单元301、候选单元302、同音单元303和音频单元304，如下：For example, as shown in FIG. 3, the audio recognition device may include a fingerprint unit 301, a candidate unit 302, a homophonic unit 303, and an audio unit 304, as follows:

（1）指纹单元301；(1) Fingerprint unit 301;

指纹单元301，用于提取待识别音频的音频指纹作为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度。The fingerprint unit 301 is used to extract the audio fingerprint of the audio to be recognized as a reference fingerprint, and calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library.

例如，指纹单元301获取待识别音频，进行音频指纹的提取，并将待识别音频的音频指纹作为基准指纹，用于查询与其最接近或最相似的音频指纹。For example, the fingerprint unit 301 obtains the audio to be recognized, extracts the audio fingerprint, and uses the audio fingerprint of the audio to be recognized as a reference fingerprint for querying the audio fingerprint that is closest or most similar to it.

在一些实施例中，指纹单元301可以接收音频识别请求，获取待识别音频；对所述待识别音频进行音频指纹提取，得到哈希序列，将所述哈希序列作为基准指纹。In some embodiments, the fingerprint unit 301 may receive an audio recognition request to obtain the audio to be recognized; perform audio fingerprint extraction on the audio to be recognized to obtain a hash sequence, and use the hash sequence as a reference fingerprint.

例如，用户可以使用客户端输入音频识别请求，指纹单元301在收到音频识别请求后，通知客户端开始进行音频采集，从而对用户的哼唱声音或是环境中的声音等进行录音，得到待识别音频，该待识别音频即为本次音频识别请求对应的待识别音频。当然，用户也可以将客户端本地存储的，或是从网络上下载的音频上传给音指纹单元301，由此，指纹单元301得到音频识别请求及其对应的待识别音频。For example, the user can use the client to input an audio recognition request. After receiving the audio recognition request, the fingerprint unit 301 notifies the client to start audio collection, so as to record the user's humming sound or the sound in the environment, etc. Identify the audio, and the audio to be identified is the audio to be identified corresponding to this audio identification request. Of course, the user can also upload the audio stored locally on the client or downloaded from the network to the audio fingerprint unit 301, whereby the fingerprint unit 301 obtains the audio recognition request and its corresponding audio to be recognized.

然后，指纹单元301对待识别音频的音频信号进行音频指纹提取，得到待识别音频的音频指纹，该音频指纹包含了待识别音频的音频特征信息。其中，对音频信号的音频指纹提取具体可以包括对音频信号进行分帧、加窗、FFT（Fast Fourier Transform，快速傅里叶变换）频域变换、提取局部峰值和转换哈希序列等。Then, the fingerprint unit 301 performs audio fingerprint extraction on the audio signal of the audio to be identified to obtain the audio fingerprint of the audio to be identified, and the audio fingerprint contains the audio feature information of the audio to be identified. Among them, the audio fingerprint extraction of the audio signal may specifically include framing, windowing, FFT (Fast Fourier Transform, Fast Fourier Transform) frequency domain transformation, extracting local peaks, and transforming hash sequences, etc., of the audio signal.

具体的，指纹单元301在得到待识别音频后，对待识别音频的音频信号进行分帧和加窗处理。分帧为按预设规则将整段音频信号切成多段，每一段即为一帧，以使音频信号在微观上是平稳的，从而能为后期的音频信号处理输入平稳信号。然后，指纹单元301使用预设的加窗函数分别对每帧音频进行加窗，预设的加窗函数可以是汉明窗等，从而使分帧后的音频信号更加连贯，表现出周期函数特征。Specifically, after the fingerprint unit 301 obtains the audio to be identified, it performs framing and windowing processing on the audio signal of the audio to be identified. Framing is to cut the entire audio signal into multiple segments according to preset rules, and each segment is a frame, so that the audio signal is microscopically stable, so that a stable signal can be input for later audio signal processing. Then, the fingerprint unit 301 uses a preset windowing function to separately window each frame of audio. The preset windowing function can be a Hamming window, etc., so that the framed audio signal is more coherent and exhibits periodic function characteristics .

然后，指纹单元301对每一帧音频信号进行FFT频域变换，得到包含频域信息的频谱。进而，指纹单元301提取频谱中的局部峰值，并转换成哈希序列该哈希序列即为待识别音频的音频指纹。需要说明的是，该哈希序列中可以包括多个哈希值。Then, the fingerprint unit 301 performs FFT frequency domain transformation on each frame of audio signal to obtain a frequency spectrum containing frequency domain information. Furthermore, the fingerprint unit 301 extracts the local peaks in the frequency spectrum and converts them into a hash sequence. The hash sequence is the audio fingerprint of the audio to be identified. It should be noted that the hash sequence may include multiple hash values.

指纹单元301将待识别音频的音频指纹作为基准指纹，来计算基准指纹与预设指纹库中音频指纹的相似度，实现音频指纹的检索或匹配。The fingerprint unit 301 uses the audio fingerprint of the audio to be recognized as the reference fingerprint to calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library, so as to realize the retrieval or matching of the audio fingerprint.

在一些实施例中，基准指纹和指纹库中的音频指纹均使用哈希序列表征，指纹单元301可以用于：分别统计所述基准指纹与预设指纹库中各音频指纹所包含的相同哈希值的数量；根据所述相同哈希值的数量，分别计算所述基准指纹与指纹库中各音频指纹的相似度。In some embodiments, both the reference fingerprint and the audio fingerprints in the fingerprint library are characterized by a hash sequence, and the fingerprint unit 301 may be used to: respectively count the same hashes contained in each audio fingerprint in the reference fingerprint and the preset fingerprint library. The number of values; according to the number of the same hash value, the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library is calculated respectively.

以指纹库中任一音频指纹为例，指纹单元301将基准指纹哈希序列中的哈希值与该音频指纹哈希序列中的哈希值进行一一比对，并统计相同哈希值的数量数量，指纹单元301将得到的相同哈希值的数量作为基准指纹与该音频指纹的相似度。由此，指纹单元301分别计算得到基准指纹与指纹库中各音频指纹的相似度。Taking any audio fingerprint in the fingerprint database as an example, the fingerprint unit 301 compares the hash value in the reference fingerprint hash sequence with the hash value in the audio fingerprint hash sequence, and counts the same hash value. The fingerprint unit 301 uses the number of the same hash value obtained as the similarity between the reference fingerprint and the audio fingerprint. Therefore, the fingerprint unit 301 calculates the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library.

（2）候选单元302；(2) Candidate unit 302;

候选单元302，用于根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集。The candidate unit 302 is configured to screen out a candidate fingerprint set in the fingerprint library according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint library.

例如，候选单元302可以根据预设的相似度阈值，将指纹库中与基准指纹的相似度数值大于该相似度阈值的音频指纹筛选出来，作为与基准指纹匹配的候选指纹。For example, the candidate unit 302 may screen out audio fingerprints in the fingerprint library whose similarity value with the reference fingerprint is greater than the similarity threshold according to a preset similarity threshold, and use them as candidate fingerprints that match the reference fingerprint.

进而，候选单元302将筛选得到的候选指纹配置到同一集合中，得到候选指纹集。由此，候选指纹集中包括了一个或多个与基准指纹匹配的候选指纹。Furthermore, the candidate unit 302 arranges the candidate fingerprints obtained through screening into the same set to obtain a candidate fingerprint set. Thus, the candidate fingerprint set includes one or more candidate fingerprints that match the reference fingerprint.

（3）同音单元303；(3) Homophonic unit 303;

同音单元303，用于在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹。The homophone unit 303 is used to select a reference fingerprint from the candidate fingerprint set and obtain homophone fingerprints of the reference fingerprint.

其中，参考指纹为与基准指纹最相似的候选指纹。例如，同音单元303可以将所述候选指纹集中，与所述基准指纹的相似度数值最大的候选指纹，确定为参考指纹。Among them, the reference fingerprint is the candidate fingerprint most similar to the reference fingerprint. For example, the homophone unit 303 may determine the candidate fingerprint with the largest similarity value to the reference fingerprint in the candidate fingerprints as the reference fingerprint.

然后，同音单元303选出该参考指纹的同音指纹。需要说明的是，同音指纹可以理解为其对应的音频与参考指纹对应的音频是相同或可视为相同的。例如，在音乐平台的曲库中，存在编号不同但其实是同一首歌曲的多个音频，比如是同一首歌曲的不同版本，不同歌手翻唱的不同版本，或是收入不同专辑或电台的同一首歌曲，将属于同一首歌的多个音频定义为同音音频，它们的音频指纹即为同音指纹。Then, the homophone unit 303 selects the homophone fingerprint of the reference fingerprint. It should be noted that the homophone fingerprint can be understood as the audio corresponding to the reference fingerprint is the same or can be regarded as the same. For example, in the music library of a music platform, there are multiple audios of the same song with different numbers, such as different versions of the same song, different versions of covers by different singers, or the same song from different albums or radio stations. For songs, multiple audios belonging to the same song are defined as homophonic audio, and their audio fingerprints are homophonic fingerprints.

在一些实施例中，同音单元303具体可以用于：计算所述参考指纹与所述候选指纹集中其他候选指纹的重合度；根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹。In some embodiments, the homophone unit 303 may be specifically used to: calculate the degree of coincidence between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set; and select the reference fingerprint from the other candidate fingerprints according to the degree of coincidence. The homonym of fingerprint.

其中，参考指纹与其他候选指纹的重合度可以通过相关性、最长公共子序列等方式进行计算。其中，相关性可以是计算参考指纹与其他候选指纹哈希序列的方差，将方差值作为参考指纹与其他候选指纹的重合度。然后，同音单元303将方差数值满足预设要求的其他候选指纹，作为参考指纹的同音指纹。Among them, the degree of overlap between the reference fingerprint and other candidate fingerprints can be calculated by means of correlation, the longest common subsequence, etc. Among them, the correlation may be to calculate the variance of the hash sequence of the reference fingerprint and other candidate fingerprints, and use the variance value as the degree of coincidence between the reference fingerprint and the other candidate fingerprints. Then, the homophone unit 303 uses other candidate fingerprints whose variance values meet the preset requirements as homophone fingerprints of the reference fingerprint.

以最长公共子序列（LCS， Longest Common Subsequence）进行举例说明，同音单元303可以用于：获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，统计所述最长公共子序列的长度；根据所述最长公共子序列的长度，计算得到所述参考指纹与其他候选指纹的重合度。Take the longest common subsequence (LCS, Longest Common Subsequence) For an example, the homophone unit 303 can be used to: obtain the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and count the length of the longest common subsequence; according to the longest common subsequence The length of the sequence is calculated to obtain the degree of overlap between the reference fingerprint and other candidate fingerprints.

例如，可使用动态规划（DP，Dynamic Programming）计算参考指纹和其他候选指纹哈希序列的最长公共子序列长度。在本实施例中，参考指纹和其他候选指纹哈希序列的最长公共子序列长度计算公式如下：For example, dynamic programming (DP, Dynamic Programming) can be used to calculate the longest common subsequence length of the reference fingerprint and other candidate fingerprint hash sequences. In this embodiment, the calculation formula for the longest common subsequence length of the reference fingerprint and other candidate fingerprint hash sequences is as follows:

以任一其他候选指纹为例，在得到参考指纹与其的最长公共子序列长度后，同音单元303计算参考指纹与该其他候选指纹的重合度。例如，可使用如下公式计算：Taking any other candidate fingerprint as an example, after obtaining the reference fingerprint and its longest common subsequence length, the homophone unit 303 calculates the degree of coincidence between the reference fingerprint and the other candidate fingerprint. For example, you can use the following formula to calculate:

sim = nlcs/hash_seq_cnt×100%；sim = nlcs/hash_seq_cnt×100%;

由此，同音单元303可分别计算得到参考指纹与各其他候选指纹的重合度。In this way, the homophone unit 303 can respectively calculate the overlap degree of the reference fingerprint and each of the other candidate fingerprints.

然后，同音单元303可以在其他候选指纹中，选出参考指纹的同音指纹。Then, the homophone unit 303 can select the homophone fingerprint of the reference fingerprint among other candidate fingerprints.

例如，同音单元303可以将重合度数值最大的其他候选指纹，作为参考指纹的同音指纹；或者，音频识别装置将重合度数值按照由大至小的顺序，选取排序在前预设位次的其他候选指纹，作为参考指纹的同音指纹。For example, the homophone unit 303 may use other candidate fingerprints with the largest coincidence value as the homophone fingerprint of the reference fingerprint; or, the audio recognition device will select the other candidate fingerprints with the highest coincidence degree value in descending order, and select other fingerprints with the highest coincidence degree value. Candidate fingerprint, as the homophone fingerprint of the reference fingerprint.

在一些实施例中，同音单元303可以用于：在所述其他候选指纹中，筛选出与所述参考指纹的重合度大于或等于预设阈值的候选指纹，作为所述参考指纹的同音指纹。In some embodiments, the homophonic unit 303 may be used to screen out candidate fingerprints whose coincidence degree with the reference fingerprint is greater than or equal to a preset threshold among the other candidate fingerprints, as homophonic fingerprints of the reference fingerprint.

由此，同音单元303在候选指纹集的其他候选指纹中，筛选得到参考指纹的同音指纹。As a result, the homophone unit 303 obtains homophone fingerprints of the reference fingerprint through screening among other candidate fingerprints in the candidate fingerprint set.

由此，同音单元303通过相似度的计算，省去了对音频库中的音频做同音音频标记的人力和时间成本，也避免了人工录入信息不及时的情况，而且在音频入库时，无需再做同音音频的人工额外标记或分类，也就消除了信息错漏记录的风险，降低了维护成本。因此，本实施例提高了同音指纹和同音音频识别的准确性和效率。Therefore, the homophonic unit 303 saves the labor and time cost of making homophonic audio marks on the audio in the audio library through the calculation of the similarity, and also avoids the situation that the manual input of information is not timely, and there is no need for the audio to be stored in the library. Doing additional manual marking or classification of homophonic audio also eliminates the risk of incorrect or missing information and reduces maintenance costs. Therefore, this embodiment improves the accuracy and efficiency of homophone fingerprint and homophone audio recognition.

在一些实施例中，若未找到与所述参考指纹的重合度大于或等于预设阈值的候选指纹，则音频单元304将所述参考指纹对应的音频确定为所述待识别音频对应的目标音频。In some embodiments, if no candidate fingerprint whose coincidence degree with the reference fingerprint is greater than or equal to the preset threshold is not found, the audio unit 304 determines the audio corresponding to the reference fingerprint as the target audio corresponding to the audio to be recognized .

由此，在无法找到参考指纹的同音指纹时，同音单元303判定候选指纹集中没有与参考指纹非常近似的其他候选指纹。因此，音频单元304根据指纹库中各音频指纹和音频的映射关系，确定该参考指纹对应的音频，并将该音频确定为待识别音频对应的目标音频。Therefore, when the homophone fingerprint of the reference fingerprint cannot be found, the homophone unit 303 determines that there are no other candidate fingerprints that are very similar to the reference fingerprint in the candidate fingerprint set. Therefore, the audio unit 304 determines the audio corresponding to the reference fingerprint according to the mapping relationship between each audio fingerprint and the audio in the fingerprint library, and determines the audio as the target audio corresponding to the audio to be identified.

（4）音频单元304；(4) Audio unit 304;

音频单元304，用于在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。The audio unit 304 is configured to select the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and its homophone fingerprint.

在得到参考指纹及其同音指纹后，音频单元304根据指纹库中各音频指纹和音频的映射关系，确定参考指纹及其同音指纹对应的音频。After obtaining the reference fingerprint and the homophonic fingerprint, the audio unit 304 determines the audio corresponding to the reference fingerprint and the homophonic fingerprint according to the mapping relationship between each audio fingerprint and audio in the fingerprint library.

然后，音频单元304在参考指纹及其同音指纹对应的音频中，选出目标音频。例如，音频单元304将参考指纹及其同音指纹对应的音频，全部作为待识别音频对应的目标音频。Then, the audio unit 304 selects the target audio from the audio corresponding to the reference fingerprint and its homophone fingerprint. For example, the audio unit 304 uses the reference fingerprint and the audio corresponding to the homophone fingerprint as the target audio corresponding to the audio to be recognized.

在一些实施例中，还可以根据实际需要对参考指纹及其同音指纹对应的音频进行筛选，音频单元304具体可以用于：获取所述参考指纹及其同音指纹对应的音频为同音音频，获取同音音频的版本信息；根据所述版本信息，确定所述同音音频的版本优先级；将版本优先级最高的同音音频作为所述待识别音频对应的目标音频。In some embodiments, the reference fingerprint and the audio corresponding to the homophone fingerprint can also be filtered according to actual needs. The audio unit 304 can be specifically used to: obtain the reference fingerprint and the audio corresponding to the homophone fingerprint as homophone audio, and obtain homophone Audio version information; determine the version priority of the homophonic audio according to the version information; use the homophonic audio with the highest version priority as the target audio corresponding to the audio to be recognized.

例如，音频单元304根据同音音频中的来源信息，将来源为专辑的版本优先级设定为最高，来源为电台的版本优先级设定为最低。由此，音频单元304将来源为专辑的同音音频确定为目标音频。For example, the audio unit 304 sets the priority of the version whose source is an album to the highest according to the source information in the homophonic audio, and sets the priority of the version whose source is the radio station to the lowest. Thus, the audio unit 304 determines the homophonic audio whose source is the album as the target audio.

例如，音频单元304根据同音音频的上架时间，按照时间先后顺序，将上架时间最早的版本优先级设置为最高，上架时间最晚的版本优先级设置为最低。由此，音频单元304将上架时间最早的同音音频确定为目标音频。For example, the audio unit 304 sets the priority of the version with the earliest release time as the highest and the version with the latest release time as the lowest in chronological order according to the release time of the homophonic audio. Thus, the audio unit 304 determines the homophonic audio with the earliest shelf time as the target audio.

由上可知，本申请实施例指纹单元301可以提取待识别音频的音频指纹作为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度；候选单元302根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集，候选指纹集中包括与基准指纹近似的音频指纹；然后，同音单元303在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹；在所述参考指纹及其同音指纹对应的音频中，音频单元304选出所述待识别音频对应的目标音频。由此，该方案在检索到与基准指纹近似的候选指纹后，虽然候选指纹是与基准指纹匹配的，但是其可能会因为待识别音频的版本问题等导致存在不确定性。因此，该方案进一步在候选指纹集中选出参考指纹，进而通过重合度的计算在候选指纹集的其他候选指纹中选出同音指纹，实现了对候选指纹的进一步筛选。该方案经过多次筛选得到的参考指纹及其同音指纹，包括了与待识别音频的基准指纹最近似，且对应音频相同或可视为相同的音频指纹。从而，在参考指纹及其同音指纹对应的音频中选出的目标音频，为最优版本的音频，可作为待识别音频的真正出处或来源，同时保障了目标音频内容和版本的准确性，提高了音频识别的整体效率和用户体验。该方案通过对指纹库中的音频指纹进行层层筛选，细化了音频识别粒度，提升了音频识别的精细化程度，从而检索得到更加准确的目标音频。It can be seen from the above that the fingerprint unit 301 of the embodiment of the present application can extract the audio fingerprint of the audio to be recognized as a reference fingerprint, and calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library; the candidate unit 302 can use the reference fingerprint and the fingerprint The similarity of the audio fingerprints in the library, the candidate fingerprint set is screened from the fingerprint library, and the candidate fingerprint set includes audio fingerprints similar to the reference fingerprint; then, the homophone unit 303 selects the reference fingerprint from the candidate fingerprint set and obtains The homophone fingerprint of the reference fingerprint; among the audio corresponding to the reference fingerprint and the homophone fingerprint, the audio unit 304 selects the target audio corresponding to the audio to be recognized. Therefore, after the solution retrieves candidate fingerprints similar to the reference fingerprint, although the candidate fingerprint matches the reference fingerprint, it may cause uncertainty due to the version of the audio to be recognized. Therefore, the solution further selects reference fingerprints in the candidate fingerprint set, and then selects homophone fingerprints from other candidate fingerprints in the candidate fingerprint set through the calculation of the degree of coincidence, thereby achieving further screening of candidate fingerprints. The reference fingerprint and its homophonic fingerprints obtained by this solution after multiple screenings include audio fingerprints that are most similar to the reference fingerprint of the audio to be recognized, and the corresponding audio is the same or can be regarded as the same. Therefore, the target audio selected from the audio corresponding to the reference fingerprint and the homophonic fingerprint is the audio of the optimal version, which can be used as the true source or source of the audio to be identified, while ensuring the accuracy of the target audio content and version, and improving Improve the overall efficiency and user experience of audio recognition. This solution refines the granularity of audio recognition by screening the audio fingerprints in the fingerprint library, improves the refinement of audio recognition, and retrieves more accurate target audio.

本申请实施例还提供一种音频识别设备，如图4a所示，其示出了本申请实施例所涉及的音频识别设备的结构示意图，具体来讲：The embodiment of the present application also provides an audio recognition device, as shown in FIG. 4a, which shows a schematic structural diagram of the audio recognition device involved in the embodiment of the present application, specifically:

该音频识别设备可以包括一个或者一个以上处理核心的处理器401、一个或一个以上计算机可读存储介质的存储器402、电源403和输入单元404等部件。本领域技术人员可以理解，图4a中示出的音频识别设备结构并不构成对音频识别设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。其中：The audio recognition device may include one or more processing core processors 401, one or more computer-readable storage media memory 402, power supply 403, and input unit 404. Those skilled in the art can understand that the structure of the audio recognition device shown in FIG. 4a does not constitute a limitation on the audio recognition device, and may include more or less components than shown in the figure, or a combination of certain components, or different components Layout. among them:

处理器401是该音频识别设备的控制中心，利用各种接口和线路连接整个音频识别设备的各个部分，通过运行或执行存储在存储器402内的软件程序和/或模块，以及调用存储在存储器402内的数据，执行音频识别设备的各种功能和处理数据，从而对音频识别设备进行整体监控。可选的，处理器401可包括一个或多个处理核心；优选的，处理器401可集成应用处理器和调制解调处理器，其中，应用处理器主要处理操作***、用户界面和应用程序等，调制解调处理器主要处理无线通信。可以理解的是，上述调制解调处理器也可以不集成到处理器401中。The processor 401 is the control center of the audio recognition device. It uses various interfaces and lines to connect the various parts of the entire audio recognition device, runs or executes the software programs and/or modules stored in the memory 402, and calls the memory 402. The data inside performs various functions of the audio recognition device and processes the data, thereby monitoring the audio recognition device as a whole. Optionally, the processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, and application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 401.

存储器402可用于存储软件程序以及模块，处理器401通过运行存储在存储器402的软件程序以及模块，从而执行各种功能应用以及数据处理。存储器402可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作***、至少一个功能所需的应用程序（比如音频识别功能等）等；存储数据区可存储根据音频识别设备的使用所创建的数据等。此外，存储器402可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地，存储器402还可以包括存储器控制器，以提供处理器401对存储器402的访问。The memory 402 may be used to store software programs and modules. The processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area. The storage program area may store an operating system, an application program required by at least one function (such as an audio recognition function, etc.), etc.; the storage data area may store information according to the audio recognition device Use the created data, etc. In addition, the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Correspondingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

音频识别设备还包括给各个部件供电的电源403，优选的，电源403可以通过电源管理***与处理器401逻辑相连，从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。电源403还可以包括一个或一个以上的直流或交流电源、再充电***、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。The audio recognition device also includes a power supply 403 for supplying power to various components. Preferably, the power supply 403 may be logically connected to the processor 401 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system. The power supply 403 may also include one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and any other components.

该音频识别设备还可包括输入单元404，该输入单元404可用于接收输入的数字或字符信息，以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The audio recognition device may further include an input unit 404, which can be used to receive inputted digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.

此外，参照图4b，音频识别设备还可以包括音频采集装置405，音频采集装置405用于采集待识别音频。例如，音频采集装置405可以通过录音等方式，采集待识别音频。In addition, referring to FIG. 4b, the audio recognition device may further include an audio collection device 405, and the audio collection device 405 is configured to collect audio to be recognized. For example, the audio collection device 405 can collect the audio to be recognized by means of recording or the like.

尽管未示出，音频识别设备还可以包括显示单元等，在此不再赘述。具体在本实施例中，音频识别设备中的处理器401会按照如下的指令，将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器402中，并由处理器401来运行存储在存储器402中的应用程序，从而实现各种功能，如下：Although not shown, the audio recognition device may also include a display unit, etc., which will not be repeated here. Specifically, in this embodiment, the processor 401 in the audio recognition device will load the executable file corresponding to the process of one or more applications into the memory 402 according to the following instructions, and the processor 401 will run and store the executable file The application program in the memory 402 thus realizes various functions, as follows:

提取待识别音频的音频指纹作为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度；根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集；在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹；在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。Extract the audio fingerprint of the audio to be recognized as the reference fingerprint, and calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library; filter in the fingerprint library according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint library Select a candidate fingerprint set; select a reference fingerprint from the candidate fingerprint set, and obtain the homophone fingerprint of the reference fingerprint; among the audio corresponding to the reference fingerprint and the homophone fingerprint, select the target corresponding to the audio to be identified Audio.

在一些实施例中，处理器401还可以运行存储在存储器402中的应用程序，实现如下功能：In some embodiments, the processor 401 may also run an application program stored in the memory 402 to implement the following functions:

计算所述参考指纹与候选指纹集中其他候选指纹的重合度；根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹。Calculate the degree of coincidence between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set; according to the degree of coincidence, select homophone fingerprints of the reference fingerprint from the other candidate fingerprints.

获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，统计所述最长公共子序列的长度；根据所述最长公共子序列的长度，计算得到所述参考指纹与其他候选指纹的重合度。Obtain the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and count the length of the longest common subsequence; calculate the reference fingerprint and other candidates according to the length of the longest common subsequence The degree of overlap of fingerprints.

分别统计所述基准指纹与预设指纹库中各音频指纹所包含的相同哈希值的数量；根据所述相同哈希值的数量，分别计算所述基准指纹与指纹库中各音频指纹的相似度。Count the number of the same hash value contained in each audio fingerprint in the reference fingerprint and the preset fingerprint library; respectively calculate the similarity of each audio fingerprint in the reference fingerprint and the fingerprint library according to the number of the same hash value degree.

获取所述参考指纹及其同音指纹对应的音频为同音音频，获取同音音频的版本信息；根据所述版本信息，确定所述同音音频的版本优先级；将版本优先级最高的同音音频作为所述待识别音频对应的目标音频。Acquire that the audio corresponding to the reference fingerprint and its homophonic fingerprint is homophonic audio, and acquire homophonic audio version information; determine the version priority of the homophonic audio according to the version information; use the homophonic audio with the highest version priority as the homophonic audio The target audio corresponding to the audio to be recognized.

在一些实施例中，所述参考指纹和所述候选指纹集中其他候选指纹均使用哈希序列表征；处理器401还可以运行存储在存储器402中的应用程序，实现如下功能：In some embodiments, the reference fingerprint and other candidate fingerprints in the candidate fingerprint set are all characterized by a hash sequence; the processor 401 may also run an application program stored in the memory 402 to implement the following functions:

使用动态规划计算参考指纹和其他候选指纹哈希序列的最长公共子序列长度。Use dynamic programming to calculate the longest common subsequence length of the reference fingerprint and other candidate fingerprint hash sequences.

以上各个操作的具体实施可参见前面的实施例，在此不再赘述。For the specific implementation of the above operations, please refer to the previous embodiments, which will not be repeated here.

本领域普通技术人员可以理解，上述实施例的各种方法中的全部或部分步骤可以通过指令来完成，或通过指令控制相关的硬件来完成，该指令可以存储于一计算机可读存储介质中，并由处理器进行加载和执行。A person of ordinary skill in the art can understand that all or part of the steps in the various methods of the foregoing embodiments can be completed by instructions, or by instructions to control related hardware. The instructions can be stored in a computer-readable storage medium. And loaded and executed by the processor.

为此，本申请实施例提供一种存储介质，其中存储有多条指令，该指令能够被处理器进行加载，以执行本申请实施例所提供的任一种音频识别方法中的步骤。例如，该指令可以执行如下步骤：To this end, an embodiment of the present application provides a storage medium in which multiple instructions are stored, and the instructions can be loaded by a processor to execute the steps in any audio recognition method provided in the embodiments of the present application. For example, the instruction can perform the following steps:

在一些实施例中，该指令还可以执行如下步骤：In some embodiments, the instruction may also perform the following steps:

在一些实施例中，所述参考指纹和所述候选指纹集中其他候选指纹均使用哈希序列表征；该指令还可以执行如下步骤：In some embodiments, the reference fingerprint and other candidate fingerprints in the candidate fingerprint set are all characterized by a hash sequence; the instruction may also perform the following steps:

其中，该存储介质可以包括：只读存储器（ROM，Read Only Memory）、随机存取记忆体（RAM，Random Access Memory）、磁盘或光盘等。Wherein, the storage medium may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD, etc.

由于该存储介质中所存储的指令，可以执行本申请实施例所提供的任一种音频识别方法中的步骤，因此，可以实现本申请实施例所提供的任一种音频识别方法所能实现的有益效果，详见前面的实施例，在此不再赘述。Since the instructions stored in the storage medium can execute the steps in any audio recognition method provided in the embodiments of this application, it can achieve what can be achieved by any audio recognition method provided in the embodiments of this application. For the beneficial effects, see the previous embodiment for details, and will not be repeated here.

以上对本申请实施例所提供的一种音频识别方法、装置、设备及存储介质进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。The audio recognition method, device, equipment, and storage medium provided by the embodiments of the application are described in detail above. Specific examples are used in this article to illustrate the principles and implementation of the application. The description of the above embodiments is only used To help understand the methods and core ideas of this application; at the same time, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and scope of application. In summary, the content of this specification It should not be construed as a limitation on this application.

Claims

一种音频识别方法，其中，包括：An audio recognition method, which includes:

提取待识别音频的音频指纹作为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度；Extracting the audio fingerprint of the audio to be recognized as a reference fingerprint, and calculating the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library;

根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集；Screening out candidate fingerprint sets in the fingerprint library according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint library;

在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹；Selecting a reference fingerprint from the candidate fingerprint set, and obtaining homophone fingerprints of the reference fingerprint;

在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。Among the audio corresponding to the reference fingerprint and the homophone fingerprint, the target audio corresponding to the audio to be identified is selected.
根据权利要求1所述的方法，其中，所述获取所述参考指纹的同音指纹，包括：The method according to claim 1, wherein said obtaining the homophone fingerprint of the reference fingerprint comprises:

计算所述参考指纹与候选指纹集中其他候选指纹的重合度；Calculating the degree of coincidence between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set;

根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹。According to the degree of coincidence, a homophone fingerprint of the reference fingerprint is selected from the other candidate fingerprints.
根据权利要求2所述的方法，其中，所述计算所述参考指纹与候选指纹集中其他候选指纹的重合度，包括：The method according to claim 2, wherein the calculating the degree of overlap between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set comprises:

获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，统计所述最长公共子序列的长度；Acquiring the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and counting the length of the longest common subsequence;

根据所述最长公共子序列的长度，计算得到所述参考指纹与其他候选指纹的重合度。According to the length of the longest common subsequence, the degree of overlap between the reference fingerprint and other candidate fingerprints is calculated.
根据权利要求2所述的方法，其中，所述根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹，包括：The method according to claim 2, wherein the selecting the homophone fingerprints of the reference fingerprint from the other candidate fingerprints according to the degree of coincidence comprises:

在所述其他候选指纹中，筛选出与所述参考指纹的重合度大于或等于预设阈值的候选指纹，作为所述参考指纹的同音指纹。Among the other candidate fingerprints, candidate fingerprints whose degree of coincidence with the reference fingerprint is greater than or equal to a preset threshold are screened out as the homophone fingerprints of the reference fingerprint.
根据权利要求4所述的方法，其中，所述方法还包括：The method according to claim 4, wherein the method further comprises:

若未找到与所述参考指纹的重合度大于或等于预设阈值的候选指纹，则将所述参考指纹对应的音频确定为所述待识别音频对应的目标音频。If no candidate fingerprint whose coincidence degree with the reference fingerprint is greater than or equal to the preset threshold is not found, the audio corresponding to the reference fingerprint is determined as the target audio corresponding to the audio to be recognized.
根据权利要求1所述的方法，其中，在所述候选指纹集中选出参考指纹，包括：The method according to claim 1, wherein selecting a reference fingerprint from the candidate fingerprint set comprises:

将所述候选指纹集中，与所述基准指纹的相似度数值最大的候选指纹，确定为参考指纹。The candidate fingerprint in the candidate fingerprint set with the largest similarity value to the reference fingerprint is determined as a reference fingerprint.
根据权利要求1所述的方法，其中，所述计算所述基准指纹与预设指纹库中音频指纹的相似度，包括：The method according to claim 1, wherein the calculating the similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint library comprises:

分别统计所述基准指纹与预设指纹库中各音频指纹所包含的相同哈希值的数量；Respectively count the number of the same hash value contained in each audio fingerprint in the reference fingerprint and the preset fingerprint library;

根据所述相同哈希值的数量，分别计算所述基准指纹与指纹库中各音频指纹的相似度。According to the number of the same hash value, the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library is calculated respectively.
根据权利要求1所述的方法，其中，所述在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频，包括：The method according to claim 1, wherein, among the audio corresponding to the reference fingerprint and the homophonic fingerprint, selecting the target audio corresponding to the audio to be recognized comprises:

获取所述参考指纹及其同音指纹对应的音频为同音音频，获取同音音频的版本信息；Acquiring the reference fingerprint and the audio corresponding to the homophonic fingerprint as homophonic audio, and acquiring the version information of the homophonic audio;

根据所述版本信息，确定所述同音音频的版本优先级；Determine the version priority of the homophonic audio according to the version information;

将版本优先级最高的同音音频作为所述待识别音频对应的目标音频。The homophonic audio with the highest version priority is used as the target audio corresponding to the audio to be recognized.
根据权利要求3所述的方法，其中，所述参考指纹和所述候选指纹集中其他候选指纹均使用哈希序列表征；The method according to claim 3, wherein the reference fingerprint and other candidate fingerprints in the candidate fingerprint set are all characterized by a hash sequence;

获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，包括：使用动态规划计算参考指纹和其他候选指纹哈希序列的最长公共子序列长度。Obtaining the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set includes: using dynamic programming to calculate the longest common subsequence length of the reference fingerprint and the hash sequence of the other candidate fingerprints.
一种音频识别装置，其中，包括：An audio recognition device, which includes:

指纹单元，用于提取待识别音频的音频指纹作为基准指纹，计算所述基准指纹与预设指纹库中音频指纹的相似度；The fingerprint unit is used to extract the audio fingerprint of the audio to be recognized as a reference fingerprint, and calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library;

候选单元，用于根据所述基准指纹与指纹库中音频指纹的相似度，在所述指纹库中筛选出候选指纹集；The candidate unit is configured to screen out a candidate fingerprint set in the fingerprint library according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint library;

同音单元，用于在所述候选指纹集中选出参考指纹，并获取所述参考指纹的同音指纹；The homophone unit is used to select a reference fingerprint from the candidate fingerprint set and obtain homophone fingerprints of the reference fingerprint;

音频单元，用于在所述参考指纹及其同音指纹对应的音频中，选出所述待识别音频对应的目标音频。The audio unit is used to select the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophone fingerprint.
根据权利要求10所述的装置，其中，所述同音单元，用于：计算所述参考指纹与候选指纹集中其他候选指纹的重合度；根据所述重合度，在所述其他候选指纹中选出所述参考指纹的同音指纹。The device according to claim 10, wherein the homophone unit is configured to: calculate the degree of coincidence between the reference fingerprint and other candidate fingerprints in the candidate fingerprint set; and select among the other candidate fingerprints according to the degree of coincidence The homophone fingerprint of the reference fingerprint.
根据权利要求11所述的装置，其中，所述同音单元，用于：获取所述参考指纹与候选指纹集中其他候选指纹的最长公共子序列，统计所述最长公共子序列的长度；根据所述最长公共子序列的长度，计算得到所述参考指纹与其他候选指纹的重合度。The device according to claim 11, wherein the homophone unit is configured to: obtain the longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and count the length of the longest common subsequence; The length of the longest common subsequence is calculated to obtain the degree of overlap between the reference fingerprint and other candidate fingerprints.
根据权利要求11所述的装置，其中，所述同音单元，用于：在所述其他候选指纹中，筛选出与所述参考指纹的重合度大于或等于预设阈值的候选指纹，作为所述参考指纹的同音指纹。11. The device according to claim 11, wherein the homophone unit is configured to: among the other candidate fingerprints, screen out candidate fingerprints whose coincidence degree with the reference fingerprint is greater than or equal to a preset threshold, as the The homophone fingerprint of the reference fingerprint.
根据权利要求14所述的装置，其中，所述音频单元，还用于：若未找到与所述参考指纹的重合度大于或等于预设阈值的候选指纹，则将所述参考指纹对应的音频确定为所述待识别音频对应的目标音频。The device according to claim 14, wherein the audio unit is further configured to: if no candidate fingerprint with a degree of coincidence with the reference fingerprint is greater than or equal to a preset threshold is not found, then the audio corresponding to the reference fingerprint Determine the target audio corresponding to the audio to be recognized.
根据权利要求10所述的额装置，其中，所述同音单元，用于：将所述候选指纹集中，与所述基准指纹的相似度数值最大的候选指纹，确定为参考指纹。10. The forehead device according to claim 10, wherein the homophone unit is used to determine the candidate fingerprint with the largest similarity value to the reference fingerprint in the collection of the candidate fingerprints as a reference fingerprint.
根据权利要求10所述的装置，其中，所述指纹单元，用于：分别统计所述基准指纹与预设指纹库中各音频指纹所包含的相同哈希值的数量；根据所述相同哈希值的数量，分别计算所述基准指纹与指纹库中各音频指纹的相似度。10. The device according to claim 10, wherein the fingerprint unit is configured to: respectively count the number of the same hash value contained in each audio fingerprint in the reference fingerprint and the preset fingerprint library; according to the same hash Calculate the similarity between the reference fingerprint and each audio fingerprint in the fingerprint library.
根据权利要求10所述的装置，其中，所述音频单元，用于：获取所述参考指纹及其同音指纹对应的音频为同音音频，获取同音音频的版本信息；根据所述版本信息，确定所述同音音频的版本优先级；将版本优先级最高的同音音频作为所述待识别音频对应的目标音频。The device according to claim 10, wherein the audio unit is configured to: obtain the reference fingerprint and the audio corresponding to the homophonic fingerprint as homophonic audio, obtain the version information of the homophonic audio; and determine the version information according to the version information. The version priority of the homophonic audio; the homophonic audio with the highest version priority is used as the target audio corresponding to the audio to be recognized.
根据权利要求12所述的装置，其中，所述参考指纹和所述候选指纹集中其他候选指纹均使用哈希序列表征；The device according to claim 12, wherein the reference fingerprint and other candidate fingerprints in the candidate fingerprint set are all characterized by a hash sequence;

所述同音单元，用于：使用动态规划计算参考指纹和其他候选指纹哈希序列的最长公共子序列长度。The homophone unit is used to calculate the longest common subsequence length of the reference fingerprint and other candidate fingerprint hash sequences using dynamic programming.
一种音频识别设备，其中，所述音频识别设备包括：存储器、处理器及存储在所述存储器上，并可在所述处理器上运行的音频识别程序，所述音频识别程序被所述处理器执行时实现如权利要求1-9任一项所述的方法的步骤。An audio recognition device, wherein the audio recognition device includes a memory, a processor, and an audio recognition program stored on the memory and capable of running on the processor, and the audio recognition program is processed by the processor. The steps of the method according to any one of claims 1-9 are realized when the device is executed.
一种存储介质，其中，所述存储介质存储有多条指令，所述指令适于处理器进行加载，以执行权利要求1至9任一项所述的音频识别方法中的步骤A storage medium, wherein the storage medium stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute the steps in the audio recognition method according to any one of claims 1 to 9