CN117219110A

CN117219110A - Speaker separation method suitable for recording tablet

Info

Publication number: CN117219110A
Application number: CN202311262137.XA
Authority: CN
Inventors: 张波; 罗卓伟
Original assignee: Suzhou Light Vector Intelligent Technology Co ltd
Current assignee: Suzhou Light Vector Intelligent Technology Co ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-12

Abstract

The application relates to the technical field of artificial intelligence, in particular to a speaker separation method suitable for a recording tablet, which comprises the steps of voice voiceprint and number extraction, clustering, splitting single voice and overlapped voice, performing text conversion, correcting by an error correction mode, repeating recursion and overlapping voice content until a text meeting the requirement is obtained.

Description

Speaker separation method suitable for recording tablet

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a speaker separation method suitable for recording industry cards.

Background

The application of the sound recording industry cards in the on-line sales scene is wider and wider, and the voice retention in the sales process can be realized by recording the voice in the sales process.

In recent years, along with the gradual application of the AI technology to the recording industry cards, intelligent analysis of the audio of the sales process recorded by the recording industry cards is realized, and sales personnel are assisted to improve the sales order rate. The voice of the offline sales process comprises sales, customers, background noise and the like, and particularly, under the condition that the sales and the customers are one-to-many and many-to-many, how to separate the audio of the sales and the customers is carried out, so that information such as questions, comments, objections and the like of the customers is extracted, and the voice has important significance for improving the sales service quality, improving the sales performance and improving the product design.

Based on the reasons, the application designs the speaker separation method suitable for the recording tablet, which realizes the speaker separation of the recording tablet from the practical use angle, accurately separates the voice of each customer, and particularly overcomes the customer voice separation under the condition of overlapping speakers, thereby improving the application prospect of the application.

Disclosure of Invention

The application aims to overcome the defects of the prior art, and provides a speaker separation method suitable for a recording tablet, which realizes speaker separation of the recording tablet from the practical use angle, accurately separates out the voice of each customer, and particularly overcomes the customer voice separation under the condition of overlapping speakers, thereby improving the wide application prospect of the application.

In order to achieve the above object, the present application provides a speaker separation method suitable for a recording tablet, comprising the steps of:

s1, dividing a record recorded by a record tablet according to a fixed time length, and extracting voiceprint characteristics of each section of voice by using a pre-trained voiceprint model;

s2, clustering the voiceprints extracted from each section of voice in the S1 by using a clustering algorithm, analyzing the number of people contained in the recorded voice by clustering, and extracting the voiceprints of the speaker according to the voice fragments of the speaker;

s3, according to the number of people and the voiceprint clustering result obtained in the S2, finely segmenting the recorded voice into single speaker voices and extracting overlapping voices of the speaker and the speaker; using a speech recognition ASR to convert the single speaker speech into a text, and extracting speaker sales information contained in the text;

s4, the speaker overlapping voice in S3 is subjected to speaker separation by using the number of people determined by the clustering algorithm of S2 and the blind source separation algorithm;

s5, using a semantic recognition ASR algorithm to convert the multi-person voice separated in the S4 into text and extracting sales information of speakers contained in the text;

s6, determining error rate with the aid of industrial background knowledge according to the text error correction capability in semantic understanding of the voice text in S5, feeding back to S4, and further analyzing, correcting and correcting the text through blind source separation;

s7, recursively executing S4-S6 according to the analysis error correction result until the analysis error correction result is lower than a threshold value.

The segmentation in S1 is performed in 2 seconds.

The pre-trained voiceprint model in S1 is mel-frequency cepstrum coefficient MFCC, linear predictive coding LPCC, convolutional neural network CNN or recurrent neural network RNN.

The clustering algorithm in S2 is an HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) clustering algorithm.

The sales information of the speaker in S5 is contained in the speaker overlapping voices.

Blind Source Separation (BSS) in S4 includes PCA (principal component analysis) or ICA (independent component analysis).

Compared with the prior art, the method has the advantages that the number of people in voice is analyzed through the pre-training model and the clustering algorithm, the speakers are separated and distinguished through blind source separation, conversion and error correction are carried out through the semantic recognition algorithm until the errors are lower than the threshold value, so that the efficiency and the accuracy of extracting the voice of the customer are realized, and the help is provided for the follow-up sales service.

Drawings

FIG. 1 is a schematic flow chart of the method of the present application.

Description of the embodiments

The application will now be further described with reference to the accompanying drawings.

Referring to fig. 1, the application provides a speaker separation method suitable for a recording tablet, comprising the following steps:

step (1): the voice recorded by the recording industry board is subjected to equal-time length segmentation (such as a 2S section), and the voiceprint characteristics of each section of voice are extracted by using a pre-trained voiceprint recognition model (such as a Mel Frequency Cepstrum Coefficient (MFCC), a linear predictive coding (LPCC), a Convolutional Neural Network (CNN), a cyclic neural network (RNN) and the like). Voiceprint recognition includes two processes, feature extraction and matching recognition, where only feature extraction is used.

Step (2): and (3) performing cluster analysis on the characteristics of each section of voice extracted in the step (1) by using a cluster algorithm. The HDBSCAN clustering algorithm is adopted in the text, and other clustering algorithms can be adopted in actual use. According to the clustering result, the number of the speakers contained in the recording can be determined, and the voiceprint of the speaker can be extracted according to the voice fragments of the corresponding speaker.

Step (3): and (3) according to the number of the speakers and the voiceprints extracted in the step (2), finely segmenting the recorded voice, and segmenting the recorded voice into single speaker voices and overlapping voices of the speakers. The single speaker speech is directly converted into text using speech recognition (ASR), and speaker sales information contained in the text is extracted. At the same time, a large amount of sales information is contained in the speaker overlapping speech segments.

Step (4): the speaker number determined in the step 2 is used to separate the speaker number into multiple voices according to a blind source separation (such as PCA, ICA and the like) algorithm. Because not all people speak at the same time and are affected by background noise, the separated multi-person speech contains a large number of errors.

Step (5): the multi-person speech separated in step 4 is converted into text using speech recognition (ASR).

Step (6): and (3) according to the speaker text determined in the step (5), correcting the text in semantic understanding, determining the error rate of final text analysis based on industry background knowledge, feeding back the result to the step (4), and further correcting the result of Blind Source Separation (BSS).

Step (7): and recursively executing the steps 4 to 6 until the text parsing error correction rate is lower than the threshold value.

The above is only a preferred embodiment of the present application, only for helping to understand the method and the core idea of the present application, and the scope of the present application is not limited to the above examples, and all technical solutions belonging to the concept of the present application belong to the scope of the present application. It should be noted that modifications and adaptations to the present application may occur to one skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

The application solves the problems that in the prior art, under a complex talking environment, the audio frequency of sales and customers cannot be separated, and information such as questions, comments, objections and the like of the customers are extracted, so that the service improvement of subsequent sales is affected, the number of voice persons is analyzed through a pre-training model and a clustering algorithm, speakers are separated and distinguished through blind source separation, and conversion and error correction are carried out through a semantic recognition algorithm until errors are lower than a threshold value, thereby realizing the extraction efficiency and accuracy of the voice of the customers and providing assistance for subsequent sales service.

Claims

1. A speaker separation method suitable for a recording tablet is characterized by comprising the following steps:

s3, according to the number of people and the voiceprint clustering result obtained in the S2, finely segmenting recorded voice into voice of a single speaker, and extracting voice overlapping the voiceprint of the speaker and the speaker; using a speech recognition ASR to convert the single speaker speech into a text, and extracting speaker sales information contained in the text;

s4, the speaker overlapping voice in the S3 is used for speaker separation by using the number of people determined by the clustering algorithm of the S2 and a blind source separation algorithm;

s5, converting the multi-person voice separated in the S4 into text by using a semantic recognition ASR algorithm and extracting sales information of speakers contained in the text;

s6, determining error rate with the aid of industrial background knowledge according to the text error correction capability in semantic understanding of the voice text in the S5, feeding back to the S4, and further analyzing, correcting and correcting the text through blind source separation;

and S7, recursively executing the steps S4 to S6 according to the analysis error correction result until the analysis error correction result is lower than a threshold value.

2. The speaker separation method for a sound recording label according to claim 1, wherein the segmentation in S1 is performed in 2 seconds.

3. The speaker separation method for a sound recording workstation according to claim 1, wherein the pre-trained voiceprint model in S1 is mel-frequency cepstral coefficient MFCC, linear predictive coded LPCC, convolutional neural network CNN or recurrent neural network RNN.

4. The speaker separation method for a recording workstation according to claim 1, wherein the clustering algorithm in S2 is Hierarchical Density-Based Spatial Clustering of Applications with Noise, i.e., HDBSCAN, clustering algorithm.

5. The speaker separation method for a recording workstation according to claim 1, wherein the sales information of the speaker in S5 is included in the overlapping voices of the speaker.

6. The speaker separation method for a voice recorder of claim 1, wherein the blind source separation BSS in S4 includes a principal component analysis PCA or an independent component analysis ICA.