CN117219110A - Speaker separation method suitable for recording tablet - Google Patents

Speaker separation method suitable for recording tablet Download PDF

Info

Publication number
CN117219110A
CN117219110A CN202311262137.XA CN202311262137A CN117219110A CN 117219110 A CN117219110 A CN 117219110A CN 202311262137 A CN202311262137 A CN 202311262137A CN 117219110 A CN117219110 A CN 117219110A
Authority
CN
China
Prior art keywords
speaker
voice
text
clustering
separation method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202311262137.XA
Other languages
Chinese (zh)
Inventor
张波
罗卓伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Light Vector Intelligent Technology Co ltd
Original Assignee
Suzhou Light Vector Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Light Vector Intelligent Technology Co ltd filed Critical Suzhou Light Vector Intelligent Technology Co ltd
Priority to CN202311262137.XA priority Critical patent/CN117219110A/en
Publication of CN117219110A publication Critical patent/CN117219110A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of artificial intelligence, in particular to a speaker separation method suitable for a recording tablet, which comprises the steps of voice voiceprint and number extraction, clustering, splitting single voice and overlapped voice, performing text conversion, correcting by an error correction mode, repeating recursion and overlapping voice content until a text meeting the requirement is obtained.

Description

Speaker separation method suitable for recording tablet
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a speaker separation method suitable for recording industry cards.
Background
The application of the sound recording industry cards in the on-line sales scene is wider and wider, and the voice retention in the sales process can be realized by recording the voice in the sales process.
In recent years, along with the gradual application of the AI technology to the recording industry cards, intelligent analysis of the audio of the sales process recorded by the recording industry cards is realized, and sales personnel are assisted to improve the sales order rate. The voice of the offline sales process comprises sales, customers, background noise and the like, and particularly, under the condition that the sales and the customers are one-to-many and many-to-many, how to separate the audio of the sales and the customers is carried out, so that information such as questions, comments, objections and the like of the customers is extracted, and the voice has important significance for improving the sales service quality, improving the sales performance and improving the product design.
Based on the reasons, the application designs the speaker separation method suitable for the recording tablet, which realizes the speaker separation of the recording tablet from the practical use angle, accurately separates the voice of each customer, and particularly overcomes the customer voice separation under the condition of overlapping speakers, thereby improving the application prospect of the application.
Disclosure of Invention
The application aims to overcome the defects of the prior art, and provides a speaker separation method suitable for a recording tablet, which realizes speaker separation of the recording tablet from the practical use angle, accurately separates out the voice of each customer, and particularly overcomes the customer voice separation under the condition of overlapping speakers, thereby improving the wide application prospect of the application.
In order to achieve the above object, the present application provides a speaker separation method suitable for a recording tablet, comprising the steps of:
s1, dividing a record recorded by a record tablet according to a fixed time length, and extracting voiceprint characteristics of each section of voice by using a pre-trained voiceprint model;
s2, clustering the voiceprints extracted from each section of voice in the S1 by using a clustering algorithm, analyzing the number of people contained in the recorded voice by clustering, and extracting the voiceprints of the speaker according to the voice fragments of the speaker;
s3, according to the number of people and the voiceprint clustering result obtained in the S2, finely segmenting the recorded voice into single speaker voices and extracting overlapping voices of the speaker and the speaker; using a speech recognition ASR to convert the single speaker speech into a text, and extracting speaker sales information contained in the text;
s4, the speaker overlapping voice in S3 is subjected to speaker separation by using the number of people determined by the clustering algorithm of S2 and the blind source separation algorithm;
s5, using a semantic recognition ASR algorithm to convert the multi-person voice separated in the S4 into text and extracting sales information of speakers contained in the text;
s6, determining error rate with the aid of industrial background knowledge according to the text error correction capability in semantic understanding of the voice text in S5, feeding back to S4, and further analyzing, correcting and correcting the text through blind source separation;
s7, recursively executing S4-S6 according to the analysis error correction result until the analysis error correction result is lower than a threshold value.
The segmentation in S1 is performed in 2 seconds.
The pre-trained voiceprint model in S1 is mel-frequency cepstrum coefficient MFCC, linear predictive coding LPCC, convolutional neural network CNN or recurrent neural network RNN.
The clustering algorithm in S2 is an HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) clustering algorithm.
The sales information of the speaker in S5 is contained in the speaker overlapping voices.
Blind Source Separation (BSS) in S4 includes PCA (principal component analysis) or ICA (independent component analysis).
Compared with the prior art, the method has the advantages that the number of people in voice is analyzed through the pre-training model and the clustering algorithm, the speakers are separated and distinguished through blind source separation, conversion and error correction are carried out through the semantic recognition algorithm until the errors are lower than the threshold value, so that the efficiency and the accuracy of extracting the voice of the customer are realized, and the help is provided for the follow-up sales service.
Drawings
FIG. 1 is a schematic flow chart of the method of the present application.
Description of the embodiments
The application will now be further described with reference to the accompanying drawings.
Referring to fig. 1, the application provides a speaker separation method suitable for a recording tablet, comprising the following steps:
step (1): the voice recorded by the recording industry board is subjected to equal-time length segmentation (such as a 2S section), and the voiceprint characteristics of each section of voice are extracted by using a pre-trained voiceprint recognition model (such as a Mel Frequency Cepstrum Coefficient (MFCC), a linear predictive coding (LPCC), a Convolutional Neural Network (CNN), a cyclic neural network (RNN) and the like). Voiceprint recognition includes two processes, feature extraction and matching recognition, where only feature extraction is used.
Step (2): and (3) performing cluster analysis on the characteristics of each section of voice extracted in the step (1) by using a cluster algorithm. The HDBSCAN clustering algorithm is adopted in the text, and other clustering algorithms can be adopted in actual use. According to the clustering result, the number of the speakers contained in the recording can be determined, and the voiceprint of the speaker can be extracted according to the voice fragments of the corresponding speaker.
Step (3): and (3) according to the number of the speakers and the voiceprints extracted in the step (2), finely segmenting the recorded voice, and segmenting the recorded voice into single speaker voices and overlapping voices of the speakers. The single speaker speech is directly converted into text using speech recognition (ASR), and speaker sales information contained in the text is extracted. At the same time, a large amount of sales information is contained in the speaker overlapping speech segments.
Step (4): the speaker number determined in the step 2 is used to separate the speaker number into multiple voices according to a blind source separation (such as PCA, ICA and the like) algorithm. Because not all people speak at the same time and are affected by background noise, the separated multi-person speech contains a large number of errors.
Step (5): the multi-person speech separated in step 4 is converted into text using speech recognition (ASR).
Step (6): and (3) according to the speaker text determined in the step (5), correcting the text in semantic understanding, determining the error rate of final text analysis based on industry background knowledge, feeding back the result to the step (4), and further correcting the result of Blind Source Separation (BSS).
Step (7): and recursively executing the steps 4 to 6 until the text parsing error correction rate is lower than the threshold value.
The above is only a preferred embodiment of the present application, only for helping to understand the method and the core idea of the present application, and the scope of the present application is not limited to the above examples, and all technical solutions belonging to the concept of the present application belong to the scope of the present application. It should be noted that modifications and adaptations to the present application may occur to one skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.
The application solves the problems that in the prior art, under a complex talking environment, the audio frequency of sales and customers cannot be separated, and information such as questions, comments, objections and the like of the customers are extracted, so that the service improvement of subsequent sales is affected, the number of voice persons is analyzed through a pre-training model and a clustering algorithm, speakers are separated and distinguished through blind source separation, and conversion and error correction are carried out through a semantic recognition algorithm until errors are lower than a threshold value, thereby realizing the extraction efficiency and accuracy of the voice of the customers and providing assistance for subsequent sales service.

Claims (6)

1. A speaker separation method suitable for a recording tablet is characterized by comprising the following steps:
s1, dividing a record recorded by a record tablet according to a fixed time length, and extracting voiceprint characteristics of each section of voice by using a pre-trained voiceprint model;
s2, clustering the voiceprints extracted from each section of voice in the S1 by using a clustering algorithm, analyzing the number of people contained in the recorded voice by clustering, and extracting the voiceprints of the speaker according to the voice fragments of the speaker;
s3, according to the number of people and the voiceprint clustering result obtained in the S2, finely segmenting recorded voice into voice of a single speaker, and extracting voice overlapping the voiceprint of the speaker and the speaker; using a speech recognition ASR to convert the single speaker speech into a text, and extracting speaker sales information contained in the text;
s4, the speaker overlapping voice in the S3 is used for speaker separation by using the number of people determined by the clustering algorithm of the S2 and a blind source separation algorithm;
s5, converting the multi-person voice separated in the S4 into text by using a semantic recognition ASR algorithm and extracting sales information of speakers contained in the text;
s6, determining error rate with the aid of industrial background knowledge according to the text error correction capability in semantic understanding of the voice text in the S5, feeding back to the S4, and further analyzing, correcting and correcting the text through blind source separation;
and S7, recursively executing the steps S4 to S6 according to the analysis error correction result until the analysis error correction result is lower than a threshold value.
2. The speaker separation method for a sound recording label according to claim 1, wherein the segmentation in S1 is performed in 2 seconds.
3. The speaker separation method for a sound recording workstation according to claim 1, wherein the pre-trained voiceprint model in S1 is mel-frequency cepstral coefficient MFCC, linear predictive coded LPCC, convolutional neural network CNN or recurrent neural network RNN.
4. The speaker separation method for a recording workstation according to claim 1, wherein the clustering algorithm in S2 is Hierarchical Density-Based Spatial Clustering of Applications with Noise, i.e., HDBSCAN, clustering algorithm.
5. The speaker separation method for a recording workstation according to claim 1, wherein the sales information of the speaker in S5 is included in the overlapping voices of the speaker.
6. The speaker separation method for a voice recorder of claim 1, wherein the blind source separation BSS in S4 includes a principal component analysis PCA or an independent component analysis ICA.
CN202311262137.XA 2023-09-27 2023-09-27 Speaker separation method suitable for recording tablet Withdrawn CN117219110A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311262137.XA CN117219110A (en) 2023-09-27 2023-09-27 Speaker separation method suitable for recording tablet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311262137.XA CN117219110A (en) 2023-09-27 2023-09-27 Speaker separation method suitable for recording tablet

Publications (1)

Publication Number Publication Date
CN117219110A true CN117219110A (en) 2023-12-12

Family

ID=89046058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311262137.XA Withdrawn CN117219110A (en) 2023-09-27 2023-09-27 Speaker separation method suitable for recording tablet

Country Status (1)

Country Link
CN (1) CN117219110A (en)

Similar Documents

Publication Publication Date Title
US11636860B2 (en) Word-level blind diarization of recorded calls with arbitrary number of speakers
US11227603B2 (en) System and method of video capture and search optimization for creating an acoustic voiceprint
US10109280B2 (en) Blind diarization of recorded calls with arbitrary number of speakers
US9368116B2 (en) Speaker separation in diarization
CN112289323B (en) Voice data processing method and device, computer equipment and storage medium
US7231019B2 (en) Automatic identification of telephone callers based on voice characteristics
CN111429889A (en) Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN111489765A (en) Telephone traffic service quality inspection method based on intelligent voice technology
CN111785275A (en) Voice recognition method and device
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
CN111489743B (en) Operation management analysis system based on intelligent voice technology
JP2008139654A (en) Method of estimating interaction, separation, and method, system and program for estimating interaction
US10872615B1 (en) ASR-enhanced speech compression/archiving
CN111341324B (en) Fasttext model-based recognition error correction and training method
CN117219110A (en) Speaker separation method suitable for recording tablet
EP4330965A1 (en) Speaker diarization supporting eposodical content
CN112241467A (en) Audio duplicate checking method and device
CN111933187A (en) Emotion recognition model training method and device, computer equipment and storage medium
Kumar et al. Attention based multi modal learning for audio visual speech recognition
Diliberto et al. Speaker diarization with overlapped speech
CN118197345A (en) Speaker separation method combined with recording tablet
CN116312479A (en) Speech recognition method supporting double-line scene
GHOSHAL Speaker Diarization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20231212

WW01 Invention patent application withdrawn after publication