WO2020056995A1

WO2020056995A1 - Method and device for determining speech fluency degree, computer apparatus, and readable storage medium

Info

Publication number: WO2020056995A1
Application number: PCT/CN2018/124442
Authority: WO
Inventors: 蔡元哲; 程宁; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-09-19
Filing date: 2018-12-27
Publication date: 2020-03-26
Also published as: CN109087667A; CN109087667B

Abstract

A method and device for determining a speech fluency degree, a computer apparatus, and a readable storage medium. The method comprises: constructing a speech recognition model; preprocessing speech under test to acquire consecutive speech frame sequences, and inputting the consecutive speech frame sequences into the speech recognition model (120); determining, according to the speech recognition model, speech fluency degrees corresponding to the consecutive speech frame sequences (130); determining whether each determined speech fluency degree corresponding to the consecutive speech frame sequences obtained from the speech under test are identical (140); if so, determining the speech fluency degree to be the speech fluency degree of a customer service representative corresponding to the speech under test (150); and if not, determining a lower speech fluency degree among the speech fluency degrees to be the speech fluency degree of the speech under test (160). The present application intelligently and accurately determines the speech fluency degree of a customer service representative on the basis of a deep learning neural network.

Description

语音流利度识别方法、装置、计算机设备及可读存储介质Speech fluency recognition method, device, computer equipment and readable storage medium

本申请申明享有2018年09月19日递交的申请号为CN 201811093169.0、名称为“语音流利度识别方法、装置、计算机设备及可读存储介质”的中国专利申请的优先权，该中国专利申请的整体内容以参考的方式结合在本申请中。This application affirms the priority of Chinese patent application filed on September 19, 2018 with the application number of CN201811093169.0 and the name "Voice Fluency Recognition Method, Device, Computer Equipment and Readable Storage Medium". The entire contents are incorporated herein by reference.

技术领域Technical field

本申请涉及数据处理技术领域，尤其涉及一种语音流利度识别方法、装置、计算机设备及可读存储介质。The present application relates to the technical field of data processing, and in particular, to a method, an apparatus, a computer device, and a readable storage medium for identifying a speech fluency degree.

背景技术Background technique

客服坐席指的是在公司企业中的呼叫中心或客服部门的工作岗位，其一般通过语音为进线客户提供业务上的咨询或指导等，在这个过程中，客服坐席的语音流利度会影响到进线客户对于该公司或企业的直接感受，因此，对于公司或企业而言客服坐席的语音流利度指标同样至关重要，因此对客服语音进行质检在服务行业是必不可少的一项工作。The customer service agent refers to the position of the call center or customer service department in the company's enterprise. It usually provides business consultation or guidance to incoming customers by voice. In this process, the voice fluency of the customer service agent will affect The incoming customer's direct feelings about the company or enterprise. Therefore, the voice fluency index of customer service agents is also very important for the company or enterprise. Therefore, quality inspection of customer service voice is an essential job in the service industry. .

质检一方面对客服的通话起到监督的作用，另一方面也可以快速定位到问题，从而提高客服的服务质量，而传统的质检有效率低、覆盖面小、反馈不及时的劣势，智能质检的出现解决了这些问题，通过语音识别、自然语言处理等技术，对客服的语音进行快速高效地质检，但是在质检环节中，***判定客服说话是否流利是一个难题。On the one hand, quality inspection plays a role of supervising customer service calls. On the other hand, it can also quickly locate problems and improve the quality of customer service. Traditional quality inspection has the disadvantages of low efficiency, small coverage, and untimely feedback. The emergence of quality inspection solves these problems. Through speech recognition and natural language processing, the customer service voice is quickly and efficiently geologically inspected. However, in the quality inspection process, it is difficult for the system to determine whether the customer service is fluent.

传统的语音流利度评估方法仅从识别的特征层次考虑语音流利质量等级，而伴随着语音数据的发展，流利度不再属于一个单纯的衡量发音标准的指标，而是需要综合性地进行识别，这些都不符合现有阶段的语音识别。目前还没有能够较好地在金融服务领域解决上述问题的方法或者装置出现。Traditional speech fluency assessment methods only consider the quality level of speech fluency from the level of recognition features. With the development of speech data, fluency is no longer a simple indicator of pronunciation standards, but requires comprehensive recognition. These are not in line with the current stage of speech recognition. At present, no method or device can solve the above problems in the field of financial services.

发明内容Summary of the Invention

为了克服相关技术中存在的问题，本申请提供一种语音流利度识别方法、装置、计算机设备及可读存储介质，以实现采用深度学习神经网络构建训练模型的方法对客服的语音流利度进行质检，更加准确和更加综合性地对客服人员的语音流利度进行识别。In order to overcome the problems in the related technology, the present application provides a speech fluency recognition method, device, computer equipment, and readable storage medium, so as to implement a method of constructing a training model by using a deep learning neural network to quantify the speech fluency of a customer service. Check, more accurately and comprehensively identify the voice fluency of customer service staff.

为实现上述目的，本申请提供了一种语音流利度识别方法，所述方法包括：In order to achieve the above objective, the present application provides a method for recognizing speech fluency. The method includes:

通过序列到序列的深度学习网络构建语音识别模型；Construct a speech recognition model through a sequence-to-sequence deep learning network;

对待检测语音进行预处理得到连续的语音帧序列，将所述连续的语音帧序列输入到所述语音识别模型中；Pre-processing the speech to be detected to obtain a continuous speech frame sequence, and inputting the continuous speech frame sequence into the speech recognition model;

根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度；Determining a speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model;

检测待检测语音中所述连续的语音帧序列确定得到的各语音流利度是否相同；Detecting the continuous speech frame sequence in the speech to be detected to determine whether the speech fluency obtained is the same;

当所述待检测语音中所述连续的语音帧序列确定得到的各语音流利度相同时，将所述语音流利度确定为所述待检测语音对应的客户的流利度；When the speech fluency obtained from the continuous speech frame sequence in the speech to be detected is the same, determining the speech fluency as the fluency of the customer corresponding to the speech to be detected;

当所述待检测语音中连续的语音帧序列确定得到的各语音流利度不相同时，将各所述语音流利度中较低一级的语音流利度确定为所述待检测语音的流利度。When the speech fluency obtained from the continuous speech frame sequence determined in the speech to be detected is different, the speech fluency at a lower level among the speech fluency is determined as the fluency of the speech to be detected.

深度depth

为实现上述目的，本申请还涉及一种客服语音流利度识别装置，所述装置包括：To achieve the above object, the present application also relates to a voice fluency recognition device for customer service, the device includes:

构建模块，用于通过序列到序列的深度学习网络构建语音识别模型；Building module for building a speech recognition model through a sequence-to-sequence deep learning network;

输入模块，用于对待检测语音进行预处理得到连续的语音帧序列，将所述连续的语音帧序列输入到所述语音识别模型中；An input module, configured to preprocess the speech to be detected to obtain a continuous speech frame sequence, and input the continuous speech frame sequence into the speech recognition model;

确定模块，用于根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度；A determining module, configured to determine a speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model;

检测模块，用于检测待检测语音中所述连续的语音帧序列确定得到的各语音流利度是否相同；A detection module, configured to detect whether the speech fluency obtained by the continuous speech frame sequence in the speech to be detected is the same;

第一输出模块，用于当所述待检测语音中所述连续的语音帧序列确定得到的各语音流利度相同时，将所述语音流利度确定为所述待检测语音对应的客户的流利度；A first output module, configured to determine the speech fluency as the fluency of a customer corresponding to the speech to be detected when the speech fluency obtained from the continuous speech frame sequence in the speech to be detected is the same ;

第二输出模块，用于当所述待检测语音中连续的语音帧序列确定得到的各语音流利度不相同时，将各所述语音流利度中较低一级的语音流利度确定为所述待检测语音的流利度。A second output module, configured to determine, when each of the speech fluency levels determined by the continuous speech frame sequence in the speech to be detected, the speech fluency at a lower level among the speech fluency levels is determined as the Fluency of the voice to be detected.

为实现上述目的，本申请还涉及一种计算机设备，包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现语音流利度识别方法的步骤：To achieve the above object, the present application also relates to a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor implements the recognition of speech fluency when the computer program is executed. Method steps:

为实现上述目的，本申请还提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现语音流利度识别方法的步骤：In order to achieve the above object, the present application also provides a computer-readable storage medium on which a computer program is stored, where the computer program is executed by a processor to implement the steps of a method for identifying speech fluency:

本申请通过构建采用了RNNs神经网络的语音识别模型实现了通过序列分析语音以进行质检，快速地对客服语音进行识别和判断，其基于深度学习的RNNs循环神经网络不断的训练和学习的过程中其识别的准确性还能够自我提升，解决了目前对于客服语音进行人工识别质检的问题，实现了基于深度学习网络神经的对客服语音的更智能、更准确的流利度判断。The present application realizes the process of continuous training and learning of RNNs recurrent neural network based on deep learning by constructing a speech recognition model using RNNs neural network to realize the quality inspection by sequence analysis of speech for quality inspection, and to quickly identify and judge customer service speech. The recognition accuracy can also be self-improved, solving the current problem of manual recognition and quality inspection of customer service voices, and realizing a smarter and more accurate fluency judgment of customer service voices based on deep learning network nerves.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It should be understood that the above general description and the following detailed description are merely exemplary and explanatory, and should not limit the present application.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是根据一示例性实施例示出的一种语音流利度识别方法的流程示意图。Fig. 1 is a schematic flowchart of a speech fluency recognition method according to an exemplary embodiment.

图2是根据一示例性实施例示出的语音识别模型的结构示意图。Fig. 2 is a schematic structural diagram of a speech recognition model according to an exemplary embodiment.

图3是根据一示例性实施例示出的一种语音流利度识别方法的预处理流程示意图。Fig. 3 is a schematic diagram of a pre-processing flow of a method for recognizing speech fluency according to an exemplary embodiment.

图4是根据一示例性实施例示出的语音识别模型训练学习的示意图。Fig. 4 is a schematic diagram showing training and learning of a speech recognition model according to an exemplary embodiment.

图5是根据一示例性实施例示出的语音流利度识别装置的示意性框图。Fig. 5 is a schematic block diagram of a speech fluency recognition device according to an exemplary embodiment.

图6是根据一示例性实施例示出的计算机设备的框图。Fig. 6 is a block diagram showing a computer device according to an exemplary embodiment.

具体实施方式detailed description

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释本申请，而非对本申请的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与本申请相关的部分而非全部结构。The following describes the present application in detail with reference to the accompanying drawings and embodiments. It can be understood that the specific embodiments described herein are only used to explain the present application, rather than limiting the present application. In addition, it should be noted that, for convenience of description, only some parts related to the present application are shown in the drawings instead of the entire structure.

在更加详细地讨论示例性实施例之前应当提到的是，一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图中将各步骤描述成顺序的处理，但是其中的许多步骤可以并行地、并发地或者同时实施。此外，各步骤的顺序可以被重新安排，当其操作完成时所述处理可以被终止，但是还可以具有未包括在附图内的其它步骤。处理可以对应于方法、函数、规程、子例程、子程序等。Before discussing the exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the steps are described in the flowchart as sequential processing, many of these steps can be performed in parallel, concurrently, or simultaneously. In addition, the order of the steps may be rearranged, the processing may be terminated when its operation is completed, but there may be other steps not included in the drawings. Processing can correspond to methods, functions, procedures, subroutines, subroutines, and so on.

本申请涉及一种语音流利度识别方法、装置、计算机设备及可读存储介质，其主要运用于需要对客服语音进行质量检测、流利度判断的场景中，其基本思想是：在对客服语音流利度进行监控时，获取每一位客服的语音或至少一部分的语音的片段，通过序列分析实现语音识别，在此之前需要首先构建语音识别***，其基于深度学习的RNNs循环神经网络实现对原始语音数据的模型构建并实现对输入的训练数据的学习过程，在对获取的待判断的语音进行预处理后，通过将任意长度的待判断的语音的语音帧序列输入到深度学习模型运行后得到对应的语音流利度，实现了基于深度学习网络神经的对客服语音的更智能、更准确的流利度判断。The present application relates to a voice fluency recognition method, device, computer equipment, and readable storage medium, which are mainly used in scenarios where quality inspection and fluency judgment of customer service voices are required. The basic idea is: During monitoring, the voice of each customer service or at least a part of the voice is obtained, and speech recognition is realized through sequence analysis. Before this, a speech recognition system needs to be constructed first. The RNNs recurrent neural network based on deep learning implements the original speech. The data model is constructed and the learning process of the input training data is implemented. After pre-processing the acquired speech to be judged, the voice frame sequence of the speech to be judged of any length is input to the deep learning model and the corresponding response is obtained. The degree of fluency of speech achieves a smarter and more accurate fluency judgment of customer service voice based on deep learning network nerves.

实施例一Example one

本实施例可适用于带有深度学习模型的智能型终端中以进行深度学习的客服语音流利度识别的情况中，该方法可以由深度学习模型的装置来执行，其中该装置可以由软件和/或硬件来实现，一般地可集成于服务器端或云端中，或者终端中的中心控制模块来控制，如图1所示，为本申请的一种语音流利度识别方法的基本流程示意图，所述方法具体包括如下步骤：This embodiment may be applicable to a case where a customer service performs fluent speech recognition in a smart terminal with a deep learning model, and the method may be executed by a device of a deep learning model, where the device may be implemented by software and / It can be implemented by hardware, and can generally be integrated in the server or the cloud, or controlled by a central control module in the terminal. As shown in FIG. 1, this is a schematic flowchart of a voice fluency recognition method of the present application. The method includes the following steps:

在步骤110中，通过序列到序列的深度学习网络构建语音识别模型；In step 110, a speech recognition model is constructed through a sequence-to-sequence deep learning network;

所述语音识别模型的核心是通过序列到序列的RNN网络实现的，利用RNNs神经网络(recurrent neural networks,RNNs，循环神经网络也可称之为递归神经网络)的长短期记忆模型功能还可实现对任意长度的语音或语音片段进行流利度的识别。The core of the speech recognition model is realized by a sequence-to-sequence RNN network, and the long-term and short-term memory model function using RNNs neural networks (recurrent neural networks, RNNs, also known as recurrent neural networks) can also be implemented Recognize fluency of speech or speech fragments of any length.

在本申请一种可行的实施场景中，如图2所示，为本申请的RNNs网络结构示意图，其采用一个6层的编码-解码结构实现了RNNs神经网络，这个结构可以使RNN处理和分类任意长度的输入序列，其主要包括编码器、解码器以及全连接层，基于该结构建立语音识别模型，且这个结构可以使RNN处理和分类任意长度的输入序列。In a feasible implementation scenario of the present application, as shown in FIG. 2, this is a schematic diagram of the RNNs network structure of the present application. It uses a 6-layer encoding-decoding structure to implement the RNNs neural network. This structure can enable RNN processing and classification. An input sequence of arbitrary length mainly includes an encoder, a decoder, and a fully connected layer. A speech recognition model is established based on this structure, and this structure enables the RNN to process and classify an input sequence of any length.

其中的编码器由3层组成的，包括分别为128神经元和64神经元的2个双向循环层，有32个循环神经元的单向层。编码器被设置为可以处理最大长度为设定的值的任意序列。编码器里所有的循环神经元都是GRU(Gated Recurrent Unit，选通重复单元)，它结构比较简单，通过更新门和重置门来决定对之前状态的依赖程度，从而可以很好解决远距离依赖及对较长时间之前的信息的处理的问题。The encoder consists of three layers, including two bidirectional circulation layers of 128 neurons and 64 neurons, and a unidirectional layer of 32 circulating neurons. The encoder is set to handle any sequence with a maximum length of the set value. All recurrent neurons in the encoder are GRU (Gated Recurrent Unit, Gated Repeat Unit), which has a simpler structure. The degree of dependence on the previous state is determined by updating and resetting the gate, which can solve long distances well. Reliance and processing of information older than a long time.

固定的编码层：编码器输出的最后一层是一个固定参数的有32神经元的激活层，被用来初始化解码器。Fixed encoding layer: The last layer of the encoder output is a 32-neuron activation layer with fixed parameters, which is used to initialize the decoder.

解码器：由一个单独的循环层构成，它具有64个长短时记忆(LSTM)单元，且结合了注意力机制。注意力机制使该网络主要关注输入特性的显著部分，并最终提高分类性能，输入的特性为由上下项组成的组中接收两个及以上的特性，包括但不限于：语性特性、音素、语言注音特性、上下文特性、语义特性以及环境特性、场景特性等。目前，我们的解码器设置为对每个输入序列输出一个单一的分类标注(标签)，即语音流利度1-5级中的一级。Decoder: It consists of a single loop layer, it has 64 long short-term memory (LSTM) units, and combines the attention mechanism. The attention mechanism enables the network to focus on a significant part of the input characteristics and ultimately improve the classification performance. The input characteristics are two or more characteristics received in a group consisting of upper and lower items, including but not limited to: speech characteristics, phonemes, Linguistic features, context features, semantic features, environment features, and scene features. At present, our decoder is set to output a single classification label (label) for each input sequence, which is one of 1-5 levels of speech fluency.

全连接层：在解码器之后，设置一个具有256个ReLU神经元的全连接层,将学到的“分布式特征表示”映射到样本标记空间，组合学习到的多个特征，从而得到语音流利度整体的特征。Fully connected layer: After the decoder, set a fully connected layer with 256 ReLU neurons, map the "distributed feature representation" learned to the sample tag space, combine the multiple learned features, and get speech fluency Degree of overall characteristics.

分类：最后的分类层使用softmax输出一个分类标签。Softmax函数可以将输入映射成为(0,1)的值，将这个值理解为概率，便可以选取最大的概率的结果作为分类的结果(语音流利度1-5级中的一级)。Classification: The final classification layer uses softmax to output a classification label. The Softmax function can map the input to a value of (0,1), understand this value as a probability, and then select the result of the largest probability as the result of the classification (one of 1-5 levels of speech fluency).

本申请示例性实施例的一种可行的实施场景中，在构建上述流利度识别所用的深度学习网络的数据库时，可以先创建一个有2000条客服语音服务记录的数据库，对每一条客服语音流利度进行人工标记，流利度按照1级到5级的顺序进行标签的标注，1级到5级代表的分别是非常不流利、不流利、勉强流利、基本流利、非常流利，可以认识到的是，上述1级到5级标签的形式可以为其它各种形式，并不以上述实施方式为限。In a feasible implementation scenario of the exemplary embodiment of the present application, when constructing a database of the deep learning network used for the above-mentioned fluency recognition, a database of 2000 customer service voice service records may be created first, and each customer service voice is fluent. The degree is manually labeled, and the fluency is labeled according to the order of 1 to 5. The levels 1 to 5 represent very poor fluent, unfluent, barely fluent, basic fluent, very fluent. It can be recognized that The form of the first-level to fifth-level labels may be in various other forms, and is not limited to the foregoing implementation manner.

在步骤120中，对待检测语音进行预处理得到连续的语音帧序列，将所述连续的语音帧序列输入到所述语音识别模型中；In step 120, the speech to be detected is pre-processed to obtain a continuous speech frame sequence, and the continuous speech frame sequence is input into the speech recognition model;

本申请示例性实施例的一种可行的实施场景中，在对客服语音质检过程中，电话平台的录音模块对客服客户的谈话语音进行记录，因为电话平台对语音的记录是双声道的，所以可以提取出客服的语音部分，在记录以及提取的过程中语音信息出现背景噪音、电平噪音以及静音等情形，因此，需要对其进行预处理一般为去噪处理后得到较为纯净的语音片段，此过程可进一步保证获取的语音源对于语音流利度识别的准确性。In a feasible implementation scenario of the exemplary embodiment of the present application, during the voice quality inspection of the customer service, the recording module of the telephone platform records the conversation voice of the customer service customer, because the voice recording of the telephone platform is two-channel Therefore, the voice part of the customer service can be extracted, and background information, level noise, and mute appear in the voice information during the recording and extraction process. Therefore, it is necessary to pre-process it to obtain a relatively pure voice after denoising. This process can further ensure the accuracy of the obtained speech source for speech fluency recognition.

对于在信号传输过程中产生的不相关的数据,如静音和背景噪音,通过对低能量的窗口的检测的方法来达到去除它们的目的。For the irrelevant data generated during the signal transmission, such as silence and background noise, the detection of low-energy windows is used to remove them.

经过去噪处理后的语音被转换成每帧有若干频率分量的序列，这些序列及对应的标签(语音流利度1-5级中的一级)将输入到所述语音识别模型中作为训练RNNs的数据。The denoised speech is converted into a sequence of several frequency components per frame, and these sequences and corresponding labels (one of the 1-5 levels of speech fluency) will be input into the speech recognition model as training RNNs The data.

在步骤130中，根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度；In step 130, the speech fluency corresponding to the continuous speech frame sequence is determined according to the speech recognition model;

运行深度学习模型可得到待检测语音的分类标注结果，其为预设的分类标注流利度1级到5级中的一级。Run the deep learning model to get the classification and annotation results of the speech to be detected, which is one of the preset classification annotation fluency levels 1 to 5.

在一种可行的具体实施方式中，初期可根据语音帧序列与人工标记的语音帧进行匹配以得出其流利度，进一步地，在语音识别模型不断的学习深化过程中，在得到分类标注的整体特征(如为语义中间存在停顿现象)之后，例如对于出现停顿现象的待检测语音即可评价“非常不流利”的分类标注，即标签1级-非常不流利，并且对于之后得到的所有出现停顿现象的客服语音均能够较为迅速地判定其语音流利度为1级的非常不流利。In a feasible embodiment, the fluency degree can be obtained by matching the voice frame sequence with the manually labeled voice frames in the initial stage. Further, in the continuous learning and deepening process of the speech recognition model, After the overall feature (such as a pause in the middle of the semantics), for example, for the speech to be detected with a pause, the "very fluent" classification label can be evaluated, that is, the label is level 1-very fluent, and for all subsequent occurrences The customer service voices of the pause phenomenon can quickly determine that their voice fluency is level 1 very unfluent.

在步骤140中，检测待检测语音中所述连续的语音帧序列确定得到的各语音流利度是否相同；若相同执行步骤150，若不同，执行步骤160。In step 140, the continuous voice frame sequence in the voice to be detected is detected to determine whether the fluency of each voice obtained is the same; if the same, step 150 is performed, and if different, step 160 is performed.

一个待检测语音的语音片段中可包括若干个连续的经过预处理之后的语音帧序列，而对于语音流利度的识别时，不仅包括对语音帧序列的识别，更需要通过上升到一个语音片段以至连续的多个语音片段及一整段语音的流利度的整体识别，在多个语音片段的流利度识别的过程中，其中某一语音片段的识别分类标注结果无法体现出对应的客服人员的整体流利度水平。A voice segment of a voice to be detected may include several consecutive pre-processed voice frame sequences. The recognition of voice fluency includes not only the recognition of the voice frame sequence, but also the need to rise to a voice segment or even The overall recognition of the fluency of multiple consecutive speech segments and a whole segment of speech. During the recognition of the fluency of multiple speech segments, the recognition and labeling results of one of the speech segments cannot reflect the overall customer service staff. Fluency level.

本申请示例性实施例的一种可行的实施场景中，对于一整段语音可用“A”表示，而每一语音片段可用“A1”/“A2”“A3”“A4”“A5”……表示，而对于每一语音片段中的语音帧序列，则可通过“A11”“A12”“A13”……“A21”“A22”“A23”……“A31”“A32”“A33”……等表示。In a feasible implementation scenario of the exemplary embodiment of the present application, "A" can be used for a whole speech, and each speech segment can be "A1" / "A2" "A3" "A4" "A5" ... Indicates that for each sequence of voice frames in a voice segment, you can use "A11" "A12" "A13" ... "A21" "A22" "A23" ... "A31" "A32" "A33" ... And so on.

在步骤150中，当所述待检测语音中所述连续的语音帧序列确定得到的各语音流利度相同时，将所述语音流利度确定为所述语音片段对应的客户的流利度；In step 150, when the speech fluency obtained from the continuous speech frame sequence in the speech to be detected is the same, the speech fluency is determined as the fluency of the customer corresponding to the speech segment;

当“A11”“A12”“A13”……的流利度分类标注结果为5级，“A21”“A22”“A23”……的流利度分类标注结果为5级，而“A31”“A32”“A33”流利度分类标注结果为5级时，则表示该语音片段中连续的语音帧序列确定得到的语音流利度相同，此时确定所述语音片段对应的流利度为5级“非常流利”。When "A11", "A12", "A13", ..., the fluency classification results are marked as 5th grade, "A21", "A22", "A23", ... are classified as 5th grade, and "A31" "A32" When the level of "A33" fluency classification is marked as 5, it means that the consecutive voice frame sequences in the voice segment are determined to have the same voice fluency. At this time, it is determined that the fluency corresponding to the voice segment is level 5 "very fluent" .

在步骤160中，当所述待检测语音中连续的语音帧序列确定得到的各语音流利度不相同时，将各所述语音流利度中较低一级的语音流利度确定为所述语音片段的流利度。In step 160, when the speech fluency obtained from the continuous speech frame sequence in the speech to be detected is different, the speech fluency at a lower level among the speech fluency is determined as the speech segment. Fluency.

当“A11”“A12”“A13”……的流利度分类标注结果为5级，“A21”“A22”“A23”……的流利度分类标注结果为5级，而“A31”“A32”“A33”流利度分类标注结果为4级时，则表示该待检测语音的语音片段中连续的语音帧序列确定得到的语音流利度并不相同，此时需要进行进一步的处理：将4级流利度作为该语音片段的流利度分类标注结果，该出现的4级流利度会影响语音整段的流利度。When "A11", "A12", "A13", ..., the fluency classification results are marked as 5th grade, "A21", "A22", "A23", ... are classified as 5th grade, and "A31" "A32" When the “A33” fluency classification labeling result is level 4, it indicates that the speech fluency determined by the continuous speech frame sequence in the speech segment of the voice to be detected is not the same. At this time, further processing is required: fluctuate level 4 Degree is the result of classifying and marking the fluency of the speech segment. The level of fluency that appears will affect the fluency of the entire speech segment.

对于语音片段中出现的不同于其它语音片段中的流利度对于语音整段的影响，可根据流利度计算算法进行确定，不同的流利度计算算法对于最终得到的客服人员的流利也不尽相同。The impact of fluency on the entire speech segment, which is different from that in other speech segments, can be determined according to the fluency calculation algorithm. Different fluency calculation algorithms also have different fluency for the customer service staff.

本申请的方法，通过选取序列到序列的深度学习循环神经网络RNNs构建深度学习模型实现对客服语音流利度的监控，通过原始语音数据的输入以及训练语音数据的输入进行不断训练，使得对获取的待判断的语音进行预处理后，将任意长度的待判断的语音的语音帧序列输入到深度学习模型运行后得到对应的语音流利度，实现了基于深度学习网络神经的对客服语音的更智能、更准确的流利度判断，进一步提升客服语音智能质检的有效性。The method of the present application constructs a deep learning model by selecting a sequence-to-sequence deep learning recurrent neural network RNNs to monitor customer service fluency, and continuously trains the input of the original voice data and the input of the training voice data, so that the acquired After the speech to be judged is preprocessed, the speech frame sequence of the speech to be judged of any length is input to the deep learning model and the corresponding speech fluency is obtained, which realizes a more intelligent and intelligent service customer voice based on deep learning network neural networks. More accurate judgement of fluency, further improving the effectiveness of intelligent voice inspection of customer service.

本申请示例性实施例的一种可行的实施场景中，所述根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度，包括：获取输入的所述语音帧序列的特性；如语性特性、音素、语言注音特性、上下文特性、语义特性以及环境特性、场景特性等；结合注意力机制，通过语音识别模型中的解码器为每一输入的所述语音帧序列输出对应的单一标签；将所述单一标签作为所述语音帧序列的分类标注，最终通过解码器设置为对每个输入序列输出一个单一的分类标注(标签)，即输出语音流利度1-5级中的一级。In a feasible implementation scenario of the exemplary embodiment of the present application, the determining the speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model includes: obtaining characteristics of the input speech frame sequence ; Such as speech characteristics, phonemes, phonetic features, context characteristics, semantic characteristics, environmental characteristics, scene characteristics, etc .; combined with the attention mechanism, through the decoder in the speech recognition model for each input of the speech frame sequence output corresponding The single label is used as the classification label of the speech frame sequence, and is finally set by the decoder to output a single classification label (label) for each input sequence, that is, the output speech fluency is 1-5. Level.

本申请示例性实施例的一种可行的实施场景中，在得到所述语音识别模型的客服语音-分类标注之后，所述方法还包括通过全连接层映射学习的过程，这一过程主要包括：In a feasible implementation scenario of the exemplary embodiment of the present application, after the customer service voice-classification annotation of the voice recognition model is obtained, the method further includes a process of learning through fully connected layer mapping, which mainly includes:

通过语音识别模型得到分布式特征表示，映射到所述数据库；Obtaining a distributed feature representation through a speech recognition model and mapping it to the database;

在分布式特征表示中，特征的意义是单独而言的，无论非该特征之外的其它特征如何改变其都不会改变，得到的分布式特征表示映射到数据库，实现语音识别模型学习和捕捉对于该分布式特征表示中关于语音流利度判断方面的内容。In the distributed feature representation, the meaning of the feature is independent. It will not change regardless of how other features other than the feature are changed. The obtained distributed feature representation is mapped to the database to realize the learning and capture of the speech recognition model. For the content of speech fluency judgment in this distributed feature representation.

对所述分布式特征进行组合得到各分类标注的整体特征；Combining the distributed features to obtain the overall features of each classification;

根据所述整体特征对客服语音进行检测。Detecting customer service voice according to the overall characteristics.

在语音识别模型不断的学习深化过程中，在得到分类标注的整体特征(如为语义中间存在停顿现象)之后，例如对于“非常不流利”的分类标注，即标签1级，则对于之后得到的所有出现停顿现象的客服语音均能够较为迅速地判定其语音流利度为1级的非常不流利。In the continuous learning and deepening process of the speech recognition model, after obtaining the overall features of the classification label (such as a pause in the middle of semantics), for example, for "very fluent" classification labels, that is, label level 1, then for the later obtained All customer service voices with pauses can quickly determine that their voice fluency is level 1 very unfluent.

本申请的方法，在得到分类标注即标签的整体特征之后能够实现对客服语音的更加迅速、更为准确的判断和评价，大幅提升了质检效率。The method of the present application can achieve a faster and more accurate judgment and evaluation of customer service voices after obtaining the overall features of the classification label, that is, the label, and greatly improve the quality inspection efficiency.

在本申请的一种可行的实施方式中，在构建语音识别模型之前，所述方法还包括对数据库的构建过程，以便于帮助构建客服语音-分类标注的所述语音识别模型中，这一过程可包括如下步骤：In a feasible implementation manner of the present application, before constructing a voice recognition model, the method further includes a process of constructing a database, so as to help construct a voice-category-labeled voice recognition model for customer service. This process It can include the following steps:

获取若干客服记录中的客服语音并创建语音数据库；Obtain customer service voices in several customer service records and create a voice database;

对所述若干客服记录中的客服语音进行人工标记，为每一客服语音设置分类标注的标签。Manually mark the customer service voices in the several customer service records, and set a classified label for each customer service voice.

本申请示例性实施例中创建一个有2000条客服服务记录的数据库。将客服语音流利程度进行人工标记，按照1级到5级的顺序进行打标签，1级到5级分别代表非常不流利、不流利、勉强流利、基本流利、非常流利。In the exemplary embodiment of the present application, a database with 2000 customer service records is created. Manually mark the fluency of customer service, and label them in the order of level 1 to level 5. Levels 1 to 5 represent very fluent, not fluent, barely fluent, basic fluent, and very fluent.

前期通过大量的客服服务记录中的客服语言进行人工打标签的操作，使得构建深度学习神经网络时学习基础客服语音-分类标注的数据更符合设定的判断标准，并使得后续在对客服语音进行质检的过程中得到的结果更为准确。In the early stage, the manual labeling operation was performed through a large number of customer service language records in the customer service record, which made it possible to learn the basic customer service voice-classified data when constructing a deep learning neural network, and made subsequent follow-up The results obtained during the quality inspection are more accurate.

本申请示例性实施例的一种可行的实施场景中，还包括对获取到的客服语音进行预处理的过程，在实际的质检过程中，电话平台的录音模块对客服客户的谈话语音进行记录，因为电话平台对语音的记录是双声道的，所以可以提取出客服的语音部分，而提取出的客服语音由于在电子设备中传输不可避免地产生的底噪等杂音，因此，如图3所示，结合图4中的语音识别的流程示意图，这一过程可包括如下步骤：A feasible implementation scenario of the exemplary embodiment of the present application further includes a process of preprocessing the acquired customer service voice. In the actual quality inspection process, the recording module of the telephone platform records the conversation voice of the customer service customer. Because the voice recording of the telephone platform is two-channel, the voice part of the customer service can be extracted, and the extracted customer service voice is unavoidably generated by the transmission of noise in the electronic device, such as the noise floor. Therefore, as shown in Figure 3 As shown, in conjunction with the schematic flowchart of speech recognition in FIG. 4, this process may include the following steps:

在步骤310中，对待检测语音进行去噪处理；In step 310, the noise to be detected is denoised.

在信号传输过程中产生的不相关的数据,如静音和背景噪音,通过对低能量的窗口的检测的方法来达到去除它们的目的，实际操作中可通过设计信号调理电路使传感器可以放大心率信号和完全消除环境信号干扰，以实现去噪处理。Irrelevant data generated during the signal transmission, such as silence and background noise, can be removed by detecting low-energy windows. In actual operation, the signal conditioning circuit can be designed so that the sensor can amplify the heart rate signal. And completely eliminate the environmental signal interference to achieve denoising processing.

在步骤320中，对去噪处理后的待检测语音进行分段，每段包括预设帧长度的帧数据；In step 320, segment the to-be-detected speech after denoising processing, and each segment includes frame data of a preset frame length;

在预处理过程中,对声波数据流进行分段成为每4毫秒长的帧。In the preprocessing process, the sonic data stream is segmented into frames every 4 milliseconds long.

在步骤330中，对所述帧数据进行序列转换得到所述语音帧序列。In step 330, performing sequence conversion on the frame data to obtain the voice frame sequence.

经过去噪之后的窗口被转换为每帧有64个频率分量的序列，这些序列及对应的标签(语音流利度1-5级中的一级)将作为训练RNNs的数据被使用。The window after denoising is converted into a sequence of 64 frequency components per frame, and these sequences and corresponding labels (one of the 1-5 levels of speech fluency) will be used as training RNNs data.

实施例二Example two

图5为本申请实施例提供的一种语音流利度识别装置的结构示意图，该装置可由软件和/或硬件实现，一般地集成于服务器端，可通过语音流利度识别方法来实现。如图所示，本实施例可以以上述实施例一为基础，提供了一种语音流利度识别装置，结合图5所示，其主要包括了构建模块510、输入模块520、确定模块530、帧检测模块540、第一输出模块550以及第二输出模块560。FIG. 5 is a schematic structural diagram of a speech fluency recognition device according to an embodiment of the present application. The device may be implemented by software and / or hardware, and is generally integrated on a server side, and may be implemented by a speech fluency recognition method. As shown in the figure, this embodiment may be based on the first embodiment, and a voice fluency recognition device is provided. As shown in FIG. 5, it mainly includes a building module 510, an input module 520, a determination module 530, and a frame. The detection module 540, the first output module 550, and the second output module 560.

其中的构建模块510，用于通过序列到序列的深度学习网络构建语音识别模型；The construction module 510 is configured to construct a speech recognition model through a sequence-to-sequence deep learning network;

其中的输入模块520，用于对待检测语音进行预处理得到连续的语音帧序列，将所述连续的语音帧序列输入到所述语音识别模型中；The input module 520 is configured to preprocess the speech to be detected to obtain a continuous speech frame sequence, and input the continuous speech frame sequence into the speech recognition model;

其中的确定模块530，根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度；The determining module 530 determines a speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model;

其中的帧检测模块540，用于检测待检测语音中所述连续的语音帧序列确定得到的各语音流利度是否相同；The frame detection module 540 is configured to detect whether the speech fluency obtained by the continuous speech frame sequence in the speech to be detected is the same;

其中的第一输出模块550，用于当所述待检测语音中所述连续的语音帧序列确定得到的各语音流利度相同时，将所述语音流利度确定为所述待检测语音对应的客户的流利度；The first output module 550 is configured to determine the speech fluency as a customer corresponding to the speech to be detected when the speech fluency obtained from the continuous speech frame sequence in the speech to be detected is the same. Fluency

其中的第二输出模块560，用于当所述待检测语音中连续的语音帧序列确定得到的各语音流利度不相同时，将各所述语音流利度中较低一级的语音流利度确定为所述待检测语音的流利度。The second output module 560 is configured to determine the speech fluency of a lower level of each of the speech fluency when it is determined that the speech fluency is different from the continuous speech frame sequence in the speech to be detected. Is the fluency of the speech to be detected.

本申请示例性实施例的一种可行的实施场景中，所述装置还包括：In a feasible implementation scenario of the exemplary embodiment of the present application, the apparatus further includes:

获取模块，用于获取若干客服记录中的客服语音并创建语音数据库；An acquisition module for acquiring customer service voices in several customer service records and creating a voice database;

人工标记模块，用于对所述若干客服记录中的客服语音进行人工标记，为每一客服语音设置分类标注的标签。The manual labeling module is configured to manually label the customer service voices in the plurality of customer service records, and set a classified label for each customer service voice.

本申请示例性实施例的一种可行的实施场景中，所述输入模块包括：In a feasible implementation scenario of the exemplary embodiment of the present application, the input module includes:

去噪子模块，用于对待检测语音进行去噪处理；A denoising sub-module for denoising the speech to be detected;

分段子模块，用于对去噪处理后的待检测语音进行分段，每段包括预设帧长度的帧数据；A segmentation sub-module for segmenting the speech to be detected after denoising processing, each segment including frame data of a preset frame length;

转换子模块，用于对所述帧数据进行序列转换得到所述语音帧序列。A conversion submodule is configured to perform sequence conversion on the frame data to obtain the voice frame sequence.

上述实施例中提供的语音流利度识别装置可执行本申请中任意实施例中所提供的语音流利度识别方法，具备执行该方法相应的功能模块和有益效果，未在上述实施例中详细描述的技术细节，可参见本申请任意实施例中所提供的语音流利度识别方法。The speech fluency recognition device provided in the foregoing embodiment can execute the speech fluency recognition method provided in any embodiment of the present application, and has corresponding function modules and beneficial effects for executing the method, which are not described in detail in the foregoing embodiments. For technical details, refer to the speech fluency recognition method provided in any embodiment of the present application.

将意识到的是，本申请也扩展到适合于将本申请付诸实践的计算机程序，特别是载体上或者载体中的计算机程序。程序可以以源代码、目标代码、代码中间源和诸如部分编译的形式的目标代码的形式，或者以任何其它适合在按照本申请的方法的实现中使用的形式。也将注意的是，这样的程序可能具有许多不同的构架设计。例如，实现按照本申请的方法或者***的功能性的程序代码可能被再分为一个或者多个子例程。It will be appreciated that this application also extends to computer programs suitable for putting this application into practice, in particular computer programs on or in a carrier. The program may be in the form of source code, object code, source code intermediate and object code such as a partially compiled form, or in any other form suitable for use in the implementation of the method according to the present application. It will also be noted that such programs may have many different architectural designs. For example, program code that implements the functionality of a method or system according to the present application may be subdivided into one or more subroutines.

用于在这些子例程中间分布功能性的许多不同方式将对技术人员而言是明显的。子例程可以一起存储在一个可执行文件中，从而形成自含式的程序。这样的可执行文件可以包括计算机可执行指令，例如处理器指令和/或解释器指令(例如，Java解释器指令)。可替换地，子例程的一个或者多个或者所有子例程都可以存储在至少一个外部库文件中，并且与主程序静态地或者动态地(例如在运行时间)链接。主程序含有对子例程中的至少一个的至少一个调用。子例程也可以包括对彼此的函数调用。涉及计算机程序产品的实施例包括对应于所阐明方法中至少一种方法的处理步骤的每一步骤的计算机可执行指令。这些指令可以被再分成子例程和/或被存储在一个或者多个可能静态或者动态链接的文件中。Many different ways to distribute functionality among these subroutines will be apparent to the skilled person. Subroutines can be stored together in an executable file to form a self-contained program. Such executable files may include computer-executable instructions, such as processor instructions and / or interpreter instructions (eg, Java interpreter instructions). Alternatively, one or more or all of the subroutines may be stored in at least one external library file and linked with the main program either statically or dynamically (e.g., at runtime). The main program contains at least one call to at least one of the subroutines. Subroutines can also include function calls to each other. Embodiments involving a computer program product include computer-executable instructions corresponding to each of the processing steps of at least one of the illustrated methods. These instructions may be subdivided into subroutines and / or stored in one or more files that may be statically or dynamically linked.

本实施例还提供一种计算机设备，如可以执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器，或者多个服务器所组成的服务器集群)等。本实施例的计算机设备20至少包括但不限于：可通过***总线相互通信连接的存储器21、处理器22，如图6所示。需要指出的是，图6仅示出了具有组件21-22的计算机设备20，但是应理解的是，并不要求实施所有示出的组件，可以替代的实施更多或者更少的组件。This embodiment also provides a computer device, such as a smart phone, tablet computer, notebook computer, desktop computer, rack server, blade server, tower server, or rack server (including a stand-alone server, or Server cluster consisting of multiple servers). The computer device 20 of this embodiment includes, but is not limited to, a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus, as shown in FIG. 6. It should be noted that FIG. 6 only shows the computer device 20 with components 21-22, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

本实施例中，存储器21(即可读存储介质)包括闪存、硬盘、多媒体卡、卡型存储器(例如，SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中，存储器21可以是计算机设备20的内部存储单元，例如该计算机设备20的硬盘或内存。在另一些实施例中，存储器21也可以是计算机设备20的外部存储设备，例如该计算机设备20上配备的插接式硬盘，智能存储卡(Smart Media Card,SMC)，安全数字(Secure Digital,SD)卡，闪存卡(Flash Card)等。当然，存储器21还可以既包括计算机设备20的内部存储单元也包括其外部存储设备。本实施例中，存储器21通常用于存储安装于计算机设备20的操作***和各类应用软件，例如实施例一的RNNs神经网络的程序代码等。此外，存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。In this embodiment, the memory 21 (ie, a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM), Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 20, such as a hard disk or a memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, for example, a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, flash card, etc. Of course, the memory 21 may also include both the internal storage unit of the computer device 20 and its external storage device. In this embodiment, the memory 21 is generally used to store an operating system and various application software installed on the computer device 20, such as program codes of the RNNs neural network of the first embodiment. In addition, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

处理器22在一些实施例中可以是中央处理器(Central Processing Unit，CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制计算机设备20的总体操作。本实施例中，处理器22用于运行存储器21中存储的程序代码或者处理数据，例如实现深度学习模型的各层结构，以实现上述实施例一的语音流利度识别方法。The processor 22 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chip in some embodiments. The processor 22 is generally used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is configured to run program code or process data stored in the memory 21, for example, to implement each layer structure of a deep learning model to implement the speech fluency recognition method of the first embodiment.

本实施例还提供一种计算机可读存储介质，如闪存、硬盘、多媒体卡、卡型存储器(例如，SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等，其上存储有计算机程序，程序被处理器执行时实现相应功能。本实施例的计算机可读存储介质用于存储金融小程序，被处理器执行时实现上述实施例一的语音流利度识别方法。This embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM), Read memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application stores, etc., which have computer programs stored on them, When the program is executed by the processor, the corresponding function is realized. The computer-readable storage medium of this embodiment is used to store a small financial program, and when executed by a processor, implements the speech fluency recognition method of the first embodiment.

另一个涉及计算机程序产品的实施例包括对应于所阐明的***和/或产品中至少一个的装置中每个装置的计算机可执行指令。这些指令可以被再分成子例程和/或被存储在一个或者多个可能静态或者动态链接的文件中。Another embodiment involving a computer program product includes computer-executable instructions corresponding to each of the devices of at least one of the illustrated systems and / or products. These instructions may be subdivided into subroutines and / or stored in one or more files that may be statically or dynamically linked.

计算机程序的载体可以是能够运载程序的任何实体或者装置。例如，载体可以包含存储介质，诸如(ROM例如CDROM或者半导体ROM)或者磁记录介质(例如软盘或者硬盘)。进一步地，载体可以是可传输的载体，诸如电学或者光学信号，其可以经由电缆或者光缆，或者通过无线电或者其它手段传递。当程序具体化为这样的信号时，载体可以由这样的线缆或者装置组成。可替换地，载体可以是其中嵌入有程序的集成电路，所述集成电路适合于执行相关方法，或者供相关方法的执行所用。The carrier of a computer program may be any entity or device capable of carrying the program. For example, the carrier may contain a storage medium such as (ROM such as CDROM or semiconductor ROM) or magnetic recording medium (such as a floppy disk or hard disk). Further, the carrier may be a transmissible carrier, such as an electrical or optical signal, which may be transmitted via an electrical or optical cable, or by radio or other means. When the program is embodied in such a signal, the carrier may be composed of such a cable or device. Alternatively, the carrier may be an integrated circuit having a program embedded therein, said integrated circuit being adapted to be used for the execution of the related method or for the execution of the related method.

应该留意的是，上文提到的实施例是举例说明本申请，而不是限制本申请，并且本领域的技术人员将能够设计许多可替换的实施例，而不会偏离所附权利要求的范围。在权利要求中，任何放置在圆括号之间的参考符号不应被解读为是对权利要求的限制。动词“包括”和其词形变化的使用不排除除了在权利要求中记载的那些之外的元素或者步骤的存在。在元素之前的冠词“一”或者“一个”不排除复数个这样的元素的存在。本申请可以通过包括几个明显不同的组件的硬件，以及通过适当编程的计算机而实现。在列举几种装置的装置权利要求中，这些装置中的几种可以通过硬件的同一项来体现。在相互不同的从属权利要求中陈述某些措施的单纯事实并不表明这些措施的组合不能被用来获益。It should be noted that the above-mentioned embodiments are illustrative of this application, rather than limiting this application, and those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims . In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements or steps other than those recited in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The present application can be implemented by hardware including several distinctly different components, and by a suitably programmed computer. In the device claims enumerating several devices, several of these devices may be embodied by the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

如果期望的话，这里所讨论的不同功能可以以不同顺序执行和/或彼此同时执行。此外，如果期望的话，以上所描述的一个或多个功能可以是可选的或者可以进行组合。If desired, the different functions discussed herein may be performed in a different order and / or concurrently with each other. Furthermore, if desired, one or more of the functions described above may be optional or may be combined.

如果期望的话，上文所讨论的各步骤并不限于各实施例中的执行顺序，不同步骤可以以不同顺序执行和/或彼此同时执行。此外，在其他实施例中，以上所描述的一个或多个步骤可以是可选的或者可以进行组合。If desired, the steps discussed above are not limited to the order of execution in the embodiments, and different steps may be performed in different orders and / or concurrently with each other. In addition, in other embodiments, one or more of the steps described above may be optional or may be combined.

虽然本申请的各个方面在独立权利要求中给出，但是本申请的其它方面包括来自所描述实施方式的特征和/或具有独立权利要求的特征的从属权利要求的组合，而并非仅是权利要求中所明确给出的组合。Although various aspects of the application are set forth in the independent claims, other aspects of the application include combinations of features from the described embodiments and / or dependent claims with features of the independent claims, and not just claims The combinations explicitly given in.

这里所要注意的是，虽然以上描述了本申请的示例实施方式，但是这些描述并不应当以限制的含义进行理解。相反，可以进行若干种变化和修改而并不背离如所附权利要求中所限定的本申请的范围。It should be noted here that although the above describes the exemplary embodiments of the present application, these descriptions should not be interpreted in a limited sense. Rather, several variations and modifications may be made without departing from the scope of the application as defined in the appended claims.

本领域普通技术人员应该明白，本申请实施例的装置中的各模块可以用通用的计算装置来实现，各模块可以集中在单个计算装置或者计算装置组成的网络组中，本申请实施例中的装置对应于前述实施例中的方法，其可以通过可执行的程序代码实现，也可以通过集成电路组合的方式来实现，因此本申请并不局限于特定的硬件或者软件及其结合。Those of ordinary skill in the art should understand that each module in the device in the embodiment of the present application may be implemented by a general-purpose computing device, and each module may be concentrated in a single computing device or a network group composed of computing devices. The device corresponds to the method in the foregoing embodiment, which can be implemented by executable program code or integrated circuit integration. Therefore, the present application is not limited to specific hardware or software and combinations thereof.

本领域普通技术人员应该明白，本申请实施例的装置中的各模块可以用通用的移动终端来实现，各模块可以集中在单个移动终端或者移动终端组成的装置组合中，本申请实施例中的装置对应于前述实施例中的方法，其可以通过编辑可执行的程序代码实现，也可以通过集成电路组合的方式来实现，因此本申请并不局限于特定的硬件或者软件及其结合。Those of ordinary skill in the art should understand that each module in the device in the embodiment of the present application may be implemented by a universal mobile terminal, and each module may be concentrated in a single mobile terminal or a device combination composed of mobile terminals. The device corresponds to the method in the foregoing embodiment, which can be implemented by editing executable program code, or can be implemented by means of an integrated circuit. Therefore, the present application is not limited to specific hardware or software and combinations thereof.

注意，上述仅为本申请的示例性实施例及所运用技术原理。本领域技术人员会理解，本申请不限于这里所述的特定实施例，对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本申请的保护范围。因此，虽然通过以上实施例对本申请进行了较为详细的说明，但是本申请不仅仅限于以上实施例，在不脱离本申请构思的情况下，还可以包括更多其他等效实施例，而本申请的范围由所附的权利要求范围决定。Note that the above are merely exemplary embodiments of the present application and applied technical principles. Those skilled in the art will understand that this application is not limited to the specific embodiments described herein, and that those skilled in the art can make various obvious changes, readjustments, and substitutions without departing from the scope of protection of this application. Therefore, although the present application has been described in more detail through the above embodiments, this application is not limited to the above embodiments, and without departing from the concept of this application, it may also include more other equivalent embodiments, and this application The scope is determined by the scope of the appended claims.

Claims

一种语音流利度识别方法，其特征在于，所述方法包括：A speech fluency recognition method, characterized in that the method includes:

通过序列到序列的深度学习网络构建语音识别模型；Construct a speech recognition model through a sequence-to-sequence deep learning network;

对待检测语音进行预处理得到连续的语音帧序列，将所述连续的语音帧序列输入到所述语音识别模型中；Pre-processing the speech to be detected to obtain a continuous speech frame sequence, and inputting the continuous speech frame sequence into the speech recognition model;

根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度；Determining a speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model;

检测待检测语音中所述连续的语音帧序列，确定得到的各语音流利度是否相同；Detecting the continuous speech frame sequence in the speech to be detected, and determining whether the obtained speech fluency is the same;

当所述待检测语音中所述连续的语音帧序列确定得到的各语音流利度相同时，将所述语音流利度确定为所述待检测语音对应的客户的流利度；When the speech fluency obtained from the continuous speech frame sequence in the speech to be detected is the same, determining the speech fluency as the fluency of the customer corresponding to the speech to be detected;

当所述待检测语音中连续的语音帧序列确定得到的各语音流利度不相同时，将各所述语音流利度中较低一级的语音流利度确定为所述待检测语音的流利度。When the speech fluency obtained from the continuous speech frame sequence determined in the speech to be detected is different, the speech fluency at a lower level among the speech fluency is determined as the fluency of the speech to be detected.
根据权利要求1所述的方法，其特征在于，所述通过序列到序列的深度学习网络构建语音识别模型之前，所述方法还包括：The method according to claim 1, wherein before the speech recognition model is constructed by a sequence-to-sequence deep learning network, the method further comprises:

获取若干客服记录中的客服语音并创建语音数据库；Obtain customer service voices in several customer service records and create a voice database;

对所述若干客服记录中的客服语音进行人工标记，为每一客服语音设置分类标注的标签。Manually mark the customer service voices in the several customer service records, and set a classified label for each customer service voice.
根据权利要求1所述的方法，其特征在于，所述对待检测语音进行预处理得到连续的语音帧序列，包括：The method according to claim 1, wherein preprocessing the speech to be detected to obtain a continuous speech frame sequence comprises:

对待检测语音进行去噪处理；Denoise the detected speech;

对去噪处理后的待检测语音进行分段，每段包括预设帧长度的帧数据；Segmenting the speech to be detected after denoising, each segment including frame data of a preset frame length;

对所述帧数据进行序列转换得到所述语音帧序列。Perform sequence conversion on the frame data to obtain the voice frame sequence.
根据权利要求1所述的方法，其特征在于，所述根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度，包括：The method according to claim 1, wherein determining the speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model comprises:

获取输入的所述语音帧序列的特性；Acquiring characteristics of the input voice frame sequence;

结合注意力机制，通过语音识别模型中的解码器为每一输入的所述语音帧序列输出对应的单一标签；Combined with the attention mechanism, the decoder in the speech recognition model outputs a corresponding single label for each input of the speech frame sequence;

将所述单一标签作为所述语音帧序列的分类标注。The single label is used as a classification label of the speech frame sequence.
根据权利要求4所述的方法，其特征在于，所述方法还包括：The method according to claim 4, further comprising:

获取所述语音识别模型的客服语音-分类标注；Obtaining a voice-category annotation of the customer service of the voice recognition model;

通过语音识别模型得到所述客服语音－分类标注的分布式特征表示，映射到所述数据库；Obtaining the distributed feature representation of the customer service voice-classified annotation through a speech recognition model and mapping to the database;

对所述分布式特征进行组合得到各分类标注的整体特征；Combining the distributed features to obtain the overall features of each classification;

根据所述整体特征对客服语音进行检测。Detecting customer service voice according to the overall characteristics.
一种客服语音流利度识别装置，其特征在于，所述装置包括：A voice fluency recognition device for customer service, characterized in that the device includes:

构建模块，用于通过序列到序列的深度学习网络构建语音识别模型；Building module for building a speech recognition model through a sequence-to-sequence deep learning network;

输入模块，用于对待检测语音进行预处理得到连续的语音帧序列，将所述连续的语音帧序列输入到所述语音识别模型中；An input module, configured to preprocess the speech to be detected to obtain a continuous speech frame sequence, and input the continuous speech frame sequence into the speech recognition model;

确定模块，用于根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度；A determining module, configured to determine a speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model;

检测模块，用于检测待检测语音中所述连续的语音帧序列确定得到的各语音流利度是否相同；A detection module, configured to detect whether the speech fluency obtained by the continuous speech frame sequence in the speech to be detected is the same;

第一输出模块，用于当所述待检测语音中所述连续的语音帧序列确定得到的各语音流利度相同时，将所述语音流利度确定为所述待检测语音对应的客户的流利度；A first output module, configured to determine the speech fluency as the fluency of a customer corresponding to the speech to be detected when the speech fluency obtained from the continuous speech frame sequence in the speech to be detected is the same ;

第二输出模块，用于当所述待检测语音中连续的语音帧序列确定得到的各语音流利度不相同时，将各所述语音流利度中较低一级的语音流利度确定为所述待检测语音的流利度。A second output module, configured to determine, when each of the speech fluency levels determined by the continuous speech frame sequence in the speech to be detected, the speech fluency at a lower level among the speech fluency levels is determined as the Fluency of the voice to be detected.
根据权利要求6所述的装置，其特征在于，所述装置还包括：The apparatus according to claim 6, further comprising:

获取模块，用于获取若干客服记录中的客服语音并创建语音数据库；An acquisition module for acquiring customer service voices in several customer service records and creating a voice database;

人工标记模块，用于对所述若干客服记录中的客服语音进行人工标记，为每一客服语音设置分类标注的标签。The manual labeling module is configured to manually label the customer service voices in the plurality of customer service records, and set a classified label for each customer service voice.
根据权利要求6所述的装置，其特征在于，所述输入模块包括：The apparatus according to claim 6, wherein the input module comprises:

去噪子模块，用于对待检测语音进行去噪处理；A denoising sub-module for denoising the speech to be detected;

分段子模块，用于对去噪处理后的待检测语音进行分段，每段包括预设帧长度的帧数据；A segmentation sub-module for segmenting the speech to be detected after denoising processing, each segment including frame data of a preset frame length;

转换子模块，用于对所述帧数据进行序列转换得到所述语音帧序列。A conversion submodule is configured to perform sequence conversion on the frame data to obtain the voice frame sequence.
根据权利要求6所述的装置，其特征在于，所述确定模块包括：The apparatus according to claim 6, wherein the determining module comprises:

特定获取子模块，用于获取输入的所述语音帧序列的特性；A specific acquisition submodule, configured to acquire characteristics of the input voice frame sequence;

输出单一标签子模块，用于结合注意力机制，通过语音识别模型中的解码器为每一输入的所述语音帧序列输出对应的单一标签；Outputting a single label sub-module, which is used in combination with the attention mechanism to output a corresponding single label for each input speech frame sequence through a decoder in a speech recognition model;

分类子模块，用于将所述单一标签作为所述语音帧序列的分类标注。A classification submodule, configured to use the single label as a classification label of the speech frame sequence.
根据权利要求9所述的装置，其特征在于，所述装置还包括：The apparatus according to claim 9, further comprising:

分类标注获取模块，用于获取所述语音识别模型的客服语音-分类标注；A classification label acquisition module, configured to acquire a customer service voice-classification label of the speech recognition model;

映射模块，用于通过语音识别模型得到所述客服语音－分类标注的分布式特征表示，映射到所述数据库；A mapping module, configured to obtain the distributed feature representation of the customer service voice-category annotation through a speech recognition model, and map to the database;

组合模块，用于对所述分布式特征进行组合得到各分类标注的整体特征；A combination module, configured to combine the distributed features to obtain overall features of each classification;

整体特征侦测模块，用于根据所述整体特征对客服语音进行检测。An overall feature detection module is configured to detect a customer service voice according to the overall feature.
一种计算机设备，包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现语音流利度识别方法的步骤：A computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor implements the steps of the method for identifying a fluency of a speech when the processor executes the computer program:

通过序列到序列的深度学习网络构建语音识别模型；Construct a speech recognition model through a sequence-to-sequence deep learning network;

对待检测语音进行预处理得到连续的语音帧序列，将所述连续的语音帧序列输入到所述语音识别模型中；Pre-processing the speech to be detected to obtain a continuous speech frame sequence, and inputting the continuous speech frame sequence into the speech recognition model;

根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度；Determining a speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model;

检测待检测语音中所述连续的语音帧序列，确定得到的各语音流利度是否相同；Detecting the continuous speech frame sequence in the speech to be detected, and determining whether the obtained speech fluency is the same;

当所述待检测语音中所述连续的语音帧序列确定得到的各语音流利度相同时，将所述语音流利度确定为所述待检测语音对应的客户的流利度；When the speech fluency obtained from the continuous speech frame sequence in the speech to be detected is the same, determining the speech fluency as the fluency of the customer corresponding to the speech to be detected;

当所述待检测语音中连续的语音帧序列确定得到的各语音流利度不相同时，将各所述语音流利度中较低一级的语音流利度确定为所述待检测语音的流利度。When the speech fluency obtained from the continuous speech frame sequence determined in the speech to be detected is different, the speech fluency at a lower level among the speech fluency is determined as the fluency of the speech to be detected.
根据权利要求11所述的计算机设备，其特征在于，所述通过序列到序列的深度学习网络构建语音识别模型之前，所述方法还包括：The computer device according to claim 11, wherein before the constructing a speech recognition model through a sequence-to-sequence deep learning network, the method further comprises:

获取若干客服记录中的客服语音并创建语音数据库；Obtain customer service voices in several customer service records and create a voice database;

对所述若干客服记录中的客服语音进行人工标记，为每一客服语音设置分类标注的标签。Manually mark the customer service voices in the several customer service records, and set a classified label for each customer service voice.
根据权利要求11所述的计算机设备，其特征在于，所述对待检测语音进行预处理得到连续的语音帧序列，包括：The computer device according to claim 11, wherein the pre-processing of the speech to be detected to obtain a continuous speech frame sequence comprises:

对待检测语音进行去噪处理；Denoise the detected speech;

对去噪处理后的待检测语音进行分段，每段包括预设帧长度的帧数据；Segmenting the speech to be detected after denoising, each segment including frame data of a preset frame length;

对所述帧数据进行序列转换得到所述语音帧序列。Perform sequence conversion on the frame data to obtain the voice frame sequence.
根据权利要求11所述的计算机设备，其特征在于，所述根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度，包括：The computer device according to claim 11, wherein the determining the speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model comprises:

获取输入的所述语音帧序列的特性；Acquiring characteristics of the input voice frame sequence;

结合注意力机制，通过语音识别模型中的解码器为每一输入的所述语音帧序列输出对应的单一标签；Combined with the attention mechanism, the decoder in the speech recognition model outputs a corresponding single label for each input of the speech frame sequence;

将所述单一标签作为所述语音帧序列的分类标注。The single label is used as a classification label of the speech frame sequence.
根据权利要求14所述的计算机设备，其特征在于，所述方法还包括：The computer device according to claim 14, wherein the method further comprises:

获取所述语音识别模型的客服语音-分类标注；Obtaining a voice-category annotation of the customer service of the voice recognition model;

通过语音识别模型得到所述客服语音－分类标注的分布式特征表示，映射到所述数据库；Obtaining the distributed feature representation of the customer service voice-classified annotation through a speech recognition model and mapping to the database;

对所述分布式特征进行组合得到各分类标注的整体特征；Combining the distributed features to obtain the overall features of each classification;

根据所述整体特征对客服语音进行检测。Detecting customer service voice according to the overall characteristics.
一种计算机可读存储介质，其上存储有计算机程序，其特征在于：所述计算机程序被处理器执行时实现语音流利度识别方法的步骤：A computer-readable storage medium having stored thereon a computer program, characterized in that the steps of the speech fluency recognition method are implemented when the computer program is executed by a processor:

通过序列到序列的深度学习网络构建语音识别模型；Construct a speech recognition model through a sequence-to-sequence deep learning network;

对待检测语音进行预处理得到连续的语音帧序列，将所述连续的语音帧序列输入到所述语音识别模型中；Pre-processing the speech to be detected to obtain a continuous speech frame sequence, and inputting the continuous speech frame sequence into the speech recognition model;

根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度；Determining a speech fluency corresponding to the continuous speech frame sequence according to the speech recognition model;

检测待检测语音中所述连续的语音帧序列，确定得到的各语音流利度是否相同；Detecting the continuous speech frame sequence in the speech to be detected, and determining whether the obtained speech fluency is the same;

当所述待检测语音中所述连续的语音帧序列确定得到的各语音流利度相同时，将所述语音流利度确定为所述待检测语音对应的客户的流利度；When the speech fluency obtained from the continuous speech frame sequence in the speech to be detected is the same, determining the speech fluency as the fluency of the customer corresponding to the speech to be detected;

当所述待检测语音中连续的语音帧序列确定得到的各语音流利度不相同时，将各所述语音流利度中较低一级的语音流利度确定为所述待检测语音的流利度。When the speech fluency obtained from the continuous speech frame sequence determined in the speech to be detected is different, the speech fluency at a lower level among the speech fluency is determined as the fluency of the speech to be detected.
根据权利要求16所述的计算机可读存储介质，其特征在于，所述通过序列到序列的深度学习网络构建语音识别模型之前，所述方法还包括：The computer-readable storage medium of claim 16, wherein before the speech recognition model is constructed by a sequence-to-sequence deep learning network, the method further comprises:

获取若干客服记录中的客服语音并创建语音数据库；Obtain customer service voices in several customer service records and create a voice database;

对所述若干客服记录中的客服语音进行人工标记，为每一客服语音设置分类标注的标签。Manually mark the customer service voices in the several customer service records, and set a classified label for each customer service voice.
根据权利要求16所述的计算机可读存储介质，其特征在于，所述对待检测语音进行预处理得到连续的语音帧序列，包括：The computer-readable storage medium according to claim 16, wherein the pre-processing of the speech to be detected to obtain a continuous speech frame sequence comprises:

对待检测语音进行去噪处理；Denoise the detected speech;

对去噪处理后的待检测语音进行分段，每段包括预设帧长度的帧数据；Segmenting the speech to be detected after denoising, each segment including frame data of a preset frame length;

对所述帧数据进行序列转换得到所述语音帧序列。Perform sequence conversion on the frame data to obtain the voice frame sequence.
根据权利要求16所述的计算机可读存储介质，其特征在于，所述根据所述语音识别模型确定出所述连续的语音帧序列对应的语音流利度，包括：The computer-readable storage medium according to claim 16, wherein determining the speech fluency corresponding to the continuous sequence of speech frames according to the speech recognition model comprises:

获取输入的所述语音帧序列的特性；Acquiring characteristics of the input voice frame sequence;

结合注意力机制，通过语音识别模型中的解码器为每一输入的所述语音帧序列输出对应的单一标签；Combined with the attention mechanism, the decoder in the speech recognition model outputs a corresponding single label for each input of the speech frame sequence;

将所述单一标签作为所述语音帧序列的分类标注。The single label is used as a classification label of the speech frame sequence.
根据权利要求19所述的计算机可读存储介质，其特征在于，所述方法还包括：The computer-readable storage medium of claim 19, wherein the method further comprises:

获取所述语音识别模型的客服语音-分类标注；Obtaining a voice-category annotation of the customer service of the voice recognition model;

通过语音识别模型得到所述客服语音－分类标注的分布式特征表示，映射到所述数据库；Obtaining the distributed feature representation of the customer service voice-classified annotation through a speech recognition model and mapping to the database;

对所述分布式特征进行组合得到各分类标注的整体特征；Combining the distributed features to obtain the overall features of each classification;

根据所述整体特征对客服语音进行检测。Detecting customer service voice according to the overall characteristics.