CN113674766A - Voice evaluation method and device, computer equipment and storage medium - Google Patents

Voice evaluation method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113674766A
CN113674766A CN202110949121.0A CN202110949121A CN113674766A CN 113674766 A CN113674766 A CN 113674766A CN 202110949121 A CN202110949121 A CN 202110949121A CN 113674766 A CN113674766 A CN 113674766A
Authority
CN
China
Prior art keywords
voice
agent
speech
data
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110949121.0A
Other languages
Chinese (zh)
Inventor
杨万强
吴贵丹
王胜煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Fu Shen Lan Software Co ltd
Original Assignee
Shanghai Fu Shen Lan Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Fu Shen Lan Software Co ltd filed Critical Shanghai Fu Shen Lan Software Co ltd
Priority to CN202110949121.0A priority Critical patent/CN113674766A/en
Publication of CN113674766A publication Critical patent/CN113674766A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of artificial intelligence, and discloses a voice evaluation method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring voice data of an agent; analyzing the voice data through a recognition system to generate an utterance text; extracting voice characteristics of the agent from the voice data through a preset analysis algorithm; processing the utterance text through a preset semantic analysis model to generate semantic analysis data; calculating the phonon likelihood of the voice data according to the voice characteristics; and generating a voice evaluation index of the agent according to the semantic analysis data and the phonon likelihood. The invention can analyze the voice evaluation index of the agent, so that the agent can know the speaking state of the agent in real time, and further the agent can adjust the self emotion and improve the communication capacity of the agent.

Description

Voice evaluation method and device, computer equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a voice evaluation method and device, computer equipment and a storage medium.
Background
At present, a plurality of voice interaction tools exist in the market, and the voice interaction between human and machines can be realized. However, such tools are either customer-oriented or function-assistant tools that do not enhance the personal abilities of the professional attendant. For some professional service personnel, such as insurance agents, etc., in some cases, it is desirable to communicate with the customer off of the voice interaction tool.
Therefore, a voice evaluation method is needed to be found, which can analyze the voice evaluation index of the agent, so that the agent can know the speaking state of the agent in real time, and further adjust the self emotion and improve the communication ability of the agent.
Disclosure of Invention
In view of the above, it is desirable to provide a voice evaluation method, apparatus, computer device, and storage medium for solving the above-mentioned problems.
A speech assessment method comprising:
acquiring voice data of an agent;
analyzing the voice data through a recognition system to generate an utterance text; extracting voice characteristics of the agent from the voice data through a preset analysis algorithm;
processing the utterance text through a preset semantic analysis model to generate semantic analysis data; calculating the phonon likelihood of the voice data according to the voice features;
and generating a voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood.
A speech evaluation device comprising:
the acquisition module is used for acquiring voice data of the agent;
the characteristic analysis module is used for analyzing the voice data through a recognition system to generate an utterance text; extracting voice characteristics of the agent from the voice data through a preset analysis algorithm;
the analysis result obtaining module is used for processing the utterance text through a preset semantic analysis model to generate semantic analysis data; calculating the phonon likelihood of the voice data according to the voice features;
and the evaluation index generation module is used for generating the voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood.
A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the speech assessment method when executing the computer readable instructions.
One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the speech assessment method described above.
The voice evaluation method, the voice evaluation device, the computer equipment and the storage medium acquire the voice data of the agent to obtain the voice data for analyzing the voice evaluation index. Analyzing the voice data through a recognition system to generate an utterance text; the voice features of the agent are extracted from the voice data by means of a preset parsing algorithm, where the voice data are converted into spoken text for semantic analysis on the one hand and extracted from the voice data for voice analysis on the other hand. Processing the utterance text through a preset semantic analysis model to generate semantic analysis data; and calculating the phonon likelihood of the voice data according to the voice characteristics, wherein the semantics expressed by the agent can be quickly acquired through semantic analysis and phonon likelihood calculation, and the emotion and the speed of the agent are recognized. And generating a voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood, wherein the voice evaluation index comprises the speaking speed of the agent and the emotion of the agent. The invention can analyze the voice evaluation index of the agent, so that the agent can know the speaking state of the agent in real time, and further the agent can adjust the self emotion and improve the communication capacity of the agent.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of a speech evaluation method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a speech evaluation method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a voice evaluation apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The voice evaluation method provided by this embodiment can be applied to the application environment shown in fig. 1, in which the client communicates with the server. The client includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of a plurality of servers.
The voice evaluation method provided by the embodiment can be applied to an Intelligent training System for Insurance Agents (Intelligent Sessions for instrumentation Agents). The system can know the voice expression ability of the agent by the method provided by the embodiment, thereby specifically training the agent.
In an embodiment, as shown in fig. 2, a speech evaluation method is provided, which is described by taking the method applied to the server in fig. 1 as an example, and includes the following steps S10-S40.
And S10, acquiring voice data of the agent.
Understandably, an agent may refer to an insurance agent or a practitioner of other service industries. Other service industries herein include, but are not limited to, the financial industry, the consumer goods industry, the educational industry, the property management industry.
Voice data may refer to voice data generated by an agent training himself or interacting with a customer. In some cases, the speech data may also refer to speech data used for training. At this time, the speaker of the voice data is not necessarily a human agent, but acts as a human agent.
In some examples, the voice data may be pre-processed generated data. The pretreatment steps include but are not limited to screening, denoising, amplification.
S20, analyzing the voice data through a recognition system to generate an utterance text; and extracting the voice characteristics of the agent from the voice data through a preset analysis algorithm.
Understandably, recognition systems can be used to convert speech data into text data, i.e., speech text. In the recognition system, a plurality of proofreading tools are preset, so that wrongly written characters of the text data can be reduced.
The preset analysis algorithm can be set according to actual needs. The pre-set parsing algorithm may convert the voice data into more easily processed voice features. Herein, speech features include, but are not limited to, MFCC (Mel-Frequency Cepstral Coefficients) features.
S30, processing the utterance text through a preset semantic analysis model to generate semantic analysis data; and calculating the phonon likelihood of the voice data according to the voice characteristics.
Understandably, the preset semantic analysis model can be set according to actual needs. In one example, the predetermined semantic analysis model may calculate a number of semantic information in the utterance text, forming semantic analysis data. Here, the semantic analysis data includes, but is not limited to, preset keywords and their occurrence frequency, business scenario, business process, and the recognition and derogation of words or sentences.
The phoneme likelihood refers to the product of probability values of several speech segments. Typically, voice data is segmented into a large number of voice segments. If the product of the probability values is calculated directly, the product becomes smaller and smaller, resulting in digital underflow. Thus, the factor likelihood can be expressed in a logarithmic sum.
And S40, generating the voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood.
Understandably, the voice evaluation index of the agent can be obtained by combining the phonon likelihood and the semantic analysis data. The speech assessment metrics include, but are not limited to, the speed of speech of the agent and the emotion of the agent. The voice evaluation index can be visually displayed to the agent through the front-end interface, so that the agent can know the self speaking state in real time according to the feedback result (the voice evaluation index) and adjust the self emotion in time.
In some examples, a Deep Learning (DL) neural network may be used to learn the voice data of the agent, Learning the intrinsic laws and the representation hierarchy of the voice data. The neural network supports the addition and modification of voice data. Voice data does not increase with the continuous break-through of agents in the "insurance agent intelligent training system". Meanwhile, a new real conversation can be added to the voice data, and the expansibility of the voice data is greatly improved. The newly added voice data can be trained through the neural network only by performing simple data labeling on the newly added voice data, and the accuracy of the intelligent training accompanying system of the insurance agent is continuously improved.
Besides improving the accuracy of the system by increasing the voice data, the method can also increase the types of the new data by increasing the types of the voice data and modify the types of the output models in the algorithm when new performance requirements exist, thereby realizing the output of new voice evaluation indexes. The embodiment can modify the system under the conditions of different environments and different requirements, thereby greatly improving the expansibility of the system algorithm and facilitating the development of the system.
In steps S10-S40, voice data of the agent is acquired to obtain voice data for analyzing the voice evaluation index. Analyzing the voice data through a recognition system to generate an utterance text; the voice features of the agent are extracted from the voice data by means of a preset parsing algorithm, where the voice data are converted into spoken text for semantic analysis on the one hand and extracted from the voice data for voice analysis on the other hand. Processing the utterance text through a preset semantic analysis model to generate semantic analysis data; and calculating the phonon likelihood of the voice data according to the voice characteristics, wherein the semantics expressed by the agent can be quickly acquired through semantic analysis and phonon likelihood calculation, and the emotion and the speed of the agent are recognized. And generating a voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood, wherein the voice evaluation index comprises the speaking speed of the agent and the emotion of the agent. The embodiment can analyze the voice evaluation index of the agent, so that the agent can know the speaking state of the agent in real time, and further the agent can adjust the self emotion and improve the communication capacity of the agent.
Optionally, after step S40, that is, after generating the voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood, the method further includes:
s50, sending the voice evaluation index to an output device associated with the agent, so that the agent receives the voice evaluation index through the output device;
s60, generating a customer response parameter according to the voice evaluation index;
and S70, generating a simulated customer voice for interacting with the agent according to the customer response parameters.
Understandably, the output device associated with the agent may refer to a display screen connected to a personal computer, a cell phone display screen, or a speaker. Through the output device, the voice evaluation index can be displayed in a text, picture or voice mode and then received by the proxy.
The customer response parameters may be generated based on the agent's voice rating metrics. Here, the client response parameters include, but are not limited to, a dialog content parameter, a dialog voice parameter. The dialog content parameter is used to set the dialog content simulating the speech of the client. The dialogue voice parameters are used for setting voice parameters such as the speed and the tone of the voice of the simulation client.
Optionally, in step S20, the extracting, by using a preset parsing algorithm, the voice feature of the agent from the voice data includes:
s201, processing the voice data into voice fragments with preset specifications;
s202, pre-emphasis and windowing are carried out on the voice fragment to obtain a windowed signal; calculating an energy coefficient of the voice segment;
s203, performing discrete Fourier transform on the windowed signal to obtain a transform result;
s204, processing the transformation result through a Mel filter to obtain a Mel sound spectrum;
s205, performing inverse Fourier transform on the Mel sound spectrum to obtain a cepstrum coefficient;
s206, calculating a first order difference cepstrum coefficient and a second order difference cepstrum coefficient of the voice data according to the cepstrum coefficient; calculating a first order difference energy coefficient and a second order difference energy coefficient of the voice data according to the energy coefficient;
s207, generating the voice feature according to the cepstrum coefficient, the first-order difference cepstrum coefficient, the second-order difference cepstrum coefficient, the energy coefficient, the first-order difference energy coefficient and the second-order difference energy coefficient.
Understandably, the preset specification refers to a time frame of a speech segment, typically 10ms, 15ms or 20 ms. The acoustic waveform of the voice data is sampled according to the preset specification of the voice segment, and corresponding voice spectrum characteristics (namely voice characteristics) are generated. The windows of each time frame are represented by vectors, each of which contains about 39 features to represent information of the sound spectrum and information of the energy magnitude and the spectral variation.
Pre-emphasis of the speech segments is required, i.e. to emphasize the energy in the high frequency band. Due to the characteristics of the glottal pulse, a sound spectrum skew phenomenon occurs when the energy with high frequency in the sound is reduced. The energy of the high-frequency end is emphasized, so that the information with higher formants is more suitable for an acoustic model, and the accuracy of phonon detection is improved. Here, the process is described. The filter used for pre-emphasis may be a first order high pass filter.
The speech segments are then windowed. Because the speech data is a non-stationary signal, the change of the sound spectrum is very fast in the whole speech or conversation, and it is relatively difficult to extract the sound spectrum feature from the whole speech. Thus, windowing may be employed to extract the spectral features. It is assumed that the speech signal within this window is stationary. In MFCC extraction, a Hamming window may be used. The hamming window shrinks the signal values to zero at the boundaries of the window, thereby avoiding signal discontinuities.
Assuming that the window length is L frames, the hamming window formula is as follows:
Figure BDA0003217993230000081
where n is time, w [ n ] is the window value for time n, and L is the window length.
And carrying out discrete Fourier transform on the windowed signal to obtain a transform result. Specifically, the windowed signal X [ N ] … X [ m ] is output as a complex number X [ k ] for each of N discrete frequency bands. The calculation process of the discrete fourier transform can refer to the existing literature, and is not described in detail here. The calculation is performed using euler's formula in fourier analysis.
The formula for the discrete fourier transform is as follows:
Figure BDA0003217993230000091
wherein, X [ k ] is the amplitude and phase of the frequency component in the original signal after discrete Fourier transform;
x [ n ] is a windowed signal at time n;
k=0,1,2,...,N-1,xnj is an imaginary unit for the sampled analog signal.
The euler equation is as follows:
e=cosθ+j sinθ
where θ is any real number and j is an imaginary unit.
The transform result may then be processed through a mel filter to obtain a mel-frequency spectrum. Specifically, the transform result is information on the energy magnitude on each frequency band. By definition, if a pair of voices sound equidistantly at perceived pitch, they may be separated by the same number of mel-scales (mel). Below 1000Hz, the mapping between frequency in Hz and mel scale is linear; above 1000Hz, this mapping is logarithmic.
The frequencies of the mel scale may be calculated from the coarse acoustic frequencies:
Figure BDA0003217993230000092
wherein mel (f) represents the frequency of the mel scale;
f is the energy of each band.
In the calculation, a filter bank may be established to implement this mapping. In this filter bank, energy from each band is collected, and 10 filters of the bands below 1000Hz are distributed in a linear fashion; other band filters above 1000Hz are distributed logarithmically. Finally, the log may be used to represent the mel-scale spectral values (i.e., mel-frequency spectrum).
Inverse fourier transform may be performed on the mel-frequency spectrum to obtain cepstral coefficients. The cepstral coefficients are of the log spectrum. To reduce computational complexity, the pre-emphasis and mel-variant parts of the mel-frequency spectrum (MFCC) may be ignored, taking only the first 12 cepstral coefficients. These cepstral coefficients are used to recognize the emotion of the agent in addition to the speech. Therefore, a higher cepstral coefficient can be used to detect a phonon of pitch.
For a windowed frame x [ n ] of speech data, the cepstral coefficients are:
Figure BDA0003217993230000101
wherein, c [ n ] is inverse Fourier transform of window frame x [ n ] logarithmic amplitude spectrum, namely cepstrum coefficient;
k=0,1,2,...,N-1;
x [ n ] is a sampled analog signal;
j is an imaginary unit.
In extracting the cepstrum using the inverse discrete fourier transform, there are 12 cepstral coefficients per frame. An energy coefficient may be added to characterize the energy of the frame. The energy coefficients are correlated to the identification of the phones. The energy coefficients may be used to detect the emotions of the agent.
The energy coefficient refers to the sum of sample powers of a certain frame in a certain time period, and the formula is as follows:
Figure BDA0003217993230000102
where x is the signal and t is the slave time t1To t2The window of (2).
The speech signal is not constant from one frame to another. The change in the slope of the formants at the transition, the change in the stop-off sound to the pop, may provide useful clues for the detection of speech. Features of the cepstral features that are linked to temporal variations can be added.
A Delta feature and a double Delta feature are added for each of the 13 features. The Delta feature is calculated by calculating the difference from frame to frame. The formula in the case where time t is specified is as follows:
Figure BDA0003217993230000111
wherein d (t) is Delta feature (namely, first-order difference cepstrum coefficient) at the time t;
c (t +1) is a cepstrum coefficient at the time of t + 1;
c (t-1) is the cepstrum coefficient at time t-1.
The double Delta characteristic is a second-order difference cepstrum coefficient.
In one example, the generated speech features include 39 MFCC features, respectively: 12 cepstrum coefficients, 12 first order difference cepstrum coefficients (Delta characteristics), 12 second order difference energy coefficients (double Delta characteristics), 1 capability coefficient, 1 first order difference energy coefficient (Delta energy coefficient), and 1 second order difference energy coefficient (double Delta energy coefficient).
In some examples, to facilitate debugging during training, feature extraction may be performed on the entire data set in advance during voice feature extraction, and then the extracted voice features are stored on an additional hard disk space. And when the model is trained, the extracted voice feature file is directly read. This method allows a large increase in training process time. Further, when training a plurality of models to compare effects, feature extraction is not required again. When a problem occurs, whether the problem is caused by feature extraction can be conveniently checked.
The regularization of the speech feature can be carried out by using a small part of data to carry out feature extraction, then calculating a mean vector and a standard deviation vector from the features, and storing the two vectors; when the complete data is subjected to feature extraction, the two vectors are utilized to carry out regularization on the voice features.
In other examples, data enhancement may be used to increase the sample size of the voice data. Data enhancement includes, but is not limited to, the following 4 methods. 1. The volume of the voice signal is changed. Therefore, the preset analysis algorithm has better robustness to signals with different volumes. 2. The sampling rate of the speech signal is changed. Considering that the traditional telephone line uses an old transmission signal with a sampling rate of 8000Hz, the model can stably work in more scenes by changing the sampling rate of training data. 3. The speed of the voice signal is disturbed to simulate different speech speed changes. 4. The fundamental frequency of the speech signal is perturbed.
Optionally, in step S30, the calculating the phonon likelihood of the speech data according to the speech feature includes:
s301, processing the voice features through a phonon likelihood calculation model to generate the phonon likelihood, wherein the phonon likelihood calculation model comprises:
Figure BDA0003217993230000121
wherein logbj(ot) Representing a feature vector otThe logarithm of the probability value at a particular state j;
d represents the width of the calculated sound, D being the maximum sound width;
Figure BDA0003217993230000122
represents the variance of a particular state j over the width of sound d;
μjdrepresents the average value of a particular state j over the width d of the sound.
Understandably, to calculate the likelihood of a complete sentence, many small probability values need to be multiplied, and multiplying many probability values will result in a smaller and smaller number of multiplication results, resulting in a digital underflow. Thus, it is necessary to calculate the phonon likelihood using a logarithm. And when the logarithmic probability is calculated, the probability is not multiplied, but the logarithmic probability is added, so that the calculation speed can be accelerated.
Optionally, in step S30, the processing the utterance text by the preset semantic analysis model to generate semantic analysis data includes:
s302, recognizing a plurality of prompt phrases in the utterance text;
s303, dividing the utterance text into a plurality of utterance sections according to the plurality of prompt phrases;
s304, analyzing the association relation among the continuous speaking segments according to the prompt phrase, and determining the category of the association relation;
s305, generating the semantic analysis data according to the incidence relation and the category.
Understandably, a plurality of keywords can be preset, and prompt phrases (the prompt phrases belong to the keywords or synonyms of the keywords) can be identified in a keyword matching mode. The utterance text may be divided into several utterance segments according to the text position of the prompt phrase in the utterance text. Associations between successive utterance segments can be resolved from the prompt phrases. In some examples, the prompt phrases may be logical conjunctions, such as a prompt phrase of a first utterance segment including "albeit" and a prompt phrase of a second utterance segment including "but". The correlation between the first speech segment and the second speech segment is a turning relationship. The category of the association relationship may be set according to actual needs, for example, the association relationship may be divided based on the commendatory and derogatory meaning of the word, or may be divided based on the service scenario. The finally generated semantic analysis data includes the above-mentioned association and category.
Optionally, in step S10, the obtaining voice data of the agent includes:
s101, acquiring original voice data of an agent;
s102, recognizing the original voice data through a human voice detection model to obtain a first recognition result of the original voice data;
s103, recognizing the first recognition result as original voice data containing human voice through a voiceprint recognition model to obtain a second recognition result;
s104, screening the voice data from the original voice data according to the second recognition result.
Understandably, raw speech data refers to speech data that has not been pre-processed. The quality of the original voice data is generally poor, and there are more error data, such as blank audio, audio that does not match the identity of the agent.
The voice detection model is used for checking whether the original voice data contains voice or not and generating a first recognition result. In the first recognition result, the original voice data is distinguished into "containing a voice" and "not containing a voice".
To prevent the occurrence of mislabeled data in the speech data, the original speech data containing human voice may be filtered using a voiceprint recognition model. In an original voice data set containing voices, an embedded code is calculated based on all audios of each speaker, and then an embedded code center of the audios of the same individual is found. For the same speaker, according to the difference value between each audio frequency and the center of the embedded code, adding a label of 'non-self' to the original voice data of which the difference value exceeds the preset embedded code threshold value, and adding a label of 'self' to the original voice data of which the difference value does not exceed the preset embedded code threshold value, namely, the second recognition result. And screening data containing the 'self' label from the original voice data, namely the preprocessed voice data.
Optionally, the voice evaluation index includes at least one of a speed of speech, an emotion and a voice accuracy.
Understandably, speech assessment indicators generated based on semantic analysis data and phonon likelihoods include, but are not limited to, speech rate, emotion, and speech accuracy. Through the voice evaluation indexes, the voice expression information of the agent can be objectively reflected, so that the agent can know the speaking state of the agent in real time, and further the agent can adjust the self emotion and improve the communication capacity of the agent.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In one embodiment, a voice evaluation apparatus is provided, which corresponds to the voice evaluation method in the above embodiments one to one. As shown in fig. 3, the speech evaluation apparatus includes an acquisition module 10, a feature analysis module 20, an analysis result obtaining module 30, and an evaluation index generating module 40. The functional modules are explained in detail as follows:
the acquisition module 10 is used for acquiring voice data of the agent;
a feature analysis module 20, configured to analyze the voice data through a recognition system to generate an utterance text; extracting voice characteristics of the agent from the voice data through a preset analysis algorithm;
an analysis result obtaining module 30, configured to process the utterance text through a preset semantic analysis model, and generate semantic analysis data; calculating the phonon likelihood of the voice data according to the voice features;
and the evaluation index generation module 40 is used for generating the voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood.
Optionally, the speech evaluation apparatus further includes:
the output index module is used for sending the voice evaluation index to output equipment associated with the agent so that the agent receives the voice evaluation index through the output equipment;
the response parameter module is used for generating a customer response parameter according to the voice evaluation index;
and the simulated voice module is used for generating simulated customer voice for interacting with the agent according to the customer response parameters.
Optionally, the feature analysis module 20 includes:
the segment segmentation unit is used for processing the voice data into voice segments with preset specifications;
the segment processing unit is used for carrying out pre-emphasis and windowing processing on the voice segment to obtain a windowed signal; calculating an energy coefficient of the voice segment;
the transforming unit is used for carrying out discrete Fourier transform on the windowed signal to obtain a transform result;
a mel filtering unit for processing the transform result by a mel filter to obtain a mel sound spectrum;
the inverse transformation unit is used for carrying out inverse Fourier transform on the Mel sound spectrum to obtain a cepstrum coefficient;
the coefficient calculation unit is used for calculating a first-order difference cepstrum coefficient and a second-order difference cepstrum coefficient of the voice data according to the cepstrum coefficient; calculating a first order difference energy coefficient and a second order difference energy coefficient of the voice data according to the energy coefficient;
and the voice feature generation unit is used for generating the voice feature according to the cepstrum coefficient, the first-order difference cepstrum coefficient, the second-order difference cepstrum coefficient, the energy coefficient, the first-order difference energy coefficient and the second-order difference energy coefficient.
Optionally, the module 40 for generating an evaluation index includes:
a likelihood calculation unit configured to process the speech feature by a phonon likelihood calculation model to generate the phonon likelihood, the phonon likelihood calculation model including:
Figure BDA0003217993230000161
wherein, log bj(ot) Representing a feature vector otThe logarithm of the probability value at a particular state j;
d represents the width of the calculated sound, D being the maximum sound width;
Figure BDA0003217993230000162
represents the variance of a particular state j over the width of sound d;
μjdrepresents the average value of a particular state j over the width d of the sound.
Optionally, the generated evaluation index module 40 includes:
the phrase recognition unit is used for recognizing a plurality of prompt phrases in the utterance text;
the utterance section dividing unit is used for dividing the utterance text into a plurality of utterance sections according to the plurality of prompt phrases;
a relation and category determining unit for analyzing the incidence relation between the continuous speech segments according to the prompt phrase and determining the category of the incidence relation;
and the semantic analysis data generating unit is used for generating the semantic analysis data according to the incidence relation and the category.
Optionally, the obtaining module 10 includes:
the system comprises an original data acquisition unit, a voice recognition unit and a voice recognition unit, wherein the original data acquisition unit is used for acquiring original voice data of an agent;
the first recognition unit is used for recognizing the original voice data through a human voice detection model and acquiring a first recognition result of the original voice data;
the second identification unit is used for identifying the original voice data of which the first identification result contains the voice through a voiceprint identification model to obtain a second identification result;
and the screening unit is used for screening the voice data from the original voice data according to the second recognition result.
Optionally, the voice evaluation index includes at least one of a speed of speech, an emotion and a voice accuracy.
For the specific limitations of the speech evaluation device, reference may be made to the above limitations of the speech evaluation method, which are not described herein again. The modules in the voice evaluation device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a readable storage medium and an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operating system and execution of computer-readable instructions in the readable storage medium. The database of the computer device is used for storing data related to the voice evaluation method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions, when executed by a processor, implement a speech assessment method. The readable storage media provided by the present embodiment include nonvolatile readable storage media and volatile readable storage media.
In one embodiment, a computer device is provided, comprising a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, the processor when executing the computer readable instructions implementing the steps of:
acquiring voice data of an agent;
analyzing the voice data through a recognition system to generate an utterance text; extracting voice characteristics of the agent from the voice data through a preset analysis algorithm;
processing the utterance text through a preset semantic analysis model to generate semantic analysis data; calculating the phonon likelihood of the voice data according to the voice features;
and generating a voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood.
In one embodiment, one or more computer-readable storage media storing computer-readable instructions are provided, the readable storage media provided by the embodiments including non-volatile readable storage media and volatile readable storage media. The readable storage medium has stored thereon computer readable instructions which, when executed by one or more processors, perform the steps of:
acquiring voice data of an agent;
analyzing the voice data through a recognition system to generate an utterance text; extracting voice characteristics of the agent from the voice data through a preset analysis algorithm;
processing the utterance text through a preset semantic analysis model to generate semantic analysis data; calculating the phonon likelihood of the voice data according to the voice features;
and generating a voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood.
It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to computer readable instructions, which may be stored in a non-volatile readable storage medium or a volatile readable storage medium, and when executed, the computer readable instructions may include processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A speech evaluation method, comprising:
acquiring voice data of an agent;
analyzing the voice data through a recognition system to generate an utterance text; extracting voice characteristics of the agent from the voice data through a preset analysis algorithm;
processing the utterance text through a preset semantic analysis model to generate semantic analysis data; calculating the phonon likelihood of the voice data according to the voice features;
and generating a voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood.
2. The speech assessment method of claim 1, wherein after generating the speech assessment index for the agent based on the semantic analysis data and the phoneme likelihood, further comprising:
sending the voice evaluation index to an output device associated with the agent so that the agent receives the voice evaluation index through the output device;
generating a customer response parameter according to the voice evaluation index;
generating a simulated customer voice for interacting with the agent based on the customer response parameters.
3. The speech assessment method of claim 1, wherein said extracting the speech features of the agent from the speech data by a predetermined parsing algorithm comprises:
processing the voice data into voice fragments with preset specifications;
pre-emphasis and windowing are carried out on the voice segments to obtain windowed signals; calculating an energy coefficient of the voice segment;
performing discrete Fourier transform on the windowed signal to obtain a transform result;
processing the transformation result through a Mel filter to obtain a Mel sound spectrum;
performing inverse Fourier transform on the Mel sound spectrum to obtain cepstrum coefficients;
calculating a first order difference cepstrum coefficient and a second order difference cepstrum coefficient of the voice data according to the cepstrum coefficient; calculating a first order difference energy coefficient and a second order difference energy coefficient of the voice data according to the energy coefficient;
and generating the voice features according to the cepstrum coefficients, the first-order difference cepstrum coefficients, the second-order difference cepstrum coefficients, the energy coefficients, the first-order difference energy coefficients and the second-order difference energy coefficients.
4. The speech assessment method of claim 1, wherein said calculating a phonon likelihood for the speech data from the speech features comprises:
processing the speech features through a phonon likelihood computation model to generate the phonon likelihood, the phonon likelihood computation model including:
Figure FDA0003217993220000021
wherein, log bj(ot) Representing a feature vector otThe logarithm of the probability value at a particular state j;
d represents the width of the calculated sound, D being the maximum sound width;
Figure FDA0003217993220000022
represents the variance of a particular state j over the width of sound d;
μjdrepresents the average value of a particular state j over the width d of the sound.
5. The speech evaluation method of claim 1, wherein the processing the utterance text by a preset semantic analysis model to generate semantic analysis data comprises:
recognizing a plurality of prompt phrases in the speech text;
dividing the utterance text into a plurality of utterance sections according to the plurality of prompt phrases;
analyzing the incidence relation between continuous speech segments according to the prompt phrase and determining the category of the incidence relation;
and generating the semantic analysis data according to the incidence relation and the category.
6. The voice evaluation method of claim 1, wherein said obtaining voice data of the agent comprises:
acquiring original voice data of an agent;
recognizing the original voice data through a human voice detection model to obtain a first recognition result of the original voice data;
recognizing the first recognition result as original voice data containing human voice through a voiceprint recognition model to obtain a second recognition result;
and screening out the voice data from the original voice data according to the second recognition result.
7. The speech assessment method of claim 1, wherein the speech assessment indicators comprise at least one of speech rate, mood, and speech accuracy.
8. A speech evaluation device, comprising:
the acquisition module is used for acquiring voice data of the agent;
the characteristic analysis module is used for analyzing the voice data through a recognition system to generate an utterance text; extracting voice characteristics of the agent from the voice data through a preset analysis algorithm;
the analysis result obtaining module is used for processing the utterance text through a preset semantic analysis model to generate semantic analysis data; calculating the phonon likelihood of the voice data according to the voice features;
and the evaluation index generation module is used for generating the voice evaluation index of the agent according to the semantic analysis data and the phoneme likelihood.
9. A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor, when executing the computer readable instructions, implements the speech assessment method of any one of claims 1 to 7.
10. One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the speech evaluation method of any of claims 1-7.
CN202110949121.0A 2021-08-18 2021-08-18 Voice evaluation method and device, computer equipment and storage medium Pending CN113674766A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110949121.0A CN113674766A (en) 2021-08-18 2021-08-18 Voice evaluation method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110949121.0A CN113674766A (en) 2021-08-18 2021-08-18 Voice evaluation method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113674766A true CN113674766A (en) 2021-11-19

Family

ID=78543551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110949121.0A Pending CN113674766A (en) 2021-08-18 2021-08-18 Voice evaluation method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113674766A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020065649A1 (en) * 2000-08-25 2002-05-30 Yoon Kim Mel-frequency linear prediction speech recognition apparatus and method
CN102509483A (en) * 2011-10-31 2012-06-20 苏州思必驰信息科技有限公司 Distributive automatic grading system for spoken language test and method thereof
CN103236260A (en) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 Voice recognition system
CN104834847A (en) * 2014-02-11 2015-08-12 腾讯科技(深圳)有限公司 Identity verification method and device
CN106782521A (en) * 2017-03-22 2017-05-31 海南职业技术学院 A kind of speech recognition system
CN106847263A (en) * 2017-01-13 2017-06-13 科大讯飞股份有限公司 Speech level evaluation method and apparatus and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020065649A1 (en) * 2000-08-25 2002-05-30 Yoon Kim Mel-frequency linear prediction speech recognition apparatus and method
CN102509483A (en) * 2011-10-31 2012-06-20 苏州思必驰信息科技有限公司 Distributive automatic grading system for spoken language test and method thereof
CN103236260A (en) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 Voice recognition system
CN104834847A (en) * 2014-02-11 2015-08-12 腾讯科技(深圳)有限公司 Identity verification method and device
CN106847263A (en) * 2017-01-13 2017-06-13 科大讯飞股份有限公司 Speech level evaluation method and apparatus and system
CN106782521A (en) * 2017-03-22 2017-05-31 海南职业技术学院 A kind of speech recognition system

Similar Documents

Publication Publication Date Title
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
EP3955246B1 (en) Voiceprint recognition method and device based on memory bottleneck feature
Ali et al. Automatic speech recognition technique for Bangla words
CN111311327A (en) Service evaluation method, device, equipment and storage medium based on artificial intelligence
Bezoui et al. Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC)
CN110970036B (en) Voiceprint recognition method and device, computer storage medium and electronic equipment
US20210350791A1 (en) Accent detection method and accent detection device, and non-transitory storage medium
Khelifa et al. Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system
Ranjan et al. Isolated word recognition using HMM for Maithili dialect
CN113744722A (en) Off-line speech recognition matching device and method for limited sentence library
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
Rahman et al. Dynamic time warping assisted svm classifier for bangla speech recognition
KR100897555B1 (en) Apparatus and method of extracting speech feature vectors and speech recognition system and method employing the same
Dave et al. Speech recognition: A review
CN113506586A (en) Method and system for recognizing emotion of user
Chavan et al. Speech recognition in noisy environment, issues and challenges: A review
Deiv et al. Automatic gender identification for hindi speech recognition
Nijhawan et al. Speaker recognition using support vector machine
CN113053409B (en) Audio evaluation method and device
Jawarkar et al. Effect of nonlinear compression function on the performance of the speaker identification system under noisy conditions
EP4024395A1 (en) Speech analyser and related method
Kurian et al. Connected digit speech recognition system for Malayalam language
Shah et al. Speaker recognition for pashto speakers based on isolated digits recognition using accent and dialect approach
Płonkowski Using bands of frequencies for vowel recognition for Polish language
CN113674766A (en) Voice evaluation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination