CN116564351B - Voice dialogue quality evaluation method and system and portable electronic equipment - Google Patents

Voice dialogue quality evaluation method and system and portable electronic equipment Download PDF

Info

Publication number
CN116564351B
CN116564351B CN202310345168.5A CN202310345168A CN116564351B CN 116564351 B CN116564351 B CN 116564351B CN 202310345168 A CN202310345168 A CN 202310345168A CN 116564351 B CN116564351 B CN 116564351B
Authority
CN
China
Prior art keywords
voice
sub
segment
duration
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310345168.5A
Other languages
Chinese (zh)
Other versions
CN116564351A (en
Inventor
秦思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HUBEI UNIVERSITY OF ECONOMICS
Original Assignee
HUBEI UNIVERSITY OF ECONOMICS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HUBEI UNIVERSITY OF ECONOMICS filed Critical HUBEI UNIVERSITY OF ECONOMICS
Priority to CN202310345168.5A priority Critical patent/CN116564351B/en
Publication of CN116564351A publication Critical patent/CN116564351A/en
Application granted granted Critical
Publication of CN116564351B publication Critical patent/CN116564351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice dialogue quality evaluation method and system and portable electronic equipment, and belongs to the technical field of voice quality evaluation. The method comprises the following steps: step S100: analyzing the voice dialogue to be evaluated to obtain a first interaction attribute; s200: dividing the voice dialogue to be evaluated into a plurality of voice fragment groups; s300: obtaining a second duration attribute of each voice sub-segment in each voice segment group; s400: determining at least one candidate sub-segment from each of the speech segment groups; s500: training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model; s600: and inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group. The invention can accurately realize the non-reference quality grading output of the multi-character voice dialogue.

Description

Voice dialogue quality evaluation method and system and portable electronic equipment
Technical Field
The present invention relates to the field of speech quality evaluation technologies, and in particular, to a speech dialogue quality evaluation method, system, and portable electronic device.
Background
Sound is one of the main ways humans know the world and perceive the world. With the full popularity of networks, network audio services have grown rapidly. When the quality of audio is poor, it needs enhancement processing to improve the audio quality. In the audio field, most audio evaluation platforms currently popular simply rely on one or two parameters of the audio as criteria for evaluating the audio quality, which is not reasonable in practice, since firstly the audio quality is related to the human auditory system, influenced by a number of factors, and cannot be measured solely on the basis of one or two simple parameters.
Along with the gradual perfection of a quality comprehensive evaluation system, two methods for evaluating the audio quality are evolved: subjective evaluation and objective evaluation. In the subjective evaluation method, a series of audio sequences are subjected to audiometry according to the international telecommunication union telecommunication standardization sector (ITU-T) p.800 standard by organizing testers, evaluation results of voice quality by the testers are counted to obtain an average value of the evaluation results, the final audio quality evaluation result is expressed as a mean opinion score (Mean Opinion Score, "MOS"), and the higher the MOS value is, the better the audio quality is. However, the subjective evaluation method has the defects of long experimental period and high economic cost.
Objective assessment methods are widely used to assess audio quality, and are divided into reference and non-reference audio quality assessment models. The operation mechanism of the reference audio quality evaluation method is that processed voice and lossless voice are compared, in the comparison process, firstly, voice is aligned, deviation of the processed voice and the lossless voice is found, after the voice of each small segment and the lossless voice of the small segment are aligned, the voice of each small segment and the lossless voice of the small segment are independently placed into an auditory model, loss of frequency cost on each frequency band and generation of additional frequency are seen, whether increase and decrease of frequency components are obvious enough or not on human hearing is judged, and finally, smoothing, weighting average and the like of voice damage conditions of each segment are carried out on the whole time domain, and finally, the voice quality is mapped to individual voice quality fractions. I TU-T historically has mainly introduced three well-known models, namely PSQM (p.861), PESQ (p.862), POLQA (p.863), the POLQA model currently being the most well accepted. The PSQM and PESQ models are only applicable to audio frequencies below 16 KHz. The POLQA model can be applied to 48kHz audio signals, and the algorithm is still in a protected state, is not public and has high use cost. And the reference audio quality assessment model needs to provide reference audio, and can not assess audio quality in a scene without reference audio. The reference audio quality assessment model needs to provide reference audio, and cannot assess audio quality in a scene without reference audio. The model for evaluating the quality of the non-reference audio is mostly realized based on deep learning, and a more representative method is MOSNET, which adopts a network architecture of CNN and BLSTM, training data is derived from The Voice Conversion Challenge (VCC) 2018, and various evaluation indexes are in the front of the industry. The time sequence convolution network (TCN) has been successful in the fields of machine translation, traffic prediction, sound event detection, etc., and has the potential of exceeding the LSTM network, in addition, the VCC2018 has a limited data size, and how to fully mine the internal information of the data by using small-scale data so as to optimize the audio evaluation performance is a problem to be solved in the industry.
In practical application, the deep learning model used by the current non-reference voice quality evaluation technology is insufficient in adaptability to voices with different timbres, and the deep learning neural network used in the prior art defaults to adopt the same evaluation model to perform the same processing evaluation on voices with different sources, so that the adaptability of voice conversation quality evaluation under a multi-voice environment, particularly a multi-target character multi-voice conversation environment and the model update training problem are not considered.
Disclosure of Invention
The invention aims to provide a voice dialogue quality evaluation method, a voice dialogue quality evaluation system and portable electronic equipment, which can accurately realize the non-reference quality score output of a multi-character voice dialogue, so that the result is more targeted and adaptive.
In order to achieve the above object, the present invention provides the following solutions:
a voice conversation quality evaluation method, comprising: comprising the following steps:
s100: analyzing a voice dialogue to be evaluated to obtain a first interaction attribute of the voice dialogue to be evaluated;
s200: dividing the voice dialogue to be evaluated into a plurality of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and the plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
s300: analyzing each voice sub-segment contained in each voice segment group to obtain a second duration attribute of each voice sub-segment in each voice segment group;
s400: determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein the second duration attributes of the plurality of candidate sub-segments are the same;
s500: taking the candidate sub-segments as training samples, training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model;
s600: inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group;
the voice quality evaluation model is a convolution-time sequence convolution network model.
Further, the first interaction attribute in the step S100 is used to characterize a first number of different target characters involved in the speech dialogue to be evaluated;
the step S200 specifically includes: and dividing the voice dialogue to be evaluated into a first number of voice fragment groups based on the first interaction attribute.
Further, the second duration attribute in the step S300 is used to characterize the duration of each voice sub-segment.
Further, the second duration attributes of the candidate sub-segments in the step S400 are the same, including one of the following cases:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
Further, the convolution-time sequence convolution network model comprises a preamble convolution module and a time sequence convolution module;
the preamble convolution module performs feature extraction on the input voice sub-segment;
the time sequence convolution module comprises n expansion convolution modules, wherein the expansion factor of each expansion convolution is 2 n-1
The invention also provides a voice dialogue quality evaluation system, which comprises:
a voice analysis unit: the method comprises a first interaction analysis subunit and a second duration analysis subunit; the first interaction analysis subunit is used for analyzing the voice dialogue to be evaluated, obtaining a first interaction attribute of the voice dialogue to be evaluated, and sending the first interaction attribute to the voice grouping unit; the second duration analysis subunit is configured to analyze each voice sub-segment included in each voice segment group, obtain a second duration attribute of each voice sub-segment in each voice segment group, and send the second duration attribute to a voice candidate module;
voice grouping unit: dividing the voice dialogue to be evaluated into a first number of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and a plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
a speech candidate unit: the method comprises the steps of determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein second duration attributes of the plurality of candidate sub-segments are identical;
model training unit: the method comprises the steps of training and updating at least one existing voice quality evaluation model by taking the candidate sub-segments as training samples to obtain an updated voice quality evaluation model;
and an evaluation output unit: the voice sub-segments except the candidate sub-segments in each voice segment group are input into the updated voice quality evaluation model, and the voice quality evaluation score of the target person corresponding to each voice segment group is output;
the voice receiving module, the first interaction analysis subunit, the voice grouping module, the voice candidate module, the model training module and the evaluation output module are sequentially connected; the second duration analysis subunit is connected with the voice grouping module and the voice candidate module;
the speech quality evaluation model is a reference-free speech quality evaluation model.
Further, the first interaction attribute is used to characterize a first number of different target persons involved in the speech dialogue to be evaluated.
Further, the second duration attribute is used to characterize a duration of each voice sub-segment.
Further, the second duration attributes of the plurality of candidate sub-segments are the same, including one of:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
The invention also provides portable electronic equipment, which comprises a voice receiving unit, a memory, a processor and a display unit, wherein the memory is stored with computer executable program instructions, and the receiving unit is used for receiving voice conversations; the executable program instructions are executed by the processor to implement the voice dialog quality assessment method and display the assessment results on the display unit.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: according to the voice dialogue quality evaluation method provided by the invention, the voice dialogue to be evaluated is analyzed to obtain the first interaction attribute; dividing the voice dialogue to be evaluated into a plurality of voice fragment groups to obtain a second duration attribute of each voice sub-fragment in each voice fragment group; then determining at least one candidate sub-segment from each of the speech segment groups; training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model; and finally, inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group, so that the non-reference quality score output of the multi-person voice dialogue can be accurately realized, and the result is more targeted and adaptive. In addition, the invention utilizes a convolution-time sequence convolution (CNN-TCN) network architecture, adopts a Label Distribution Learning (LDL) method to improve the network audio evaluation performance, carries out preprocessing segmentation on the voice according to the scene characteristics of the multi-person voice conversation, carries out targeted updating training of a non-reference voice quality evaluation model according to the grouping sample data, and carries out evaluation output, so that the result has more pertinence, suitability and accuracy, and is suitable for multi-voice environments, especially multi-target person multi-voice conversation environments.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a voice dialogue quality evaluation method according to an embodiment of the invention;
fig. 2 is a block diagram of a plurality of voice segment groups of a voice conversation quality evaluation method in an embodiment of the present invention;
FIG. 3 (a) is a schematic diagram of a no-reference speech dialog quality assessment model of the present invention;
FIG. 3 (b) is a schematic diagram of the structure of a time-series convolution module in the model for quality assessment of a reference-less speech dialogue of the present invention;
FIG. 3 (c) is a schematic diagram of an expanded convolution module of the no-reference speech dialogue quality assessment model of the present invention;
FIG. 4 is a schematic structural diagram of an electronic device according to a voice conversation quality evaluation method of the present invention;
FIG. 5 is a schematic diagram of a voice conversation quality evaluation system in accordance with one embodiment of the present invention;
FIG. 6 is a schematic diagram of a voice dialog quality assessment system in accordance with a preferred embodiment of the present invention;
FIG. 7 (a) is an effect diagram corresponding to sentence level indicators according to the present invention;
fig. 7 (b) is an effect diagram corresponding to the system level index according to the embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a voice dialogue quality evaluation method, a voice dialogue quality evaluation system and portable electronic equipment, which can accurately realize the non-reference quality score output of a multi-character voice dialogue, so that the result is more targeted and adaptive.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
As shown in fig. 1, a method for evaluating quality of a voice conversation according to an embodiment of the present invention includes:
s100: analyzing a voice dialogue to be evaluated to obtain a first interaction attribute of the voice dialogue to be evaluated;
specifically, the first interaction attribute in step S100 is used to characterize a first number of different target characters involved in the speech dialogue to be evaluated;
s200: dividing the voice dialogue to be evaluated into a plurality of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and the plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
the step S200 specifically includes:
and dividing the voice dialogue to be evaluated into a first number of voice fragment groups based on the first interaction attribute.
S300: analyzing each voice sub-segment contained in each voice segment group to obtain a second duration attribute of each voice sub-segment in each voice segment group;
the second duration attribute in step S300 is used to characterize the duration of each voice sub-segment;
s400: determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein the second duration attributes of the plurality of candidate sub-segments are the same;
the second duration attributes of the candidate sub-segments in the step S400 are the same, including one of the following cases:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
S500: taking the candidate sub-segments as training samples, training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model;
the speech quality evaluation model in the step S500 is a no-reference speech quality evaluation model, and the no-reference speech quality evaluation model is a convolution-time sequence convolution network model.
S600: inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group.
In one embodiment of the present invention, as shown in fig. 2, a grouping schematic diagram of a plurality of voice segment groups of a voice conversation quality evaluation method, a voice conversation quality evaluation method of the present invention specifically includes:
s100: analyzing a voice dialogue to be evaluated to obtain a first interaction attribute of the voice dialogue to be evaluated;
in this embodiment, in the leftmost part of fig. 2, the speech dialogue to be evaluated includes speech A1, B1, A2, C1, C2, B2, C3, A3; the voice dialog is an english voice dialog, such as a spoken english dialog; the segment of voice dialog is from at least three different people, assumed to be person a, person B and person C, wherein the voices from person a are A1, A2, A3; the voices from the person B are B1 and B2; the voices from the person C are C1, C2 and C3; in the scenario of FIG. 2, character conversations are interleaved;
in this embodiment, when the step S100 is performed, the first interaction attribute is used to characterize a first number of different target characters related to the speech dialogue to be evaluated.
S200: dividing the voice dialogue to be evaluated into a plurality of voice fragment groups based on the first interaction attribute;
in this embodiment, the voice dialogue to be evaluated is divided into 3 voice segment groups based on the first interaction attribute, which is still represented by A, B, C;
specifically, each voice segment group contains at least one voice sub-segment, and a plurality of voice sub-segments located in the same voice segment group belong to the same target person; for example, the speech segment group a contains speech sub-segments A1, A2 and A3, the speech segment group B contains speech sub-segments B1, B2, and the speech segment group C contains speech sub-segments C1, C2 and C3.
S300: analyzing each voice sub-segment contained in each voice segment group to obtain a second duration attribute of each voice sub-segment in each voice segment group;
the second duration attribute in step S300 is used to characterize the duration of each voice sub-segment;
in the present embodiment, it is assumed that the duration of the voice sub-segments A1, A2, and A3 included in the voice segment group a are 18s, 30s, and 25s, respectively (s represents seconds); the duration of the speech sub-segments B1, B2 included in the speech segment group B is 26s, 18s and 30s, respectively, and the duration of the speech sub-segments C1, C2 and C3 included in the speech segment group C is 25s, 31s and 18s, respectively.
S400: determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein the second duration attributes of the plurality of candidate sub-segments are the same;
in this embodiment, the criteria for determining at least one candidate sub-segment from each speech segment group specifically includes one or a combination of the following:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
If criterion (1) is implemented, A1, B2, C3 may be selected as a plurality of candidate sub-segments, since the second duration attributes of the plurality of candidate sub-segments are the same (all 18 s);
if the execution criterion (2) is set to 1s in advance, A3, B1, C1 may be selected as a plurality of candidate sub-segments, since the absolute value of the difference value of the plurality of candidate sub-segments is smaller than the preset upper limit value;
of course, other options are possible, and the above is merely an illustrative example.
S500: taking the candidate sub-segments as training samples, training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model;
specifically, the existing at least one speech quality assessment model is a no-reference speech quality assessment model.
Specifically, as still another improved embodiment of the present invention, the reference-free speech quality assessment model is a convolution-time series convolution network model.
Fig. 3 (a) - (c) are schematic diagrams illustrating principles and components of a model for quality evaluation of a non-reference voice conversation according to an embodiment of the present invention.
The embodiment provides a new deep learning-based reference-free audio quality evaluation method, which utilizes a convolution-time sequence convolution (CNN-TCN) network architecture and adopts a Label Distribution Learning (LDL) method to improve network audio evaluation performance.
The input signals processed by the corresponding steps (steps S100-S400) of the method shown in FIG. 1 pass through a convolution module, the output signals are input as a time sequence convolution module, and then are output by the time sequence convolution module and are input as a full connection module, the output of the full connection module comprises two branches, and the first branch a is a frame level MOS value; after the second branch b (the branch mapped into the global average pooling) performs sentence-level tag distribution processing, sentence-level MOS values are output as shown in fig. 3 (a).
Next, the re-update training principle of the non-reference voice dialogue quality evaluation model used in the present invention will be described.
1. Introduction to data set
The VCC2018 generates audio with MOS labels by various voice synthesis systems, and each audio is subjected to MOS grading (1-5 points) by four persons, training set data 13580, verification set data 3000 and test set data 4000. Because the number of training sets is small, the embodiment increases the distribution of learning labels to improve the model performance.
The process of retraining and updating an existing model belongs to the prior art, and this embodiment does not specifically develop.
2. Signal input
The original audio in VCC2018 is subjected to STFT transformation to obtain a frequency domain input signal of the network, and optionally, hamming window is added, window length is 512, window shift is 256, and frequency point number is 512. The input dimension of the input signal is [ B, N, F ], wherein B: batch size, N: the number of frames, F, is the number of single-sided frequency points, and when the number of frequency points is 512, F is 257. Input is reshaped to obtain an input signal 1 with dimensions [ B, N,257,1].
3. Convolution module
The input signal is first subjected to feature extraction by a convolution module, the convolution module is the same as the convolution module in MOSNET, and the total number of the convolution modules is 12, and parameters of each layer are shown in table 1.
TABLE 1 convolution module layer parameters
4. Sequential convolution module
After the feature is extracted by the convolution module, the dimension is [ B, N,4,128], and the feature is reshape to obtain the time sequence convolution module input signal 2, wherein the dimension is [ B, N,512].
The sequential convolution module is formed by a plurality of expansion convolution modules, and preferably n is 4. The dilation factor of each dilation convolution is 2 n-1 Where n is the number of expansion convolution modules, as shown in FIG. 3 (b).
With a first expansionThe convolution module is exemplified as a residual structure, and the input signal 2 comprises a one-dimensional expansion convolution, channel normalization, reLU activation function and Dropout layer, which are repeated once, wherein the number of one-dimensional expansion convolution output channels is preferably set to 128, the convolution kernel number is 3, the step size is 1 and the expansion factor is 2 n-1 Filling with same, dropout rate 0.3. The input signal 2 is further subjected to 1 x 1 one-dimensional convolution, preferably, the number of output channels is 128, and the two obtained signals are added to obtain the input signal 3 of the fully-connected module, wherein the dimension is [ B, N,128]]As shown in fig. 3 (c).
5. Network output processing
The input signal 3 passes through the fully-connected layer, preferably with an output channel of 128, an activation function of Relu and a Dropout rate of 0.3, resulting in a fully-connected module output 4 with dimensions [ B, N,128]. The signal 4 is fully connected with the output channel being 1, so that a signal 5-frame level MOS value is obtained, and the dimension is [ B, N,1]. The signal 4 is fully connected through an output channel 101, the activation function is softmax, the fully connected output 6 is obtained, the dimension is [ B, N,101], the signal 6 is subjected to global pooling, the sentence-level tag distribution signal 7 is obtained, the dimension is [ B,1,101], the signal 8 sentence-level MOS value is obtained through mapping of the signal 7, and the dimension is [ B,1]. When B is 1, the mapping function is:
where y represents signal 8,x represents signal 7. The calculation range of the above formula is 0-5.
6. Loss function
The average avg_mos of the MOS in VCC2018 is used as the label of the signal 5 frame level MOS, the frame_mos and the signal 8 sentence level MOS value, and the utility_mos. The variance var _ MOS of each audio MOS value is calculated, the label distribution (gaussian distribution) is calculated from the following equation,
wherein lk is [0:100].
All or part of the steps of the various embodiments described above may also be implemented as computer program instructions, implemented by a portable electronic device.
Referring to fig. 4, the present invention may also be implemented as a portable electronic device, where the electronic device includes a voice receiving unit, a memory, a processor, and a display unit, where the memory stores computer executable program instructions, and the receiving unit is configured to receive an english voice dialogue;
the executable program instructions are executed by the processor to implement all or part of the steps of the voice dialog quality assessment method described in fig. 1 and display the assessment results on the display unit.
Fig. 5-6 are schematic structural diagrams of a voice conversation quality evaluation system according to various embodiments of the present invention.
As shown in fig. 5, the present embodiment provides a voice dialogue quality evaluation system, which includes a voice receiving module, a voice analyzing module, a voice grouping module, a voice candidate module, a model training module, and an evaluation output module, for implementing the method described in fig. 1.
As shown in fig. 6, in this embodiment, the voice parsing module includes a first interaction parsing subunit and a second duration parsing subunit;
in this embodiment, the first interaction analysis subunit is configured to analyze a voice dialogue to be evaluated, obtain a first interaction attribute of the voice dialogue to be evaluated, and send the first interaction attribute to a voice grouping module; the second duration analysis subunit is configured to analyze each voice sub-segment included in each voice segment group, obtain a second duration attribute of each voice sub-segment in each voice segment group, and send the second duration attribute to a voice candidate module;
in this embodiment, the voice receiving module is configured to receive a voice dialogue to be evaluated;
in this embodiment, the voice grouping module is configured to segment the voice dialog to be evaluated into a first number of voice segment groups based on the first interaction attribute, where each voice segment group includes at least one voice sub-segment, and a plurality of voice sub-segments located in the same voice segment group belong to the same target person;
in this embodiment, the voice candidate module is configured to determine at least one candidate sub-segment from each voice segment group, so as to obtain a plurality of candidate sub-segments, where second duration attributes of the plurality of candidate sub-segments are the same;
in this embodiment, the model training module is configured to train and update an existing at least one speech quality evaluation model with the plurality of candidate sub-segments as training samples, to obtain an updated speech quality evaluation model;
in this embodiment, the evaluation output module is configured to input the speech sub-segments except the candidate sub-segments in each speech segment group into the updated speech quality evaluation model, and output a speech quality evaluation score of the target person corresponding to each speech segment group;
in this embodiment, the voice receiving module, the first interaction analysis subunit, the voice grouping module, the voice candidate module, the model training module, and the evaluation output module are sequentially connected; the second duration analysis subunit is connected with the voice grouping module and the voice candidate module;
in this embodiment, the speech quality evaluation model is a no-reference speech quality evaluation model, and the no-reference speech quality evaluation model is a convolution-time sequence convolution network model.
In this embodiment, the first interaction attribute is used to characterize a first number of different target persons involved in the speech dialogue to be evaluated.
In this embodiment, the second duration attribute is used to characterize the duration of each voice sub-segment.
In this embodiment, the second duration attributes of the candidate sub-segments are the same, including one of the following cases:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
As a further preferred aspect, the time-series convolution module is composed of n expansion convolution modules, each expansion convolution having an expansion factor of 2 n-1 Where n is the number of expansion convolution modules.
Specifically, to obtain a more beneficial training model and training parameters, it is verified that the value of n is related to the first number K of different target persons involved in the speech dialogue to be evaluated and the number { num1, num2, … … num } of speech sub-segments contained in the K speech segment groups, where num represents the number of speech sub-segments contained in the i (i=1, 2,3, … …, K) th speech segment group.
Specifically, the number n of the expansion convolution modules is determined as follows:
if K is less than or equal to 4, n=4;
if K>4, then
Wherein, min { } represents taking a smaller value,representing taking a larger value of { num1, num2, … … num K };representing an upward rounding.
As shown in fig. 7, the effect diagram corresponding to the sentence level index and the effect diagram corresponding to the system level index in the technical scheme of the present invention are shown.
In this embodiment, the comparison index is selected: linear Correlation Coefficient (LCC), spearman Rank Correlation Coefficient (SRCC), and Mean Square Error (MSE).
It can be seen that the method and moset pair of this example are shown in table 2, and each index is superior to moset.
TABLE 2 comparison of the methods of the present example with MOSNET indicators
Index (I) The invention is that MOSNET
LCC (0.6684,0.9643) (0.642,0.957)
SRCC (0.6342,0.9282) (0.589,0.888)
MSE (0.4642,0.0434) (0.538,0.084)
Note that: (A, B) A represents sentence level index, and B represents system level index.
According to the technical scheme, a convolution-time sequence convolution (CNN-TCN) network architecture is utilized, a Label Distribution Learning (LDL) method is adopted to improve network audio evaluation performance, pretreatment segmentation is carried out on voices according to scene characteristics of multi-user voice conversations, targeted updating training of a non-reference voice quality evaluation model is carried out according to grouping sample data, and evaluation output is carried out, so that the result is more targeted and adaptive.
The technical scheme of the invention firstly analyzes the voice dialogue to be evaluated to obtain a first interaction attribute; dividing the voice dialogue to be evaluated into a plurality of voice fragment groups to obtain a second duration attribute of each voice sub-fragment in each voice fragment group; then determining at least one candidate sub-segment from each of the speech segment groups; training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model; and finally, inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group, so that the non-reference quality score output of the multi-person voice dialogue can be accurately realized, and the result is more targeted and adaptive.
In the other technical features of the embodiment, those skilled in the art can flexibly select to meet different specific actual requirements according to actual conditions. However, it will be apparent to one of ordinary skill in the art that: no such specific details are necessary to practice the invention. In other instances, well-known compositions or structures have not been described in detail so as not to obscure the invention, and are within the scope of the invention as defined by the appended claims.
Modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the invention as defined by the appended claims. In the above description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that: no such specific details are necessary to practice the invention. In other instances, well-known techniques, such as specific construction details, operating conditions, and other technical conditions, have not been described in detail in order to avoid obscuring the present invention.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims (10)

1. A method for evaluating the quality of a voice conversation, comprising:
s100: analyzing a voice dialogue to be evaluated to obtain a first interaction attribute of the voice dialogue to be evaluated;
s200: dividing the voice dialogue to be evaluated into a plurality of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and the plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
s300: analyzing each voice sub-segment contained in each voice segment group to obtain a second duration attribute of each voice sub-segment in each voice segment group;
s400: determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein the second duration attributes of the plurality of candidate sub-segments are the same;
s500: taking the candidate sub-segments as training samples, training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model;
s600: inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group;
the voice quality evaluation model is a convolution-time sequence convolution network model.
2. The method for evaluating the quality of a voice conversation according to claim 1, wherein,
the first interaction attribute in the step S100 is used to characterize a first number of different target characters involved in the speech dialogue to be evaluated;
the step S200 specifically includes: and dividing the voice dialogue to be evaluated into a first number of voice fragment groups based on the first interaction attribute.
3. The method for evaluating the quality of a voice conversation according to claim 1, wherein,
the second duration attribute in step S300 is used to characterize the duration of each voice sub-segment.
4. The method for evaluating the quality of a voice conversation according to claim 1, wherein,
in the step S400, the second duration attributes of the candidate sub-segments are the same, including one of the following cases:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
5. The method for evaluating the quality of a voice conversation according to claim 1, wherein,
the convolution-time sequence convolution network model comprises a preamble convolution module and a time sequence convolution module;
the preamble convolution module performs feature extraction on the input voice sub-segment;
the time sequence convolution module comprises n expansion convolution modules, wherein the expansion factor of each expansion convolution module is 2 n-1
6. A speech dialog quality assessment system comprising:
a voice receiving module: for receiving a voice dialog to be evaluated;
and a voice analysis module: the method comprises a first interaction analysis subunit and a second duration analysis subunit; the first interaction analysis subunit is used for analyzing the voice dialogue to be evaluated, obtaining a first interaction attribute of the voice dialogue to be evaluated, and sending the first interaction attribute to the voice grouping module; the second duration analysis subunit is configured to analyze each voice sub-segment included in each voice segment group, obtain a second duration attribute of each voice sub-segment in each voice segment group, and send the second duration attribute to a voice candidate module;
and a voice grouping module: dividing the voice dialogue to be evaluated into a first number of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and a plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
a voice candidate module: the method comprises the steps of determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein second duration attributes of the plurality of candidate sub-segments are identical;
model training module: the method comprises the steps of training and updating at least one existing voice quality evaluation model by taking the candidate sub-segments as training samples to obtain an updated voice quality evaluation model;
and an evaluation output module: the voice sub-segments except the candidate sub-segments in each voice segment group are input into the updated voice quality evaluation model, and the voice quality evaluation score of the target person corresponding to each voice segment group is output;
the voice receiving module, the first interaction analysis subunit, the voice grouping module, the voice candidate module, the model training module and the evaluation output module are sequentially connected; the second duration analysis subunit is also connected with the voice grouping module and the voice candidate module;
the voice quality evaluation model is a convolution-time sequence convolution network model.
7. The speech dialog quality assessment system of claim 6, wherein the first interaction attribute is used to characterize a first number of different target persons involved in the speech dialog to be assessed.
8. The speech dialogue quality assessment system of claim 6 wherein,
the second duration attribute is used to characterize the duration of each speech sub-segment.
9. The speech dialog quality evaluation system of claim 6 wherein the second duration attributes of the plurality of candidate sub-segments are the same, comprising:
the duration of the plurality of candidate sub-fragments is the same;
or, the absolute value of the difference value of the duration time of the plurality of candidate sub-segments is smaller than a preset upper limit value.
10. The portable electronic equipment is characterized by comprising a voice receiving unit, a memory, a processor and a display unit, wherein the memory is stored with computer executable program instructions, and the receiving unit is used for receiving voice conversations;
the executable program instructions are executed by the processor to implement the speech dialog quality assessment method of any of claims 1-5 and display the assessment results on the display unit.
CN202310345168.5A 2023-04-03 2023-04-03 Voice dialogue quality evaluation method and system and portable electronic equipment Active CN116564351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310345168.5A CN116564351B (en) 2023-04-03 2023-04-03 Voice dialogue quality evaluation method and system and portable electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310345168.5A CN116564351B (en) 2023-04-03 2023-04-03 Voice dialogue quality evaluation method and system and portable electronic equipment

Publications (2)

Publication Number Publication Date
CN116564351A CN116564351A (en) 2023-08-08
CN116564351B true CN116564351B (en) 2024-01-23

Family

ID=87485165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310345168.5A Active CN116564351B (en) 2023-04-03 2023-04-03 Voice dialogue quality evaluation method and system and portable electronic equipment

Country Status (1)

Country Link
CN (1) CN116564351B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108346434A (en) * 2017-01-24 2018-07-31 ***通信集团安徽有限公司 A kind of method and apparatus of speech quality evaluation
CN108564968A (en) * 2018-04-26 2018-09-21 广州势必可赢网络科技有限公司 A kind of method and device of evaluation customer service
CN110401622A (en) * 2018-04-25 2019-11-01 ***通信有限公司研究院 A kind of speech quality assessment method, device, electronic equipment and storage medium
CN111429938A (en) * 2020-03-06 2020-07-17 江苏大学 Single-channel voice separation method and device and electronic equipment
CN112750465A (en) * 2020-12-29 2021-05-04 昆山杜克大学 Cloud language ability evaluation system and wearable recording terminal
CN112885377A (en) * 2021-02-26 2021-06-01 平安普惠企业管理有限公司 Voice quality evaluation method and device, computer equipment and storage medium
CN114220419A (en) * 2021-12-31 2022-03-22 科大讯飞股份有限公司 Voice evaluation method, device, medium and equipment
WO2022103290A1 (en) * 2020-11-12 2022-05-19 "Stc"-Innovations Limited" Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems
CN115512718A (en) * 2022-09-14 2022-12-23 中科猷声(苏州)科技有限公司 Voice quality evaluation method, device and system for stock voice file

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9711167B2 (en) * 2012-03-13 2017-07-18 Nice Ltd. System and method for real-time speaker segmentation of audio interactions

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108346434A (en) * 2017-01-24 2018-07-31 ***通信集团安徽有限公司 A kind of method and apparatus of speech quality evaluation
CN110401622A (en) * 2018-04-25 2019-11-01 ***通信有限公司研究院 A kind of speech quality assessment method, device, electronic equipment and storage medium
CN108564968A (en) * 2018-04-26 2018-09-21 广州势必可赢网络科技有限公司 A kind of method and device of evaluation customer service
CN111429938A (en) * 2020-03-06 2020-07-17 江苏大学 Single-channel voice separation method and device and electronic equipment
WO2022103290A1 (en) * 2020-11-12 2022-05-19 "Stc"-Innovations Limited" Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems
CN112750465A (en) * 2020-12-29 2021-05-04 昆山杜克大学 Cloud language ability evaluation system and wearable recording terminal
CN112885377A (en) * 2021-02-26 2021-06-01 平安普惠企业管理有限公司 Voice quality evaluation method and device, computer equipment and storage medium
CN114220419A (en) * 2021-12-31 2022-03-22 科大讯飞股份有限公司 Voice evaluation method, device, medium and equipment
CN115512718A (en) * 2022-09-14 2022-12-23 中科猷声(苏州)科技有限公司 Voice quality evaluation method, device and system for stock voice file

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Automatic Assessment of Speech Impairment in Cantonese-Speaking People with Aphasia;Ying Qin;《IEEE Journal of Selected Topics in Signal Processing 》;全文 *
基于深度学习的环境音频多标签分类方法研究;马文;《中国优秀硕士学位论文全文数据库》;全文 *

Also Published As

Publication number Publication date
CN116564351A (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN109599093B (en) Intelligent quality inspection keyword detection method, device and equipment and readable storage medium
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN108564942A (en) One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
CN107818797B (en) Voice quality evaluation method, device and system
CN111243602A (en) Voiceprint recognition method based on gender, nationality and emotional information
CN113012720B (en) Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN111429948A (en) Voice emotion recognition model and method based on attention convolution neural network
CN109408660B (en) Music automatic classification method based on audio features
US9786300B2 (en) Single-sided speech quality measurement
CN109979486B (en) Voice quality assessment method and device
CN102789779A (en) Speech recognition system and recognition method thereof
CN111933113B (en) Voice recognition method, device, equipment and medium
CN111599339B (en) Speech splicing synthesis method, system, equipment and medium with high naturalness
CN111508505A (en) Speaker identification method, device, equipment and storage medium
CN116564351B (en) Voice dialogue quality evaluation method and system and portable electronic equipment
CN115497455B (en) Intelligent evaluating method, system and device for oral English examination voice
CN116884427A (en) Embedded vector processing method based on end-to-end deep learning voice re-etching model
Novotney12 et al. Analysis of low-resource acoustic model self-training
CN116230018A (en) Synthetic voice quality evaluation method for voice synthesis system
Duong Development of accent recognition systems for Vietnamese speech
CN113035236B (en) Quality inspection method and device for voice synthesis data
CN115881156A (en) Multi-scale-based multi-modal time domain voice separation method
CN111785236A (en) Automatic composition method based on motivational extraction model and neural network
CN112767968B (en) Voice objective evaluation optimal feature group screening method based on discriminative complementary information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant