CN116504269A - Pronunciation evaluation method and device, readable medium and electronic equipment - Google Patents

Pronunciation evaluation method and device, readable medium and electronic equipment Download PDF

Info

Publication number
CN116504269A
CN116504269A CN202310460583.5A CN202310460583A CN116504269A CN 116504269 A CN116504269 A CN 116504269A CN 202310460583 A CN202310460583 A CN 202310460583A CN 116504269 A CN116504269 A CN 116504269A
Authority
CN
China
Prior art keywords
sample
target
evaluation
phoneme
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310460583.5A
Other languages
Chinese (zh)
Inventor
李亮亮
李伟
高绍钧
田霄海
付凯奇
马泽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Lemon Inc Cayman Island
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Lemon Inc Cayman Island
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd, Lemon Inc Cayman Island filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202310460583.5A priority Critical patent/CN116504269A/en
Publication of CN116504269A publication Critical patent/CN116504269A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The disclosure relates to a pronunciation evaluation method, a pronunciation evaluation device, a readable medium and an electronic device, wherein the pronunciation evaluation method comprises the following steps: acquiring target voice to be evaluated and a target text corresponding to the target voice; determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text; inputting the multiple target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model to obtain an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model; the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, wherein the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, and the sample phoneme information comprises sample phoneme identification and sample position information.

Description

Pronunciation evaluation method and device, readable medium and electronic equipment
Technical Field
The disclosure relates to the technical field of voice recognition, in particular to a pronunciation evaluation method and device, a readable medium and electronic equipment.
Background
With the development of the internet, language learning application based on the internet has also been rapidly developed, and speech evaluation is an important technology for helping autonomous language learners learn, and can evaluate the pronunciation of the learners in terms of accuracy, fluency, integrity, rhythm and the like.
In the related art, pronunciation is evaluated by an evaluation model, different evaluation models are used for evaluating pronunciation from different dimensions, for example, accuracy of pronunciation is evaluated by an accuracy evaluation model, and completeness of pronunciation is evaluated by an completeness evaluation model. Therefore, how to improve the accuracy of pronunciation assessment is a highly desirable problem.
Disclosure of Invention
This section is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This section is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a pronunciation assessment method, including:
acquiring target voice to be evaluated and a target text corresponding to the target voice;
determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text, wherein the target phoneme features are used for representing the phoneme features of target phonemes in the target voice in the target text, and different target phoneme features correspond to different target phoneme features;
inputting a plurality of target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model to obtain an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model;
the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text.
In a second aspect, the present disclosure provides a pronunciation assessment device, comprising:
the first acquisition module is used for acquiring target voice to be evaluated and target text corresponding to the target voice;
the determining module is used for determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text, wherein the target phoneme features are used for representing the phoneme features of target phonemes in the target voice, and different target phonemes correspond to different target phoneme features;
the second acquisition module is used for inputting a plurality of target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model so as to acquire an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model;
the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text.
In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides an electronic device comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to carry out the steps of the method of the first aspect of the disclosure.
Through the technical scheme, the target voice to be evaluated and the target text corresponding to the target voice are obtained; determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text, wherein the target phoneme features are used for representing the phoneme features of target phonemes in the target voice in the target text, and different target phoneme features correspond to different target phoneme features; inputting a plurality of target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model to obtain an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model; the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text. That is, the present disclosure inputs a plurality of target phoneme features corresponding to a target speech and at least one target evaluation task simultaneously into a pronunciation evaluation model, and performs pronunciation evaluation on at least one target evaluation task through the pronunciation evaluation model.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow chart illustrating a pronunciation assessment method according to an exemplary embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating a model training method according to an exemplary embodiment of the present disclosure;
FIG. 3 is a schematic diagram illustrating a model training method according to an exemplary embodiment of the present disclosure;
FIG. 4 is a flowchart illustrating a model training step according to an exemplary embodiment of the present disclosure;
FIG. 5 is a block diagram of a pronunciation assessment device according to an exemplary embodiment of the present disclosure;
FIG. 6 is a block diagram of another pronunciation assessment device, according to an exemplary embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device, according to an exemplary embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
All actions in this disclosure to obtain signals, information or data are performed in compliance with the corresponding data protection legislation policies of the country of location and to obtain authorization granted by the owner of the corresponding device.
It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.
As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.
It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.
Meanwhile, it can be understood that the data (including but not limited to the data itself, the acquisition or the use of the data) related to the technical scheme should conform to the requirements of the corresponding laws and regulations and related regulations.
The following detailed description of specific embodiments of the present disclosure refers to the accompanying drawings.
FIG. 1 is a flow chart of a pronunciation assessment method, as shown in FIG. 1, according to an exemplary embodiment of the present disclosure, which may include:
s101, acquiring target voice to be evaluated and target text corresponding to the target voice.
In the step, the target text can be displayed through the electronic equipment, a voice signal input by a user aiming at the target text is collected through a microphone of the electronic equipment, and the collected voice signal is used as the target voice. After the voice signal is collected, the voice signal may also be preprocessed, for example, noise reduction processing is performed on the voice signal, so as to obtain the target voice, where the preprocessing mode may be a signal processing mode in the prior art, which is not limited in this disclosure.
S102, determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text.
The target phoneme features may be used to represent the phoneme features of the target phonemes in the target text in the target speech, where different target phonemes correspond to different target phoneme features. The target phoneme feature may be, for example, a GoP (Goodness of Pronunciation, well-sounding) feature.
In this step, after the target speech and the target text corresponding to the target speech are obtained, the target phoneme characteristic of each target phoneme in the target speech in the target text may be determined according to the target speech and the target text. Taking the target phoneme feature as a GoP feature as an example, if the target text is "Its Name", the target phoneme sequence corresponding to the target text is "IHT S|N EY M", the target phoneme sequence contains 6 target phonemes, and the target speech corresponds to a plurality of target phoneme features and can be expressed as "GoP [ IH ], goP [ T ], goP [ S ], goP [ N ], goP [ EY ], goP [ M ]".
In one possible implementation, the target speech and the target text may be input into a pre-generated phoneme feature acquisition model to acquire a plurality of the target phoneme features output by the phoneme feature acquisition model. The phoneme feature acquisition model may be, among other things, a prior art ASR (Automatic Speech Recognition ) acoustic model.
S103, inputting a plurality of target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model to obtain an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model.
The pronunciation evaluation model can be obtained by training a target neural network model through a plurality of sample sets, wherein the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample speech, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text.
In this step, after determining a plurality of target phoneme features corresponding to the target speech, at least one target evaluation task that needs to evaluate the target speech may be determined. In one possible implementation, the target evaluation task may be a plurality of preset evaluation tasks for which the pronunciation evaluation model is capable of performing pronunciation evaluation, and the preset evaluation tasks may be evaluation tasks for which the pronunciation evaluation model is capable of performing pronunciation evaluation, for example, the preset evaluation tasks may include a phoneme evaluation, a word evaluation, and a sentence evaluation, may include an accuracy evaluation for the phoneme evaluation and the word evaluation, may include a fluency evaluation, an integrity evaluation, a prosody evaluation, and the like, which is not limited in the present disclosure.
In another possible implementation manner, at least one target evaluation task may also be determined from a plurality of preset evaluation tasks according to the number of phonemes in the target text selected by the user. For example, if the user selects one target phoneme in the target text, the target evaluation task may be a phoneme evaluation, that is, evaluate the accuracy of the pronunciation of the target phoneme in the target speech, and continue to illustrate the target text in step S102, and if the user selects the target phoneme "S", the target evaluation task is to evaluate the accuracy of the target phoneme "S"; if the user selects the target word 'Name', the target evaluation task is to evaluate the accuracy of the target word 'Name'; if all the target phonemes in the target text are selected by the user, the target evaluation task may be sentence evaluation, that is, evaluating the fluency, completeness, and prosody of the entire sentence of the target speech.
After determining at least one target evaluation task, a plurality of target phoneme features and at least one target evaluation task can be input into the pronunciation evaluation model, feature extraction can be performed on the plurality of target phoneme features through the pronunciation evaluation model for each target evaluation task to obtain target phoneme evaluation features corresponding to the target evaluation task, pooling processing is performed on the target phoneme evaluation features according to the target evaluation task to obtain target pooling features corresponding to the target evaluation task, and an evaluation value corresponding to the target evaluation task is determined according to the target pooling features.
By adopting the method, the plurality of target phoneme features corresponding to the target voice and at least one target evaluation task are simultaneously input into the pronunciation evaluation model, and the pronunciation evaluation is carried out on at least one target evaluation task through the pronunciation evaluation model.
FIG. 2 is a flow chart of a model training method, as shown in FIG. 2, according to an exemplary embodiment of the present disclosure, which may include:
s21, acquiring a plurality of sample sets, and determining a current sample set from the plurality of sample sets.
In this step, a plurality of sample texts and sample voices corresponding to each sample text may be acquired, and the sample voices may be acquired by a plurality of users for the sample texts. For each sample text, sample phoneme information of each sample phoneme in the sample text, sample phoneme characteristics of each sample phoneme in sample speech, a plurality of sample evaluation tasks, and a sample evaluation value corresponding to each sample evaluation task may be determined. After obtaining the plurality of sample sets, any one of the plurality of sample sets may be taken as the current sample set.
For example, taking any sample set as an example, a sample text of the sample set and a sample voice corresponding to the sample text may be input into the phoneme feature obtaining model to obtain a plurality of sample phoneme features output by the phoneme feature obtaining model. For each sample phoneme in the sample text, a sample phoneme identification of the sample phoneme can be determined through a preset identification association relationship, and the identification association relationship can comprise a corresponding relationship between different phonemes and phoneme identifications.
Fig. 3 is a schematic diagram of a model training method according to an exemplary embodiment of the present disclosure, as shown in fig. 3, if the sample text is "Its Name", a sample phoneme sequence corresponding to the sample text is "IH ts|ney M", and the sample phoneme sequence includes 6 sample phonemes. Inputting the sample text and the sample voice into the phoneme feature acquisition model to obtain a plurality of sample phoneme features corresponding to the sample voice: the sample phoneme identification of the sample phoneme "IH" may be denoted as "Phn [ IH ]", the sample phoneme identification of the sample phoneme "T" may be denoted as "Phn [ T ]", the sample phoneme identification of the sample phoneme "S" may be denoted as "Phn [ S ]", the sample phoneme identification of the sample phoneme "N" may be denoted as "Phn [ N ]", the sample phoneme identification of the sample phoneme "EY" may be denoted as "Phn [ EY ]", and the sample phoneme identification of the sample phoneme "M" may be denoted as "Phn [ M ]". For each sample phoneme in the sample text, sample position information of the sample phoneme may be determined according to a position of the sample phoneme in a sample phoneme sequence of the sample text, and for example, the sample position information of the sample phoneme "IH" may be represented as "Pos [0]", the sample phoneme identification of the sample phoneme "T" may be represented as "Pos [1]", the sample phoneme identification of the sample phoneme "S" may be represented as "Pos [2]", the sample phoneme identification of the sample phoneme "N" may be represented as "Pos [3]", the sample phoneme identification of the sample phoneme "EY" may be represented as "Pos [4]", and the sample phoneme identification of the sample phoneme "M" may be represented as "Pos [5]".
The tasks of evaluating the plurality of samples in different sample sets may be the same or different, and the present disclosure is not limited thereto. For each sample set, a plurality of sample evaluation tasks of sample voices of the sample set can be marked manually, so that a sample evaluation value corresponding to each sample evaluation task can be obtained.
S22, performing a model training step in a circulating manner according to the current sample set until the trained target neural network model meets the preset stopping iteration condition according to a plurality of sample evaluation values and a plurality of current prediction evaluation values of the current sample set, and taking the trained target neural network model as the pronunciation evaluation model.
The current prediction evaluation value is an evaluation value corresponding to the sample evaluation task output after the current sample set is input into the trained target neural network model. The preset stop iteration condition may be any stop iteration condition in the prior art, which is not limited by the present disclosure.
In this step, the model training step may be performed according to the current sample set, a current predicted evaluation value corresponding to each sample evaluation task may be output through the target neural network, and whether the target neural network meets a preset stop iteration condition may be determined according to a plurality of sample evaluation values and a plurality of current predicted evaluation values. If the target neural network meets the preset stopping iteration condition, the target neural network model can be used as the pronunciation evaluation model; if the target neural network is determined not to meet the preset stopping iteration condition, a new current sample set can be determined from a plurality of sample sets, and the model training step is continuously executed according to the new current sample set until the trained target neural network model is determined to meet the preset stopping iteration condition.
FIG. 4 is a flowchart illustrating a model training step, as shown in FIG. 4, according to an exemplary embodiment of the present disclosure, which may include:
s1, acquiring a current predicted evaluation value corresponding to each sample evaluation task in the current sample set through the target neural network model.
For example, the current sample set may be input into the target neural network model to obtain a current predicted evaluation value corresponding to each of the sample evaluation tasks output by the target neural network model.
In one possible implementation manner, for each sample evaluation task, feature extraction may be performed on a plurality of current sample phoneme features of the current sample set and a plurality of current sample phoneme information of the current sample set through the target neural network model according to the sample evaluation task, so as to obtain phoneme evaluation features corresponding to the sample evaluation task, and a current predicted evaluation value corresponding to the sample evaluation task is determined according to the phoneme evaluation features and the sample evaluation task.
The target neural network model may include a feature extraction sub-model, and for each current sample phoneme of the current sample set, a current sample phoneme feature of the current sample phoneme and current sample phoneme information of the current sample phoneme may be subjected to a concatenation process to obtain current sample concatenation information; and carrying out feature extraction on the current sample splicing information through the feature extraction sub-model according to the sample evaluation task to obtain phoneme evaluation features corresponding to the sample evaluation task.
For example, continuing to take the model training schematic shown in fig. 3 as an example, the current sample phoneme sequence corresponding to the current sample text is "IH T s|n EY M", and for the current sample phoneme "IH" in the current sample phoneme sequence, the current sample phoneme feature "go [ IH ]", the current sample phoneme identifier "Phn [ T ]" and the current sample position information "Pos [0]" of the current sample phoneme may be subjected to the splicing processing, so as to obtain the current sample splicing information of the current sample phoneme "IH". Referring to the processing mode of the current sample phonemes 'IH' in the current sample phoneme sequence, other current sample phonemes in the current sample phoneme sequence can be processed to obtain current sample splicing information corresponding to each current sample phoneme in the current sample phoneme sequence, and thus 6 pieces of current sample splicing information are obtained.
After determining current sample splicing information corresponding to each current sample phoneme of the current sample set, splicing the sample evaluation task and a plurality of pieces of current sample splicing information according to each sample evaluation task of the current sample set to obtain current sample evaluation information, inputting the current sample evaluation information into the feature extraction sub-model, and extracting features through the feature extraction sub-model to obtain phoneme evaluation features corresponding to the sample evaluation task.
In one possible implementation, the target neural network model includes a pooling layer and an evaluation sub-model, an output of the pooling layer being coupled to an input of the evaluation sub-model, for each sample evaluation task, after determining a phoneme evaluation feature corresponding to each sample evaluation task, a target sample phoneme may be determined from the plurality of the current sample phonemes of the current sample set according to the sample evaluation task; according to the target sample phonemes, pooling processing is carried out on the phoneme evaluation characteristics to obtain sample pooling characteristics corresponding to the sample evaluation tasks; and according to the sample pooling characteristics, determining the current predicted evaluation value corresponding to the sample evaluation task through the evaluation sub-model.
The range association relation comprises the corresponding relation between different evaluation tasks and the phoneme range; the target sample phone is determined from a plurality of the current sample phones based on the sample phone range.
Continuing to take the current sample text 'Its Name' as an example, if the sample evaluation task is phoneme evaluation, the sample phoneme range can be [0,0], [1,1], [2,2], [3,3], [4,4], [5,5], which indicates that the accuracy of each current sample phoneme of the current sample speech is evaluated; if the sample evaluation task is word evaluation, the sample phoneme range can be [0,2], [3,5], which represents that the accuracy of each word of the current sample voice is evaluated; if the sample evaluation task is sentence evaluation, the sample phoneme range may be [0,5], which indicates that the fluency, completeness, prosody of the entire sentence of the current sample speech is evaluated.
After determining the range of the sample phonemes corresponding to each sample evaluation task, determining the boundary range corresponding to the sample evaluation task according to the range of the sample phonemes, and performing pooling processing on the phoneme evaluation features corresponding to the sample evaluation task according to the boundary range, where pooling processing may be performed, for example, by means of boundary pooling, average pooling, and the like in the prior art, to obtain the sample pooling features corresponding to the sample evaluation task. And then, scoring is carried out through the evaluation submodel according to the sample pooling characteristics, so that the current predicted evaluation value corresponding to the sample evaluation task is obtained. The evaluation submodel may be a linear mapping layer, for example.
S2, under the condition that the target neural network model is determined to not meet the preset iteration stopping condition according to the current prediction evaluation values and the sample evaluation values, determining a target loss value according to the current prediction evaluation values and the sample evaluation values, updating parameters of the target neural network model according to the target loss value to obtain a trained target neural network model, taking the trained target neural network model as a new target neural network model, and determining a new current sample set from the sample sets.
For example, after determining the plurality of current predicted evaluation values, a target loss value may be determined according to the plurality of current predicted evaluation values and the plurality of sample evaluation values, for example, for each sample evaluation task, a task loss value corresponding to the sample evaluation task may be determined according to the current predicted evaluation value corresponding to the sample evaluation task and the sample evaluation value corresponding to the sample evaluation task, an average value of the plurality of task loss values may be taken as the target loss value, a maximum task loss value of the plurality of task loss values may be taken as the target loss value, different weights may be set for different sample evaluation tasks, and the target loss value may be determined according to the weights and the task loss values. The above manner of determining the target loss value is by way of example, and the disclosure is not limited thereto.
After determining the target loss value, determining whether the target loss value is smaller than or equal to a preset loss value threshold, and if the target loss value is smaller than or equal to the preset loss value threshold, determining that the target neural network model meets the preset stop iteration condition; if the target loss value is greater than the preset loss value threshold, it may be determined that the target neural network model does not meet the preset loss value threshold, parameters of the target neural network model are updated according to the target loss value, a trained target neural network model is obtained, the trained target neural network model is used as a new target neural network model, a new current sample set is determined from a plurality of sample sets, and the steps S1 to S2 are continuously performed according to the new current sample set.
By adopting the model training method, in the process of training the pronunciation evaluation model, the feature extraction sub-model not only extracts the feature information of phonemes, but also extracts the feature information of evaluation tasks, so that the feature extraction sub-model can utilize the feature information among different evaluation tasks to realize complementation of the feature information among different evaluation tasks, and the accuracy of the pronunciation evaluation model is improved; in addition, the pronunciation evaluation model can fully utilize the characteristic information among different evaluation tasks, so that the dependence on the number of training samples is reduced, and a pronunciation evaluation model with higher accuracy can be generated under the condition that the number of training samples is limited.
Fig. 5 is a block diagram of a pronunciation assessment device according to an exemplary embodiment of the present disclosure, as shown in fig. 5, the device may include:
a first obtaining module 501, configured to obtain a target voice to be evaluated and a target text corresponding to the target voice;
a determining module 502, configured to determine, according to the target speech and the target text, a plurality of target phoneme features corresponding to the target speech, where the target phoneme features are used to represent phoneme features of target phonemes in the target text in the target speech, and different target phoneme features correspond to different target phoneme features;
A second obtaining module 503, configured to input a plurality of the target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model, so as to obtain an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model;
the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text.
Optionally, fig. 6 is a block diagram of another pronunciation assessment device according to an exemplary embodiment of the present disclosure, and as shown in fig. 6, the device further includes:
model training module 504 for:
acquiring a plurality of sample sets, and determining a current sample set from the plurality of sample sets;
according to the current sample set, a model training step is circularly executed until a trained target neural network model is determined to meet a preset stopping iteration condition according to a plurality of sample evaluation values and a plurality of current prediction evaluation values of the current sample set, the trained target neural network model is used as the pronunciation evaluation model, and the current prediction evaluation value is an evaluation value corresponding to the sample evaluation task output after the current sample set is input into the trained target neural network model;
The model training step comprises the following steps:
acquiring a current predicted evaluation value corresponding to each sample evaluation task in the current sample set through the target neural network model;
under the condition that the target neural network model is determined to not meet the preset stopping iteration condition according to the current predicting evaluation values and the sample evaluation values, determining a target loss value according to the current predicting evaluation values and the sample evaluation values, updating parameters of the target neural network model according to the target loss value to obtain a trained target neural network model, taking the trained target neural network model as a new target neural network model, and determining a new current sample set from the sample sets.
Optionally, the model training module 504 is further configured to:
for each sample evaluation task, extracting features of a plurality of current sample phoneme features of the current sample set and a plurality of current sample phoneme information of the current sample set through the target neural network model according to the sample evaluation task to obtain phoneme evaluation features corresponding to the sample evaluation task, and determining a current predicted evaluation value corresponding to the sample evaluation task according to the phoneme evaluation features and the sample evaluation task.
Optionally, the target neural network model includes a feature extraction sub-model, and the model training module 504 is further configured to:
aiming at each current sample phoneme of the current sample set, carrying out splicing processing on the current sample phoneme characteristics of the current sample phoneme and the current sample phoneme information of the current sample phoneme to obtain current sample splicing information;
and carrying out feature extraction on the current sample splicing information through the feature extraction sub-model according to the sample evaluation task to obtain phoneme evaluation features corresponding to the sample evaluation task.
Optionally, the target neural network model includes a pooling layer and an evaluation submodel, an output of the pooling layer being coupled to an input of the evaluation submodel, the model training module 504 being further configured to:
determining a target sample phoneme from a plurality of the current sample phonemes of the current sample set according to the sample evaluation task;
according to the target sample phonemes, pooling processing is carried out on the phoneme evaluation characteristics to obtain sample pooling characteristics corresponding to the sample evaluation tasks;
and according to the sample pooling characteristics, determining the current predicted evaluation value corresponding to the sample evaluation task through the evaluation sub-model.
Optionally, the model training module 504 is further configured to:
determining a sample phoneme range corresponding to the sample evaluation task through a preset range association relationship, wherein the range association relationship comprises the correspondence relationship between different evaluation tasks and phoneme ranges;
the target sample phone is determined from a plurality of the current sample phones based on the sample phone range.
Optionally, the determining module 502 is further configured to:
the target speech and the target text are input into a pre-generated phoneme feature acquisition model to acquire a plurality of the target phoneme features output by the phoneme feature acquisition model.
Through the device, the plurality of target phoneme features corresponding to the target voice and at least one target evaluation task are simultaneously input into the pronunciation evaluation model, and the pronunciation evaluation is carried out on at least one target evaluation task through the pronunciation evaluation model.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Referring now to fig. 7, a schematic diagram of an electronic device 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 7, the electronic device 700 may include a processing means (e.g., a central processor, a graphics processor, etc.) 701, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
In general, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 shows an electronic device 700 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 709, or installed from storage 708, or installed from ROM 702. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 701.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring target voice to be evaluated and a target text corresponding to the target voice; determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text, wherein the target phoneme features are used for representing the phoneme features of target phonemes in the target voice in the target text, and different target phoneme features correspond to different target phoneme features; inputting a plurality of target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model to obtain an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model; the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited to the module itself in some cases, and for example, the first acquisition module may also be described as "a module for acquiring a target voice to be evaluated and a target text corresponding to the target voice".
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, example 1 provides a pronunciation assessment method, comprising: acquiring target voice to be evaluated and a target text corresponding to the target voice; determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text, wherein the target phoneme features are used for representing the phoneme features of target phonemes in the target voice in the target text, and different target phoneme features correspond to different target phoneme features; inputting a plurality of target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model to obtain an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model; the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text.
In accordance with one or more embodiments of the present disclosure, example 2 provides the method of example 1, the pronunciation assessment model is pre-generated by: acquiring a plurality of sample sets, and determining a current sample set from the plurality of sample sets; according to the current sample set, a model training step is circularly executed until a trained target neural network model is determined to meet a preset stopping iteration condition according to a plurality of sample evaluation values and a plurality of current prediction evaluation values of the current sample set, the trained target neural network model is used as the pronunciation evaluation model, and the current prediction evaluation values are evaluation values corresponding to the sample evaluation tasks output after the current sample set inputs the trained target neural network model; the model training step comprises the following steps: acquiring a current predicted evaluation value corresponding to each sample evaluation task in the current sample set through the target neural network model; and under the condition that the target neural network model does not meet the preset stopping iteration condition according to the current predicted evaluation values and the sample evaluation values, determining a target loss value according to the current predicted evaluation values and the sample evaluation values, updating parameters of the target neural network model according to the target loss value to obtain a trained target neural network model, taking the trained target neural network model as a new target neural network model, and determining a new current sample set from the sample sets.
According to one or more embodiments of the present disclosure, example 3 provides the method of example 2, the obtaining, by the target neural network model, a current predicted evaluation value corresponding to each of the sample evaluation tasks in the current sample set includes: and for each sample evaluation task, extracting characteristics of a plurality of current sample phoneme characteristics of the current sample set and a plurality of current sample phoneme information of the current sample set through the target neural network model according to the sample evaluation task to obtain phoneme evaluation characteristics corresponding to the sample evaluation task, and determining a current predicted evaluation value corresponding to the sample evaluation task according to the phoneme evaluation characteristics and the sample evaluation task.
According to one or more embodiments of the present disclosure, example 4 provides the method of example 3, the target neural network model includes a feature extraction sub-model, and the feature extracting, according to the sample evaluation task, the plurality of current sample phoneme features of the current sample set and the plurality of current sample phoneme information of the current sample set by the target neural network model, to obtain phoneme evaluation features corresponding to the sample evaluation task includes: aiming at each current sample phoneme of the current sample set, performing splicing processing on the current sample phoneme characteristics of the current sample phonemes and the current sample phoneme information of the current sample phonemes to obtain current sample splicing information; and carrying out feature extraction on the current sample splicing information through the feature extraction sub-model according to the sample evaluation task to obtain phoneme evaluation features corresponding to the sample evaluation task.
According to one or more embodiments of the present disclosure, example 5 provides the method of example 3, the target neural network model including a pooling layer and an evaluation submodel, an output of the pooling layer coupled to an input of the evaluation submodel, the determining, according to the phoneme evaluation feature and the sample evaluation task, a current predicted evaluation value corresponding to the sample evaluation task including: determining a target sample phoneme from a plurality of current sample phonemes in the current sample set according to the sample evaluation task; according to the target sample phonemes, carrying out pooling treatment on the phoneme evaluation characteristics to obtain sample pooling characteristics corresponding to the sample evaluation tasks; and determining a current predicted evaluation value corresponding to the sample evaluation task through the evaluation sub-model according to the sample pooling characteristics.
According to one or more embodiments of the present disclosure, example 6 provides the method of example 5, the determining a target sample phoneme from a plurality of the current sample phonemes of the current sample set according to the sample evaluation task comprising: determining a sample phoneme range corresponding to the sample evaluation task through a preset range association relationship, wherein the range association relationship comprises the correspondence relationship between different evaluation tasks and phoneme ranges; and determining the target sample phonemes from a plurality of current sample phonemes according to the sample phoneme range.
In accordance with one or more embodiments of the present disclosure, example 7 provides the method of any one of examples 1-6, the determining, from the target speech and the target text, a plurality of target phoneme features corresponding to the target speech comprising: inputting the target voice and the target text into a pre-generated phoneme characteristic acquisition model to acquire a plurality of target phoneme characteristics output by the phoneme characteristic acquisition model.
According to one or more embodiments of the present disclosure, example 8 provides a pronunciation assessment device, comprising: the first acquisition module is used for acquiring target voice to be evaluated and target text corresponding to the target voice; the determining module is used for determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text, wherein the target phoneme features are used for representing the phoneme features of target phonemes in the target voice, and different target phonemes correspond to different target phoneme features; the second acquisition module is used for inputting a plurality of target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model so as to acquire an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model; the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text.
Example 9 provides the apparatus of example 8, according to one or more embodiments of the disclosure, further comprising: the model training module is used for acquiring a plurality of sample sets and determining a current sample set from the plurality of sample sets; according to the current sample set, a model training step is circularly executed until a trained target neural network model is determined to meet a preset stopping iteration condition according to a plurality of sample evaluation values and a plurality of current prediction evaluation values of the current sample set, the trained target neural network model is used as the pronunciation evaluation model, and the current prediction evaluation values are evaluation values corresponding to the sample evaluation tasks output after the current sample set inputs the trained target neural network model; the model training step comprises the following steps: acquiring a current predicted evaluation value corresponding to each sample evaluation task in the current sample set through the target neural network model; and under the condition that the target neural network model does not meet the preset stopping iteration condition according to the current predicted evaluation values and the sample evaluation values, determining a target loss value according to the current predicted evaluation values and the sample evaluation values, updating parameters of the target neural network model according to the target loss value to obtain a trained target neural network model, taking the trained target neural network model as a new target neural network model, and determining a new current sample set from the sample sets.
Example 10 provides the apparatus of example 9, according to one or more embodiments of the disclosure, the model training module further to: and for each sample evaluation task, extracting characteristics of a plurality of current sample phoneme characteristics of the current sample set and a plurality of current sample phoneme information of the current sample set through the target neural network model according to the sample evaluation task to obtain phoneme evaluation characteristics corresponding to the sample evaluation task, and determining a current predicted evaluation value corresponding to the sample evaluation task according to the phoneme evaluation characteristics and the sample evaluation task.
In accordance with one or more embodiments of the present disclosure, example 11 provides the apparatus of example 10, the target neural network model comprising a feature extraction sub-model, the model training module further to: aiming at each current sample phoneme of the current sample set, performing splicing processing on the current sample phoneme characteristics of the current sample phonemes and the current sample phoneme information of the current sample phonemes to obtain current sample splicing information; and carrying out feature extraction on the current sample splicing information through the feature extraction sub-model according to the sample evaluation task to obtain phoneme evaluation features corresponding to the sample evaluation task.
In accordance with one or more embodiments of the present disclosure, example 12 provides the apparatus of example 10, the target neural network model comprising a pooling layer and an evaluation submodel, an output of the pooling layer coupled with an input of the evaluation submodel, the model training module further to: determining a target sample phoneme from a plurality of current sample phonemes in the current sample set according to the sample evaluation task; according to the target sample phonemes, carrying out pooling treatment on the phoneme evaluation characteristics to obtain sample pooling characteristics corresponding to the sample evaluation tasks; and determining a current predicted evaluation value corresponding to the sample evaluation task through the evaluation sub-model according to the sample pooling characteristics.
Example 13 provides the apparatus of example 12, according to one or more embodiments of the disclosure, the model training module further to: determining a sample phoneme range corresponding to the sample evaluation task through a preset range association relationship, wherein the range association relationship comprises the correspondence relationship between different evaluation tasks and phoneme ranges; and determining the target sample phonemes from a plurality of current sample phonemes according to the sample phoneme range.
In accordance with one or more embodiments of the present disclosure, example 14 provides the apparatus of any one of examples 8-13, the determining module further to: inputting the target voice and the target text into a pre-generated phoneme characteristic acquisition model to acquire a plurality of target phoneme characteristics output by the phoneme characteristic acquisition model.
Example 15 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing device, performs the steps of the method of any of examples 1-7, according to one or more embodiments of the present disclosure.
Example 16 provides an electronic device according to one or more embodiments of the present disclosure, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the method of any one of examples 1-7.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims (10)

1. A pronunciation assessment method, comprising:
acquiring target voice to be evaluated and a target text corresponding to the target voice;
determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text, wherein the target phoneme features are used for representing the phoneme features of target phonemes in the target voice in the target text, and different target phoneme features correspond to different target phoneme features;
inputting a plurality of target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model to obtain an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model;
the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text.
2. The method of claim 1, wherein the pronunciation assessment model is pre-generated by:
acquiring a plurality of sample sets, and determining a current sample set from the plurality of sample sets;
according to the current sample set, a model training step is circularly executed until a trained target neural network model is determined to meet a preset stopping iteration condition according to a plurality of sample evaluation values and a plurality of current prediction evaluation values of the current sample set, the trained target neural network model is used as the pronunciation evaluation model, and the current prediction evaluation values are evaluation values corresponding to the sample evaluation tasks output after the current sample set inputs the trained target neural network model;
the model training step comprises the following steps:
acquiring a current predicted evaluation value corresponding to each sample evaluation task in the current sample set through the target neural network model;
and under the condition that the target neural network model does not meet the preset stopping iteration condition according to the current predicted evaluation values and the sample evaluation values, determining a target loss value according to the current predicted evaluation values and the sample evaluation values, updating parameters of the target neural network model according to the target loss value to obtain a trained target neural network model, taking the trained target neural network model as a new target neural network model, and determining a new current sample set from the sample sets.
3. The method of claim 2, wherein the obtaining, by the target neural network model, a current predicted evaluation value corresponding to each of the sample evaluation tasks in the current sample set comprises:
and for each sample evaluation task, extracting characteristics of a plurality of current sample phoneme characteristics of the current sample set and a plurality of current sample phoneme information of the current sample set through the target neural network model according to the sample evaluation task to obtain phoneme evaluation characteristics corresponding to the sample evaluation task, and determining a current predicted evaluation value corresponding to the sample evaluation task according to the phoneme evaluation characteristics and the sample evaluation task.
4. The method of claim 3, wherein the target neural network model includes a feature extraction sub-model, and wherein the feature extracting, by the target neural network model, the plurality of current sample phoneme features of the current sample set and the plurality of current sample phoneme information of the current sample set according to the sample evaluation task, to obtain phoneme evaluation features corresponding to the sample evaluation task includes:
aiming at each current sample phoneme of the current sample set, performing splicing processing on the current sample phoneme characteristics of the current sample phonemes and the current sample phoneme information of the current sample phonemes to obtain current sample splicing information;
And carrying out feature extraction on the current sample splicing information through the feature extraction sub-model according to the sample evaluation task to obtain phoneme evaluation features corresponding to the sample evaluation task.
5. A method according to claim 3, wherein the target neural network model comprises a pooling layer and an evaluation submodel, an output of the pooling layer being coupled to an input of the evaluation submodel, the determining a current predicted evaluation value corresponding to the sample evaluation task based on the phoneme evaluation feature and the sample evaluation task comprising:
determining a target sample phoneme from a plurality of current sample phonemes in the current sample set according to the sample evaluation task;
according to the target sample phonemes, carrying out pooling treatment on the phoneme evaluation characteristics to obtain sample pooling characteristics corresponding to the sample evaluation tasks;
and determining a current predicted evaluation value corresponding to the sample evaluation task through the evaluation sub-model according to the sample pooling characteristics.
6. The method of claim 5, wherein said determining a target sample phoneme from a plurality of said current sample phonemes of said current sample set in accordance with said sample evaluation task comprises:
Determining a sample phoneme range corresponding to the sample evaluation task through a preset range association relationship, wherein the range association relationship comprises the correspondence relationship between different evaluation tasks and phoneme ranges;
and determining the target sample phonemes from a plurality of current sample phonemes according to the sample phoneme range.
7. The method of any of claims 1-6, wherein determining a plurality of target phoneme features corresponding to the target speech from the target speech and the target text comprises:
inputting the target voice and the target text into a pre-generated phoneme characteristic acquisition model to acquire a plurality of target phoneme characteristics output by the phoneme characteristic acquisition model.
8. A pronunciation evaluation device, comprising:
the first acquisition module is used for acquiring target voice to be evaluated and target text corresponding to the target voice;
the determining module is used for determining a plurality of target phoneme features corresponding to the target voice according to the target voice and the target text, wherein the target phoneme features are used for representing the phoneme features of target phonemes in the target voice, and different target phonemes correspond to different target phoneme features;
The second acquisition module is used for inputting a plurality of target phoneme features and at least one target evaluation task into a pre-generated pronunciation evaluation model so as to acquire an evaluation value corresponding to each target evaluation task output by the pronunciation evaluation model;
the pronunciation evaluation model is obtained by training a target neural network model through a plurality of sample sets, the sample sets comprise sample phoneme information of each sample phoneme of a sample text, sample phoneme characteristics of each sample phoneme in sample voice, a plurality of sample evaluation tasks and sample evaluation values corresponding to each sample evaluation task, the sample phoneme information comprises sample phoneme identification and sample position information, and the sample position information is used for representing positions of the sample phonemes in sample phoneme sequences corresponding to the sample text.
9. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-7.
10. An electronic device, comprising:
a storage device having at least one computer program stored thereon;
At least one processing means for executing said at least one computer program in said storage means to carry out the steps of the method according to any one of claims 1-7.
CN202310460583.5A 2023-04-25 2023-04-25 Pronunciation evaluation method and device, readable medium and electronic equipment Pending CN116504269A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310460583.5A CN116504269A (en) 2023-04-25 2023-04-25 Pronunciation evaluation method and device, readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310460583.5A CN116504269A (en) 2023-04-25 2023-04-25 Pronunciation evaluation method and device, readable medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116504269A true CN116504269A (en) 2023-07-28

Family

ID=87327994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310460583.5A Pending CN116504269A (en) 2023-04-25 2023-04-25 Pronunciation evaluation method and device, readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116504269A (en)

Similar Documents

Publication Publication Date Title
CN112509562B (en) Method, apparatus, electronic device and medium for text post-processing
CN113449070A (en) Multimodal data retrieval method, device, medium and electronic equipment
CN114765025A (en) Method for generating and recognizing speech recognition model, device, medium and equipment
CN113140012B (en) Image processing method, device, medium and electronic equipment
CN112242143B (en) Voice interaction method and device, terminal equipment and storage medium
CN116884402A (en) Method and device for converting voice into text, electronic equipment and storage medium
CN116072108A (en) Model generation method, voice recognition method, device, medium and equipment
CN116092092A (en) Matching method, device, medium and electronic equipment
CN113033680B (en) Video classification method and device, readable medium and electronic equipment
CN113435528B (en) Method, device, readable medium and electronic equipment for classifying objects
CN112685996B (en) Text punctuation prediction method and device, readable medium and electronic equipment
CN116244431A (en) Text classification method, device, medium and electronic equipment
CN111582456B (en) Method, apparatus, device and medium for generating network model information
CN116504269A (en) Pronunciation evaluation method and device, readable medium and electronic equipment
CN113222050A (en) Image classification method and device, readable medium and electronic equipment
CN114613355B (en) Video processing method and device, readable medium and electronic equipment
CN116343905B (en) Pretreatment method, pretreatment device, pretreatment medium and pretreatment equipment for protein characteristics
CN111292766B (en) Method, apparatus, electronic device and medium for generating voice samples
CN116364066A (en) Classification model generation method, audio classification method, device, medium and equipment
CN115938470B (en) Protein characteristic pretreatment method, device, medium and equipment
CN113345426B (en) Voice intention recognition method and device and readable storage medium
CN111933122B (en) Speech recognition method, apparatus, electronic device, and computer-readable medium
CN116312619A (en) Voice activity detection model generation method and device, medium and electronic equipment
CN116775816A (en) Determination method, data processing method, device, medium and equipment for solving problem model
CN117690406A (en) Speech synthesis method and device, readable medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination