CN114360537A - Spoken question and answer scoring method, spoken question and answer training method, computer equipment and storage medium - Google Patents

Spoken question and answer scoring method, spoken question and answer training method, computer equipment and storage medium Download PDF

Info

Publication number
CN114360537A
CN114360537A CN202111618214.1A CN202111618214A CN114360537A CN 114360537 A CN114360537 A CN 114360537A CN 202111618214 A CN202111618214 A CN 202111618214A CN 114360537 A CN114360537 A CN 114360537A
Authority
CN
China
Prior art keywords
model
text
feature vector
answering audio
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111618214.1A
Other languages
Chinese (zh)
Inventor
王豫丰
李�浩
吴奎
盛志超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202111618214.1A priority Critical patent/CN114360537A/en
Publication of CN114360537A publication Critical patent/CN114360537A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the application provides a scoring method, a training method, computer equipment and a storage medium for spoken questions and answers, wherein the scoring method comprises the following steps: inputting the answering audio into a preset voice recognition model to obtain a voice characteristic vector and a spoken language recognition text; inputting the spoken language identification text and the problem text into a preset semantic extraction model to obtain a text feature vector; inputting the answering audio into a preset acoustic model to obtain an acoustic feature vector; and based on a preset scoring model, obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector. The voice feature vectors are extracted through the voice recognition model, the text feature vectors are extracted through the semantic extraction model, the acoustic feature vectors are extracted through the acoustic model, and finally grading is carried out according to the extracted multiple vectors, so that the influence of voice recognition errors on grading effects in the existing spoken question-answer grading scheme can be relieved or avoided, and grading is more fair and stable.

Description

Spoken question and answer scoring method, spoken question and answer training method, computer equipment and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a scoring method, a training method, computer equipment and a storage medium for spoken questions and answers.
Background
Along with the deep reform of the education system, the oral test is introduced into high and high tests in a plurality of regions by paying more and more attention to the improvement of the oral level of students, such as the oral level of English. Questions and answers are frequently examined as the most representative mode of daily interaction and verbal communication. Meanwhile, with the increasing maturity of the artificial intelligence related technology, machine review auxiliary scoring is becoming mainstream. However, the question-answering questions are relatively open, pronunciation information and semantic understanding need to be considered, and fair scoring of the machine becomes challenging.
The existing spoken question-answer scoring scheme is as follows: recognizing student answer audio by adopting a speech recognition model, then carrying out similarity calculation on a spoken language recognition text and an artificially preset standard answer, segmenting the speech according to the spoken language recognition text, calculating phoneme pronunciation characteristics such as GOP (good of pronunciation) and the like according to segmentation boundaries, and then regressing the extracted speech and semantic characteristics to obtain scores corresponding to students. However, the scheme has certain limitation and low accuracy.
Disclosure of Invention
The embodiment of the application provides a scoring method, a training method, computer equipment and a storage medium for spoken questions and answers, which can accurately score spoken questions and answers.
In a first aspect, the present application provides a method for scoring spoken questions and answers, the method comprising:
acquiring a question text;
acquiring a response audio frequency of a student response;
inputting the answering audio into a preset voice recognition model to obtain a voice feature vector and a spoken language recognition text corresponding to the answering audio;
inputting the spoken language identification text and the question text into a preset semantic extraction model to obtain a text feature vector corresponding to the answering audio;
inputting the answering audio into a preset acoustic model to obtain an acoustic feature vector corresponding to the answering audio;
and obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio based on a preset scoring model.
In a second aspect, the application provides a training method of a spoken question-answer scoring model, where the spoken question-answer scoring model includes a speech recognition model, a semantic extraction model, an acoustic model, and a scoring model;
the training method comprises the following steps:
acquiring a question text;
acquiring answering audio of student answers and corresponding labeled scores;
inputting the answering audio into a pre-trained voice recognition model to obtain a voice characteristic vector and a spoken language recognition text corresponding to the answering audio;
inputting the spoken language identification text and the question text into a pre-trained semantic extraction model to obtain a text feature vector corresponding to the answering audio;
inputting the answering audio into a pre-trained acoustic model to obtain an acoustic feature vector corresponding to the answering audio;
based on a preset scoring model, obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio;
and adjusting model parameters of at least one model of the voice recognition model, the semantic extraction model, the acoustic model and the scoring model according to the prediction score and the labeling score corresponding to the answering audio.
In a third aspect, the present application provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and to implement the steps of any of the above methods when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program for implementing the steps of any one of the above methods when the computer program is executed by a processor.
The application discloses a scoring method, a training method, computer equipment and a storage medium of spoken questions and answers, wherein the scoring method comprises the following steps: inputting the answering audio into a preset voice recognition model to obtain a voice feature vector and a spoken language recognition text corresponding to the answering audio; inputting the spoken language identification text and the question text into a preset semantic extraction model to obtain a text feature vector corresponding to the answering audio; inputting the answering audio into a preset acoustic model to obtain an acoustic feature vector corresponding to the answering audio; and based on a preset scoring model, obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio. The voice feature vectors are extracted through the voice recognition model, the text feature vectors are extracted through the semantic extraction model, the acoustic feature vectors are extracted through the acoustic model, and finally grading is carried out according to the extracted multiple vectors, so that the influence of voice recognition errors on grading effects in the existing spoken question-answer grading scheme can be relieved or avoided, and grading is more fair and stable.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart illustrating a method for scoring spoken questions and answers according to an embodiment of the present application;
FIG. 2 is a diagram illustrating an application scenario of a scoring method according to an embodiment;
FIG. 3 is a diagram illustrating the scoring performed by the spoken question-answer scoring model according to an embodiment;
FIG. 4 is a diagram illustrating extraction of speech feature vectors according to an embodiment;
FIG. 5 is a diagram illustrating extraction of text feature vectors according to an embodiment;
FIG. 6 is a diagram illustrating extraction of acoustic feature vectors according to one embodiment;
FIG. 7 is a schematic flow chart illustrating a method for training a spoken question-answer scoring model according to another embodiment of the present application;
fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The embodiment of the application provides a scoring method, a training method, computer equipment and a storage medium of spoken questions and answers so as to achieve scoring according to question texts needing to be answered and audio frequency of student answers, namely scoring of the spoken questions and answers.
In the current spoken question-answer scoring, a voice recognition model is generally adopted to recognize the student answer audio, and then similarity calculation is carried out on the spoken recognition text and the manually preset standard answer. Among the metrics related to similarity are longest common substring (lcs), BLEU (bilingual Evaluation understudy), METEOR (metric for Evaluation of transformation with Explicit editing), and so on. Then segmenting the voice according to the spoken language identification text, calculating phoneme pronunciation characteristics such as GOP (good of probability) and the like according to segmentation boundaries, and then regressing the extracted voice and semantic characteristics to obtain scores corresponding to students.
The current spoken question-answer score suffers from at least one of the following drawbacks:
firstly, performing voice recognition to obtain a text, then calculating similarity by using the obtained text, and finally performing regression scoring on all the characteristics; the strategy can generate cascading errors, errors of voice recognition have large influence on text feature extraction, and errors of feature extraction also have large influence on scoring.
Secondly, the text similarity calculation cannot meet the requirements easily, the weak change of answers of many students can cause the change of semantics to cause score difference, and the existing text similarity cannot reflect the influence caused by the weak disturbance easily.
Third, for the questions with open answers, the standard answers are difficult to enumerate by manpower and rules, so that the existing scheme is difficult to make fair scores for some open answers, and the phenomenon that excellent divergent answers are judged to be low scores is often caused.
And fourthly, the speech recognition training data is lack of field adaptation of spoken scenes, so that the audio of the middle school students who speak the second foreign language is limited to a certain extent, and common grammar errors and pronunciation flaws are difficult to be recognized accurately.
Based on this, the inventor of the present application improved the scoring method of spoken question and answer to solve at least one of the above-mentioned shortcomings.
The scoring method for spoken questions and answers provided by the embodiment of the application can be applied to terminals or servers. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and the like; the servers may be independent servers or server clusters. However, for the sake of understanding, the following embodiments will be described in detail in a method applied to a server.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flowchart of a scoring method for spoken questions and answers according to an embodiment of the present disclosure.
In some embodiments, as shown in fig. 2, the server obtains the question text and the answering audio of the student answering from the terminal, generates a prediction score corresponding to the answering audio according to a scoring method of spoken question answering, and sends the prediction score to the terminal.
In some optional embodiments, the question text is a locally stored text of the apparatus for implementing the scoring method for spoken question-answering, a text acquired by the apparatus from a network, a text acquired by the apparatus from an input device connected thereto, a text acquired by the apparatus from other electronic devices, a text converted by the apparatus from voice information, and the like. The answering audio is locally stored audio of the device for implementing the scoring method for spoken question answering, audio acquired by the device from a network, audio acquired by the device from an input device connected thereto, audio acquired by the device from other electronic devices, and the like.
Referring to fig. 1 and 3, the scoring method for spoken questions and answers includes the following steps S110 to S160.
And step S110, acquiring a question text.
For example, the question text is represented as a ═ { a ═ a1,a2,…am,Where m is the number of words of the question text.
And step S120, acquiring the answering audio of the student answering.
Illustratively, when a student starts answering, the student triggers the recording to start through operations such as key triggering, and after answering, the student triggers the recording to finish, so that answering audio is obtained. Certainly, the method is not limited to this, and for example, the answering audio may be obtained by starting recording and ending recording according to a voice instruction when the student answers, such as "start" and "end of answering".
Step S130, inputting the answering audio into a preset voice recognition model to obtain a voice feature vector and a spoken language recognition text corresponding to the answering audio.
Speech Recognition, also referred to as Automatic Speech Recognition (ASR), can illustratively convert human Speech into text.
In some embodiments, the speech recognition model may extract speech feature vectors from the answering audio and recognize the spoken language recognition text from the speech feature vectors. Illustratively, the answering audio is subjected to voice recognition, a spoken language recognition text is obtained, and an intermediate vector in the voice recognition process is taken out to serve as a voice feature vector.
Illustratively, the speech recognition model includes an Encoder (Encoder) submodel and a Decoder (Decoder) submodel, and the speech recognition model may be an Encoder-Decoder model. The encoder submodel is used for encoding the answering audio to obtain a voice feature vector, and the voice feature vector is a hidden layer vector; the decoder submodel is used for decoding the voice characteristic vector to obtain a spoken language identification text.
Optionally, as shown in fig. 4, a vector obtained by encoding the response audio of the encoder sub-model may be processed through a preset neural network, for example, full-connection processing, and the vector processed by the neural network is used as the speech feature vector. The vector obtained by the encoder sub-model coding answer audio is processed through the neural network, more useful information can be extracted, and the scoring accuracy is improved.
In some embodiments, referring to fig. 4, step S130 inputs the response audio into a preset speech recognition model to obtain a speech feature vector and a spoken language recognition text corresponding to the response audio, which includes the following steps S131 to S133.
S131, inputting the answering audio into a first feature extraction model of the voice recognition model to obtain a first voice feature of the answering audio; step S132, inputting the first voice characteristic into an encoder sub-model of the voice recognition model to obtain a voice characteristic vector corresponding to the answering audio; and S133, inputting the voice feature vector into a decoder submodel of the voice recognition model to obtain a spoken language recognition text corresponding to the answering audio.
Illustratively, the first feature extraction model is used to extract fbank features of the answering audio, i.e. the first speech features may be fbank features. The step of obtaining fbank characteristics of the speech signal generally comprises the steps of: pre-emphasis, framing, windowing, short-time fourier transform (STFT), mel (mel) filtering, de-averaging, etc. For example, the feature window length when extracting fbank features is 25ms, the frame shift is 10ms, and the dimension of the first speech feature is T × 40 dimension, where T is the number of frames of the answering audio.
Illustratively, the encoder submodel includes multiple layers, such as 12 layers of transformers (Multi-head self-Attention network), the output of the previous layer of transformers will be used as the input of the next layer of transformers, and each layer of transformers processes the vectors using a Multi-head Attention mechanism.
The input of the encoder submodel is fbank characteristics with dimension T multiplied by 40, the output of the encoder submodel is vectors with dimension T multiplied by S, S is hidden vector dimension of a transform layer, and the output of the last layer of the transform of the encoder submodel can be used as a voice characteristic vector corresponding to the answering audio to participate in subsequent scoring tasks.
Illustratively, the decoder submodel includes multiple layers, such as 6 layers of transformers, that decode speech feature vectors to obtain spoken language identification text.
In some embodiments, the speech recognition model is trained based on pre-training data comprising a large amount of high-content domestic student spoken reading data, thereby having a strong modeling capability for the student spoken language features.
Illustratively, for a speech recognition model in the field of spoken language, a large number of high-resolution reading questions are selected for rough priming training in a pre-training stage, and then a large number of finely labeled spoken transcription data are selected for training. In the training stage, the text is firstly identified by a speech recognition model, and then the output of the last layer of the encoder is used as a speech feature vector to participate in a subsequent scoring task.
For example, the spoken language identification text corresponding to the answering audio is represented by g ═ g1,g2,…gl,Where l is the number of words of the spoken language identification text.
And S140, inputting the spoken language identification text and the question text into a preset semantic extraction model to obtain a text feature vector corresponding to the answering audio.
The text feature vector is used for indicating semantic features of the spoken language identification text and the question text. In some embodiments, the semantic extraction model may be referred to as a text feature extraction module that inputs the spoken language identification text and the question text and outputs an extracted text feature vector, i.e., semantic features.
In some embodiments, the step S140 inputs the spoken language identification text and the question text into a preset semantic extraction model to obtain a text feature vector corresponding to the answering audio, and includes the following steps S141 to S142.
Step S141, inputting a preset initial character (such as [ CLS ]), the spoken language identification text, the question text and a preset interval character (such as [ SEP ]) between the spoken language identification text and the question text into an embedding sub-model of the semantic extraction model to obtain an embedding vector.
Obtaining a target text by splicing the spoken language identification text and the question text, wherein the spoken language identification text is used as a first part of the target text and supplements the initial character such as [ CLS ] at the beginning of a sentence; the question text is used as a second part of the target text, and space characters such as SEP are filled in between the first part and the second part.
For example, referring to fig. 5, the question text is denoted as a ═ a1,a2,…am,The spoken language identification text is denoted g ═ g1,g2,…gl,And adding a start character and an interval character, wherein the word number of the target text is l + m + 2. First each word in the target text is converted into a word vector, the embedding (embedding) vector comprising l + m +2 word vectors, e.g. denoted as H0=[e1,e2,…,el+m+2]Wherein e isiWord vector representing ith word in target text, wherein 1 st word vector e1Is the word vector of the starting character, the l +2 word vector el+2Is a word vector of the space character. Optionally, an end character may be added after the question text, and the end character may be [ SEP ], for example]Then the embedded (embedding) vector comprises a l + m +3 word vector. Alternatively, the target text may be represented as [ a; q. q.s]。
And S142, inputting the embedded vector into a multi-head self-attention sub-model of the semantic extraction model to obtain a corresponding text feature vector.
In some embodiments, the multi-headed self-attention submodel is a BERT model (Bidirectional Encoder retrieval from transformations), and optionally, the multi-headed self-attention submodel is a pre-trained BERT model, although the multi-headed self-attention submodel is not limited to the BERT model, and may also be a recurrent neural network model, a convolutional neural network model, or a combination of various models/networks. The BERT model is a two-way language representation model. A Transformer network (a neural network based on self attention) is used as a unit module, and two upstream tasks, namely word mask Prediction (MLM) and upper and lower Sentence coherence classification (NSP), are used for pre-training on large-scale linguistic data, so that compared with a cyclic neural network, the convolutional neural network has stronger semantic modeling capability, and a better effect can be achieved on the downstream tasks only by simple fine tuning. By using a multi-head self-attention sub-model such as a pre-trained BERT model, deeper semantic modeling can be performed.
Referring to fig. 5, where the context coherence classification is one of the upstream tasks of the BERT model for predicting whether the second sentence is subsequent to the first sentence, in a conventional QA (question and answer) task, a question Q is often merged with an answer a as an input of the BERT model, and the BERT model predicts whether the answer a can answer the question Q. In the embodiment of the application, a BERT model is used for carrying out semantic modeling on a target text to obtain the spoken language identification text and the text feature vector corresponding to the problem text.
Illustratively, the extraction of text features is performed using a pre-trained Bert model to characterize the coherence between the problem text and the spoken recognition text.
The BERT model is a deep language model with 12 layers of transform networks (multi-headed self-attention networks) and does not share parameters, and the output of the previous layer of transform network is used as the input of the next layer of transform network. The output of the last layer of the transform network can be expressed as follows:
Figure BDA0003437133710000101
wherein the content of the first and second substances,
Figure BDA0003437133710000102
as a start character [ CLS]The corresponding vector is then used to generate a vector,
Figure BDA0003437133710000103
identifying a vector corresponding to a first word in text for spoken language
Figure BDA0003437133710000104
For space characters [ SEP]The corresponding vector is then used to generate a vector,
Figure BDA0003437133710000105
vector corresponding to last word in question text
Figure BDA0003437133710000106
Output of last layer of Transformer network12A hidden layer representation of all the characters in the target text is included. Illustratively, after the spoken language identification text and the question text are pre-extracted through a 12-layer Transformer network of a pre-trained Bert model, the last layer of Transformer network outputs a vector with a dimension of (l + m +2) × 768, which can be used as a text feature vector corresponding to the spoken language identification text and the question text, but is not limited thereto.
Illustratively, pleaseReferring to FIG. 5, the start character [ CLS ] may be selected]Is shown as a hidden layer
Figure BDA0003437133710000111
The text content representations (or can be called semantic representations) corresponding to all the characters in the starting character, the main point text, the interval character and the test taker answering text are hidden layer representations of the text vector.
Optionally, referring to fig. 5, the feature vectors of the spoken language identification text and the text corresponding to the question text at least include output12With the start character [ CLS ]]The corresponding vector. Optionally, output can be included12Average of vectors corresponding to characters other than the starting character: MEAN (output)12[1:]) That is, the text feature vectors corresponding to the spoken language identification text and the question text may be represented as follows:
Hidden=[output12[0];MEAN(output12[1:])]
in some embodiments, the semantic extraction model is pre-trained in two stages, in the first stage, a large number of confrontational samples are generated in an unsupervised manner, and the semantic extraction model is pre-trained through a weakly-changed text, so that the semantic extraction model can simultaneously take sensitivity and robustness into consideration for spoken language data. Illustratively, a first training sample is obtained, wherein the first training sample comprises a spoken language identification text; based on a preset data enhancement rule, performing data enhancement on the spoken language identification text to obtain a first training sample after data enhancement; and pre-training the semantic extraction model according to the first training sample. The training data comprises spoken language identification texts of students, and data enhancement is carried out on the spoken language identification texts, and the data enhancement method comprises the following steps: and performing part-of-speech recognition on the corpus, and replacing the specific word by a near-sense word and an anti-sense word with the replacement proportion of 40% for example, so as to generate a sample with unchanged semantics. The pre-training method of the semantic extraction model is, for example, consistent with the pre-training method of the original Bert model.
The training of the second stage of the semantic extraction model can adopt massive dialogue samples to carry out dialogue consistency evaluation modeling, so that the answers to the questions to be answered are not made manually, and the effect of fair scoring is achieved. Illustratively, a second training sample is obtained, wherein the second training sample comprises question texts and corresponding answer texts; and according to the second training sample, pre-training a semantic extraction model pre-trained on the basis of the first training sample to obtain the pre-trained semantic extraction model. And the second stage is question-answer consistency pre-training, the data corpus is question-answer data, the associated question-answer pairs are positive samples, and one question and one answer are randomly sampled to be negative samples, and then training is carried out.
Illustratively, the two-stage pre-training is carried out on the semantic extraction model, so that the robustness and the sensitivity of the model to the semantic understanding of the spoken language recognition text are increased, and meanwhile, the model can be better modeled on the consistency of the short-word through the pre-training of a large amount of spoken question and answer data.
And S150, inputting the answering audio into a preset acoustic model to obtain an acoustic feature vector corresponding to the answering audio.
The voice recognition model of step S130 focuses more on the voice content, and the features obtained by recognition lack information related to the scoring dimension, such as prosody and tone scale accuracy; acoustic features are extracted from the answer audio through an acoustic model, and the extracted acoustic features can indicate original acoustic information of the answer audio, so that scoring information can be diversified.
In some embodiments, referring to fig. 6, step S150 inputs the answering audio into a preset acoustic model to obtain an acoustic feature vector corresponding to the answering audio, and includes the following steps S151 to S152.
Step S151, extracting a model based on a second feature of the acoustic model to obtain a second voice feature of the answering audio; and S152, inputting the second voice characteristic into an acoustic information extraction submodel of the acoustic model to obtain an acoustic characteristic vector corresponding to the answering audio.
In some embodiments, the second feature extraction model is used to extract MFCC (Mel-Frequency Cepstral Coefficients) features of the response audio, i.e., the second speech feature may be an MFCC feature. For example, the second feature extraction model may perform DCT (discrete cosine transform) cepstrum processing on the fbank features obtained by the first feature extraction model to obtain the MFCC features. The essence of DCT is to remove the correlation between signals in each dimension and map the signals into a low-dimensional space, so that the MFCC features have better discriminability.
For example, referring to fig. 6, the acoustic information extraction submodel of the acoustic model includes 3 convolution modules and 3 bidirectional LSTM (Long short-term memory) modules, performing feature extraction on the second speech feature, and finally outputting an acoustic feature vector, where the acoustic feature vector is, for example, a vector with a dimension of T × S.
In some embodiments, the speech recognition model and the acoustic model may constitute a speech processing module that inputs student audio, obtains recognized text, text feature vectors, and speech feature vectors.
And step S160, based on a preset scoring model, obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio.
The scoring model may also be referred to as a feature fusion and scoring module, which inputs the speech feature vector, the text feature vector, and the acoustic feature vector and outputs a final score. By fusing the voice feature vector, the text feature vector and the acoustic feature vector, the voice information, the text semantic information and the acoustic feature are fully considered in the scoring process, the voice recognition, the semantic understanding and the acoustic feature are integrated, and the scoring accuracy is higher.
In some implementations, the speech recognition model, the semantic extraction model, the acoustic model, and the scoring model can constitute a spoken question-answer scoring model.
According to the scoring method, the voice feature vectors are extracted through the voice recognition model, the text feature vectors are extracted through the semantic extraction model, the acoustic feature vectors are extracted through the acoustic model, and scoring is carried out according to the extracted multiple vectors, so that the influence of voice recognition errors on scoring effects in the existing spoken question-answer scoring scheme can be relieved or avoided, and scoring is more fair and stable.
In some embodiments, acoustic models, speech recognition models, and semantic extraction models are pre-trained using a large amount of student spoken data, allowing each model to fit the student spoken field, making the scoring more fair and stable.
In some embodiments, the scaling data can be used to fine-tune the whole spoken question-answer scoring model in the official examination, so that the joint training of the speech recognition model, the semantic extraction model and the scoring model is realized, the cascading error is relieved, and the influence of the speech recognition on the text feature extraction is reduced.
Illustratively, the training process of the spoken language question-answer scoring model is divided into two stages, the first stage is modular pre-training, and data in the mass spoken language question-answer field is used for pre-training a voice recognition model and a semantic extraction model. The second stage is fine adjustment of a calibration scene, a few students are given to answer and grade, a semantic extraction model, a coder sub-model of a voice recognition model and an acoustic model are fine adjusted by using a back propagation algorithm to obtain a trained spoken question-answer grading model, and cascade errors can be relieved through combined training of individual models in the spoken question-answer grading model, for example, the influence of voice recognition errors on grading effects is relieved, so that the grading effect is better. It is understood that the training of the semantic extraction model may include, in addition to pre-training, joint training of the speech recognition model and the acoustic model during the calibration and fine-tuning stage.
In some embodiments, the training method for the spoken question-answer scoring model comprises scaling and fine-tuning the spoken question-answer scoring model, and may comprise the following steps: acquiring a question text; acquiring answering audio of student answers and corresponding labeled scores; inputting the answering audio into a pre-trained voice recognition model to obtain a voice characteristic vector and a spoken language recognition text corresponding to the answering audio; inputting the spoken language identification text and the question text into a pre-trained semantic extraction model to obtain a text feature vector corresponding to the answering audio; inputting the answering audio into a pre-trained acoustic model to obtain an acoustic feature vector corresponding to the answering audio; based on a preset scoring model, obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio; and adjusting model parameters of at least one model of the voice recognition model, the semantic extraction model, the acoustic model and the scoring model according to the prediction score and the labeling score corresponding to the answering audio.
Illustratively, based on a preset loss function, determining a model loss value according to the prediction score and the annotation score corresponding to the answering audio, and adjusting model parameters of at least one model of the voice recognition model, the semantic extraction model, the acoustic model and the scoring model according to the model loss value.
The scoring method for spoken questions and answers provided by the embodiment of the application comprises the following steps: inputting the answering audio into a preset voice recognition model to obtain a voice feature vector and a spoken language recognition text corresponding to the answering audio; inputting the spoken language identification text and the question text into a preset semantic extraction model to obtain a text feature vector corresponding to the answering audio; inputting the answering audio into a preset acoustic model to obtain an acoustic feature vector corresponding to the answering audio; and based on a preset scoring model, obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio. The voice feature vectors are extracted through the voice recognition model, the text feature vectors are extracted through the semantic extraction model, the acoustic feature vectors are extracted through the acoustic model, and finally grading is carried out according to the extracted multiple vectors, so that the influence of voice recognition errors on grading effects in the existing spoken question-answer grading scheme can be relieved or avoided, and grading is more fair and stable.
In some embodiments, in the face of open question-answer highly divergent questions, it is difficult for the traditional method to guarantee the fairness of scoring by matching the similarity of the artificially made answers and the student answers. The method for scoring spoken language question-answer provided by the embodiment of the application can extract the spoken language identification text and the text feature vector of the question text based on the semantic extraction model, so that the continuity of modeling conversation is good, and the scoring accuracy is improved.
In some embodiments, the existing text representation is difficult to solve well for the characteristic of high sensitivity of the text, such as the phenomenon that the meaning of a sentence is greatly changed due to weak word change. According to the scoring method for spoken question answering, the pre-trained Bert model is adopted to extract the spoken recognition text and the text feature vector of the question text, and various interference data are added in the pre-training stage, so that the sensitivity of the pre-trained Bert model to language change is increased, and the scoring accuracy is improved.
Referring to fig. 7 in conjunction with the foregoing embodiments, an embodiment of the present application further provides a method for training a spoken question-answer scoring model. As shown in fig. 3, the spoken question-answer scoring model includes a speech recognition model, a semantic extraction model, an acoustic model, and a scoring model.
Referring to fig. 7, the training method includes steps S210 to S250.
Step S210, obtaining a question text;
s220, acquiring answering audio of student answering and corresponding annotation scores;
step S230, inputting the answering audio into a pre-trained voice recognition model to obtain a voice feature vector and a spoken language recognition text corresponding to the answering audio;
step S240, inputting the spoken language identification text and the question text into a pre-trained semantic extraction model to obtain a text feature vector corresponding to the answering audio;
s250, inputting the answering audio into a pre-trained acoustic model to obtain an acoustic feature vector corresponding to the answering audio;
step S260, based on a preset scoring model, obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio;
and step S270, adjusting model parameters of at least one model of the voice recognition model, the semantic extraction model, the acoustic model and the scoring model according to the prediction score and the labeling score corresponding to the answering audio.
In some embodiments, the training method further comprises:
acquiring a first training sample, wherein the first training sample comprises a spoken language identification text;
based on a preset data enhancement rule, performing data enhancement on the spoken language identification text to obtain a first training sample after data enhancement;
pre-training the semantic extraction model according to the first training sample;
acquiring a second training sample, wherein the second training sample comprises a question text and a corresponding answer text;
and according to the second training sample, pre-training a semantic extraction model pre-trained on the basis of the first training sample to obtain the pre-trained semantic extraction model.
Illustratively, the semantic extraction model is pre-trained in two stages, a large number of confrontation samples are generated in an unsupervised mode in the first stage, and the semantic extraction model is pre-trained through weakly-changed texts, so that the semantic extraction model can give consideration to both sensitivity and robustness to spoken language data. And the training of the second stage can adopt massive dialogue samples to carry out dialogue consistency evaluation modeling, so that answers to questions to be answered are not made manually, and the effect of fair scoring is achieved.
In some embodiments, the inputting the answering audio into a pre-trained speech recognition model to obtain a speech feature vector and a spoken language recognition text corresponding to the answering audio includes:
inputting the answering audio into a first feature extraction model of the voice recognition model to obtain a first voice feature of the answering audio;
inputting the first voice characteristic into an encoder sub-model of the voice recognition model to obtain a voice characteristic vector corresponding to the answering audio;
and inputting the voice feature vector into a decoder sub-model of the voice recognition model to obtain a spoken language recognition text corresponding to the answering audio.
The specific principle and implementation manner of the training method for the spoken language question-answer scoring model provided in the embodiment of the application are similar to those of the spoken language question-answer scoring method in the foregoing embodiment, and are not described herein again.
The methods of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
Illustratively, the above-described method may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.
Referring to fig. 8, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions which, when executed, cause a processor to perform the steps of any of the methods described above.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor causes the processor to perform the steps of any of the methods described above.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the configuration of the computer apparatus is merely a block diagram of a portion of the configuration associated with aspects of the present application and is not intended to limit the computer apparatus to which aspects of the present application may be applied, and that a particular computer apparatus may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring a question text;
acquiring a response audio frequency of a student response;
inputting the answering audio into a preset voice recognition model to obtain a voice feature vector and a spoken language recognition text corresponding to the answering audio;
inputting the spoken language identification text and the question text into a preset semantic extraction model to obtain a text feature vector corresponding to the answering audio;
inputting the answering audio into a preset acoustic model to obtain an acoustic feature vector corresponding to the answering audio;
and obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio based on a preset scoring model.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring a question text;
acquiring answering audio of student answers and corresponding labeled scores;
inputting the answering audio into a pre-trained voice recognition model to obtain a voice characteristic vector and a spoken language recognition text corresponding to the answering audio;
inputting the spoken language identification text and the question text into a pre-trained semantic extraction model to obtain a text feature vector corresponding to the answering audio;
inputting the answering audio into a pre-trained acoustic model to obtain an acoustic feature vector corresponding to the answering audio;
based on a preset scoring model, obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio;
and adjusting model parameters of at least one model of the voice recognition model, the semantic extraction model, the acoustic model and the scoring model according to the prediction score and the labeling score corresponding to the answering audio.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application, such as:
a computer-readable storage medium, which stores a computer program, where the computer program includes program instructions, and the processor executes the program instructions to implement the steps of any one of the scoring methods for spoken questions and answers provided in the embodiments of the present application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A scoring method for spoken questions and answers is characterized by comprising the following steps:
acquiring a question text;
acquiring a response audio frequency of a student response;
inputting the answering audio into a preset voice recognition model to obtain a voice feature vector and a spoken language recognition text corresponding to the answering audio;
inputting the spoken language identification text and the question text into a preset semantic extraction model to obtain a text feature vector corresponding to the answering audio;
inputting the answering audio into a preset acoustic model to obtain an acoustic feature vector corresponding to the answering audio;
and obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio based on a preset scoring model.
2. The scoring method according to claim 1, wherein the inputting the answering audio into a preset speech recognition model to obtain a speech feature vector and a spoken language recognition text corresponding to the answering audio comprises:
inputting the answering audio into a first feature extraction model of the voice recognition model to obtain a first voice feature of the answering audio;
inputting the first voice characteristic into an encoder sub-model of the voice recognition model to obtain a voice characteristic vector corresponding to the answering audio;
and inputting the voice feature vector into a decoder sub-model of the voice recognition model to obtain a spoken language recognition text corresponding to the answering audio.
3. A scoring method as claimed in claim 2, wherein said inputting the answering audio into a preset acoustic model to obtain an acoustic feature vector corresponding to the answering audio comprises:
extracting a model based on a second feature of the acoustic model to obtain a second voice feature of the answering audio;
and inputting the second voice characteristic into an acoustic information extraction submodel of the acoustic model to obtain an acoustic characteristic vector corresponding to the answering audio.
4. A scoring method according to claim 3, wherein the first speech feature is an fbank feature and the second speech feature is an MFCC feature.
5. A scoring method according to any one of claims 1-4, wherein said entering the spoken language identification text and the question text into a preset semantic extraction model to obtain a text feature vector corresponding to the answering audio comprises:
inputting a preset initial character, the spoken language identification text, the question text and a preset interval character between the spoken language identification text and the question text into an embedding sub-model of the semantic extraction model to obtain an embedding vector;
and inputting the embedded vector into a multi-head self-attention sub-model of the semantic extraction model to obtain a corresponding text feature vector.
6. A training method of a spoken question-answer scoring model is characterized in that the spoken question-answer scoring model comprises a voice recognition model, a semantic extraction model, an acoustic model and a scoring model;
the training method comprises the following steps:
acquiring a question text;
acquiring answering audio of student answers and corresponding labeled scores;
inputting the answering audio into a pre-trained voice recognition model to obtain a voice characteristic vector and a spoken language recognition text corresponding to the answering audio;
inputting the spoken language identification text and the question text into a pre-trained semantic extraction model to obtain a text feature vector corresponding to the answering audio;
inputting the answering audio into a pre-trained acoustic model to obtain an acoustic feature vector corresponding to the answering audio;
based on a preset scoring model, obtaining a prediction score corresponding to the answering audio according to the voice feature vector, the text feature vector and the acoustic feature vector corresponding to the answering audio;
and adjusting model parameters of at least one model of the voice recognition model, the semantic extraction model, the acoustic model and the scoring model according to the prediction score and the labeling score corresponding to the answering audio.
7. The training method of claim 6, further comprising:
acquiring a first training sample, wherein the first training sample comprises a spoken language identification text;
based on a preset data enhancement rule, performing data enhancement on the spoken language identification text to obtain a first training sample after data enhancement;
pre-training the semantic extraction model according to the first training sample;
acquiring a second training sample, wherein the second training sample comprises a question text and a corresponding answer text;
and according to the second training sample, pre-training a semantic extraction model pre-trained on the basis of the first training sample to obtain the pre-trained semantic extraction model.
8. The training method as claimed in claim 6 or 7, wherein the inputting the answering audio into the pre-trained speech recognition model to obtain the corresponding speech feature vector of the answering audio and the spoken language recognition text comprises:
inputting the answering audio into a first feature extraction model of the voice recognition model to obtain a first voice feature of the answering audio;
inputting the first voice characteristic into an encoder sub-model of the voice recognition model to obtain a voice characteristic vector corresponding to the answering audio;
and inputting the voice feature vector into a decoder sub-model of the voice recognition model to obtain a spoken language recognition text corresponding to the answering audio.
9. A computer device, wherein the computer device comprises a memory and a processor;
the memory is used for storing a computer program;
the processor is used for executing the computer program and realizing the following when the computer program is executed:
a step of the scoring method of spoken questions and answers according to any one of claims 1 to 5; and/or
The steps of the training method of the spoken question-answer scoring model according to any one of claims 6 to 8.
10. A computer-readable storage medium storing a computer program, wherein if the computer program is executed by a processor, the computer program implements:
a step of the scoring method of spoken questions and answers according to any one of claims 1 to 7; and/or
The steps of the training method of the spoken question-answer scoring model according to any one of claims 6 to 8.
CN202111618214.1A 2021-12-27 2021-12-27 Spoken question and answer scoring method, spoken question and answer training method, computer equipment and storage medium Pending CN114360537A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111618214.1A CN114360537A (en) 2021-12-27 2021-12-27 Spoken question and answer scoring method, spoken question and answer training method, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111618214.1A CN114360537A (en) 2021-12-27 2021-12-27 Spoken question and answer scoring method, spoken question and answer training method, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114360537A true CN114360537A (en) 2022-04-15

Family

ID=81102393

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111618214.1A Pending CN114360537A (en) 2021-12-27 2021-12-27 Spoken question and answer scoring method, spoken question and answer training method, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114360537A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115827879A (en) * 2023-02-15 2023-03-21 山东山大鸥玛软件股份有限公司 Low-resource text intelligent review method and device based on sample enhancement and self-training

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115827879A (en) * 2023-02-15 2023-03-21 山东山大鸥玛软件股份有限公司 Low-resource text intelligent review method and device based on sample enhancement and self-training
CN115827879B (en) * 2023-02-15 2023-05-26 山东山大鸥玛软件股份有限公司 Low-resource text intelligent review method and device based on sample enhancement and self-training

Similar Documents

Publication Publication Date Title
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN111312245B (en) Voice response method, device and storage medium
CN108766415B (en) Voice evaluation method
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
CN111402862B (en) Speech recognition method, device, storage medium and equipment
CN110738998A (en) Voice-based personal credit evaluation method, device, terminal and storage medium
CN109584906B (en) Method, device and equipment for evaluating spoken language pronunciation and storage equipment
CN112397056B (en) Voice evaluation method and computer storage medium
Keshet Automatic speech recognition: A primer for speech-language pathology researchers
KR102272554B1 (en) Method and system of text to multiple speech
Ali Multi-dialect Arabic broadcast speech recognition
Qu et al. LipSound: Neural Mel-Spectrogram Reconstruction for Lip Reading.
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN115132174A (en) Voice data processing method and device, computer equipment and storage medium
CN111370001B (en) Pronunciation correction method, intelligent terminal and storage medium
CN114360537A (en) Spoken question and answer scoring method, spoken question and answer training method, computer equipment and storage medium
CN113393841B (en) Training method, device, equipment and storage medium of voice recognition model
Toyama et al. Use of Global and Acoustic Features Associated with Contextual Factors to Adapt Language Models for Spontaneous Speech Recognition.
Ranjith et al. GTSO: Gradient tangent search optimization enabled voice transformer with speech intelligibility for aphasia
CN113409768A (en) Pronunciation detection method, pronunciation detection device and computer readable medium
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
CN111243597A (en) Chinese-English mixed speech recognition method
Sakti et al. Incremental sentence compression using LSTM recurrent networks
CN113053409B (en) Audio evaluation method and device
CN113327578B (en) Acoustic model training method and device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination