CN109272992B

CN109272992B - Spoken language evaluation method and device for generating spoken language evaluation model

Info

Publication number: CN109272992B
Application number: CN201811422784.1A
Authority: CN
Inventors: 夏龙; 高强; 吴凡; 陈冬晓; 马楠; 崔成鹏; 郭常圳
Original assignee: Beijing Ape Power Future Technology Co Ltd
Current assignee: Beijing Ape Power Future Technology Co Ltd
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2022-03-18
Anticipated expiration: 2038-11-27
Also published as: CN109272992A

Abstract

The application provides a spoken language evaluation method, a device and a device for generating a spoken language evaluation model, wherein the spoken language evaluation method is applied to a data processing end of a spoken language evaluation system and comprises the following steps: acquiring comment voice data to be detected; extracting the spoken language attribute feature information of the comment voice data to be detected; and obtaining a scoring result according to the spoken language attribute characteristic information of the speech data to be evaluated based on the pre-generated spoken language evaluation model.

Description

Spoken language evaluation method and device for generating spoken language evaluation model

Technical Field

The present application relates to the field of language learning technologies, and in particular, to a method and an apparatus for spoken language evaluation, and an apparatus for generating a spoken language evaluation model, a computing device, and a computer-readable storage medium.

Background

With the development of global economy integration, english, as an international common language, increasingly exhibits its important role. Business, cultural exchange, cross-country tourism and other activities are increasingly frequent, more and more people need to learn a foreign language, and thus the improvement of the spoken language communication capacity becomes an urgent need for foreign language learning.

Most of the existing spoken language evaluation technologies are specific to adults, and the speech training data are basically specific to adults. Because the vocal tract of the infant is not developed completely and the labeled infant training data are few, the models hardly have good performance on the oral evaluation of the infant. Other models require a large amount of sample audio and the flexibility of the model is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a spoken language evaluation method, an apparatus for generating a spoken language evaluation model, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.

The embodiment of the application discloses a spoken language evaluation method, which is applied to a data processing end of a spoken language evaluation system and comprises the following steps:

acquiring comment voice data to be detected;

extracting the spoken language attribute feature information of the speech data to be evaluated;

and obtaining a scoring result according to the spoken language attribute characteristic information of the speech data to be evaluated based on a pre-generated spoken language evaluation model.

Optionally, after the obtaining of the speech data to be tested, before the extracting of the spoken language attribute feature information of the speech data to be tested includes:

carrying out voice ending detection on the to-be-detected comment voice data;

carrying out sound channel length normalization processing on the to-be-detected comment sound data after the voice is over detected;

performing voice recognition on the to-be-tested speech data subjected to the sound channel length normalization processing to obtain a speech frame sequence of the to-be-tested speech data and a conversion text corresponding to the speech frame sequence;

and carrying out forced alignment processing on the converted text and the reference text, and generating forced alignment information, wherein the forced alignment information comprises the corresponding relation between the voice frame sequence of the converted text and the voice frame sequence of the reference text.

Optionally, the performing voice end detection on the to-be-detected comment voice data includes:

identifying the to-be-detected comment voice data, and converting the identified to-be-detected comment voice data into to-be-finished detection voice texts containing time information;

matching the reference text with the voice text to be finished and detected to obtain a matched text time point;

and determining the ending time point of the to-be-tested comment voice data according to a preset voice activity detection method and the matched text time point.

Optionally, the spoken language attribute feature information includes:

pronunciation attribute feature information;

tone attribute feature information;

and fluency attribute feature information.

Optionally, the spoken language attribute feature information further includes: marking feature information on words;

after the generating of the forced alignment information, further comprising:

and judging whether the converted text is matched with the reference text or not according to the forced alignment information, and if not, adding word labeling characteristic information to the words which are not matched in the reference text.

Optionally, the obtained to-be-tested speech data in the to-be-tested speech data is fragmented and compressed speech data;

the acquiring of the comment voice data to be detected comprises the following steps:

receiving the segmented and compressed speech data to be tested;

and decompressing the segmented and compressed comment voice data to be detected.

Optionally, in the process of the forced alignment processing, a decoding graph with a clin closure is used for filtering out speech data repeatedly appearing in the speech data of the comment to be detected according to the reference text.

Optionally, the obtaining the speech data to be tested, and the extracting the spoken language attribute feature information of the speech data to be tested includes:

acquiring initial voice data;

extracting target voice data from the initial voice data by adopting a voice separation method based on deep clustering as to-be-detected comment voice data;

and extracting the spoken language attribute feature information of the comment voice data to be detected.

In another aspect, the present application further provides a method for generating a spoken language evaluation model, including:

generating a spoken language training sample set according to a spoken language training sample group, wherein each spoken language training sample group comprises spoken language training samples, and each spoken language training sample comprises spoken language attribute feature information and a scoring result corresponding to the spoken language attribute feature information;

and training a spoken language evaluation model according to the spoken language training sample set, wherein the spoken language evaluation model enables the spoken language attribute feature information of each spoken language training sample to be associated with the corresponding scoring result.

Optionally, before training the spoken language evaluation model according to the spoken language training sample set, the method includes:

and carrying out vocal tract length normalization processing on the voice data of the spoken language training samples so as to unify the vocal tract lengths of the spoken language training samples in all the spoken language training sample groups.

Optionally, the spoken language attribute feature information includes:

pronunciation attribute feature information;

tone attribute feature information;

and fluency attribute feature information.

On the other hand, this application still provides a spoken language evaluation device, includes:

the voice data acquisition module is configured to acquire the to-be-tested comment voice data;

the attribute feature extraction module is configured to extract the spoken language attribute feature information of the comment voice data to be detected;

and the scoring result generating module is configured to obtain a scoring result according to the spoken language attribute feature information of the to-be-evaluated speech data based on a pre-generated spoken language evaluation model.

Optionally, the apparatus further comprises:

the voice ending detection module is configured to carry out voice ending detection on the to-be-detected comment voice data;

the sound channel degree normalization processing module is configured to perform sound channel length normalization processing on the to-be-detected comment sound data after the voice end detection;

the voice recognition module is configured to perform voice recognition on the to-be-detected comment voice data subjected to the sound channel length normalization processing to obtain a conversion text of the to-be-detected comment voice data;

and the forced alignment module is configured to perform forced alignment processing on the converted text and the reference text and generate forced alignment information, wherein the forced alignment information comprises a corresponding relation between a voice frame sequence of the converted text and a voice frame sequence of the reference text.

Optionally, the end of utterance detection module includes:

the recognition unit is configured to recognize the to-be-detected comment voice data and convert the recognized to-be-detected comment voice data into to-be-finished detection voice texts containing time information;

the matching unit is configured to match the reference text with the voice text to be detected to obtain a matching text time point;

and the determining unit is configured to determine the ending time point of the to-be-tested speech data according to a preset speech activity detection method and the matched text time point.

Optionally, the apparatus further comprises:

and the marking module is configured to judge whether the converted text is matched with the reference text or not according to the forced alignment information, and if not, adding word marking characteristic information to the words which are not matched in the reference text.

Optionally, the voice data obtaining module includes:

the first acquisition unit is configured to acquire the voice data after the fragment compression;

and the decompression unit is configured to decompress the segmented and compressed comment voice data to be tested.

Optionally, the forced alignment module is configured to use a decoding graph with a clin closure to filter out repeatedly appearing speech data in the speech data to be evaluated according to the reference text.

Optionally, the voice data obtaining module includes:

a second acquisition unit configured to acquire initial voice data;

and the extraction unit is configured to extract target voice data from the initial voice data as the to-be-evaluated voice data by adopting a voice separation method based on deep clustering.

In another aspect, the present application further provides an apparatus for generating a spoken language evaluation model, including:

the training sample generating module is configured to generate a spoken language training sample set according to spoken language training sample sets, each spoken language training sample set comprises spoken language training samples, and each spoken language training sample comprises spoken language attribute feature information and a scoring result corresponding to the spoken language attribute feature information;

and the model generation module is configured to train a spoken language evaluation model according to the spoken language training sample set to obtain the spoken language evaluation model, and the spoken language evaluation model enables the spoken language attribute feature information of each spoken language training sample to be associated with the corresponding scoring result.

Optionally, the apparatus further comprises: and the vocal tract length normalization processing module is configured to perform vocal tract length normalization processing on the voice data of the spoken language training samples so as to unify the vocal tract lengths of the spoken language training samples in all the spoken language training sample groups.

In another aspect, the present application further provides a computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the method for spoken language assessment or the steps of the method for generating a spoken language assessment model when executing the instructions.

In another aspect, the present application further provides a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the spoken language evaluation method or the method of generating a spoken language evaluation model.

According to the oral evaluation method and device, the device for generating the oral evaluation model, the computing equipment and the computer readable storage medium, the oral evaluation method and device can be used for carrying out oral evaluation on different crowds by carrying out sound channel length normalization processing and model training on oral training samples, and obtaining accurate oral evaluation results, so that the problem of oral evaluation of infants is solved. Meanwhile, the pronunciation ending detection is used in the spoken language assessment method, so that the fluency of the whole spoken language assessment is improved, and the use threshold of a user is reduced. In addition, the problem of repeated follow-up reading is solved by using the Klin closure in the method; the voice separation method based on the deep clustering effectively avoids the phenomenon of superposition or alternation of voice data to be detected.

Drawings

FIG. 1 is a block diagram of a computing device according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram of a spoken language assessment method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a spoken language evaluation method according to an embodiment of the present application

FIG. 4 is a schematic flow chart diagram of generating a spoken language assessment model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a TDNN network according to an embodiment of the present application;

fig. 6 is a schematic partial structure diagram of a TDNN network structure according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a spoken language evaluation device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an apparatus for generating a spoken language evaluation model according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

In the present application, a spoken language evaluation method, an apparatus for generating a spoken language evaluation model, a computing device, and a computer-readable storage medium are provided, which are described in detail one by one in the following embodiments.

FIG. 1 shows a block diagram of a computing device 100, according to an embodiment of the present description. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130, the database 150 is used for storing data, and the network 160 is used for receiving data stored in the database 150.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In an embodiment of the present description, the other components of the computing device 100 described above and not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the method shown in fig. 2. Fig. 2 is a schematic flow chart of a spoken language evaluation method according to an embodiment of the present application, which is applied to a data processing end of a spoken language evaluation system and includes steps 202 to 206:

step 202: and acquiring the speech data to be evaluated.

Step 204: and extracting the spoken language attribute feature information of the comment voice data to be detected.

Wherein the spoken language attribute feature information includes: pronunciation attribute feature information, intonation attribute feature information and fluency attribute feature information;

step 206: and obtaining a scoring result according to the spoken language attribute characteristic information of the speech data to be evaluated based on a pre-generated spoken language evaluation model.

After the to-be-tested comment voice data is obtained, the spoken language evaluation model can obtain a scoring result according to the spoken language attribute feature information of the to-be-tested comment voice data, the steps are simple, and the time for a user to wait for the result can be greatly shortened. And the spoken language evaluation model is trained in advance, so that the data of the speech to be evaluated can be scored without storing a large amount of sample audios, and the application range of the method is expanded.

The content of the spoken language attribute characteristic information reflects the scoring direction of the spoken language evaluation model, namely, the spoken language evaluation model scores the pronunciation, intonation and fluency of the speech data to be evaluated, and gives a score to evaluate the accuracy of the speech data to be evaluated.

For example, the given score of the spoken language evaluation model is given in percentage, and after a section of speech is input into the spoken language evaluation model, the spoken language evaluation model scores the section of speech in three aspects of pronunciation, intonation and fluency, and gives a score result of 98.

Referring to fig. 3, fig. 3 shows a schematic flow diagram of a spoken language assessment method according to an embodiment of the present application; the spoken language evaluation method comprises steps 302 to 322:

step 302: and acquiring the speech data to be tested after the fragmentation compression, wherein the speech data to be tested is initial speech data.

Under the condition that the data processing end is a server end, if the server end directly obtains uncompressed comment voice data to be tested from a client, a user needs to consume large network flow, large delay is brought, and pressure is brought to a server network broadband. And carrying out fragment compression processing on the to-be-tested comment voice data and carrying out fragment transmission.

The size of the to-be-tested comment voice data fragment influences the performance of the whole spoken language evaluation, the fragment is too short, the transmission delay is low, but the utilization rate of a CPU/GPU at the server side is low. Fragmentation is too long, although the CPU/GPU utilization is high, the network latency is large.

In actual use, the optimal segmentation length of each piece of speech evaluation data to be tested can be determined through a large number of experiments.

The audio compression algorithm is divided into lossless and lossy compression, the lossless compression has no information loss but has a low compression ratio, and the lossy compression has a part of information loss but has a higher compression ratio and has little influence on the speech recognition result. There are many mature methods for compression coding, and one embodiment of the present application uses aac (advanced Audio coding) format compression coding.

Step 304: decompression processing (inverse transformation of AAC) is performed on the piece-compressed speech data to be measured.

And decompressing the segmented and compressed speech data to be evaluated, so that the subsequent processing of the speech data to be evaluated is facilitated.

Step 306: and extracting target voice data from the initial voice data as the to-be-detected comment voice data by adopting a voice separation method based on deep clustering.

Under the scene that the child has parents to accompany and read, the phenomenon of voice superposition or alternation may occur. The voice data of the infant can be accurately extracted from the voice data with parents for accompanying reading by adopting a deep clustering voice separation method, so that the accuracy of spoken language assessment is improved.

Step 308: and carrying out voice ending detection on the to-be-detected comment voice data.

In this step, the method specifically comprises the following steps:

matching the reference text and the voice text to be detected to obtain a matched text time point, wherein the matched text time point is a time point in the time information of the reference text and a time point after the time point in the time information of the voice text to be detected is matched;

The reference text is a text containing user Voice content and time information, and the preset Voice Activity Detection method may include, but is not limited to, Voice Activity Detection (VAD). In the voice finish detection process, whether the user finishes reading can be accurately judged by determining the finishing time point of the to-be-detected comment voice data by using the VAD and the matched text time point, so that the finishing time point of the user voice is automatically detected under the condition of no manual intervention, the fluency of the whole spoken language evaluation is further improved, the use threshold of the method is reduced, and the method is particularly suitable for the scene of independent use of infants under the condition of no accompanying of parents.

The traditional VAD technology based on the dominant frequency component can generate a large amount of false positive false positives in a scene with high environmental noise. Because the spoken language training samples in the embodiment of the application contain various real noise scenes and the texts read by the user are preset, the voice recognition-based VAD is more robust, and the false positive rate is greatly reduced.

Step 310: and carrying out sound channel length normalization processing on the to-be-detected comment sound data after the voice is over and detected.

In the step, a VTLN (Vocal track Length normalization) technology is adopted to eliminate the recognition performance influence caused by the shape of the Vocal Tract, and the application range of the method is expanded under the condition of ensuring the evaluation accuracy.

Step 312: and carrying out voice recognition on the to-be-tested speech data subjected to the sound channel length normalization processing to obtain a speech frame sequence of the to-be-tested speech data and a converted text corresponding to the speech frame sequence.

The real to-be-tested speech data can be obtained by performing speech recognition on the to-be-tested speech data subjected to the sound channel length normalization processing, and the scoring accuracy can be improved conveniently.

Step 314: and carrying out forced alignment processing on the converted text and the reference text, and generating forced alignment information, wherein the forced alignment information comprises the corresponding relation between the voice frame sequence of the converted text and the voice frame sequence of the reference text.

In this step, a Viterbi algorithm is adopted, which is a dynamic programming algorithm, and the optimal substructure thereof is as follows:

α(t，j)＝b_j(o_t)min_iα(t-1，i)a_ijformula (1)

Wherein alpha (t, j) represents the probability of the optimal path with the hidden state j until the time t; b_j(o_t) Indicating the generation of o in a given hidden state j_tThe transmission probability of (a); a is_ijRepresenting the transition probability from the hidden state i to j; o_tRepresenting the observed value at time t.

The forced alignment information provides detailed alignment information of the user voice and the reference text voice frame sequence and provides basic information for subsequent voice scoring.

In the process of the forced alignment processing, a decoding graph with a clin closure is adopted for filtering out repeated voice data in the comment voice data to be detected according to a reference text, wherein the clin closure is widely used in voice recognition based on weighted finite-state-machine (WFST) decoding and is used for solving the problem of repeated reading.

Step 316: and extracting the spoken language attribute feature information of the comment voice data to be detected.

Wherein the spoken language attribute feature information includes: pronunciation attribute feature information;

tone attribute feature information;

and fluency attribute feature information.

The content of the spoken language attribute characteristic information reflects the scoring direction of the spoken language evaluation model, namely scoring the pronunciation, the intonation and the fluency of the speech data to be evaluated.

The pronunciation attribute feature information consists of voice recognition prediction probabilities corresponding to each word, syllable and phoneme in the speech data to be evaluated; the tone attribute feature information is determined by the duration and the fundamental tone frequency corresponding to each word, syllable and phoneme in the comment voice data to be detected; the fluency attribute feature information is determined by the duration, the fundamental tone frequency and the speech speed corresponding to each word, syllable and phoneme in the comment voice data to be tested.

Step 318: and obtaining a scoring result according to the spoken language attribute characteristic information of the speech data to be evaluated based on a pre-generated spoken language evaluation model.

A series of preprocessing such as decompression processing, deep clustering voice separation, forced alignment processing and the like are carried out on the voice data to be evaluated, the accuracy of the scoring of the spoken language evaluation model can be improved, the application range of the method is expanded, different crowds can use the method to carry out spoken language evaluation, and accurate scoring results are obtained.

Referring to fig. 4, fig. 4 shows a schematic flow chart of generating a spoken language evaluation model according to an embodiment of the present application, which is obtained through steps 3091 to 3093:

step 3092: generating a spoken language training sample set according to a spoken language training sample group, wherein each spoken language training sample group comprises spoken language training samples, and each spoken language training sample comprises spoken language attribute feature information and a scoring result corresponding to the spoken language attribute feature information.

With the use of the user and the accumulation of the spoken language training samples, an embodiment of the application continuously expands the spoken language training sample set and iterates the model by adopting an unsupervised method. This allows the user speech to be evaluated without providing example audio, making the method more versatile and robust.

Step 3094: and carrying out vocal tract length normalization processing on the voice data of the spoken language training samples so as to unify the vocal tract lengths of the spoken language training samples in all the spoken language training sample groups.

In the step, a VTLN (vocal tract length normalization) technology is adopted to eliminate the recognition performance influence caused by different vocal tract shapes of the spoken language training sample set, so that the accuracy of the spoken language evaluation model for evaluating the users with different vocal tract shapes is ensured.

Step 3096: and training a spoken language evaluation model according to the spoken language training sample set, wherein the spoken language evaluation model enables the spoken language attribute feature information of each spoken language training sample to be associated with the corresponding scoring result.

The spoken language evaluation model adopts a Time Delay Neural Network (TDNN) based on sub-sampling, the TDNN is a special convolution network, can efficiently model a long-dependence time sequence relationship, and has the advantages of small operand, good recognition effect, suitability for real-time speech recognition scenes and the like. A number of experimental results show that TDNN using symmetric context is superior to asymmetric TDNN. A specific schematic diagram of the TDNN network structure is shown in fig. 5, where 1 to 6 are hidden layers and 7 is an output layer.

Referring to fig. 6, fig. 6 shows a partial structure 500 of the TDNN network structure in fig. 5, which is spliced by a splicing layer after speech frames are extracted from an input layer, so as to splice multiple paths of input; then the spliced voice frames are conveyed to a full connection layer by the splicing layer, and the full connection layer performs full connection on the spliced voice frames; and then, the voice frames after full connection are transmitted to a pooling layer, and the output size of the voice frames after full connection is reduced by the pooling layer.

Step 320: the spoken language attribute feature information further includes: marking feature information on words;

and judging whether the converted text is matched with the reference text or not according to the forced alignment information, and if not, adding word labeling characteristic information to the words which are not matched in the reference text. And judging whether the converted text is matched with the reference text or not according to the forced alignment information, if not, judging that the content of the reference text corresponding to the voice key frame is an unreadable part and the rest unmatched content in the reference text is an misreading part if the voice key frame sequence in the voice frame sequence corresponding to the reference text is aligned with the voice blank frame sequence in the voice frame sequence corresponding to the converted text.

According to the word marking characteristic information, the word which is not read or read by the user can be determined, so that the spoken language evaluation model can score according to the pronunciation, tone and fluency of the speech data to be evaluated and the word which is not read or read by the user, and compared with the spoken language evaluation model which scores according to the pronunciation, tone and fluency of the speech data to be evaluated, the scoring result is more accurate.

In addition, the word marking characteristic information can mark the corresponding words differently according to different error modes, namely misreading and misreading, of the words, so that the misreading and misreading words can be displayed in different modes subsequently; wherein the different ways include, but are not limited to, adding different font colors, font sizes or/and font types to the missed and misread words, respectively, to distinguish which are the missed words and which are the misread words.

Step 322: and displaying the unmatched words according to the word labeling characteristic information, and displaying the scoring result.

And displaying the unmatched words and displaying the scoring results, so that the user can know not only the final scoring result but also the word which is missed or wrongly read by the user. Through the sound channel length normalization processing, the oral evaluation can be performed on different crowds, an accurate oral evaluation result can be obtained, and the problem of oral evaluation of infants is particularly solved.

According to the oral evaluation method, the vocal tract length normalization processing and the model training are carried out on the oral training samples, so that the oral evaluation method can be used for carrying out oral evaluation on different crowds, an accurate oral evaluation result is obtained, and the problem of oral evaluation of infants is particularly solved. Meanwhile, the pronunciation ending detection is used in the spoken language assessment method, so that the fluency of the whole spoken language assessment is improved, and the use threshold of a user is reduced. In addition, the problem of repeated follow-up reading is solved by using the Klin closure in the method; the voice separation method based on the deep clustering effectively avoids the phenomenon of superposition or alternation of voice data to be detected.

An embodiment of the present disclosure further provides a spoken language evaluation device, and referring to fig. 7, fig. 7 is a schematic structural diagram of a spoken language evaluation device 700 according to an embodiment of the present disclosure, including:

a voice data acquisition module 702 configured to acquire the to-be-tested comment voice data;

an attribute feature extraction module 704 configured to extract spoken language attribute feature information of the speech data to be evaluated;

a scoring result generation module 706 configured to obtain a scoring result according to the spoken language attribute feature information of the speech data to be evaluated based on a pre-generated spoken language evaluation model;

after the to-be-tested comment voice data is obtained, the attribute feature extraction module can extract the spoken language attribute feature information of the to-be-tested comment voice data, so that the scoring result generation module uses the spoken language evaluation model to obtain a scoring result according to the spoken language attribute feature information of the to-be-tested comment voice data. And the spoken language evaluation model is trained in advance, so that the data of the speech to be evaluated can be scored without storing a large amount of sample audios, and the application range of the device is expanded.

An embodiment of the present specification further provides a spoken language assessment apparatus, including:

In an optional embodiment, the spoken language assessment apparatus further includes:

the voice recognition module is configured to perform voice recognition on the to-be-tested comment voice data subjected to the sound channel length normalization processing to obtain a conversion text of the to-be-tested comment voice data;

The spoken language evaluation device can carry out spoken language evaluation on different crowds by carrying out vocal tract length normalization processing and model training on the spoken language training sample, obtain an accurate spoken language evaluation result and particularly solve the problem of infant spoken language evaluation. Meanwhile, the speech termination detection is used in the spoken language assessment device, so that the smoothness of the whole spoken language assessment is improved, and the use threshold of a user is reduced. In addition, the problem of repeated follow-up reading is solved by using the Klin closure in the method; the voice separation method based on the deep clustering effectively avoids the phenomenon of superposition or alternation of voice data to be detected.

In an optional embodiment, the end of utterance detection module includes:

In the voice end detection process, whether the end time point of the to-be-detected comment voice data can be accurately judged whether the user finishes reading or not by matching the text time point, so that the completion of pronunciation can be automatically detected without manual intervention, the smoothness of the whole spoken language assessment is improved, the use threshold of the device is reduced, and the device is particularly suitable for a scene in which an infant is independently used under the condition that parents do not accompany.

In an optional embodiment, the spoken language attribute feature information includes:

pronunciation attribute feature information, intonation attribute feature information, fluency attribute feature information, and word tagging feature information.

and the marking module is used for judging whether the converted text is matched with the reference text or not according to the forced alignment information, and if not, adding word marking characteristic information to the words which are not matched in the reference text.

The marking module can determine the word which is not read or read by the user according to the word marking characteristic information, so that the spoken language assessment model can score according to the pronunciation, tone and fluency of the speech data to be assessed and the word which is not read or read by the user, and compared with the spoken language assessment model which scores according to the pronunciation, tone and fluency of the speech data to be assessed, the scoring result is more accurate.

In addition, the marking module can also mark words corresponding to different error modes differently according to the word marking characteristic information, so that words corresponding to misreading and missed reading can be displayed in different modes subsequently; wherein the different ways include, but are not limited to, adding different font colors, font sizes, or/and font types to the missed and misread words, respectively.

and the display module is configured to display the unmatched words and display the scoring result according to the word labeling characteristic information.

The display module displays the unmatched words and displays the scoring results, so that the user can know not only the scoring results but also the words which are missed and wrongly read by the user. Through the sound channel length normalization processing, the oral evaluation can be performed on different crowds, an accurate oral evaluation result can be obtained, and the problem of oral evaluation of infants is particularly solved.

In an optional embodiment, the voice data obtaining module includes:

The first acquisition unit can acquire the voice data after the fragment compression, so that the spoken language evaluation device can accept the voice data after the fragment compression.

The decompression unit can decompress the to-be-detected comment voice data after the fragment compression, so that the to-be-detected comment voice data can be conveniently processed subsequently.

In an optional embodiment, the forced alignment module employs a decoding graph with a clin closure, and is configured to filter out speech data that repeatedly appears in the speech data to be evaluated according to the reference text.

The forced alignment module adopts a decoding graph with a Clin closure to filter the repeated voice data in the to-be-tested comment voice data so as to solve the problem of repeated follow-up reading.

In an optional embodiment, the voice data obtaining module includes:

a second acquisition unit configured to acquire initial voice data;

The second acquisition unit can acquire the initial voice data, so that the extraction unit can extract the target voice data from the initial voice data as the comment voice data to be tested, and then can make a true and reliable score on the comment voice data to be tested.

An embodiment of the present specification further provides an apparatus for generating a spoken language evaluation model, and referring to fig. 8, fig. 8 is a schematic structural diagram of an apparatus for generating a spoken language evaluation model according to an embodiment of the present application, including:

a training sample generation module 802 configured to generate a spoken language training sample set according to spoken language training sample sets, each spoken language training sample set including spoken language training samples, each spoken language training sample including spoken language attribute feature information and a scoring result corresponding to the spoken language attribute feature information;

the model generation module 804 is configured to train a spoken language evaluation model according to the spoken language training sample set to obtain the spoken language evaluation model, and the spoken language evaluation model associates the spoken language attribute feature information of each spoken language training sample with a corresponding scoring result.

In the training stage of the spoken language evaluation model, the spoken language attribute feature information of each spoken language training sample is associated with the corresponding scoring result, so that the spoken language evaluation model can accurately score the speech to be evaluated according to the spoken language attribute feature information. The spoken language evaluation model can be realized by adopting various neural network models, and the spoken language evaluation model in the embodiment can be a TDNN (time domain neural network) model.

In an optional embodiment, the apparatus for generating a spoken language evaluation model further comprises: and the vocal tract length normalization processing module is configured to perform vocal tract length normalization processing on the voice data of the spoken language training samples so as to unify the vocal tract lengths of the spoken language training samples in all the spoken language training sample groups.

pronunciation attribute feature information;

tone attribute feature information;

and fluency attribute feature information.

And the content of the spoken language attribute characteristic information is expanded, so that the evaluation direction diversity of the spoken language evaluation model is enlarged, and the scoring result of the spoken language evaluation model is more real and reliable.

An embodiment of the present specification also provides a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the spoken language evaluation method or the method of generating a spoken language evaluation model when executing the instructions.

An embodiment of the present specification also provides a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the spoken language evaluation method or the method of generating a spoken language evaluation model.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium is the same as the technical solution of the above-mentioned spoken language evaluation method or the above-mentioned method for generating a spoken language evaluation model, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned spoken language evaluation method.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A spoken language evaluation method is applied to a data processing end of a spoken language evaluation system and is characterized by comprising the following steps:

acquiring comment voice data to be detected;

carrying out voice ending detection on the to-be-detected comment voice data;

carrying out forced alignment processing on the converted text and the reference text, and generating forced alignment information, wherein the forced alignment information comprises a corresponding relation between a voice frame sequence of the converted text and a voice frame sequence of the reference text;

obtaining a scoring result according to the spoken language attribute characteristic information of the speech data to be evaluated based on a pre-generated spoken language evaluation model;

acquiring initial voice data; the initial voice data is voice data with voice superposition or alternation;

and extracting target voice data from the initial voice data as the to-be-detected comment voice data by adopting a voice separation method based on deep clustering.

2. The spoken language assessment method according to claim 1, wherein the end-of-speech detection of the speech data to be assessed comprises:

3. The spoken language evaluation method according to claim 2,

the spoken language attribute feature information includes:

pronunciation attribute feature information;

tone attribute feature information;

and fluency attribute feature information.

4. The spoken language assessment method according to claim 3, wherein said spoken language attribute feature information further comprises: marking feature information on words;

after the generating of the forced alignment information, further comprising:

5. The spoken language assessment method according to claim 1, wherein said obtaining speech data to be assessed comprises:

acquiring voice data after fragment compression;

and decompressing the voice data after the fragment compression to obtain the comment voice data to be detected.

6. The spoken language assessment method according to claim 1, wherein during the forced alignment process, repeated speech data in the speech data to be assessed is filtered using a decoding graph with a clin closure.

7. A method of generating a spoken language assessment model, comprising:

generating a spoken language training sample set according to a spoken language training sample set, wherein each spoken language training sample set comprises spoken language training samples, each spoken language training sample comprises spoken language attribute feature information and a scoring result corresponding to the spoken language attribute feature information, and the spoken language attribute feature information is obtained by using the spoken language assessment method according to any one of claims 1 to 6;

8. The method of generating a spoken language assessment model according to claim 7, wherein said training of a spoken language assessment model according to said set of spoken training samples comprises:

9. A spoken language assessment device, comprising:

a forced alignment module configured to perform forced alignment processing on the converted text and a reference text and generate forced alignment information, where the forced alignment information includes a correspondence between a sequence of voice frames of the converted text and a sequence of voice frames of the reference text;

the scoring result generation module is configured to obtain a scoring result according to the spoken language attribute feature information of the to-be-evaluated speech data based on a pre-generated spoken language evaluation model;

the voice data acquisition module comprises:

a second acquisition unit configured to acquire initial voice data; the initial voice data is voice data with voice superposition or alternation;

10. The apparatus of claim 9, wherein the end of utterance detection module comprises:

11. The apparatus of claim 9, further comprising:

12. The apparatus of claim 9, wherein the voice data acquisition module comprises:

and the decompression unit is configured to decompress the voice data after the fragment compression to obtain the comment voice data to be detected.

13. The apparatus of claim 9, wherein the forced alignment module employs a decoding graph with a clin closure for filtering out repeated speech data in the speech data to be evaluated according to the reference text.

14. An apparatus for generating a spoken language assessment model, comprising:

a training sample generation module configured to generate a spoken language training sample set according to a spoken language training sample set, each spoken language training sample set including spoken language training samples, each spoken language training sample including spoken language attribute feature information and a scoring result corresponding to the spoken language attribute feature information, wherein the spoken language attribute feature information is obtained by the spoken language assessment method according to any one of claims 1 to 6;

15. The apparatus of claim 14, further comprising:

and the vocal tract length normalization processing module is configured to perform vocal tract length normalization processing on the voice data of the spoken language training samples so as to unify the vocal tract lengths of the spoken language training samples in all the spoken language training sample groups.

16. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-8 when executing the instructions.

17. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 8.