CN113571043B - Dialect simulation force evaluation method and device, electronic equipment and storage medium - Google Patents

Dialect simulation force evaluation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113571043B
CN113571043B CN202110850935.9A CN202110850935A CN113571043B CN 113571043 B CN113571043 B CN 113571043B CN 202110850935 A CN202110850935 A CN 202110850935A CN 113571043 B CN113571043 B CN 113571043B
Authority
CN
China
Prior art keywords
voice
dialect
comment
predicted
frequency spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110850935.9A
Other languages
Chinese (zh)
Other versions
CN113571043A (en
Inventor
马金龙
熊佳
王伟喆
曾锐鸿
罗箫
焦南凯
盘子圣
徐志坚
谢睿
陈光尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huancheng Culture Media Co ltd
Original Assignee
Guangzhou Huancheng Culture Media Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huancheng Culture Media Co ltd filed Critical Guangzhou Huancheng Culture Media Co ltd
Priority to CN202110850935.9A priority Critical patent/CN113571043B/en
Publication of CN113571043A publication Critical patent/CN113571043A/en
Application granted granted Critical
Publication of CN113571043B publication Critical patent/CN113571043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a dialect simulation force evaluation method, a dialect simulation force evaluation device, electronic equipment and a storage medium, which are used for solving the technical problems of high algorithm complexity, limited model generalization capability and low calculation efficiency of the existing simulation evaluation scheme. The method comprises the following steps: receiving a voice signal input by a user; extracting predicted comment features from the speech signal; matching the target reference template of the predicted comment voice feature from a preset dialect voice feature reference template; and calculating the simulation force score of the predicted comment features according to the target reference template.

Description

Dialect simulation force evaluation method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of voice evaluation, in particular to a dialect simulation force evaluation method, a device, electronic equipment and a storage medium.
Background
With the rapid development of 5G and artificial intelligence and the rising of general entertainment products such as live broadcast, short video and the like, more social interaction playing methods are layered endlessly, so that voice playing methods based on regional characteristics are created in a large quantity, such as unlocking IP locking address positions to find a playing partner, recommending content based on registered places, creating rooms based on regions and the like. In addition, the social relationship among users can be increased by searching a user joining platform of the same dialect through the dialect simulation to play a list.
The main current simulation evaluation schemes mainly comprise two types, namely a word-level spoken language evaluation algorithm based on voice recognition and a video distortion evaluation method based on traditional voice signal processing.
Then, the word-level spoken language assessment algorithm based on voice recognition needs a complete voice recognition system, so that a large number of dialect data sets are needed to be used for model training by adopting the scheme, meanwhile, a C/S interaction system is also built, the algorithm complexity is high, and the implementation on the terminal is not facilitated. The video distortion evaluation method based on the traditional voice signal processing has high requirements on consistency of the reference voice and the voice to be tested, and needs to ensure that the reference voice and the voice to be tested are aligned and have consistent duration, so that the anti-interference and the robustness are poor. Devices and scenes that fit a specific periodic sound production are not very suitable for non-specific people.
Disclosure of Invention
The invention provides a dialect simulation force evaluation method, a dialect simulation force evaluation device, electronic equipment and a storage medium, which are used for solving the technical problems of high algorithm complexity, limited model generalization capability and low calculation efficiency of the existing simulation evaluation scheme.
The invention provides a dialect simulation force evaluation method, which comprises the following steps:
receiving a voice signal input by a user;
Extracting predicted comment features from the speech signal;
Matching the target reference template of the predicted comment voice feature from a preset dialect voice feature reference template;
And calculating the simulation force score of the predicted comment features according to the target reference template.
Optionally, the step of extracting the predicted comment features from the speech signal includes:
preprocessing the voice signal to obtain a preprocessed signal;
And extracting predicted comment features from the preprocessed signals.
Optionally, the step of extracting the predicted comment features from the preprocessed signal includes:
performing fast Fourier transform on the preprocessed voice signal to obtain a frequency spectrum of the preprocessed voice signal;
obtaining the square value of the frequency spectrum to obtain a short-time energy spectrum;
Acquiring an amplitude spectrum of the frequency spectrum, and converting the amplitude spectrum into a Mel frequency spectrum;
obtaining the logarithm of the Mel frequency spectrum according to the short-time energy spectrum and the Mel frequency spectrum;
and performing discrete cosine transform on the logarithm, calculating to obtain a Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as a predicted comment characteristic of the preprocessed voice signal.
Optionally, the step of matching the target reference template of the predicted comment feature from the preset dialect voice feature reference template includes:
And carrying out dynamic time warping calculation on the predicted comment voice features and the preset dialect voice feature reference templates, and matching the target reference templates of the predicted comment voice features in the preset dialect voice feature reference templates according to calculation results.
The invention also provides a dialect simulation force evaluation device, which comprises:
the voice signal receiving module is used for receiving a voice signal input by a user;
The predicted comment feature extraction module is used for extracting predicted comment features from the voice signals;
the target reference template matching module is used for matching the target reference template of the predicted comment voice characteristic from a preset dialect voice characteristic reference template;
And the simulation force score calculation module is used for calculating the simulation force score of the predicted comment sound feature according to the target reference template.
Optionally, the prediction comment feature extraction module includes:
The preprocessing sub-module is used for preprocessing the voice signal to obtain a preprocessed signal;
and the prediction comment feature extraction sub-module is used for extracting the prediction comment features from the preprocessing signal.
Optionally, the predictive comment voice feature extraction sub-module includes:
The frequency spectrum calculation unit is used for carrying out fast Fourier transform on the preprocessed voice signal to obtain the frequency spectrum of the preprocessed voice signal;
the short-time energy spectrum calculation unit is used for obtaining the square value of the frequency spectrum to obtain a short-time energy spectrum;
the Mel frequency spectrum calculation unit is used for acquiring the amplitude spectrum of the frequency spectrum and converting the amplitude spectrum into Mel frequency spectrum;
a logarithmic calculation unit for calculating the logarithm of the mel frequency spectrum according to the short-time energy spectrum and the mel frequency spectrum;
And the predicted comment sound characteristic calculation unit is used for performing discrete cosine transformation on the logarithm, calculating to obtain a Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as the predicted comment sound characteristic of the preprocessed voice signal.
Optionally, the target reference template matching module includes:
the target reference template matching sub-module is used for carrying out dynamic time warping calculation on the predicted comment voice characteristics and the preset dialect voice characteristic reference templates, and matching the target reference templates of the predicted comment voice characteristics in the preset dialect voice characteristic reference templates according to calculation results.
The invention also provides an electronic device comprising a processor and a memory:
The memory is used for storing program codes and transmitting the program codes to the processor;
The processor is configured to execute the dialect simulated force assessment method according to any of the above claims according to instructions in the program code.
The present invention also provides a computer-readable storage medium for storing program code for executing the dialect simulation force evaluation method as described in any one of the above.
From the above technical scheme, the invention has the following advantages: the invention receives the voice signal input by the user; extracting predicted comment features from the speech signal; matching a target reference template for predicting the voice characteristics from a preset dialect voice characteristic reference template; and calculating the simulation force score for predicting the voice characteristic according to the target reference template. Therefore, an efficient and real-time dialect evaluation result is provided for the user, and the exercise enthusiasm of the user is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flowchart illustrating steps of a method for evaluating a dialect simulation force according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps of a method for evaluating a dialect analog force according to another embodiment of the present invention;
FIG. 3 is a frequency chart of a Mel filter set according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an MFCC extraction process according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a DTW algorithm according to an embodiment of the present invention;
Fig. 6 is an android integrated flow chart of a dialect simulation force evaluation method according to an embodiment of the present invention;
Fig. 7 is a block diagram of a dialect simulation force evaluation device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a dialect simulation force evaluation method, a dialect simulation force evaluation device, electronic equipment and a storage medium, which are used for solving the technical problems of high algorithm complexity, limited model generalization capability and low calculation efficiency of the existing simulation evaluation scheme.
In order to make the objects, features and advantages of the present invention more comprehensible, the technical solutions in the embodiments of the present invention are described in detail below with reference to the accompanying drawings, and it is apparent that the embodiments described below are only some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a method for evaluating a dialect simulation force according to an embodiment of the present invention.
The invention provides a dialect simulation force evaluation method, which specifically comprises the following steps:
step 101, receiving a voice signal input by a user;
in the embodiment of the invention, the voice signal can be a human voice signal generated when a user exercises dialect, or can be a prerecorded audio signal. The present invention is not particularly limited thereto.
Step 102, extracting predicted comment features from the speech signal;
The purpose of extracting the predicted speech features from the speech signal is to extract the identifiable components from the speech signal and remove the disturbing information (such as background noise and emotion). So as to reduce the interference of the interference information on the voice analysis and the calculated amount of the whole dialect simulation force evaluation process.
Step 103, matching the target reference templates of the predicted comment features from the preset dialect voice feature reference templates;
After the predicted comment features are extracted, the corresponding target reference templates can be matched in the preset dialect voice feature reference templates.
It should be noted that, in order to implement the matching process of the target reference templates, the speech feature reference templates of different dialects need to be stored in advance. The voice characteristic reference template can be obtained by training voice signals of different dialects.
And 104, calculating the simulation force score for predicting the voice characteristic according to the target reference template.
After the target reference template is obtained, a simulated force score may be determined based on the similarity of the predicted voice feature to the target reference template.
In an embodiment of the invention, the simulated force score is used to characterize how similar the speech signal input by the user is to the dialect template used as a reference. The higher the score, the more accurate the simulation of the dialect on behalf of the user.
The invention receives the voice signal input by the user; extracting predicted comment features from the speech signal; matching a target reference template for predicting the voice characteristics from a preset dialect voice characteristic reference template; and calculating the simulation force score for predicting the voice characteristic according to the target reference template. Therefore, an efficient and real-time dialect evaluation result is provided for the user, and the exercise enthusiasm of the user is improved.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for evaluating a dialect simulation force according to another embodiment of the present invention. The method specifically comprises the following steps:
step 201, receiving a voice signal input by a user;
Step 202, preprocessing a voice signal to obtain a preprocessed signal;
Before analyzing and processing the speech signal, it must be pre-processed to eliminate the effects on the quality of the speech signal due to aliasing, higher harmonic distortion, high frequencies, etc. caused by the human vocal organ itself and by the equipment that collects the speech signal. Ensuring more uniform and smooth signals obtained by subsequent voice processing as far as possible.
In embodiments of the present invention, preprocessing may include framing, windowing, and pre-emphasis. The frame division is to cut off the voice signal according to the short-time stability, the frame length is generally 20ms, and the frame shift is generally 10ms; the hamming window or the hanning window is generally adopted for windowing, because the width of the main lobe corresponds to the frequency resolution, the wider the width of the main lobe is, the lower the corresponding frequency resolution is, so that the energy is concentrated on the main lobe as much as possible when a window function is selected, or the relative amplitude of the maximum side lobe height is as small as possible, and the hamming window has larger side lobe attenuation in amplitude-frequency characteristics and can reduce the Gibbs effect, so that the hamming window is generally selected for the windowing processing of the voice signal; because the voice signal is easy to be influenced by glottal excitation and oral-nasal radiation, the attenuation of 6 dB/octave occurs in the frequency component above 800Hz, so that the energy of a high-frequency part is required to be increased by a pre-emphasis method, the high-frequency loss is compensated by a machine, and the pre-emphasis is generally realized by adopting a first-order high-pass filter 1-0.9375Z -1. In addition, the pre-processing may also include anti-aliasing filtering.
Step 203, extracting predicted comment features from the preprocessed signals;
In practical applications, feature extraction may include time domain feature parameter extraction and frequency domain feature parameter extraction, where the time domain feature parameters include short-time zero-crossing rate, short-time energy spectrum, and pitch period, and the frequency domain feature parameters include LPCC (LINEAR PREDICTIVE CEPSTRAL Coding, linear prediction cepstral coefficient), Δlpcc (first order differential linear prediction cepstral coefficient), MFCC (Mel Frequency Cepstrum Coefficient, mel frequency cepstral coefficient), and Δmfcc (first order differential Mel cepstral system).
In one example, taking MFCC as an example of the predicted comment features, the step of extracting the predicted comment features from the preprocessed signal may include:
s31, performing fast Fourier transform on the preprocessed voice signal to obtain a frequency spectrum of the preprocessed voice signal;
In the embodiment of the invention, the process of performing fast fourier transform (fast Fourier transform, FFT) on the preprocessed voice signal to obtain the spectrum thereof is shown in the following formula:
Wherein X (k) is the frequency spectrum of the preprocessed voice signal; n represents the number of points of Fourier transform, typically 256 or 512 points; n is the position of the voice signal (usually 320 points of a frame signal, n is 0-320); x (N) is the input speech signal, k is the kth point, k=0, 1.
S32, obtaining a square value of a frequency spectrum to obtain a short-time energy spectrum;
in the embodiment of the invention, the calculation formula of the short-time energy spectrum is as follows:
in a specific implementation, the magnitude spectrum can be changed to a mel frequency spectrum H (k, m) with a mel filter bank;
Where f (k) represents the actual frequency, f c () represents the center frequency, and m is the mel filter order.
In one example, the mel-filter bank frequency is shown in fig. 3.
S34, calculating the logarithm of the Mel frequency spectrum according to the short-time energy spectrum and the Mel frequency spectrum;
In the embodiment of the invention, the log transform is performed on the mel frequency spectrum, and the obtained log X' (m) of the mel frequency spectrum can be represented by the following formula:
s35, discrete cosine transform (Discrete Cosine Transform, DCT) is carried out on the logarithm, the Mel frequency cepstrum coefficient is obtained through calculation, and the Mel frequency cepstrum coefficient is used as the predicted comment characteristic of the preprocessed voice signal.
In the embodiment of the invention, the mel-frequency cepstrum coefficient mfcc (r) can be calculated by the following formula:
it should be noted that, the MFCC features obtained through the above calculation steps are static parameters, which can well reflect static features of speech, but do not fully utilize dynamic features of speech, so in an alternative example, first-order and second-order differential parameters of the MFCC may be added as predicted comment features on the basis of the MFCC features, so as to better describe time-varying characteristics of the speech signal.
In one example, the MFCC extraction flow is shown in fig. 4.
Step 204, matching the target reference templates of the predicted comment features from the preset dialect voice feature reference templates;
it should be noted that, in order to implement the matching process of the target reference templates, the speech feature reference templates of different dialects need to be stored in advance.
In a specific implementation, for the generation of the speech feature reference template, the same process as the above steps 201-203 may be performed for each reference speech, that is, the process includes preprocessing the reference speech and extracting features. And then storing the local according to the actual situation. When the simulation force evaluation is actually carried out on the user voice signal, the reference template can be directly loaded, and the corresponding characteristic parameters are read to determine the target reference template.
In one example, the step of matching the target reference template of the predicted comment features from the preset dialect speech feature reference template may include:
And carrying out dynamic time warping calculation on the predicted comment voice features and the preset dialect voice feature reference templates, and matching the target reference templates of the predicted comment voice features in the preset dialect voice feature reference templates according to calculation results.
Dynamic time warping (DYNAMIC TIME WARPING, DTW) is a nonlinear warping technique that combines time warping and distance measure computation, finds a warping function i m=φ(in), nonlinearly maps the time axis n of the test vector (vector of predicted evaluation features) onto the time axis m of the dialect speech feature reference template, and makes the function satisfy:
Wherein D is the distance between the two vectors under the condition of optimal time regularity; t (i n) is a feature vector to be detected, R (phi (i n)) is a reference template feature vector, phi (i n) is a regular function of i n, and i n is a feature frame number.
Since DTW continuously calculates the distance between the two vectors to find the optimal matching path, a normalization function is obtained corresponding to the smallest cumulative distance when the two vectors match, which ensures the maximum acoustic similarity between them. The essence of the DTW algorithm is to use the idea of dynamic programming, and automatically find a path by using the locally optimized process, and along the path, the cumulative distortion between two feature vectors is minimum, so that errors possibly caused by different durations are avoided. The DTW algorithm requires that the reference template and the test template employ the same type of feature vector, the same frame length, the same window function, and the same frame shift.
The principle of the DTW algorithm is shown in fig. 5, where each frame number n=1 to N of the predicted evaluation feature is marked on the horizontal axis in a two-dimensional rectangular coordinate system, each frame m=1 to M of the dialect speech feature reference template is marked on the vertical axis, a grid can be formed by drawing some longitudinal lines through the integer coordinates representing the frame numbers, and each intersection point (t i,rj) in the grid represents the intersection point of a certain frame in the predicted evaluation feature and a certain frame in the dialect speech feature reference template. The DTW algorithm is carried out in two steps, namely, calculating the distance between frames of the predicted comment voice feature and the dialect voice feature reference template, namely, solving a frame matching distance matrix, and finding an optimal path in the frame matching distance matrix, namely, a target reference template.
Step 205, calculating the simulation force score for predicting the voice characteristic according to the target reference template.
In a specific implementation, the target reference template is assumed to have M frame vectors { R (1), R (2),... d (T (i n),R(im)) represents the distance between the i n frame feature in T and the i m frame feature in R, typically expressed in terms of euclidean distance. And finally, quantifying the result according to the Euclidean distance into a percentage to obtain score output. And finally obtaining a polynomial fitting curve function through multiple tests:
SCORE=-0.08*d(T(in),R(im))*d(T(in),R(im))+100
The SCORE is a simulated force SCORE, and the simulated force SCORE of the voice signal can be calculated through the polynomial fitting curve function.
The invention receives the voice signal input by the user; extracting predicted comment features from the speech signal; matching a target reference template for predicting the voice characteristics from a preset dialect voice characteristic reference template; and calculating the simulation force score for predicting the voice characteristic according to the target reference template. Therefore, an efficient and real-time dialect evaluation result is provided for the user, and the exercise enthusiasm of the user is improved.
In order to facilitate understanding, embodiments of the present invention are described below by way of specific examples.
Referring to fig. 6, fig. 6 is a schematic diagram of an android integration flow chart of a dialect simulation force evaluation method according to an embodiment of the present invention. The specific implementation steps are as follows:
the operator prepares a dialect data set comprising template voices such as martial arts, shandong, northeast, hakka, guangdong, sichuan and the like;
The android client is in butt joint with a DIE (Dialect Imitation Evaluate, dialect simulation force evaluation) platform, microphone data acquisition is completed through a self-contained voice acquisition frame, a complete sound fragment is generated and stored locally, an init interface is called to complete initialization of dialect simulation force evaluation method software, a sampling rate and a frame length are set, a required reference voice path is set, and the like;
And calling a processing interface to read the saved sound fragment file and perform similarity calculation with reference voice, including preprocessing the saved sound fragment, extracting multiple features and loading and comparing templates, and finally calling a stop interface to obtain the imitation force score calculated by the DTW. After the client obtains the result, the client can give out a corresponding result according to the difficulty degree of the dialect simulation, for example, the simulation force is strong, the score is high, and the applause expression is adopted for appreciation; and prompting multiple times of contact is adopted for prompting the user to use more when the scoring is low, and the left back user finishes dialect simulation force scene and needs to call destory an interface to release resources so as to finish evaluation.
Referring to fig. 7, fig. 7 is a block diagram of a dialect simulation force evaluation apparatus according to an embodiment of the present invention.
The embodiment of the invention provides a dialect simulation force evaluation device, which is characterized by comprising:
a voice signal receiving module 701, configured to receive a voice signal input by a user;
a predicted comment feature extraction module 702 for extracting predicted comment features from a speech signal;
The target reference template matching module 703 is configured to match a target reference template for obtaining a predicted comment feature from a preset dialect voice feature reference template;
the simulation force score calculating module 704 is configured to calculate a simulation force score for predicting the speech feature according to the target reference template.
In an embodiment of the present invention, the predictive comment feature extraction module 702 includes:
the preprocessing sub-module is used for preprocessing the voice signal to obtain a preprocessed signal;
and the prediction comment feature extraction sub-module is used for extracting the prediction comment features from the preprocessing signal.
In an embodiment of the present invention, the predictive comment feature extraction submodule includes:
The frequency spectrum calculation unit is used for carrying out fast Fourier transform on the preprocessed voice signal to obtain the frequency spectrum of the preprocessed voice signal;
The short-time energy spectrum calculation unit is used for obtaining the square value of the frequency spectrum to obtain a short-time energy spectrum;
The Mel frequency spectrum calculation unit is used for obtaining the amplitude spectrum of the frequency spectrum and converting the amplitude spectrum into Mel frequency spectrum;
the logarithmic calculation unit is used for calculating the logarithm of the Mel frequency spectrum according to the short-time energy spectrum and the Mel frequency spectrum;
And the predicted comment sound characteristic calculation unit is used for performing discrete cosine transformation on the logarithm, calculating to obtain a Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as the predicted comment sound characteristic of the preprocessed voice signal.
In an embodiment of the present invention, the target reference template matching module 703 includes:
the target reference template matching sub-module is used for carrying out dynamic time warping calculation on the predicted comment voice characteristics and the preset dialect voice characteristic reference template, and matching the target reference template of the predicted comment voice characteristics in the preset dialect voice characteristic reference template according to a calculation result.
The embodiment of the invention also provides electronic equipment, which comprises a processor and a memory:
the memory is used for storing the program codes and transmitting the program codes to the processor;
The processor is used for executing the dialect simulation force evaluation method according to the embodiment of the invention according to the instructions in the program code.
The embodiment of the invention also provides a computer readable storage medium, which is used for storing program codes, and the program codes are used for executing the dialect simulation force evaluation method of the embodiment of the invention.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A dialect simulation force evaluation method, characterized by comprising:
receiving a voice signal input by a user;
Extracting predicted comment features from the speech signal;
matching target reference templates of the predicted comment features from preset different dialect voice feature reference templates;
calculating the simulation force score of the predicted comment features according to the target reference template;
Wherein the step of extracting the predicted comment features from the speech signal comprises:
Preprocessing the voice signal to obtain a preprocessed signal; the preprocessing comprises framing, windowing and pre-emphasis;
extracting predicted comment features from the preprocessed signals;
Wherein the step of extracting the predicted comment features from the pre-processed signal comprises:
performing fast Fourier transform on the preprocessed voice signal to obtain a frequency spectrum of the preprocessed voice signal;
obtaining the square value of the frequency spectrum to obtain a short-time energy spectrum;
Acquiring an amplitude spectrum of the frequency spectrum, and converting the amplitude spectrum into a Mel frequency spectrum;
obtaining the logarithm of the Mel frequency spectrum according to the short-time energy spectrum and the Mel frequency spectrum;
and performing discrete cosine transform on the logarithm, calculating to obtain a Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as a predicted comment characteristic of the preprocessed voice signal.
2. The method of claim 1, wherein the step of matching the target reference templates of the predicted voice feature from the pre-set dialect voice feature reference templates comprises:
And carrying out dynamic time warping calculation on the predicted comment voice features and the preset dialect voice feature reference templates, and matching the target reference templates of the predicted comment voice features in the preset dialect voice feature reference templates according to calculation results.
3. A dialect-simulating force evaluation device, characterized by comprising:
the voice signal receiving module is used for receiving a voice signal input by a user;
The predicted comment feature extraction module is used for extracting predicted comment features from the voice signals;
the target reference template matching module is used for matching the target reference templates of the predicted comment voice characteristics from preset different dialect voice characteristic reference templates;
the simulation force score calculation module is used for calculating the simulation force score of the predicted comment sound feature according to the target reference template;
The prediction comment sound feature extraction module comprises:
the preprocessing sub-module is used for preprocessing the voice signal to obtain a preprocessed signal; the preprocessing comprises framing, windowing and pre-emphasis;
The prediction comment feature extraction sub-module is used for extracting the prediction comment features from the preprocessing signals;
Wherein, the prediction comment sound feature extraction submodule includes:
The frequency spectrum calculation unit is used for carrying out fast Fourier transform on the preprocessed voice signal to obtain the frequency spectrum of the preprocessed voice signal;
the short-time energy spectrum calculation unit is used for obtaining the square value of the frequency spectrum to obtain a short-time energy spectrum;
the Mel frequency spectrum calculation unit is used for acquiring the amplitude spectrum of the frequency spectrum and converting the amplitude spectrum into Mel frequency spectrum;
a logarithmic calculation unit for calculating the logarithm of the mel frequency spectrum according to the short-time energy spectrum and the mel frequency spectrum;
And the predicted comment sound characteristic calculation unit is used for performing discrete cosine transformation on the logarithm, calculating to obtain a Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as the predicted comment sound characteristic of the preprocessed voice signal.
4. The apparatus of claim 3, wherein the target reference template matching module comprises:
the target reference template matching sub-module is used for carrying out dynamic time warping calculation on the predicted comment voice characteristics and the preset dialect voice characteristic reference templates, and matching the target reference templates of the predicted comment voice characteristics in the preset dialect voice characteristic reference templates according to calculation results.
5. An electronic device, the device comprising a processor and a memory:
The memory is used for storing program codes and transmitting the program codes to the processor;
The processor is configured to execute the dialect simulation force assessment method according to any of claims 1-2 according to instructions in the program code.
6. A computer readable storage medium storing program code for performing the dialect simulation force evaluation method according to any one of claims 1-2.
CN202110850935.9A 2021-07-27 2021-07-27 Dialect simulation force evaluation method and device, electronic equipment and storage medium Active CN113571043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110850935.9A CN113571043B (en) 2021-07-27 2021-07-27 Dialect simulation force evaluation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110850935.9A CN113571043B (en) 2021-07-27 2021-07-27 Dialect simulation force evaluation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113571043A CN113571043A (en) 2021-10-29
CN113571043B true CN113571043B (en) 2024-06-04

Family

ID=78167949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110850935.9A Active CN113571043B (en) 2021-07-27 2021-07-27 Dialect simulation force evaluation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113571043B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1302427A (en) * 1997-11-03 2001-07-04 T-内提克斯公司 Model adaptation system and method for speaker verification
CN101246685A (en) * 2008-03-17 2008-08-20 清华大学 Pronunciation quality evaluation method of computer auxiliary language learning system
CN102354496A (en) * 2011-07-01 2012-02-15 中山大学 PSM-based (pitch scale modification-based) speech identification and restoration method and device thereof
CN102543073A (en) * 2010-12-10 2012-07-04 上海上大海润信息***有限公司 Shanghai dialect phonetic recognition information processing method
CN102982803A (en) * 2012-12-11 2013-03-20 华南师范大学 Isolated word speech recognition method based on HRSF and improved DTW algorithm
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN104103272A (en) * 2014-07-15 2014-10-15 无锡中星微电子有限公司 Voice recognition method and device and blue-tooth earphone
JP2015068897A (en) * 2013-09-27 2015-04-13 国立大学法人 東京大学 Evaluation method and device for utterance and computer program for evaluating utterance
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN112233651A (en) * 2020-10-10 2021-01-15 深圳前海微众银行股份有限公司 Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN112951274A (en) * 2021-02-07 2021-06-11 脸萌有限公司 Voice similarity determination method and device, and program product

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11238845B2 (en) * 2018-11-21 2022-02-01 Google Llc Multi-dialect and multilingual speech recognition

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1302427A (en) * 1997-11-03 2001-07-04 T-内提克斯公司 Model adaptation system and method for speaker verification
CN101246685A (en) * 2008-03-17 2008-08-20 清华大学 Pronunciation quality evaluation method of computer auxiliary language learning system
CN102543073A (en) * 2010-12-10 2012-07-04 上海上大海润信息***有限公司 Shanghai dialect phonetic recognition information processing method
CN102354496A (en) * 2011-07-01 2012-02-15 中山大学 PSM-based (pitch scale modification-based) speech identification and restoration method and device thereof
CN102982803A (en) * 2012-12-11 2013-03-20 华南师范大学 Isolated word speech recognition method based on HRSF and improved DTW algorithm
JP2015068897A (en) * 2013-09-27 2015-04-13 国立大学法人 東京大学 Evaluation method and device for utterance and computer program for evaluating utterance
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system
CN104103272A (en) * 2014-07-15 2014-10-15 无锡中星微电子有限公司 Voice recognition method and device and blue-tooth earphone
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN112233651A (en) * 2020-10-10 2021-01-15 深圳前海微众银行股份有限公司 Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN112951274A (en) * 2021-02-07 2021-06-11 脸萌有限公司 Voice similarity determination method and device, and program product

Also Published As

Publication number Publication date
CN113571043A (en) 2021-10-29

Similar Documents

Publication Publication Date Title
Dhingra et al. Isolated speech recognition using MFCC and DTW
US7133826B2 (en) Method and apparatus using spectral addition for speaker recognition
CN111489424A (en) Virtual character expression generation method, control method, device and terminal equipment
CN104700843A (en) Method and device for identifying ages
JP2007279743A (en) Speaker authentication registration, confirmation method, and device
CN100543840C (en) Method for distinguishing speek person based on emotion migration rule and voice correction
JP2020507819A (en) Method and apparatus for dynamically modifying voice sound quality by frequency shift of spectral envelope formants
CN106548785A (en) A kind of method of speech processing and device, terminal unit
US10014007B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN111863033A (en) Training method and device for audio quality recognition model, server and storage medium
CN109584904A (en) The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method
Rao et al. Robust speaker recognition on mobile devices
CN112185342A (en) Voice conversion and model training method, device and system and storage medium
CN109410971B (en) Method and device for beautifying sound
CN111326170A (en) Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution
CN114491140A (en) Audio matching detection method and device, electronic equipment and storage medium
CN113571043B (en) Dialect simulation force evaluation method and device, electronic equipment and storage medium
Jokinen et al. Estimating the spectral tilt of the glottal source from telephone speech using a deep neural network
CN109741761B (en) Sound processing method and device
CN112885326A (en) Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
CN114302301B (en) Frequency response correction method and related product
CN110033786B (en) Gender judgment method, device, equipment and readable storage medium
Sahoo et al. Detection of speech-based physical load using transfer learning approach
CN112863476A (en) Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant