CN117894064A

CN117894064A - Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation

Info

Publication number: CN117894064A
Application number: CN202311690218.XA
Authority: CN
Inventors: 赵海涛
Original assignee: King Channels Digital Technology Beijing Co ltd
Current assignee: King Channels Digital Technology Beijing Co ltd
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-04-16

Abstract

The invention relates to the field of alignment methods of digital human pronunciation and mouth shapes, and provides a mouth shape alignment method based on training of traversing initial consonant and vowels and integral pronunciation, which comprises an original pronunciation acquisition and mouth shape calculation process and a digital population shape acquisition process, wherein the original pronunciation acquisition and mouth shape calculation process and the digital population shape acquisition process both comprise calculation modules; the maximum similarity mouth shape is calculated through the pronunciation waveform similarity, so that the problems that all pronunciations of a digital person need to be trained, huge training cost is generated and the like in the prior art are solved, meanwhile, a large number of manpower labels are saved through training a large knowledge graph model through AI knowledge, and the model has learning ability possibly more accurate than human labels.

Description

Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation

Technical Field

The invention relates to the field of alignment methods of digital human pronunciation and mouth shapes, in particular to a mouth shape alignment method based on training of traversing initial consonant and vowels and integral pronunciation.

Background

Human pronunciation refers to the human's ability to express speech produced by sound; humans use organs such as vocal cords, tongue, teeth, and lips to make sounds by adjusting the flow and vibration of air streams, which can be divided into different tones, volumes, and intonation for expressing language, ideas, emotion, and intention; the mode and ability of human pronunciation are not possessed by other animals, and are the basis of human language and communication; by pronunciation, humans can verbally communicate, communicate information, share knowledge and culture, etc.

The diversity of human pronunciation is determined by the shape, size, structure and position of the pronunciation organ, and is also influenced by language habit and cultural background, namely, different pronunciation characteristics and accents can be generated in different areas and people with different languages; the digital person is an virtual task image constructed based on a computer technology, the key of realistically simulating human language communication is realized by the voice synthesis and voice recognition technology, and because human pronunciation has diversity and complexity, if all pronunciation needs to be trained, huge training cost is generated, so that the digital person can show complete mouth shape change.

In summary, the invention provides a mouth shape alignment method based on traversal of initial consonant and vowel and whole pronunciation training.

Disclosure of Invention

The invention provides a mouth shape alignment method based on training of traversing initials and finals and integral pronunciation, which calculates the mouth shape with maximum similarity through pronunciation waveform similarity so as to solve the problems that all pronunciations of digital people need to be trained in the prior art, and huge training cost is generated.

The prior art scheme of the invention is as follows:

the mouth shape alignment method based on the training of traversing initials and finals and integral pronunciations comprises an original pronunciation acquisition and mouth shape calculation process and a digital population shape acquisition process, wherein the original pronunciation acquisition and mouth shape calculation process and the digital population shape acquisition process both comprise a calculation module and a database.

As a technical scheme of the invention, the original pronunciation acquisition and mouth shape calculation process comprises the following steps:

s1: positioning of the mouth shape: collecting a mouth shape graph of a person during silencing through collecting equipment, performing grid processing, and converting the mouth shape graph into space coordinates to obtain original mouth shape data of the person;

s2: and (3) audio acquisition: collecting all pronunciation audios and corresponding mouth shape change graphs of all initials and finals of a person through collecting equipment to obtain original pronunciation data of the person;

s3: acquiring and storing a pronunciation waveform diagram: carrying out waveform analysis on the acquired pronunciation audio to obtain a waveform diagram of the audio, and storing the waveform diagram in a database;

s4: calculating the numerical value of the waveform diagram: the calculation module carries out numerical calculation on the oscillogram, extracts characteristics related to the mouth shape, and stores characteristic data in a database;

s5: human mouth shape replication: the stored consonant and vowel pronunciation audio and mouth shape transformation diagram are compared and analyzed, and the human mouth shape is converted into the coordinates of digital population shape through the mapping of the space coordinates.

As a technical solution of the present invention, the digital population type acquisition process includes the following steps:

s1: collection of human relapse: selecting a text segment of human pronunciation, wherein the segment does not have a record of mouth shape space coordinate mapping before;

s2: voice waveform analysis: acquiring pronunciation audio of the text segment through audio acquisition equipment, and performing waveform analysis in a calculation module to obtain an audio waveform diagram;

s3: similarity of waveform diagrams: matching the calculated waveform diagram with the stored waveform diagram of the initial consonant and vowel sounds, and finding out the most matched initial consonant and vowel sounds;

s4: mouth-space mapping: and according to the matching result, finding the space coordinates of the mouth shape graph of the corresponding initial consonant and vowel stored before, and displaying the digital population by utilizing the information of the coordinates.

As a technical scheme of the invention, the acquisition equipment is an image pickup equipment and a recorder equipment, and the mouth shape image can be a series of continuous image frames or discrete key frames, so that the change condition of the mouth shape of a human can be more accurately obtained, and the dynamic change process of the digital population can be more truly restored.

As a technical scheme of the invention, the oscillogram is represented by a two-dimensional histogram which is vertically symmetrical on the x-axis, and the numerical value of the oscillogram is calculated according to the histogram coverage change statistics.

As a technical solution of the present invention, the characteristics related to the mouth shape include frequency spectrum information of audio frequency and energy information, the frequency spectrum information refers to energy distribution situations of sound preference on different frequencies, and the energy information refers to intensity or amplitude of sound, which reflects energy of a sound signal. In the mouth shape alignment method, energy information may be used to calculate the values of the waveform map, thereby extracting features related to the mouth shape.

As a technical scheme of the invention, the calculation mode of the calculation module is to calculate the numerical value difference value of two waveform numerical values, and if the numerical value difference value is minimum, the similarity is maximum, namely the matching degree is highest.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the method, the maximum similarity mouth shape is calculated through the similarity of the pronunciation waveforms, so that the problems that all the pronunciations of the digital person need to be trained, and huge training cost is generated in the prior art are solved, meanwhile, a large number of manpower labels are saved through training a large knowledge graph model by AI knowledge, and the model has learning ability possibly more accurate than the human labels.

Drawings

FIG. 1 is a schematic diagram of a human audio orographic model acquisition of the present invention;

FIG. 2 is a flowchart showing the procedure of the mouth shape, i.e., pronunciation, according to the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.

As shown in fig. 1-2, the invention provides a mouth shape alignment method based on the training of traversing initials and finals and integral pronunciations, which comprises an original pronunciation acquisition and mouth shape calculation process and a digital population shape acquisition process, wherein the original pronunciation acquisition and mouth shape calculation process and the digital population shape acquisition process both comprise calculation modules.

Embodiment one:

as shown in fig. 1-2, in the present embodiment, a mouth shape image of a person at the time of silence is acquired by an image pickup apparatus, and then the mouth shape image is subjected to raster processing, which is converted into spatial coordinates to represent the position of the mouth shape in space. Meanwhile, various pronunciation audios of initials and finals sent by a person are collected through the recording device. To record the spatial coordinates of the mouth shape at different pronunciation stages, we split the whole time period into 4 aliquots and select the starting points of 3 of them for image acquisition. For example, if it takes 1 second to sound, the entire time period will be divided into 4 aliquots, each of length 0.25 seconds. Then, we select the starting points of the first 3 aliquots as the time points of image acquisition, namely 0.25 seconds, 0.5 seconds and 0.75 seconds. Thus, the mouth shape changing images at different time points can be collected, and the space coordinates of the mouth shape at different pronunciation stages can be recorded.

It should be noted that: the mouth shape image may be a series of continuous image frames or may be discrete key frames. And the grids represent space coordinates, namely, each grid has 5 { x, y and z } coordinates, and the finer the grid, the better the effect.

And carrying out waveform analysis on the acquired sounding audio by using a Fourier transform (FFT) waveform analysis method to acquire the time-varying condition of the audio and obtain a waveform diagram of the audio. By performing numerical calculation on the waveform diagram, the characteristics related to the mouth shape, such as frequency spectrum information, energy information and the like of the audio, can be extracted.

It should be noted that: and converting the time domain waveform into a frequency domain signal during Fourier transformation to obtain frequency spectrum information. And the fourier transform is a generalization of the fourier series, and can decompose the non-periodic function into a series of sum of sine and cosine functions. The formula of the fourier transform is as follows:

F(ω)＝∫f(t)·e^(-iωt)·dt

where F (t) is a non-periodic function, F (ω) is a representation of the function in the frequency domain, e++iωt is a complex exponential function, ω is the angular frequency.

The physical meaning of the fourier transform is that any one non-periodic function can be expressed as the sum of sine and cosine functions of many different frequencies. These sine and cosine functions are called fundamental frequencies, the frequencies of which are continuous and can take on arbitrary real values.

Waveforms are typically represented by a two-dimensional histogram that is vertically symmetric about the x-axis, and the values of the waveform map are statistically calculated from the height variations of the histogram.

The spectral information refers to the energy distribution of the sound preferences at different frequencies, and the energy information refers to the intensity or amplitude of the sound, which reflects the energy level of the sound signal. In the mouth shape alignment method, energy information may be used to calculate the values of the waveform map, thereby extracting features related to the mouth shape.

The human mouth shape is converted into the coordinates of the digital mouth shape through the mapping of the space coordinates by comparing and analyzing the sound audio and mouth shape transformation graphs of the initial consonants and the vowels stored in the database. Thus, the mouth shape information can be represented by numbers, so that more intensive research and analysis can be performed.

A text segment of a human pronunciation is selected and its pronunciation audio is recorded for waveform analysis. We will then use RMSE (root mean square error) to measure the difference between the audio and the previously stored initial and final pronunciation waveform diagrams.

It should be noted that: root mean square error (Root Mean Square Error, RMSE) is a common indicator for measuring the error between a predicted value and an actual observed value. It is the standard deviation of the prediction error, which means that the average error between the predicted value and the actual observed value is large. The RMSE calculation formula is as follows:

RMSE＝sqrt(1/N xΣ(i＝1to N)(x(i)-y(i))^2)

where x and y represent sample values of two waveform sequences, respectively, and N represents the total number of sample points.

According to this formula, the smaller the calculated RMSE value, the more similar the two waveforms are. By calculating the RMSE value, we can determine the most matching initials and finals pronunciation.

Then, we will find the corresponding mouth shape map space coordinates stored before according to the matching result and map them into the coordinate system of the digital population shape to reveal the mouth shape of the digital person.

While embodiments of the present invention have been shown and described above for purposes of illustration and description, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. The mouth shape alignment method based on the training of traversing initials and finals and integral pronunciations is characterized by comprising an original pronunciation acquisition and mouth shape calculation process and a digital population shape acquisition process, wherein the original pronunciation acquisition and mouth shape calculation process and the digital population shape acquisition process both comprise a calculation module and a database.

2. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 1, wherein the method comprises the following steps: the original pronunciation acquisition and mouth shape calculation process comprises the following steps:

s5: human mouth shape replication: the stored consonant and vowel pronunciation audio and mouth shape variation map are compared and analyzed, the human mouth shape is converted into the coordinates of the digital population shape through the mapping of the space coordinates, and the coordinates of the digital population shape are stored.

3. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 1, wherein the method comprises the following steps: the digital population acquisition process comprises the following steps:

s3: similarity of waveform diagrams: matching the calculated waveform diagram with the initial consonant and vowel pronunciation waveform diagram stored in the original pronunciation acquisition and mouth shape calculation process, and finding out the most matched initial consonant and vowel pronunciation;

4. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 2, wherein: the acquisition equipment is an image pickup equipment and a recorder equipment, and the mouth shape image is a series of continuous image frames or discrete key frames.

5. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 2, wherein: the oscillogram is represented by a two-dimensional histogram with vertical symmetry on the x-axis, and the numerical value of the oscillogram is calculated according to the histogram coverage variation statistics.

6. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 2, wherein: the mouth-shape related features include spectral information of the audio and energy information, which refers to the intensity or amplitude of the sound.

7. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 3, wherein: the calculation mode of the calculation module is to calculate the numerical value difference value of the two waveform numerical values, and if the numerical value difference value is minimum, the similarity is maximum, namely the matching degree is highest.