CN117894064A - Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation - Google Patents

Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation Download PDF

Info

Publication number
CN117894064A
CN117894064A CN202311690218.XA CN202311690218A CN117894064A CN 117894064 A CN117894064 A CN 117894064A CN 202311690218 A CN202311690218 A CN 202311690218A CN 117894064 A CN117894064 A CN 117894064A
Authority
CN
China
Prior art keywords
pronunciation
mouth shape
mouth
shape
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311690218.XA
Other languages
Chinese (zh)
Inventor
赵海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
King Channels Digital Technology Beijing Co ltd
Original Assignee
King Channels Digital Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by King Channels Digital Technology Beijing Co ltd filed Critical King Channels Digital Technology Beijing Co ltd
Priority to CN202311690218.XA priority Critical patent/CN117894064A/en
Publication of CN117894064A publication Critical patent/CN117894064A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to the field of alignment methods of digital human pronunciation and mouth shapes, and provides a mouth shape alignment method based on training of traversing initial consonant and vowels and integral pronunciation, which comprises an original pronunciation acquisition and mouth shape calculation process and a digital population shape acquisition process, wherein the original pronunciation acquisition and mouth shape calculation process and the digital population shape acquisition process both comprise calculation modules; the maximum similarity mouth shape is calculated through the pronunciation waveform similarity, so that the problems that all pronunciations of a digital person need to be trained, huge training cost is generated and the like in the prior art are solved, meanwhile, a large number of manpower labels are saved through training a large knowledge graph model through AI knowledge, and the model has learning ability possibly more accurate than human labels.

Description

Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation
Technical Field
The invention relates to the field of alignment methods of digital human pronunciation and mouth shapes, in particular to a mouth shape alignment method based on training of traversing initial consonant and vowels and integral pronunciation.
Background
Human pronunciation refers to the human's ability to express speech produced by sound; humans use organs such as vocal cords, tongue, teeth, and lips to make sounds by adjusting the flow and vibration of air streams, which can be divided into different tones, volumes, and intonation for expressing language, ideas, emotion, and intention; the mode and ability of human pronunciation are not possessed by other animals, and are the basis of human language and communication; by pronunciation, humans can verbally communicate, communicate information, share knowledge and culture, etc.
The diversity of human pronunciation is determined by the shape, size, structure and position of the pronunciation organ, and is also influenced by language habit and cultural background, namely, different pronunciation characteristics and accents can be generated in different areas and people with different languages; the digital person is an virtual task image constructed based on a computer technology, the key of realistically simulating human language communication is realized by the voice synthesis and voice recognition technology, and because human pronunciation has diversity and complexity, if all pronunciation needs to be trained, huge training cost is generated, so that the digital person can show complete mouth shape change.
In summary, the invention provides a mouth shape alignment method based on traversal of initial consonant and vowel and whole pronunciation training.
Disclosure of Invention
The invention provides a mouth shape alignment method based on training of traversing initials and finals and integral pronunciation, which calculates the mouth shape with maximum similarity through pronunciation waveform similarity so as to solve the problems that all pronunciations of digital people need to be trained in the prior art, and huge training cost is generated.
The prior art scheme of the invention is as follows:
the mouth shape alignment method based on the training of traversing initials and finals and integral pronunciations comprises an original pronunciation acquisition and mouth shape calculation process and a digital population shape acquisition process, wherein the original pronunciation acquisition and mouth shape calculation process and the digital population shape acquisition process both comprise a calculation module and a database.
As a technical scheme of the invention, the original pronunciation acquisition and mouth shape calculation process comprises the following steps:
s1: positioning of the mouth shape: collecting a mouth shape graph of a person during silencing through collecting equipment, performing grid processing, and converting the mouth shape graph into space coordinates to obtain original mouth shape data of the person;
s2: and (3) audio acquisition: collecting all pronunciation audios and corresponding mouth shape change graphs of all initials and finals of a person through collecting equipment to obtain original pronunciation data of the person;
s3: acquiring and storing a pronunciation waveform diagram: carrying out waveform analysis on the acquired pronunciation audio to obtain a waveform diagram of the audio, and storing the waveform diagram in a database;
s4: calculating the numerical value of the waveform diagram: the calculation module carries out numerical calculation on the oscillogram, extracts characteristics related to the mouth shape, and stores characteristic data in a database;
s5: human mouth shape replication: the stored consonant and vowel pronunciation audio and mouth shape transformation diagram are compared and analyzed, and the human mouth shape is converted into the coordinates of digital population shape through the mapping of the space coordinates.
As a technical solution of the present invention, the digital population type acquisition process includes the following steps:
s1: collection of human relapse: selecting a text segment of human pronunciation, wherein the segment does not have a record of mouth shape space coordinate mapping before;
s2: voice waveform analysis: acquiring pronunciation audio of the text segment through audio acquisition equipment, and performing waveform analysis in a calculation module to obtain an audio waveform diagram;
s3: similarity of waveform diagrams: matching the calculated waveform diagram with the stored waveform diagram of the initial consonant and vowel sounds, and finding out the most matched initial consonant and vowel sounds;
s4: mouth-space mapping: and according to the matching result, finding the space coordinates of the mouth shape graph of the corresponding initial consonant and vowel stored before, and displaying the digital population by utilizing the information of the coordinates.
As a technical scheme of the invention, the acquisition equipment is an image pickup equipment and a recorder equipment, and the mouth shape image can be a series of continuous image frames or discrete key frames, so that the change condition of the mouth shape of a human can be more accurately obtained, and the dynamic change process of the digital population can be more truly restored.
As a technical scheme of the invention, the oscillogram is represented by a two-dimensional histogram which is vertically symmetrical on the x-axis, and the numerical value of the oscillogram is calculated according to the histogram coverage change statistics.
As a technical solution of the present invention, the characteristics related to the mouth shape include frequency spectrum information of audio frequency and energy information, the frequency spectrum information refers to energy distribution situations of sound preference on different frequencies, and the energy information refers to intensity or amplitude of sound, which reflects energy of a sound signal. In the mouth shape alignment method, energy information may be used to calculate the values of the waveform map, thereby extracting features related to the mouth shape.
As a technical scheme of the invention, the calculation mode of the calculation module is to calculate the numerical value difference value of two waveform numerical values, and if the numerical value difference value is minimum, the similarity is maximum, namely the matching degree is highest.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the method, the maximum similarity mouth shape is calculated through the similarity of the pronunciation waveforms, so that the problems that all the pronunciations of the digital person need to be trained, and huge training cost is generated in the prior art are solved, meanwhile, a large number of manpower labels are saved through training a large knowledge graph model by AI knowledge, and the model has learning ability possibly more accurate than the human labels.
Drawings
FIG. 1 is a schematic diagram of a human audio orographic model acquisition of the present invention;
FIG. 2 is a flowchart showing the procedure of the mouth shape, i.e., pronunciation, according to the present invention.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.
As shown in fig. 1-2, the invention provides a mouth shape alignment method based on the training of traversing initials and finals and integral pronunciations, which comprises an original pronunciation acquisition and mouth shape calculation process and a digital population shape acquisition process, wherein the original pronunciation acquisition and mouth shape calculation process and the digital population shape acquisition process both comprise calculation modules.
Embodiment one:
as shown in fig. 1-2, in the present embodiment, a mouth shape image of a person at the time of silence is acquired by an image pickup apparatus, and then the mouth shape image is subjected to raster processing, which is converted into spatial coordinates to represent the position of the mouth shape in space. Meanwhile, various pronunciation audios of initials and finals sent by a person are collected through the recording device. To record the spatial coordinates of the mouth shape at different pronunciation stages, we split the whole time period into 4 aliquots and select the starting points of 3 of them for image acquisition. For example, if it takes 1 second to sound, the entire time period will be divided into 4 aliquots, each of length 0.25 seconds. Then, we select the starting points of the first 3 aliquots as the time points of image acquisition, namely 0.25 seconds, 0.5 seconds and 0.75 seconds. Thus, the mouth shape changing images at different time points can be collected, and the space coordinates of the mouth shape at different pronunciation stages can be recorded.
It should be noted that: the mouth shape image may be a series of continuous image frames or may be discrete key frames. And the grids represent space coordinates, namely, each grid has 5 { x, y and z } coordinates, and the finer the grid, the better the effect.
And carrying out waveform analysis on the acquired sounding audio by using a Fourier transform (FFT) waveform analysis method to acquire the time-varying condition of the audio and obtain a waveform diagram of the audio. By performing numerical calculation on the waveform diagram, the characteristics related to the mouth shape, such as frequency spectrum information, energy information and the like of the audio, can be extracted.
It should be noted that: and converting the time domain waveform into a frequency domain signal during Fourier transformation to obtain frequency spectrum information. And the fourier transform is a generalization of the fourier series, and can decompose the non-periodic function into a series of sum of sine and cosine functions. The formula of the fourier transform is as follows:
F(ω)=∫f(t)·e^(-iωt)·dt
where F (t) is a non-periodic function, F (ω) is a representation of the function in the frequency domain, e++iωt is a complex exponential function, ω is the angular frequency.
The physical meaning of the fourier transform is that any one non-periodic function can be expressed as the sum of sine and cosine functions of many different frequencies. These sine and cosine functions are called fundamental frequencies, the frequencies of which are continuous and can take on arbitrary real values.
Waveforms are typically represented by a two-dimensional histogram that is vertically symmetric about the x-axis, and the values of the waveform map are statistically calculated from the height variations of the histogram.
The spectral information refers to the energy distribution of the sound preferences at different frequencies, and the energy information refers to the intensity or amplitude of the sound, which reflects the energy level of the sound signal. In the mouth shape alignment method, energy information may be used to calculate the values of the waveform map, thereby extracting features related to the mouth shape.
The human mouth shape is converted into the coordinates of the digital mouth shape through the mapping of the space coordinates by comparing and analyzing the sound audio and mouth shape transformation graphs of the initial consonants and the vowels stored in the database. Thus, the mouth shape information can be represented by numbers, so that more intensive research and analysis can be performed.
A text segment of a human pronunciation is selected and its pronunciation audio is recorded for waveform analysis. We will then use RMSE (root mean square error) to measure the difference between the audio and the previously stored initial and final pronunciation waveform diagrams.
It should be noted that: root mean square error (Root Mean Square Error, RMSE) is a common indicator for measuring the error between a predicted value and an actual observed value. It is the standard deviation of the prediction error, which means that the average error between the predicted value and the actual observed value is large. The RMSE calculation formula is as follows:
RMSE=sqrt(1/N xΣ(i=1to N)(x(i)-y(i))^2)
where x and y represent sample values of two waveform sequences, respectively, and N represents the total number of sample points.
According to this formula, the smaller the calculated RMSE value, the more similar the two waveforms are. By calculating the RMSE value, we can determine the most matching initials and finals pronunciation.
Then, we will find the corresponding mouth shape map space coordinates stored before according to the matching result and map them into the coordinate system of the digital population shape to reveal the mouth shape of the digital person.
While embodiments of the present invention have been shown and described above for purposes of illustration and description, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (7)

1. The mouth shape alignment method based on the training of traversing initials and finals and integral pronunciations is characterized by comprising an original pronunciation acquisition and mouth shape calculation process and a digital population shape acquisition process, wherein the original pronunciation acquisition and mouth shape calculation process and the digital population shape acquisition process both comprise a calculation module and a database.
2. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 1, wherein the method comprises the following steps: the original pronunciation acquisition and mouth shape calculation process comprises the following steps:
s1: positioning of the mouth shape: collecting a mouth shape graph of a person during silencing through collecting equipment, performing grid processing, and converting the mouth shape graph into space coordinates to obtain original mouth shape data of the person;
s2: and (3) audio acquisition: collecting all pronunciation audios and corresponding mouth shape change graphs of all initials and finals of a person through collecting equipment to obtain original pronunciation data of the person;
s3: acquiring and storing a pronunciation waveform diagram: carrying out waveform analysis on the acquired pronunciation audio to obtain a waveform diagram of the audio, and storing the waveform diagram in a database;
s4: calculating the numerical value of the waveform diagram: the calculation module carries out numerical calculation on the oscillogram, extracts characteristics related to the mouth shape, and stores characteristic data in a database;
s5: human mouth shape replication: the stored consonant and vowel pronunciation audio and mouth shape variation map are compared and analyzed, the human mouth shape is converted into the coordinates of the digital population shape through the mapping of the space coordinates, and the coordinates of the digital population shape are stored.
3. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 1, wherein the method comprises the following steps: the digital population acquisition process comprises the following steps:
s1: collection of human relapse: selecting a text segment of human pronunciation, wherein the segment does not have a record of mouth shape space coordinate mapping before;
s2: voice waveform analysis: acquiring pronunciation audio of the text segment through audio acquisition equipment, and performing waveform analysis in a calculation module to obtain an audio waveform diagram;
s3: similarity of waveform diagrams: matching the calculated waveform diagram with the initial consonant and vowel pronunciation waveform diagram stored in the original pronunciation acquisition and mouth shape calculation process, and finding out the most matched initial consonant and vowel pronunciation;
s4: mouth-space mapping: and according to the matching result, finding the space coordinates of the mouth shape graph of the corresponding initial consonant and vowel stored before, and displaying the digital population by utilizing the information of the coordinates.
4. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 2, wherein: the acquisition equipment is an image pickup equipment and a recorder equipment, and the mouth shape image is a series of continuous image frames or discrete key frames.
5. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 2, wherein: the oscillogram is represented by a two-dimensional histogram with vertical symmetry on the x-axis, and the numerical value of the oscillogram is calculated according to the histogram coverage variation statistics.
6. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 2, wherein: the mouth-shape related features include spectral information of the audio and energy information, which refers to the intensity or amplitude of the sound.
7. The method for aligning mouth shapes based on training of traversing initials and finals and whole pronunciation as claimed in claim 3, wherein: the calculation mode of the calculation module is to calculate the numerical value difference value of the two waveform numerical values, and if the numerical value difference value is minimum, the similarity is maximum, namely the matching degree is highest.
CN202311690218.XA 2023-12-11 2023-12-11 Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation Pending CN117894064A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311690218.XA CN117894064A (en) 2023-12-11 2023-12-11 Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311690218.XA CN117894064A (en) 2023-12-11 2023-12-11 Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation

Publications (1)

Publication Number Publication Date
CN117894064A true CN117894064A (en) 2024-04-16

Family

ID=90645637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311690218.XA Pending CN117894064A (en) 2023-12-11 2023-12-11 Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation

Country Status (1)

Country Link
CN (1) CN117894064A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763190A (en) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
KR102035596B1 (en) * 2018-05-25 2019-10-23 주식회사 데커드에이아이피 System and method for automatically generating virtual character's facial animation based on artificial intelligence
CN114581567A (en) * 2022-05-06 2022-06-03 成都市谛视无限科技有限公司 Method, device and medium for driving mouth shape of virtual image by sound
CN115511994A (en) * 2022-10-14 2022-12-23 厦门靠谱云股份有限公司 Method for quickly cloning real person into two-dimensional virtual digital person
CN116665695A (en) * 2023-07-28 2023-08-29 腾讯科技(深圳)有限公司 Virtual object mouth shape driving method, related device and medium
CN116994600A (en) * 2023-09-28 2023-11-03 中影年年(北京)文化传媒有限公司 Method and system for driving character mouth shape based on audio frequency

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763190A (en) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
KR102035596B1 (en) * 2018-05-25 2019-10-23 주식회사 데커드에이아이피 System and method for automatically generating virtual character's facial animation based on artificial intelligence
CN114581567A (en) * 2022-05-06 2022-06-03 成都市谛视无限科技有限公司 Method, device and medium for driving mouth shape of virtual image by sound
CN115511994A (en) * 2022-10-14 2022-12-23 厦门靠谱云股份有限公司 Method for quickly cloning real person into two-dimensional virtual digital person
CN116665695A (en) * 2023-07-28 2023-08-29 腾讯科技(深圳)有限公司 Virtual object mouth shape driving method, related device and medium
CN116994600A (en) * 2023-09-28 2023-11-03 中影年年(北京)文化传媒有限公司 Method and system for driving character mouth shape based on audio frequency

Similar Documents

Publication Publication Date Title
Dhingra et al. Isolated speech recognition using MFCC and DTW
US20200294509A1 (en) Method and apparatus for establishing voiceprint model, computer device, and storage medium
US4980917A (en) Method and apparatus for determining articulatory parameters from speech data
US9489864B2 (en) Systems and methods for an automated pronunciation assessment system for similar vowel pairs
CN102723079B (en) Music and chord automatic identification method based on sparse representation
CN105989842A (en) Method and device for voiceprint similarity comparison and application thereof in digital entertainment on-demand system
CN112002348B (en) Method and system for recognizing speech anger emotion of patient
CN112750442B (en) Crested mill population ecological system monitoring system with wavelet transformation and method thereof
Chaki Pattern analysis based acoustic signal processing: a survey of the state-of-art
CN110265051A (en) The sightsinging audio intelligent scoring modeling method of education is sung applied to root LeEco
CN106157974A (en) Text recites quality assessment device and method
CN110473548B (en) Classroom interaction network analysis method based on acoustic signals
CN116229932A (en) Voice cloning method and system based on cross-domain consistency loss
Permana et al. Implementation of constant-Q transform (CQT) and mel spectrogram to converting bird’s sound
CN110310644A (en) Wisdom class board exchange method based on speech recognition
JP3174777B2 (en) Signal processing method and apparatus
CN111341346A (en) Language expression capability evaluation method and system for fusion depth language generation model
Yousfi et al. Holy Qur'an speech recognition system distinguishing the type of recitation
CN110246514A (en) A kind of English word word pronunciation learning system based on pattern-recognition
CN117894064A (en) Mouth shape alignment method based on training of traversing initial consonants, vowels and integral pronunciation
CN112735444B (en) Chinese phoenix head and gull recognition system with model matching and model matching method thereof
CN114678039A (en) Singing evaluation method based on deep learning
CN113691382A (en) Conference recording method, conference recording device, computer equipment and medium
Marck et al. Identification, analysis and characterization of base units of bird vocal communication: The white spectacled bulbul (Pycnonotus xanthopygos) as a case study
Li et al. A study of assessment model of oral English Imitation reading in college entrance examination

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination