CN114863898A

CN114863898A - Vehicle karaoke audio processing method and system and storage medium

Info

Publication number: CN114863898A
Application number: CN202110153880.6A
Authority: CN
Inventors: 李景俊; 邓胜; 谢鹏鹤; 覃小艺; 张剑锋; 尹苍穹
Original assignee: Guangzhou Automobile Group Co Ltd
Current assignee: Guangzhou Automobile Group Co Ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2022-08-05

Abstract

The invention relates to a vehicle karaoke audio processing method and system, and a storage medium, comprising: acquiring vocal print parameters of a singer acquired by vehicle-mounted voice acquisition equipment; acquiring a mouth type continuous frame image of a singer acquired by vehicle-mounted camera equipment, and identifying the mouth type continuous frame image to obtain a mouth type acoustic parameter; obtaining corresponding singing content parameters according to the mouth shape acoustic parameters; generating a first audio signal according to the voiceprint parameters and the singing content parameters; acquiring a second audio signal corresponding to the current song accompaniment music; and carrying out sound mixing processing on the first audio signal and the second audio signal to obtain a third audio signal, and sending the third audio signal to vehicle-mounted audio playing equipment so that the vehicle-mounted audio playing equipment plays the third audio signal. The invention can realize the purpose that the singer can sing the song even if the singer forgets words and sings mistakes, thereby improving the user experience effect of karaoke in the vehicle.

Description

Vehicle karaoke audio processing method and system and storage medium

Technical Field

The invention relates to the technical field of audio processing, in particular to a vehicle karaoke audio processing method and system and a computer readable storage medium.

Background

Currently, karaoke in a car mainly mixes a vocal signal and a vocal accompaniment audio signal input by a singer, and then plays the audio signal obtained by mixing the voices. However, in the practical application process, the singer may have the situation of singing with a small voice, forgetting words or singing in a wrong way, and in this situation, the experience effect of the karaoke user in the car is not good.

Disclosure of Invention

The invention aims to provide a vehicle karaoke audio processing method and system and a computer readable storage medium, so that a singer can sing a song even under the condition that the singer has little singing voice, forgets words or sings a mistake, and the user experience effect of karaoke in a vehicle is improved.

The invention provides a vehicle karaoke audio processing method in a first aspect, which comprises the following steps:

acquiring vocal print parameters of a singer acquired by vehicle-mounted voice acquisition equipment;

acquiring mouth-shaped continuous frame images of singers acquired by vehicle-mounted camera equipment, and identifying the mouth-shaped continuous frame images by using a pre-trained deep learning network model to obtain mouth-shaped acoustic parameters;

obtaining corresponding singing content parameters according to the mouth shape acoustic parameters;

generating a first audio signal according to the voiceprint parameters and the singing content parameters;

acquiring a second audio signal corresponding to the current song accompaniment music;

and carrying out sound mixing processing on the first audio signal and the second audio signal to obtain a third audio signal, and sending the third audio signal to vehicle-mounted audio playing equipment so that the vehicle-mounted audio playing equipment plays the third audio signal.

Optionally, the mouth acoustic parameters include a mouth reliability parameter, and an agreement parameter between the mouth and the lyrics of the current song;

the mouth-shaped acoustic parameters of each mouth-shaped action comprise a mouth-shaped credibility parameter and an inosculation degree parameter of a mouth shape and the lyrics of the current song.

Optionally, the obtaining of the corresponding singing content parameter according to the mouth-type acoustic parameter includes:

and determining to reserve or correct the lyric content corresponding to each mouth type action according to the mouth type credibility parameter, the mouth type and the coincidence degree parameter of the lyrics of the current song, wherein the correction comprises selecting the correct lyrics corresponding to the current song to replace the lyric content, or adjusting the lyric content to enable the similarity between the lyric content and the correct lyrics corresponding to the current song to be larger than a preset threshold value.

and determining to reserve or correct a lyric content corresponding to a plurality of mouth-shaped actions according to the mouth-shaped reliability parameter, the mouth shape and the coincidence degree parameter of the lyrics of the current song, wherein the correction comprises selecting a correct lyric corresponding to the current song to replace the lyric content, or adjusting the lyric content to enable the similarity between the lyric content and the correct lyric corresponding to the current song to be larger than a preset threshold value.

Optionally, the mouth-shaped continuous frame images are acquired at the same time as the voiceprint parameters.

Optionally, the voiceprint parameters include a fundamental frequency parameter, a formant parameter, a harmonic amplitude parameter, and a harmonic-to-noise ratio parameter.

A second aspect of the present invention provides a car karaoke audio processing system, including:

the voice print acquisition unit is used for acquiring voice print parameters of the singer acquired by the vehicle-mounted voice collecting equipment;

the acoustic parameter acquisition unit is used for acquiring mouth type continuous frame images of singers acquired by the vehicle-mounted camera equipment and identifying the mouth type continuous frame images by utilizing a pre-trained deep learning network model to acquire mouth type acoustic parameters;

the singing content acquisition unit is used for acquiring corresponding singing content parameters according to the mouth shape acoustic parameters;

the first audio acquisition unit is used for generating a first audio signal according to the voiceprint parameter and the singing content parameter;

the first audio acquisition unit is used for acquiring a second audio signal corresponding to the current song accompaniment music; and

and the third audio acquisition unit is used for carrying out audio mixing processing on the first audio signal and the second audio signal to obtain a third audio signal and sending the third audio signal to vehicle-mounted audio playing equipment so that the vehicle-mounted audio playing equipment plays the third audio signal.

Optionally, the acoustic parameter obtaining unit is specifically configured to:

Or, determining to reserve or correct a lyric content corresponding to a plurality of mouth-shaped actions according to the mouth-shaped reliability parameter, the mouth shape and the coincidence degree parameter of the lyrics of the current song, wherein the correction comprises selecting a correct lyric corresponding to the current song to replace the lyric content, or adjusting the lyric content to enable the similarity between the lyric content and the correct lyric corresponding to the current song to be larger than a preset threshold value.

A third aspect of the present invention proposes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the vehicle karaoke audio processing method of the first aspect.

Aspects of the present invention respectively provide a method and a system for processing karaoke audio of a vehicle, and a computer-readable storage medium, which, when implemented, have at least the following advantages:

the method has the advantages that the singing content to be played is obtained through intelligent identification according to the mouth type continuous frame images of the singer, the singing content can be the correction or adjustment of the singing content of the singer, the singing content sung by the singer in an ideal state can be obtained through combination with the unique voiceprint characteristics of the singer, and finally the singing content and accompanying sound mixing are processed, output and played, so that the purpose that the singer can sing a song under the condition that the singer has small singing voice, forgets words or sings mistakes is achieved, and the user experience effect of the karaoke in the vehicle is improved.

Additional features and advantages of the invention will be set forth in the description which follows.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a method for processing karaoke audio of a vehicle according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a car karaoke audio processing system according to another embodiment of the present invention.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In addition, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known means have not been described in detail so as not to obscure the present invention.

Referring to fig. 1, an embodiment of the present invention provides a car karaoke audio processing method, including the following steps S1 to S6:

step S1, acquiring vocal print parameters of the singer collected by the vehicle-mounted voice collecting equipment;

specifically, the voiceprint parameter is a parameter characterizing the voice characteristics of the singer, and in a specific example, the voiceprint parameter comprises a fundamental frequency parameter, a formant parameter, a harmonic amplitude parameter and a harmonic noise ratio parameter of the voice of the singer;

step S2, acquiring mouth-shaped continuous frame images of singers acquired by vehicle-mounted camera equipment, and recognizing the mouth-shaped continuous frame images by using a pre-trained deep learning network model to acquire mouth-shaped acoustic parameters;

specifically, the mouth type acoustic parameters are used for recording the voice content to be expressed with the mouth type of the singer;

in a specific example, the mouth acoustic parameters include, but are not limited to, mouth reliability parameters, mouth fit with lyrics of a current song; the method comprises the steps that a mouth-shaped action of a multi-frame continuous image corresponds to lyric content, and mouth-shaped acoustic parameters of each mouth-shaped action comprise a mouth-shaped credibility parameter and an inosculation degree parameter of a mouth shape and current song lyrics;

it can be understood that a singer needs a certain time to complete a mouth-type action, during which time the main car camera device will take a plurality of consecutive images, and therefore a mouth-type action corresponds to a plurality of consecutive images, and a mouth-type action actually corresponds to a lyrics content, such as "i", "you", "he";

wherein, the mouth-type reliability parameter indicates whether the mouth-type action is reliable, for example, if the mouth-type action is not obvious, the reliability is relatively low at the moment, and for example, if the mouth-type action is obvious, the reliability is relatively high at the moment; specifically, the mouth shape reliability parameter is represented by a value of 0-100%, and the higher the numerical value is, the higher the reliability is;

the matching degree parameter of the mouth shape and the lyrics of the current song can determine the lyrics of the music playing according to the image frame time stamp corresponding to the mouth shape by using the lyrics corresponding to the mouth shape, and then match 2 lyrics to determine the matching degree parameter of the mouth shape and the lyrics of the current song; specifically, the goodness of fit parameter is represented by a value of 0-100%, and the higher the numerical value is, the higher the goodness of fit is;

it should be noted that the deep learning network model is an intelligent tool that can be used for image frame recognition, and can achieve the recognition purpose through training; only the input layer and the output layer based on the existing deep learning network model need to be adjusted, so that the input layer of the deep learning network model corresponds to the mouth-shaped continuous frame image in the embodiment, the output layer corresponds to the mouth-shaped acoustic parameters in the embodiment, and given training samples, the deep learning network model can be trained by self to achieve the identification purpose required by the embodiment;

step S3, obtaining corresponding singing content parameters according to the mouth shape acoustic parameters;

in a specific example, the obtaining of the corresponding singing content parameter according to the mouth-type acoustic parameter includes:

according to the mouth shape credibility parameter and the matching degree parameter of the mouth shape and the lyrics of the current song, the lyrics content corresponding to each mouth shape action is reserved or corrected; wherein, the correcting comprises selecting the correct lyrics corresponding to the current song to replace the lyrics content, or adjusting the lyrics content to make the similarity between the lyrics content and the correct lyrics corresponding to the current song larger than a preset threshold value;

specifically, whether lyric content corresponding to the mouth-shaped action is reserved or corrected is determined according to a comparison result of the mouth-shaped reliability parameter, the goodness of fit parameter and a preset threshold, for example, if the mouth-shaped reliability parameter corresponding to the mouth-shaped action is greater than the reliability threshold and the goodness of fit parameter is greater than the goodness of fit threshold, it is determined that one lyric content corresponding to the mouth-shaped action is reserved, otherwise, correction is performed;

more specifically, the similarity of the lyrics may be calculated in a text distance calculation manner, so that the distance between 2 words is smaller than a preset threshold, and the distance may be an euclidean distance, a manhattan distance, or the like;

in another specific example, the obtaining of the corresponding singing content parameter according to the mouth-type acoustic parameter includes:

determining to reserve or correct a sentence of lyric content corresponding to a plurality of mouth shape actions according to the mouth shape credibility parameter, the mouth shape and the matching degree parameter of the current song lyric; wherein, the correcting comprises selecting a correct lyric corresponding to the current song to replace the content of the lyric, or adjusting the content of the lyric to make the similarity between the content of the lyric and the correct lyric corresponding to the current song larger than a preset threshold value;

specifically, whether a lyric content corresponding to the mouth-shaped action is reserved or corrected is determined according to a comparison result of the mouth-shaped reliability parameter, the goodness of fit parameter and a preset threshold, for example, if the mouth-shaped reliability parameter corresponding to the mouth-shaped action is greater than the reliability threshold and the goodness of fit parameter is greater than the goodness of fit threshold, it is determined that a lyric content corresponding to the mouth-shaped action is reserved, otherwise, correction is performed;

more specifically, the similarity of the lyrics may be calculated in a text distance manner, such that the distance between 2 sentences is smaller than a preset threshold, and the distance may be an euclidean distance, a manhattan distance, or the like.

Step S4, generating a first audio signal according to the voiceprint parameter and the singing content parameter;

specifically, the first audio signal may be understood as singing content sung by a singer in an ideal state, thereby improving the user experience effect of karaoke in a car;

step S5, acquiring a second audio signal corresponding to the current song accompaniment music;

step S6, performing sound mixing processing on the first audio signal and the second audio signal to obtain a third audio signal, and sending the third audio signal to a vehicle-mounted audio playing device so that the vehicle-mounted audio playing device plays the third audio signal.

Specifically, the steps S5 to S6 are conventional karaoke mixing processing, and the method of the present embodiment mainly improves the acquisition aspect of the vocal audio signal of the singer, so as to achieve the purpose that the singer can sing a song even when the singer has little singing voice, forgets words or has a wrong singing, and improve the experience effect of the karaoke user in the vehicle.

In one embodiment, the continuous frame images of the mouth are acquired at the same time as the voiceprint parameters, so that the voiceprint of the singer corresponds to the singing content.

Further, when the singer only acts on the mouth and does not make a sound, the singer only acquires the mouth type record at the moment and cannot acquire the voiceprint parameters of the singer, which indicates that the singer may have too little singing sound or may forget words, and at the moment, the previously identified voiceprint parameters are used as the voiceprint parameters of the current singer to perform subsequent audio signal processing.

Referring to fig. 2, another embodiment of the present invention provides a car karaoke audio processing system, including:

a voiceprint acquisition unit 1, configured to acquire voiceprint parameters of a singer acquired by a vehicle-mounted sound collecting device;

the acoustic parameter acquisition unit 2 is used for acquiring mouth type continuous frame images of singers acquired by the vehicle-mounted camera equipment, and recognizing the mouth type continuous frame images by using a pre-trained deep learning network model to acquire mouth type acoustic parameters;

a singing content obtaining unit 3, configured to obtain a corresponding singing content parameter according to the mouth shape acoustic parameter;

a first audio obtaining unit 4, configured to generate a first audio signal according to the voiceprint parameter and the singing content parameter;

a first audio acquiring unit 5, configured to acquire a second audio signal corresponding to the current song accompaniment music; and

the third audio obtaining unit 6 is configured to obtain a third audio signal after performing audio mixing processing on the first audio signal and the second audio signal, and send the third audio signal to a vehicle-mounted audio playing device, so that the vehicle-mounted audio playing device plays the third audio signal.

In a specific example, the mouth type acoustic parameters comprise a mouth type credibility parameter, a matching degree parameter of the mouth type and the lyrics of the current song;

In a specific example, the acoustic parameter obtaining unit 2 is specifically configured to:

In a specific example, the continuous frame images of the mouth are acquired at the same time as the voiceprint parameters.

In a specific example, the voiceprint parameters include a fundamental frequency parameter, a formant parameter, a harmonic amplitude parameter, and a harmonic-to-noise ratio parameter.

The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

It should be noted that the system described in the foregoing embodiment corresponds to the method described in the foregoing embodiment, and therefore, parts of the system described in the foregoing embodiment that are not described in detail may be obtained by referring to the content of the method described in the foregoing embodiment, that is, the specific step content of the method described in the foregoing embodiment may be understood as the functions that can be implemented by the system of this embodiment, and will not be described again here.

In addition, when the car karaoke audio processing system according to the above embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer readable storage medium.

Another embodiment of the present invention provides a computer-readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, implements the steps of the vehicular karaoke audio processing method according to the above-described embodiment.

Specifically, the computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A vehicle karaoke audio processing method, comprising:

2. The vehicle karaoke audio processing method as claimed in claim 1, wherein the mouth acoustic parameters comprise a mouth confidence parameter, a mouth fit with the lyrics of a current song;

3. The vehicular karaoke audio processing method according to claim 2, wherein said obtaining corresponding singing content parameters from said mouth-type acoustic parameters comprises:

4. The vehicular karaoke audio processing method according to claim 2, wherein said obtaining corresponding singing content parameters from said mouth-type acoustic parameters comprises:

5. The vehicle karaoke audio processing method as claimed in claim 2, wherein the mouth-shaped continuous frame image is acquired at the same time as the voiceprint parameters.

6. The vehicle karaoke audio processing method as claimed in claim 2, wherein said voiceprint parameters comprise a fundamental frequency parameter, a formant parameter, a harmonic amplitude parameter, a harmonic noise ratio parameter.

7. A vehicle karaoke audio processing system, comprising:

the acoustic parameter acquisition unit is used for acquiring mouth type continuous frame images of the singer, which are acquired by the vehicle-mounted camera equipment, and recognizing the mouth type continuous frame images by using a pre-trained deep learning network model to acquire mouth type acoustic parameters;

the singing content acquisition unit is used for acquiring corresponding singing content parameters according to the mouth type acoustic parameters;

a first audio obtaining unit, configured to generate a first audio signal according to the voiceprint parameter and the singing content parameter;

8. The vehicle karaoke audio processing system of claim 7, wherein the mouth acoustic parameters comprise a mouth confidence parameter, an agreement of the mouth with the lyrics of the current song;

9. The vehicle karaoke audio processing system according to claim 8, wherein said acoustic parameter acquisition unit is specifically configured to:

determining to reserve or correct the lyric content corresponding to each mouth shape action according to the mouth shape credibility parameter, the mouth shape and the coincidence degree parameter of the lyrics of the current song, wherein the correction comprises selecting the correct lyrics corresponding to the current song to replace the lyric content, or adjusting the lyric content to enable the similarity between the lyric content and the correct lyrics corresponding to the current song to be larger than a preset threshold value;

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the vehicle karaoke audio processing method according to any one of claims 1 to 6.