CN112614481A

CN112614481A - Voice tone customization method and system for automobile prompt tone

Info

Publication number: CN112614481A
Application number: CN202011443075.9A
Authority: CN
Inventors: 李俊杰; 辛慧玉
Original assignee: Zhejiang Hozon New Energy Automobile Co Ltd
Current assignee: Zhejiang Hozon New Energy Automobile Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-04-06

Abstract

The invention relates to the technical field of vehicle voice, in particular to a voice tone customization method and system of automobile prompt tones. The invention provides a method for customizing the voice tone of an automobile prompt tone, which comprises the following steps: step S1, inputting sound with appointed tone; step S2, storing input sound data; step S3, extracting the tone of the input voice data, synthesizing the extracted tone with the original voice prompt tone data, and generating the customized voice prompt tone data corresponding to the designated tone; and step S4, storing and outputting the customized voice prompt tone data. According to the method and the system for customizing the voice tone of the automobile prompt tone, the favorite sound is input by the user, the tone of the sound is simulated to perform subsequent TTS voice broadcast prompt, the structure is simple, the design is ingenious, the scientific and technological sense brought by voice interaction can be realized, the operation attribute of the traditional voice broadcast can be realized, and the affinity and the individuality of the automobile in driving are greatly improved.

Description

Voice tone customization method and system for automobile prompt tone

Technical Field

The invention relates to the technical field of vehicle voice, in particular to a voice tone customization method and system of automobile prompt tones.

Background

The vehicle-mounted voice control system is a novel product system which is popular in recent years and is used for replacing a traditional in-vehicle control system.

The vehicle-mounted voice control system can realize multiple functions which cannot be realized by interaction modes such as traditional entity keys and the like in a simpler interaction mode by means of a voice control mode of software, and improves the scientific and technological feeling and the luxury feeling of the vehicle.

However, in the existing typical vehicle-mounted voice control system, the corresponding voice interaction functions are generally classified into the following types:

1) and voice control cannot be carried out, and only one-way voice broadcasting can be carried out.

2) Simple voice control can be performed, such as turning on an air conditioner, and the like.

3) The broadcast sound, such as the pronouncing character of boys and girls, can be selected on the basis of voice control.

For the above-mentioned sub-tone broadcast voice function of the third vehicle-mounted voice control system, as shown in fig. 1, fig. 1 discloses a flow chart of a voice broadcast method in the prior art, and stores fixed voice files of several tones in a memory, and a User selects a favorite tone through related setting items on a User Interface (User Interface) and outputs voice broadcast of the selected tone.

The solution shown in fig. 1 has the following drawbacks:

generally, voice files with various timbres need to be stored in advance, and requirements on hardware storage equipment at a vehicle end are high.

Even if a voice file with a plurality of timbres is provided in advance, the timbres are difficult to be customized according to needs, and specific preferences of users are difficult to meet to a great extent.

Disclosure of Invention

The invention aims to provide a method and a system for customizing the voice tone of an automobile prompt tone, which solve the problem that the automobile prompt tone in the prior art is difficult to input and customize in a personalized way.

In order to achieve the aim, the invention provides a method for customizing the voice tone of an automobile prompt tone, which comprises the following steps:

step S1, inputting sound with appointed tone;

step S2, storing input sound data;

step S3, extracting the tone of the input voice data, synthesizing the extracted tone with the original voice prompt tone data, and generating the customized voice prompt tone data corresponding to the designated tone;

and step S4, storing and outputting the customized voice prompt tone data.

In an embodiment, the step S3, further includes:

step S31, analyzing the voice spectrum through Fourier change, and extracting the tone color characteristics of the input voice data;

step S32, extracting the content characteristic information of the original voice prompt tone data;

step S33 is to synthesize the tone color feature and the content feature information to generate voice guidance sound data corresponding to the specified tone color.

In an embodiment, the step S31, further includes:

step 311, decomposing the input voice data by frame;

step S312, calculating a periodic power spectrum for the audio of each frame;

step S313, applying the mel filter to the periodic power spectrum, and calculating the energy sum of each mel filter;

step S314, calculating the logarithm value of the energy sum;

step S315, discrete cosine transform is carried out on each logarithmic energy;

and step S316, reserving 2-13 coefficients of the discrete cosine transform result as timbre characteristics, and discarding the rest coefficients.

In an embodiment, the step S33, further includes:

step S331, classifying the extracted tone characteristic information according to a frequency spectrum;

s332, expanding the tone characteristic information by using the series, and taking the tone characteristic information of the main part;

step S333, sorting the content characteristic information, combining the tone characteristic information and generating voice spectrum data corresponding to the designated tone;

and step 334, performing inverse frequency domain transformation on the voice frequency spectrum data corresponding to the designated tone color, and outputting voice prompt tone data corresponding to the designated tone color.

In an embodiment, the step S33, further includes:

and (4) synthesizing the tone characteristic and the content characteristic information after training through a deep neural network algorithm.

In order to achieve the above object, the present invention provides a system for customizing the voice timbre of an automobile prompt tone, which comprises a user end, a vehicle end and a service end:

the user side is connected with the vehicle side, inputs the sound with the designated tone and outputs the customized voice prompt tone;

the vehicle end is connected with the server end, receives input sound data, stores the input sound data and sends the input sound data to the server end, sends original voice prompt sound data to the server end, receives customized voice prompt sound data, stores the customized voice prompt sound data and sends the customized voice prompt sound data to the user end;

and the server side extracts the tone of the input sound data, synthesizes the input sound data with the original voice prompt tone data and generates customized voice prompt tone data corresponding to the designated tone.

In an embodiment, the server analyzes the fourier transform into a spectrogram, extracts the tone features of the input voice data, extracts the content feature information of the original voice prompt tone data, and synthesizes the tone features and the content feature information to generate the voice prompt tone data corresponding to the specified tone.

In one embodiment, the server decomposes the input sound data by frames, calculates a periodic power spectrum for the audio frequency of each frame, applies mel filters to the periodic power spectrum, calculates the energy sum of each mel filter, calculates the logarithm value of the energy sum, performs discrete cosine transform on each logarithm energy, retains 2-13 coefficients of the discrete cosine transform result as the tone color feature, and discards the rest coefficients.

In an embodiment, the server classifies the extracted tone characteristic information according to a frequency spectrum, expands the tone characteristic information by using a series, takes tone characteristic information of a main part of the tone characteristic information, sorts the content characteristic information, combines the tone characteristic information, generates voice spectrum data corresponding to a specified tone, performs inverse frequency domain transformation on the voice spectrum data corresponding to the specified tone, and outputs voice prompt tone data corresponding to the specified tone.

In an embodiment, the server side synthesizes the tone characteristic and the content characteristic information after training through a deep neural network algorithm.

According to the method and the system for customizing the voice tone of the automobile prompt tone, the favorite sound is input by the user, the tone of the sound is simulated to perform subsequent TTS voice broadcast prompt, the structure is simple, the design is ingenious, the scientific and technological sense brought by voice interaction can be realized, the operation attribute of the traditional voice broadcast can be realized, and the affinity and the individuality of the automobile in driving are greatly improved.

Drawings

The above and other features, properties and advantages of the present invention will become more apparent from the following description of the embodiments with reference to the accompanying drawings in which like reference numerals denote like features throughout the several views, wherein:

fig. 1 discloses a flow chart of a voice broadcasting method in the prior art;

FIG. 2 is a flow chart of a method for customizing the voice tone of an automobile warning tone according to an embodiment of the present invention;

fig. 3 discloses a schematic diagram of a voice timbre customizing system for a car warning sound according to an embodiment of the invention.

The meanings of the reference symbols in the figures are as follows:

100 user terminals;

200 of a vehicle end;

300, a server side.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 2 discloses a flow chart of a method for customizing a voice tone of an automobile warning sound according to an embodiment of the present invention, and the method for customizing the voice tone of the automobile warning sound shown in fig. 2 includes the following steps:

step S1, inputting sound with appointed tone;

and directly inputting the sound interested by the user through a microphone by an artificial intelligence technology, wherein the tone corresponding to the sound interested by the user is used as a designated tone.

Step S2, storing input sound data;

the vehicle end stores the input sound data file and transmits the input sound data file to the server end through TBOX.

The T-Box is called a vehicle-mounted intelligent terminal, serves as the only control unit capable of being networked for a vehicle body, carries a mission for monitoring and controlling the state of the vehicle body, and the TBox is mainly used for collecting vehicle-related information including position information, attitude information, vehicle state information (through connecting a CAN bus on the vehicle) and the like and then transmitting the information to the TSP platform through wireless communication.

and the server side extracts the tone information through a frequency domain transformation algorithm and synthesizes the tone information and the content of the original voice prompt tone.

And step S4, storing and outputting the customized voice prompt tone data.

And the server transmits the synthesized voice prompt tone data to the vehicle terminal through the TBOX, and the tone corresponding to the synthesized voice prompt tone is the tone of the sound which is interested by the user.

When the user carries out man-machine interaction with the car machine end through the voice command, the car machine end can output voice prompt sound corresponding to the tone desired by the user.

Further, the step S3 further includes the following steps:

The key points of the present invention are two steps of extracting tone color feature information and synthesizing sound in step S3.

In step S31, the algorithm for extracting the tone characteristic information further includes:

step 311, decomposing the input voice data by frame;

step S312, calculating a periodic power spectrum for the audio of each frame;

human ears feel that the height of a voice signal is not in a linear relation with the frequency, so that a group of triangular filter sequences can be constructed, and sparse decomposition is carried out on the signal, namely a mel filter bank.

Step S314, calculating the logarithm value of the energy sum;

step S315, performing Discrete Cosine Transform (DCT) on each logarithmic energy;

The step S33, an algorithm of sound synthesis, further includes:

Furthermore, the accuracy and the precision of sound synthesis can be further improved by continuously training the samples based on the deep neural network algorithm, and synthesizing the tone characteristic information and the content characteristic information after training.

Fig. 3 discloses a schematic diagram of a voice timbre customizing system for an automobile warning sound according to an embodiment of the present invention, and the voice timbre customizing system for an automobile warning sound shown in fig. 3 includes a user terminal 100, an automobile terminal 200, and a service terminal 300:

the user end 100 is connected with the vehicle end 200, inputs the sound with the designated tone color and outputs the customized voice prompt tone;

the vehicle end 200 is connected with the server 300, receives input sound data, stores the input sound data, sends the input sound data to the server 300, sends original voice prompt sound data to the server 300, receives customized voice prompt sound data, stores the customized voice prompt sound data and sends the customized voice prompt sound data to the user end 100;

the server 300 extracts the tone of the input voice data, synthesizes the extracted tone with the original voice prompt tone data, and generates customized voice prompt tone data corresponding to the designated tone.

In the embodiment shown in fig. 3, the user terminal 100 inputs the sound of interest to the user directly through the microphone/microphone by using the artificial intelligence technique, and the tone corresponding to the sound of interest to the user is used as the designated tone.

When the user terminal 100 performs the man-machine interaction with the car terminal 200 through the voice command, the car terminal 200 outputs the voice prompt sound corresponding to the tone desired by the user to the user terminal 100.

In the embodiment shown in FIG. 3, the vehicle end 200 is an infotainment system (IHU, generally referred to as an infotainment head unit).

The car terminal 200 stores the inputted sound data file and transmits it to the server terminal 300 through the TBOX.

Optionally, the vehicle-mounted terminal 200 is an SoC terminal, and the SoC chip is an integrated circuit chip, so that the development cost of the electronic/information system product can be effectively reduced, the development period can be shortened, and the competitiveness of the product can be improved, which is the most important product development mode to be adopted in the future industry.

In the embodiment shown in fig. 3, the server 300 is a cloud processor, and performs sound synthesis and tone color replacement by analyzing the uploaded input sound file into a spectrogram through fourier transform.

Further, the server 300 analyzes the fourier transform into a spectrogram, extracts the tone features of the input voice data, extracts the content feature information of the original voice prompt tone data, and synthesizes the tone features and the content feature information to generate the voice prompt tone data corresponding to the designated tone.

Further, the server 300 extracts the tone characteristic information, and further includes:

decomposing input sound data by frames, calculating a periodic power spectrum for the audio frequency of each frame, applying mel filters to the periodic power spectrum, calculating the energy sum of each mel filter, calculating the logarithm value of the energy sum, performing discrete cosine transform on each logarithm energy, reserving 2-13 coefficients of the discrete cosine transform result as timbre characteristics, and discarding the rest coefficients.

Further, the server 300 performs sound synthesis, and further includes:

classifying the extracted tone characteristic information according to frequency spectrums, expanding the tone characteristic information by using series, taking tone characteristic information of a main part, sorting content characteristic information, combining the tone characteristic information, generating voice frequency spectrum data corresponding to the specified tone, performing frequency domain inverse transformation on the voice frequency spectrum data corresponding to the specified tone, and outputting voice prompt tone data corresponding to the specified tone.

Furthermore, the server 300 synthesizes the information of the tone features and the content features after training through a deep neural network algorithm.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

The embodiments described above are provided to enable persons skilled in the art to make or use the invention and that modifications or variations can be made to the embodiments described above by persons skilled in the art without departing from the inventive concept of the present invention, so that the scope of protection of the present invention is not limited by the embodiments described above but should be accorded the widest scope consistent with the innovative features set forth in the claims.

Claims

1. A voice timbre customizing method of an automobile prompt tone is characterized by comprising the following steps:

step S1, inputting sound with appointed tone;

step S2, storing input sound data;

and step S4, storing and outputting the customized voice prompt tone data.

2. The method for customizing the voice tone of an automobile warning sound according to claim 1, wherein the step S3 further comprises:

3. The method for customizing the voice tone of an automobile warning sound according to claim 2, wherein the step S31 further comprises:

step 311, decomposing the input voice data by frame;

step S312, calculating a periodic power spectrum for the audio of each frame;

step S314, calculating the logarithm value of the energy sum;

step S315, discrete cosine transform is carried out on each logarithmic energy;

4. The method for customizing the voice tone of an automobile warning sound according to claim 3, wherein the step S33 further comprises:

5. The method for customizing the voice tone of an automobile warning sound according to claim 2, wherein the step S33 further comprises:

6. The utility model provides a pronunciation tone customization system of car prompt tone which characterized in that, includes user end, car end and server:

7. The system of claim 6, wherein the server analyzes the voice spectrum by fourier transform analysis, extracts tone color features of the input voice data, extracts content feature information of the original voice prompt tone data, and synthesizes the tone color features and the content feature information to generate the voice prompt tone data corresponding to the specified tone color.

8. The system of claim 7, wherein the server decomposes the input sound data by frames, calculates a periodic power spectrum for each frame of audio, applies mel-filters to the periodic power spectrum, calculates a sum of energy of each mel-filter, calculates a logarithmic value of the sum of energy, performs discrete cosine transform on each logarithmic energy, retains 2-13 coefficients of the discrete cosine transform result as timbre characteristics, and truncates the remaining coefficients.

9. The system according to claim 8, wherein the server classifies the extracted tone feature information according to a spectrum, expands the tone feature information using a series, extracts tone feature information of a main part of the tone feature information, sorts the content feature information, combines the tone feature information to generate voice spectrum data corresponding to a specified tone, performs inverse frequency domain conversion on the voice spectrum data corresponding to the specified tone, and outputs the voice prompt tone data corresponding to the specified tone.

10. The system of claim 7, wherein the server synthesizes the tone characteristic and the content characteristic information after training the tone characteristic and the content characteristic information through a deep neural network algorithm.