CN110992970A - Audio synthesis method and related device - Google Patents

Audio synthesis method and related device Download PDF

Info

Publication number
CN110992970A
CN110992970A CN201911289583.3A CN201911289583A CN110992970A CN 110992970 A CN110992970 A CN 110992970A CN 201911289583 A CN201911289583 A CN 201911289583A CN 110992970 A CN110992970 A CN 110992970A
Authority
CN
China
Prior art keywords
audio
time period
accompaniment
clip
right channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911289583.3A
Other languages
Chinese (zh)
Other versions
CN110992970B (en
Inventor
闫震海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN201911289583.3A priority Critical patent/CN110992970B/en
Publication of CN110992970A publication Critical patent/CN110992970A/en
Application granted granted Critical
Publication of CN110992970B publication Critical patent/CN110992970B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • H04S7/306For headphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/005Musical accompaniment, i.e. complete instrumental rhythm synthesis added to a performed melody, e.g. as output by drum machines
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/155Musical effects
    • G10H2210/265Acoustic effect simulation, i.e. volume, spatial, resonance or reverberation effects added to a musical sound, usually by appropriate filtering or delays
    • G10H2210/295Spatial effects, musical uses of multiple audio channels, e.g. stereo
    • G10H2210/301Soundscape or sound field simulation, reproduction or control for musical purposes, e.g. surround or 3D sound; Granular synthesis

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Stereophonic System (AREA)

Abstract

The embodiment of the application discloses an audio synthesis method and a related device, wherein the method comprises the following steps: determining at least one first time period and at least one second time period within the playing time length of the target accompaniment according to the timestamp of the target accompaniment; wherein there is at least one first time period that does not coincide with at least one second time period; determining left channel audio and right channel audio of the first audio; determining left-to-audio right channel audio of the second audio; and mixing the left channel audio and the right channel audio of the second audio into the left channel and the right channel of the target accompaniment respectively in at least one second time period to obtain the synthesized audio. The audio synthesis method provided by the embodiment of the application can efficiently synthesize the music in the binaural double-tone form, and improves the listening effect of the music in the binaural double-tone form.

Description

Audio synthesis method and related device
Technical Field
The present application relates to the field of audio processing, and in particular, to an audio synthesis method and related apparatus.
Background
With the development of times, the material level of people is continuously improved, and the requirements on the mental level are higher and higher. Music is one of the most popular artistic forms of the public, and has been integrated into daily infusion of life, so that enthusiasm of people for listening to music is rising, for example, according to data display of QQ music, the play volume of 'waiting you for class' singing together by Zhou Jieren and Yang Sweden on the first day of online is already broken by billions, and then the music is turned over again to be played in dozens of versions such as English, French, German and Spanish on platforms such as QQ music, YouTube and beep gel.
With the development of computer technology and the continuous improvement of music appreciation level of people, the traditional music effect can not meet the requirements of people, so that diversified music forms need to be made, and a series of problems are generated in the process and need to be solved urgently. For example, music in a binaural diphone form is an emerging music form, and different pieces of music can be played in a left channel and a right channel respectively, so that a user can listen to two pieces of different pieces of music at the same time, but the music in the binaural diphone form is complex in manufacturing steps, and the synthesized music effect is hollow, which is difficult to meet the requirements of the user.
Disclosure of Invention
The embodiment of the application discloses an audio synthesis method and a related device, which can efficiently synthesize music in a binaural two-tone form and improve the listening effect of the music in the binaural two-tone form.
In a first aspect, an embodiment of the present application provides an audio synthesis method, including: determining at least one first time period and at least one second time period within the playing time length of the target accompaniment according to the timestamp of the target accompaniment; wherein there is at least one first time period that does not coincide with the at least one second time period;
determining a left channel audio of the first audio and a right channel audio of the first audio;
determining a left channel audio of the second audio and a right channel audio of the second audio;
mixing the left channel audio of the first audio into the left channel of the target accompaniment in at least one first time period, and mixing the right channel audio of the first audio into the right channel of the target accompaniment in at least one first time period; and mixing the left channel audio of the second audio into the left channel of the target accompaniment in at least one second time period, and mixing the right channel audio of the second audio into the right channel of the target accompaniment in at least one second time period to obtain the synthesized audio.
The existing double-ear double-tone music is simply that two pieces of music are respectively set as single-channel audio, the first piece of music is placed in a left sound channel, the other piece of music is placed in a right sound channel, and the two pieces of music are synthesized and input into double-ear double-tone music. The sound sources of the left channel and the right channel of the synthesized music are independent, and the sound image is concentrated on the leftmost side because the music of the left channel does not contain the first piece of music. Similarly, the sound image of the right channel music is concentrated on the rightmost side. The middle position is hollow, and the auditory sensation is poor. According to the audio synthesis method provided by the embodiment of the application, the first time period and the second time period can be determined according to the time stamp of the target accompaniment, and the first audio and the second audio are mixed into the left channel and the right channel of the target accompaniment according to the time periods. During specific synthesis, the first audio can be modulated to obtain a left channel audio and a right channel audio, the left channel audio and the right channel audio are respectively mixed into the left channel and the right channel of the accompaniment according to a first time period, and similarly, the left channel audio and the right channel audio of the second audio are respectively mixed into the left channel and the right channel of the accompaniment according to a second time period, so that the audios in the binaural double-tone form are efficiently synthesized, and the listening effect of the music in the binaural double-tone form is improved. Furthermore, the first time period and the second time period for mixing the audio are determined according to the time stamp of the target accompaniment, so that the synthesized music tempos of the left and right sound channels can correspond to each other, the problem of misalignment of the two music tempos is avoided, and the listening effect of the music in the binaural double-tone mode is improved.
In a possible implementation manner of the first aspect, the mixing the left channel audio of the first audio into the left channel of the target accompaniment in at least one first time period and mixing the right channel audio of the first audio into the right channel of the target accompaniment in at least one first time period includes:
mixing a first audio clip into a left channel of a first accompaniment clip in at least one first time period, mixing a second audio clip into a right channel of the first accompaniment clip in the at least one first time period, wherein the first audio clip is an audio clip corresponding to the at least one first time period in the left channel audio of the first audio, the second audio clip is an audio clip corresponding to the at least one first time period in the right channel audio of the first audio, and the first accompaniment clip is a part of the target accompaniment corresponding to the at least one first time period;
the mixing of the left channel audio of the first audio into the left channel of the target accompaniment in at least one first time slot and the mixing of the right channel audio of the first audio into the right channel of the target accompaniment in at least one first time slot includes:
and mixing a third audio clip into the left channel of the second accompaniment clip in at least one second time period, mixing a fourth audio clip into the right channel of the second accompaniment clip in at least one second time period, wherein the third audio clip is the audio clip corresponding to at least one second time period in the left channel audio of the second audio, the second accompaniment clip is the part of the target accompaniment corresponding to at least one second time period, and the fourth audio clip is the audio clip corresponding to at least one second time period in the right channel audio of the second audio.
Because the first audio and the second audio have corresponding relation with the target accompaniment, when mixing, in order to ensure the synthetic effect, the audio synthesis device firstly extracts the audio sequence corresponding to the first time slot in the first audio according to the first time slot, and combines the audio sequence into the target accompaniment corresponding to the first time slot, and similarly, mixes the other segment corresponding to the first audio in the second time slot into the accompaniment corresponding to the segment, so that when mixing, only the segment mixed with the audio needs to be synthesized for the target accompaniment, and the whole segment of the accompaniment does not need to be synthesized again, thereby reducing the calculation pressure and saving the system resources.
In yet another possible implementation of the first aspect, the target accompaniment comprises at least two pieces of lyrics; the determining unit is configured to determine at least one first time period and at least one second time period within a playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
determining a first time period according to a part of the lyrics in the at least two sections of lyrics;
and determining a second time period according to another part of the lyrics in the at least two lyrics.
It can be seen that by dividing the lyrics into different paragraphs and mixing the first audio and the second audio in the different lyrics paragraphs, the effect of different singers singing a song together can be achieved. For example, the user mixes the version singing by himself and the version of the singer in the original singing accompaniment respectively, and different singing paragraphs can be mixed, so that the voice of the user and the voice of the singer appear at intervals, and the chorusing effect is more vivid. Furthermore, scoring can be carried out according to multiple words sung by the user, one or more words with the score higher than or equal to a preset threshold value are mixed into the original singing accompaniment, and the other part of words with the score lower than the threshold value are mixed into the original singing accompaniment through the audio frequency of the original singer, so that the synthesized audio frequency is more pleasant, and the listening experience is improved.
In a possible implementation manner of the first aspect, the target accompaniment corresponds to N first lyric fragments and M second lyric fragments, the first lyric fragment is a master song lyric fragment, the second lyric fragment is a chorus lyric fragment, or the first lyric fragment is a lyric fragment sung by a first singer and the second lyric fragment is a lyric fragment sung by a second singer, where M is an integer greater than or equal to 1 and N is an integer greater than or equal to 1; the determining unit is configured to determine at least one first time period and at least one second time period within a playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
determining at least one first time period according to the start time stamp and the end time stamp of the M first lyric fragments;
the at least one second time period is determined based on the start and end time stamps of the N second lyrics fragments or based on the start and end time stamps of the M first lyrics fragments and the start and end time stamps of the N second lyrics fragments.
It can be seen that the verse and the refrain can be used as the basis for dividing the lyric segments, or the lyrics sung by different singers can be used as the basis for dividing the lyrics. For example, since the song is not highly vocal in the main song portion, the user is more familiar with the chorus portion, and thus the first audio singing in the original is mixed into the main song portion of the target accompaniment, and the second audio singing in the user is mixed into the chorus portion of the target accompaniment. For another example, if a song is played by multiple singers, the first audio may be mixed into the lyrics section played by one singer, and the second audio may be mixed into the section played by another singer, so as to form an effect that the two audios are played in opposition to each other, thereby making the synthesized audio form more diversified.
In a possible implementation manner of the first aspect, the determining at least one first time period and at least one second time period within the playing time duration of the target accompaniment according to the timestamp of the target accompaniment includes:
receiving input first information, wherein the first information indicates at least one first time period and at least one second time period;
at least one first time period and at least one second time period are determined from the first information.
Therefore, the user can set the mixing time of the first audio and the second audio according to the preference of the user, so that the personalized listening effect is constructed, and the interestingness of listening experience is increased.
In a possible implementation manner of the first aspect, the determining the left channel audio of the first audio and the right channel audio of the first channel includes:
convolving the first audio frequency with a first head related transfer function from the position of a sound source of the first audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; the position of the sound source of the first audio is a preset sound image position or a position indicated by the received first operation instruction;
the determining the left channel audio of the first audio and the right channel audio of the first channel includes:
convolving the second audio frequency with a first head related transfer function from the position of the sound source of the second audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; the position of the sound source of the second audio is a preset sound image position or a position indicated by the received second operation instruction.
It can be seen that when the left channel audio and the right channel audio of the audio are determined, the head-related transfer function can be used for performing sound-image modulation, so that when the modulated left channel audio and the modulated right channel audio are played, people can feel that the audio is transferred from the position of a sound source, and the condition of extreme sound image is avoided. Compared with a time delay method, a gain method and the like, the head-related transfer function is used for modulating the audio frequency to obtain the audio-video modulation method, so that the reality sense of the audio-video can be improved, music components at all angles can be enriched, and the listening experience of a user can be improved.
In one possible implementation of the first aspect, the volume of the second audio is less than or equal to the volume of the first audio.
When music in a binaural double-tone form is synthesized, the volumes of the left channel audio and the right channel audio need to be adapted to each other, if the volume of the audio is large, the hearing of a listener can be damaged, and if the volume of the audio is too small, the listener cannot feel the music effect, so that the listening experience of the user is influenced. The method provided by the application can adjust the volume of the first audio according to the volume of the first audio, prevent the too large volume of the second audio from hurting the ears of a listener, and also prevent the situation that the audio has an unobvious effect due to the too small volume, so that the volumes of the first audio and the second audio are coordinated, and the listening experience of a user is improved.
In one possible implementation manner of the first aspect, the volume of the first audio is a volume indicated by a third operation instruction, and the volume of the second audio is a volume indicated by a fourth operation instruction.
Therefore, the user can set the volumes of the first audio and the second audio according to the preference of the user, so that the personalized listening effect is constructed, and the interestingness and the flexibility of the synthesized music are increased.
In a second aspect, an embodiment of the present application provides an audio synthesizing apparatus, including:
the determining unit is used for determining at least one first time slot and at least one second time slot in the playing time length of the target accompaniment according to the timestamp of the target accompaniment; wherein there is at least one first time period that does not coincide with at least one second time period;
a modulation unit for determining a left channel audio of the first audio and a right channel audio of the first audio;
the modulation unit is further used for determining left sound-to-audio of the second audio and right channel audio of the second audio;
a synthesizing unit, for mixing the left channel audio of the first audio into the left channel of the target accompaniment in at least one first time period, and mixing the right channel audio of the first audio into the right channel of the target accompaniment in at least one first time period; and mixing the left channel audio of the second audio into the left channel of the target accompaniment in at least one second time period, and mixing the right channel audio of the second audio into the right channel of the target accompaniment in at least one second time period to obtain the synthesized audio.
The existing double-ear double-tone music is simply that two pieces of music are respectively set as single-channel audio, the first piece of music is placed in a left sound channel, the other piece of music is placed in a right sound channel, and the two pieces of music are synthesized and input into double-ear double-tone music. The sound sources of the left channel and the right channel of the synthesized music are independent, and the sound image is concentrated on the leftmost side because the music of the left channel does not contain the first piece of music. Similarly, the sound image of the right channel music is concentrated on the rightmost side. The middle position is hollow, and the auditory sensation is poor. The audio synthesis device provided by the embodiment of the application can determine the first time period and the second time period according to the time stamp of the target accompaniment, and mix the first audio and the second audio into the left channel and the right channel of the target accompaniment according to the time periods. During specific synthesis, the first audio can be modulated to obtain a left channel audio and a right channel audio, the left channel audio and the right channel audio are respectively mixed into the left channel and the right channel of the accompaniment according to a first time period, and similarly, the left channel audio and the right channel audio of the second audio are respectively mixed into the left channel and the right channel of the accompaniment according to a second time period, so that the audios in the binaural double-tone form are efficiently synthesized, and the listening effect of the music in the binaural double-tone form is improved. Furthermore, the first time period and the second time period for mixing the audio are determined according to the time stamp of the target accompaniment, so that the synthesized music tempos of the left and right sound channels can correspond to each other, the problem of misalignment of the two music tempos is avoided, and the listening effect of the music in the binaural double-tone mode is improved.
In a possible implementation manner of the second aspect, the synthesizing unit is configured to mix a left channel audio of the first audio into a left channel of the target accompaniment in at least one first time period, and mix a right channel audio of the first audio into a right channel of the target accompaniment in at least one first time period, specifically:
mixing a first audio clip into a left channel of a first accompaniment clip in at least one first time period, mixing a second audio clip into a right channel of the first accompaniment clip in the at least one first time period, wherein the first audio clip is an audio clip corresponding to the at least one first time period in the left channel audio of the first audio, the second audio clip is an audio clip corresponding to the at least one first time period in the right channel audio of the first audio, and the first accompaniment clip is a part of the target accompaniment corresponding to the at least one first time period;
the synthesizing unit is further configured to mix a left channel audio of the second audio into a left channel of the target accompaniment in at least one second time period, and mix a right channel audio of the second audio into a right channel of the target accompaniment in at least one second time period, specifically:
and mixing a third audio clip into the left channel of the second accompaniment clip in at least one second time period, mixing a fourth audio clip into the right channel of the second accompaniment clip in at least one second time period, wherein the third audio clip is the audio clip corresponding to at least one second time period in the left channel audio of the second audio, the second accompaniment clip is the part of the target accompaniment corresponding to at least one second time period, and the fourth audio clip is the audio clip corresponding to at least one second time period in the right channel audio of the second audio.
Because the first audio and the second audio have a corresponding relation with the target accompaniment, when mixing, in order to ensure the synthesis effect, firstly, according to the first time slot, the audio sequence corresponding to the first time slot in the first audio is extracted, the audio sequence is merged into the target accompaniment corresponding to the first time slot, and similarly, another segment corresponding to the first audio in the second time slot is mixed into the accompaniment corresponding to the segment, so that when mixing, only the segment mixed with the audio is required to be synthesized for the target accompaniment, and the whole accompaniment is not required to be synthesized again, thereby reducing the calculation pressure and saving the system resources.
In one possible embodiment of the second aspect, the target accompaniment comprises at least two pieces of lyrics; the determining unit is configured to determine at least one first time period and at least one second time period within a playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
determining a first time period according to a part of the lyrics in the at least two lyrics;
a second time period is determined based on another portion of the at least two pieces of lyrics.
It can be seen that by dividing the lyrics into different paragraphs and mixing the first audio and the second audio in the different lyrics paragraphs, the effect of different singers singing a song together can be achieved. For example, the user mixes the version singing by himself and the version of the singer in the original singing accompaniment respectively, and different singing paragraphs can be mixed, so that the voice of the user and the voice of the singer appear at intervals, and the chorusing effect is more vivid. Furthermore, scoring can be carried out according to multiple words sung by the user, one or more words with the score higher than or equal to a preset threshold value are mixed into the original singing accompaniment, and the other part of words with the score lower than the threshold value are mixed into the original singing accompaniment through the audio frequency of the original singer, so that the synthesized audio frequency is more pleasant, and the listening experience is improved.
In a possible implementation manner of the second aspect, the target accompaniment corresponds to N first lyric fragments and M second lyric fragments, the first lyric fragment is a master song lyric fragment, the second lyric fragment is a chorus lyric fragment, or the first lyric fragment is a lyric fragment sung by a first singer and the second lyric fragment is a lyric fragment sung by a second singer, where M is an integer greater than or equal to 1 and N is an integer greater than or equal to 1; the determining unit is configured to determine at least one first time period and at least one second time period within a playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
determining at least one first time period according to the start time stamp and the end time stamp of the M first lyric fragments;
the at least one second time period is determined based on the start and end time stamps of the N second lyrics fragments or based on the start and end time stamps of the M first lyrics fragments and the start and end time stamps of the N second lyrics fragments.
It can be seen that the verse and the refrain can be used as the basis for dividing the lyric segments, or the lyrics sung by different singers can be used as the basis for dividing the lyrics. For example, since the song is not highly vocal in the main song portion, the user is more familiar with the chorus portion, and thus the first audio singing in the original is mixed into the main song portion of the target accompaniment, and the second audio singing in the user is mixed into the chorus portion of the target accompaniment. For another example, if a song is played by multiple singers, the first audio may be mixed into the lyrics section played by one singer, and the second audio may be mixed into the section played by another singer, so as to form an effect that the two audios are played in opposition to each other, thereby making the synthesized audio form more diversified.
In a possible implementation manner of the second aspect, the apparatus further includes:
an input unit for receiving input first information, wherein the first information indicates at least one first time period and at least one second time period;
the determining unit is configured to determine at least one first time period and at least one second time period within a playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
at least one first time period and at least one second time period are determined from the first information.
Therefore, the user can determine the mixing time of the first audio and the second audio according to the self requirement, and the user can set the mixing time of the first audio and the second audio according to the self preference, so that the personalized listening effect is constructed, and the interestingness of the listening experience is increased.
In a possible implementation manner of the second aspect, the modulation unit is configured to determine a left channel audio of the first audio and a right channel audio of the first audio, and specifically:
convolving the first audio frequency with a first head related transfer function from the position of a sound source of the first audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; the position of the sound source of the first audio is a preset sound image position or a position of a received first operation instruction;
the modulation unit is further configured to determine a left channel audio of the second audio and a right channel audio of the second audio, and specifically includes:
convolving the second audio frequency with a first head related transfer function from the position of the sound source of the second audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; the position of the sound source of the second audio is a preset sound image position or a position of the received second operation instruction.
It can be seen that when the left channel audio and the right channel audio of the audio are determined, the head-related transfer function can be used for performing sound-image modulation, so that when the modulated left channel audio and the modulated right channel audio are played, people can feel that the audio is transferred from the position of a sound source, and the condition of extreme sound image is avoided. Compared with a time delay method, a gain method and the like, the head-related transfer function is used for modulating the audio frequency to obtain the audio-video modulation method, so that the reality sense of the audio-video can be improved, music components at all angles can be enriched, and the listening experience of a user can be improved.
In one possible implementation of the second aspect, the volume of the second audio is less than or equal to the volume of the first audio.
When music in a binaural double-tone form is synthesized, the volumes of the left channel audio and the right channel audio need to be adapted to each other, if the volume of the audio is large, the hearing of a listener can be damaged, and if the volume of the audio is too small, the listener cannot feel the music effect, so that the listening experience of the user is influenced. The method provided by the application can adjust the volume of the first audio according to the volume of the first audio, prevent the too large volume of the second audio from hurting the ears of a listener, and also prevent the situation that the audio has an unobvious effect due to the too small volume, so that the volumes of the first audio and the second audio are coordinated, and the listening experience of a user is improved.
In one possible implementation manner of the second aspect, the volume of the first audio is a volume indicated by a third operation instruction, and the volume of the second audio is a volume indicated by a fourth operation instruction.
Therefore, the user can set the volumes of the first audio and the second audio according to the preference of the user, so that the personalized listening effect is constructed, and the interestingness and the flexibility of the synthesized music are increased.
In a third aspect, an embodiment of the present application provides an audio synthesizing apparatus, including: the device comprises a processor and a memory, wherein the memory stores a computer program, and the processor is used for calling the computer program to execute the method provided by the first aspect of the embodiments of the present application or any implementation manner of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, where a computer program is stored, and when the computer program runs on one or more processors, the computer program performs the method provided by the first aspect of the present application or any one implementation manner of the first aspect.
In a fifth aspect, an embodiment of the present application provides a computer program product, where the computer program product includes: a computer readable storage medium, which in turn contains computer readable program code, which when executed by one or more processors, is configured to perform the method provided by the first aspect of the embodiments of the present application or any one of the implementations of the first aspect.
It is to be understood that the audio synthesis apparatus provided by the second aspect, the audio synthesis apparatus provided by the third aspect, the computer storage medium provided by the fourth aspect, and the computer program product provided by the fifth aspect are all configured to execute the audio synthesis method provided by the first aspect, and therefore, the beneficial effects achieved by the audio synthesis method provided by the first aspect can be referred to and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments of the present application or the background art will be briefly described below.
Fig. 1 is a schematic diagram of an architecture of an audio synthesis system according to an embodiment of the present application;
fig. 2 is a schematic view of an operation scenario of an audio synthesis system according to an embodiment of the present application;
fig. 3 is a schematic flowchart of an audio synthesizing method provided by an embodiment of the present application;
fig. 4 is a schematic diagram of a method for acquiring audio according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a method for determining time according to an embodiment of the present application;
FIG. 6 is a schematic diagram of another method for determining time provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of another method for determining time provided by an embodiment of the present application;
fig. 8 is a schematic diagram of an audio mixing method provided in an embodiment of the present application;
fig. 9 is a schematic diagram of a method for modulating audio-visual signals according to an embodiment of the present application;
fig. 10 is a schematic diagram of a method for controlling volume according to an embodiment of the present disclosure;
fig. 11 is a schematic structural diagram of an audio synthesizing apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of another audio synthesizing apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1, fig. 1 is a schematic diagram illustrating a possible audio synthesis system according to an embodiment of the present disclosure. The audio synthesis system may include an electronic device 101, a recording device 102, an audio playback device 103, and a network 104.
The electronic device 101 is a device with data processing capability, and may receive data transmitted by the recording device 102 and the audio playback device 103 via the network 104, or may transmit data to the recording device 102 and the audio playback device 103. In particular implementations, the electronic devices described in this application include, but are not limited to, other portable devices such as mobile phones, laptop computers, or tablet computers with data processing capabilities. It should also be understood that in some embodiments the electronic device is not a portable communication device, but is a desktop computer or server with data processing capabilities. Of course, the electronic device may also be an in-vehicle device (e.g., an automobile, a bicycle, an electric vehicle, an airplane, a ship, etc.), a wearable device (e.g., a smart watch (such as iWatch, etc.), a smart bracelet, a pedometer, etc.), a smart home device (e.g., a refrigerator, a television, an air conditioner, an electric meter, etc.), a smart robot, a system device for singing (e.g., a complete set of singing devices in a KTV, a complete set of singing devices in a small-sized singing bar, etc.).
The recording device 102 is a device for collecting audio, such as a microphone, a microphone integrated module, and the like, and can record sound signals on a medium, so that the sound signals can form various audio file formats. The first audio and/or the second audio described later in this application may be acquired by the recording apparatus. When the first audio and the second audio are both collected by the recording device 102, the first audio and the second audio may be collected by one recording device, or the first audio and the second audio may be collected by two recording devices respectively.
The playback device 103 is a playback device such as a wired headphone, a bluetooth headphone, or a stereo set, or a terminal integrated with a playback device, and can play back a signal recorded on a media medium. The playback device may be connected to the electronic device 101 via the network 104.
The network 104 may be a medium that provides a communication link between the electronic device 101, the recording device 102, and the audio playback device 103, or may be the internet that includes network devices and transmission media, but is not limited thereto. Network 104 may include various types of connection media, such as, but not limited to, wired links, wireless links (e.g., WIFI, bluetooth), fiber optic links, and the like.
Optionally, the function of recording the first audio and/or the second audio of the recording device may also be implemented by the electronic device, and similarly, the function of playing the audio by the playing device may also be implemented by the electronic device. For example, taking the electronic device 101 as a smart phone as an example, a microphone module may be integrated in the smart phone, so that the function of recording audio may be completed. The electronic equipment can also be connected with a wired earphone through an earphone interface to complete the function of playing audio.
It is understood that the number of the electronic devices 101, the recording devices 102, and the playing devices 103 in the architecture shown in fig. 1 is only an example, and in a specific implementation, the network architecture of the audio synthesis system may include any number of electronic devices, recording devices, and playing devices, for example, the electronic device 101 may be a server, or may be a server cluster composed of a plurality of servers.
Referring to fig. 3, fig. 3 is a schematic view of a possible audio synthesis system during operation according to an embodiment of the present application, which includes an electronic device 201, a recording device 202, and an audio playing device 203. The audio synthesis system relates to processing of an original audio VOL1, a target accompaniment BGM and a singing audio VOL2 in the operation process, and the electronic device 201 relates to the following units in specific processing: a time determination unit 204, a channel separation unit 205, a left mixing unit 206, a right mixing unit 207, and an output unit 208.
It is to be understood that the above units and modules are functional modules divided according to functions, in a specific implementation, some of the functional modules may be subdivided into more tiny functional modules, and some of the functional modules may also be combined into one functional module, but whether the functional modules are subdivided or combined, the general flow performed by the electronic device in the process of performing audio synthesis is the same. For example, the time determination unit 204 performs a function of determining time, which can be expressed as a determination unit when the time is divided into specific parts, and for example, the functions performed by the channel separation unit 205, the left channel mixing unit 206, the right channel mixing unit 207, and the output unit 208 can also be performed by one synthesis unit, and thus are divided into smaller functional modules for convenience of description. Generally, each functional module corresponds to a respective program code (or program instructions), and when the respective program code of these functional modules runs on a processor, the functional modules execute corresponding procedures to implement corresponding functions.
As an alternative embodiment, the recording device 202 records the vocal flipping audio VOL2 and sends it to the electronic device 201 via ethernet. Accordingly, the electronic device 201 receives the vocal overturning audio VOL2 sent by the recording device 202, and the electronic device may obtain the original vocal audio VOL1 and the target accompaniment BGM in other manners (data search, music separation, and the like).
The electronic apparatus 201 can determine a time period in which the original audio VOL1 is mixed into the target accompaniment BGM, which is denoted as a first time period for convenience of description, by the time determination unit 207. Likewise, the electronic apparatus 201 may determine, by the time determination unit 207, a time period in which the sing audio VOL2 is mixed into the target accompaniment BGM, which is denoted as a second time period for convenience of description. The electronic device 201 can separate the left channel audio BGM _ L and the right channel audio BGM _ R of the target accompaniment BGM by the channel separation unit 205, mix the original audio VOL1 into the left channel audio BGN _ L of the target accompaniment according to the first time period by the left channel mixing unit 206 to obtain the left channel synthesized audio M _ L, and similarly, mix the reverse audio VOL2 into the right channel audio BGM _ R of the target accompaniment partial audio according to the second time period by the right channel mixing unit 207 to obtain the right channel synthesized audio M _ R. The electronic apparatus 201 outputs the left channel synthesized audio and the right channel synthesized audio as a synthesized audio file, i.e., target audio, through the output unit 208, and transmits to the playback apparatus 203.
Correspondingly, after receiving the target audio sent by the electronic device 201, the playing device 203 may play the target audio, and when the song is played in the first time period, the left channel may hear the vocal of the vocal turner, and correspondingly, in the second time period, the right channel may hear the original vocal, so as to present the hearing experience in the form of binaural double. Optionally, if the first time and the second time have a coincidence portion, the effect that the original vocal and the vocal of the vocal rap are played in the left ear and the right ear respectively can be heard, and singing effects such as chorus and antiphonal singing can be presented.
Referring to fig. 3, fig. 3 is an audio synthesizing method provided by an embodiment of the present application, which may be implemented based on the audio synthesizing system shown in fig. 1, and the method includes, but is not limited to, the following steps:
s301: the electronic device acquires a target accompaniment, a first audio and a second audio.
Specifically, the target accompaniment may be an audio of an accompaniment part of target music, and the target music may be a music signal such as a song or a recorded mother tape. Taking the example that the target music is song "wait you go lesson", the target accompaniment may be the accompaniment part of "wait you go lesson". The first audio is a piece of audio singing according to the target accompaniment, for example, the first audio may be the audio of the vocal part of the original person in "wait you get out of class".
The electronic device (computer, mobile phone, etc.) may acquire the target accompaniment and the first audio by the following optional methods:
in the first mode, the first audio and the target accompaniment are extracted according to the target music. The electronic equipment acquires a data file of the target music, and extracts a voice part and an accompaniment part of the target music through a correlation algorithm for segmenting the voice part and the background part. For example, the electronic device obtains the audio frequency of the target music, wherein the audio frequency of the target music conforms to a center/Side (Mid/Sid, M/S) system, the vocal part belongs to a Mid channel, and the accompaniment part belongs to a Side channel, so that the audio frequency of the Mid channel is obtained as the first audio frequency, and the audio frequency of the Side channel is obtained as the target accompaniment.
In the second mode, the voice part and the accompaniment part of the target music are searched according to the target music, the voice part can be determined as the first audio, and the accompaniment part can be determined as the target accompaniment. For example, the electronic device stores therein vocal audio or accompaniment audio of one or more pieces of music, or the electronic device may acquire the vocal audio or the accompaniment audio of the one or more pieces of music through a network service. The user inputs keywords such as the name of the target music, singers and albums, or can search one or more voice audios and accompaniment audios by using methods such as listening to songs and recognizing songs.
Referring to fig. 4, fig. 4 is a schematic diagram of a possible method for acquiring the first audio and the target accompaniment provided by the embodiment of the present application, and a user interface of the electronic device includes a search box component 401, a list information component 402 and a selection component 403. The user can input keywords such as the name of the target music and singers through the search box component 401, for example, the user inputs four characters such as "you go down" as search keywords, the electronic device sends search service of the music to the server through network connection to search relevant voice audio and accompaniment audio, and the searched content is displayed through the list information component 402, such as music with the name of "you go down", singers sing in Zhou Jie Lun and the Yangru Dynasty, and music released in 2018. The electronic device may receive the user selection operation through the selection component 403, and use the human voice audio indicated by the user selection operation as the first audio. Optionally, the electronic device may pre-store an audio file of the first audio, or download a data file corresponding to the first audio through a network to obtain the audio of the first audio. Similarly, the electronic device receives a user selection operation for the target accompaniment, and takes the accompaniment indicated by the user selection operation as the target accompaniment. Optionally, the electronic device may pre-store an audio file of the target accompaniment, or download a data file corresponding to the target accompaniment through a network to obtain the audio of the target accompaniment.
The second audio is another piece of audio singing according to the target accompaniment, for example, the voice of the english reprinting person in the book you leave lessons (english edition) is a piece of audio singing according to the accompaniment in the book you leave lessons. It is to be understood that the second audio is another piece of audio singing according to the target accompaniment, and the accompaniment corresponding to the second audio is not limited to the target accompaniment. For example, the accompaniment part of the book of you leaving lessons (english version) is not exactly the same as the accompaniment of the original music of the book of you leaving lessons, but the singing rhythm in the book of you leaving lessons (english version) is adapted according to the musical accompaniment of the original music, so the book of you leaving lessons (english version) can still be regarded as a piece of audio for singing according to the accompaniment of the book of you leaving lessons.
The electronic device may obtain the second audio by the following alternatives:
in the first aspect, the electronic device may obtain, through a microphone or the like, the input second audio, which may be an audio of a vocal part of a singer singing according to an accompaniment of the target music. For example, the electronic apparatus provides the singer with the audio of the accompaniment part of "wait you go to class", the singer performs singing with the microphone according to the accompaniment, and the electronic apparatus can acquire the audio of the singer singing "wait you go to class" with the microphone and use the audio as the second audio. The singer can be a person who can sing audio according to the accompaniment, such as a common user or a professional singer, and the person singing is represented as the singer for convenience of description, and the identity of the singer is not limited in this embodiment.
And in the second scheme, the electronic equipment acquires the second audio according to keywords such as names, singers and the like. For example, a plurality of audios are stored in the electronic device, or the electronic device may obtain one or more audios through a network service, and the user may input keywords such as a name, a singer, and an album, or may search for one or more audios by using methods such as listening to songs and recognizing songs. For example, the user inputs the name of the target music in the search box, namely, inputs four characters of 'your English version in class and the like', the electronic device is connected through the network, sends a search service of the music to the server to search the audio, the electronic device provides a plurality of audio options for the user, receives a user selection operation aiming at the second audio, and takes the accompaniment indicated by the user selection operation as the second audio. Optionally, the electronic device may store an audio file of the second audio in advance, or download a data file corresponding to the second audio through a network to obtain the audio of the second audio.
And obtaining a second audio according to the second music, wherein the accompaniment of the second music is related to the music of the target accompaniment. The electronic equipment acquires a data file of the second music, and extracts a voice part and an accompaniment part of the second music through a correlation algorithm for segmenting the voice part and the background part. For example, the electronic device obtains audio of "you leave class (english version)" of the second music, and takes the audio of the vocal part of the second music as the second audio through a correlation algorithm for separating the vocal and the accompaniment.
S302: the electronic device determines at least one first time period and at least one second time period.
Specifically, the electronic device determines a start time stamp and an end time stamp of the first audio mixed into the target accompaniment, and determines a time period between the start time stamp and the end time stamp as the first time period, wherein the time stamp (timestamp) refers to a time data, usually a character sequence, which can identify a time scale of the song. Accordingly, the electronic device determines a start time stamp and an end time stamp of the second audio mixed into the target accompaniment, and determines a time period between the start time stamp and the end time stamp as the second time period. The electronic device may optionally determine the at least one first time period and the at least one second time period by:
in a first mode, at least one first time period and at least one second time period are determined according to corresponding lyric fragments of the target accompaniment. Specifically, the target accompaniment may have corresponding lyrics, and the lyrics are preset with at least one first lyric fragment and at least one second lyric fragment, for example, the first lyric fragment is a main song lyric fragment, the second lyric fragment is a chorus lyric fragment, and for example, the first lyric fragment is a lyric fragment sung by the first singer, and the second lyric fragment is a lyric fragment sung by the second singer. Since the target lyrics have a corresponding relationship with the target accompaniment, the time stamp of the target lyrics may correspond to the time stamp of the target accompaniment. Thus, the electronic device determines a first time period based on the start time stamp and the end time stamp of the first lyric fragment indicating that the first audio may be mixed into the target accompaniment within the first lyric fragment. Similarly, a second time period is determined based on the opening time stamp and the ending time stamp of the second lyric fragment, indicating that the second audio is mixed into the target accompaniment within the second lyric fragment. Optionally, the second time period may be determined according to a time period corresponding to the first lyric fragment and a time period corresponding to the second lyric fragment, indicating that the second audio is mixed into the target accompaniment in both the first lyric fragment and the second lyric fragment. Similarly, the first time period may be determined according to a time period corresponding to the first lyric fragment and a time period corresponding to the second lyric fragment, indicating that the first audio is mixed into the target accompaniment in both the first lyric fragment and the second lyric fragment. The specific implementation process can be seen in the following cases:
case 1, the target accompaniment corresponds to one or more lyric fragments of a master song and one or more lyric fragments of a chorus, the time period corresponding to the lyric fragment of the master song is determined as a first time period, and the lyric fragment corresponding to the lyric fragment of the chorus is determined as a second lyric fragment. Optionally, the electronic device divides the lyrics corresponding to the target accompaniment into a primary song lyric fragment and a secondary song lyric fragment, and takes the time period corresponding to the primary song lyric fragment as a first time period and the time period corresponding to the secondary song lyric fragment as a second time period. Referring to fig. 5, fig. 5 is a schematic diagram of a possible method for determining time according to an embodiment of the present application, which involves processing lyrics 501 of target music, vocal part 502 of the target music, and accompaniment part 503 of the target music. The electronic device may divide the lyrics 501 into verse and chorus lyric fragments, or divide the lyrics 501 into verse and chorus lyric fragments by a lyric dividing device. For example, if the target song is "waiting for you to go" played by both zhou jeren and the yan rui, it can be known from the lyrics that when the lyrics "in the box you live" starts to play (time t 1), the lyrics "what you can hear over the earphone cannot tell me" the end of the singing (time t 2) is the main song part, and from the time when the lyrics "lie in the playground of your school to see starry sky" starts to sing (time t 3) to the lyrics "the lyrics also represent that i have gone far" the end of the singing (time t 4) is the side song part. Therefore, the time period between the time t1 and the end time t2 of the singing start corresponding to the verse part can be determined as the first time period, and the time period between the time t3 and the end time t4 of the singing start corresponding to the chorus part can be determined as the second time period.
Case 2, the lyrics are divided into a plurality of lyric fragments according to the singers by the lyrics of the target music, for example, the time when the lyric fragment started to be sung by the first singer starts is taken as a first time, and the time when the lyric fragment started to be sung by the second singer starts is taken as a second time. Alternatively, a singer may sing one or more lyric fragments. For example, referring to fig. 5, if the target music is "lower class of equal you" singing together in zhou jiron and the yankee era, it can be known from the lyrics, where when the lyrics "in the box you live" start singing (time t 1), the lyrics "what you can hear by the earphone cannot tell me" the lyrics section sung by the first singer (zhou jiron) at the end of singing (time t 2), and from the time when the lyrics "lie in the sky at the playground of your school" start singing (time t 3) to the lyrics "also represent the section that i have gone away" singing together by the first singer and the second singer (the yankee era) (time t 4). Therefore, a time period between the start time t1 and the end time t2 of the singing by the first singer is determined as one first time period, a time period between the start time t3 and the end time t4 of the singing by the first singer is determined as another first time period, and a time period between the start time t3 and the end time t4 of the singing by the second singer is determined as another first time period.
In the second mode, the lyrics of the target accompaniment comprise two parts of lyrics, the time period of the target accompaniment corresponding to the first part of lyrics is determined as the first time period, and the time period corresponding to the other part of lyrics is determined as the second time period. For example, the lyrics of the target accompaniment are divided into a plurality of lyric sentences, and the first audio and the second audio are alternately mixed according to the time periods corresponding to the lyric sentences. For another example, the electronic device may score the second audio, and determine lyrics with a score lower than a preset threshold as first part of lyrics and determine lyrics with a score higher than or equal to the preset threshold as second part of lyrics according to different lyrics. And determining a time period corresponding to the first part of the lyrics as a first time period, and determining a time period corresponding to the second part of the lyrics as a second time period. The specific implementation process can be seen in the following cases:
case 3, referring to fig. 6, fig. 6 is a schematic diagram of a possible method for determining time according to an embodiment of the present application, which involves processing of lyrics 601 of target music, vocal part 602 of the target music, and accompaniment part 603 of the target music. For example, the "wait you get lesson" sung together by zhou jen and yanui is used as the target music, and the lyrics may include the time when the lyrics start. The electronic device divides the lyrics of your class waiting into a plurality of lyrics, the first lyric is 'I rent an apartment in your lane', the starting time of the lyrics is known from '00: 13.89', the lyrics start singing within 13.89 seconds (time t5) of the music of 'your class waiting' and end singing before the starting of the next lyric (time t6), so the time period between the starting time of the first lyric sentence (time t5) and the starting time of the next lyric (time t6) is determined as the first time period. Similarly, the second lyric sentence is "to want to meet you without any chance", and it is known from the time of starting the lyric sentence "[ 00:22.88 ]" that the lyric starts being a lyric which starts singing 22.88 seconds (time t6) of the music of "class you equal", and ends singing before the starting time (time t 7) of the next lyric, so that the time period between the time of starting the second lyric sentence (time t6) and the time of starting the next lyric (time t 7) is determined as the second time period. It is understood that the electronic device further determines a time period (from time t7 to time t 8) during which the lyrics of the third sentence ("three high decades why i do not read well") sing as the first time period, and so on, so that the first audio and the second audio are mixed alternately according to the rhythm of each lyric, thereby forming a antiphonal singing effect and increasing the interest and interactivity of music.
Case 4, the electronic device determines, according to the singing score of the second audio, a time period corresponding to the lyrics with the score lower than a preset threshold as a first time period, and determines a time period corresponding to the lyrics with the score higher than or equal to the preset threshold as a second time period. Specifically, the electronic device extracts audio sequences singing the words and phrases in the second audio according to the words and phrases in the lyrics. For example, taking the first audio as the original vocal life audio, the electronic device may score the audio sequence of the first vocal phrase sung in the second audio through a scoring algorithm according to the tone and rhythm of the first vocal phrase sung in the first audio, and record the scoring data of the user. In order to ensure that the sound mixing effect meets the user requirements better, a score threshold value can be set, and if the screen score of the first lyric sentence is not lower than the preset score threshold value, the starting time of the first lyric sentence is determined as the second time, namely the first lyric sentence is mixed with the voice of the singing person. According to the scoring data of the lyrics, determining the time period of the lyrics corresponding to the audio sequence with the score lower than the preset score threshold value in the lyrics as a first time period, namely singing the lyrics with the score lower than the preset score threshold value by the original voice of the person singing.
In a third way, the electronic device may receive a selection operation of the user on the first time period, and similarly, the electronic device may also receive a selection operation of the user on the second time period, generate selection information according to the selection operation of the user, and determine at least one of the first time period and/or the second time period according to the first information by using the information generated by the selection operation of the user as the first information for convenience of description. Referring to fig. 7, fig. 7 is a schematic diagram of yet another possible method for determining time according to an embodiment of the present application, including an electronic device 70, a first audio track 701, a second audio track 702, a third audio track 703, a fourth audio track 704, and a time selection control 705, where the audio tracks represent parallel "tracks" in which a piece of audio is located, and the electronic device can place the first audio in the first audio track, place the left channel and the right channel of a target accompaniment in the second audio track and the third audio track, respectively, and place the second audio in the third audio track. In determining the time of mixing in, the user may select a time period for audio mixing in via control 705. Optionally, the user may cut the first audio, cut the first audio into a plurality of audio sequences, drag the cut first audio sequences on the time axis, and determine the first time period. Similarly, the user may cut the second audio, cut the second audio into a plurality of audio sequences, drag the cut second audio sequences on the time axis, and determine the second time period.
Optionally, the electronic device divides the first audio and/or the third audio into a plurality of audio sequences according to a plurality of words in the lyrics. The electronic equipment receives the selection of the audio frequency which needs to be mixed by the user aiming at each lyric, clicks the first lyric sentence, and determines whether the audio frequency sequence which sings the lyric in the first audio frequency, or the audio frequency sequence which sings the lyric in the third audio frequency, or the audio frequency sequence which is mixed in the first audio frequency and the third audio frequency simultaneously and sings the lyric in the first audio frequency and the third audio frequency is mixed in the time corresponding to the lyric sentence. The electronic equipment receives selection operation of a user, and then determines the first time and/or the second time. Optionally, the electronic device may provide trial listening of the audio sequence to the user, and determine that what needs to be mixed with the time corresponding to the lyric sentence is the audio sequence in the first audio frequency to sing the lyric sentence, or the audio sequence in the third audio frequency to sing the lyric sentence, or the audio sequence in the first audio frequency and the third audio frequency to sing the lyric sentence simultaneously. So as to maximally open the time allocation rights of both to the user. The chorus mode is completely decided by the user.
Optionally, after determining the time period of the first audio and the second audio, the first audio and the second audio are still two independent audio tracks, and the corresponding audio data is zeroed or nulled at a time when the first audio and the second audio should not appear. The schemes for determining the first time period can be selected by the user, and a certain scheme can be set as a default option.
S303: the electronic device determines left channel audio of the first audio and right channel audio of the first audio, and determines left channel audio and right channel audio of the second audio.
Specifically, the electronic device may determine the left channel audio and the right channel audio of the first audio by the following optional ways:
in the first mode, the first audio is audio including left and right channels, and the left channel audio and the right channel audio are determined by separating the channels.
In the second mode, the left channel audio of the first audio and the right channel audio of the first audio are determined by pan-modulating the first audio. Specifically, the electronic device performs acoustic image modulation on the first audio to obtain a left channel audio of the first audio and a right channel audio of the first audio, and performs acoustic image modulation on the second audio to obtain a left channel audio and a right channel audio of the second audio. When the electronic equipment carries out sound image modulation, a time delay method, an intensity difference method and an HRTF function can be used for carrying out sound image modulation, so that the modulated sound is heard as if the sound is transmitted from a certain position.
Optionally, the electronic device may perform modulation on the first audio and the second audio using Head Related Transfer Functions (HRTFs). Head Related Transfer Functions (HRTFs), also known as physiological transfer functions (ATFs), are an audio localization algorithm that can generate stereo audio by using inter-aural time delay (ITD), inter-aural amplitude difference (IAD), and pinna frequency vibration, so that a listener can feel surround audio when the audio is transmitted to the pinna, ear canal, and eardrum in the human ear. The human can hear the sound as a result of the sound propagating in the space, and the sound changes in the process of propagating from the sound source to the eardrum of the human ear, and the change can be regarded as the filtering effect of the human ears on the sound, and the filtering effect can be simulated by the audio processed by the HRTF. That is, a listener can judge the position of a sound source of audio through HRTF-processed audio. The electronic device can perform sound image modulation by using a head-related transfer function through a sound image modulation platform, wherein the head-related transfer function has a plurality of open source libraries to be selected, such as MIT HRTF database of the American Massachusetts institute of technology, Davis school CIPIC HRTF database of the American California university, Microsoft HRTF database, North big HRTF database and the like. In addition, the method can also be obtained through HRTF modeling calculation.
Specifically, when the first audio is subjected to audio-video modulation, the electronic device determines a position of a sound source of the audio first, and obtains a head-related transfer function corresponding to the position according to the position of the sound source of the audio, and the electronic device convolves the audio with the head-related transfer functions from the position of the sound source to the left and right ears, respectively, to obtain a binaural audio of the audio. For convenience of description, a user operation instruction for the position of the sound source of the first audio is used as the first operation instruction, and a user operation instruction for the position of the sound source of the second audio is used as the second operation instruction.
Referring to fig. 5, fig. 5 is a schematic diagram of an effect of the image-sound modulation provided by an embodiment of the present application, and includes a first location 801, a second location 802, and a listener 803, where the location of a sound source can be represented by three-dimensional coordinates, for example, the three-dimensional coordinates [ azimuth, elevation, distance ] can be used to represent the location, the first location is the location of the sound source of the first audio, and the second location is the location of the sound source of the second audio. For example, taking the CIPIC library as an example, the electronic device takes the head related transfer function of the first position [30,15,1.5], and the function values of the first position to the left and right ears are respectively denoted as H _1L, H _ 1R. The electronic device takes the head related transfer function for the second position [ -30,16,1.6], and the function values for the second position to the left and right ears are denoted as H _2L, H _2R, respectively. The electronic equipment convolutes the first audio with the transfer functions (H _1L and H _1R) of the first position respectively to obtain the left channel audio and the right channel audio of the first audio, and when the first audio is played, a listener can feel the effect that the first audio is transmitted from the first position. Similarly, the second audio is convolved with the transfer functions (H _2L and H _2R) at the second position respectively to obtain the left channel audio and the right channel audio of the second audio, and during playing, the listener can feel that the second audio appears to be transmitted from the second position.
S303: the electronic equipment mixes the left channel audio and the right channel audio of the first audio into the left channel of the target accompaniment respectively in at least one first time period, and mixes the left channel audio and the right channel audio of the second audio into the right channel of the target accompaniment respectively in at least one second time period to obtain the synthetic audio.
Specifically, the electronic device may divide the target accompaniment into a left channel audio of the target accompaniment and a right channel audio of the target accompaniment, wherein the format of the target accompaniment may be a stereo format, and the separated left channel audio and the separated right channel audio are not identical audio. It is to be understood that the above-mentioned function of separating the target accompaniment may be performed by a channel separation unit in the electronic device, or may be performed by a module in the synthesis unit, or may be obtained by other devices, which is not limited herein.
The electronic equipment mixes the left channel audio of the first audio into the left channel audio of the target accompaniment according to the first time period, mixes the right channel audio of the first audio into the right channel audio of the target accompaniment, mixes the left channel audio of the second audio into the left channel audio of the target accompaniment according to the second time period, and mixes the right channel audio of the second audio into the right channel audio of the target accompaniment, thereby obtaining the synthesized audio. The process of mixing audio into the target accompaniment by the electronic device can have the following two optional situations.
In the first time period, mixing the left channel audio of the first audio into the left channel of the target accompaniment, and mixing the right channel audio of the first audio into the right channel of the target accompaniment; and in a second time period, mixing the left channel audio of the second audio into the left channel of the target accompaniment, and mixing the right channel audio of the second audio into the right channel of the target accompaniment, thereby obtaining the synthesized audio.
For example, referring to fig. 5, in the case where the first time period and the second time period are determined by the method described in case 1, it can be known that t1 to t2 are time periods in which the first lyric fragment is sung, i.e., the first time period, and t3 to t4 are time periods in which the second lyric fragment is sung, i.e., the second time period.
For convenience of description, the first audio, the left channel of the target accompaniment, the right channel of the target accompaniment, and the second audio may be disposed in the first track, the second track, the third track, and the fourth track, respectively. In mixing, the electronic apparatus may synthesize audio of a plurality of tracks into one audio file according to a time axis of the tracks. When the left channel of the target accompaniment proceeds to t1, the left channel audio and the right channel audio of the corresponding audio of the first audio, which is the audio corresponding to the time t1 of the first audio, are mixed into the left channel and the right channel of the target accompaniment, respectively, and the mixing of the first audio is stopped when the left channel of the target accompaniment proceeds to the time t 2. Accordingly, when the right channel of the target accompaniment proceeds to t3, the left channel audio and the right channel audio of the corresponding audio of the second audio are mixed into the left channel and the right channel of the target accompaniment, respectively, and the mixing of the second audio is stopped when the left channel of the target accompaniment proceeds to the time point t4, thereby obtaining the synthesized audio.
In the second case, the electronic device first acquires a portion of the first audio corresponding to the first time period, acquires a portion of the second audio corresponding to the second time period, and acquires a portion of the target accompaniment corresponding to the first time period. For convenience of description, a left channel audio of a portion of the first audio corresponding to the first time period is referred to as a first audio clip, a right channel audio of a portion of the first audio corresponding to the first time period is referred to as a second audio clip, a left channel audio of a portion of the second audio corresponding to the second time period is referred to as a third audio clip, a right channel audio of a portion of the second audio corresponding to the second time period is referred to as a fourth audio clip, a portion of the target accompaniment corresponding to the first time period is referred to as a first accompaniment clip, and a portion of the target accompaniment corresponding to the second time period is referred to as a second accompaniment clip.
During synthesis, a first audio clip is mixed into the left channel of the first accompaniment clip, a second audio clip is mixed into the right channel of the first accompaniment clip, a third audio clip is mixed into the left channel of the second accompaniment clip, and a fourth audio clip is mixed into the right channel of the second accompaniment clip, so that synthesized audio is obtained.
For example, referring to fig. 9, fig. 9 is a schematic diagram of a possible mixing method provided by an embodiment of the present application, and includes a first audio clip 901, a second audio clip 902, a left channel audio 903 of a target accompaniment, a right channel audio 904 of the target accompaniment, a third audio clip 905, and a fourth audio clip 906. Wherein, the first audio segment 901 and the second audio segment 902 are the left channel audio and the right channel audio of the portion of the first audio in the time period from t1 to t2, respectively, and the third audio segment 905 and the fourth audio segment 902 are the left channel audio and the right channel audio of the portion of the second audio in the time period from t3 to t4, respectively. The electronic device mixes the first audio clip 901 into the left channel of the target accompaniment, mixes the second audio clip 902 into the right channel of the first accompaniment clip, mixes the third audio clip into the left channel of the second accompaniment clip, and mixes the fourth audio clip into the right channel of the second accompaniment clip to obtain the synthesized audio. By adopting the method, the electronic equipment only needs to process the target accompaniment clips needing to be mixed with the audio, and other target accompaniment clips without audio mixing do not need to be synthesized again, so that the processing pressure of the electronic equipment is reduced.
Optionally, before mixing, the first audio may be modulated to obtain a left channel audio and a right channel audio of the first audio, and correspondingly, the second audio may also be modulated to obtain a left channel audio and a right channel audio of the second audio. For example, the first audio may be separated into left channel audio and right channel audio by separate modulation.
Optionally, when the electronic device performs sound mixing, the volume of the first audio may be adjusted, so as to avoid too large or too small sound after mixing. The electronic device can adjust the volume in the following optional ways:
in the first mode, the volume of the second audio is adjusted according to the volume of the first audio, and the adjusted volume of the second audio is equal to or less than the volume of the first audio. Specifically, after the electronic device determines the first audio, the electronic device may calculate a volume of the first audio, and adjust a volume of the second audio according to the volume of the first audio, for example, so that the volume of the second audio and the volume of the first audio present a fixed proportionality coefficient or a difference. Alternatively, the volume of the audio may be represented by Root Mean Square (RMS), and the electronic device may adjust the volume of the second audio by the RMS of the first audio. For example, the first audio may be the original vocal data1 of the target music, the second audio may be a section of vocal data2 that the user vocally vocalizes, and if the volume of the second audio before being adjusted is Vol0, and the electronic device may calculate the volume of the first audio to be Vol1, the volume of the second audio may be adjusted to be Vol2, so that Vol2 satisfies the following equation:
Vol2=min(Vol0,Vol1)
it can be seen that after adjustment, the volume of the second audio is less than or equal to the volume of the first audio.
And in the second mode, the input third operation instruction and the input fourth operation instruction are received, the volume of the first audio is adjusted according to the information indicated by the third operation instruction, and the volume of the second audio is adjusted according to the information indicated by the fourth operation instruction. For example, the electronic device receives an operation instruction of a user through an input device such as a touch screen or a keyboard, and for convenience of description, the user operation instruction for the first audio volume is used as a third operation instruction, the user operation instruction for the second audio volume is used as a fourth operation instruction, the volume of the first audio is adjusted according to the volume indicated by the third operation instruction, and the volume of the second audio is adjusted according to the volume indicated by the fourth operation instruction. Referring to fig. 10, fig. 10 is a schematic view of a possible interface for adjusting volume provided in the present application, which includes an electronic device 100, a first track 1001, a second track 1002, a third track 1003, a fourth track 1004, a volume input box 1005, and a volume slide control 1006. The electronic device 100 places the first audio, the left channel audio of the first audio, the right channel audio of the first audio, and the second audio in the first audio track, the second audio track, the third audio track, and the fourth audio track, respectively, and the user can adjust the volume of the audio tracks through the volume input box 1005 and the volume slide control 1006. For example, the user may slide a control to adjust the volume of the first audio in the first track to 40. Optionally, the electronic device may provide multiple options for the scaling factor of the first audio and the target background, or provide multiple options for the scaling factor of the first audio and the target background, so that the user may select the volume ratio of the audio and the background according to his or her needs.
Optionally, the electronic device may output the synthesized audio to an audio playing module in the electronic device, and then may play the synthesized audio. Optionally, the electronic device may also output the synthesized audio as a target audio file, that is, a target binaural audio, and transmit the target binaural audio to the playback device. The playing device may be a module integrated in the electronic device, or a device connected through a wired interface or a wireless interface (bluetooth, WiFi, etc.). When the target two-channel audio is played by the playing device, the first audio can be heard by the left channel when the song is played in the first time period, and correspondingly, the second audio can be heard by the right channel in the second time period, so that the hearing experience in the form of double-ear two-tone is presented. Optionally, if the first time and the second time have a coincidence portion, the effect that the original singing voice and the singing voice are played in the left ear and the right ear respectively can be heard, and the effect of singing together is presented.
In the method shown in fig. 3, the electronic device may determine a first time period and a second time period according to a time stamp of the target accompaniment, and mix the first audio and the second audio into the left and right channels of the target accompaniment according to the time periods. During specific synthesis, the first audio can be modulated to obtain a left channel audio and a right channel audio, the left channel audio and the right channel audio are respectively mixed into the left channel and the right channel of the accompaniment according to a first time period, and similarly, the left channel audio and the right channel audio of the second audio are respectively mixed into the left channel and the right channel of the accompaniment according to a second time period, so that the audios in the binaural double-tone form are efficiently synthesized, and the listening effect of the music in the binaural double-tone form is improved. Furthermore, the first time period and the second time period for mixing the audio are determined according to the time stamp of the target accompaniment, so that the synthesized music tempos of the left and right sound channels can correspond to each other, the problem of misalignment of the two music tempos is avoided, and the listening effect of the music in the binaural double-tone mode is improved.
The method of the embodiments of the present application is set forth above in detail and the apparatus of the embodiments of the present application is provided below.
Referring to fig. 11, fig. 11 is a schematic structural diagram of an audio synthesizing apparatus 110 according to an embodiment of the present application, where the audio synthesizing apparatus 110 may include a determining unit 1101, a modulating unit 1102 and a synthesizing unit 1103, where details of each unit are as follows:
a determining unit 1101 configured to determine at least one first time slot and at least one second time slot within a play duration of the target accompaniment according to the timestamp of the target accompaniment; wherein there is at least one first time period that does not coincide with at least one second time period;
a modulation unit 1102 for determining a left channel audio of the first audio and a right channel audio of the first audio;
the modulation unit 1102 is further configured to determine a left channel audio of the second audio and a right channel audio of the second audio;
a synthesizing unit 1103 configured to mix a left channel audio of the first audio into a left channel of the target accompaniment for at least one first time period, and mix a right channel audio of the first audio into a right channel of the target accompaniment for at least one first time period; and mixing the left channel audio of the second audio into the left channel of the target accompaniment in at least one second time period, and mixing the right channel audio of the second audio into the right channel of the target accompaniment in at least one second time period to obtain the synthesized audio.
The existing double-ear double-tone music is simply that two pieces of music are respectively set as single-channel audio, the first piece of music is placed in a left sound channel, the other piece of music is placed in a right sound channel, and the two pieces of music are synthesized and input into double-ear double-tone music. The sound sources of the left channel and the right channel of the synthesized music are independent, and the sound image is concentrated on the leftmost side because the music of the left channel does not contain the first piece of music. Similarly, the sound image of the right channel music is concentrated on the rightmost side. The middle position is hollow, and the auditory sensation is poor. The audio synthesis device provided by the embodiment of the application can determine the first time period and the second time period according to the time stamp of the target accompaniment, and mix the first audio and the second audio into the left channel and the right channel of the target accompaniment according to the time periods. During specific synthesis, the first audio can be modulated to obtain a left channel audio and a right channel audio, the left channel audio and the right channel audio are respectively mixed into the left channel and the right channel of the accompaniment according to a first time period, and similarly, the left channel audio and the right channel audio of the second audio are respectively mixed into the left channel and the right channel of the accompaniment according to a second time period, so that the audios in the binaural double-tone form are efficiently synthesized, and the listening effect of the music in the binaural double-tone form is improved. Furthermore, the first time period and the second time period for mixing the audio are determined according to the time stamp of the target accompaniment, so that the synthesized music tempos of the left and right sound channels can correspond to each other, the problem of misalignment of the two music tempos is avoided, and the listening effect of the music in the binaural double-tone mode is improved.
In a possible implementation manner, the synthesizing unit 1103 is configured to mix the left channel audio of the first audio into the left channel of the target accompaniment in at least one first time period, and mix the right channel audio of the first audio into the right channel of the target accompaniment in at least one first time period, specifically:
mixing a first audio clip into a left channel of a first accompaniment clip in at least one first time period, mixing a second audio clip into a right channel of the first accompaniment clip in the at least one first time period, wherein the first audio clip is an audio clip corresponding to the at least one first time period in the left channel audio of the first audio, the second audio clip is an audio clip corresponding to the at least one first time period in the right channel audio of the first audio, and the first accompaniment clip is a part of the target accompaniment corresponding to the at least one first time period;
the synthesizing unit 1103 is further configured to mix a left channel audio of the second audio into a left channel of the target accompaniment in at least one second time period, and mix a right channel audio of the second audio into a right channel of the target accompaniment in at least one second time period, specifically:
and mixing a third audio clip into the left channel of the second accompaniment clip in at least one second time period, mixing a fourth audio clip into the right channel of the second accompaniment clip in at least one second time period, wherein the third audio clip is the audio clip corresponding to at least one second time period in the left channel audio of the second audio, the second accompaniment clip is the part of the target accompaniment corresponding to at least one second time period, and the fourth audio clip is the audio clip corresponding to at least one second time period in the right channel audio of the second audio.
Because the first audio and the second audio have a corresponding relation with the target accompaniment, when mixing, in order to ensure the synthesis effect, firstly, according to the first time slot, the audio sequence corresponding to the first time slot in the first audio is extracted, the audio sequence is merged into the target accompaniment corresponding to the first time slot, and similarly, another segment corresponding to the first audio in the second time slot is mixed into the accompaniment corresponding to the segment, so that when mixing, only the segment mixed with the audio is required to be synthesized for the target accompaniment, and the whole accompaniment is not required to be synthesized again, thereby reducing the calculation pressure and saving the system resources.
In one possible embodiment, the target accompaniment includes at least two pieces of lyrics; the determining unit 1101 is configured to determine at least one first time period and at least one second time period within the playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
determining a first time period according to a part of the lyrics in the at least two lyrics;
a second time period is determined based on another portion of the at least two pieces of lyrics.
It can be seen that by dividing the lyrics into different paragraphs and mixing the first audio and the second audio in the different lyrics paragraphs, the effect of different singers singing a song together can be achieved. For example, the user mixes the version singing by himself and the version of the singer in the original singing accompaniment respectively, and different singing paragraphs can be mixed, so that the voice of the user and the voice of the singer appear at intervals, and the chorusing effect is more vivid. Furthermore, scoring can be carried out according to multiple words sung by the user, one or more words with the score higher than or equal to a preset threshold value are mixed into the original singing accompaniment, and the other part of words with the score lower than the threshold value are mixed into the original singing accompaniment through the audio frequency of the original singer, so that the synthesized audio frequency is more pleasant, and the listening experience is improved.
In a possible embodiment, the target accompaniment corresponds to N first lyric fragments and M second lyric fragments, the first lyric fragment is a master song lyric fragment, the second lyric fragment is a chorus lyric fragment, or the first lyric fragment is a lyric fragment sung by a first singer and the second lyric fragment is a lyric fragment sung by a second singer, where M is an integer greater than or equal to 1 and N is an integer greater than or equal to 1; the determining unit 1101 is configured to determine at least one first time period and at least one second time period within the playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
determining at least one first time period according to the start time stamp and the end time stamp of the M first lyric fragments;
the at least one second time period is determined based on the start and end time stamps of the N second lyrics fragments or based on the start and end time stamps of the M first lyrics fragments and the start and end time stamps of the N second lyrics fragments.
It can be seen that the verse and the refrain can be used as the basis for dividing the lyric segments, or the lyrics sung by different singers can be used as the basis for dividing the lyrics. For example, since the song is not highly vocal in the main song portion, the user is more familiar with the chorus portion, and thus the first audio singing in the original is mixed into the main song portion of the target accompaniment, and the second audio singing in the user is mixed into the chorus portion of the target accompaniment. For another example, if a song is played by multiple singers, the first audio may be mixed into the lyrics section played by one singer, and the second audio may be mixed into the section played by another singer, so as to form an effect that the two audios are played in opposition to each other, thereby making the synthesized audio form more diversified.
In a possible implementation manner of the second aspect, the apparatus further includes:
an input unit 1104 for receiving input first information, wherein the first information indicates at least one first time period and at least one second time period;
the determining unit 1101 is configured to determine at least one first time period and at least one second time period within the playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
at least one first time period and at least one second time period are determined from the first information.
Therefore, the user can determine the mixing time of the first audio and the second audio according to the self requirement, and the user can set the mixing time of the first audio and the second audio according to the self preference, so that the personalized listening effect is constructed, and the interestingness of the listening experience is increased.
In a possible implementation manner, the modulation unit 1102 is configured to determine a left channel audio of the first audio and a right channel audio of the first audio, and specifically:
convolving the first audio frequency with a first head related transfer function from the position of a sound source of the first audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; the position of the sound source of the first audio is a preset sound image position or a position of a received first operation instruction;
the modulation unit 1102 is further configured to determine a left channel audio of the second audio and a right channel audio of the second audio, specifically:
convolving the second audio frequency with a first head related transfer function from the position of the sound source of the second audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; the position of the sound source of the second audio is a preset sound image position or a position of the received second operation instruction.
It can be seen that when the left channel audio and the right channel audio of the audio are determined, the head-related transfer function can be used for performing sound-image modulation, so that when the modulated left channel audio and the modulated right channel audio are played, people can feel that the audio is transferred from the position of a sound source, and the condition of extreme sound image is avoided. Compared with a time delay method, a gain method and the like, the head-related transfer function is used for modulating the audio frequency to obtain the audio-video modulation method, so that the reality sense of the audio-video can be improved, music components at all angles can be enriched, and the listening experience of a user can be improved.
In one possible implementation, the volume of the second audio is less than or equal to the volume of the first audio.
When music in a binaural double-tone form is synthesized, the volumes of the left channel audio and the right channel audio need to be adapted to each other, if the volume of the audio is large, the hearing of a listener can be damaged, and if the volume of the audio is too small, the listener cannot feel the music effect, so that the listening experience of the user is influenced. The method provided by the application can adjust the volume of the first audio according to the volume of the first audio, prevent the too large volume of the second audio from hurting the ears of a listener, and also prevent the situation that the audio has an unobvious effect due to the too small volume, so that the volumes of the first audio and the second audio are coordinated, and the listening experience of a user is improved.
In one possible implementation, the volume of the first audio is a volume indicated by a third operation instruction, and the volume of the second audio is a volume indicated by a fourth operation instruction. Therefore, the user can set the volumes of the first audio and the second audio according to the preference of the user, so that the personalized listening effect is constructed, and the interestingness and the flexibility of the synthesized music are increased.
It should be noted that the implementation of each operation may also correspond to the corresponding description of the method embodiment shown in fig. 3. The audio synthesizing apparatus 110 is an electronic device in the embodiment of the method shown in fig. 3.
Referring to fig. 12, fig. 12 is a schematic structural diagram of another audio synthesis apparatus 120 provided in the embodiment of the present application, where the audio synthesis apparatus may include a memory 1201, a processor 1202, and an input device 1203, where the memory 1201, the processor 1202, and the input device 1203 may be connected by a bus 1204 or in another manner, and the embodiment of the present application takes the bus connection as an example, and details of each unit are described below.
The Memory 1201(Memory) is a storage device in the audio synthesizing apparatus, and stores programs and data. It is understood that the memory 1201 herein may include a built-in memory of the audio synthesizing apparatus, and may also include an extended memory supported by the audio synthesizing apparatus. The memory 1201 provides storage space that stores the operating system of the audio synthesis apparatus and other data, which may include, but is not limited to: android system, iOS system, Windows Phone system, etc., which are not limited in this application.
The processor 1202 (or Central Processing Unit (CPU)) is a computing core and a control core of the audio synthesis apparatus, and can parse various types of instructions in the audio synthesis apparatus and process various types of data of the audio synthesis apparatus, such as: the CPU may transmit various types of interactive data between the internal structures of the audio synthesizing apparatus, and the like.
The input device 1203 may be a device for recording audio, such as a microphone or an audio recording module, or the input device may be a data acquisition module such as a keyboard, a mouse, or a touch-sensitive display screen, without limitation. Optionally, the input 1203 device may be integrated into the device 120, or may be connected to the apparatus through a data interface or a network interface. The audio synthesis device may be an independent device, or may be integrated in a terminal such as a mobile phone or a computer, or in a server or a server cluster.
The memory may store a computer program, and the processor 1202 may be configured to call the computer program stored in the memory 1201, and may execute the method provided by the embodiment shown in fig. 3.
It should be noted that the specific operations performed by the audio synthesis apparatus may also correspond to the corresponding descriptions with reference to the method embodiment shown in fig. 3. The audio synthesizing apparatus 120 is an electronic device in the embodiment of the method shown in fig. 3.
Embodiments of the present application further provide a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is executed on a processor, the operations performed by the electronic device in the embodiment shown in fig. 3 are implemented.
Embodiments of the present application further provide a computer program product, which when executed on a processor, implements the operations performed by the electronic device in the embodiment shown in fig. 3. Those skilled in the art will appreciate that the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. Such as Applications (APPs) for audio synthesis, plug-ins, etc.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims (19)

1. A method of audio synthesis, the synthesis method comprising at least a first audio and a second audio, the synthesis method comprising:
determining at least one first time period and at least one second time period within the playing time of the target accompaniment according to the timestamp of the target accompaniment; wherein there is at least one of the first time periods that does not coincide with the at least one second time period;
determining a left channel audio of the first audio and a right channel audio of the first audio;
determining a left channel audio of the second audio and a right channel audio of the second audio;
mixing a left channel audio of the first audio into a left channel of the target accompaniment for the at least one first time period, and mixing a right channel audio of the first audio into a right channel of the target accompaniment for the at least one first time period; and mixing the left channel audio of the second audio into the left channel of the target accompaniment in the at least one second time period, and mixing the right channel audio of the second audio into the right channel of the target accompaniment in the at least one second time period to obtain the synthesized audio.
2. The method of claim 1, wherein mixing the left channel audio of the first audio into the left channel of the target accompaniment for the at least one first time period and mixing the right channel audio of the first audio into the right channel of the target accompaniment for the at least one first time period comprises:
mixing a first audio clip into a left channel of a first accompaniment clip within the at least one first time period, mixing a second audio clip into a right channel of the first accompaniment clip within the at least one first time period, wherein the first audio clip is an audio clip corresponding to the at least one first time period in the left channel audio of the first audio, the second audio clip is an audio clip corresponding to the at least one first time period in the right channel audio of the first audio, and the first accompaniment clip is a part of the target accompaniment corresponding to the at least one first time period;
mixing the left channel audio of the first audio into the left channel of the target accompaniment within the at least one first time period, and mixing the right channel audio of the first audio into the right channel of the target accompaniment within the at least one first time period, including:
mixing a third audio clip into a left channel of a second accompaniment clip in the at least one second time slot, mixing a fourth audio clip into a right channel of the second accompaniment clip in the at least one second time slot, wherein the third audio clip is an audio clip corresponding to the at least one second time slot in the left channel audio of the second audio, the second accompaniment clip is a part corresponding to the at least one second time slot in the target accompaniment, and the fourth audio clip is an audio clip corresponding to the at least one second time slot in the right channel audio of the second audio.
3. The method of claim 1, wherein the target accompaniment comprises at least two pieces of lyrics; the determining at least one first time period and at least one second time period according to the timestamp of the target accompaniment includes:
determining a first time period according to a part of the lyrics in the at least two lyrics;
and determining a second time period according to another part of the lyrics in the at least two pieces of lyrics.
4. The method of claim 1, wherein the target accompaniment corresponds to N first lyric fragments and M second lyric fragments, the first lyric fragment is a master song lyric fragment, the second lyric fragment is a chorus lyric fragment, or the first lyric fragment is a lyric fragment sung by a first singer and the second lyric fragment is a lyric fragment sung by a second singer, wherein M is an integer greater than or equal to 1 and N is an integer greater than or equal to 1; the determining at least one first time period and at least one second time period within the playing time of the target accompaniment according to the timestamp of the target accompaniment comprises:
determining at least one first time period according to the start time stamp and the end time stamp of the M first lyric fragments;
the at least one second time period is determined based on the start and end time stamps of the N second lyrics fragments or based on the start and end time stamps of the M first lyrics fragments and the start and end time stamps of the N second lyrics fragments.
5. The method of claim 1, wherein determining at least one first time period and at least one second time period within a play duration of a target accompaniment according to a timestamp of the target accompaniment comprises:
receiving input first information, wherein the first information indicates at least one first time period and at least one second time period;
determining the at least one first time period and the at least one second time period according to the first information.
6. The method of any of claims 1-5, wherein determining the left channel audio of the first audio and the right channel audio of the first channel comprises:
convolving the first audio frequency with a first head related transfer function from the position of a sound source of the first audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; the position of the sound source of the first audio is a preset sound image position or a position indicated by a received first operation instruction;
the determining the left channel audio of the first audio and the right channel audio of the first channel comprises:
convolving the second audio frequency with a first head related transfer function from the position of a sound source of the second audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; and the position of the sound source of the second audio is a preset sound image position or a position indicated by the received second operation instruction.
7. The method of any of claims 1-6, wherein a volume of the second audio is less than or equal to a volume of the first audio.
8. The method according to any one of claims 1 to 6, wherein the volume of the first audio is a volume indicated by a third operation instruction, and the volume of the second audio is a volume indicated by a fourth operation instruction.
9. An audio synthesizing apparatus, characterized in that the synthesizing method includes at least a first audio and a second audio, comprising:
the determining unit is used for determining at least one first time slot and at least one second time slot in the playing time length of the target accompaniment according to the timestamp of the target accompaniment; wherein there is at least one of the first time periods that does not coincide with the at least one second time period;
a modulation unit for determining a left channel audio of the first audio and a right channel audio of the first audio;
the modulation unit is further used for determining a left sound-to-audio frequency of the second audio frequency and a right channel audio frequency of the second audio frequency;
a synthesizing unit, configured to mix a left channel audio of the first audio into a left channel of the target accompaniment in the at least one first time period, and mix a right channel audio of the first audio into a right channel of the target accompaniment in the at least one first time period; and mixing the left channel audio of the second audio into the left channel of the target accompaniment in the at least one second time period, and mixing the right channel audio of the second audio into the right channel of the target accompaniment in the at least one second time period to obtain the synthesized audio.
10. The apparatus according to claim 9, wherein the synthesizing unit is configured to mix a left channel audio of the first audio into a left channel of the target accompaniment in the at least one first time period, and mix a right channel audio of the first audio into a right channel of the target accompaniment in the at least one first time period, specifically:
mixing a first audio clip into a left channel of a first accompaniment clip within the at least one first time period, mixing a second audio clip into a right channel of the first accompaniment clip within the at least one first time period, wherein the first audio clip is an audio clip corresponding to the at least one first time period in the left channel audio of the first audio, the second audio clip is an audio clip corresponding to the at least one first time period in the right channel audio of the first audio, and the first accompaniment clip is a part of the target accompaniment corresponding to the at least one first time period;
the synthesizing unit is further configured to mix a left channel audio of a second audio into a left channel of the target accompaniment in the at least one second time period, and mix a right channel audio of the second audio into a right channel of the target accompaniment in the at least one second time period, specifically:
mixing a third audio clip into a left channel of a second accompaniment clip in the at least one second time slot, mixing a fourth audio clip into a right channel of the second accompaniment clip in the at least one second time slot, wherein the third audio clip is an audio clip corresponding to the at least one second time slot in the left channel audio of the second audio, the second accompaniment clip is a part corresponding to the at least one second time slot in the target accompaniment, and the fourth audio clip is an audio clip corresponding to the at least one second time slot in the right channel audio of the second audio.
11. The apparatus of claim 9, wherein the target accompaniment comprises at least two pieces of lyrics; the determining unit is configured to determine at least one first time period and at least one second time period within a playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
determining a first time period according to a part of the lyrics in the at least two lyrics;
and determining a second time period according to another part of the lyrics in the at least two pieces of lyrics.
12. The apparatus of claim 9, wherein the target accompaniment corresponds to N first lyric fragments and M second lyric fragments, the first lyric fragment is a master song lyric fragment, the second lyric fragment is a chorus lyric fragment, or the first lyric fragment is a lyric fragment sung by a first singer and the second lyric fragment is a lyric fragment sung by a second singer, wherein M is an integer greater than or equal to 1 and N is an integer greater than or equal to 1; the determining unit is configured to determine at least one first time period and at least one second time period within a playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
determining at least one first time period according to the start time stamp and the end time stamp of the M first lyric fragments;
the at least one second time period is determined based on the start and end time stamps of the N second lyrics fragments or based on the start and end time stamps of the M first lyrics fragments and the start and end time stamps of the N second lyrics fragments.
13. The apparatus of claim 9, further comprising:
an input unit for receiving input first information, wherein the first information indicates at least one first time period and at least one second time period;
the determining unit is configured to determine at least one first time period and at least one second time period within a playing duration of the target accompaniment according to the timestamp of the target accompaniment, and specifically includes:
determining the at least one first time period and the at least one second time period according to the first information.
14. The apparatus according to any of claims 9 to 13, wherein the modulation unit is configured to determine a left channel audio of the first audio and a right channel audio of the first audio, and specifically:
convolving the first audio frequency with a first head related transfer function from the position of a sound source of the first audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; the position of the sound source of the first audio is a preset sound image position or a position of a received first operation instruction;
the modulation unit is further configured to determine a left channel audio of the second audio and a right channel audio of the second audio, and specifically includes:
convolving the second audio frequency with a first head related transfer function from the position of a sound source of the second audio frequency to the left ear and the right ear to obtain a left channel audio frequency of the first audio frequency and a right channel audio frequency of the first audio frequency; and the position of the sound source of the second audio is a preset sound image position or the position of the received second operation instruction.
15. The apparatus of any of claims 9-14, wherein a volume of the second audio is less than or equal to a volume of the first audio.
16. The apparatus according to any one of claims 9 to 15, wherein the volume of the first audio is a volume indicated by a third operation instruction, and the volume of the second audio is a volume indicated by a fourth operation instruction.
17. An audio synthesizing apparatus, comprising: a processor and a memory for storing a computer program, the processor for invoking the computer program to perform the method of any of claims 1-8.
18. A computer-readable storage medium, in which a computer program is stored which, when run on one or more processors, performs the method according to any one of claims 1-8.
19. A computer program product for audio synthesis, the computer program product comprising:
a computer readable storage medium having embodied thereon a computer readable program, the computer readable program being executed by one or more processors for performing the method of any one of claims 1-8.
CN201911289583.3A 2019-12-13 2019-12-13 Audio synthesis method and related device Active CN110992970B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911289583.3A CN110992970B (en) 2019-12-13 2019-12-13 Audio synthesis method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911289583.3A CN110992970B (en) 2019-12-13 2019-12-13 Audio synthesis method and related device

Publications (2)

Publication Number Publication Date
CN110992970A true CN110992970A (en) 2020-04-10
CN110992970B CN110992970B (en) 2022-05-31

Family

ID=70094043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911289583.3A Active CN110992970B (en) 2019-12-13 2019-12-13 Audio synthesis method and related device

Country Status (1)

Country Link
CN (1) CN110992970B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111615046A (en) * 2020-05-11 2020-09-01 腾讯音乐娱乐科技(深圳)有限公司 Audio signal processing method and device and computer readable storage medium
CN112037738A (en) * 2020-08-31 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 Music data processing method and device and computer storage medium
CN112967705A (en) * 2021-02-24 2021-06-15 腾讯音乐娱乐科技(深圳)有限公司 Mixed sound song generation method, device, equipment and storage medium
CN113094541A (en) * 2021-04-16 2021-07-09 网易(杭州)网络有限公司 Audio playing method, electronic equipment and storage medium
CN113192486A (en) * 2021-04-27 2021-07-30 腾讯音乐娱乐科技(深圳)有限公司 Method, equipment and storage medium for processing chorus audio
CN114143599A (en) * 2021-11-19 2022-03-04 湖南快乐阳光互动娱乐传媒有限公司 Sound information processing method and related equipment
CN114466242A (en) * 2022-01-27 2022-05-10 海信视像科技股份有限公司 Display device and audio processing method
CN114615534A (en) * 2022-01-27 2022-06-10 海信视像科技股份有限公司 Display device and audio processing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2476113A1 (en) * 2009-09-11 2012-07-18 Nokia Corp. Method, apparatus and computer program product for audio coding
CN105045578A (en) * 2015-06-29 2015-11-11 广州酷狗计算机科技有限公司 Method and apparatus for audio synthesis
CN106412672A (en) * 2015-07-29 2017-02-15 王泰来 Two-channel audio playing method and playing apparatus using same
CN106486128A (en) * 2016-09-27 2017-03-08 腾讯科技(深圳)有限公司 A kind of processing method and processing device of double-tone source audio data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2476113A1 (en) * 2009-09-11 2012-07-18 Nokia Corp. Method, apparatus and computer program product for audio coding
CN105045578A (en) * 2015-06-29 2015-11-11 广州酷狗计算机科技有限公司 Method and apparatus for audio synthesis
CN106412672A (en) * 2015-07-29 2017-02-15 王泰来 Two-channel audio playing method and playing apparatus using same
CN106486128A (en) * 2016-09-27 2017-03-08 腾讯科技(深圳)有限公司 A kind of processing method and processing device of double-tone source audio data

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111615046A (en) * 2020-05-11 2020-09-01 腾讯音乐娱乐科技(深圳)有限公司 Audio signal processing method and device and computer readable storage medium
CN112037738A (en) * 2020-08-31 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 Music data processing method and device and computer storage medium
CN112037738B (en) * 2020-08-31 2024-05-28 腾讯音乐娱乐科技(深圳)有限公司 Music data processing method and device and computer storage medium
CN112967705A (en) * 2021-02-24 2021-06-15 腾讯音乐娱乐科技(深圳)有限公司 Mixed sound song generation method, device, equipment and storage medium
CN112967705B (en) * 2021-02-24 2023-11-28 腾讯音乐娱乐科技(深圳)有限公司 Method, device, equipment and storage medium for generating mixed song
CN113094541A (en) * 2021-04-16 2021-07-09 网易(杭州)网络有限公司 Audio playing method, electronic equipment and storage medium
CN113192486A (en) * 2021-04-27 2021-07-30 腾讯音乐娱乐科技(深圳)有限公司 Method, equipment and storage medium for processing chorus audio
WO2022228220A1 (en) * 2021-04-27 2022-11-03 腾讯音乐娱乐科技(深圳)有限公司 Method and device for processing chorus audio, and storage medium
CN113192486B (en) * 2021-04-27 2024-01-09 腾讯音乐娱乐科技(深圳)有限公司 Chorus audio processing method, chorus audio processing equipment and storage medium
CN114143599A (en) * 2021-11-19 2022-03-04 湖南快乐阳光互动娱乐传媒有限公司 Sound information processing method and related equipment
CN114466242A (en) * 2022-01-27 2022-05-10 海信视像科技股份有限公司 Display device and audio processing method
CN114615534A (en) * 2022-01-27 2022-06-10 海信视像科技股份有限公司 Display device and audio processing method

Also Published As

Publication number Publication date
CN110992970B (en) 2022-05-31

Similar Documents

Publication Publication Date Title
CN110992970B (en) Audio synthesis method and related device
WO2016188322A1 (en) Karaoke processing method, apparatus and system
WO2021103314A1 (en) Listening scene constructing method and related device
US9326082B2 (en) Song transition effects for browsing
CN111916039B (en) Music file processing method, device, terminal and storage medium
CN112037738B (en) Music data processing method and device and computer storage medium
WO2016188211A1 (en) Audio processing method, apparatus and system
CN113823250B (en) Audio playing method, device, terminal and storage medium
CN110915240B (en) Method for providing interactive music composition to user
CN112967705B (en) Method, device, equipment and storage medium for generating mixed song
CN111724757A (en) Audio data processing method and related product
CN111512648A (en) Enabling rendering of spatial audio content for consumption by a user
US20220386062A1 (en) Stereophonic audio rearrangement based on decomposed tracks
CN114242025A (en) Method and device for generating accompaniment and storage medium
CN113821189A (en) Audio playing method and device, terminal equipment and storage medium
WO2023061330A1 (en) Audio synthesis method and apparatus, and device and computer-readable storage medium
CN113039815B (en) Sound generating method and device for executing the same
JP6596903B2 (en) Information providing system and information providing method
Nazemi et al. Sound design: a procedural communication model for VE
US20230421981A1 (en) Reproducing device, reproducing method, information processing device, information processing method, and program
Walls Giving Bach The Spatial Treatment.
WO2023217003A1 (en) Audio processing method and apparatus, device, and storage medium
WO2023160713A1 (en) Music generation methods and apparatuses, device, storage medium, and program
CN115767407A (en) Sound generating method and device for executing the same
JP2008216681A (en) Karaoke device wherein recorded singer's singing can strictly be compared with model singing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant