CN113192486A - Method, equipment and storage medium for processing chorus audio - Google Patents

Method, equipment and storage medium for processing chorus audio Download PDF

Info

Publication number
CN113192486A
CN113192486A CN202110460280.4A CN202110460280A CN113192486A CN 113192486 A CN113192486 A CN 113192486A CN 202110460280 A CN202110460280 A CN 202110460280A CN 113192486 A CN113192486 A CN 113192486A
Authority
CN
China
Prior art keywords
audio
chorus
sound
dry
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110460280.4A
Other languages
Chinese (zh)
Other versions
CN113192486B (en
Inventor
张超鹏
陈灏
武文昊
罗辉
李革委
姜涛
胡鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202110460280.4A priority Critical patent/CN113192486B/en
Publication of CN113192486A publication Critical patent/CN113192486A/en
Priority to PCT/CN2022/087784 priority patent/WO2022228220A1/en
Application granted granted Critical
Publication of CN113192486B publication Critical patent/CN113192486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

The application discloses a method for processing chorus audio, which comprises the following steps: respectively obtaining the dry sound frequency of singing the same target song by a plurality of singers; performing time alignment processing on the obtained multiple pieces of dry sound audio, and then performing virtual sound image localization to localize the multiple pieces of dry sound audio onto multiple virtual sound images; generating chorus audio; under the condition that the master singing audio based on the target song singing is obtained, the master singing audio, the chorus audio and the corresponding accompaniment are synthesized, and then the audio with the grand chorus effect is output. By applying the technical scheme provided by the application, the virtual sound images surround the ears of the human body, and the dry sound images are positioned on the virtual sound images, so that the output large chorus effect audio has sound field surrounding sound effect, the head effect generated by the sound field gathering at the center of the head of the human body is effectively avoided, and the sound field is wider. The application also discloses a chorus audio processing device, equipment and a storage medium, which have corresponding technical effects.

Description

Method, equipment and storage medium for processing chorus audio
Technical Field
The present application relates to the field of computer application technologies, and in particular, to a method and an apparatus for processing chorus audio, and a storage medium.
Background
With the rapid development of computer technology, various types of software such as audio, video and office are gradually increased, and great convenience is brought to the life of people. By using the audio software, the user can listen to songs, sing songs and the like.
At present, in order to provide the user with the auditory experience of chorus in a concert, the singing data of multiple persons are directly overlapped. However, in terms of hearing, the sound field of the audio obtained through the simple superposition processing is concentrated in the center of the human head, so that the audio has a head-in effect, the sound field is not wide enough, and the hearing experience is poor.
Disclosure of Invention
The application aims to provide a method and equipment for processing chorus audio and a storage medium, so that the head center effect caused by the fact that a sound field is gathered in the center of a human head is avoided, the sound field is wider, and the auditory experience is improved.
In order to solve the technical problem, the application provides the following technical scheme:
a method of processing chorus audio, comprising:
respectively obtaining the dry sound frequency of singing the same target song by a plurality of singers;
performing time alignment processing on the obtained plurality of the dry audio;
performing virtual sound image localization on the plurality of the dry sound audios subjected to the time alignment processing to localize the plurality of the dry sound audios onto a plurality of virtual sound images; the virtual sound images are positioned in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system takes a human head as a center and takes the middle point of a straight line where left and right ears are positioned as a coordinate origin, the positive direction of a first coordinate axis represents the front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the front of the human head, the distance between each virtual sound image and the coordinate origin is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
generating chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization;
and under the condition of acquiring the main singing audio based on the singing of the target song, synthesizing the main singing audio, the chorus audio and the corresponding accompaniment, and outputting the audio with the grand chorus effect.
In one embodiment of the present application, the performing a time alignment process on the obtained plurality of the dry audio includes:
determining a reference audio corresponding to the target song;
respectively extracting the audio features of the current dry sound audio and the reference audio for each obtained dry sound audio, wherein the audio features are fingerprint features or fundamental frequency features;
determining the time corresponding to the maximum value of the audio feature similarity of the current dry audio and the reference audio as audio alignment time;
and performing time alignment processing on the current dry sound audio based on the audio alignment time.
In one embodiment of the present application, the method further includes:
respectively carrying out band-pass filtering processing on the obtained plurality of the dry sound frequencies to obtain a plurality of bass data;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry sound audios and the plurality of the bass data after the virtual sound image localization.
In one embodiment of the present application, the method further includes:
performing reverberation simulation processing on the obtained multiple pieces of dry sound audio respectively;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization includes:
and generating chorus audio based on the plurality of dry sound frequencies subjected to virtual sound image localization and the plurality of dry sound frequencies subjected to reverberation simulation processing.
In one embodiment of the present application, the performing reverberation simulation processing on the obtained plurality of the dry audio frequencies respectively includes:
and performing reverberation simulation processing on the obtained multiple pieces of the dry sound audio respectively by utilizing cascade connection of a comb filter and an all-pass filter.
In one embodiment of the present application, after the performing the virtual sound image localization on the plurality of pieces of dry sound audio subjected to the time alignment processing, the method further includes:
performing reverberation simulation processing on the plurality of the dry sound frequencies subjected to virtual sound image localization respectively;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry sound audios subjected to virtual sound image localization and reverberation simulation processing.
In one embodiment of the present application, the method further includes:
performing binaural simulation processing on the obtained plurality of the dry sound frequencies respectively;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization and the plurality of the dry sound audios subjected to the two-channel simulation processing.
In an embodiment of the present application, after the performing binaural analog processing on the obtained plurality of pieces of dry audio respectively, the method further includes:
performing reverberation simulation processing on the plurality of the dry sound frequencies subjected to the binaural simulation processing;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization and the plurality of the dry sound audios subjected to the binaural simulation processing includes:
and generating a chorus audio based on the plurality of the dry sound frequencies subjected to the virtual sound image localization, the plurality of the dry sound frequencies subjected to the binaural simulation processing and the reverberation simulation processing.
In one embodiment of the present application, the performing virtual sound image localization on the plurality of pieces of dry sound audio subjected to the time alignment processing includes:
grouping the obtained multiple pieces of the dry sound audio subjected to time alignment processing according to the number of the virtual sound images, wherein the number of the groups is the same as the number of the virtual sound images;
each group of the trunk audio is respectively positioned on the corresponding virtual audio image, and different groups of the trunk audio correspond to different virtual audio images.
In one embodiment of the present application,
the elevation angle of a virtual sound image positioned behind the human head relative to a plane formed by the first coordinate axis and the second coordinate axis is larger than the elevation angle of the virtual sound image positioned in front of the human head relative to the plane formed by the first coordinate axis and the second coordinate axis;
alternatively, the first and second electrodes may be,
each virtual sound image is uniformly distributed on the periphery of a plane formed by the first coordinate axis and the second coordinate axis.
In a specific embodiment of the present application, the synthesizing the leading audio, the chorus audio and the corresponding accompaniment includes:
respectively adjusting the volume of the leading audio and the chorus audio, and/or performing reverberation simulation processing on the leading audio and the chorus audio;
and synthesizing the leading vocal audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing.
An apparatus for processing chorus audio, comprising:
the system comprises a dry sound audio obtaining module, a voice recognition module and a voice recognition module, wherein the dry sound audio obtaining module is used for respectively obtaining dry sound audio for a plurality of singers to sing the same target song;
the alignment processing module is used for carrying out time alignment processing on the obtained multiple pieces of the dry sound audio;
a virtual sound image localization module, configured to perform virtual sound image localization on the plurality of trunk sound frequencies subjected to the time alignment processing, so as to localize the plurality of trunk sound frequencies onto a plurality of virtual sound images; the virtual sound images are positioned in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system takes a human head as a center and takes the middle point of a straight line where left and right ears are positioned as a coordinate origin, the positive direction of a first coordinate axis represents the front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the front of the human head, the distance between each virtual sound image and the coordinate origin is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
the chorus audio generation module is used for generating chorus audio based on the plurality of the trunk audio subjected to virtual sound image localization;
and the grand chorus effect audio output module is used for synthesizing the master singing audio, the chorus audio and the corresponding accompaniment and outputting the grand chorus effect audio under the condition of acquiring the master singing audio based on the target song.
A device for processing chorus audio, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of processing chorus audio of any one of the above when executing the computer program.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of processing chorus audio of any of the above.
By applying the technical scheme provided by the embodiment of the application, after the trunk sound frequencies for singing the same target song by a plurality of singers are respectively obtained, time alignment processing is carried out on the obtained trunk sound frequencies, virtual sound image localization is carried out on the aligned trunk sound frequencies, so that the trunk sound frequencies are localized on a plurality of virtual sound images, the virtual sound images are positioned in a virtual sound image coordinate system taking the head of a person as the center, the distance between the virtual sound images and the origin of coordinates is within a set distance range, the human ear is surrounded, chorus audio is generated based on the trunk sound frequencies after the virtual sound image localization, and in the case of obtaining the main singing audio based on the target song, the main singing audio, the chorus audio and corresponding accompaniment are chorus, so that the large chorus effect audio is obtained and output. The multiple dry sound frequencies are positioned on the multiple virtual sound images surrounding the ears of the human body, so that the generated chorus audio has a sound field surrounding sound effect, and in the sense of hearing, the head center effect generated by the sound field of the finally output large chorus effect audio gathering in the center of the head of the human body can be effectively avoided, and the sound field is wider.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart illustrating an implementation of a method for processing chorus audio in an embodiment of the present application;
FIG. 2 is a schematic diagram of a virtual sound image localization coordinate system showing sound image orientations in an embodiment of the present application;
FIG. 3 is a schematic view of a virtual sound image localization in an embodiment of the present application;
fig. 4 is a schematic diagram of a localized virtual sound image in an embodiment of the present application;
FIG. 5 is a schematic diagram of a spatial sound field process in an embodiment of the present application;
FIG. 6 is a schematic diagram of a cascade of a comb filter and an all-pass filter according to an embodiment of the present application;
FIG. 7 is a diagram illustrating a reverberation pulse response according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a binaural simulation process according to an embodiment of the present application;
FIG. 9 is a block diagram of a chorus audio processing system according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a chorus audio processing system according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a chorus audio processing apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a chorus audio processing device in an embodiment of the present application.
Detailed Description
The core of the application is to provide a method for processing chorus audio. Respectively obtaining the dry sound frequencies for singing the same target song by a plurality of singers, then carrying out time alignment processing on the obtained plurality of dry sound frequencies, carrying out virtual sound image localization on the aligned plurality of dry sound frequencies so as to localize the plurality of dry sound frequencies onto the plurality of virtual sound images, wherein the plurality of virtual sound images are positioned in a virtual sound image coordinate system taking the head as the center, surround the ears of a human body within a set distance range from the origin of coordinates, generate chorus audio based on the plurality of dry sound frequencies after the virtual sound image localization, and perform chorus on the dominant singing audio, the chorus audio and corresponding accompaniment under the condition of obtaining the dominant singing audio based on the target song so as to obtain and output the major chorus effect audio. The multiple dry sound frequencies are positioned on the multiple virtual sound images surrounding the ears of the human body, so that the generated chorus audio has a sound field surrounding sound effect, and in the sense of hearing, the head center effect generated by the sound field of the finally output large chorus effect audio gathering in the center of the head of the human body can be effectively avoided, and the sound field is wider.
In practical application, the method provided by the embodiment of the application can be applied to various scenes in which a chorus sound effect is desired to be obtained, and the implementation of a specific scheme can be carried out through the interaction of the server and the client.
For example, in scene 1, the server may obtain, in advance, a plurality of singers, such as the stereo audio for the singers 1, 2, 3, and 4 … … to sing the same target song, perform time alignment on the obtained stereo audio, perform virtual sound image localization on the plurality of stereo audio after alignment, localize the plurality of stereo audio onto a plurality of virtual sound images, which may surround human ears, and generate chorus audio based on the plurality of stereo audio after virtual sound image localization. When a user X wants to enable a song sung by the user X to realize the chorus sound effect, a target song can be sung through the client, the server obtains the main singing audio sung by the user X through the client, the main singing audio, the chorus audio and the corresponding accompaniment are synthesized to obtain the chorus effect audio, the chorus effect audio is output through the client, and the user X can feel the chorus sound effect.
In scene 2, several friends ( users 1, 2, 3, 4, and 5) sing a target song in the same time period but different spaces, and want to achieve the chorus sound effect. From the perspective of either user, the current user may be taken as the lead song. As in the perspective of the user 1, the server may obtain the dry sound audio in which the users 2, 3, 4, and 5 sing the target song, perform time alignment processing on the obtained dry sound audio, align the dry sound audio and localize the plurality of dry sound audio onto a plurality of virtual sound images, surround the human ear with the plurality of virtual sound images, and generate a chorus audio based on the plurality of dry sound audio localized by the virtual sound image. The server synthesizes the main singing audio, the chorus audio and the corresponding accompaniment under the condition that the main singing audio is obtained by the user 1 through the client based on the target song to obtain the chorus effect audio, and the chorus effect audio is output to the user 1 through the client, so that the user 1 can feel the chorus sound effect.
The application scenarios are described above only by way of example, and in practical applications, the technical solution of the present application may also be applied to more scenarios, such as sound effect processing scenarios of multiple choruses, multiple small bands, and the like.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, there is shown a flowchart of an implementation of a method for processing chorus audio provided in an embodiment of the present application, where the method may include the following steps:
s110: and respectively obtaining the dry sound frequency for singing the same target song by a plurality of singers.
In the embodiment of the present application, a plurality of dry audio can be obtained according to actual needs. The multiple dry audio may be audio data obtained by different singers singing the same target song, and the different singers may be in the same or different environments.
S120: time alignment processing is performed on the obtained plurality of pieces of dry audio.
The method includes the steps that the voice audios for singing the same target song by multiple singers are obtained respectively, and because the voice audios may be singed by different singers at different times, misalignment phenomena such as delay and the like may exist. In order to achieve a better chorus sound effect later, time alignment processing can be performed on the obtained multiple pieces of dry sound audio, so that the dry sound audio subjected to the time alignment processing does not have serious pre-shooting or slow shooting, such as audio which is earlier or later than 1 second. In particular, the time alignment of the most identical starting positions of the obtained multiple pieces of audio can be performed by using an alignment tool.
In the specific embodiment of the present application, before performing time alignment processing on the obtained multiple pieces of dry audio, a plurality of pieces of obtained dry audio may be subjected to preliminary screening, for example, screening by using tools such as tone quality detection, and rejecting the audio with noise, back stepping with accompaniment, too short audio length, too small audio energy, and so on with poor tone quality. And then, carrying out time alignment processing and subsequent steps on the screened and reserved trunk audio.
S130: and performing virtual sound image localization on the plurality of dry sound audios subjected to the time alignment processing to localize the plurality of dry sound audios onto the plurality of virtual sound images.
The virtual sound images are positioned in a virtual sound image coordinate system established in advance, the virtual sound image coordinate system takes a human head as a center, the midpoint of a straight line where left and right ears are positioned is taken as a coordinate origin, the positive direction of a first coordinate axis represents the front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the front of the human head, the distance between each virtual sound image and the coordinate origin is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range.
In the embodiment of the present application, a virtual image coordinate system may be established in advance for showing the image orientation. The virtual sound image coordinate system may be a cartesian coordinate system. As shown in fig. 2, the virtual sound image coordinate system may be centered on the human head, the midpoint of the straight line where the left and right ears are located is the origin of coordinates, the positive direction of the x-axis as the first coordinate axis represents the front of the human head, the positive direction of the y-axis as the second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of the z-axis as the third coordinate axis represents the top of the human head, the sound image has a certain azimuth angle and elevation angle in space (elevation), and the virtual sound image coordinate system may be used
Figure BDA0003041997920000081
Denoted rad, the distance of the current sound image from the origin of coordinates.
The general sound signal is a one-way signal and can be regarded as sound image
Figure BDA0003041997920000082
Locally, in order to obtain a certain virtual sound image, a localization operation may be implemented by performing data convolution using an HRTF (Head Related Transfer Function). A schematic diagram of virtual sound image localization is shown in FIG. 3, where X represents a real sound source (one-way signal) and Y represents a real sound sourceL、YRThe HRTFs represent transfer functions of transmission paths of the acoustic signals from the sound source positions to both ears. Based on HRTF technology, a real sound source (one-way signal) can pass through a certain signal
Figure BDA0003041997920000083
HRTF filtering of the left ear and the right ear at the positions to obtain two-way acoustic signals.
The frequency domain characteristics of the left and right ear received acoustic signals can be expressed as:
Figure BDA0003041997920000084
it can be simply considered that the acoustic signal heard by the human ear is the result of HRTF filtering of the sound source X. Therefore, in performing virtual sound image localization, the acoustic signal may be filtered through the HRTF of the corresponding position. A plurality of virtual sound images may be set in the virtual sound image coordinate system, each of which may be located at a distance from the origin of coordinates within a set distance range, such as a 1-meter range, and a pitch angle of each of the virtual sound images with respect to a plane formed by the first coordinate axis and the second coordinate axis of the virtual sound image coordinate system may be within a set angle range, such as a 10 ° range, so that the plurality of virtual sound images surround the human ear.
Specifically, each of the plurality of virtual sound images may be uniformly distributed around a plane formed by the first coordinate axis and the second coordinate axis. I.e. at the same angular interval around the horizontal plane of the human ear. The angle of separation may be set, for example, to 30 ° depending on the actual situation or analysis of historical data. If the set interval angle is 30 degrees, the human ear horizontal plane is surrounded by 30 degrees at intervals, 12 virtual sound images can be positioned, the elevation angle of the 12 virtual sound images is 0 degree, and the azimuth angles are respectively as follows: 0 °, 30 °, 60 °, …, 330 °. Of course, the spacing angle may also be set to other values, such as 15 °, 60 °, etc.
In another embodiment, an elevation angle of a virtual sound image located behind the head with respect to a plane formed by the first and second coordinate axes may be greater than an elevation angle of a virtual sound image located in front of the head with respect to a plane formed by the first and second coordinate axes, among the plurality of virtual sound images. That is, the elevation angle of the virtual sound image located behind the head of the person, among the plurality of virtual sound images, may be larger than the elevation angle of the virtual sound image located in front of the head of the person. Thus, the positioning effect can be enhanced, and the problem of front and back mirror images of the virtual sound image can be reduced. For example, the elevation angle of the virtual sound image located behind the head may be adjusted up by 10 °, that is, the elevation angle θ of the virtual sound image located in front of the head is 0 °, and the elevation angle θ of the virtual sound image located behind the head is 10 °.
As shown in fig. 4, the plurality of virtual sound images surround the human ear horizontal plane at intervals of 30 °, and the elevation angle θ of the virtual sound image located in front of the human head is 0 ° and the elevation angle θ of the virtual sound image located behind the human head is 10 °.
It should be noted that the positions of the plurality of virtual sound images in the virtual sound image coordinate system are not limited to the above-mentioned ones, and may be specifically set according to actual needs, and it is only necessary that the distance between each virtual sound image and the origin of coordinates is within a set distance range, and the pitch angle of each virtual sound image with respect to the plane formed by the first coordinate axis and the second coordinate axis is within a set angle range. If one part of the virtual sound images in the plurality of virtual sound images surrounds the human ear plane at intervals of 30 degrees for one circle, the elevation angle is 0 degrees, the other part of the virtual sound images surrounds the human ear plane at intervals of 60 degrees for one circle, the elevation angle is 10 degrees, the distances between the two parts of the virtual sound images and the origin of coordinates can be the same or different, but are within a set distance range, and therefore the surrounding effect of the subsequently generated chorus audio can be enhanced.
After the virtual sound image localization is performed on the plurality of dry sound audios subjected to the time alignment processing, and the plurality of dry sound audios are localized on the plurality of virtual sound images, the operations of the subsequent steps can be continuously performed.
S140: and generating chorus audio based on the plurality of stem audio after the virtual sound image is positioned.
After obtaining the trunk sound frequencies for singing the same target song by a plurality of singers respectively, performing time alignment processing on the plurality of trunk sound frequencies, performing virtual sound image localization on the aligned trunk sound frequencies, and localizing the plurality of trunk sound frequencies onto a plurality of virtual sound images, each of the plurality of trunk sound frequencies can be subjected to HRTF filtering processing at a corresponding virtual sound image position respectively, and corresponding audio data can be obtained at each virtual sound image position. Chorus audio may be generated based on the plurality of stem audio after virtual sound image localization. Specifically, the chorus audio may be obtained by superimposing, or weighting and superimposing, the corresponding audio data obtained after the HRTF filtering processing of the plurality of virtual sound image positions. The obtained chorus audio has three-dimensional sound field listening feeling.
S150: under the condition that the master singing audio based on the target song singing is obtained, the master singing audio, the chorus audio and the corresponding accompaniment are synthesized, and then the audio with the grand chorus effect is output.
In an application scenario of the embodiment of the present application, after the chorus audio is generated, the chorus audio can be stored in a database and used as needed. For example, a song that a user wants to sing by himself has a chorus effect, in which case the corresponding effect can be achieved by using chorus audio.
The method can acquire the audio of the current user singing based on the target song, takes the audio as the dominant singing audio, then synthesizes the dominant singing audio, the chorus audio and the corresponding accompaniment to obtain the chorus effect audio, outputs the chorus effect audio, and the current user can enjoy the chorus sound effect.
The synthesis of the leading audio, the chorus audio and the corresponding accompaniment can be realized in various ways, such as synthesizing the leading audio and the corresponding accompaniment firstly and then synthesizing the leading audio and the chorus audio, or synthesizing the chorus audio and the corresponding accompaniment firstly and then synthesizing the leading audio and the chorus audio, or synthesizing the leading audio and the chorus audio firstly and then synthesizing the chorus audio and the corresponding accompaniment, such as superposing the corresponding accompaniment according to the set vocal-accompaniment ratio after carrying out balance adjustment on the leading audio and the chorus audio. The chorus sound effects obtained by different implementation modes are different, and the specific implementation mode can be selected according to the actual situation.
By applying the method provided by the embodiment of the application, after the trunk sound frequencies for singing the same target song by a plurality of singers are respectively obtained, time alignment processing is carried out on the obtained trunk sound frequencies, virtual sound image localization is carried out on the aligned trunk sound frequencies, so that the trunk sound frequencies are localized on the virtual sound images, the virtual sound images are located in a virtual sound image coordinate system taking the head as the center, the distance between the virtual sound images and the origin of coordinates is within a set distance range, the human ear is surrounded, chorus audio is generated based on the trunk sound frequencies after the virtual sound image localization, and in the case of obtaining the main singing audio based on the target song, the main singing audio, the chorus audio and corresponding accompaniment are chorus, so that the large chorus effect audio is obtained and output. The multiple dry sound frequencies are positioned on the multiple virtual sound images surrounding the ears of the human body, so that the generated chorus audio has a sound field surrounding sound effect, and in the sense of hearing, the head center effect generated by the sound field of the finally output large chorus effect audio gathering in the center of the head of the human body can be effectively avoided, and the sound field is wider.
In an embodiment of the present application, the step S120 of performing the time alignment process on the obtained multiple pieces of dry audio may include the following steps:
the first step is as follows: determining a reference audio corresponding to a target song;
the second step is that: respectively extracting the audio characteristics of the current dry sound audio and the reference audio aiming at each obtained dry sound audio, wherein the audio characteristics are fingerprint characteristics or fundamental frequency characteristics;
the third step: determining the time corresponding to the maximum value of the audio feature similarity of the current dry sound audio and the reference audio as audio alignment time;
the fourth step: and performing time alignment processing on the current dry sound audio based on the audio alignment time.
For convenience of description, the above steps are combined for illustration.
In the embodiment of the application, after the dry audio for a plurality of singers to sing the same target song is obtained, the reference audio corresponding to the target song may be determined in the process of performing time alignment processing on the obtained dry audio. Specifically, one of the obtained multiple dry sound frequencies that has a better sound quality may be selected as the reference audio. The original vocal dry audio of the target song may also be determined as the reference audio.
For each obtained dry sound audio, the audio features of the current dry sound audio and the reference audio can be respectively extracted, and the audio features are fingerprint features or fundamental frequency features. For example, Mel frequency band information, Bark frequency band information, Erb frequency band power and the like can be extracted through multi-band filtering, and then fingerprint characteristics are obtained through half-wave rectification, binary judgment and the like. In another example, the fundamental frequency features can be extracted by fundamental frequency extraction tools such as pyin, deep, and harvest. The audio features of the reference audio can be stored after being extracted once and can be directly called when needed.
The audio characteristics of the current dry sound audio and the reference audio are compared, and can be represented by a similarity curve and the like, and the time corresponding to the maximum similarity can be determined as the audio alignment time. And then performing time alignment processing on the current dry sound audio based on the audio alignment time.
And comparing the obtained each dry sound audio with the audio characteristics of the reference audio to obtain corresponding audio alignment time, and performing time alignment processing to obtain a plurality of dry sound audios subjected to time alignment processing.
In one embodiment of the present application, the method may further comprise the steps of:
respectively carrying out band-pass filtering processing on the obtained multiple pieces of dry sound audio to obtain multiple pieces of bass data;
correspondingly, generating the chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization comprises the following steps:
and generating chorus audio based on the plurality of trunk sound audio and the plurality of bass data after the virtual sound image localization.
In the embodiment of the present application, after obtaining the stereo audio in which a plurality of singers sing the same target song, respectively, the obtained plurality of stereo audio may be subjected to bandpass filtering processing, for example, the plurality of stereo audio may be subjected to bandpass filtering processing with a cutoff frequency of [33,523] Hz, so as to obtain a plurality of bass data.
Chorus audio may be generated based on the plurality of trunk sound audio and the plurality of bass data after the virtual sound image localization. Specifically, the obtained plurality of pieces of bass data may be superimposed or weighted-superimposed on the plurality of pieces of dry audio subjected to virtual sound image localization to generate chorus audio. After the bass signal is superimposed, the sense of heaviness of the acoustic signal can be enhanced.
In one embodiment of the present application, the method may further comprise the steps of:
performing reverberation simulation processing on the obtained multiple pieces of dry sound frequency respectively;
correspondingly, generating the chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization comprises the following steps:
and generating chorus audio based on the multiple dry sound audios subjected to the virtual sound image localization and the multiple dry sound audios subjected to the reverberation simulation processing.
Generally, sound signals emitted from sound sources in a sound field go through direct sound, reflection, reverberation and other processes. Fig. 5 is a schematic diagram illustrating a typical spatial sound field process. In the figure, the sound signal with the largest amplitude is a direct sound, the next sound signal is a reflected sound signal obtained by reflecting the sound wave on an object closest to a listener, and the reflected sound signal has obvious directivity, and the next dense sound signal is a reverberant sound signal obtained by superposing the sound wave after multiple reflections of surrounding objects, is superposition of a large number of reflected sounds in different directions, and has no directivity.
According to the known room impulse response characteristics, reverberant sound is superposition of multiple paths of reflected sound, and is characterized by weak energy and no directivity, and because the reverberant sound is superposition of a large number of late-stage reflected sounds from different directions and has high echo density, surround sound effect with surround feeling can be generated by using reverberation.
In the embodiment of the present application, after obtaining the dry audio in which a plurality of singers sing the same target song, the reverberation simulation processing may be performed on the obtained plurality of dry audio. Specifically, the obtained multiple pieces of dry audio may be subjected to reverberation simulation processing by using a cascade of a comb filter and an all-pass filter.
Fig. 6 shows a cascade of four comb filters and an all-pass filter, wherein four comb filters are connected in parallel and then connected in series with two all-pass filters. The reverberation impulse response obtained by the actual simulation is shown in fig. 7.
It should be noted that fig. 6 shows only one specific form, and in practical applications, there may be other more forms, and the number and the cascading manner of the comb filters and the all-pass filters may be adjusted according to practical needs.
The reverberation simulation processing and the virtual sound image localization are performed on the obtained plurality of dry sound frequencies, and after the plurality of dry sound frequencies are localized on the plurality of virtual sound images, a chorus audio can be generated based on the plurality of dry sound frequencies subjected to the virtual sound image localization and the plurality of dry sound frequencies subjected to the reverberation simulation processing. Specifically, the chorus audio may be generated by superimposing or weighted superimposing the plurality of dry sound audios subjected to the virtual sound image localization and the plurality of dry sound audios subjected to the reverberation simulation. Therefore, the spatial sound effect of the sound signal can be enhanced, the head effect is further inhibited, and the sound field is expanded.
In an embodiment of the present application, after performing virtual sound image localization on the plurality of trunk audio subjected to the time alignment process, the method may further include the steps of:
performing reverberation simulation processing on a plurality of dry sound frequencies subjected to virtual sound image localization respectively;
correspondingly, generating the chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization comprises the following steps:
and generating chorus audio based on the plurality of dry sound audios subjected to virtual sound image localization and reverberation simulation processing.
In this embodiment of the present application, after obtaining the dry audio for singing the same target song by a plurality of singers, performing time alignment processing on the obtained plurality of dry audio, and performing virtual sound image localization, reverberation simulation processing may be further performed on the plurality of dry audio subjected to virtual sound image localization, and the reverberation simulation processing process may refer to the reverberation simulation processing process of the previous embodiment, which is not described herein again.
Chorus audio can be generated based on a plurality of dry audio subjected to virtual sound image localization and reverberation simulation processing. Specifically, the chorus audio may be generated by superimposing or weighting the plural dry audio signals subjected to the virtual sound image localization and reverberation simulation processing.
The reverberation simulation processing is carried out on the plurality of dry sound frequencies after the virtual sound image processing, so that the spatial sound effect of the sound signals can be enhanced, the head effect is further inhibited, and the sound field is expanded.
In one embodiment of the present application, the method may further comprise the steps of:
performing two-channel simulation processing on the obtained multiple pieces of dry sound audio respectively;
correspondingly, generating the chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization comprises the following steps:
and generating chorus audio based on the multiple dry sound audios subjected to the virtual sound image localization and the multiple dry sound audios subjected to the two-channel simulation processing.
In the embodiment of the present application, after obtaining the dry audio for a plurality of singers to sing the same target song and performing the time alignment processing on the obtained plurality of dry audio, the binaural simulation processing may be performed on the plurality of dry audio respectively. The correlation of two-channel signals is reduced through delay, and the sound field is expanded as much as possible to obtain double-channel output.
As shown in fig. 8, a plurality of dry audio can be subjected to binaural simulation by 8 sets of delay weights for each of the left and right, where d denotes a delay and g denotes a weight. Since the room impulse response generally takes 80ms as the reverberation time, 16 parameters that are different from 21ms to 79ms can be selected as the delay parameters. The energy loss of the sound wave caused by reflection is represented by amplitude attenuation, so that the correlation of two paths of environment information can be reduced. The method comprises the steps of respectively copying the audio of the trunk to obtain two paths of signals with the same information, completely correlating the two paths of signals, attenuating the two paths of signals by using different delays and amplitudes, and reducing the correlation of the two paths of signals to obtain a pseudo stereo signal.
It should be noted that fig. 8 is only a specific example, and the two-channel simulation can be implemented by setting fewer or more different delays according to actual needs.
Chorus audio can be generated based on the plurality of dry sound audios subjected to the virtual sound image localization and the plurality of dry sound audios subjected to the binaural simulation processing. Specifically, the chorus audio may be generated by superimposing or weighting and superimposing the plurality of stereo audio subjected to the virtual sound image localization and the plurality of stereo audio subjected to the binaural simulation.
In an embodiment of the present application, after the two-channel analog processing is performed on the obtained plurality of dry audio, respectively, the method may further include the steps of:
performing reverberation simulation processing on the multiple pieces of dry sound audio subjected to the binaural simulation processing;
correspondingly, generating the chorus audio based on the multiple dry sound audios subjected to the virtual sound image localization and the multiple dry sound audios subjected to the binaural simulation processing includes:
and generating a chorus audio based on the multiple dry sound audios subjected to the virtual sound image localization, the multiple dry sound audios subjected to the two-channel simulation processing and the reverberation simulation processing.
In the embodiment of the present application, after obtaining the dry sound frequencies for a plurality of singers to sing the same target song, performing time alignment processing on the plurality of dry sound frequencies, and performing binaural simulation processing on the plurality of dry sound frequencies, respectively, reverberation simulation processing may be further performed on the plurality of dry sound frequencies after the binaural simulation processing, so as to enhance a spatial effect of an acoustic signal, suppress a head center effect, and expand an acoustic field.
After the virtual sound image localization is performed on the plurality of dry sound frequencies and the plurality of dry sound frequencies are localized on the plurality of virtual sound images, a chorus audio may be generated based on the plurality of dry sound frequencies subjected to the virtual sound image localization, the plurality of dry sound frequencies subjected to the binaural cue simulation processing, and the plurality of reverberation simulation processing. Specifically, the chorus audio may be generated by superimposing or weighted superimposing the plurality of stereo audio after the virtual sound image localization, the plurality of stereo audio after the binaural simulation processing, and the plurality of stereo audio after the reverberation simulation processing.
In practical application, after obtaining the dry sound frequencies for a plurality of singers to sing the same target song, the obtained plurality of dry sound frequencies may be subjected to time alignment processing, and then the plurality of dry sound frequencies subjected to time alignment processing are subjected to virtual sound image localization, bass enhancement, reverberation simulation, binaural simulation and the like.
Fig. 9 is a schematic diagram of a system framework for processing multiple audio signals after time alignment, which includes a bass enhancement unit, a virtual sound image positioning unit, a binaural simulation unit, and a reverberation simulation unit. The bass enhancement unit is used for carrying out band-pass filtering processing on the plurality of dry sound frequencies to obtain bass data; the virtual sound image localization unit is used for performing virtual sound image localization on the plurality of dry sound frequencies so as to localize the plurality of dry sound frequencies onto the plurality of virtual sound images; the double-track simulation unit is used for carrying out double-track simulation processing on a plurality of dry sound audios; the reverberation simulation unit is used for carrying out reverberation simulation processing on a plurality of dry sound frequencies. The virtual sound image localization unit and the binaural simulation unit can be connected to the reverberation simulation unit, and after the virtual sound image localization unit performs virtual sound image localization on the multiple dry sound frequencies, reverberation simulation processing can be further performed by the reverberation simulation unit, and similarly, after the binaural simulation processing is performed on the multiple dry sound frequencies by the binaural simulation unit, reverberation simulation processing can be further performed by the reverberation simulation unit. Finally, the audio data processed by the units can be weighted and superposed to obtain the chorus audio.
Fig. 10 shows a specific example of processing a plurality of dry sound frequencies, where H denotes a transfer function of HRTF filtering, and by this processing, virtual sound image localization can be performed on the plurality of dry sound frequencies, and the plurality of dry sound frequencies are localized to 12 virtual sound images around the horizontal plane of the human ear, REV denotes a reverberation simulation unit, BASS denotes a BASS enhancement unit, and REF denotes a binaural simulation unit. The reverberation simulation unit can use the same parameter, and different parameters can be configured for different reverberation simulation units according to actual requirements, so that flexible reverberation modulation is obtained.
The chorus effect of the chorus audio generated finally by the embodiment of the application is closer to the chorus listening feeling of a real concert. In practical application, the accompaniment is added on the basis of the master singing audio, and the chorus audio is mixed, so that the user can have the experience of an immersive singing meeting in the sense of hearing, and more shocking immersive sound field surrounding experience is obtained.
In an embodiment of the present application, the virtual sound image localization of the plurality of trunk audio signals after the time alignment process may include the steps of:
the method comprises the following steps: grouping the obtained multiple trunk sound images subjected to time alignment processing according to the number of the virtual sound images, wherein the number of the groups is the same as the number of the virtual sound images;
step two: each group of the trunk audio is respectively positioned on the corresponding virtual audio image, and different groups of the trunk audio correspond to different virtual audio images.
For convenience of description, the above two steps are combined for illustration.
In the embodiment of the present application, after obtaining the trunk audio for a plurality of singers to sing the same target song, and performing time alignment processing on the obtained plurality of trunk audio, the obtained plurality of trunk audio subjected to time alignment processing may be grouped according to the number of virtual audio images, where the number of groups obtained is the same as the number of virtual audio images, and a plurality of trunk audio are included in the same group. If the obtained number of the dry sound audio is large, the same dry sound audio can be only in one group, and if the obtained number of the dry sound audio is small, the same dry sound audio can be in a plurality of groups, so that the chorus effect can be better realized.
After grouping a plurality of trunk audio, each group of trunk audio can be localized to a corresponding virtual audio image, and different groups of trunk audio correspond to different virtual audio images. The positioning processing of virtual sound images of a plurality of dry sound frequencies is realized, and the chorus sound effect is enhanced.
In one embodiment of the present application, synthesizing the leading audio, the chorus audio, and the corresponding accompaniment may include the steps of:
respectively adjusting the volume of the leading audio and the chorus audio, and/or performing reverberation simulation processing on the leading audio and the chorus audio;
and synthesizing the leading singing audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing.
In the embodiment of the application, after the leading audio singing based on the target song is acquired, the volume of the leading audio and the volume of the chorus audio can be adjusted respectively, so that the volume of the leading audio and the volume of the chorus audio are equivalent, or the volume of the leading audio is larger than the volume of the chorus audio. Meanwhile, the reverberation simulation treatment can be carried out on the leading audio and the chorus audio so as to obtain the surround sound effect with surrounding sense.
And synthesizing the leading singing audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing, so that the finally output chorus effect audio brings better listening experience for the user.
Corresponding to the above method embodiments, the present application further provides a device for processing a chorus audio, and the device for processing a chorus audio described below and the method for processing a chorus audio described above may be referred to in correspondence.
Referring to fig. 11, the apparatus may include the following modules:
an audio-dry obtaining module 1110, configured to obtain audio-dry frequencies in which multiple singers sing the same target song respectively;
a time alignment processing module 1120, configured to perform time alignment processing on the obtained multiple pieces of dry audio;
a virtual sound image localization module 1130 for performing virtual sound image localization on the plurality of trunk sound frequencies subjected to the time alignment process to localize the plurality of trunk sound frequencies onto the plurality of virtual sound images; the virtual sound images are positioned in a virtual sound image coordinate system which is established in advance, the virtual sound image coordinate system takes a human head as a center, the midpoint of a straight line where left and right ears are positioned is taken as a coordinate origin, the positive direction of a first coordinate axis represents the front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the front of the human head, the distance between each virtual sound image and the coordinate origin is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
a chorus audio generation module 1140, configured to generate chorus audio based on the multiple trunk audio subjected to virtual sound image localization;
a chorus effect audio obtaining module 1150, configured to, in a case where a dominant singing audio based on the target song singing is obtained, output the dominant singing audio after synthesizing the dominant singing audio, the chorus audio, and the corresponding accompaniment.
The device provided by the embodiment of the application is applied to respectively obtain the dry sound frequencies for singing the same target song by a plurality of singers, then time alignment processing is carried out on the obtained plurality of dry sound frequencies, virtual sound image localization is carried out on the aligned plurality of dry sound frequencies, so that the plurality of dry sound frequencies are localized on the plurality of virtual sound images, the plurality of virtual sound images are positioned in a virtual sound image coordinate system taking the head as the center, the distance between the virtual sound images and the origin of coordinates is within a set distance range, the human ear is surrounded, chorus audio is generated based on the plurality of dry sound frequencies after the virtual sound image localization, and in the case of obtaining the main singing audio based on the target song, the main singing audio, the chorus audio and corresponding accompaniment are chorus, so that the large chorus effect audio is obtained and output. The multiple dry sound frequencies are positioned on the multiple virtual sound images surrounding the ears of the human body, so that the generated chorus audio has a sound field surrounding sound effect, and in the sense of hearing, the head center effect generated by the sound field of the finally output large chorus effect audio gathering in the center of the head of the human body can be effectively avoided, and the sound field is wider.
In a specific embodiment of the present application, the time alignment processing module 1120 is configured to:
determining a reference audio corresponding to a target song;
respectively extracting the audio characteristics of the current dry sound audio and the reference audio aiming at each obtained dry sound audio, wherein the audio characteristics are fingerprint characteristics or fundamental frequency characteristics;
determining the time corresponding to the maximum value of the audio feature similarity of the current dry sound audio and the reference audio as audio alignment time;
and performing time alignment processing on the current dry sound audio based on the audio alignment time.
In an embodiment of the present application, the apparatus further includes a bass data obtaining module, configured to:
respectively carrying out band-pass filtering processing on the obtained multiple pieces of dry sound audio to obtain multiple pieces of bass data;
accordingly, the chorus audio generation module 1140 is configured to:
and generating chorus audio based on the plurality of trunk sound audio and the plurality of bass data after the virtual sound image localization.
In a specific embodiment of the present application, the apparatus further includes a reverberation simulation processing module, configured to:
performing reverberation simulation processing on the obtained multiple pieces of dry sound frequency respectively;
accordingly, the chorus audio generation module 1140 is configured to:
and generating chorus audio based on the multiple dry sound audios subjected to the virtual sound image localization and the multiple dry sound audios subjected to the reverberation simulation processing.
In a specific embodiment of the present application, the reverberation simulation processing module is configured to:
and performing reverberation simulation processing on the obtained multiple pieces of dry sound audio respectively by using cascade connection of a comb filter and an all-pass filter.
In a specific embodiment of the present application, the reverberation simulation processing module is further configured to:
after virtual sound image localization is carried out on the multiple pieces of dry sound audio subjected to the time alignment processing, reverberation simulation processing is respectively carried out on the multiple pieces of dry sound audio subjected to the virtual sound image localization;
accordingly, the chorus audio generation module 1140 is configured to:
and generating chorus audio based on the plurality of dry sound audios subjected to virtual sound image localization and reverberation simulation processing.
In an embodiment of the present application, the apparatus further includes a binaural analog processing module, configured to:
performing two-channel simulation processing on the obtained multiple pieces of dry sound audio respectively;
accordingly, the chorus audio generation module 1140 is configured to:
and generating chorus audio based on the multiple dry sound audios subjected to the virtual sound image localization and the multiple dry sound audios subjected to the two-channel simulation processing.
In a specific embodiment of the present application, the reverberation simulation processing module is further configured to:
after the obtained multiple pieces of dry sound audio are subjected to the two-channel simulation processing respectively, the multiple pieces of dry sound audio subjected to the two-channel simulation processing are subjected to the reverberation simulation processing;
accordingly, the chorus audio generation module 1140 is configured to:
and generating a chorus audio based on the multiple dry sound audios subjected to the virtual sound image localization, the multiple dry sound audios subjected to the two-channel simulation processing and the reverberation simulation processing.
In one embodiment of the present application, the virtual sound image localization module 1130 is configured to:
grouping the obtained multiple trunk sound images subjected to time alignment processing according to the number of the virtual sound images, wherein the number of the groups is the same as the number of the virtual sound images;
each group of the trunk audio is respectively positioned on the corresponding virtual audio image, and different groups of the trunk audio correspond to different virtual audio images.
In one embodiment of the present application, an elevation angle of a virtual sound image located behind the head with respect to a plane formed by the first coordinate axis and the second coordinate axis is larger than an elevation angle of a virtual sound image located in front of the head with respect to a plane formed by the first coordinate axis and the second coordinate axis; alternatively, each virtual sound image is uniformly distributed around the plane formed by the first coordinate axis and the second coordinate axis.
In one embodiment of the present application, the grand chorus effect audio obtaining module 1150 is configured to:
respectively adjusting the volume of the leading audio and the chorus audio, and/or performing reverberation simulation processing on the leading audio and the chorus audio;
and synthesizing the leading singing audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing.
Corresponding to the above method embodiment, an embodiment of the present application further provides a device for processing chorus audio, including:
a memory for storing a computer program;
and the processor is used for realizing the steps of the chorus audio processing method when executing the computer program.
As shown in fig. 12, which is a schematic diagram of a composition structure of a chorus audio processing device, the chorus audio processing device may include: a processor 10, a memory 11, a communication interface 12 and a communication bus 13. The processor 10, the memory 11 and the communication interface 12 all communicate with each other through a communication bus 13.
In the embodiment of the present application, the processor 10 may be a Central Processing Unit (CPU), an application specific integrated circuit, a digital signal processor, a field programmable gate array or other programmable logic device, etc.
The processor 10 may call a program stored in the memory 11, and in particular, the processor 10 may perform operations in an embodiment of a method of processing chorus audio.
The memory 11 is used for storing one or more programs, the program may include program codes, the program codes include computer operation instructions, in this embodiment, the memory 11 stores at least the program for implementing the following functions:
respectively obtaining the dry sound frequency of singing the same target song by a plurality of singers;
performing time alignment processing on the obtained multiple pieces of dry audio;
performing virtual sound image localization on the plurality of dry sound audios subjected to the time alignment processing to localize the plurality of dry sound audios onto the plurality of virtual sound images; the virtual sound images are positioned in a virtual sound image coordinate system which is established in advance, the virtual sound image coordinate system takes a human head as a center, the midpoint of a straight line where left and right ears are positioned is taken as a coordinate origin, the positive direction of a first coordinate axis represents the front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the front of the human head, the distance between each virtual sound image and the coordinate origin is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
generating chorus audio based on the multiple trunk sound audios subjected to virtual sound image localization;
under the condition that the master singing audio based on the target song singing is obtained, the master singing audio, the chorus audio and the corresponding accompaniment are synthesized, and then the audio with the grand chorus effect is output.
In one possible implementation, the memory 11 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as an audio playing function and an audio synthesizing function), and the like; the storage data area may store data created during use, such as sound image positioning data, audio synthesis data, and the like.
Further, the memory 11 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid state storage device.
The communication interface 12 may be an interface of a communication module for connecting with other devices or systems.
Of course, it should be noted that the structure shown in fig. 12 does not constitute a limitation to the processing device of chorus audio in the embodiment of the present application, and in practical applications, the processing device of chorus audio may include more or less components than those shown in fig. 12, or some components may be combined.
Corresponding to the above method embodiments, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method for processing chorus audio are implemented.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The principle and the implementation of the present application are explained in the present application by using specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims (13)

1. A method for processing chorus audio, comprising:
respectively obtaining the dry sound frequency of singing the same target song by a plurality of singers;
performing time alignment processing on the obtained plurality of the dry audio;
performing virtual sound image localization on the plurality of the dry sound audios subjected to the time alignment processing to localize the plurality of the dry sound audios onto a plurality of virtual sound images; the virtual sound images are positioned in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system takes a human head as a center and takes the middle point of a straight line where left and right ears are positioned as a coordinate origin, the positive direction of a first coordinate axis represents the front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the front of the human head, the distance between each virtual sound image and the coordinate origin is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
generating chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization;
and under the condition of acquiring the main singing audio based on the singing of the target song, synthesizing the main singing audio, the chorus audio and the corresponding accompaniment, and outputting the audio with the grand chorus effect.
2. The method for processing chorus audio according to claim 1, wherein the time-aligning the obtained plurality of the dry audio comprises:
determining a reference audio corresponding to the target song;
respectively extracting the audio features of the current dry sound audio and the reference audio for each obtained dry sound audio, wherein the audio features are fingerprint features or fundamental frequency features;
determining the time corresponding to the maximum value of the audio feature similarity of the current dry audio and the reference audio as audio alignment time;
and performing time alignment processing on the current dry sound audio based on the audio alignment time.
3. The method of processing chorus audio according to claim 1, further comprising:
respectively carrying out band-pass filtering processing on the obtained plurality of the dry sound frequencies to obtain a plurality of bass data;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry sound audios and the plurality of the bass data after the virtual sound image localization.
4. The method of processing chorus audio according to claim 1, further comprising:
performing reverberation simulation processing on the obtained multiple pieces of dry sound audio respectively;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization includes:
and generating chorus audio based on the plurality of dry sound frequencies subjected to virtual sound image localization and the plurality of dry sound frequencies subjected to reverberation simulation processing.
5. The method for processing chorus audio according to claim 4, wherein the performing reverberation simulation processing on the obtained plurality of the dry sound audios respectively comprises:
and performing reverberation simulation processing on the obtained multiple pieces of the dry sound audio respectively by utilizing cascade connection of a comb filter and an all-pass filter.
6. The method of processing chorus audio according to claim 1, further comprising, after the performing virtual sound image localization on the plurality of temporally aligned dry sound audios:
performing reverberation simulation processing on the plurality of the dry sound frequencies subjected to virtual sound image localization respectively;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry sound audios subjected to virtual sound image localization and reverberation simulation processing.
7. The method of processing chorus audio according to claim 1, further comprising:
performing binaural simulation processing on the obtained plurality of the dry sound frequencies respectively;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization and the plurality of the dry sound audios subjected to the two-channel simulation processing.
8. The method of processing chorus audio according to claim 7, further comprising, after said performing binaural analog processing on the obtained plurality of the dry audio respectively:
performing reverberation simulation processing on the plurality of the dry sound frequencies subjected to the binaural simulation processing;
correspondingly, the generating a chorus audio based on the plurality of the dry sound audios subjected to the virtual sound image localization and the plurality of the dry sound audios subjected to the binaural simulation processing includes:
and generating a chorus audio based on the plurality of the dry sound frequencies subjected to the virtual sound image localization, the plurality of the dry sound frequencies subjected to the binaural simulation processing and the reverberation simulation processing.
9. The method of claim 1, wherein the performing virtual sound image localization on the plurality of temporally aligned stereo audios comprises:
grouping the obtained multiple pieces of the dry sound audio subjected to time alignment processing according to the number of the virtual sound images, wherein the number of the groups is the same as the number of the virtual sound images;
each group of the trunk audio is respectively positioned on the corresponding virtual audio image, and different groups of the trunk audio correspond to different virtual audio images.
10. The method of chorus audio processing according to claim 1,
the elevation angle of a virtual sound image positioned behind the human head relative to a plane formed by the first coordinate axis and the second coordinate axis is larger than the elevation angle of the virtual sound image positioned in front of the human head relative to the plane formed by the first coordinate axis and the second coordinate axis;
alternatively, the first and second electrodes may be,
each virtual sound image is uniformly distributed on the periphery of a plane formed by the first coordinate axis and the second coordinate axis.
11. The method of chorus audio processing according to any of claims 1 to 10, wherein said synthesizing of the leading vocal audio, the chorus audio and the corresponding accompaniment comprises:
respectively adjusting the volume of the leading audio and the chorus audio, and/or performing reverberation simulation processing on the leading audio and the chorus audio;
and synthesizing the leading vocal audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing.
12. A device for processing chorus audio, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of processing chorus audio according to any one of claims 1 to 11 when executing said computer program.
13. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the method of processing chorus audio according to any one of claims 1 to 11.
CN202110460280.4A 2021-04-27 2021-04-27 Chorus audio processing method, chorus audio processing equipment and storage medium Active CN113192486B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110460280.4A CN113192486B (en) 2021-04-27 2021-04-27 Chorus audio processing method, chorus audio processing equipment and storage medium
PCT/CN2022/087784 WO2022228220A1 (en) 2021-04-27 2022-04-20 Method and device for processing chorus audio, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110460280.4A CN113192486B (en) 2021-04-27 2021-04-27 Chorus audio processing method, chorus audio processing equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113192486A true CN113192486A (en) 2021-07-30
CN113192486B CN113192486B (en) 2024-01-09

Family

ID=76979435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110460280.4A Active CN113192486B (en) 2021-04-27 2021-04-27 Chorus audio processing method, chorus audio processing equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113192486B (en)
WO (1) WO2022228220A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363793A (en) * 2022-01-12 2022-04-15 厦门市思芯微科技有限公司 System and method for converting dual-channel audio into virtual surround 5.1-channel audio
CN114630145A (en) * 2022-03-17 2022-06-14 腾讯音乐娱乐科技(深圳)有限公司 Multimedia data synthesis method, equipment and storage medium
WO2022228220A1 (en) * 2021-04-27 2022-11-03 腾讯音乐娱乐科技(深圳)有限公司 Method and device for processing chorus audio, and storage medium
CN116170613A (en) * 2022-09-08 2023-05-26 腾讯音乐娱乐科技(深圳)有限公司 Audio stream processing method, computer device and computer program product
WO2023109278A1 (en) * 2021-12-14 2023-06-22 腾讯音乐娱乐科技(深圳)有限公司 Accompaniment generation method, device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009044261A (en) * 2007-08-06 2009-02-26 Yamaha Corp Device for forming sound field
CN108269560A (en) * 2017-01-04 2018-07-10 北京酷我科技有限公司 A kind of speech synthesizing method and system
CN110992970A (en) * 2019-12-13 2020-04-10 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method and related device
CN111028818A (en) * 2019-11-14 2020-04-17 北京达佳互联信息技术有限公司 Chorus method, apparatus, electronic device and storage medium
WO2020177190A1 (en) * 2019-03-01 2020-09-10 腾讯音乐娱乐科技(深圳)有限公司 Processing method, apparatus and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000333297A (en) * 1999-05-14 2000-11-30 Sound Vision:Kk Stereophonic sound generator, method for generating stereophonic sound, and medium storing stereophonic sound
CN105208039B (en) * 2015-10-10 2018-06-08 广州华多网络科技有限公司 The method and system of online concert cantata
CN106331977B (en) * 2016-08-22 2018-06-12 北京时代拓灵科技有限公司 A kind of virtual reality panorama acoustic processing method of network K songs
CN107422862B (en) * 2017-08-03 2021-01-15 嗨皮乐镜(北京)科技有限公司 Method for virtual image interaction in virtual reality scene
CN110379401A (en) * 2019-08-12 2019-10-25 黑盒子科技(北京)有限公司 A kind of music is virtually chorused system and method
CN113192486B (en) * 2021-04-27 2024-01-09 腾讯音乐娱乐科技(深圳)有限公司 Chorus audio processing method, chorus audio processing equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009044261A (en) * 2007-08-06 2009-02-26 Yamaha Corp Device for forming sound field
CN108269560A (en) * 2017-01-04 2018-07-10 北京酷我科技有限公司 A kind of speech synthesizing method and system
WO2020177190A1 (en) * 2019-03-01 2020-09-10 腾讯音乐娱乐科技(深圳)有限公司 Processing method, apparatus and device
CN111028818A (en) * 2019-11-14 2020-04-17 北京达佳互联信息技术有限公司 Chorus method, apparatus, electronic device and storage medium
CN110992970A (en) * 2019-12-13 2020-04-10 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method and related device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022228220A1 (en) * 2021-04-27 2022-11-03 腾讯音乐娱乐科技(深圳)有限公司 Method and device for processing chorus audio, and storage medium
WO2023109278A1 (en) * 2021-12-14 2023-06-22 腾讯音乐娱乐科技(深圳)有限公司 Accompaniment generation method, device, and storage medium
CN114363793A (en) * 2022-01-12 2022-04-15 厦门市思芯微科技有限公司 System and method for converting dual-channel audio into virtual surround 5.1-channel audio
CN114363793B (en) * 2022-01-12 2024-06-11 厦门市思芯微科技有限公司 System and method for converting double-channel audio into virtual surrounding 5.1-channel audio
CN114630145A (en) * 2022-03-17 2022-06-14 腾讯音乐娱乐科技(深圳)有限公司 Multimedia data synthesis method, equipment and storage medium
CN116170613A (en) * 2022-09-08 2023-05-26 腾讯音乐娱乐科技(深圳)有限公司 Audio stream processing method, computer device and computer program product

Also Published As

Publication number Publication date
CN113192486B (en) 2024-01-09
WO2022228220A1 (en) 2022-11-03

Similar Documents

Publication Publication Date Title
CN113192486B (en) Chorus audio processing method, chorus audio processing equipment and storage medium
Hacihabiboglu et al. Perceptual spatial audio recording, simulation, and rendering: An overview of spatial-audio techniques based on psychoacoustics
US5371799A (en) Stereo headphone sound source localization system
JP4927848B2 (en) System and method for audio processing
Algazi et al. Headphone-based spatial sound
US10251012B2 (en) System and method for realistic rotation of stereo or binaural audio
US7489788B2 (en) Recording a three dimensional auditory scene and reproducing it for the individual listener
KR101644780B1 (en) Test platform implemented by a method for positioning a sound object in a 3d sound environment
EP1522868B1 (en) System for determining the position of a sound source and method therefor
US20080219485A1 (en) Apparatus, System and Method for Acoustic Signals
EP1891833A1 (en) Assembly, system and method for acoustic transducers
US20180206038A1 (en) Real-time processing of audio data captured using a microphone array
US20150189455A1 (en) Transformation of multiple sound fields to generate a transformed reproduced sound field including modified reproductions of the multiple sound fields
CN113170271A (en) Method and apparatus for processing stereo signals
CA2744429C (en) Converter and method for converting an audio signal
Lee et al. A real-time audio system for adjusting the sweet spot to the listener's position
US11032660B2 (en) System and method for realistic rotation of stereo or binaural audio
US20240171929A1 (en) System and Method for improved processing of stereo or binaural audio
Hoffbauer et al. Four-directional ambisonic spatial decomposition method with reduced temporal artifacts
US9794717B2 (en) Audio signal processing apparatus and audio signal processing method
Omoto et al. Hypotheses for constructing a precise, straightforward, robust and versatile sound field reproduction system
GB2369976A (en) A method of synthesising an averaged diffuse-field head-related transfer function
JP2004509544A (en) Audio signal processing method for speaker placed close to ear
US11218832B2 (en) System for modelling acoustic transfer functions and reproducing three-dimensional sound
CN114390425A (en) Conference audio processing method, device, system and storage device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant