CN115019753A

CN115019753A - Audio processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN115019753A
Application number: CN202210614338.0A
Authority: CN
Inventors: 范欣悦; 郑羲光; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-09-06

Abstract

The disclosure relates to an audio processing method and device, an electronic device and a computer readable storage medium. The audio processing method comprises the following steps: determining a pitch limit of each note in the human voice signal based on the fundamental frequency of each signal frame in the human voice signal in the audio to be processed; determining an initial pitch for each pitch boundary based on the frequency of the signal frame within each pitch boundary; determining a target pitch of each pitch boundary according to the scale pitch of the original audio corresponding to the audio to be processed; and respectively carrying out sound modification processing on the signal frames in each pitch limit based on the relation between the target pitch of each pitch limit and the initial pitch to obtain target audio. The problem that the sound correction is inaccurate in the related art is solved through the sound correction method and the sound correction device.

Description

Audio processing method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of audio and video processing, and in particular, to an audio processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

The tone modification mainly refers to the modification of tone pitch of the voice in the collected audio, so that the tone pitch of the voice after processing is more accurate than that of the voice before processing, and meanwhile, a part of singing flaws such as sound breaking, unstable breath, running tone and the like can be masked.

Some audio processing software can automatically modify audio. The automatic sound modification strongly depends on midi (midi) reference information, and generally, the pitch of each signal frame of the audio is modified based on a reference pitch corresponding to each signal frame of the audio in the midi reference information to obtain a modified audio, but it is difficult to obtain an accurate sound modification result in the absence of midi reference information corresponding to the audio requiring sound modification.

Disclosure of Invention

The present disclosure provides an audio processing method and apparatus, an electronic device, and a computer-readable storage medium, so as to at least solve the problem of inaccurate sound correction in the related art.

According to a first aspect of embodiments of the present disclosure, there is provided an audio processing method, including: determining a pitch limit of each note in the human voice signal based on the fundamental frequency of each signal frame in the human voice signal in the audio to be processed; determining an initial pitch for each pitch boundary based on the frequency of the signal frame within each pitch boundary; determining a target pitch of each pitch boundary according to the scale pitch of the original audio corresponding to the audio to be processed; and respectively carrying out sound modification processing on the signal frames in each pitch limit based on the relation between the target pitch of each pitch limit and the initial pitch to obtain target audio.

Optionally, determining a pitch limit of each note in the human voice signal based on the fundamental frequency of each signal frame in the human voice signal in the audio to be processed, including: determining at least two signal frame pairs with fundamental frequency variation larger than a first preset value in a human voice signal in the audio to be processed, wherein each signal frame pair comprises a front adjacent signal frame and a rear adjacent signal frame in the human voice signal in the audio to be processed; the pitch limit of each note in the human voice signal is determined based on the time at which at least two signal frame pairs are located.

Optionally, determining at least two signal frame pairs in which the variation of the fundamental frequency in the human voice signal in the audio to be processed is greater than a preset value includes: for each signal frame in the human voice signal in the audio to be processed, carrying out differential processing on the fundamental frequency of the current signal frame and the fundamental frequency of the last signal frame of the current signal frame to obtain a differential processing result of the current signal frame; and taking the current signal frame which is greater than the first preset value in the difference processing result and the last signal frame of the current signal frame as one signal frame pair of at least two signal frame pairs.

Optionally, determining a target pitch for each pitch boundary according to the scale pitch of the original audio corresponding to the audio to be processed, includes: acquiring the scale pitch of an original audio corresponding to the audio to be processed; selecting the scale pitches with the difference degree from the initial pitch of each pitch limit smaller than a second preset value in the scale pitches, and determining the target pitch of each pitch limit.

Optionally, the obtaining of the scale pitch of the original audio corresponding to the audio to be processed includes: acquiring the tonal information of an original audio corresponding to the audio to be processed from a server; based on the tonal information, the scale pitch of the original audio is obtained.

Optionally, determining an initial pitch for each pitch boundary based on the frequency of the signal frame within each pitch boundary comprises: acquiring the average value of the frequencies of all the signal frames in each pitch limit as the frequency of each pitch limit; the frequency of each pitch limit is converted to the initial pitch of each pitch limit.

Optionally, based on a relationship between the target pitch and the initial pitch of each pitch boundary, performing a trimming process on the signal frames in each pitch boundary, respectively, to obtain a target audio, including: determining a rate of change of the target pitch from the initial pitch for each pitch boundary; and performing sound modification treatment on the pitches of all the signal frames in each pitch limit according to the change rate of each pitch limit to obtain the target audio.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio processing apparatus including: a pitch limit determination unit configured to determine a pitch limit of each note in the human voice signal based on the fundamental frequency of each signal frame in the human voice signal in the audio to be processed; an initial pitch determination unit configured to determine an initial pitch for each pitch boundary in dependence on the frequency of the signal frame within each pitch boundary; a target pitch determination unit configured to determine a target pitch for each pitch limit according to a scale pitch of an original audio corresponding to the audio to be processed; and the processing unit is configured to respectively perform sound modification processing on the signal frames in each pitch limit based on the relation between the target pitch of each pitch limit and the initial pitch to obtain target audio.

Optionally, the pitch limit determining unit is further configured to determine at least two signal frame pairs in which the fundamental frequency variation in the human voice signal in the audio to be processed is greater than a first preset value, where each signal frame pair includes two adjacent signal frames in front and behind of the human voice signal in the audio to be processed; the pitch limit of each note in the human voice signal is determined based on the time at which at least two signal frame pairs are located.

Optionally, the pitch limit determining unit is further configured to, for each signal frame in the human voice signal in the audio to be processed, perform differential processing on the fundamental frequency of the current signal frame and the fundamental frequency of a previous signal frame of the current signal frame to obtain a differential processing result of the current signal frame; and taking the current signal frame which is greater than the first preset value in the difference processing result and the last signal frame of the current signal frame as one signal frame pair of at least two signal frame pairs.

Optionally, the target pitch determining unit is further configured to obtain a scale pitch of the original audio corresponding to the audio to be processed; selecting a scale pitch with a difference degree smaller than a second preset value from the initial pitch of each pitch limit in the scale pitches, and determining a target pitch of each pitch limit.

Optionally, the target pitch determining unit is further configured to obtain tonality information of the original audio corresponding to the audio to be processed from the server; based on the tonal information, the scale pitch of the original audio is obtained.

Optionally, the initial pitch determination unit is further configured to obtain an average value of the frequencies of all signal frames within each pitch limit as the frequency of each pitch limit; the frequency of each pitch limit is converted to the initial pitch of each pitch limit.

Optionally, the processing unit is further configured to determine a rate of change of the target pitch from the initial pitch for each pitch boundary; and modifying the pitches of all the signal frames in each pitch limit according to the change rate of each pitch limit to obtain the target audio.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the audio processing method according to the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by at least one processor, cause the at least one processor to perform an audio processing method as described above according to the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement an audio processing method according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

compared with the prior art that a target pitch is set for each signal frame and each signal frame is adjusted independently, the audio processing method and device, the electronic equipment and the computer readable storage medium of the disclosure determine the target pitch based on the pitch limit of the note, namely, the disclosure adjusts the signal frame by taking the pitch limit as a unit, thereby greatly improving the sound modification trace in the electric sound effect, ensuring the original jitter of human voice and enabling the sound modification to be more natural.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is an implementation scenario diagram illustrating an audio processing method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow diagram illustrating a method of audio processing according to an exemplary embodiment;

FIG. 3 is a diagram illustrating pitch of a segment of a human voice signal, according to an exemplary embodiment;

FIG. 4 is a system architecture diagram illustrating a method of audio processing in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating an audio processing device according to an exemplary embodiment;

fig. 6 is a block diagram of an electronic device 600 according to an embodiment of the disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

The current automatic sound modification function strongly depends on the midi reference signal, the problem of automatic sound modification cannot be solved when a song has no midi reference signal, in addition, although the electric sound effect device can modify the human sound to a certain extent, a plurality of unnatural processing traces can be brought, and the reasons that the electric sound effect causes the unnatural sound modification mainly include the following two points: human voice has natural pitch shake and strong shake and links up too closely between pitch and the pitch along with the air current, leads to appearing quick smooth sound effect. Furthermore, with the advent of the Karaoke software, the public can not experience Karaoke at home, and share the songs on the platform for others to enjoy, however, not every user has the singing power of a professional singer, so that the problems of running and tone drift, vibrato, break voice and the like caused by unstable breath exist, and for users lacking professional knowledge, manual sound modification is not practical.

In order to solve the problems, the disclosure provides an audio processing method which can improve the problems of vibrato caused by running tone, breaking tone, unstable breath and the like of human voice while ensuring the naturalness of the human voice. The following description will be given taking as an example a scene in which the recorded adapted song a is dubbed.

Fig. 1 is a schematic diagram illustrating an implementation scenario of an audio processing method according to an exemplary embodiment of the present disclosure, and as shown in fig. 1, the implementation scenario includes a server 100, a user terminal 110, and a user terminal 120, where the number of the user terminals is not limited to 2, and includes not limited to a mobile phone, a personal computer, and the like, the user terminal may install an application program for recording sound, the server may be one server, or several servers form a server cluster, or may be a cloud computing platform or a virtualization center.

The user terminal 110 or the user terminal 120 records the recomposed song a sung by the singer and uploads the recorded recomposed song a to the server 100, and the server 100 determines a pitch limit of each note in the vocal signal based on the fundamental frequency of each signal frame in the vocal signal in the recomposed song a; determining an initial pitch for each pitch boundary based on the frequency of the signal frame within each pitch boundary; determining a target pitch for each pitch bound based on the scale pitch of the original song A corresponding to the adapted song A; and respectively carrying out sound modification processing on the signal frames in each pitch limit based on the relation between the target pitch of each pitch limit and the initial pitch to obtain target audio. After each signal frame of the human voice signal in the adapted song a is processed, the modified adapted song a is obtained.

Hereinafter, an audio processing method and apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating an audio processing method according to an exemplary embodiment, as shown in fig. 2, the audio processing method including the steps of:

in step S201, a pitch limit of each note in the human voice signal is determined based on the fundamental frequency of each signal frame in the human voice signal in the audio to be processed.

Specifically, taking the audio to be processed as the recorded audio as an example, before determining a pitch boundary (pitch boundary) of each note in the human voice signal, the fundamental frequency of each signal frame in the human voice signal in the recorded audio needs to be acquired. The fundamental frequency can be obtained by adopting a YIN (YIN) algorithm, and the basic principle of the YIN algorithm is to search the minimum positive period of the waveform, namely to search the overlap ratio of the translated signal and the original signal to be the highest. It should be noted that, a sawtooth wave induced pitch estimator (abbreviated as SWIPE) algorithm, a convolution representation of pitch estimation (abbreviated as create) algorithm, and the like may also be used to obtain the fundamental frequency, and this disclosure does not limit this.

The fundamental frequency obtained by the YIN algorithm is described as an example.

Firstly, translating the original human voice signal according to the translation amount tau to obtain a translated signal, then subtracting the translated signal from the original human voice signal, and solving the square and then integrating the subtraction result:

the above formula is called difference function (difference function), where x _i Is the original human voice signal, x _i-τ The signal is obtained by translating the human voice signal according to the translation amount tau, W is the number of sampling points in one frame of the human voice signal, and the translation amount tau corresponding to the valley of the difference function can represent the period at the time t. To avoid interference, a cumulative mean normalized difference function (abbreviated as CMNDF) is defined on the basis of the difference function:

this is equivalent to say that the value of the difference function at τ is denormalised by the average of the difference function to the left of τ, where dt (τ) _ is determined according to equation (1) above. Usually, the time of the first valley of the deepest mean normalized difference function is accumulated as a period P, but because the signal period is not ideal, the YIN algorithm will select a plurality of valleys as candidates for each signal frame, and Model the transfer rule of the fundamental frequency with a Hidden Markov Model (abbreviated as HMM), so that the pitch track is as smooth as possible, and the frequency multiplication or half-frequency error generated by individual frames is eliminated, and finally, the frequency of the waveform, that is, the fundamental frequency, can be obtained by the relationship between the period P and the sampling rate fs:

in order to obtain the fundamental frequency sequence of the whole human voice, the human voice can be detected by framing the fundamental frequency, the range of the fundamental frequency of the human voice is usually within 70 hertz (Hz) to 1400 hertz (Hz), and the frequencies outside the range are regarded as noise.

According to an exemplary embodiment of the present disclosure, determining a pitch limit of each note in a human voice signal based on a fundamental frequency of each signal frame in the human voice signal in audio to be processed may be implemented by: determining at least two signal frame pairs with fundamental frequency variation larger than a first preset value in a human voice signal in the audio to be processed, wherein each signal frame pair comprises a front adjacent signal frame and a rear adjacent signal frame in the human voice signal in the audio to be processed; the pitch limit of each note in the human voice signal is determined based on the time at which at least two signal frame pairs are located. The first preset value may be set according to a user requirement, which is not limited in the present disclosure. Because the fundamental frequency difference between the two notes is large, the place with excessively large fundamental frequency change in the human voice signal is used as the demarcation point of the pitch limit of the notes, and the relatively accurate pitch limit can be obtained.

For example, after the detection of the fundamental frequency sequence, the pitch limit can be found based on the fundamental frequency sequence, since the main melody of the human voice is composed of pitch notes (i.e. the notes in the above embodiment) and each note has a certain duration, so that the boundary of each note in the human voice signal, i.e. the pitch limit, is within this range that the pitch of the human voice is relatively stable, as shown in fig. 3, the green line represents the actual pitch of the human voice, and each red square represents the pitch limit of one note. The pitch correction is carried out on the signal frames in each pitch limit in batches, so that the normal jitter of human voice can be ensured, and the accuracy of pitch on hearing can also be ensured. The present disclosure may use, as a boundary point of a pitch boundary, a time of two adjacent signals with excessively large changes of fundamental frequencies in a human voice signal, for example, a time of a previous signal frame may be used as the boundary point, a time of a next signal frame may be used as the boundary point, and a middle point of the times of the two signal frames may be used as the boundary point, which is not limited in the present disclosure.

According to an exemplary embodiment of the present disclosure, at least two signal frame pairs in which the fundamental frequency variation is greater than a preset value in a human voice signal in audio to be processed may be determined as follows: for each signal frame in the human voice signal in the audio to be processed, carrying out differential processing on the fundamental frequency of the current signal frame and the fundamental frequency of the last signal frame of the current signal frame to obtain a differential processing result of the current signal frame; and taking the current signal frame which is greater than the first preset value in the difference processing result and the last signal frame of the current signal frame as one signal frame pair of at least two signal frame pairs. According to the embodiment, two adjacent signal frames with greatly changed fundamental frequencies can be quickly and accurately found through differential processing of the signal frames.

In particular, the pitch limit of each note in the human voice signal can be determined by first determining the detected fundamental frequency f _n Performing a first order difference, it should be noted that the present disclosure is not limited to the first order difference:

detection of

A value of (a), wherein f _n Is the current signal frame, f _n-1 Is the last signal frame of the current signal frame, and when the fundamental frequency is determined to be changed greatly and exceeds a first preset value, namely the note is changed at the moment, the fundamental frequency is determined to be changed greatly and exceeds the first preset value

The time (e.g. the time of the current signal frame or the time of the last signal frame) is used as the boundary point of the pitch limit, and in order to speed up the determination process, the time can be directly detected

The time at which the peak value greater than the first preset value is detected is taken as the demarcation point of the pitch limit.

In the above equation, PB is the detected pitch limit.

Returning to FIG. 2, in step S202, the initial pitch for each pitch limit is determined based on the frequency of the signal frame within each pitch limit.

According to an exemplary embodiment of the present disclosure, an average value of the frequencies of all signal frames within each pitch bound may be obtained as the frequency of each pitch bound; the frequency of each pitch limit is converted to the initial pitch of each pitch limit. According to the present embodiment, the frequency of each of the pitch limits is determined based on the average of the frequencies of all the signal frames, and the pitch from which the pitch limit is obtained is converted, so that the speed of obtaining the pitch from the pitch limit can be increased and the difficulty of obtaining the pitch from the pitch limit can be reduced.

It should be noted that, the intermediate value of the frequencies of all signal frames within the current pitch limit may also be obtained as the frequency of the current pitch limit, and of course, other manners are also possible, and this disclosure is not limited thereto.

For example, after obtaining the pitch limits, the average of the frequencies in each pitch limit may be taken as the frequency f of the corresponding pitch limit. After obtaining the frequencies f, each frequency f can be converted into midi pitch, i.e. the pitch of the pitch boundary, the conversion formula is as follows:

M＝69+12×log ₂ (f/440) (6)

returning to fig. 2, in step S203, the target pitch for each pitch limit is determined according to the scale pitch of the original audio corresponding to the audio to be processed.

According to an exemplary embodiment of the present disclosure, a scale pitch of original audio corresponding to audio to be processed may be acquired; selecting a scale pitch with a difference degree smaller than a second preset value from the initial pitch of each pitch limit in the scale pitches, and determining a target pitch of each pitch limit. According to the embodiment, the target pitch with smaller difference is selected as the target pitch of the pitch limit, so that the accuracy of the target pitch can be ensured to the maximum extent.

For example, the difference may be a difference, and for convenience, an absolute value of the difference may also be used, and the disclosure is not limited thereto. Taking the difference absolute value as an example of the difference degree, the pitch with the smallest difference absolute value from each of the multiple scale pitches of the original audio can be used as the target pitch of each pitch boundary.

According to an exemplary embodiment of the present disclosure, acquiring a scale pitch of original audio corresponding to audio to be processed may include: the method comprises the steps of obtaining tonal information of original audio corresponding to audio to be processed from a server; based on the tonal information, the scale pitch of the original audio is obtained. According to the embodiment, the relatively accurate scale pitch can be acquired through the tonal information.

For example, taking the audio to be processed as the recorded audio as an example, before determining the target pitch of the current pitch limit based on the pitch of the scale corresponding to the original audio of the recorded audio, the pitch of the scale corresponding to the original audio may also be obtained based on the tonality information and/or the mode information of the original audio corresponding to the recorded audio. Specifically, the Tonality information of the original audio can be obtained from the server, and Tonality (Tonality) is simply 24 major tones, and twelve tones in an octave can respectively become the major tones of one key, so that twelve major tones and twelve minor tones are obtained, which are the twenty-four major tones in total. Each tone has its fixed scale, e.g., the scale arrangement of the C natural major key: C. d, E, F, G, A, B, the scale of the a natural minor is arranged to be A, B, C, D, E, F, G, the a natural minor being identical to the C natural major contained tones. Midi pitches of the corresponding scales at c1 to c6 are used as reference pitches, and compared with the initial pitches of the pitch limits of the obtained vocal signal, and the pitch most adjacent to the initial pitch of each pitch limit in the reference pitches is found as the target pitch of each pitch limit.

Returning to fig. 2, in step S203, based on the relationship between the target pitch and the initial pitch of each pitch boundary, the signal frames in each pitch boundary are subjected to the sound modifying processing, so as to obtain the target audio.

According to an exemplary embodiment of the present disclosure, the rate of change of the target pitch from the initial pitch for each pitch boundary may be determined first; and modifying the pitches of all the signal frames in each pitch limit according to the change rate of each pitch limit to obtain the target audio. According to the embodiment, the signal frame is adjusted by taking the pitch limit as a unit, so that the sound modification trace in the electric sound effect can be greatly improved, and the original jitter of the human voice is ensured to increase the naturalness of the sound modification.

For example, the change rate may be a ratio of the target pitch to the frequency of the initial pitch, a ratio of the target pitch to the initial pitch, or a difference between the target pitch and the initial pitch, which is not limited in this disclosure. Taking the frequency ratio of the pitch with the change rate as an example, after the ratio between the target pitch and the frequency of the initial pitch is obtained through calculation, the human voice signal is processed according to the obtained ratio relationship, specifically, the pitch of each signal frame in the pitch limit is adjusted according to the same ratio (ratio) determined above, so that the frequency in the human voice signal is moved to the target frequency to obtain the human voice signal with the pitch offset. The Pitch-shifted and non-variable-speed processing may be implemented by a phase vocoder or a Pitch Synchronous Overlap Add (PSOLA), or may be implemented in other ways, which is not limited in this disclosure. For example, the above ratio may be as follows:

where n is the index of the number of frames, Boundary _ frequency (n) is the initial pitch, and Target _ frequency (n) is the Target pitch.

Fig. 4 is a system architecture diagram illustrating an audio processing method according to an exemplary embodiment, still taking an adapted song a sung from the audio to be processed as an example, as shown in fig. 4, first obtaining the pitch of the human voice signal and the time-varying human voice frequency sequence of the human voice signal in the adapted song a (i.e. the audio in fig. 4) by a fundamental frequency detection algorithm; secondly, after the human voice frequency sequence is obtained, detecting a pitch limit according to the frequency; again, the average frequency of all signal frames in each pitch limit is calculated as the initial frequency for this pitch limit and translated to the corresponding initial pitch. Obtaining the mode and the tone of the original song A from the server, determining the pitch of the musical scale corresponding to the song, and searching the pitch closest to the pitch of each pitch limit in the pitch of the musical scale as a target pitch; then, the ratio of the frequency of the target Pitch to the frequency of the initial Pitch is calculated, and then the Pitch of the human voice signal is adjusted by using a Pitch Synchronous superposition algorithm (PSOLA) and other modes according to the ratio to obtain a tuned adapted song a, namely the target audio in fig. 4, so as to realize the natural effect of sound modification. The target pitch is determined in units of pitch limits of the note, rather than in units of each signal frame, i.e., the present disclosure adjusts the signal frames in units of pitch limits rather than individually for each signal frame, so that human voice jitter in each pitch limit is preserved, with much more natural modification effects than processing each signal frame individually.

In summary, the present disclosure obtains different target pitches from the tone of different songs delivered from the server, and combines the singing result of the singer, the pitch of the human voice signal can be adjusted in real time without referring to midi, and in order to ensure the reality and naturalness of the human voice, a pitch limit is also selected and detected, and each pitch limit is used as a unit to perform pitch adjustment on the signal frame, instead of performing absolute calibration on each signal frame, so that the trace of modifying the voice in the electrical sound effect can be greatly improved, the original jitter of the human voice is ensured to increase the naturalness of modifying the voice, the sound defects of running, breaking, and being unstable in breath are improved, and the result of adapting the song by the person with music quality can be compatible.

Fig. 5 is a block diagram illustrating an audio processing device according to an example embodiment. Referring to fig. 5, the apparatus includes:

a pitch limit determining unit 50 configured to determine a pitch limit of each note in the human voice signal based on the fundamental frequency of each signal frame in the human voice signal in the audio to be processed; an initial pitch determination unit 52 configured to determine an initial pitch for each pitch limit from the frequency of the signal frame within each pitch limit; a target pitch determination unit 54 configured to determine a target pitch for each pitch limit from the scale pitch of the original audio corresponding to the audio to be processed; and the processing unit 56 is configured to perform sound modification processing on the signal frames in each pitch limit respectively based on the relationship between the target pitch of each pitch limit and the initial pitch, so as to obtain the target audio.

According to the embodiment of the present disclosure, the pitch limit determining unit 50 is further configured to determine at least two signal frame pairs of which the fundamental frequency change is greater than a first preset value in the human voice signal in the audio to be processed, wherein each signal frame pair includes two adjacent signal frames in front of and behind the human voice signal in the audio to be processed; the pitch limit of each note in the human voice signal is determined based on the time at which at least two signal frame pairs are located.

According to the embodiment of the present disclosure, the pitch limit determining unit 50 is further configured to perform, for each signal frame in the human voice signal in the audio to be processed, a difference processing on the fundamental frequency of the current signal frame and the fundamental frequency of the last signal frame of the current signal frame to obtain a difference processing result of the current signal frame; and taking the current signal frame which is greater than the first preset value in the difference processing result and the last signal frame of the current signal frame as one signal frame pair of at least two signal frame pairs.

According to an embodiment of the present disclosure, the target pitch determination unit 54 is further configured to obtain a scale pitch of the original audio corresponding to the audio to be processed; selecting a scale pitch with a difference degree smaller than a second preset value from the initial pitch of each pitch limit in the scale pitches, and determining a target pitch of each pitch limit.

According to an embodiment of the present disclosure, the target pitch determining unit 54 is further configured to obtain tonality information of the original audio corresponding to the audio to be processed from the server; based on the tonal information, the scale pitch of the original audio is obtained.

According to an embodiment of the present disclosure, the initial pitch determination unit 52 is further configured to obtain an average of the frequencies of all signal frames within each pitch bound as the frequency of each pitch bound; converting the frequency of each pitch limit to the initial pitch of each pitch limit

According to an embodiment of the present disclosure, the processing unit 56 is further configured to determine a rate of change of the target pitch from the initial pitch for each pitch boundary; and modifying the pitches of all the signal frames in each pitch limit according to the change rate of each pitch limit to obtain the target audio.

According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 6 is a block diagram of an electronic device 600 including at least one memory 601 having a set of computer-executable instructions stored therein that, when executed by the at least one processor, perform an audio processing method according to an embodiment of the disclosure and at least one processor 602, according to an embodiment of the disclosure.

By way of example, the electronic device 600 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device 1000 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 600 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 600, the processor 602 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 602 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.

The processor 602 may execute instructions or code stored in memory, where the memory 601 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 601 may be integrated with the processor 602, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 601 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 601 and the processor 602 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 602 can read files stored in the memory 601.

Further, the electronic device 600 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein when executed by at least one processor, instructions in the computer-readable storage medium cause the at least one processor to perform the audio processing method of the embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the audio processing method of the embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio processing method, comprising:

determining a pitch limit of each note in a human voice signal based on the fundamental frequency of each signal frame in the human voice signal in the audio to be processed;

determining an initial pitch for each of the pitch limits based on the frequency of the signal frame within each of the pitch limits;

determining a target pitch of each pitch limit according to the scale pitch of the original audio corresponding to the audio to be processed;

and respectively carrying out sound modifying treatment on the signal frames in each pitch limit based on the relation between the target pitch of each pitch limit and the initial pitch to obtain target audio.

2. The audio processing method of claim 1, wherein determining a pitch bound for each note in a human voice signal in the audio to be processed based on a fundamental frequency of each signal frame in the human voice signal comprises:

determining at least two signal frame pairs of which the fundamental frequency change is larger than a first preset value in the human voice signal in the audio to be processed, wherein each signal frame pair comprises a front adjacent signal frame and a rear adjacent signal frame in the human voice signal in the audio to be processed;

determining a pitch limit for each note in the vocal signal based on the time at which the at least two signal frame pairs are located.

3. The audio processing method according to claim 2, wherein the determining at least two signal frame pairs in which the variation of the fundamental frequency in the human voice signal in the audio to be processed is greater than a preset value comprises:

for each signal frame in the human voice signals in the audio to be processed, carrying out differential processing on the fundamental frequency of the current signal frame and the fundamental frequency of the last signal frame of the current signal frame to obtain a differential processing result of the current signal frame;

and taking the current signal frame which is greater than the first preset value in the difference processing result and the last signal frame of the current signal frame as one signal frame pair of the at least two signal frame pairs.

4. The audio processing method of claim 1, wherein said determining a target pitch for each of the pitch limits from a scale pitch of an original audio corresponding to the audio to be processed comprises:

acquiring the scale pitch of the original audio corresponding to the audio to be processed;

selecting the scale pitches with the difference degree of the initial pitch of each pitch limit smaller than a second preset value from the scale pitches, and determining the target pitch of each pitch limit.

5. The audio processing method of claim 4, wherein the obtaining of the scale pitch of the original audio corresponding to the audio to be processed comprises:

acquiring the tonal information of the original audio corresponding to the audio to be processed from a server;

and acquiring the scale pitch of the original audio based on the tonal information.

6. The audio processing method of claim 1, wherein said determining an initial pitch of each of said pitch limits based on the frequency of the signal frame within each of said pitch limits comprises:

obtaining an average value of the frequencies of all signal frames within each pitch limit as the frequency of each pitch limit;

converting the frequency of each of the pitch limits to the initial pitch of each of the pitch limits.

7. An audio processing apparatus, comprising:

a pitch limit determination unit configured to determine a pitch limit of each note in a human voice signal in audio to be processed based on a fundamental frequency of each signal frame in the human voice signal;

an initial pitch determination unit configured to determine an initial pitch for each of the pitch limits based on the frequency of the signal frame within each of the pitch limits;

a target pitch determination unit configured to determine a target pitch for each of the pitch limits according to a scale pitch of an original audio corresponding to the audio to be processed;

and the processing unit is configured to respectively perform sound modifying processing on the signal frames in each pitch limit based on the relation between the target pitch of each pitch limit and the initial pitch to obtain target audio.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio processing method of any of claims 1 to 6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the audio processing method of any of claims 1 to 6.

10. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement the audio processing method according to any of claims 1 to 6.