CN107833581B

CN107833581B - Method, device and readable storage medium for extracting fundamental tone frequency of sound

Info

Publication number: CN107833581B
Application number: CN201710989739.3A
Authority: CN
Inventors: 劳振锋
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2017-10-20
Filing date: 2017-10-20
Publication date: 2021-04-13
Anticipated expiration: 2037-10-20
Also published as: CN107833581A

Abstract

The invention discloses a method and a device for extracting a pitch frequency of a voice and a readable storage medium. Firstly, acquiring a sound signal to be detected, and converting the sound signal to be detected from a time domain to a frequency domain through short-time Fourier transform; then determining the frequency band range of the sound signal to be detected from the frequency domain, and determining the maximum harmonic frequency of the sound signal to be detected according to the frequency band range; respectively carrying out energy intensity detection on each frequency point in the frequency band range, and determining the frequency point a with the maximum energy intensity according to an intensity detection result; and finally, judging whether the frequency point to be detected of the maximum value point exists or not according to the frequency point a and the maximum harmonic frequency, if so, judging that the frequency point to be detected may be the fundamental tone frequency of the sound signal to be detected or the harmonic component of the fundamental tone frequency, and finally extracting the fundamental tone frequency from the sound signal to be detected. The method for extracting the pitch frequency of the voice can realize higher accuracy by using lower algorithm complexity.

Description

Method, device and readable storage medium for extracting fundamental tone frequency of sound

Technical Field

The present invention relates to the field of audio signal technology, and in particular, to a method, an apparatus, and a readable storage medium for extracting a pitch frequency of a sound.

Background

The fundamental frequency is called fundamental frequency for short, when the sounding body sounds due to vibration, the sound can be generally decomposed into a plurality of pure sine waves, all natural sounds are basically composed of a plurality of sine waves with different frequencies, wherein the sine wave with the lowest frequency is the fundamental frequency, and the other sine waves with higher frequencies are harmonic waves. For example, the pitch frequency is a basic feature that can reflect the pitch of human voice, and it is generally determined whether the intonation of a singing person is correct, and the pitch is obtained by extracting the pitch frequency of human voice.

The existing pitch frequency detection methods include a time domain autocorrelation method, a frequency domain cepstrum calculation method, a frequency domain discrete wavelet transform method and the like, but the pitch frequency detection methods have the defects of complex algorithm, low detection accuracy and the like. The fundamental frequency detection method of the invention realizes higher accuracy rate with lower algorithm complexity.

Disclosure of Invention

The invention mainly aims to provide a method, a device and a readable storage medium for extracting pitch frequency of voice, and aims to solve the problems of higher algorithm complexity and lower detection precision of the existing pitch frequency detection method.

To achieve the above object, the present invention provides a method for extracting a pitch frequency of a voice, the method comprising the steps of:

acquiring a sound signal to be detected, and converting the sound signal to be detected from a time domain to a frequency domain through short-time Fourier transform;

determining the frequency band range of the sound signal to be detected from the frequency domain, and determining the maximum harmonic frequency of the sound signal to be detected according to the frequency band range;

respectively carrying out energy intensity detection on each frequency point in the frequency band range, and determining the frequency point a with the maximum energy intensity according to an intensity detection result;

and extracting fundamental tone frequency from the sound signal to be detected according to the frequency point a and the maximum harmonic frequency.

Preferably, the extracting the fundamental tone frequency from the sound signal to be detected according to the frequency point a and the maximum harmonic frequency specifically includes:

setting a variable n to the maximum harmonic number;

calculating a frequency point to be detected corresponding to the frequency point a according to the variable n;

judging whether each frequency point to be detected meets a first preset condition or not;

and when each frequency point to be detected does not meet the first preset condition, carrying out self-subtraction on the variable n by 1, and returning to the step of calculating the frequency point to be detected corresponding to the frequency point a according to the variable n until each frequency point to be detected meets the first preset condition, and taking the quotient of the frequency point a and the variable n as the fundamental tone frequency of the sound signal to be detected.

Preferably, the calculating the frequency point to be measured corresponding to the frequency point a according to the variable n specifically includes:

setting a variable m to 1;

calculating a frequency point f to be measured corresponding to the frequency point a according to a formula (1);

increasing the variable m by 1, calculating the frequency point to be measured corresponding to the frequency point a again according to the formula (1), and taking each calculated frequency point to be measured as the frequency point to be measured corresponding to the frequency point a when m is equal to n-1;

wherein the formula (1) is

Preferably, after the frequency point to be measured corresponding to the frequency point a is calculated according to the variable n, the method further includes:

rounding the frequency points to be detected to get the whole.

Preferably, after the self-decreasing the variable n by 1, the method further includes:

and when the variable n is 2 and each frequency point to be detected does not meet the first preset condition, taking the absolute frequency value of the frequency point a as the fundamental tone frequency of the sound signal to be detected.

Preferably, the determining whether each frequency point to be detected meets a first preset condition specifically includes:

comparing the absolute frequency values of the frequency points to be detected, and acquiring frequency domain energy corresponding to the frequency points to be detected when the comparison result meets a first preset state;

judging whether the frequency domain energy corresponding to each frequency point to be detected is a maximum value point or not;

when the frequency domain energy corresponding to each frequency point to be detected is a maximum value point, selecting the frequency point f with the minimum absolute frequency value from each frequency point to be detected_min；

Judging and the frequency point f_minAnd whether the corresponding frequency domain energy is larger than a preset energy threshold value or not, if so, judging that each frequency point to be detected meets the first preset condition, and if not, judging that each frequency point to be detected does not meet the first preset condition.

Preferably, the first preset state is:

and the absolute frequency value of each frequency point to be detected is in an increasing state along with the increasing presentation of m, wherein the absolute frequency value of each frequency point to be detected is respectively smaller than the absolute frequency value of the frequency point a, and the absolute frequency value of each frequency point to be detected is larger than 1.

In addition, to achieve the above object, the present invention also provides an apparatus for extracting a pitch frequency of a voice, the apparatus comprising: a sound sensor for acquiring a sound signal to be detected, a memory, a processor and a pitch frequency program for extracting sound stored on said memory and executable on said processor, said pitch frequency program for extracting sound being configured to implement the steps of the method for extracting pitch frequency of sound as described above.

Furthermore, to achieve the above object, the present invention also proposes a readable storage medium having stored thereon a pitch frequency program for extracting a pitch of a voice, which when executed by a processor implements the steps of the method for extracting a pitch of a voice as described above.

Firstly, converting a to-be-detected sound signal from a time domain to a frequency domain by short-time Fourier transform by acquiring the to-be-detected sound signal; then determining the frequency band range of the sound signal to be detected from the frequency domain, and determining the maximum harmonic frequency of the sound signal to be detected according to the frequency band range; respectively carrying out energy intensity detection on each frequency point in the frequency band range, and determining the frequency point a with the maximum energy intensity according to an intensity detection result; and finally, extracting the fundamental tone frequency from the sound signal to be detected according to the frequency point a and the maximum harmonic frequency, thereby achieving the purpose of realizing the higher accuracy of extracting the fundamental tone frequency by using lower algorithm complexity.

Drawings

FIG. 1 is a schematic diagram of an apparatus in a hardware operating environment according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a first embodiment of a method for extracting a pitch frequency of a voice according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, an audio sensor 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may comprise a Display screen (Display), and the optional user interface 1003 may also comprise a standard wired interface, a wireless interface. The sound sensor 1004 is used to acquire a sound signal to be detected. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration of the device shown in fig. 1 is not intended to be limiting of the devices described herein and may include more or less components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a sound signal acquiring module, a user interface module, and a pitch frequency program for extracting sounds.

The apparatus of the present invention calls, by the processor 1001, the pitch frequency program of the extracted voice stored in the memory 1005, and performs the following operations:

Further, the processor 1001 may call the pitch frequency program of the extracted sounds stored in the memory 1005, and also perform the following operations:

setting a variable n to the maximum harmonic number;

and when each frequency point to be detected does not meet a first preset condition, carrying out self-subtraction on the variable n by 1, and returning to the step of calculating the frequency point to be detected corresponding to the frequency point a according to the variable n until each frequency point to be detected meets the first preset condition, and taking the quotient of the frequency point a and the variable n as the fundamental tone frequency of the sound signal to be detected.

setting a variable m to 1;

wherein the formula (1) is

rounding the frequency points to be detected to get the whole.

and when the variable n is 2 and each frequency point to be detected does not meet a first preset condition, taking the absolute frequency values of the frequency points a and the frequency points a as the fundamental tone frequency of the sound signal to be detected.

judging whether the frequency domain energy of each frequency point to be detected is a maximum value point or not;

Judging and the frequency point f_minAnd whether the corresponding frequency domain energy is larger than a preset energy threshold value or not, if so, judging that each frequency point to be detected meets a first preset condition, and if not, judging that each frequency point to be detected does not meet the first preset condition.

The method comprises the steps of firstly, obtaining an audio signal to be detected, and converting the audio signal to be detected from a time domain to a frequency domain through short-time Fourier transform; then determining the frequency band range of the sound signal to be detected from the frequency domain, and determining the maximum harmonic frequency of the sound signal to be detected according to the frequency band range; respectively carrying out energy intensity detection on each frequency point in the frequency band range, and determining the frequency point a with the maximum energy intensity according to an intensity detection result; and finally, judging whether a frequency point to be detected of a maximum value point exists or not according to the frequency point a and the maximum harmonic frequency, if so, judging that the frequency point to be detected may be the fundamental tone frequency of the sound signal to be detected or the harmonic component of the fundamental tone frequency, and finally extracting the fundamental tone frequency from the sound signal to be detected.

Based on the hardware structure, the embodiment of the method for extracting the pitch frequency of the voice is provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a method for extracting a pitch frequency of a voice according to the present invention.

In this embodiment, the method includes the steps of:

step S10: acquiring a sound signal to be detected, and converting the sound signal to be detected from a time domain to a frequency domain through short-time Fourier transform;

in addition, the present embodiment is described with a processor of the above apparatus as an execution subject;

in a specific implementation, in this embodiment, the to-be-detected sound signal is a digital audio signal obtained by taking 1024 points in steps as 512, that is, firstly, the obtained human sound signal is subjected to short-time fourier transform of 1024 points, so that an effective frequency value of 512 points can be obtained, and an index of each point corresponds to a frequency value. The human voice frequency band is typically 80-1200Hz, for example when the sampling rate of the audio signal is 44100Hz, the corresponding frequency bin index range is 2-27. In this embodiment, the sound signal to be detected is preferably converted from the time domain to the frequency domain by a short-time fourier transform, so that each frame of the sound signal to be detected is relatively stable.

Step S20: determining the frequency band range of the sound signal to be detected from the frequency domain, and determining the maximum harmonic frequency of the sound signal to be detected according to the frequency band range;

understandably, determining the frequency band range (namely frequency band) of the sound signal to be detected, and determining the maximum harmonic frequency of the sound signal to be detected according to the frequency band range; for example, the human voice frequency band is generally 80-1200HZ, an index range corresponding to the frequency band is determined according to the sampling rate of the audio signal, and the maximum allowable harmonic value (i.e. the maximum harmonic number) of the voice signal can be determined according to the index range. Since there is generally only 4 harmonics at most in the vocal index range, i.e. the maximum harmonic number is 4 in the present embodiment.

Step S30: respectively carrying out energy intensity detection on each frequency point in the frequency band range, and determining the frequency point a with the maximum energy intensity according to an intensity detection result;

in a specific implementation, a frequency point a corresponding to the maximum energy value is found in the human voice index range, and the frequency point is at least a fundamental frequency or one of harmonic components of the fundamental frequency. It can be understood that all natural sounds are basically composed of many sine waves with different frequencies, wherein the sine wave with the lowest frequency is the fundamental tone, and the other sine waves with higher frequencies are the harmonic waves; and respectively carrying out energy intensity detection on each frequency point in the frequency band range, determining the frequency point a with the maximum energy intensity according to the intensity detection result, and reducing the frequency point a to a search range which is closer to the final extracted fundamental tone true value, namely the frequency point is at least fundamental frequency or one harmonic component of the fundamental frequency.

Step S40: and extracting fundamental tone frequency from the sound signal to be detected according to the frequency point a and the maximum harmonic frequency.

It can be understood that, a frequency point a corresponding to the maximum energy value is found in the audio frequency band region to be detected, the frequency point a is assumed to be an n-th harmonic component of a gene frequency (i.e., the frequency point a is assumed to be a 4-th harmonic component of a fundamental tone frequency, and n is 4), and then whether a maximum value point (i.e., a wave peak or a wave trough of a waveform) exists in the region of 1 to n, 2 to n, … to n, and (n-1) to n of the frequency point a and meets a first preset condition is found, where it needs to be stated that, 1 to n of the frequency point a, 2 to n of the frequency point a, and (n-1) to n of the frequency point a are collectively referred to as the frequency point to be detected of the frequency point a; if the frequency point to be detected exists, when the frequency point to be detected is the frequency point of 1/n of the frequency point a, the frequency point to be detected is a real fundamental frequency (namely a fundamental tone frequency), and the frequency point a is an n-th harmonic of the fundamental frequency; otherwise, the frequency point a is assumed to be the n-1 harmonic of the fundamental frequency, and whether the fundamental frequency point can be found is judged in the same way, if the fundamental frequency point is not found until n is 2, the frequency point a is judged to be the real fundamental frequency.

In a specific implementation, the present embodiment preferably adopts a double-loop calculation manner to extract a pitch frequency from the sound signal to be detected. The loop with variable n is an outer loop calculation mode. And the step S40 can be divided into three sub-steps

The method comprises the following steps: setting a variable n to the maximum harmonic number;

step two: calculating the frequency points to be detected corresponding to the frequency point a according to the variable n, and judging whether each frequency point to be detected meets a first preset condition;

in a specific implementation, assuming that the frequency point a is an n-th harmonic component of a fundamental frequency, where n is a variable, and setting a loop initial value of the variable n as the maximum harmonic number; generally, only 4 harmonics are present in the range of the pitch index at most, i.e. the maximum harmonic number in this embodiment is 4, and it is assumed that the frequency point a is the 4 th harmonic component of the fundamental frequency.

It should be noted that, the determining whether each frequency point to be measured meets the first preset condition is an inner loop method using m as a variable value.

In a specific implementation, it is assumed that the frequency point a is a4 th harmonic component of the fundamental frequency (since the maximum harmonic frequency is 4 in this embodiment), and the value of m is 1, 2, and 3; then finding out a frequency point to be tested anm corresponding to the frequency point a, wherein when m is 1 and n is 4, the frequency point to be tested anm is denoted as a 41; when m is 2 and n is 4, the frequency point anm to be measured is denoted as a 42; when m is 3 and n is 4, the frequency point anm to be measured is denoted as a 43; calculating the absolute frequency value f of the frequency point to be measured according to the following formula (1):

preferably, in order to make the measurement result more accurate, the formula (1) is further optimized by the formula (2); the formula (2) is:

wherein

Is a pair of

Rounding is performed, and in this embodiment, when n is 4, m is 1, m is 2, and m is 3.

The frequency points to be measured obtained by calculation according to the formula (2) are respectively a41 ═ round (a/4), a42 ═ round (2 ═ a/4), a42 ═ round (3 ═ a/4), the absolute frequency values of the frequency points to be measured are compared, when the comparison result meets a first preset state (the first preset state is the state that the absolute frequency values of a plurality of frequency points to be measured of the frequency point a are increased along with the increasing presentation of m, wherein the absolute frequency values of the frequency points to be measured are respectively smaller than the absolute frequency value of the frequency point a, and are respectively greater than 1), and frequency domain energy s (a41), s (a42) and s (a43) corresponding to the frequency points to be measured are obtained; that is, the comparison result should satisfy a > a43> a42> a41, a41>1, a42>1, a43>1 (the fundamental frequency point should be in the human voice band region). Then, whether a41, a42 and a43 are maximum points is judged, that is, whether the frequency domain energies s (a41), s (a42) and s (a43) corresponding to the points satisfy the following model is judged:

if the frequency domain energy s (a41), s (a42) and s (a43) satisfy the above model, it is proved that a41, a42 and a43 are maximum points, and then the frequency point a can be predicted to be a harmonic of the fundamental frequency; selecting the frequency point f with the minimum absolute frequency value from all frequency points to be detected_minIf s (a41) is greater than the preset energy threshold, it may be determined that the frequency point a41 to be measured is the base frequency point.

And a third substep: and when each frequency point to be detected does not meet a first preset condition, carrying out self-subtraction on the variable n by 1, and returning to the step of calculating the frequency point to be detected corresponding to the frequency point a according to the variable n until each frequency point to be detected meets the first preset condition, and taking the quotient of the frequency point a and the n as the real fundamental tone frequency of the sound signal to be detected.

It can be understood that, in the third substep, if s (a41) is determined to be smaller than the preset energy threshold, the step of outer loop (with variable n) is performed, that is, it is continuously assumed that a is 3 th harmonic (n equals to 3), and similarly, it is determined whether the true fundamental frequency point can be found. If the base frequency point is not found until a is assumed to be 2 harmonic (n is 2), it is directly determined that a is the true base frequency point.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of software products stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk), and including instructions for causing a device (e.g., a mobile phone, a server, an air conditioner, or a network device) to perform the methods according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of extracting a pitch frequency of a sound, the method comprising:

extracting fundamental tone frequency from the sound signal to be detected according to the frequency point a and the maximum harmonic frequency;

wherein, the step of extracting the fundamental tone frequency from the sound signal to be detected according to the frequency point a and the maximum harmonic frequency comprises the following steps:

acquiring all frequency points to be detected corresponding to the frequency point a, wherein the frequency points to be detected are m which is n times of the frequency point a, and the value range of the variable m is [1, n-1 ];

judging whether a maximum value point meeting a first preset condition exists in the frequency points to be detected or not;

if so, taking the frequency point to be detected as the fundamental tone frequency when the frequency point to be detected is the frequency point which is 1/n of the frequency point a;

and if the variable n does not exist, subtracting one from the variable n, returning to the step of acquiring all the frequency points to be measured corresponding to the frequency point a, and if the pitch frequency is not acquired until n is 2, taking the frequency point a as the pitch frequency.

2. The method according to claim 1, wherein said extracting a pitch frequency from the audio signal to be detected according to the frequency point a and the maximum harmonic number specifically comprises:

setting a variable n to the maximum harmonic number;

when each frequency point to be detected does not meet the first preset condition, carrying out self-subtraction on the variable n by 1, and returning to the step of calculating the frequency point to be detected corresponding to the frequency point a according to the variable n until each frequency point to be detected meets the first preset condition, and taking the quotient of the frequency point a and the variable n as the fundamental tone frequency of the sound signal to be detected;

the step of calculating the frequency point to be measured corresponding to the frequency point a according to the variable n comprises the following steps:

calculating all frequency points to be measured corresponding to the frequency point a according to a formula (1), wherein the formula (1) is as follows:

f ═ a × m)/n, where the variable m has a value range of [1, n-1 ];

the first preset condition is that the absolute frequency value of each frequency point to be detected is in an increasing state along with the increasing of m, the frequency domain energy corresponding to each frequency point to be detected is a maximum value point, and the frequency domain energy corresponding to the frequency point with the minimum absolute frequency value in each frequency point to be detected is larger than a preset energy threshold value.

3. The method according to claim 2, wherein the calculating the frequency point to be measured corresponding to the frequency point a according to the variable n specifically comprises:

setting a variable m to 1;

wherein the formula (1) is

After the frequency point to be measured corresponding to the frequency point a is calculated according to the variable n, the method further comprises the following steps:

rounding the frequency points to be detected to get the whole.

4. The method of claim 2, wherein after the self-decreasing the variable n by 1, the method further comprises:

5. The method according to claim 3 or 4, wherein the determining whether each frequency point to be detected meets a first preset condition specifically comprises:

Determining the frequency of the signalPoint f_minWhether the corresponding frequency domain energy is larger than a preset energy threshold value or not is judged, if yes, each frequency point to be detected is judged to meet the first preset condition, and if not, each frequency point to be detected is judged not to meet the first preset condition;

the first preset state is as follows: and the absolute frequency value of each frequency point to be measured presents an increasing state along with the increment of m.

6. The method according to claim 4, wherein the first preset state specifically comprises: and the absolute frequency value of each frequency point to be detected is smaller than that of the frequency point a, and the absolute frequency value of each frequency point to be detected is greater than 1.

7. An apparatus for extracting a pitch frequency of a sound, the apparatus comprising: a sound sensor for acquiring a sound signal to be detected, a memory, a processor and a program for extracting a pitch frequency of a sound stored on the memory and executable on the processor, the program for extracting a pitch frequency of a sound being configured to implement the steps of the method for extracting a pitch frequency of a sound according to any one of claims 1 to 6.

8. A readable storage medium having stored thereon a pitch frequency program based on extracted voices, the pitch frequency program when executed by a processor implementing the steps of the method of extracting pitch frequency of voices of any one of claims 1 to 6.