CN112863541A

CN112863541A - Audio cutting method and system based on clustering and median convergence

Info

Publication number: CN112863541A
Application number: CN202011614821.6A
Authority: CN
Inventors: 刘培; 王颖蕊; 陈坚; 陈绪水; 何魁伟
Original assignee: Fuzhou Institute Of Data Technology Co ltd
Current assignee: Fuzhou Institute Of Data Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-28
Anticipated expiration: 2040-12-31
Also published as: CN112863541B

Abstract

The invention relates to the technical field of audio and video processing, in particular to an audio cutting method and system based on clustering and median convergence. The audio cutting method based on clustering and median convergence comprises the following steps: filtering an input audio signal; calculating a spectrogram matrix of the filtered audio signal; clustering the spectrogram matrix into two types through K _ means to obtain clustering labels; and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. In the whole process, the method is independent without depending on external system data, the calculated amount is small, the cutting point position is correct, the accuracy is high, the interference carrying capacity is strong, the engineering is easy, and the dynamic cutting can be realized according to the audio frequency periodic variation.

Description

Audio cutting method and system based on clustering and median convergence

Technical Field

The invention relates to the technical field of audio and video processing, in particular to an audio cutting method and system based on clustering and median convergence.

Background

In the field of health monitoring and fan fault detection, audio cutting is often required. In application No. 201911121998, X, entitled "audio segmentation method based on signal energy spike identification", the present application relates to an audio segmentation method based on signal energy spike identification, including: carrying out short-time Fourier transform on an input audio signal, and converting the input audio signal into a power spectrum matrix; extracting intermediate frequency energy characteristics based on a power spectrum; carrying out peak identification on the extracted intermediate frequency energy characteristics; performing error division correction on the signal subjected to peak identification; and outputting the time coordinate of the division point of the audio signal. The audio frequency segmentation method needs the rated rotating speed rs of the fan blade rotation obtained in real time from other systems as an input condition, and has strong coupling with other systems.

Disclosure of Invention

Therefore, an audio cutting method based on clustering and median convergence is needed to be provided, so as to solve the technical problems that the existing audio cutting audio input needs to rely on other systems to obtain a better audio cutting effect and has strong coupling. The specific technical scheme is as follows:

an audio cutting method based on clustering and median convergence comprises the following steps:

filtering an input audio signal;

calculating a spectrogram matrix of the filtered audio signal;

clustering the spectrogram matrix into two types through K _ means to obtain clustering labels;

and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate.

Further, the "calculating a spectrogram matrix of the filtered audio signal" specifically includes the steps of:

pre-emphasis, framing and windowing are carried out on the filtered audio signal to obtain a first result;

performing fast Fourier transform on the first result to obtain a second result;

carrying out absolute value or square value calculation on the second result to obtain a third result;

performing triangular band-pass filtering processing on the third result to obtain a fourth result;

carrying out logarithmic energy processing on the fourth result to obtain a fifth result;

and performing dynamic characteristic calculation on the fifth result to obtain a spectrogram matrix.

Further, the "clustering the spectrogram matrix into two classes through K _ means to obtain clustering labels" specifically includes the following steps:

intercepting the middle and high dimensionality in the spectrogram matrix, and inputting the spectrogram matrix subjected to dimensionality reduction to K _ means to obtain a tag sequence;

and filtering out burrs and peaks of the label sequence.

Further, the "performing convergence correction on the clustering label and outputting a segmentation point sequence coordinate" specifically includes the steps of:

the sequence coordinates of the cluster tag are identified as V ═ V1, V2, …, vn ];

step 1, calculating a first-order difference vector of V;

step 2, carrying out median operation on the differential vector;

step 3, traversing the first-order difference vector from the head, judging whether the first-order difference vector is in a preset interval range, and outputting a division point sequence coordinate if the first-order difference vector is in the preset interval range;

if the current time is not within the preset interval range, executing preset operation.

Further, the step of executing a preset operation if the current time is not within the preset interval range specifically includes the steps of:

if the first-order difference vector is not in the range of the preset interval, judging whether the first-order difference vector is larger than a first preset threshold value or not, and if the first-order difference vector is larger than the first preset threshold value, inserting a new value into the sequence coordinate;

and judging whether the first-order difference vector is smaller than a second preset threshold value, and if so, deleting the current first-order difference vector.

In order to solve the technical problem, the audio cutting system based on clustering and median convergence is further provided, and the specific technical scheme is as follows:

an audio cutting system based on clustering and median convergence, comprising: the system comprises a filtering module, a spectrogram matrix generating module, a clustering label generating module and a sequence coordinate generating module;

the filtering module is used for: filtering an input audio signal;

the spectrogram matrix generation module is used for: calculating a spectrogram matrix of the filtered audio signal;

the cluster label generation module is configured to: clustering the spectrogram matrix into two types through K _ means to obtain clustering labels;

the sequence coordinate generation module is configured to: and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate.

Further, the spectrogram matrix generation module is further configured to:

Further, the cluster label generating module is further configured to:

and filtering out burrs and peaks of the label sequence.

Further, the sequence coordinate generating module is further configured to:

step 1, calculating a first-order difference vector of V;

step 2, carrying out median operation on the differential vector;

Further, the sequence coordinate generating module is further configured to:

The invention has the beneficial effects that: by filtering an input audio signal; calculating a spectrogram matrix of the filtered audio signal; clustering the spectrogram matrix into two types through K _ means to obtain clustering labels; and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. In the whole process, the method is independent without depending on external system data, the calculated amount is small, the cutting point position is correct, the accuracy is high, the interference carrying capacity is strong, the engineering is easy, and the dynamic cutting can be realized according to the audio frequency periodic variation.

Drawings

FIG. 1 is a flow chart of an audio slicing method based on clustering and median convergence according to an embodiment;

FIG. 2 is a flowchart illustrating the calculation of a spectrogram matrix of a filtered audio signal according to an embodiment;

FIG. 3 is a block diagram of an audio slicing system based on clustering and median convergence according to an embodiment;

FIG. 4 is a schematic time-frequency diagram representation, in accordance with an embodiment;

FIG. 5 is a diagram illustrating an audio spectrogram according to an embodiment;

FIG. 6 is a schematic diagram of the tag sequence of embodiment 1 after amplification;

FIG. 7 is a schematic diagram of a cut line before convergence correction is performed on the cluster label according to an embodiment;

FIG. 8 is a schematic diagram of a cut line after convergence correction is performed on the cluster label according to an embodiment; .

Description of reference numerals:

300. an audio cutting system based on clustering and median convergence,

301. a filtering module for filtering the received signal and the received signal,

302. a speech spectrogram matrix generation module for generating a speech spectrogram matrix,

303. a cluster label generation module for generating a cluster label,

304. and a sequence coordinate generation module.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Referring to fig. 1 to 2, in the present embodiment, an audio segmentation method based on clustering and median convergence is applicable to an audio segmentation system based on clustering and median convergence, where the audio segmentation system based on clustering and median convergence includes: the device comprises a filtering module, a spectrogram matrix generating module, a clustering label generating module and a sequence coordinate generating module.

Please refer to fig. 1, which illustrates an embodiment of the present invention as follows:

step S101: the input audio signal is filtered. The method specifically comprises the following steps: the sound collector works in a complex outdoor environment, and collected audio signals generally contain a large amount of noise, such as bird calls, wind sounds, human voices, noise caused by other fans and the like. At present, the starting condition of the fan is generally that the average wind speed is not less than 3.5m/s, and the signal collected by the sound sensor necessarily contains wind noise, so compared with other background noise, the wind noise has a larger influence. As seen from the time-frequency diagram of fig. 4, the spectral energy of the wind noise is concentrated in the low frequency region (bright region in the diagram) below 350 Hz. A filter is required to filter out low frequency wind noise. In the present embodiment, a low-frequency filter is used for preprocessing.

Step S102: a spectrogram matrix of the filtered audio signal is calculated. The extraction of the sound signal is one of the core parts of the application, and the extraction of effective and reliable characteristics can improve the accuracy and effectiveness of the result and reduce the complexity of processing. The research shows that the spectrogram can well represent the audio characteristics of the fan blade. Fig. 5 shows an audio spectrogram.

A Spectrogram, also known as a time spectrum (english: Spectrogram), also known as a spectral waterfall (spectral waterfall), acoustic fingerprint (voiceprint), sonogram (voicegram) or Spectrogram, is a heat map that describes how fluctuating frequency components change over time. The conventional 2-dimensional spectrum obtained by fourier transform can show how complex fluctuations are decomposed into a superposition of simple waves (into a spectrum) in proportion, but cannot simultaneously reflect their changes over time. A common mathematical method that can analyze both the time variation of the fluctuation and the frequency distribution is the short-time fourier transform, but it is not convenient to observe and analyze on a paper surface if the 3-dimensional image is directly drawn. The time spectrum is represented by 3 d values in the form of heat maps with shades of color on the basis of time-frequency analysis methods.

As shown in fig. 2, the "calculating a spectrogram matrix of the filtered audio signal" specifically includes the steps of:

step S201, carrying out pre-emphasis, framing and windowing on the filtered audio signal to obtain a first result;

step S202, performing fast Fourier transform on the first result to obtain a second result;

step S203, carrying out absolute value or square value calculation on the second result to obtain a third result;

step S204, triangular band-pass filtering processing is carried out on the third result to obtain a fourth result;

step S205, logarithmic energy processing is carried out on the fourth result to obtain a fifth result;

and S206, performing dynamic feature calculation on the fifth result to obtain a spectrogram matrix. The method specifically comprises the following steps:

first, the audio input is pre-emphasized by a high pass filter:

mu is 0.79.

Framing is a process of dividing a speech signal into short frames according to a frame length and a frame step size.

H(z)＝1-uz^-1

Windowing multiplies each frame by a hamming window to increase the continuity of the left and right ends of the frame.

Fast Fourier transform: and carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame.

Taking an absolute value or a square value: and performing a modular square on the frequency spectrum of the voice signal to obtain a power spectrum.

Triangular band-pass filtering: the energy spectrum is passed through a set of Mel-scale triangular filter banks defining a filter bank of M filters. Therefore, the frequency spectrum is smoothed, the effect of harmonic waves is eliminated, a formant is highlighted, and the operation amount is reduced.

Taking logarithmic energy: the logarithmic energy of each filter bank is calculated.

Dynamic characteristics: a difference spectrum of the static features is calculated.

Step S103: and clustering the spectrogram matrix into two types through K _ means to obtain a clustering label. The method specifically comprises the following steps:

intercepting the middle and high dimensionality in the spectrogram matrix, and inputting the spectrogram matrix subjected to dimensionality reduction to K _ means to obtain a tag sequence; in the present embodiment, one fifth or more of the dimensions are set as the medium-high dimensions;

and filtering out burrs and peaks of the label sequence. The method specifically comprises the following steps:

k in the k-means algorithm represents the number of the class clusters, and means represents the mean value of the data objects in the class clusters (the mean value is a description of the center of the class clusters), so the k-means algorithm is also called as k-means algorithm. The k-means algorithm is a clustering algorithm based on division, and takes distance as a standard of similarity measurement between data objects, i.e. the smaller the distance between data objects is, the higher the similarity is, the more likely they are in the same cluster. There are many calculations of the distance between data objects, and the k-means algorithm generally uses euclidean distances to calculate the distance between data objects.

Firstly, intercepting the middle and high dimensionality of a spectrogram. And inputting the matrix subjected to dimensionality reduction into a K _ means two-class algorithm.

Resulting in clustered 0,1 tag sequences. For visualization convenience, fig. 6 shows that the label sequence 1 is amplified, so that the effect after K _ means clustering is good, the label sequences with wind sweeping sound and quiet sound are obtained, and the period after wind sweeping can be roughly divided.

And filtering out burrs and peaks of the label sequence.

Step S104: and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. The method specifically comprises the following steps:

step 1, calculating a first-order difference vector of V;

step 2, carrying out median operation on the differential vector;

If the current time is not within the preset interval range, executing preset operation, and specifically comprising the following steps of:

and judging whether the first-order difference vector is smaller than a second preset threshold value, and if so, deleting the current first-order difference vector. The method specifically comprises the following steps:

jumping from continuation tag 0 to continuation tag 1 as coordinate points, the sequence being in coordinates:

V＝[v1,v2,…,vn]

where vi, i ∈ [1,2, …, N ], represents the ith peak on the time axis in the spectrogram. The label is obtained by a K _ means algorithm, the K _ means algorithm is sensitive to deviation values, and the deviation values need to be corrected on the basis.

The convergence correction algorithm is given below:

1. the sequence coordinates are denoted V ═ V1, V2, …, vn.

2. Calculate the first order difference vector Diffv of V:

diffv (i) ═ V (i +1) -V (i), where i ∈ 1,2, …, N-1.

3. The difference vector is subjected to a Median operation Median (diffv (i)).

4. The first order difference vector Diffv is traversed from the header if Diffv (i) >1.3media (Diffv (i)) vector Vi inserts a new value Vi + media (Diffv (i)). If Diffv (i) <0.7 Median (Diffv (i)), the vector Vi is deleted.

5. The new sequence is subjected to the first step until in the fourth step 0.7Median (diffv (i) <1.3Median (diffv (i)) jumps out of the loop.

In the median convergence iteration, the initial median has no relation of deviation, and the median in the true statistical sense can be approached in the convergence of the subsequent iteration.

The convergence of the median values in the sequence of coordinates is now exemplified:

1、V＝[0,27,59,89，118,149,178,209,239,269,331,361,391, 421,451,482,511,541,568,602,631,661]。

2. calculating the first order difference of V to obtain

Diffv(i)＝[27.0,32.0,30.0,29.0,31.0,29.0,31.0,30.0,30.0, 62.0,30.0,30.0,30.0,30.0,31.0,29.0,30.0,27.0,34.0,29.0,30.0]

3. The Median operation is performed on the difference vector, Median (diffv (i) ═ 30

Where i is 10, Diffv (10) >1.3 x 30. The value 299 is inserted at position V10 and subsequent values are shifted back.

V＝[0,27,59,89,118,149,178,209,239,269,299,331,361, 391,421,451,482,511,541,568,602,631,661]

Corresponding to the variation of the cutting vertical lines of fig. 7 to 8.

The vector V example jumps out of the iteration loop.

By filtering an input audio signal; calculating a spectrogram matrix of the filtered audio signal; clustering the spectrogram matrix into two types through K _ means to obtain clustering labels; and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. In the whole process, the method is independent without depending on external system data, the calculated amount is small, the cutting point position is correct, the accuracy is high, the interference carrying capacity is strong, the engineering is easy, and the dynamic cutting can be realized according to the audio frequency periodic variation.

Referring to fig. 3 to 8, in the present embodiment, an embodiment of an audio segmentation system 300 based on clustering and median convergence is as follows:

an audio cutting system 300 based on clustering and median convergence, comprising: the system comprises a filtering module 301, a spectrogram matrix generating module 302, a clustering label generating module 303 and a sequence coordinate generating module 304;

the filtering module 301 is configured to: the input audio signal is filtered. The method specifically comprises the following steps: the sound collector works in a complex outdoor environment, and collected audio signals generally contain a large amount of noise, such as bird calls, wind sounds, human voices, noise caused by other fans and the like. At present, the starting condition of the fan is generally that the average wind speed is not less than 3.5m/s, and the signal collected by the sound sensor necessarily contains wind noise, so compared with other background noise, the wind noise has a larger influence. As seen from the time-frequency diagram of fig. 4, the spectral energy of the wind noise is concentrated in the low frequency region (bright region in the diagram) below 350 Hz. A filter is required to filter out low frequency wind noise. In the present embodiment, a low-frequency filter is used for preprocessing.

The spectrogram matrix generation module 302 is configured to: a spectrogram matrix of the filtered audio signal is calculated. The extraction of the sound signal is one of the core parts of the application, and the extraction of effective and reliable characteristics can improve the accuracy and effectiveness of the result and reduce the complexity of processing. The research shows that the spectrogram can well represent the audio characteristics of the fan blade. Fig. 5 shows an audio spectrogram.

Further, the spectrogram matrix generation module 302 is further configured to:

and performing dynamic characteristic calculation on the fifth result to obtain a spectrogram matrix. The method specifically comprises the following steps:

first, the audio input is pre-emphasized by a high pass filter:

mu is 0.79.

H(z)＝1-uz^-1

The cluster label generating module 303 is configured to: the method comprises the following steps of clustering the spectrogram matrix into two types through K _ means to obtain a clustering label, wherein the clustering label further comprises the following steps:

Further, the cluster label generating module 303 is further configured to:

and filtering out burrs and peaks of the label sequence.

The sequence coordinate generation module 304 is configured to: and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. The method specifically comprises the following steps:

step 1, calculating a first-order difference vector of V;

step 2, carrying out median operation on the differential vector;

Further, the sequence coordinate generating module 304 is further configured to:

The method specifically comprises the following steps:

V＝[v1,v2,…,vn]

The convergence correction algorithm is given below:

1. the sequence coordinates are denoted V ═ V1, V2, …, vn.

2. Calculate the first order difference vector Diffv of V:

diffv (i) ═ V (i +1) -V (i), where i ∈ 1,2, …, N-1.

3. The difference vector is subjected to a Median operation Median (diffv (i)).

2. calculating the first order difference of V to obtain

Corresponding to the variation of the cutting vertical lines of fig. 7 to 8.

The vector V example jumps out of the iteration loop.

Filtering the input audio signal by the filtering module 301; the spectrogram matrix generating module 302 calculates a spectrogram matrix of the filtered audio signal; the clustering label generating module 303 clusters the spectrogram matrix into two types through K _ means to obtain clustering labels; the sequence coordinate generation module 304 performs convergence correction on the cluster label, and outputs a segmentation point sequence coordinate. The whole system does not depend on external system data, has independent method, small calculated amount, correct cutting point position, high accuracy, strong interference carrying capacity and easy engineering, and can realize dynamic cutting according to the audio frequency periodic variation.

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims

1. An audio cutting method based on clustering and median convergence is characterized by comprising the following steps:

filtering an input audio signal;

calculating a spectrogram matrix of the filtered audio signal;

2. The audio cutting method based on clustering and median convergence according to claim 1, wherein the calculating of the spectrogram matrix of the filtered audio signal specifically comprises the steps of:

3. The audio cutting method based on clustering and median convergence according to claim 1, wherein the "clustering the spectrogram matrix into two classes through K _ means to obtain clustering labels" further comprises the steps of:

and filtering out burrs and peaks of the label sequence.

4. The audio cutting method based on clustering and median convergence according to claim 1, wherein the step of performing convergence correction on the clustering label and outputting a segmentation point sequence coordinate comprises the steps of:

step 1, calculating a first-order difference vector of V;

step 2, carrying out median operation on the differential vector;

5. The audio cutting method based on clustering and median convergence according to claim 4, wherein the step of performing the preset operation if the audio cutting is not within the preset interval comprises the steps of:

6. An audio cutting system based on clustering and median convergence, comprising: the system comprises a filtering module, a spectrogram matrix generating module, a clustering label generating module and a sequence coordinate generating module;

the filtering module is used for: filtering an input audio signal;

7. The audio segmentation system based on clustering and median convergence according to claim 6, wherein the spectrogram matrix generation module is further configured to:

8. The audio cutting system based on clustering and median convergence according to claim 6, wherein the cluster label generation module is further configured to:

and filtering out burrs and peaks of the label sequence.

9. The audio slicing system based on clustering and median convergence according to claim 6, wherein the sequence coordinate generating module is further configured to:

step 1, calculating a first-order difference vector of V;

step 2, carrying out median operation on the differential vector;

10. The audio slicing system based on clustering and median convergence according to claim 9, wherein the sequence coordinate generating module is further configured to: