CN112863541A - Audio cutting method and system based on clustering and median convergence - Google Patents

Audio cutting method and system based on clustering and median convergence Download PDF

Info

Publication number
CN112863541A
CN112863541A CN202011614821.6A CN202011614821A CN112863541A CN 112863541 A CN112863541 A CN 112863541A CN 202011614821 A CN202011614821 A CN 202011614821A CN 112863541 A CN112863541 A CN 112863541A
Authority
CN
China
Prior art keywords
clustering
result
difference vector
order difference
median
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011614821.6A
Other languages
Chinese (zh)
Other versions
CN112863541B (en
Inventor
刘培
王颖蕊
陈坚
陈绪水
何魁伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou Institute Of Data Technology Co ltd
Original Assignee
Fuzhou Institute Of Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou Institute Of Data Technology Co ltd filed Critical Fuzhou Institute Of Data Technology Co ltd
Priority to CN202011614821.6A priority Critical patent/CN112863541B/en
Publication of CN112863541A publication Critical patent/CN112863541A/en
Application granted granted Critical
Publication of CN112863541B publication Critical patent/CN112863541B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Discrete Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention relates to the technical field of audio and video processing, in particular to an audio cutting method and system based on clustering and median convergence. The audio cutting method based on clustering and median convergence comprises the following steps: filtering an input audio signal; calculating a spectrogram matrix of the filtered audio signal; clustering the spectrogram matrix into two types through K _ means to obtain clustering labels; and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. In the whole process, the method is independent without depending on external system data, the calculated amount is small, the cutting point position is correct, the accuracy is high, the interference carrying capacity is strong, the engineering is easy, and the dynamic cutting can be realized according to the audio frequency periodic variation.

Description

Audio cutting method and system based on clustering and median convergence
Technical Field
The invention relates to the technical field of audio and video processing, in particular to an audio cutting method and system based on clustering and median convergence.
Background
In the field of health monitoring and fan fault detection, audio cutting is often required. In application No. 201911121998, X, entitled "audio segmentation method based on signal energy spike identification", the present application relates to an audio segmentation method based on signal energy spike identification, including: carrying out short-time Fourier transform on an input audio signal, and converting the input audio signal into a power spectrum matrix; extracting intermediate frequency energy characteristics based on a power spectrum; carrying out peak identification on the extracted intermediate frequency energy characteristics; performing error division correction on the signal subjected to peak identification; and outputting the time coordinate of the division point of the audio signal. The audio frequency segmentation method needs the rated rotating speed rs of the fan blade rotation obtained in real time from other systems as an input condition, and has strong coupling with other systems.
Disclosure of Invention
Therefore, an audio cutting method based on clustering and median convergence is needed to be provided, so as to solve the technical problems that the existing audio cutting audio input needs to rely on other systems to obtain a better audio cutting effect and has strong coupling. The specific technical scheme is as follows:
an audio cutting method based on clustering and median convergence comprises the following steps:
filtering an input audio signal;
calculating a spectrogram matrix of the filtered audio signal;
clustering the spectrogram matrix into two types through K _ means to obtain clustering labels;
and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate.
Further, the "calculating a spectrogram matrix of the filtered audio signal" specifically includes the steps of:
pre-emphasis, framing and windowing are carried out on the filtered audio signal to obtain a first result;
performing fast Fourier transform on the first result to obtain a second result;
carrying out absolute value or square value calculation on the second result to obtain a third result;
performing triangular band-pass filtering processing on the third result to obtain a fourth result;
carrying out logarithmic energy processing on the fourth result to obtain a fifth result;
and performing dynamic characteristic calculation on the fifth result to obtain a spectrogram matrix.
Further, the "clustering the spectrogram matrix into two classes through K _ means to obtain clustering labels" specifically includes the following steps:
intercepting the middle and high dimensionality in the spectrogram matrix, and inputting the spectrogram matrix subjected to dimensionality reduction to K _ means to obtain a tag sequence;
and filtering out burrs and peaks of the label sequence.
Further, the "performing convergence correction on the clustering label and outputting a segmentation point sequence coordinate" specifically includes the steps of:
the sequence coordinates of the cluster tag are identified as V ═ V1, V2, …, vn ];
step 1, calculating a first-order difference vector of V;
step 2, carrying out median operation on the differential vector;
step 3, traversing the first-order difference vector from the head, judging whether the first-order difference vector is in a preset interval range, and outputting a division point sequence coordinate if the first-order difference vector is in the preset interval range;
if the current time is not within the preset interval range, executing preset operation.
Further, the step of executing a preset operation if the current time is not within the preset interval range specifically includes the steps of:
if the first-order difference vector is not in the range of the preset interval, judging whether the first-order difference vector is larger than a first preset threshold value or not, and if the first-order difference vector is larger than the first preset threshold value, inserting a new value into the sequence coordinate;
and judging whether the first-order difference vector is smaller than a second preset threshold value, and if so, deleting the current first-order difference vector.
In order to solve the technical problem, the audio cutting system based on clustering and median convergence is further provided, and the specific technical scheme is as follows:
an audio cutting system based on clustering and median convergence, comprising: the system comprises a filtering module, a spectrogram matrix generating module, a clustering label generating module and a sequence coordinate generating module;
the filtering module is used for: filtering an input audio signal;
the spectrogram matrix generation module is used for: calculating a spectrogram matrix of the filtered audio signal;
the cluster label generation module is configured to: clustering the spectrogram matrix into two types through K _ means to obtain clustering labels;
the sequence coordinate generation module is configured to: and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate.
Further, the spectrogram matrix generation module is further configured to:
pre-emphasis, framing and windowing are carried out on the filtered audio signal to obtain a first result;
performing fast Fourier transform on the first result to obtain a second result;
carrying out absolute value or square value calculation on the second result to obtain a third result;
performing triangular band-pass filtering processing on the third result to obtain a fourth result;
carrying out logarithmic energy processing on the fourth result to obtain a fifth result;
and performing dynamic characteristic calculation on the fifth result to obtain a spectrogram matrix.
Further, the cluster label generating module is further configured to:
intercepting the middle and high dimensionality in the spectrogram matrix, and inputting the spectrogram matrix subjected to dimensionality reduction to K _ means to obtain a tag sequence;
and filtering out burrs and peaks of the label sequence.
Further, the sequence coordinate generating module is further configured to:
the sequence coordinates of the cluster tag are identified as V ═ V1, V2, …, vn ];
step 1, calculating a first-order difference vector of V;
step 2, carrying out median operation on the differential vector;
step 3, traversing the first-order difference vector from the head, judging whether the first-order difference vector is in a preset interval range, and outputting a division point sequence coordinate if the first-order difference vector is in the preset interval range;
if the current time is not within the preset interval range, executing preset operation.
Further, the sequence coordinate generating module is further configured to:
if the first-order difference vector is not in the range of the preset interval, judging whether the first-order difference vector is larger than a first preset threshold value or not, and if the first-order difference vector is larger than the first preset threshold value, inserting a new value into the sequence coordinate;
and judging whether the first-order difference vector is smaller than a second preset threshold value, and if so, deleting the current first-order difference vector.
The invention has the beneficial effects that: by filtering an input audio signal; calculating a spectrogram matrix of the filtered audio signal; clustering the spectrogram matrix into two types through K _ means to obtain clustering labels; and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. In the whole process, the method is independent without depending on external system data, the calculated amount is small, the cutting point position is correct, the accuracy is high, the interference carrying capacity is strong, the engineering is easy, and the dynamic cutting can be realized according to the audio frequency periodic variation.
Drawings
FIG. 1 is a flow chart of an audio slicing method based on clustering and median convergence according to an embodiment;
FIG. 2 is a flowchart illustrating the calculation of a spectrogram matrix of a filtered audio signal according to an embodiment;
FIG. 3 is a block diagram of an audio slicing system based on clustering and median convergence according to an embodiment;
FIG. 4 is a schematic time-frequency diagram representation, in accordance with an embodiment;
FIG. 5 is a diagram illustrating an audio spectrogram according to an embodiment;
FIG. 6 is a schematic diagram of the tag sequence of embodiment 1 after amplification;
FIG. 7 is a schematic diagram of a cut line before convergence correction is performed on the cluster label according to an embodiment;
FIG. 8 is a schematic diagram of a cut line after convergence correction is performed on the cluster label according to an embodiment; .
Description of reference numerals:
300. an audio cutting system based on clustering and median convergence,
301. a filtering module for filtering the received signal and the received signal,
302. a speech spectrogram matrix generation module for generating a speech spectrogram matrix,
303. a cluster label generation module for generating a cluster label,
304. and a sequence coordinate generation module.
Detailed Description
To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
Referring to fig. 1 to 2, in the present embodiment, an audio segmentation method based on clustering and median convergence is applicable to an audio segmentation system based on clustering and median convergence, where the audio segmentation system based on clustering and median convergence includes: the device comprises a filtering module, a spectrogram matrix generating module, a clustering label generating module and a sequence coordinate generating module.
Please refer to fig. 1, which illustrates an embodiment of the present invention as follows:
step S101: the input audio signal is filtered. The method specifically comprises the following steps: the sound collector works in a complex outdoor environment, and collected audio signals generally contain a large amount of noise, such as bird calls, wind sounds, human voices, noise caused by other fans and the like. At present, the starting condition of the fan is generally that the average wind speed is not less than 3.5m/s, and the signal collected by the sound sensor necessarily contains wind noise, so compared with other background noise, the wind noise has a larger influence. As seen from the time-frequency diagram of fig. 4, the spectral energy of the wind noise is concentrated in the low frequency region (bright region in the diagram) below 350 Hz. A filter is required to filter out low frequency wind noise. In the present embodiment, a low-frequency filter is used for preprocessing.
Step S102: a spectrogram matrix of the filtered audio signal is calculated. The extraction of the sound signal is one of the core parts of the application, and the extraction of effective and reliable characteristics can improve the accuracy and effectiveness of the result and reduce the complexity of processing. The research shows that the spectrogram can well represent the audio characteristics of the fan blade. Fig. 5 shows an audio spectrogram.
A Spectrogram, also known as a time spectrum (english: Spectrogram), also known as a spectral waterfall (spectral waterfall), acoustic fingerprint (voiceprint), sonogram (voicegram) or Spectrogram, is a heat map that describes how fluctuating frequency components change over time. The conventional 2-dimensional spectrum obtained by fourier transform can show how complex fluctuations are decomposed into a superposition of simple waves (into a spectrum) in proportion, but cannot simultaneously reflect their changes over time. A common mathematical method that can analyze both the time variation of the fluctuation and the frequency distribution is the short-time fourier transform, but it is not convenient to observe and analyze on a paper surface if the 3-dimensional image is directly drawn. The time spectrum is represented by 3 d values in the form of heat maps with shades of color on the basis of time-frequency analysis methods.
As shown in fig. 2, the "calculating a spectrogram matrix of the filtered audio signal" specifically includes the steps of:
step S201, carrying out pre-emphasis, framing and windowing on the filtered audio signal to obtain a first result;
step S202, performing fast Fourier transform on the first result to obtain a second result;
step S203, carrying out absolute value or square value calculation on the second result to obtain a third result;
step S204, triangular band-pass filtering processing is carried out on the third result to obtain a fourth result;
step S205, logarithmic energy processing is carried out on the fourth result to obtain a fifth result;
and S206, performing dynamic feature calculation on the fifth result to obtain a spectrogram matrix. The method specifically comprises the following steps:
first, the audio input is pre-emphasized by a high pass filter:
mu is 0.79.
Framing is a process of dividing a speech signal into short frames according to a frame length and a frame step size.
H(z)=1-uz-1
Windowing multiplies each frame by a hamming window to increase the continuity of the left and right ends of the frame.
Fast Fourier transform: and carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame.
Taking an absolute value or a square value: and performing a modular square on the frequency spectrum of the voice signal to obtain a power spectrum.
Triangular band-pass filtering: the energy spectrum is passed through a set of Mel-scale triangular filter banks defining a filter bank of M filters. Therefore, the frequency spectrum is smoothed, the effect of harmonic waves is eliminated, a formant is highlighted, and the operation amount is reduced.
Taking logarithmic energy: the logarithmic energy of each filter bank is calculated.
Dynamic characteristics: a difference spectrum of the static features is calculated.
Step S103: and clustering the spectrogram matrix into two types through K _ means to obtain a clustering label. The method specifically comprises the following steps:
intercepting the middle and high dimensionality in the spectrogram matrix, and inputting the spectrogram matrix subjected to dimensionality reduction to K _ means to obtain a tag sequence; in the present embodiment, one fifth or more of the dimensions are set as the medium-high dimensions;
and filtering out burrs and peaks of the label sequence. The method specifically comprises the following steps:
k in the k-means algorithm represents the number of the class clusters, and means represents the mean value of the data objects in the class clusters (the mean value is a description of the center of the class clusters), so the k-means algorithm is also called as k-means algorithm. The k-means algorithm is a clustering algorithm based on division, and takes distance as a standard of similarity measurement between data objects, i.e. the smaller the distance between data objects is, the higher the similarity is, the more likely they are in the same cluster. There are many calculations of the distance between data objects, and the k-means algorithm generally uses euclidean distances to calculate the distance between data objects.
Firstly, intercepting the middle and high dimensionality of a spectrogram. And inputting the matrix subjected to dimensionality reduction into a K _ means two-class algorithm.
Resulting in clustered 0,1 tag sequences. For visualization convenience, fig. 6 shows that the label sequence 1 is amplified, so that the effect after K _ means clustering is good, the label sequences with wind sweeping sound and quiet sound are obtained, and the period after wind sweeping can be roughly divided.
And filtering out burrs and peaks of the label sequence.
Step S104: and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. The method specifically comprises the following steps:
the sequence coordinates of the cluster tag are identified as V ═ V1, V2, …, vn ];
step 1, calculating a first-order difference vector of V;
step 2, carrying out median operation on the differential vector;
step 3, traversing the first-order difference vector from the head, judging whether the first-order difference vector is in a preset interval range, and outputting a division point sequence coordinate if the first-order difference vector is in the preset interval range;
if the current time is not within the preset interval range, executing preset operation.
If the current time is not within the preset interval range, executing preset operation, and specifically comprising the following steps of:
if the first-order difference vector is not in the range of the preset interval, judging whether the first-order difference vector is larger than a first preset threshold value or not, and if the first-order difference vector is larger than the first preset threshold value, inserting a new value into the sequence coordinate;
and judging whether the first-order difference vector is smaller than a second preset threshold value, and if so, deleting the current first-order difference vector. The method specifically comprises the following steps:
jumping from continuation tag 0 to continuation tag 1 as coordinate points, the sequence being in coordinates:
V=[v1,v2,…,vn]
where vi, i ∈ [1,2, …, N ], represents the ith peak on the time axis in the spectrogram. The label is obtained by a K _ means algorithm, the K _ means algorithm is sensitive to deviation values, and the deviation values need to be corrected on the basis.
The convergence correction algorithm is given below:
1. the sequence coordinates are denoted V ═ V1, V2, …, vn.
2. Calculate the first order difference vector Diffv of V:
diffv (i) ═ V (i +1) -V (i), where i ∈ 1,2, …, N-1.
3. The difference vector is subjected to a Median operation Median (diffv (i)).
4. The first order difference vector Diffv is traversed from the header if Diffv (i) >1.3media (Diffv (i)) vector Vi inserts a new value Vi + media (Diffv (i)). If Diffv (i) <0.7 Median (Diffv (i)), the vector Vi is deleted.
5. The new sequence is subjected to the first step until in the fourth step 0.7Median (diffv (i) <1.3Median (diffv (i)) jumps out of the loop.
In the median convergence iteration, the initial median has no relation of deviation, and the median in the true statistical sense can be approached in the convergence of the subsequent iteration.
The convergence of the median values in the sequence of coordinates is now exemplified:
1、V=[0,27,59,89,118,149,178,209,239,269,331,361,391, 421,451,482,511,541,568,602,631,661]。
2. calculating the first order difference of V to obtain
Diffv(i)=[27.0,32.0,30.0,29.0,31.0,29.0,31.0,30.0,30.0, 62.0,30.0,30.0,30.0,30.0,31.0,29.0,30.0,27.0,34.0,29.0,30.0]
3. The Median operation is performed on the difference vector, Median (diffv (i) ═ 30
4. The first order difference vector Diffv is traversed from the header if Diffv (i) >1.3media (Diffv (i)) vector Vi inserts a new value Vi + media (Diffv (i)). If Diffv (i) <0.7 Median (Diffv (i)), the vector Vi is deleted.
Where i is 10, Diffv (10) >1.3 x 30. The value 299 is inserted at position V10 and subsequent values are shifted back.
V=[0,27,59,89,118,149,178,209,239,269,299,331,361, 391,421,451,482,511,541,568,602,631,661]
Corresponding to the variation of the cutting vertical lines of fig. 7 to 8.
5. The new sequence is subjected to the first step until in the fourth step 0.7Median (diffv (i) <1.3Median (diffv (i)) jumps out of the loop.
The vector V example jumps out of the iteration loop.
By filtering an input audio signal; calculating a spectrogram matrix of the filtered audio signal; clustering the spectrogram matrix into two types through K _ means to obtain clustering labels; and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. In the whole process, the method is independent without depending on external system data, the calculated amount is small, the cutting point position is correct, the accuracy is high, the interference carrying capacity is strong, the engineering is easy, and the dynamic cutting can be realized according to the audio frequency periodic variation.
Referring to fig. 3 to 8, in the present embodiment, an embodiment of an audio segmentation system 300 based on clustering and median convergence is as follows:
an audio cutting system 300 based on clustering and median convergence, comprising: the system comprises a filtering module 301, a spectrogram matrix generating module 302, a clustering label generating module 303 and a sequence coordinate generating module 304;
the filtering module 301 is configured to: the input audio signal is filtered. The method specifically comprises the following steps: the sound collector works in a complex outdoor environment, and collected audio signals generally contain a large amount of noise, such as bird calls, wind sounds, human voices, noise caused by other fans and the like. At present, the starting condition of the fan is generally that the average wind speed is not less than 3.5m/s, and the signal collected by the sound sensor necessarily contains wind noise, so compared with other background noise, the wind noise has a larger influence. As seen from the time-frequency diagram of fig. 4, the spectral energy of the wind noise is concentrated in the low frequency region (bright region in the diagram) below 350 Hz. A filter is required to filter out low frequency wind noise. In the present embodiment, a low-frequency filter is used for preprocessing.
The spectrogram matrix generation module 302 is configured to: a spectrogram matrix of the filtered audio signal is calculated. The extraction of the sound signal is one of the core parts of the application, and the extraction of effective and reliable characteristics can improve the accuracy and effectiveness of the result and reduce the complexity of processing. The research shows that the spectrogram can well represent the audio characteristics of the fan blade. Fig. 5 shows an audio spectrogram.
A Spectrogram, also known as a time spectrum (english: Spectrogram), also known as a spectral waterfall (spectral waterfall), acoustic fingerprint (voiceprint), sonogram (voicegram) or Spectrogram, is a heat map that describes how fluctuating frequency components change over time. The conventional 2-dimensional spectrum obtained by fourier transform can show how complex fluctuations are decomposed into a superposition of simple waves (into a spectrum) in proportion, but cannot simultaneously reflect their changes over time. A common mathematical method that can analyze both the time variation of the fluctuation and the frequency distribution is the short-time fourier transform, but it is not convenient to observe and analyze on a paper surface if the 3-dimensional image is directly drawn. The time spectrum is represented by 3 d values in the form of heat maps with shades of color on the basis of time-frequency analysis methods.
Further, the spectrogram matrix generation module 302 is further configured to:
pre-emphasis, framing and windowing are carried out on the filtered audio signal to obtain a first result;
performing fast Fourier transform on the first result to obtain a second result;
carrying out absolute value or square value calculation on the second result to obtain a third result;
performing triangular band-pass filtering processing on the third result to obtain a fourth result;
carrying out logarithmic energy processing on the fourth result to obtain a fifth result;
and performing dynamic characteristic calculation on the fifth result to obtain a spectrogram matrix. The method specifically comprises the following steps:
first, the audio input is pre-emphasized by a high pass filter:
mu is 0.79.
Framing is a process of dividing a speech signal into short frames according to a frame length and a frame step size.
H(z)=1-uz-1
Windowing multiplies each frame by a hamming window to increase the continuity of the left and right ends of the frame.
Fast Fourier transform: and carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame.
Taking an absolute value or a square value: and performing a modular square on the frequency spectrum of the voice signal to obtain a power spectrum.
Triangular band-pass filtering: the energy spectrum is passed through a set of Mel-scale triangular filter banks defining a filter bank of M filters. Therefore, the frequency spectrum is smoothed, the effect of harmonic waves is eliminated, a formant is highlighted, and the operation amount is reduced.
Taking logarithmic energy: the logarithmic energy of each filter bank is calculated.
Dynamic characteristics: a difference spectrum of the static features is calculated.
The cluster label generating module 303 is configured to: the method comprises the following steps of clustering the spectrogram matrix into two types through K _ means to obtain a clustering label, wherein the clustering label further comprises the following steps:
intercepting the middle and high dimensionality in the spectrogram matrix, and inputting the spectrogram matrix subjected to dimensionality reduction to K _ means to obtain a tag sequence; in the present embodiment, one fifth or more of the dimensions are set as the medium-high dimensions;
and filtering out burrs and peaks of the label sequence. The method specifically comprises the following steps:
k in the k-means algorithm represents the number of the class clusters, and means represents the mean value of the data objects in the class clusters (the mean value is a description of the center of the class clusters), so the k-means algorithm is also called as k-means algorithm. The k-means algorithm is a clustering algorithm based on division, and takes distance as a standard of similarity measurement between data objects, i.e. the smaller the distance between data objects is, the higher the similarity is, the more likely they are in the same cluster. There are many calculations of the distance between data objects, and the k-means algorithm generally uses euclidean distances to calculate the distance between data objects.
Firstly, intercepting the middle and high dimensionality of a spectrogram. And inputting the matrix subjected to dimensionality reduction into a K _ means two-class algorithm.
Resulting in clustered 0,1 tag sequences. For visualization convenience, fig. 6 shows that the label sequence 1 is amplified, so that the effect after K _ means clustering is good, the label sequences with wind sweeping sound and quiet sound are obtained, and the period after wind sweeping can be roughly divided.
Further, the cluster label generating module 303 is further configured to:
intercepting the middle and high dimensionality in the spectrogram matrix, and inputting the spectrogram matrix subjected to dimensionality reduction to K _ means to obtain a tag sequence; in the present embodiment, one fifth or more of the dimensions are set as the medium-high dimensions;
and filtering out burrs and peaks of the label sequence.
The sequence coordinate generation module 304 is configured to: and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate. The method specifically comprises the following steps:
the sequence coordinates of the cluster tag are identified as V ═ V1, V2, …, vn ];
step 1, calculating a first-order difference vector of V;
step 2, carrying out median operation on the differential vector;
step 3, traversing the first-order difference vector from the head, judging whether the first-order difference vector is in a preset interval range, and outputting a division point sequence coordinate if the first-order difference vector is in the preset interval range;
if the current time is not within the preset interval range, executing preset operation.
Further, the sequence coordinate generating module 304 is further configured to:
if the first-order difference vector is not in the range of the preset interval, judging whether the first-order difference vector is larger than a first preset threshold value or not, and if the first-order difference vector is larger than the first preset threshold value, inserting a new value into the sequence coordinate;
and judging whether the first-order difference vector is smaller than a second preset threshold value, and if so, deleting the current first-order difference vector.
The method specifically comprises the following steps:
jumping from continuation tag 0 to continuation tag 1 as coordinate points, the sequence being in coordinates:
V=[v1,v2,…,vn]
where vi, i ∈ [1,2, …, N ], represents the ith peak on the time axis in the spectrogram. The label is obtained by a K _ means algorithm, the K _ means algorithm is sensitive to deviation values, and the deviation values need to be corrected on the basis.
The convergence correction algorithm is given below:
1. the sequence coordinates are denoted V ═ V1, V2, …, vn.
2. Calculate the first order difference vector Diffv of V:
diffv (i) ═ V (i +1) -V (i), where i ∈ 1,2, …, N-1.
3. The difference vector is subjected to a Median operation Median (diffv (i)).
4. The first order difference vector Diffv is traversed from the header if Diffv (i) >1.3media (Diffv (i)) vector Vi inserts a new value Vi + media (Diffv (i)). If Diffv (i) <0.7 Median (Diffv (i)), the vector Vi is deleted.
5. The new sequence is subjected to the first step until in the fourth step 0.7Median (diffv (i) <1.3Median (diffv (i)) jumps out of the loop.
In the median convergence iteration, the initial median has no relation of deviation, and the median in the true statistical sense can be approached in the convergence of the subsequent iteration.
The convergence of the median values in the sequence of coordinates is now exemplified:
1、V=[0,27,59,89,118,149,178,209,239,269,331,361,391, 421,451,482,511,541,568,602,631,661]。
2. calculating the first order difference of V to obtain
Diffv(i)=[27.0,32.0,30.0,29.0,31.0,29.0,31.0,30.0,30.0, 62.0,30.0,30.0,30.0,30.0,31.0,29.0,30.0,27.0,34.0,29.0,30.0]
3. The Median operation is performed on the difference vector, Median (diffv (i) ═ 30
4. The first order difference vector Diffv is traversed from the header if Diffv (i) >1.3media (Diffv (i)) vector Vi inserts a new value Vi + media (Diffv (i)). If Diffv (i) <0.7 Median (Diffv (i)), the vector Vi is deleted.
Where i is 10, Diffv (10) >1.3 x 30. The value 299 is inserted at position V10 and subsequent values are shifted back.
V=[0,27,59,89,118,149,178,209,239,269,299,331,361, 391,421,451,482,511,541,568,602,631,661]
Corresponding to the variation of the cutting vertical lines of fig. 7 to 8.
5. The new sequence is subjected to the first step until in the fourth step 0.7Median (diffv (i) <1.3Median (diffv (i)) jumps out of the loop.
The vector V example jumps out of the iteration loop.
Filtering the input audio signal by the filtering module 301; the spectrogram matrix generating module 302 calculates a spectrogram matrix of the filtered audio signal; the clustering label generating module 303 clusters the spectrogram matrix into two types through K _ means to obtain clustering labels; the sequence coordinate generation module 304 performs convergence correction on the cluster label, and outputs a segmentation point sequence coordinate. The whole system does not depend on external system data, has independent method, small calculated amount, correct cutting point position, high accuracy, strong interference carrying capacity and easy engineering, and can realize dynamic cutting according to the audio frequency periodic variation.
It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims (10)

1. An audio cutting method based on clustering and median convergence is characterized by comprising the following steps:
filtering an input audio signal;
calculating a spectrogram matrix of the filtered audio signal;
clustering the spectrogram matrix into two types through K _ means to obtain clustering labels;
and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate.
2. The audio cutting method based on clustering and median convergence according to claim 1, wherein the calculating of the spectrogram matrix of the filtered audio signal specifically comprises the steps of:
pre-emphasis, framing and windowing are carried out on the filtered audio signal to obtain a first result;
performing fast Fourier transform on the first result to obtain a second result;
carrying out absolute value or square value calculation on the second result to obtain a third result;
performing triangular band-pass filtering processing on the third result to obtain a fourth result;
carrying out logarithmic energy processing on the fourth result to obtain a fifth result;
and performing dynamic characteristic calculation on the fifth result to obtain a spectrogram matrix.
3. The audio cutting method based on clustering and median convergence according to claim 1, wherein the "clustering the spectrogram matrix into two classes through K _ means to obtain clustering labels" further comprises the steps of:
intercepting the middle and high dimensionality in the spectrogram matrix, and inputting the spectrogram matrix subjected to dimensionality reduction to K _ means to obtain a tag sequence;
and filtering out burrs and peaks of the label sequence.
4. The audio cutting method based on clustering and median convergence according to claim 1, wherein the step of performing convergence correction on the clustering label and outputting a segmentation point sequence coordinate comprises the steps of:
the sequence coordinates of the cluster tag are identified as V ═ V1, V2, …, vn ];
step 1, calculating a first-order difference vector of V;
step 2, carrying out median operation on the differential vector;
step 3, traversing the first-order difference vector from the head, judging whether the first-order difference vector is in a preset interval range, and outputting a division point sequence coordinate if the first-order difference vector is in the preset interval range;
if the current time is not within the preset interval range, executing preset operation.
5. The audio cutting method based on clustering and median convergence according to claim 4, wherein the step of performing the preset operation if the audio cutting is not within the preset interval comprises the steps of:
if the first-order difference vector is not in the range of the preset interval, judging whether the first-order difference vector is larger than a first preset threshold value or not, and if the first-order difference vector is larger than the first preset threshold value, inserting a new value into the sequence coordinate;
and judging whether the first-order difference vector is smaller than a second preset threshold value, and if so, deleting the current first-order difference vector.
6. An audio cutting system based on clustering and median convergence, comprising: the system comprises a filtering module, a spectrogram matrix generating module, a clustering label generating module and a sequence coordinate generating module;
the filtering module is used for: filtering an input audio signal;
the spectrogram matrix generation module is used for: calculating a spectrogram matrix of the filtered audio signal;
the cluster label generation module is configured to: clustering the spectrogram matrix into two types through K _ means to obtain clustering labels;
the sequence coordinate generation module is configured to: and carrying out convergence correction on the clustering label and outputting a segmentation point sequence coordinate.
7. The audio segmentation system based on clustering and median convergence according to claim 6, wherein the spectrogram matrix generation module is further configured to:
pre-emphasis, framing and windowing are carried out on the filtered audio signal to obtain a first result;
performing fast Fourier transform on the first result to obtain a second result;
carrying out absolute value or square value calculation on the second result to obtain a third result;
performing triangular band-pass filtering processing on the third result to obtain a fourth result;
carrying out logarithmic energy processing on the fourth result to obtain a fifth result;
and performing dynamic characteristic calculation on the fifth result to obtain a spectrogram matrix.
8. The audio cutting system based on clustering and median convergence according to claim 6, wherein the cluster label generation module is further configured to:
intercepting the middle and high dimensionality in the spectrogram matrix, and inputting the spectrogram matrix subjected to dimensionality reduction to K _ means to obtain a tag sequence;
and filtering out burrs and peaks of the label sequence.
9. The audio slicing system based on clustering and median convergence according to claim 6, wherein the sequence coordinate generating module is further configured to:
the sequence coordinates of the cluster tag are identified as V ═ V1, V2, …, vn ];
step 1, calculating a first-order difference vector of V;
step 2, carrying out median operation on the differential vector;
step 3, traversing the first-order difference vector from the head, judging whether the first-order difference vector is in a preset interval range, and outputting a division point sequence coordinate if the first-order difference vector is in the preset interval range;
if the current time is not within the preset interval range, executing preset operation.
10. The audio slicing system based on clustering and median convergence according to claim 9, wherein the sequence coordinate generating module is further configured to:
if the first-order difference vector is not in the range of the preset interval, judging whether the first-order difference vector is larger than a first preset threshold value or not, and if the first-order difference vector is larger than the first preset threshold value, inserting a new value into the sequence coordinate;
and judging whether the first-order difference vector is smaller than a second preset threshold value, and if so, deleting the current first-order difference vector.
CN202011614821.6A 2020-12-31 2020-12-31 Audio cutting method and system based on clustering and median convergence Active CN112863541B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011614821.6A CN112863541B (en) 2020-12-31 2020-12-31 Audio cutting method and system based on clustering and median convergence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011614821.6A CN112863541B (en) 2020-12-31 2020-12-31 Audio cutting method and system based on clustering and median convergence

Publications (2)

Publication Number Publication Date
CN112863541A true CN112863541A (en) 2021-05-28
CN112863541B CN112863541B (en) 2024-02-09

Family

ID=75998694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011614821.6A Active CN112863541B (en) 2020-12-31 2020-12-31 Audio cutting method and system based on clustering and median convergence

Country Status (1)

Country Link
CN (1) CN112863541B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003001508A1 (en) * 2001-06-25 2003-01-03 Universitat Pompeu-Fabra Method for multiple access and transmission in a point-to-multipoint system on an electric network
CN102431136A (en) * 2011-09-16 2012-05-02 广州市香港科大***研究院 Multi-phase batch process phase dividing method based on multiway principal component analysis method
US20150261845A1 (en) * 2014-03-14 2015-09-17 Xiaomi Inc. Clustering method and device
CN109599120A (en) * 2018-12-25 2019-04-09 哈尔滨工程大学 One kind being based on large-scale farming field factory mammal abnormal sound monitoring method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003001508A1 (en) * 2001-06-25 2003-01-03 Universitat Pompeu-Fabra Method for multiple access and transmission in a point-to-multipoint system on an electric network
CN102431136A (en) * 2011-09-16 2012-05-02 广州市香港科大***研究院 Multi-phase batch process phase dividing method based on multiway principal component analysis method
US20150261845A1 (en) * 2014-03-14 2015-09-17 Xiaomi Inc. Clustering method and device
CN109599120A (en) * 2018-12-25 2019-04-09 哈尔滨工程大学 One kind being based on large-scale farming field factory mammal abnormal sound monitoring method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陶华伟: "基于谱图特征的语音情感识别若干问题的研究", 博士电子期刊, no. 2018, pages 136 - 22 *

Also Published As

Publication number Publication date
CN112863541B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
KR102635469B1 (en) Method and apparatus for recognition of sound events based on convolutional neural network
CN105474311A (en) Speech signal separation and synthesis based on auditory scene analysis and speech modeling
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN111986699B (en) Sound event detection method based on full convolution network
CN104795064A (en) Recognition method for sound event under scene of low signal to noise ratio
JP2020140193A (en) Voice feature extraction algorithm based on dynamic division of cepstrum coefficient of inverse discrete cosine transform
CN105845126A (en) Method for automatic English subtitle filling of English audio image data
CN112542174A (en) VAD-based multi-dimensional characteristic parameter voiceprint identification method
CN111145726B (en) Deep learning-based sound scene classification method, system, device and storage medium
CN116524939A (en) ECAPA-TDNN-based automatic identification method for bird song species
CN116229991A (en) Motor fault diagnosis method based on MFCC voice feature extraction and machine learning
Labied et al. An overview of automatic speech recognition preprocessing techniques
CN114295195B (en) Abnormality judgment method and system for optical fiber sensing vibration signals based on feature extraction
CN117116290A (en) Method and related equipment for positioning defects of numerical control machine tool parts based on multidimensional characteristics
CN112863541B (en) Audio cutting method and system based on clustering and median convergence
CN112151067A (en) Passive detection method for digital audio tampering based on convolutional neural network
CN107919136B (en) Digital voice sampling frequency estimation method based on Gaussian mixture model
CN112309404B (en) Machine voice authentication method, device, equipment and storage medium
CN115328661A (en) Computing power balance execution method and chip based on voice and image characteristics
Qian et al. Application of local binary patterns for SVM based stop consonant detection
JP7304301B2 (en) Acoustic diagnostic method, acoustic diagnostic system, and acoustic diagnostic program
Patil et al. Audio environment identification
Shinde et al. Speech processing for isolated Marathi word recognition using MFCC and DTW features
Kumar et al. Analysis of audio visual feature extraction techniques for AVSR system
CN116343812B (en) Voice processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant