CN117727309A

CN117727309A - Automatic identification method for bird song species based on TDNN structure

Info

Publication number: CN117727309A
Application number: CN202410179331.XA
Authority: CN
Inventors: 高树会; 李可扬
Original assignee: Bainiao Data Technology Beijing Co ltd
Current assignee: Bainiao Data Technology Beijing Co ltd
Priority date: 2024-02-18
Filing date: 2024-02-18
Publication date: 2024-03-19
Anticipated expiration: 2044-02-18
Also published as: CN117727309B

Abstract

The invention relates to the technical field of voice processing, and provides a method for automatically identifying a bird song species based on a TDNN structure, which comprises the following steps: collecting mixed bird song data in an ecological area; determining a bird song syllable coverage of each frame based on the periodicity of the signal energy of each frame and the stability of the bird song pitch; determining a time-frequency masking probability according to the bird song syllable coverage rate and the information approximation coefficient of each frame in the cluster where the single-frame energy vector of each frame is located; determining a frequency spectrum masking value based on the bird song syllable coverage rate and the time frequency masking probability of each frame; constructing a time-frequency masking map of each spectrum sub-graph based on the spectrum masking values of all frames; taking the result of multiplying the frequency spectrum subgraph by the time-frequency masking graph as an enhanced bird song feature graph of the frequency spectrum subgraph; and determining a species identification result based on the enhanced bird song feature map by adopting a TDNN identification model. According to the invention, through masking treatment of the frequency spectrum subgraph in the frequency spectrum chart, the quality of the training sample of the recognition model and the accuracy of bird song species recognition are improved.

Description

Automatic identification method for bird song species based on TDNN structure

Technical Field

The invention relates to the technical field of voice processing, in particular to a method for automatically identifying a bird song species based on a TDNN structure.

Background

Birds are an important part of the food chain and net in the ecosystem. Birds help control the population by predating insects, small mammals, and other small organisms, maintaining the balance of the ecosystem and helping to maintain the stability of the ecosystem. Bird song is an essential behavior means in bird survival, and the tone and length of bird song are different with different bird activities. Thus, identifying bird song information, knowing bird activities is of great importance in maintaining balance and stability of the entire ecosystem.

Birds are small in body and are easy to hide on trees, shrubs and other hidden parts in forests in ecological areas, but bird sounds can be transmitted far away, and different birds have certain distinguishability, so that bird identification through bird sounds is a common method at present. The deep learning technology is widely applied to bird song recognition due to excellent learning ability and training ability on a large amount of data, and the ability of the bird is recognized by extracting a characteristic training model from bird song data. Since the bird song data is a multi-frequency, multi-tone mixed audio, a model is required to have a strong processing power for the time-series signals. The Time delay neural network TDNN (Time-Delay Neural Networks) is one of the earliest networks for processing audio signals, but the TDNN has a certain limit on the length of an input sequence, and the processing process of the bird song signals with different lengths is complex; in addition, the TDNN has a memory capacity problem, so that each neuron can only acquire input information before a limited time step, and the ability to distinguish different bird song features in the mixed bird song data is weak.

Disclosure of Invention

The invention provides an automatic identification method of a bird song species based on a TDNN structure, which aims to solve the problem of low identification rate of mixed bird song data caused by the limitation of the TDNN on the length of an input sequence and the memory capacity, and adopts the following technical scheme:

the invention discloses an automatic identification method of a bird song species based on a TDNN structure, which comprises the following steps:

collecting mixed bird song data in an ecological area;

dividing a spectrogram of each mixed bird song data into frequency spectrum subgraphs with equal scales; determining a bird song syllable coverage of each frame on each spectral subgraph based on the periodicity of the signal energy at each frame on each spectral subgraph and the stability of the bird song pitch;

determining the time-frequency masking probability of each frame on each frequency spectrum subgraph according to the bird song syllable coverage rate and the information approximation coefficient of each frame in the cluster where the single-frame energy vector of each frame on each frequency spectrum subgraph is located;

determining a spectrum masking value of each frame on each spectrum subgraph based on the bird song syllable coverage rate and the time-frequency masking probability of each frame on each spectrum subgraph; constructing a time-frequency masking map of each spectrum sub-graph based on the spectrum masking values of all frames on each spectrum sub-graph;

and determining a species identification result corresponding to each mixed bird song data based on each frequency spectrum subgraph and the time-frequency masking graph thereof by adopting a TDNN identification model.

Preferably, the method for dividing the spectrogram of each mixed bird song data into the spectrum subgraphs with equal scales comprises the following steps:

and sliding the spectrogram of each mixed bird song data along the time sequence according to a preset moving step length by utilizing a time window with a preset scale, and taking the spectrogram in the sliding position of each time window as a frequency spectrum subgraph.

Preferably, the method for determining the coverage rate of the bird song syllables of each frame on each frequency spectrum subgraph based on the periodicity of the signal energy at each frame on each frequency spectrum subgraph and the stability of the bird song pitch is as follows:

determining the energy period stability of each frequency point on each frame based on the energy values and the predicted energy values of all frequency points on each frame on each frequency spectrum subgraph;

determining a pitch predictability coefficient of each frequency point on each frame based on a time difference between different frames where the frequency points with equal pitches on each frequency subgraph are located;

respectively taking vectors formed by energy period stability and pitch predictability coefficients of all frequency points on each frame on each frequency spectrum subgraph according to the ascending order of the frequencies as energy stability evaluation vectors and pitch predictability vectors of each frame;

taking a similarity measurement result between the energy stability evaluation vector of each frame and the energy stability evaluation vector of any other frame as a molecule;

taking the sum of the similarity measurement result between the pitch-predictable vector of each frame and the pitch-predictable vector of any other frame and the preset parameters as denominators;

taking the average value of the accumulation results of the ratio of the numerator to the denominator on all the rest of the frames on each frequency spectrum sub-graph as the syllable coverage rate of the bird song of each frame.

Preferably, the method for determining the energy period stability of each frequency point on each frame based on the energy values and the predicted energy values of all frequency points on each frame on each spectrum subgraph comprises the following steps:

respectively taking a set formed by all frequencies of the maximum value and the minimum value of energy in each frame of signal on each frequency spectrum subgraph as a maximum frequency set and a minimum frequency set of each frame; determining the predicted energy value of each frequency point on each frame based on the maximum frequency set and the minimum frequency set of each frame by adopting a data prediction algorithm;

taking a Hurst index of a sequence formed by the difference value between the energy value of each frequency point and the energy values of the rest frequency points on each frame as a molecule;

taking the sum of the absolute value of the difference value and a preset parameter as a denominator; the ratio of the numerator to the denominator is taken as the energy cycle stability for each frequency point on each frame.

Preferably, the method for determining the pitch predictability coefficient of each frequency point on each frame based on the time difference between different frames where the frequency points with equal pitches on each frequency subgraph are located is as follows:

taking a frame where any frequency point on each frequency spectrum subgraph is equal to the pitch of each frequency point on each frame as an equal-pitch frame of each frequency point on each frame; taking variances of elements in a set formed by time differences between all equal-pitch frames of each frequency point on each frame on each spectrum subgraph and each frame as molecules;

taking the difference value between the maximum value in the pitches of all frequency points on each frame in each frequency spectrum subgraph and the maximum value in the pitches of all frequency points on any other frame as a first pitch difference value; taking the sum of the accumulated results of the first pitch difference value on all other frames in each frequency spectrum sub-graph and preset parameters as a denominator;

the ratio of the numerator to the denominator is taken as the pitch predictability coefficient for each frequency point on each frame.

Preferably, the method for determining the time-frequency masking probability of each frame on each spectrum subgraph according to the bird song syllable coverage rate and the information approximation coefficient of each frame in the cluster where the single-frame energy vector of each frame on each spectrum subgraph is located is as follows:

taking a sequence formed by energy values of all frequency points on each frame on each frequency spectrum subgraph according to the ascending order of the frequencies as a single-frame energy vector of each frame; taking the average value of all elements in the single-frame energy vector of each frame as the energy average value of each frame;

taking single-frame energy vectors of all frames in each spectrum subgraph as input, and adopting a clustering algorithm to obtain a cluster where the single-frame energy vectors of each frame are located;

taking the ratio of the short-time zero-crossing rate of each frame to the energy mean value of each frame as a first scale factor; taking the ratio of the short-time zero-crossing rate of the frames corresponding to the maximum value of all the energy means in each frequency spectrum subgraph to the maximum value of the energy means as a maximum scale factor; taking the difference between the first scale factor and the maximum scale factor as an information approximation coefficient of each frame;

taking the difference value between the maximum value of the coverage rate of the bird song syllables of all frames on each frequency spectrum subgraph and the coverage rate of the bird song syllables of each frame as a molecule; taking the sum of the information approximation coefficient of each frame and the preset parameter as a denominator, and taking the ratio of the numerator to the denominator as the single-frame masking probability of each frame;

and taking the average value of the single-frame masking probabilities of the frames corresponding to all elements in the cluster where the single-frame energy vector of each frame is located as the time-frequency masking probability of each frame.

Preferably, the method for determining the spectrum masking value of each frame on each spectrum subgraph based on the bird song syllable coverage rate and the time-frequency masking probability of each frame on each spectrum subgraph comprises the following steps:

taking the average value of the syllable coverage rate and the time-frequency masking probability of the bird song of all frames on each frequency spectrum subgraph as a first average value and a second average value respectively;

the syllable decision value of any frame with the bird song syllable coverage rate larger than or equal to or smaller than the first mean value on each frequency spectrum subgraph is respectively set to be 1 and 0;

setting the masking decision value of any frame with the time-frequency masking probability being more than or equal to the second average value and less than the second average value on each spectrum subgraph to be 1 and 0 respectively;

and taking the syllable decision value and the sum operation result of the masking decision value of each frame on each spectrum subgraph as the spectrum masking value of each frame.

Preferably, the method for constructing the time-frequency mask of each spectrum sub-graph based on the spectrum mask values of all frames on each spectrum sub-graph comprises the following steps:

setting the masking value of all frequency points on any frame with the frequency spectrum masking value of 1 on each frequency spectrum subgraph to be 1;

for any frame with syllable decision value and masking decision value of 0 on each frequency spectrum subgraph, determining the masking value of each frequency point on each frame based on the noise component contributions of all frequencies on each frame;

and taking a binary image constructed by masking values of all frequency points on each frame on each frequency spectrum subgraph according to the positions of the frequency points on each frequency spectrum graph as a time-frequency masking image of each frequency spectrum subgraph.

Preferably, the method for determining the masking value of each frequency point on each frame based on the noise component contributions of all frequencies on each frame is as follows:

the ratio of the energy period stability of each frequency point on each frame to the pitch predictable coefficient of each frequency point is taken as the noise component contribution of each frequency point;

taking the contribution of the noise components of all frequency points on each frame as input, and acquiring a segmentation threshold value of the contribution of the noise components on each frame by using a threshold segmentation algorithm;

the masking value of any frequency point whose noise component contribution is greater than the division threshold is set to 0, and the masking value of any frequency point whose noise component contribution is less than the division threshold is set to 1.

Preferably, the method for determining the species identification result corresponding to each mixed bird song data based on the enhanced bird song feature map of the spectrum subgraph by using the TDNN identification model includes:

taking the result of multiplying each frequency spectrum subgraph by the time-frequency masking graph as an enhanced birdcasting feature graph of each frequency spectrum subgraph; and taking the enhanced birdcall feature graphs of all frequency spectrum subgraphs in the frequency spectrograms of all the mixed birdcall data as input, and determining a species identification result corresponding to each mixed birdcall data by adopting a TDNN identification model.

The beneficial effects of the invention are as follows: according to the invention, the probability that each frame in each mixed bird song data contains bird song syllables is estimated by constructing the bird song syllable coverage rate according to the energy periodicity and the pitch confusion degree of each frame on the spectrum subgraph; secondly, determining the time-frequency masking probability of each frame through analyzing the frequency oscillation characteristics at each frame, wherein the time-frequency masking probability considers the frequency spectrum subgraphThe phenomenon of overlapping the energy in the frames reduces the influence of the phenomenon that partial energy on partial frames is covered by the energy of noise frames on the evaluation result of each frame; and secondly, determining a time-frequency mask diagram of each frequency spectrum subgraph based on the frequency spectrum mask value of each frame, so that the enhancement of mixed bird song data is realized, and the species identification result accuracy of the subsequent TDNN network is higher.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for identifying bird song species automatically based on a TDNN structure according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a time-frequency mask for each spectral subgraph according to an embodiment of the present invention;

fig. 3 is a flowchart of an implementation of a method for identifying a bird song species based on a TDNN structure according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a flowchart of a method for automatically identifying a bird song species based on a TDNN structure according to an embodiment of the present invention is shown, and the method includes the following steps:

and S001, collecting mixed bird song data in the ecological area.

The AI voiceprint sensing equipment arranged in the ecological area is utilized, the equipment is a microphone array formed by four paths of sound pick-up devices, audio data in a 200m range can be collected, the collected bird song data can be transmitted back to a data center in real time through the Internet of things, and the data center carries out subsequent mixed bird song voice recognition. In the present invention, the sampling frequency of the bird song data is set to 22.05Khz, and the sampling period of each bird song data is set to 5s.

In the process of acquiring bird song data by the AI voiceprint sensing device, there may be noise in the device, environmental noise in an ecological area, song sounds of other species, and the like, for example, a yas sound of a leaf blown by wind, and the like, and the noise and the song sounds of other species may be mixed with the bird song sound to be acquired by the AI voiceprint sensing device and transmitted to a data center. The audio segments with each time length of 5s are also taken as a mixed bird song data in the data center according to the sampling time length of the bird song data.

So far, mixed bird song data in the ecological area are obtained and used for calculating the syllable coverage rate of subsequent bird song.

Step S002, determining the bird song syllable coverage rate of each frame on each spectrum sub-graph based on the periodicity of the signal energy at each frame on each spectrum sub-graph and the stability of the bird song pitch.

In the mixed bird song data of each AI voiceprint device, bird calls from different birds can be used as audio data of different sound sources in mixing, namely, each bird which emits song is used as a sound source, and the tones of different birds are different due to the song; and the information which the same bird wants to express is different, and factors such as different bird ages can cause the difference between the bird song signals. Therefore, in the present invention, it is considered that the bird song signals are clustered, and the bird song syllables are classified according to the similarity of the audio characteristics, so as to distinguish bird song data of different birds.

Each mixed bird song data received by the data center contains various environmental noises, wherein natural sounds such as rain sounds, wind sounds and the like can be attributed to stable noises, the noise is basically and uniformly distributed at the full frequency in broadband noises, and during the song process, the resonant cavity of the bird vibrates at the same frequency in a short time under the excitation of the glottis, so that periodic changes are formed. That is, the interval between adjacent syllables in each of the bird song signals contained in the mixed bird song data is very short, each syllable has a spectrum centroid, the energy of each syllable shows a tendency from weak to strong and then from strong to weak, and the energy is concentrated near the spectrum centroid.

Specifically, a spectrogram of each mixed bird song data is obtained, 2s is taken as the length of a time window, the moving step length between two adjacent time windows is 0.25s, namely the time range of the first time window is 0s to 2s, the time range of the second time window is 0.25s to 2.25s, and the spectrogram of each mixed bird song data can be divided into 13 spectral subgraphs with equal scales. The purpose of such division is to acquire frames containing the bird song frequency as much as possible for the subsequent frame-cutting of each spectrum sub-image due to noise with undefined duration and starting time in the mixed bird song data. It should be noted that, the length of the time window may be set by the practitioner to an appropriate value according to the duration of the acquisition of the mixed bird song data.

Further, in each mixed bird song data, the sounding organ structure of each bird is different, the generated fundamental frequency is different, and when the same bird is song with different pitches, the fundamental frequency of sound is also different, namely the pitch of different bird data is different. And respectively taking each mixed bird song data as input, and acquiring the pitch of each frequency on each frame in each mixed bird song data by using a YIN algorithm, wherein the YIN algorithm is a known technology, and the specific process is not repeated. And secondly, respectively acquiring the frequencies of the maximum value and the minimum value of the energy in each frame of signal on each frequency spectrum sub-graph, and respectively taking the set formed by all the frequencies of the maximum value and the minimum value of the energy in each frame of signal on each frequency spectrum sub-graph as a maximum frequency set and a minimum frequency set of each frame.

Further, for each spectral subgraph, the kth spectral subgraph in the spectrogram of the (a) th mixed bird song data is usedTaking the ith frame as an example, the spectrum is sub-picture +.>The frequency in the maximum frequency set and the minimum frequency set of the ith frame and the energy value corresponding to each frequency are input in time sequence, and the autoregressive moving average ARIMA (Autoregressive Integrated Moving Average) model is utilized to obtain a frequency spectrum subgraph>The application of the ARIMA model to the predicted energy value of each frequency point on the i-th frame is a known technique, and the specific process is not repeated.

Based on the above analysis, a bird song syllable overlay is constructed hereCoverage rate was used to characterize the likelihood that each frame in each mixed bird song data contained bird song syllables. Calculating a spectral subgraphBird song syllable coverage of the i th frame:

in the method, in the process of the invention,is the energy period stability of frequency c in the i-th frame,/->Is a sequence of differences between the energy value of frequency c and the energy values of the remaining frequencies on the ith frame,/and>is the sequence->Is>、/>The energy value of frequency c in the i-th frame, the predicted energy value,/respectively>Is a parameter regulating factor for preventing denominator from being 0, & lt/L>Is checked by the size of (2)0.01, the calculation of the hurst index is a known technology, and the specific process is not repeated;

is the pitch-predictable coefficient of frequency c in the ith frame,/>Is a spectral subgraph->A set of time differences between the remaining frequency corresponding frames of the same pitch size as frequency c and the i-th frame,/for>Is a set->Variance of inner element, m is spectral subgraph->The number of frames included in j is the spectral subgraph +.>Middle upper j frame,/, 2>、/>The maximum value of the pitch on the ith frame and the jth frame respectively;

is a spectral subgraph->In (i) th frame of the formula>、/>The energy stability evaluation vector is composed of energy period stability of all frequencies on the ith frame and the jth frame according to the ascending order of the frequencies respectively, < >>、/>The pitch predictors are respectively composed of pitch predictors of all frequencies on the ith frame and the jth frame according to the ascending order of the frequencies>、/>Vectors respectively->And->、/>And->Cosine similarity between the two is a well-known technique, and the specific process is not repeated.

Wherein the spectrum subgraphThe more the ith frame in the system accords with the law of periodic energy change, the stronger the regularity of the difference distribution between the energy value of the frequency c and the energy values corresponding to the rest frequencies on the ith frame is, the more the difference distribution is +.>The greater the value of (2); the energy value has strong predictability, the smaller the difference between the energy predicted value and the actual value is,/-the energy predicted value is>The smaller the value of +.>The greater the value of (2); spectral subgraph->The more frames containing the actual bird song information, the more spectral subgraph->The larger the fluctuation of the pitch between adjacent different frames, the more different the time interval between equal pitch frames corresponding to the same pitch, the +.>The larger the distribution variance of the inner element is +.>The greater the value of (2); the higher the probability of the bird song syllable contained in the i frame is, the closer the maximum pitch value on the i frame is to the maximum pitch value on the rest frames, the first pitch difference +.>The smaller the value of +.>The smaller the value of +.>The greater the value of (2); the stronger the periodicity of the energy variation between the i-th frame and the rest of the frames within the spectral subgraph, the higher the similarity between the energy stability assessment vectors,the greater the value of (2); the more unstable the pitch variation between the i-th frame and the rest of the frames in the spectral subgraph, the worse the periodicity of the pitch between adjacent frames, the +.>The smaller the value of (2); i.e. < ->The larger the value of (a) is, the possibility of containing a bird song syllable in the ith frameThe greater the likelihood.

Thus, the coverage rate of the bird song syllables of each frame is obtained and is used for subsequently determining the time-frequency masking probability of each frame.

Step S003, determining the time-frequency masking probability of each frame on each frequency spectrum subgraph according to the bird song syllable coverage rate of each frame in the cluster where the single-frame energy vector of each frame is located; and determining the spectrum masking value of each frame on each spectrum subgraph based on the bird song syllable coverage rate and the time-frequency masking probability of each frame on each spectrum subgraph.

Further, when the AI voiceprint recognition device is used to collect mixed bird song data, noise is often mixed in each mixed bird song data due to the influence of environmental noise and device noise, and the noise can change the period and the frequency spectrum flatness of an original signal, so that bird song cannot be completely detected only through bird song syllable coverage, and syllable omission may occur.

Similar to human pronunciation, on the audio frame where the bird song appears, the value of the short-time zero crossing rate is small, and the energy value is larger. Second, for mixed bird song data, the energy values of the signals are mostly sparse and the energy values of the signals are distinguishable. Accordingly, there are a large number of regions of energy value 0 in the spectrogram of each mixed bird song data, which are generally not regions where bird song features can be extracted. Therefore, the invention considers the way of utilizing the time-frequency masking to further judge each frame on each frequency spectrum subgraph.

In particular, for a spectral subgraph in each mixed signal, the spectral subgraph is usedFor example, the spectrum subgraph->The energy values corresponding to all frequencies on each frame are arranged according to the ascending order of the frequencies, and the vector formed by the arrangement is used as a frequency spectrum subgraph +.>Single frame energy vector of each frame and will each frameThe average value of all elements in the energy vector of a single frame is taken as the energy average value of each frame. Second, statistical spectrum subgraph->The number of frequencies with energy value of 0 on each frame and the short-time zero-crossing rate of each frame, the calculation of the short-time zero-crossing rate on the audio signal are known techniques, and the specific process is not repeated.

Further, the spectrum is sub-dividedThe k-means clustering algorithm is a known technology, and the specific process is not repeated. The purpose of clustering single frame energy vectors is to reduce spectral subgraphs +.>When the phenomenon of overlapping of the energy in the frames occurs, the phenomenon that partial energy on partial frames is covered by the energy of noise frames is caused to influence the evaluation result of each frame.

Based on the above analysis, a time-frequency masking probability is constructed here to characterize the likelihood that each frame in each spectrogram is masked. Calculating a spectral subgraphTime-frequency masking probability of the i-th frame:

in the method, in the process of the invention,is the information approximation coefficient of the i-th frame, +.>、/>The short time zero crossing rate and the energy mean value of the ith frame are respectively,is the mean maximum value of the elements in all single frame energy vectors in the spectrum subgraph, < >>Is->Short-time zero-crossing rate of the corresponding frame;

is the single frame masking probability of the i-th frame, < >>Is a spectral subgraph->Maximum value of the syllable coverage of bird song of all frames,/-for all frames>Is a spectral subgraph->In (i) th frame of the formula>Is a parameter regulating factor for preventing denominator from being 0, & lt/L>The size of (2) is 0.01;

is a spectral subgraph->K is the number of single-frame energy vectors in the cluster where the single-frame energy vector of the ith frame is located, j is the j-th single-frame energy vector in the cluster where the single-frame energy vector of the ith frame is located,is the single frame masking probability corresponding to the jth single frame energy vector.

Wherein the spectrum subgraphThe greater the probability of a bird song occurring on the ith frame, the closer the oscillation frequency of the ith frame, which is embodied in the local region, is to the spectral subgraph +.>Oscillation frequency at other bird song occurrence frame, short time zero crossing rate of ith frame ≡>The smaller the value is, the larger the energy value is, the average value of elements in the single frame energy vector of the ith frame is, and the energy average value is +.>The larger the value of (2), the first scale factor +.>The smaller the value of +.>And maximum scale factor->The closer the size of ++>The smaller the value of (2); the more disturbed the energy distribution in the local area where the i frame is located, the more disturbed the energy distribution in the local area where the i frame is locatedThe lower the likelihood of a bird song syllable, +.>The smaller the value of (c) is,the greater the value of +.>The greater the value of (2); i.e. < ->The greater the value of (2), the lower the probability of bird song on the ith frame, the more sparse the energy and the greater the probability of time-frequency masking.

According to the steps, the coverage rate of the bird song syllables and the time-frequency masking probability of all frames in the frequency spectrum subgraph are respectively obtained. And determining the spectrum masking value of each frame based on the bird song syllable coverage rate and the time-frequency masking probability of each frame, and calculating a spectrum subgraphSpectral mask value for the i-th frame:

in the method, in the process of the invention,is a spectral subgraph->Syllable decision value of the i th frame of (a), +.>Tone coverage for frame i，/>Is a spectral subgraph->A mean value of the syllable coverage rate of the bird song in all frames;

is a spectral subgraph->Masking decision value of the i-th frame in +.>Is the time-frequency masking probability of the i-th frame, < >>Is a spectral subgraph->The average value of the time-frequency masking probabilities of all frames in the (a);

is a spectral subgraph->Spectral mask value of the i-th frame of (a), a +.>Is the sum of the operator, i.e. both values are 1,the value of (2) is 1, then in the spectral subgraph +.>The masking value of each frequency point on the i-th frame of (a) is 1.

For a value of 0 for both values,in case the value of (2) is 0, in order to preserve more information, the spectrum is sub-picture +.>And judging whether the masking value of each frequency point is 1 or 0 according to the energy period stability and the pitch predictability coefficient of each frequency point on the ith frame of the frame (i). Specifically, the noise component contributions of each frequency above the frame with the spectral mask value of 0 are calculated separately, and then the spectral subgraph is sub-calculated>The noise component contribution of each frequency above all frames with the frequency spectrum masking value of 0 is used as input, a division threshold value is obtained by using an Ojin threshold algorithm, the masking value of a frequency point with the noise component contribution larger than the division threshold value is set to 0, and the masking value of a frequency point with the noise component contribution smaller than the division threshold value is set to 1. By spectral subgraph->For example, the frequency c on the i-th frame of the frame, calculate the noise component contribution of the frequency c +.>：

In the method, in the process of the invention,、/>the energy cycle stability, pitch predictability factor of frequency c in the i-th frame, respectively.

So far, the masking value of each frequency point on each frame is obtained and is used for subsequently determining the time-frequency masking diagram of each frequency spectrum subgraph.

Step S004, constructing a time-frequency masking diagram of each frequency spectrum subgraph based on the frequency spectrum masking values of the frequency points on all frames on each frequency spectrum subgraph; and determining a species identification result corresponding to each mixed bird song data based on each frequency spectrum subgraph and the time-frequency masking graph thereof by adopting a TDNN identification model.

According to the above steps, respectively obtaining frequency spectrum subgraphsThe spectral mask value of each frame in (a) is used for constructing a spectral subgraph>I.e. masking values of all frequency points on each frame according to spectral subgraph +.>As shown in fig. 2, black and white squares represent masking values 0 and 1, respectively. Second, the spectrum subgraph is->Time-frequency mask and spectral subgraph->Multiplication result as spectral subgraph->Is a characteristic diagram of the enhanced bird song.

Further, enhanced bird song feature graphs of all frequency spectrum subgraphs in each mixed bird song data are respectively obtained, and based on the enhanced bird song feature graphs, recognition results of bird song species in an ecological area are obtained. Secondly, taking the enhanced bird song feature graphs of all frequency spectrum subgraphs in all mixed bird song data as input of a TDNN network, taking an Adam algorithm as an optimization algorithm, taking a cross entropy function as a loss function, and taking output of the TDNN network as bird identification results corresponding to the mixed bird song data, wherein training of the neural network is a known technology, and specific processes are not repeated.

Further, each mixed bird song data collected by the AI voiceprint equipment and a corresponding species identification result thereof are stored in the data center, and a manager of the data center and a manager of the ecological area perform subsequent evaluation processing on species behaviors in the ecological area.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. The foregoing description of the preferred embodiments of the present invention is not intended to be limiting, but rather, any modifications, equivalents, improvements, etc. that fall within the principles of the present invention are intended to be included within the scope of the present invention.

Claims

1. Automatic identification method of bird song species based on TDNN structure, which is characterized by comprising the following steps:

collecting mixed bird song data in an ecological area;

determining a spectrum masking value of each frame on each spectrum subgraph based on the bird song syllable coverage rate and the time-frequency masking probability of each frame on each spectrum subgraph;

constructing a time-frequency masking map of each spectrum sub-graph based on the spectrum masking values of all frames on each spectrum sub-graph; and determining a species identification result corresponding to each mixed bird song data based on each frequency spectrum subgraph and the time-frequency masking graph thereof by adopting a TDNN identification model.

2. The automatic identification method of a bird song species based on a TDNN structure according to claim 1, wherein the method for dividing the spectrogram of each mixed bird song data into the spectrum subgraphs with equal scales is as follows:

3. The automatic identification method of a bird song species based on a TDNN structure according to claim 1, wherein the method for determining the coverage rate of a bird song syllable of each frame on each spectrum subgraph based on the periodicity of signal energy at each frame on each spectrum subgraph and the stability of the bird song pitch is as follows:

4. The automatic identification method of a bird song species based on a TDNN structure according to claim 3, wherein the method for determining the energy period stability of each frequency point on each frame based on the energy values and the predicted energy values of all frequency points on each frame on each spectrum subgraph is as follows:

5. The automatic identification method of a bird song species based on a TDNN structure according to claim 3, wherein the method for determining the pitch predictability coefficient of each frequency point on each frame based on the time difference between different frames where the frequency points with equal pitches on each spectrum subgraph are located is as follows:

6. The automatic recognition method of a bird song species based on a TDNN structure according to claim 1, wherein the method for determining the time-frequency masking probability of each frame on each spectrum subgraph according to the bird song syllable coverage rate and the information approximation coefficient of each frame in the cluster where the single frame energy vector of each frame on each spectrum subgraph is located is as follows:

7. The automatic recognition method of a bird song species based on a TDNN structure according to claim 1, wherein the method for determining the spectrum masking value of each frame on each spectrum subgraph based on the bird song syllable coverage rate and the time-frequency masking probability of each frame on each spectrum subgraph is as follows:

8. The automatic identification method of a bird song species based on a TDNN structure according to claim 1, wherein the method for constructing the time-frequency mask of each spectrum subgraph based on the spectrum mask values of all frames on each spectrum subgraph is as follows:

9. The automatic identification method of a bird song species based on a TDNN structure according to claim 8, wherein the method of determining the masking value of each frequency point on each frame based on the noise component contributions of all frequencies on each frame is:

10. The automatic identification method for bird song species based on TDNN structure according to claim 1, wherein the method for determining the species identification result corresponding to each mixed bird song data based on the enhanced bird song feature map of the spectral subgraph by using the TDNN identification model is as follows: