CN116206612A - Bird voice recognition method, model training method, device and electronic equipment - Google Patents

Bird voice recognition method, model training method, device and electronic equipment Download PDF

Info

Publication number
CN116206612A
CN116206612A CN202310216766.2A CN202310216766A CN116206612A CN 116206612 A CN116206612 A CN 116206612A CN 202310216766 A CN202310216766 A CN 202310216766A CN 116206612 A CN116206612 A CN 116206612A
Authority
CN
China
Prior art keywords
bird
data
sound
voice recognition
bird sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310216766.2A
Other languages
Chinese (zh)
Inventor
郭慧敏
鉴海防
王洪昌
朱文旗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Semiconductors of CAS
Original Assignee
Institute of Semiconductors of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Semiconductors of CAS filed Critical Institute of Semiconductors of CAS
Priority to CN202310216766.2A priority Critical patent/CN116206612A/en
Publication of CN116206612A publication Critical patent/CN116206612A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a bird voice recognition method, a model training method, a device and electronic equipment based on double-path time-frequency joint modeling and multiple auxiliary branches. The bird voice recognition model is of an N-level connection structure, N is more than 1, each layer of cascade structure comprises a feature extraction layer, a double-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier which are sequentially connected, and the model training method comprises the following steps: preprocessing the obtained bird sound sample data to obtain preprocessed bird sound sample data and corresponding tag data; inputting the pretreated bird voice sample data into a bird voice recognition model to obtain N classification results; obtaining a bird classification result according to the N classification results; calculating a loss value according to the classification result and the tag data of birds to obtain a loss result; and iteratively adjusting parameters of the bird voice recognition model by using the loss result to obtain the bird voice recognition model after training.

Description

Bird voice recognition method, model training method, device and electronic equipment
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to a bird voice recognition method, a model training method, a device and electronic equipment, and more particularly relates to a bird voice recognition method, a model training method, a device and electronic equipment based on multiple auxiliary branches of double-path time-frequency joint modeling.
Background
Bird song contains abundant biological information, and is one of the most effective means for birds to communicate information. Through analysis of the sound data, not only can bird species information contained in the sound of the target bird be obtained, but also the change condition of the ecological environment at the place can be analyzed according to long-period monitoring data.
However, the current analysis of bird sound data is mostly dependent on manual analysis processing of ecological protection workers, and a great deal of manpower and material resources are required. Although some researches try to construct a bird voice recognition model based on a deep learning technology so as to realize automatic analysis processing of target bird voice, the method does not consider the characteristics of local time-frequency structure, global time dependency, frequency dependency and the like in a bird voice spectrogram, and the influence of characteristic information of different layers on a final bird classification result is not fully considered, so that the existing bird voice recognition model has the problem of low accuracy.
Disclosure of Invention
In view of the above, the present disclosure provides a bird voice recognition method, a model training method, a device and an electronic apparatus based on multiple auxiliary branches of dual-path time-frequency joint modeling, so as to at least partially solve the above technical problems.
According to a first aspect of the present disclosure, a training method of a bird voice recognition model of multiple auxiliary branches based on dual-path time-frequency joint modeling is provided, the bird voice recognition model is an N-level joint structure, N is greater than 1, each of the cascade structures includes a feature extraction layer, a dual-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier which are sequentially connected, and the method includes:
preprocessing the obtained bird sound sample data to obtain preprocessed bird sound sample data and corresponding tag data, wherein the bird sound sample data comprises bird sound data and environment sound data;
the following operations are executed according to the sequence of the N-layer cascade structure until N classification results are obtained:
inputting the pretreated bird sound sample data into a feature extraction layer of an i-1 th cascade structure, outputting first feature data, inputting the first feature data into a dual-path time-frequency joint modeling unit, outputting second feature data, inputting the second feature data into a jump connection layer, outputting third feature data, inputting the third feature data into an auxiliary branch classifier, and outputting an i-1 th classification result, wherein i is less than or equal to N;
inputting third characteristic data output by the jump connection layer in the i-1 th layer cascade structure into the i-th layer cascade structure, and outputting an i-th classification result;
Obtaining a bird classification result according to the N classification results;
calculating a loss value according to the classification result and the tag data of birds to obtain a loss result;
and iteratively adjusting parameters of the bird voice recognition model by using the loss result to obtain the bird voice recognition model after training.
According to an embodiment of the present disclosure, a dual path time-frequency joint modeling unit includes: a time convolution-attention mechanism unit and a frequency convolution-attention mechanism unit;
the dual-path time-frequency joint modeling unit in each layer of cascade structure executes the following operations:
performing first remodeling operation on the first characteristic data, obtaining first remodeling characteristic data, inputting the first remodeling characteristic data into a time convolution-attention mechanism unit, and outputting time first remodeling characteristic data;
performing first residual connection on the time first remodelling characteristic data and the first remodelling characteristic data to obtain first residual characteristic data;
performing second plastic repeating operation on the first residual characteristic data, obtaining second plastic repeating characteristic data, inputting the second plastic repeating characteristic data into a frequency convolution-attention mechanism unit, and outputting frequency second plastic repeating characteristic data;
performing second residual connection on the frequency second plastic characteristic data and the second plastic characteristic data to obtain second residual characteristic data;
And performing third plastic repeating operation on the second residual characteristic data to obtain second characteristic data.
According to an embodiment of the present disclosure, the N classification results are different from each other, and each of the N classification results independently includes any one of the following features:
bird voiceprint amplitude, bird voiceprint edge, bird voiceprint gradient, bird voiceprint shape, bird voiceprint texture, bird voiceprint first semantic information, bird voiceprint structure information, bird voiceprint second semantic information.
According to an embodiment of the present disclosure, preprocessing acquired bird sound sample data to obtain preprocessed bird sound sample data and corresponding tag data includes:
extracting the acquired bird sound sample data to obtain effective bird sound sample fragments;
filtering, segmenting and labeling the effective bird sound sample fragments to obtain first bird sound sample data and tag data;
framing, windowing, fourier transforming and Mel filtering are carried out on the first bird sound sample data to obtain bird Mel spectrogram sample data;
and carrying out data enhancement processing on bird mel spectrogram sample data to obtain pretreated bird sound sample data.
According to an embodiment of the present disclosure, data enhancement processing is performed on bird mel spectrogram sample data, and the obtained pre-processed bird sound sample data includes at least one of the following data enhancement processing modes:
Scaling the input bird mel spectrogram sample data in the horizontal direction or the vertical direction by using a random stretching proportion so as to simulate the change of tone and rhythm in the sounding process of birds;
shifting the input bird mel spectrogram sample data in the horizontal direction or the vertical direction by using a random scrolling proportion so as to simulate the change of bird song tone and sounding time;
distorting the input bird mel spectrogram sample data along a given grid to simulate distortion conditions encountered in the bird sound collection process;
the input bird mel spectrogram sample data is masked in the horizontal direction or the vertical direction by using a random masking proportion so as to simulate the problem of information loss in bird sound collection.
In a second aspect of the present disclosure, a multi-auxiliary branch bird voice recognition method based on dual-path time-frequency joint modeling is provided, including:
preprocessing the acquired target bird sound data to obtain preprocessed target bird sound data, wherein the target bird sound data comprises target bird sound data and environment sound data;
inputting the preprocessed target bird sound data into a bird sound recognition model after training, and outputting a prediction result;
Comparing the predicted result with a preset output threshold value, and determining target bird information in target bird sound data under the condition that the predicted result is larger than the preset output threshold value;
the bird voice recognition model after training is obtained by training the bird voice recognition model with multiple auxiliary branches based on double-path time-frequency joint modeling.
According to an embodiment of the present disclosure, preprocessing acquired target bird sound data, to obtain preprocessed target bird sound data includes:
extracting the pretreated target bird sound data to obtain a target effective bird sound fragment;
filtering and segmenting the target effective bird sound fragment to obtain first target bird sound data;
framing, windowing, fourier transforming and Mel filtering are carried out on the first target bird sound data to obtain the preprocessed target bird sound data.
In a third aspect of the present disclosure, a training device for a bird voice recognition model of multiple auxiliary branches based on dual-path time-frequency joint modeling is provided, the bird voice recognition model is an N-level joint structure, N is greater than 1, each cascade structure includes a feature extraction layer, a dual-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier which are sequentially connected, and the device includes:
The bird sound sample preprocessing module is used for preprocessing the acquired bird sound sample data to obtain preprocessed bird sound sample data and corresponding tag data, wherein the bird sound sample data comprises bird sound data and environment sound data;
the model training module is used for executing the following operations according to the sequence of the N-layer cascade structure until N classification results are obtained:
inputting the pretreated bird sound sample data into a feature extraction layer of an i-1 th cascade structure, outputting first feature data, inputting the first feature data into a dual-path time-frequency joint modeling unit, outputting second feature data, inputting the second feature data into a jump connection layer, outputting third feature data, inputting the third feature data into an auxiliary branch classifier, and outputting an i-1 th classification result, wherein i is less than or equal to N;
inputting third characteristic data output by the jump connection layer in the i-1 th layer cascade structure into the i-th layer cascade structure, and outputting an i-th classification result;
obtaining a bird classification result according to the N classification results;
the calculating module is used for calculating a loss value according to the classification result of the birds and the tag data to obtain a loss result;
and the adjusting module is used for iteratively adjusting parameters of the bird voice recognition model by using the loss result to obtain the bird voice recognition model after training.
In a fourth aspect of the present disclosure, there is provided a multi-auxiliary branch bird voice recognition apparatus based on dual path time-frequency joint modeling, comprising:
the target bird sound preprocessing module is used for preprocessing the acquired target bird sound data to obtain preprocessed target bird sound data, wherein the target bird sound data comprises target bird sound data and environment sound data;
the prediction module is used for inputting the preprocessed target bird sound data into the bird sound recognition model after training, and outputting a prediction result;
the determining output module is used for comparing the prediction result with a preset output threshold value, and determining target bird information in the target bird sound data under the condition that the prediction result is larger than the preset output threshold value;
the bird voice recognition model after training is obtained by training by the model training method provided in the embodiment.
In a fifth aspect of the present disclosure, there is provided an electronic device, including:
one or more processors;
a memory for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods of the embodiments described above.
In a sixth aspect of the present disclosure, there is provided a computer readable storage medium storing computer executable instructions that when executed are configured to implement the method of the above embodiments.
In a seventh aspect of the present disclosure, a computer program product is provided, comprising computer executable instructions which, when executed, are for use in the method in the above embodiments.
According to an embodiment of the present disclosure, the bird voice recognition model provided by the present disclosure is an N-level structure, N > 1, wherein each cascade structure includes a feature extraction layer, a dual-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier, which are sequentially connected, and the preprocessed bird voice sample data is input into the bird model of the N-level cascade structure for training, so as to obtain N classification results, and then the classification results of the birds are obtained according to the N classification results, which is specifically expressed as follows: the first characteristic data is extracted and output by inputting the preprocessed bird sound sample data into the characteristic extraction layer, and the first characteristic data is input into the dual-path time-frequency joint modeling unit, so that characteristic extraction of local characteristics of a time-frequency structure can be realized, and meanwhile, the global dependency relationship of the time dependency relationship and the frequency dependency relationship can be constructed, and the second characteristic data can be obtained. And inputting the obtained second characteristic data into the jump connection layer, inputting the third characteristic data output by the jump connection layer into the auxiliary branch classifier for classification and the next-layer cascade structure for classification and characteristic extraction processing, so that the influence of classification results of different layers in different layers of connection structures on the classification result of the final birds is fully considered, and the accuracy of the bird voice recognition model on target bird voice recognition is effectively improved. And then calculating a loss value according to the bird classification result and the tag data to obtain a loss result, and finally iteratively adjusting parameters of the bird voice recognition model according to the loss result, so that a trained bird voice recognition model is obtained, and the bird voice model obtained by the method can pay attention to local characteristic information of bird voice and can pay attention to global characteristic information and classification information of different levels, thereby being beneficial to obtaining a more accurate bird voice recognition result.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates a system architecture diagram of a multi-auxiliary branch bird voice recognition method, model training method, apparatus based on dual-path time-frequency joint modeling in an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a training method of a multi-auxiliary branch bird voice recognition model based on dual path time-frequency joint modeling in an embodiment of the present disclosure;
FIG. 3 schematically illustrates a simplified architecture of feature extraction layers in each of the cascade structures in an embodiment of the disclosure;
FIG. 4 schematically illustrates a block diagram of a dual path time-frequency joint modeling unit in a cascade structure of each layer in an embodiment of the disclosure;
FIG. 5 schematically illustrates a simplified structure of a hop link layer in a per-layer cascading structure in an embodiment of the present disclosure;
FIG. 6 schematically illustrates a block diagram of an auxiliary branch classifier in a per-layer cascade structure in an embodiment of the disclosure;
FIG. 7 schematically illustrates a block diagram of a training method of a bird voice recognition model based on multiple auxiliary branches of dual path time-frequency joint modeling in an embodiment of the present disclosure;
FIG. 8 schematically illustrates a flow chart of a multi-auxiliary branch bird voice recognition method based on dual path time-frequency joint modeling in accordance with an embodiment of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a training apparatus for a multi-auxiliary branch bird voice recognition model based on dual path time-frequency joint modeling in accordance with an embodiment of the present disclosure;
FIG. 10 schematically illustrates a block diagram of a multi-auxiliary branch bird voice recognition device based on dual path time-frequency joint modeling in accordance with an embodiment of the present disclosure; and
fig. 11 schematically illustrates a block diagram of an electronic device adapted to implement a multi-auxiliary branch bird voice recognition method based on dual path time-frequency joint modeling, and a training method of the model, in accordance with an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Aiming at the defects that local features and global dependency relations cannot be captured at the same time and different-level feature information is difficult to fully utilize in the existing bird voice recognition method based on deep learning, the invention provides a training method and device of a multi-auxiliary-branch bird voice recognition model based on double-path time-frequency joint modeling.
Fig. 1 schematically illustrates a system architecture diagram of a multi-auxiliary branch bird voice recognition method, a model training method, and a device based on dual-path time-frequency joint modeling in an embodiment of the disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include a bird sound collection device 101, a terminal device 102, a server 103, and a network device 104. The network 104 is the medium used to provide communication links between the terminal devices 102 and the server 103. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user can interact with the server 103 through the network 104 using the bird sound collection device 101, the terminal device 102, to receive or transmit a message, or the like. The bird sound collection device 101 may be a recorder, a recording pen, a radio, a mobile phone, etc. having a recording function (by way of example only). Various communication client applications may be installed on the terminal device 102, such as shopping class applications, web browser applications, search class applications, sound recordings, instant messaging tools, mailbox clients, social platform software, and the like (by way of example only).
Terminal device 102 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 103 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by the user using the terminal device 102. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that, the bird voice recognition method and the model training method based on the multi-auxiliary branch of the dual-path time-frequency joint modeling provided in the embodiments of the present disclosure may be generally executed by the server 103. Accordingly, the bird voice recognition device and the model training device based on the multi-auxiliary branch of the dual-path time-frequency joint modeling provided by the embodiments of the present disclosure may be generally disposed in the server 103. The bird voice recognition method and model training method based on the multi-auxiliary branch of the dual-path time-frequency joint modeling provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 103 and is capable of communicating with the terminal device 102 and/or the server 103. Accordingly, the bird voice recognition device and the model training device based on the multi-auxiliary branch of the dual-path time-frequency joint modeling provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 103 and capable of communicating with the terminal device 102 and/or the server 103.
In addition, the bird voice recognition method and model training method based on the multi-auxiliary branch of the dual-path time-frequency joint modeling provided by the embodiments of the present disclosure may also be generally executed by the terminal device 102. Accordingly, the bird voice recognition device and the model training device based on the multi-auxiliary branch of the dual-path time-frequency joint modeling provided by the embodiments of the present disclosure may be generally disposed in the terminal device 102.
It should be understood that the number of terminal devices, bird sound collection devices, networks and servers in fig. 1 are merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The following describes a training method of the bird voice recognition model based on the multi-auxiliary branch of the dual path time-frequency joint modeling in the embodiment of the present disclosure with reference to fig. 2 to 8 based on the scenario described in fig. 1.
FIG. 2 schematically illustrates a flow chart of a training method of a multi-auxiliary branch bird voice recognition model based on dual path time-frequency joint modeling in an embodiment of the present disclosure.
In an embodiment of the present disclosure, a training method of a bird voice recognition model with multiple auxiliary branches based on dual-path time-frequency joint modeling is provided, where the bird voice recognition model is an N-level joint structure, N is greater than 1, each cascade structure includes a feature extraction layer, a dual-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier, which are sequentially connected, and the training method 200 of the bird voice recognition model includes: operations S210-S240.
In operation S210, the obtained bird sound sample data is preprocessed to obtain preprocessed bird sound sample data and corresponding tag data, the bird sound sample data including bird sound data and environmental sound data.
In operation S220, the following operations are performed in the order of the N-layer cascade structure until N classification results are obtained:
inputting the pretreated bird sound sample data into a feature extraction layer of an i-1 th cascade structure, outputting first feature data, inputting the first feature data into a dual-path time-frequency joint modeling unit, outputting second feature data, inputting the second feature data into a jump connection layer, outputting third feature data, inputting the third feature data into an auxiliary branch classifier, and outputting an i-1 th classification result, wherein i is less than or equal to N;
inputting third characteristic data output by the jump connection layer in the i-1 th layer cascade structure into the i-th layer cascade structure, and outputting an i-th classification result;
and obtaining the bird classification result according to the N classification results.
In operation S230, a loss value is calculated according to the classification result of birds and the tag data, to obtain a loss result.
In operation S240, parameters of the bird voice recognition model are iteratively adjusted using the loss result, resulting in a trained bird voice recognition model.
According to an embodiment of the present disclosure, the bird voice recognition model provided by the present disclosure is an N-level structure, N > 1, wherein each cascade structure includes a feature extraction layer, a dual-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier, which are sequentially connected, and the preprocessed bird voice sample data is input into the bird model of the N-level cascade structure for training, so as to obtain N classification results, and then the classification results of the birds are obtained according to the N classification results, which is specifically expressed as follows: the first characteristic data is extracted and output by inputting the preprocessed bird sound sample data into the characteristic extraction layer, and the first characteristic data is input into the dual-path time-frequency joint modeling unit, so that characteristic extraction of local characteristics of a time-frequency structure can be realized, and meanwhile, the global dependency relationship of the time dependency relationship and the frequency dependency relationship can be constructed, and the second characteristic data can be obtained. And inputting the second characteristic data into the jump connection layer, inputting the third characteristic data output by the jump connection layer into the auxiliary branch classifier for classification and the next cascade structure for characteristic extraction and classification treatment, so that the influence of classification results of different layers in different layers of connection structures on the classification result of the final birds is fully considered, and the accuracy of the bird voice recognition model on target bird voice recognition is effectively improved.
The structure of the bird voice recognition model according to the embodiments of the present disclosure will be described in detail with reference to fig. 3 to 6.
Fig. 3 schematically illustrates a simplified architecture of feature extraction layers in each layer of the cascade structure in an embodiment of the disclosure.
As shown in fig. 3, the feature extraction layer provided in the present disclosure is formed by stacking two convolution layers with a convolution kernel size of 3*3 and a step size of 1, a batch normalization (Batch Normalization), an activation function layer and an average pooling layer, where the activation function layer is a linear rectification function, the average pooling layer adopts an adaptive average pooling layer, and a process of feature extraction by the feature extraction layer may be represented as shown in formula (1):
y=F avg (F 3*3 (F 3*3 (x) (ii) formula (1);
wherein x is the pretreated bird sound sample data or the third characteristic output of a jump connection layer in the i-1 layer cascade structure, i is less than or equal to N, and N is more than 1; f (F) 3*3 (x) Representing the output after processing by the first convolution layer, batch normalization and activation function layer; f (F) 3*3 (F 3*3 (x) Representing the output after processing by the second convolution layer, batch normalization and activation function layer; y represents the first characteristic data.
Fig. 4 schematically shows a schematic structural diagram of a dual path time-frequency joint modeling unit in a cascade structure of each layer in an embodiment of the disclosure.
As shown in fig. 4, the dual-path Time-frequency joint modeling unit provided by the present disclosure uses a two-layer encoder (convolutional network), namely a real-Time convolution-attention mechanism unit (encoder-Time unit) and a frequency convolution-attention mechanism unit (encoder-Freq unit), so as to capture global characteristic information of Time dependency and frequency dependency in stages, wherein the structures of the encoder-Time unit and the encoder-Freq unit are the same, and convolution operation is applied to an encoding layer of a self-attention mechanism, and the encoding layer mainly comprises a feedforward layer, a multi-head attention mechanism and a convolution module.
The following describes in detail the operation of obtaining the second feature data by inputting the first feature output by the feature extraction layer in each layer of cascade structure to the dual-path time-frequency joint modeling unit with reference to fig. 4.
According to an embodiment of the present disclosure, inputting first feature data into a dual path time-frequency joint modeling unit in each layer of cascade structure, the operation of obtaining second feature data includes:
for the first characteristic data D (D e R B×T×F×C ) Performing first remolding operation treatment to obtain first remolding characteristic data D T (D T ∈R BF×T×C ) The 4-dimensional dimension of the first feature data can be reduced to a 3-dimensional dimension after the first remodeling operation. Then, the first remodeling characteristic data D T Input to a time convolution-attention mechanism unit, and output time first remodelling characteristic data
Figure BDA0004115270800000123
The time convolution-attention mechanism unit can model data in the time dimension direction in bird sound data, global characteristic information of time dependency relationship can be obtained after modeling is completed, and meanwhile local detail characteristic information of a time-frequency structure can be focused, wherein B represents quantity, T time, F frequency and C channel.
First remodelling time characteristic data
Figure BDA0004115270800000121
With the first remodeling characteristic data D T Performing a first residual connection to obtain first residual characteristic data +.>
Figure BDA0004115270800000122
For the first residual characteristic data
Figure BDA0004115270800000124
Performing a second plastic-repeating operation to obtain second plastic-repeating characteristic data D F (D F ∈R BT ×F×C ) Thereafter, the second plastic characteristic data D F Output/input frequency convolution-attention mechanism unit, output frequency second plastic characteristic data +.>
Figure BDA0004115270800000125
The frequency convolution-attention mechanism unit can model data in the frequency dimension direction in the bird sound data, global characteristic information of a frequency dependency relationship can be obtained after the modeling is completed, meanwhile, local detail characteristic information of a time-frequency structure can be focused, and further characteristic information of bird sounds can be extracted. Subsequently, frequency second reprofiling characteristic data +. >
Figure BDA0004115270800000126
And second plastic characteristic data D F Performing a second residual connection to obtain second residual characteristic data +.>
Figure BDA0004115270800000127
For second residual characteristic data->
Figure BDA0004115270800000128
And performing third plastic repeating operation to obtain second characteristic data E, so that the overall consideration of the bird voice recognition model on the global characteristics of the time-frequency local characteristics, the time dependence and the frequency dependence is realized, and the capability of extracting the long-term sequence characteristics and the local characteristics of the bird recognition model is improved.
Fig. 5 schematically illustrates a simplified structure of a jump connection layer in each layer of the cascade structure in the embodiment of the present disclosure.
As shown in fig. 5, the jump connection layer provided in the present disclosure is formed by stacking a convolution layer with a convolution kernel size of 3*3 and a step size of 1, a batch normalization (Batch Normalization), an activation function layer and an average pooling layer, and the jump connection layer is used to help solve the problems of gradient explosion and gradient disappearance in the bird voice recognition model training process, where the activation function layer is a linear rectification function, and the average pooling layer adopts a global average pooling layer. In the embodiment of the disclosure, the second feature data E output by the dual-path time-frequency joint modeling unit is input into the jump connection layer, and the third feature data is output, where the third feature data is used as an input of the auxiliary branch classifier and an input of a feature extraction layer in an adjacent next-layer cascade structure, that is, the third feature data output by the jump connection layer in the i-1-th layer cascade structure is input into the feature extraction layer in the i-th layer cascade structure for feature extraction, which can be understood as inputting the third feature data output by the jump connection layer in the 1-th layer cascade structure into the feature extraction layer in the 2-th layer cascade structure, and inputting the third feature data output by the jump connection layer in the 2-th layer cascade structure into the feature extraction layer in the 3-th layer cascade structure, and so on.
Fig. 6 schematically illustrates a block diagram of an auxiliary branch classifier in a per-layer cascade structure in an embodiment of the disclosure.
As shown in fig. 6, the auxiliary branch classifier provided in the present disclosure is composed of an average pooling layer, a Dropout layer, a full connection layer, an activation function layer, and batch normalization (batch normalization), and each hierarchical structure auxiliary branch classifier is connected after the jump connection layer of each hierarchical structure, and is mainly used for classifying third feature data of the jump connection layer, so as to obtain a classification result. The bird voice recognition model in the embodiment of the present disclosure is an N-layer cascade structure, and N auxiliary branch classifiers have N classification results correspondingly.
According to an embodiment of the present disclosure, the N classification results are different from each other, and each of the N classification results independently includes any one of the following features:
bird voiceprint amplitude, bird voiceprint edge, bird voiceprint gradient, bird voiceprint shape, bird voiceprint texture, bird voiceprint first semantic information, bird voiceprint structure information, bird voiceprint second semantic information.
According to an embodiment of the present disclosure, preprocessing the acquired bird sound sample data in operation S210, obtaining preprocessed bird sound sample data and corresponding tag data includes:
Extracting the acquired bird sound sample data to obtain effective bird sound sample fragments;
filtering, segmenting and labeling the effective bird sound sample fragments to obtain first bird sound sample data and tag data;
framing, windowing, fourier transforming and Mel filtering are carried out on the first bird sound sample data to obtain bird Mel spectrogram sample data;
and carrying out data enhancement processing on bird mel spectrogram sample data to obtain pretreated bird sound sample data.
According to embodiments of the present disclosure, bird sound sample data is collected using a bird sound collector, wherein the bird sound sample data includes bird sound data and environmental sound data, the bird sound collector includes a cell phone, a voice recorder, a radio, and the like. Before extracting effective sound fragments from the obtained bird sound sample data, performing data format unification processing on the bird sound sample data, for example unifying audio formats into wav; the sampling frequency was resampled to 44100Hz. Then, based on the sound signal endpoint detection algorithm (namely a double-threshold method), whether the obtained bird sound sample data is mute or not is judged on a frame-by-frame basis, and non-mute segments are spliced to extract effective bird sound sample segments.
According to the embodiment of the disclosure, the obtained effective bird sound sample fragments are subjected to filtering processing based on a sound signal noise reduction algorithm, namely spectral subtraction, interference of background noise is eliminated, the effective bird sound sample fragments are subjected to segmentation processing, and the time length of each segment of bird audio signal is unified to 3s for subsequent processing. In view of the fact that more than one bird species is contained in each section of bird audio, in the labeling process, multi-species information labeling of each section of equal-length audio data is completed through comprehensive analysis of sound waveforms and sound spectrograms, and first bird sound sample data and corresponding label data are obtained, wherein the labeling mode can be manual labeling. In addition, since the bird sound sample data includes environmental sound data including wind sounds, rain sounds, whistling sounds, human speaking sounds, and the like, in addition to bird sound data. Therefore, in the labeling process, besides labeling the bird sound sample data, the labeling of the environmental sound scene information is additionally added, so that the situation that the bird sound recognition model erroneously recognizes sounds in the environment as bird sounds is avoided.
According to an embodiment of the present disclosure, bird mel spectrum sample data is obtained after framing, windowing, fourier transforming, and mel filtering operations are performed on the obtained first bird sound sample data. Then, the obtained bird mel spectrogram sample data is subjected to data enhancement processing, so that more bird sound sample data can be obtained, and training sample data of a bird sound recognition model are enriched.
Wherein, the data enhancement processing of the bird mel spectrogram sample data comprises at least one of the following steps:
1) The bird mel spectrogram sample data is enhanced by utilizing a horizontal stretching mode and a vertical stretching mode, and the method is specifically shown as follows: scaling the input bird mel spectrogram sample data in the horizontal direction or the vertical direction by using a random stretching proportion so as to simulate the change of tone and rhythm in the sounding process of birds;
2) The bird mel spectrogram sample data are subjected to data by utilizing a horizontal and vertical scrolling mode, and the method is specifically shown as follows: shifting the input bird mel spectrogram sample data in the horizontal direction or the vertical direction by using a random scrolling proportion so as to simulate the change of bird song tone and sounding time;
3) The data enhancement of bird mel spectrogram sample data by using elastic distortion and torsion is specifically shown as follows: distorting the input bird mel spectrogram sample data along a given grid to simulate distortion conditions encountered in the bird sound collection process;
4) The bird mel spectrogram sample data is enhanced by using a time and frequency masking mode, and the method is specifically shown as follows: the input bird mel spectrogram sample data is masked in the horizontal direction or the vertical direction by using a random masking proportion so as to simulate the problem of information loss in bird sound collection.
After data enhancement is performed on all bird sound data (namely bird mel spectrogram sample data), the enhanced data are proportionally divided into a training set, a testing set and a verification set for training, testing and verifying the bird sound recognition model, wherein the data division ratio after enhancement can be 6:2:2, other ratios may be used, and are not particularly limited herein.
According to an embodiment of the present disclosure, in operation S220, a bird voice recognition model is trained using the pre-processed bird voice sample data to obtain a classification result of birds, which is specifically expressed as: the following operations are executed according to the sequence of the N-layer cascade structure until N classification results are obtained:
inputting the pretreated bird sound sample data into a feature extraction layer of an i-1 th cascade structure, outputting first feature data, inputting the first feature data into a dual-path time-frequency joint modeling unit, outputting second feature data, inputting the second feature data into a jump connection layer, outputting third feature data, inputting the third feature data into an auxiliary branch classifier, and outputting an i-1 th classification result, wherein i is less than or equal to N, and i is a positive integer;
inputting third characteristic data output by the jump connection layer in the i-1 th layer cascade structure into the i-th layer cascade structure, and outputting an i-th classification result;
And obtaining the bird classification result according to the N classification results.
It can be understood that when n=2, i=1, 2, that is, the preprocessed bird sound sample data is input into the layer 1 cascade structure to obtain the first classification result, the third characteristic data output by the jump connection layer in the first cascade structure is input into the layer 2 cascade structure, and the second classification result is output. And then, according to the first classification result and the second classification result, obtaining the classification result of birds.
The following description is made by using a bird voice recognition model composed of a feature extraction layer, a dual-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier with a 4-layer cascade structure, and it should be noted that the hierarchical connection structure of the bird voice recognition model is not limited to 4 layers, but can be 8 layers, 12 layers, etc., and is not limited in detail herein.
FIG. 7 schematically illustrates a block diagram of a training method of a bird voice recognition model based on multiple auxiliary branches of dual path time-frequency joint modeling in an embodiment of the present disclosure.
As shown in fig. 7, the preprocessed bird sound sample data is input into the feature extraction layer of the first cascade structure, the first feature data is output and then is input into the dual-path time-frequency joint modeling unit, the second feature data is output and then is input into the jump connection layer, the third feature data is output and then is input into the auxiliary branch classifier, and the first classification result is output, so that the extraction and classification of primary features in the bird sound sample data are completed, wherein the first classification result can be basic features such as bird voiceprint amplitude, edge, gradual change and the like.
And then, inputting third characteristic data output by a jump connection layer in the first layer cascade structure into a characteristic extraction layer in the second layer cascade structure, outputting first characteristic data of the second layer cascade structure, inputting the first characteristic data of the second layer cascade structure into a dual-path time-frequency joint modeling unit of the second layer cascade structure, outputting second characteristic data of the second layer cascade structure, inputting the second characteristic data of the second layer cascade structure into a jump connection layer of the second layer cascade structure, outputting third characteristic data of the second layer cascade structure, inputting the third characteristic data of the second layer cascade structure into an auxiliary branch classifier of the second layer cascade structure, outputting a second classification result, and finishing the extraction and classification of middle-level characteristics of bird sound sample data, wherein the second classification result can be complex characteristics such as bird voiceprint shape, texture and the like.
And then inputting third characteristic data output by the jump connection layer in the second cascade structure into a characteristic extraction layer in the third cascade structure, outputting first characteristic data of the third cascade structure, inputting the first characteristic data of the third cascade structure into a dual-path time-frequency joint modeling unit of the third cascade structure, outputting second characteristic data of the third cascade structure, inputting the second characteristic data of the third cascade structure into the jump connection layer of the third cascade structure, outputting third characteristic data of the third cascade structure, inputting the third characteristic data into an auxiliary branch classifier of the third cascade structure, outputting a third classification result, and completing extraction and classification of higher-level characteristics of bird sound sample data, wherein the third classification result can be first semantic information or first structure information of bird voiceprints.
And finally, inputting third characteristic data output by a jump connection layer in the third cascade structure into a characteristic extraction layer in the fourth cascade structure, outputting first characteristic data of the fourth cascade structure, inputting the first characteristic data of the fourth cascade structure into a dual-path time-frequency joint modeling unit of the fourth cascade structure, outputting second characteristic data of the fourth cascade structure, inputting the second characteristic data of the fourth cascade structure into the jump connection layer of the fourth cascade structure, outputting third characteristic data of the fourth cascade structure, inputting the third characteristic data into an auxiliary branch classifier of the fourth cascade structure, outputting a fourth classification result, and completing extraction and classification of the highest-level characteristic of bird sound sample data, wherein the fourth classification result can be bird voiceprint second semantic information or second structure information and the like, and the second semantic information or the second structure information has more detailed characteristics than the first semantic information or the second structure information.
And then, weighting the output results of the four-layer auxiliary branch classifier module according to different weights of the auxiliary branch classifier in different layer cascade structures, and outputting the final classification vector.
For example: setting the weight of the auxiliary branch classifier in the first layer cascade structure as a 1 The weight of the auxiliary branch classifier in the second layer cascade structure is set to be alpha=0.4 2 =0.6, setting the weight of the auxiliary branch classifier in the third layer cascade structure to α 3 =0.8, setting the weight of the auxiliary branch classifier in the fourth layer cascade structure to α 4 =1. And then, adding different weights of the auxiliary branch classifiers in the four-cascade structure to obtain a bird classification result, namely obtaining the bird classification result according to the N classification results.
According to an embodiment of the present disclosure, in operation S230, a loss value is calculated according to a classification result of birds and tag data, resulting in a loss result, which is embodied as:
training a multi-auxiliary branch bird voice recognition model based on dual-path time-frequency joint modeling using a multi-tag loss function, wherein the loss function is as shown in formula (2):
Figure BDA0004115270800000171
wherein yk corresponds to a one-hot encoded tag,
Figure BDA0004115270800000181
the output result of the model is represented, which contains c neurons for c classes, σ being the sigmoid activation function.
According to an embodiment of the present disclosure, in operation S240, parameters of a bird voice recognition model are iteratively adjusted using a loss result, resulting in a trained bird voice recognition model, which is embodied as:
In the bird voice recognition model optimization process, an adaptive moment estimation (Adam) optimizer is used as a training optimizer, and an initial learning rate is set to be 1e -4 The learning rate adjustment strategy adopts an equidistant learning rate adjustment strategy, and in other embodiments, the learning rate adjustment strategy can be modified into other types of optimizers or learning rate adjustment strategies according to actual needs, for example, a random Batch gradient descent optimizer (batch_sgd), a cosine annealing learning rate adjustment strategy, and the like, which are not described in detail herein. And under the condition that the calculated loss value is smaller, training of the bird voice model is completed, and a trained bird voice recognition model is obtained.
Fig. 8 schematically illustrates a flow chart of a multi-auxiliary branch bird voice recognition method based on dual path time-frequency joint modeling in accordance with an embodiment of the present disclosure.
As shown in fig. 8, the multi-auxiliary branch bird voice recognition method 800 based on dual-path time-frequency joint modeling provided by the present disclosure includes: operation S810-operation S830.
In operation S810, the acquired target bird sound data is preprocessed to obtain preprocessed target bird sound data, the target bird sound data including target bird sound data and environmental sound data.
In operation S820, the preprocessed target bird sound data is input into the trained bird sound recognition model, and a prediction result is output.
In operation S830, comparing the prediction result with a preset output threshold, and determining target bird information in the target bird sound data if the prediction result is greater than the preset output threshold;
the bird voice recognition model after training is obtained by training by the model training method in the embodiment.
According to an embodiment of the present disclosure, preprocessing the acquired target bird sound data in operation S810 includes:
extracting the pretreated target bird sound data to obtain a target effective bird sound fragment;
filtering and segmenting the target effective bird sound fragment to obtain first target bird sound data;
framing, windowing, fourier transforming and Mel filtering are carried out on the first target bird sound data to obtain the preprocessed target bird sound data.
According to an embodiment of the present disclosure, a bird sound collector is employed to collect target bird sound data, wherein the target bird sound data includes target bird sound data and environmental sound data. Before extracting the effective sound fragment of the obtained target bird sound data, the data format of the target bird sound data is also required to be processed uniformly, for example, the audio format is unified into wav; the sampling frequency was resampled to 44100Hz. Then, based on the sound signal end point detection algorithm (namely a double threshold method), whether the obtained target bird sound data is mute or not is judged on a frame-by-frame basis, and non-mute segments are spliced to extract target effective bird sound segments.
According to the embodiment of the disclosure, the obtained target effective bird sound fragments are subjected to filtering processing based on a sound signal noise reduction algorithm, namely spectral subtraction, interference of background noise is eliminated, the target effective bird sound fragments are subjected to segmentation processing, and the time length of each segment of bird audio signal is unified to 3s so as to facilitate subsequent processing, so that first target bird sound data are obtained.
According to an embodiment of the present disclosure, after the pre-processed target bird sound data is input to the training-completed bird sound recognition model in operation S820, the output result of the bird sound recognition model is first non-linearly mapped with the (0, 1) interval probability distribution, which is specifically described as follows:
the output results are activated using a Sigmoid activation function to output a predicted probability distribution vector (i.e., output a predicted result).
According to an embodiment of the present disclosure, after the probability distribution (prediction result) is acquired, the identified multi-species information is output through a threshold setting policy in operation S830, which includes:
setting two preset output threshold selection strategies, namely an automatic preset output threshold selection strategy and a manual preset output threshold selection strategy. After the preset output threshold is selected, screening the output probability distribution vectors, and outputting bird species larger than the preset output threshold to obtain target bird information in the target bird sound data.
The present disclosure provides a training device for a bird voice recognition model based on multiple auxiliary branches of dual-path time-frequency joint modeling, which is described in detail below with reference to fig. 9.
Fig. 9 schematically illustrates a block diagram of a training apparatus of a multi-auxiliary branch bird voice recognition model based on dual path time-frequency joint modeling in accordance with an embodiment of the present disclosure.
As shown in fig. 9, in the training device of the bird voice recognition model based on the multi-auxiliary branch of the dual-path time-frequency joint modeling in this embodiment, the bird voice recognition model is in an N-level structure, N > 1, and each level of cascade structure includes a feature extraction layer, a dual-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier which are sequentially connected, and the device 900 includes: bird sound sample preprocessing module 910, model training module 920, calculation module 930, adjustment module 940.
The bird sound sample preprocessing module 910 is configured to preprocess the obtained bird sound sample data to obtain preprocessed bird sound sample data and corresponding tag data, where the bird sound sample data includes bird sound data and environmental sound data. In an embodiment, the bird sound sample preprocessing module 910 may be used to perform the operation S210 described above, which is not described herein.
The model training module 920 is configured to perform the following operations in order of the N-layer cascade structure until N classification results are obtained:
inputting the pretreated bird sound sample data into a feature extraction layer of an i-1 th cascade structure, outputting first feature data, inputting the first feature data into a dual-path time-frequency joint modeling unit, outputting second feature data, inputting the second feature data into a jump connection layer, outputting third feature data, inputting the third feature data into an auxiliary branch classifier, and outputting an i-1 th classification result, wherein i is less than or equal to N, and i is a positive integer;
inputting third characteristic data output by the jump connection layer in the i-1 th layer cascade structure into the i-th layer cascade structure, and outputting an i-th classification result;
and obtaining the bird classification result according to the N classification results. In an embodiment, the model training module 920 may be used to perform the operation S220 described above, which is not described herein.
And the calculating module 930 is configured to calculate a loss value according to the classification result and the tag data of the birds, so as to obtain a loss result. In an embodiment, the calculating module 930 may be configured to perform the operation S230 described above, which is not described herein.
And the adjusting module 940 is configured to iteratively adjust parameters of the bird voice recognition model according to the loss result to obtain a trained bird voice recognition model. In an embodiment, the adjustment module 940 may be configured to perform the operation S240 described above, which is not described herein.
Any of the bird sound sample preprocessing module 910, model training module 920, computing module 930, adjustment module 940 may be combined in one module to be implemented, or any of the modules may be split into multiple modules according to embodiments of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the bird sound sample preprocessing module 910, the model training module 920, the calculation module 930, the adjustment module 940 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or in hardware or firmware, such as any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the bird sound sample preprocessing module 910, the model training module 920, the computing module 930, the adjustment module 940 may be at least partially implemented as a computer program module that, when executed, may perform the corresponding functions.
Based on the above-mentioned bird voice recognition method based on the multi-auxiliary branches of the dual-path time-frequency joint modeling, the present disclosure provides a bird voice recognition device based on the multi-auxiliary branches of the dual-path time-frequency joint modeling, which will be described in detail below with reference to fig. 10.
Fig. 10 schematically illustrates a block diagram of a multi-auxiliary branch bird voice recognition device based on dual path time-frequency joint modeling in accordance with an embodiment of the present disclosure.
As shown in fig. 10, the bird voice recognition device 1000 of multiple auxiliary branches based on dual path time-frequency joint modeling includes: a target bird sound preprocessing module 1010, a prediction module 1020, a determination output module 1030.
The target bird sound preprocessing module 1010 is configured to preprocess the acquired target bird sound data to obtain preprocessed target bird sound data, where the target bird sound data includes target bird sound data and environment sound data. In an embodiment, the target bird sound preprocessing module 1010 may be used to perform the operation S810 described above, which is not described herein.
The prediction module 1020 is configured to input the preprocessed target bird sound data into a trained bird sound recognition model, and output a prediction result, where the trained bird sound recognition model is trained by the training method of the bird sound recognition model based on the multi-auxiliary branch of the dual-path time-frequency joint modeling. In an embodiment, the prediction module 1020 may be configured to perform the operation S820 described above, which is not described herein.
The determining output module 1030 is configured to compare the prediction result with a preset output threshold, and determine target bird information in the target bird sound data if the prediction result is greater than the preset output threshold. In an embodiment, the determining output module 1030 may be used to perform the operation S830 described above, which is not described herein.
Any of the target bird sound preprocessing module 1010, the prediction module 1020, the determination output module 1030 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules according to an embodiment of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the target bird sound pre-processing module 1010, the prediction module 1020, the determination output module 1030 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuits, or in any one of or a suitable combination of any of the three. Alternatively, at least one of the target bird sound preprocessing module 1010, the prediction module 1020, the determination output module 1030 may be at least partially implemented as a computer program module which, when executed, may perform the corresponding functions.
Fig. 11 schematically illustrates a block diagram of an electronic device adapted to implement a multi-auxiliary branch bird voice recognition method based on dual path time-frequency joint modeling, and a training method of the model, in accordance with an embodiment of the present disclosure.
As shown in fig. 11, an electronic device 1100 according to an embodiment of the present disclosure includes a processor 1101 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. The processor 1101 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1101 may also include on-board memory for caching purposes. The processor 1101 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flow according to embodiments of the present disclosure.
In the RAM1103, various programs and data necessary for the operation of the electronic device 1100 are stored. The processor 1101, ROM1102, and RAM1103 are connected to each other by a bus 1104. The processor 1101 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM1102 and/or the RAM 1103. Note that the program may be stored in one or more memories other than the ROM1102 and the RAM 1103. The processor 1101 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 1100 may also include an input/output (I/O) interface 1105, the input/output (I/O) interface 1105 also being connected to the bus 1104. The electronic device 1100 may also include one or more of the following components connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output portion 1107 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, and the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in drive 1110, so that a computer program read therefrom is installed as needed in storage section 1108.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM1102 and/or RAM1103 described above and/or one or more memories other than ROM1102 and RAM 1103.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. When the computer program product runs in a computer system, the program code is used for enabling the computer system to realize the bird voice recognition method based on the multi-auxiliary branch of the dual-path time-frequency joint modeling and the training method of the model provided by the embodiment of the disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1101. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program can also be transmitted, distributed over a network medium in the form of signals, downloaded and installed via the communication portion 1109, and/or installed from the removable media 1111. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1101. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (10)

1. A training method of a bird voice recognition model of multiple auxiliary branches based on double-path time-frequency joint modeling, wherein the bird voice recognition model is of an N-level connection structure, N is more than 1, each cascade structure comprises a feature extraction layer, a double-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier which are connected in sequence, and the method comprises the following steps:
Preprocessing the obtained bird sound sample data to obtain preprocessed bird sound sample data and corresponding tag data, wherein the bird sound sample data comprises bird sound data and environment sound data;
the following operations are executed according to the sequence of the N-layer cascade structure until N classification results are obtained:
inputting the pretreated bird sound sample data into a feature extraction layer of an i-1 th cascade structure, outputting first feature data, inputting the first feature data into a dual-path time-frequency joint modeling unit, outputting second feature data, inputting the second feature data into a jump connection layer, outputting third feature data, inputting the third feature data into an auxiliary branch classifier, and outputting an i-1 th classification result, wherein i is less than or equal to N;
inputting the third characteristic data output by the jump connection layer in the i-1 th layer cascade structure into the i-th layer cascade structure, and outputting an i-th classification result;
obtaining a bird classification result according to the N classification results;
calculating a loss value according to the classification result of the birds and the tag data to obtain a loss result;
and iteratively adjusting parameters of the bird voice recognition model by using the loss result to obtain a trained bird voice recognition model.
2. The method of claim 1, wherein the dual path time-frequency joint modeling unit comprises: a time convolution-attention mechanism unit and a frequency convolution-attention mechanism unit;
the dual-path time-frequency joint modeling unit in each layer of cascade structure executes the following operations:
performing first remodeling operation on the first characteristic data, obtaining first remodeling characteristic data, inputting the first remodeling characteristic data into the time convolution-attention mechanism unit, and outputting time first remodeling characteristic data;
performing first residual connection on the time first remodelling characteristic data and the first remodelling characteristic data to obtain first residual characteristic data;
performing second plastic operation on the first residual characteristic data, obtaining second plastic characteristic data, inputting the second plastic characteristic data into the frequency convolution-attention mechanism unit, and outputting frequency second plastic characteristic data;
performing second residual connection on the frequency second plastic characteristic data and the second plastic characteristic data to obtain second residual characteristic data;
and performing third plastic repeating operation on the second residual characteristic data to obtain the second characteristic data.
3. The method of claim 1, wherein the N classification results are different from each other, each independently comprising any one of the following features:
Bird voiceprint amplitude, bird voiceprint edge, bird voiceprint gradient, bird voiceprint shape, bird voiceprint texture, bird voiceprint first semantic information, bird voiceprint structure information, bird voiceprint second semantic information.
4. The method of claim 1, wherein preprocessing the obtained bird sound sample data to obtain preprocessed bird sound sample data and corresponding tag data comprises:
extracting the acquired bird sound sample data to obtain effective bird sound sample fragments;
filtering, segmenting and labeling the effective bird sound sample fragments to obtain first bird sound sample data and tag data;
framing, windowing, fourier transforming and Mel filtering the first bird sound sample data to obtain bird Mel spectrogram sample data;
and carrying out data enhancement processing on the bird Mel spectrogram sample data to obtain pretreated bird sound sample data.
5. The method of claim 4, wherein the performing data enhancement processing on the bird mel spectrogram sample data to obtain pre-processed bird sound sample data comprises at least one of the following data enhancement processing modes:
Scaling the input bird mel spectrogram sample data in the horizontal direction or the vertical direction by using a random stretching proportion so as to simulate the change of tone and rhythm in the sounding process of birds;
performing horizontal or vertical offset on the inputted bird mel spectrogram sample data using a random scroll ratio to simulate a change in bird song tone and sounding time;
distorting the input bird mel spectrogram sample data along a given grid to simulate distortion conditions encountered in a bird sound collection process;
masking the input bird mel spectrogram sample data in a horizontal direction or a vertical direction by using a random masking proportion so as to simulate the problem of information loss encountered in bird sound collection.
6. A multi-auxiliary branch bird voice recognition method based on double-path time-frequency joint modeling is characterized by comprising the following steps:
preprocessing the acquired target bird sound data to obtain preprocessed target bird sound data, wherein the target bird sound data comprises target bird sound data and environment sound data;
inputting the preprocessed target bird sound data into a bird sound recognition model after training, and outputting a prediction result;
Comparing the prediction result with a preset output threshold value, and determining target bird information in the target bird sound data under the condition that the prediction result is larger than the preset output threshold value;
wherein the trained bird voice recognition model is trained by the method of any one of claims 1-5.
7. The method of claim 6, wherein preprocessing the acquired target bird sound data to obtain preprocessed target bird sound data comprises:
extracting the pretreated target bird sound data to obtain a target effective bird sound fragment;
filtering and segmenting the target effective bird sound fragment to obtain first target bird sound data;
and framing, windowing, fourier transform and Mel filtering are carried out on the first target bird sound data to obtain preprocessed target bird sound data.
8. A training device of a bird voice recognition model of multiple auxiliary branches based on double-path time-frequency joint modeling, wherein the bird voice recognition model is of an N-level connection structure, N is more than 1, each layer of cascade structure comprises a feature extraction layer, a double-path time-frequency joint modeling unit, a jump connection layer and an auxiliary branch classifier which are sequentially connected, and the device comprises:
The bird sound sample preprocessing module is used for preprocessing the acquired bird sound sample data to obtain preprocessed bird sound sample data and corresponding tag data, wherein the bird sound sample data comprises bird sound data and environment sound data;
the model training module is used for executing the following operations according to the sequence of the N-layer cascade structure until N classification results are obtained:
inputting the pretreated bird sound sample data into a feature extraction layer of an i-1 th cascade structure, outputting first feature data, inputting the first feature data into a dual-path time-frequency joint modeling unit, outputting second feature data, inputting the second feature data into a jump connection layer, outputting third feature data, inputting the third feature data into an auxiliary branch classifier, and outputting an i-1 th classification result, wherein i is less than or equal to N;
inputting the third characteristic data output by the jump connection layer in the i-1 th layer cascade structure into the i-th layer cascade structure, and outputting an i-th classification result;
obtaining a bird classification result according to the N classification results;
the calculating module is used for calculating a loss value according to the bird classification result and the tag data to obtain a loss result;
and the adjusting module is used for iteratively adjusting parameters of the bird voice recognition model by using the loss result to obtain the bird voice recognition model after training.
9. A multi-auxiliary branch bird voice recognition device based on dual-path time-frequency joint modeling, comprising:
the target bird sound preprocessing module is used for preprocessing the acquired target bird sound data to obtain preprocessed target bird sound data, wherein the target bird sound data comprises target bird sound data and environment sound data;
the prediction module is used for inputting the preprocessed target bird sound data into a bird sound recognition model after training, and outputting a prediction result;
the determining output module is used for comparing the prediction result with a preset output threshold value, and determining target bird information in the target bird sound data under the condition that the prediction result is larger than the preset output threshold value;
wherein the trained bird voice recognition model is trained by the method of any one of claims 1-5.
10. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to
A processor implementing the method of any one of claims 1 to 7.
CN202310216766.2A 2023-03-02 2023-03-02 Bird voice recognition method, model training method, device and electronic equipment Pending CN116206612A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310216766.2A CN116206612A (en) 2023-03-02 2023-03-02 Bird voice recognition method, model training method, device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310216766.2A CN116206612A (en) 2023-03-02 2023-03-02 Bird voice recognition method, model training method, device and electronic equipment

Publications (1)

Publication Number Publication Date
CN116206612A true CN116206612A (en) 2023-06-02

Family

ID=86514528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310216766.2A Pending CN116206612A (en) 2023-03-02 2023-03-02 Bird voice recognition method, model training method, device and electronic equipment

Country Status (1)

Country Link
CN (1) CN116206612A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095694A (en) * 2023-10-18 2023-11-21 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117292693A (en) * 2023-11-27 2023-12-26 安徽大学 CRNN rare animal identification and positioning method integrated with self-attention mechanism

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110120224A (en) * 2019-05-10 2019-08-13 平安科技(深圳)有限公司 Construction method, device, computer equipment and the storage medium of bird sound identification model
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
WO2021043015A1 (en) * 2019-09-05 2021-03-11 腾讯科技(深圳)有限公司 Speech recognition method and apparatus, and neural network training method and apparatus
CN114863938A (en) * 2022-05-24 2022-08-05 西南石油大学 Bird language identification method and system based on attention residual error and feature fusion
CN115294994A (en) * 2022-06-28 2022-11-04 重庆理工大学 Bird sound automatic identification system in real environment
CN115376518A (en) * 2022-10-26 2022-11-22 广州声博士声学技术有限公司 Voiceprint recognition method, system, device and medium for real-time noise big data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110120224A (en) * 2019-05-10 2019-08-13 平安科技(深圳)有限公司 Construction method, device, computer equipment and the storage medium of bird sound identification model
WO2021043015A1 (en) * 2019-09-05 2021-03-11 腾讯科技(深圳)有限公司 Speech recognition method and apparatus, and neural network training method and apparatus
CN110827837A (en) * 2019-10-18 2020-02-21 中山大学 Whale activity audio classification method based on deep learning
CN114863938A (en) * 2022-05-24 2022-08-05 西南石油大学 Bird language identification method and system based on attention residual error and feature fusion
CN115294994A (en) * 2022-06-28 2022-11-04 重庆理工大学 Bird sound automatic identification system in real environment
CN115376518A (en) * 2022-10-26 2022-11-22 广州声博士声学技术有限公司 Voiceprint recognition method, system, device and medium for real-time noise big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李国瑞;何小海;吴晓红;卿粼波;滕奇志;: "基于语义信息跨层特征融合的细粒度鸟类识别", 计算机应用与软件, no. 04, 12 April 2020 (2020-04-12) *
陈伟斌;: "鸟类图像分类特征的选择与提取", 长江大学学报(自然科学版)理工卷, no. 04, 15 December 2009 (2009-12-15) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095694A (en) * 2023-10-18 2023-11-21 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117095694B (en) * 2023-10-18 2024-02-23 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117292693A (en) * 2023-11-27 2023-12-26 安徽大学 CRNN rare animal identification and positioning method integrated with self-attention mechanism
CN117292693B (en) * 2023-11-27 2024-02-09 安徽大学 CRNN rare animal identification and positioning method integrated with self-attention mechanism

Similar Documents

Publication Publication Date Title
CN111276131B (en) Multi-class acoustic feature integration method and system based on deep neural network
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN108989882B (en) Method and apparatus for outputting music pieces in video
CN116206612A (en) Bird voice recognition method, model training method, device and electronic equipment
CN109087667B (en) Voice fluency recognition method and device, computer equipment and readable storage medium
CN111179971A (en) Nondestructive audio detection method and device, electronic equipment and storage medium
US20230162724A9 (en) Keyword spotting apparatus, method, and computer-readable recording medium thereof
CN108877779B (en) Method and device for detecting voice tail point
CN111625649A (en) Text processing method and device, electronic equipment and medium
CN116741159A (en) Audio classification and model training method and device, electronic equipment and storage medium
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN111968670A (en) Audio recognition method and device
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN116913258B (en) Speech signal recognition method, device, electronic equipment and computer readable medium
CN111341333B (en) Noise detection method, noise detection device, medium, and electronic apparatus
Do et al. Speech representation using linear chirplet transform and its application in speaker-related recognition
CN111312223A (en) Training method and device of voice segmentation model and electronic equipment
CN113160823B (en) Voice awakening method and device based on impulse neural network and electronic equipment
CN114783423A (en) Speech segmentation method and device based on speech rate adjustment, computer equipment and medium
CN113327596B (en) Training method of voice recognition model, voice recognition method and device
CN115376498A (en) Speech recognition method, model training method, device, medium, and electronic apparatus
CN114898757A (en) Voiceprint confirmation model training method and device, electronic equipment and storage medium
CN113987258A (en) Audio identification method and device, readable medium and electronic equipment
Konduru et al. Multidimensional feature diversity based speech signal acquisition
CN111899718A (en) Method, apparatus, device and medium for recognizing synthesized speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination