CN112750459A - Audio scene recognition method, device, equipment and computer readable storage medium - Google Patents

Audio scene recognition method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN112750459A
CN112750459A CN202010794916.4A CN202010794916A CN112750459A CN 112750459 A CN112750459 A CN 112750459A CN 202010794916 A CN202010794916 A CN 202010794916A CN 112750459 A CN112750459 A CN 112750459A
Authority
CN
China
Prior art keywords
audio
recognition
scene
clip
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010794916.4A
Other languages
Chinese (zh)
Other versions
CN112750459B (en
Inventor
夏咸军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010794916.4A priority Critical patent/CN112750459B/en
Publication of CN112750459A publication Critical patent/CN112750459A/en
Application granted granted Critical
Publication of CN112750459B publication Critical patent/CN112750459B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Analysis (AREA)

Abstract

The application provides an audio scene recognition method, device, equipment and computer readable storage medium; the method comprises the following steps: carrying out audio clip extraction on an audio signal to be identified to obtain a first audio clip and a second audio clip; the first audio clip comprises a second audio clip, and the duration of the first audio clip is greater than that of the second audio clip; acquiring the dynamic audio features of the first audio clip and the dynamic audio features of the second audio clip; inputting the dynamic audio features of a first audio clip into a first recognition model, performing audio scene recognition on the audio signal to obtain a first recognition result, inputting the dynamic audio features of a second audio clip into the first recognition model, and performing audio scene recognition on the audio signal to obtain a second recognition result; and determining an audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result. By the method and the device, the audio scene recognition precision can be improved.

Description

Audio scene recognition method, device, equipment and computer readable storage medium
Technical Field
The present application relates to computer technologies, and in particular, to an audio scene recognition method, apparatus, device, and computer-readable storage medium.
Background
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, deep learning and the like.
Audio scene recognition is one of the important applications of speech processing technology, and aims to recognize audio scenes, such as speech sounds, music sounds, noises, etc., contained in a continuous audio stream. In a conventional audio scene recognition system, feature extraction is usually performed on an extracted audio signal according to an input audio file, and different classification models are used to recognize the extracted features, however, this method has high requirements on recording equipment and recording environment, is only suitable for environments without noise, and has low accuracy in recognizing audio scenes in noisy environments.
Disclosure of Invention
The embodiment of the application provides an audio scene identification method, an audio scene identification device and a computer readable storage medium, which can improve the accuracy of audio scene identification.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides an audio scene identification method, which comprises the following steps:
carrying out audio clip extraction on an audio signal to be identified to obtain a first audio clip and a second audio clip;
wherein the first audio clip comprises the second audio clip and the duration of the first audio clip is greater than the duration of the second audio clip;
acquiring the dynamic audio features of the first audio clip and the dynamic audio features of the second audio clip;
inputting the dynamic audio features of the first audio clip into a first recognition model for audio scene recognition to obtain a corresponding first recognition result, and inputting the dynamic audio features of the second audio clip into the first recognition model for audio scene recognition to obtain a corresponding second recognition result;
and determining an audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result.
An embodiment of the present application provides an audio scene recognition device, including:
the segment extraction module is used for extracting audio segments of the audio signal to be identified to obtain a first audio segment and a second audio segment;
wherein the first audio clip comprises the second audio clip and the duration of the first audio clip is greater than the duration of the second audio clip;
the characteristic acquisition module is used for acquiring the dynamic audio characteristics of the first audio clip and the dynamic audio characteristics of the second audio clip;
the scene recognition module is used for inputting the dynamic audio features of the first audio clip into a first recognition model for audio scene recognition to obtain a corresponding first recognition result, and inputting the dynamic audio features of the second audio clip into the first recognition model for audio scene recognition to obtain a corresponding second recognition result;
and the scene determining module is used for determining the audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result.
In the above scheme, the segment extracting module is further configured to perform audio segment extraction on the audio signal through a first window to obtain a first audio segment;
performing audio clip extraction on the audio signal through a second window to obtain a second audio clip;
the time domain corresponding to the first window comprises the time domain corresponding to the second window, and the window size of the first window is larger than that of the second window.
In the foregoing solution, the feature obtaining module is further configured to perform the following operations on the first audio clip and the second audio clip respectively:
acquiring static audio features of an audio clip;
performing first-order difference processing on the static audio features to obtain corresponding first-order difference features;
performing second-order differential processing on the static audio features to obtain corresponding second-order differential features;
and splicing the static audio features, the first-order difference features and the second-order difference features to obtain the dynamic audio features of the audio clip.
In the above scheme, the characteristic obtaining module is further configured to perform fast fourier transform on the audio segment to obtain a corresponding audio frequency spectrum;
squaring the audio frequency spectrum to obtain a corresponding audio power spectrum;
carrying out Mel filtering on the audio power spectrum to obtain a corresponding audio Mel frequency spectrum;
and carrying out logarithm processing on the audio Mel frequency spectrum to obtain corresponding logarithmic Mel frequency spectrum characteristics, and determining the logarithmic Mel frequency spectrum characteristics as the static audio frequency characteristics of the audio frequency fragment.
In the above scheme, the feature obtaining module is further configured to perform framing processing on the audio segment to obtain at least two corresponding audio frames;
windowing the at least two audio frames to obtain corresponding windowed audio signals;
and carrying out fast Fourier transform on the windowed audio signal to obtain a corresponding audio frequency spectrum.
In the above solution, the apparatus further includes a first recognition model training module, where the first recognition model training module is configured to, before inputting the dynamic audio features of the first audio segment into the first recognition model,
acquiring dynamic audio features of audio signal samples, wherein the audio signal samples are marked with corresponding audio scenes;
inputting the dynamic audio features of the audio signal samples into a first recognition model, and performing audio scene recognition on the audio signal samples to obtain recognition results;
obtaining a difference between the recognition result and the label of the audio signal sample;
updating model parameters of the first recognition model based on the obtained difference.
In the above solution, the first recognition model training module is further configured to determine an error signal of the first recognition model based on the difference when the difference exceeds a difference threshold;
and reversely propagating the error signal in the first recognition model, and updating the model parameters of each layer in the process of propagation.
In the above scheme, the first identification result represents a first prediction probability that the audio signal corresponds to different audio scenes, and the second identification result represents a second prediction probability that the audio signal corresponds to different audio scenes;
the scene determining module is further configured to obtain mean values of the first prediction probability and the second prediction probability in the same audio scene respectively;
and taking the audio scene with the maximum average value as the audio scene corresponding to the audio signal.
In the above solution, the first identification result represents a first prediction probability of the audio signal corresponding to different audio scenes, the second identification result represents a second prediction probability of the audio signal corresponding to different audio scenes, and when the number of the second audio segments is at least two,
the scene determining module is further configured to determine, based on the second recognition results of the second audio segments, a first mean value of the second prediction probability values in the same audio scene corresponding to the second audio segments, respectively;
acquiring a second average value of the first prediction probability and the first average value corresponding to the same audio scene;
and taking the audio scene with the largest second average value as the audio scene corresponding to the audio signal.
In the above solution, the first audio clip and the second audio clip form an audio clip pair, and when the number of the audio clip pair is at least two,
the scene determining module is further configured to determine an audio scene corresponding to each of the audio segment pairs based on the first recognition result and the second recognition result of each of the audio segment pairs, respectively;
respectively acquiring the number of audio segment pairs corresponding to each determined audio scene;
and determining the audio scene corresponding to the audio signal based on the number of the audio fragment pairs corresponding to each audio scene.
In the foregoing solution, when it is determined that the audio scene corresponding to the audio signal is the target audio scene, the apparatus further includes a second identification processing module, where the second identification processing module is configured to identify the target audio scene according to the audio scene information, and the second identification processing module is configured to identify the target audio scene according to the audio scene information
Acquiring static audio features of the first audio clip and static audio features of the second audio clip;
inputting the static audio features of the first audio clip into a second recognition model for audio scene recognition to obtain a corresponding third recognition result, and inputting the static audio features of the second audio clip into the second recognition model for audio scene recognition to obtain a corresponding fourth recognition result;
and determining that the audio signal corresponds to a sub audio scene in the target audio scene by combining the third recognition result and the fourth recognition result.
In the above solution, the apparatus further includes a second recognition model training module, where the second recognition model training module is configured to, before inputting the static audio features of the first audio segment into a second recognition model for audio scene recognition,
obtaining static audio features of an audio signal sample, wherein the audio signal sample is marked with a corresponding audio scene;
inputting the static audio features of the audio signal samples into a second recognition model, and performing audio scene recognition on the audio signal samples to obtain recognition results;
obtaining a difference between the recognition result and the label of the audio signal sample;
updating model parameters of the second recognition model based on the obtained difference.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the audio scene recognition method provided by the embodiment of the application when the executable instructions stored in the memory are executed.
The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for recognizing an audio scene provided by the embodiment of the application.
The embodiment of the application has the following beneficial effects:
the method comprises the steps of extracting a first audio clip and a second audio clip with different durations from an audio signal to be identified, respectively obtaining an identification result for audio scene identification based on each audio clip, and then combining the identification result corresponding to each audio clip to obtain an audio scene corresponding to the audio signal.
Drawings
Fig. 1 is a schematic diagram of an alternative architecture of an audio scene recognition system according to an embodiment of the present application;
fig. 2 is an alternative structural schematic diagram of an electronic device provided in an embodiment of the present application;
fig. 3 is a schematic flow chart of an alternative audio scene recognition method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating an alternative method for determining dynamic audio characteristics according to an embodiment of the present application;
FIG. 5 is a schematic flow chart illustrating an alternative method for determining static audio characteristics according to an embodiment of the present application;
FIG. 6 is a schematic diagram of an audio clip provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of an audio clip provided by an embodiment of the present application;
fig. 8 is a schematic data flow diagram of audio scene recognition provided in an embodiment of the present application;
fig. 9 is a schematic view of a feature extraction process provided in an embodiment of the present application;
FIG. 10 is a schematic diagram of classifier training provided in an embodiment of the present application;
fig. 11 is a schematic structural diagram of an audio scene recognition apparatus according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the description that follows, reference is made to the term "first \ second …" merely to distinguish between similar objects and not to represent a particular ordering for the objects, it being understood that "first \ second …" may be interchanged in a particular order or sequence of orders as permitted to enable embodiments of the application described herein to be practiced in other than the order illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Referring to fig. 1, fig. 1 is an alternative architecture diagram of an audio scene recognition system 100 provided in this embodiment of the present application, in order to support an exemplary application, a terminal 400 (an exemplary terminal 400-1 and a terminal 400-2 are shown) is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two, and data transmission is implemented using a wireless link.
In practical applications, the terminal 400 may be various types of user terminals such as a smart phone, a tablet computer, a notebook computer, and the like, and may also be a desktop computer, a game console, a television, or a combination of any two or more of these data processing devices; the server 200 may be a single server configured to support various services, may also be configured as a server cluster, may also be a cloud server, and the like. In practical implementation, the audio scene recognition method provided by the embodiment of the present application may be implemented by a server or a terminal alone, or may be implemented by the server and the terminal in a cooperative manner.
In some embodiments, the terminal 400 is configured to perform audio segment extraction on an audio signal to be identified to obtain a first audio segment and a second audio segment; acquiring the dynamic audio features of the first audio clip and the dynamic audio features of the second audio clip; inputting the dynamic audio features of the first audio clip into a first recognition model for audio scene recognition to obtain a corresponding first recognition result, and inputting the dynamic audio features of the second audio clip into the first recognition model for audio scene recognition to obtain a corresponding second recognition result; and determining the audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result.
In other embodiments, the terminal 400 is provided with an audio acquisition device (e.g., a microphone), acquires an audio signal to be identified through the audio acquisition device, and sends the audio signal to be identified to the server 200, where the server 200 is configured to extract an audio clip from the audio signal to be identified to obtain a first audio clip and a second audio clip; acquiring the dynamic audio features of the first audio clip and the dynamic audio features of the second audio clip; inputting the dynamic audio features of the first audio clip into a first recognition model for audio scene recognition to obtain a corresponding first recognition result, and inputting the dynamic audio features of the second audio clip into the first recognition model for audio scene recognition to obtain a corresponding second recognition result; and determining and returning an audio scene corresponding to the audio signal to the terminal 400 by combining the first recognition result and the second recognition result, and executing next processing by the terminal 400 based on the audio scene corresponding to the audio signal, such as improving speech definition under noise or improving appreciation ability of music.
As an example, the audio scene recognition method provided by the embodiment of the application can be applied to smart homes, when a user is working at a company and no person is in the home in the daytime, the audio scene recognition method can be used to detect an abnormality occurring in the home, for example, by acquiring an audio signal and recognizing an audio scene corresponding to the audio signal, such as a door is violently bumped and a fire alarm, so that the user can know about a thing occurring in the home at the first time and take a countermeasure in time.
As another example, the audio scene recognition method provided in the embodiment of the present application may also be applied to unmanned driving, and although most unmanned driving technologies are based on image recognition and do not effectively utilize audio resources, in some scenes, there may be places where videos cannot reach, for example, in video blind areas similar to turning, where a camera cannot give synchronous picture information, and in this case, the audio scene recognition method may be used to recognize audio scenes corresponding to some audio signals, for example, for audio scenes of audio signals sent by some cars needing emergency traffic, it is necessary to decelerate and let the cars go in time, or for example, for audio scenes of recognized emergency events (for example, violence, screaming of pedestrians, etc.), it is necessary to avoid in time.
Referring to fig. 2, fig. 2 is an optional schematic structural diagram of an electronic device 500 provided in the embodiment of the present application, in practical applications, the electronic device 500 may be the terminal 400 or the server 200 in fig. 1, and the electronic device is the terminal 400 shown in fig. 1 as an example, so as to describe the electronic device implementing the audio scene recognition method in the embodiment of the present application. The electronic device 500 shown in fig. 2 includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 3.
The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 530 includes one or more output devices 531 enabling presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.
The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
a presentation module 553 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;
an input processing module 554 to detect one or more user inputs or interactions from one of the one or more input devices 532 and to translate the detected inputs or interactions.
In some embodiments, the audio scene recognition apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows an audio scene recognition apparatus 555 stored in a memory 550, which may be software in the form of programs and plug-ins, and includes the following software modules: the segment extraction module 5551, the feature acquisition module 5552, the scene recognition module 5553 and the scene determination module 5554 are logical and thus may be arbitrarily combined or further split depending on the functions implemented.
The functions of the respective modules will be explained below.
In other embodiments, the audio scene recognition Device provided in the embodiments of the present Application may be implemented in hardware, and for example, the audio scene recognition Device provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the audio scene recognition method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
Next, the audio scene recognition method provided in the embodiment of the present application is described, and in practical implementation, the audio scene recognition method provided in the embodiment of the present application may be implemented by a server or a terminal alone, or may be implemented by a server and a terminal in a cooperative manner.
Referring to fig. 3, fig. 3 is an alternative flowchart of an audio scene recognition method provided in the embodiment of the present application, which will be described with reference to the steps shown in fig. 3.
Step 101: and the terminal extracts audio clips of the audio signal to be identified to obtain a first audio clip and a second audio clip.
In practical application, an audio signal acquisition device (such as a microphone) is installed on the terminal, and an audio signal to be identified is acquired by the audio signal acquisition device, or the audio signal to be identified is a signal sent by other devices or a server.
Before feature extraction is carried out on the audio signal to be recognized, audio segmentation extraction needs to be carried out on the audio signal to obtain a first audio segment and a second audio segment. In some embodiments, the terminal may extract the audio segment from the audio signal to obtain the first audio segment and the second audio segment by:
audio clip extraction is carried out on the audio signal through a first window to obtain a first audio clip; audio clip extraction is carried out on the audio signal through a second window to obtain a second audio clip; the time domain corresponding to the first window comprises the time domain corresponding to the second window, and the window size of the first window is larger than that of the second window.
Here, the size of the first window and the second window may be determined according to a practical situation, for example, the real-time audio/video call is a chat function of multi-user voice and video realized based on the internet, and the real-time requirement on the audio/video is high, in this case, the sampling duration (i.e. the window size) corresponding to the second window may be at least 20 ms, when the sampling duration corresponding to the second window is set to 20 ms, the second audio clip extracted through the second window is an audio clip with a duration of 20 ms, since the sampling duration corresponding to the first window is greater than the sampling duration corresponding to the second window, the sampling duration corresponding to the first window may be set to 50 ms, the first audio clip extracted through the first window is an audio clip with a duration of 50 ms, and since the time domain corresponding to the first window includes the time domain corresponding to the second window, accordingly, the first audio clip obtained by segmentation extraction includes the second audio clip, And the duration of the first audio segment is greater than the duration of the second audio segment.
For another example, for an audio signal to be identified collected by a hearing aid worn by a hearing-impaired person, in order to avoid excessive consumption on a hearing-aid CPU, it is not necessary to identify an audio scene of the audio signal in real time, in this case, the sampling duration corresponding to the second window may be 5 seconds at minimum, when the sampling duration corresponding to the second window is 5 seconds, the sampling duration of the first window may be 20 seconds, then the second audio segment extracted through the second window is an audio segment with a duration of 5 seconds, and the first audio segment extracted through the first window is an audio segment with a duration of 20 seconds.
Step 102: and acquiring the dynamic audio features of the first audio clip and the dynamic audio features of the second audio clip.
Here, the dynamic audio features of the audio clip are audio features of three dimensions obtained by splicing the static audio features of the audio clip, the first-order difference features and the second-order difference features corresponding to the static audio features.
In some embodiments, referring to fig. 4, fig. 4 is an optional flowchart of the method for determining a dynamic audio feature provided in the embodiment of the present application, and step 102 shown in fig. 3 may be implemented by performing steps 201 and 204 shown in fig. 4 for a first audio segment and a second audio segment, respectively:
step 201: static audio features of an audio clip are obtained.
Here, the static audio features of the first audio segment and the static audio features of the second audio segment are obtained, respectively, and the static audio features are logarithmic mel-frequency spectrum features of the corresponding audio segments.
In some embodiments, referring to fig. 5, fig. 5 is an optional flowchart of the method for determining a static audio feature provided in the embodiment of the present application, and step 201 shown in fig. 4 may be implemented by step 2011-2014 shown in fig. 5:
step 2011: and carrying out fast Fourier transform on the audio segments to obtain corresponding audio frequency spectrums.
In some embodiments, for each audio clip, the terminal may obtain the audio spectrum of the audio clip by:
performing framing processing on the audio clips to obtain at least two corresponding audio frames; windowing at least two audio frames to obtain corresponding windowed audio signals; and carrying out fast Fourier transform on the windowed audio signal to obtain a corresponding audio frequency spectrum.
In order to obtain a highly accurate recognition result, the required audio signal should be a stable signal when performing audio scene recognition, but in practical applications, the audio signal to be recognized may not be stable as a whole, so that the segmented audio segment may also be unstable. Although an audio signal has a time-varying characteristic, its characteristic may be considered to be relatively stable for a short time (e.g., in the range of 10-30 milliseconds). Therefore, in order to obtain a stable audio segment, it is necessary to frame each audio segment, and divide the audio segment into a plurality of audio frames, and in practical implementation, the framing may be performed in a continuous segmentation manner or an overlapping segmentation manner, where the overlapping segmentation manner may avoid smooth transition between frames and maintain continuity between audio frames, an overlapping portion between two adjacent frames is referred to as a frame shift, and a ratio of the frame shift to the frame length is generally 0 to 1/2. For example, since an audio signal in the range of 10-30 milliseconds is considered to be stable, each audio piece is framed with a frame length of not less than 20 milliseconds, with a time of about 1/2 as a frame shift frame.
After the framing, the beginning and the end of each frame in the obtained audio segment may be discontinuous, if the number of the divided audio frames is more, the error with the audio signal of the original audio segment will be larger, in order to reduce the error, the audio signal of the audio segment after the framing becomes continuous, and each frame of audio signal shows the characteristic of a periodic function, the audio signal after the framing is windowed by adopting a window function, in the actual implementation, the commonly used window functions include a rectangular window, a hamming window and a hanning window, and different window functions can be selected according to different situations.
And finally, carrying out fast Fourier transform on each audio frame of the windowed audio signal to obtain a corresponding audio frequency spectrum.
Step 2012: and squaring the audio frequency spectrum to obtain a corresponding audio power spectrum.
Step 2013: and carrying out Mel filtering on the audio power spectrum to obtain a corresponding audio Mel spectrum.
Step 2014: and carrying out logarithm processing on the audio Mel frequency spectrum to obtain corresponding logarithmic Mel frequency spectrum characteristics, and determining the logarithmic Mel frequency spectrum characteristics as the static audio frequency characteristics of the audio frequency segment.
Here, the audio frequency spectrum is used to represent a spectrogram of an audio piece, and the spectrogram is often a large graph, and in order to obtain an audio feature with a proper size, it is often converted into an audio Mel frequency spectrum through a Mel filter, so as to achieve a degree of sensitivity of the simulated human ear hearing to the actual frequency of the audio signal, because the level of the sound heard by the human ear and the actual frequency (Hz) of the sound are not in a linear relationship, and the Mel frequency (Mel) is more consistent with the hearing characteristic of the human ear, i.e. the Mel frequency is linearly distributed below 1000Hz and logarithmically increases above 1000Hz, and the relationship between the Mel frequency and the Hz frequency is as follows:
Figure BDA0002625180610000131
wherein f represents the actual frequency, fMelIndicating the mel frequency.
Step 202: and performing first-order difference processing on the static audio features to obtain corresponding first-order difference features.
Step 203: and carrying out second-order differential processing on the static audio features to obtain corresponding second-order differential features.
Step 204: and splicing the static audio features, the first-order difference features and the second-order difference features to obtain the dynamic audio features of the audio clip.
After obtaining the logarithmic mel frequency spectrum feature (i.e. static audio feature) of the audio segment, calculating the first order difference and the second order difference to obtain the corresponding first order feature and second order feature, and splicing the logarithmic mel frequency spectrum feature (i.e. static audio feature), the first order feature and the second order feature to finally obtain the feature of three-dimensional dynamic audio feature.
Step 103: and inputting the dynamic audio features of the first audio clip into the first recognition model for audio scene recognition to obtain a corresponding first recognition result, and inputting the dynamic audio features of the second audio clip into the first recognition model for audio scene recognition to obtain a corresponding second recognition result.
In practical implementation, in order to reduce the influence caused by noise, the dynamic audio features of the first audio segment and the dynamic audio features of the second audio segment may be normalized and then input into the first recognition model, where the first recognition model is used to classify and recognize the extracted dynamic audio features, so as to obtain a recognition result for characterizing an audio scene corresponding to the audio signal.
In practical applications, the first recognition Model may be a traditional classification Model such as a Hidden Markov Model (HMM), a Support Vector Machine (SVM), or a Neural Network Model such as a Convolutional Neural Network (CNN) and a Recurrent Neural Network (R NN). For applications with high real-time requirements, such as audio/video real-time conversation, the first recognition model can be a compressed residual Network (ResNet) model, and the model has a small size and meets the requirements of low computational complexity and low time delay.
In some embodiments, before inputting the dynamic audio features of the first audio segment into the first recognition model, the first recognition model may also be trained by:
acquiring dynamic audio features of an audio signal sample, wherein the audio signal sample is marked with a corresponding audio scene; inputting the dynamic audio features of the audio signal samples into a first recognition model, and performing audio scene recognition on the audio signal samples to obtain recognition results; acquiring the difference between the identification result and the label of the audio signal sample; based on the obtained difference, the model parameters of the first recognition model are updated.
In practical implementation, the value of the loss function of the first recognition model may be determined according to the difference between the recognition result and the label of the audio signal sample; in the training process, cross entropy can be adopted as a loss function, and when the value of the loss function reaches a preset threshold value, a corresponding error signal is determined based on the value of the loss function of the first recognition model; the error signal is propagated back in the first recognition model and model parameters of the respective layers of the first recognition model are updated during the propagation.
Describing backward propagation, inputting training sample data into an input layer of a neural network model, passing through a hidden layer, finally reaching an output layer and outputting a result, which is a forward propagation process of the neural network model, wherein because the output result of the neural network model has an error with an actual result, an error between the output result and the actual value is calculated and is propagated backward from the output layer to the hidden layer until the error is propagated to the input layer, and in the process of backward propagation, the value of a model parameter is adjusted according to the error; and continuously iterating the process until convergence.
Step 104: and determining the audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result.
Here, the first recognition result determined based on the first audio piece and the second recognition result determined based on the second audio piece are comprehensively considered to obtain an audio scene corresponding to the audio signal.
In some embodiments, the first recognition result characterizes a first prediction probability that the audio signal corresponds to a different audio scene, and the second recognition result characterizes a second prediction probability that the audio signal corresponds to a different audio scene; the terminal can determine the audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result as follows:
respectively obtaining the mean values of the first prediction probability and the second prediction probability under the same audio scene; and taking the audio scene with the maximum average value as the audio scene corresponding to the audio signal.
Here, the dynamic audio feature corresponding to the first audio piece is input to the first recognition model, and the prediction probability of the audio signal corresponding to each audio scene is output. Respectively obtaining a first prediction probability of each audio scene corresponding to the first audio clip and a second prediction probability of each audio scene corresponding to the second audio clip, taking the average value of the first prediction probability and the second prediction probability corresponding to the same audio scene, and taking the audio scene with the largest average value as the audio scene corresponding to the audio signal.
For example, common audio scenes of an audio-video real-time call include: and if the types of music, clean voice and noise are similar, inputting the dynamic audio characteristics of the audio clip of the audio and video real-time call into the first recognition model, and outputting the prediction probabilities of the three types of music, clean voice and noise. Assuming that the dynamic audio features corresponding to the first audio segment of the speech signal to be recognized are input to the first recognition model, the prediction probabilities of outputting three audio scenes corresponding to music, clean speech and noise are 0.5, 0.8 and 0.6 respectively, the dynamic audio features corresponding to the second audio segment are input to the first recognition model, the prediction probabilities of outputting three audio scenes corresponding to music, clean speech and noise are 0.4, 0.7 and 0.7 respectively, and the mean values of the three audio scenes corresponding to music, clean speech and noise are calculated to be 0.45, 0.75 and 0.65 respectively, it is obvious that the mean value of the audio scene corresponding to clean speech is the largest, and it can be determined that the audio scene corresponding to the speech signal to be recognized is clean speech.
In some embodiments, the first recognition result represents a first prediction probability that the audio signal corresponds to different audio scenes, the second recognition result represents a second prediction probability that the audio signal corresponds to different audio scenes, and when the number of the second audio segments is at least two, the terminal may determine the audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result in the following manner:
determining a first mean value of second prediction probability values under the same audio scene corresponding to the second audio segments respectively based on the second identification results of the second audio segments; acquiring a second average value of the first prediction probability and the first average value corresponding to the same audio scene; and taking the audio scene with the largest second average value as the audio scene corresponding to the audio signal.
In practical application, an audio signal to be recognized is divided into a plurality of short-duration second audio segments and a long-duration first audio segment, dynamic audio features of the second audio segments are respectively input into a first recognition model for audio scene recognition, and a plurality of corresponding prediction probabilities corresponding to different audio scenes are obtained; then, for the same audio scene, the prediction probability mean values corresponding to the second audio segments are obtained, the prediction probabilities corresponding to the first audio segments are weighted and averaged to obtain the final prediction probability mean value corresponding to the audio scene, and the audio scene corresponding to the maximum prediction probability mean value determined based on the second audio segments and the first audio segment is selected as the audio scene corresponding to the audio signal to be identified.
For example, assuming that the audio signal to be recognized is an audio signal of a captured audio-video real-time call, referring to fig. 6, fig. 6 is a schematic audio segment diagram provided in the embodiment of the present application, as shown in fig. 6, the audio signal to be recognized is divided into an audio segment a with a long duration and two audio segments B1 and B2 with short durations, where B1 and B2 are both included in a, and when performing audio scene recognition, the result of performing audio scene recognition based on B1 is: the prediction probabilities of outputting three audio scenes, namely, music, clean speech and noise, are 0.4, 0.7 and 0.5 respectively, and the result of performing audio scene recognition based on B2 is: the prediction probabilities of outputting three audio scenes, namely corresponding music, clean voice and noise, are 0.5, 0.9 and 0.7 respectively, and the probability mean values of the three audio scenes, namely corresponding music, clean voice and noise, determined based on B1 and B2 are 0.45, 0.8 and 0.6 respectively; the result of audio scene recognition based on a is: the predicted probabilities of outputting three audio scenes corresponding to music, clean speech and noise are 0.5, 0.9 and 0.6 respectively, and the average of the probabilities of the three audio scenes corresponding to music, clean speech and noise finally determined based on A, B1 and B2 are respectively: 0.475, 0.85, and 0.6, it is known that the average of the probabilities of the audio scene corresponding to the clean speech is the largest, and it is determined that the audio scene corresponding to the speech signal to be recognized is the clean speech.
In some embodiments, the first audio clip and the second audio clip form an audio clip pair, and when the number of the audio clip pair is at least two, the terminal may further determine the audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result as follows:
determining an audio scene corresponding to each audio clip pair based on the first recognition result and the second recognition result of each audio clip pair respectively; respectively acquiring the number of audio clip pairs corresponding to each determined audio scene; and determining the audio scene corresponding to the audio signal based on the number of the audio fragment pairs corresponding to each audio scene.
The audio signal to be recognized is divided into a plurality of audio segment pairs, wherein each audio segment pair contains a second audio segment of short duration and a first audio segment of long duration. For example, assuming that the audio signal to be recognized is an audio signal of a captured audio-video real-time call, referring to fig. 7, fig. 7 is a schematic diagram of an audio segment provided in the embodiment of the present application, as shown in fig. 7, the audio signal is divided into three pairs of audio segments { a1, B1}, { a2, B2} and { A3, B3}, and for the pair of audio segments { a1, B1}, the result of performing audio scene recognition based on a1 is: the prediction probabilities of outputting three audio scenes, namely, music, clean speech and noise, are 0.4, 0.7 and 0.5 respectively, and the result of performing audio scene recognition based on B1 is: if the prediction probabilities of outputting three audio scenes, namely, corresponding music, clean speech and noise, are 0.5, 0.9 and 0.7 respectively, it can be determined that the average values of the prediction probabilities of the three audio scenes, namely, corresponding music, clean speech and noise, are 0.45, 0.8 and 0.6 based on a1 and B1, and the audio scene of the clean speech corresponding to the largest prediction probability average value (0.8) is selected as the audio scene corresponding to the audio segment pair { a1 and B1 }. Similarly, if it is determined that the audio scene corresponding to the audio segment pair { a2, B2} is music and the audio scene corresponding to the audio segment pair { A3, B3} is clean speech, it is known that, in the three audio segment pairs, 2 audio segment pairs corresponding to the audio scene of clean speech are present, that is, the ratio of the audio segment pair corresponding to the audio scene of clean speech to the total audio segment pair is maximum, it is determined that the audio scene corresponding to the audio signal to be recognized is clean speech.
In some embodiments, when it is determined that the audio scene corresponding to the audio signal is the target audio scene, the terminal may further determine that the audio scene corresponding to the audio signal to be identified is a sub-audio scene in the target audio scene by:
acquiring static audio features of a first audio clip and static audio features of a second audio clip; inputting the static audio features of the first audio clip into a second recognition model for audio scene recognition to obtain a corresponding third recognition result, and inputting the static audio features of the second audio clip into the second recognition model for audio scene recognition to obtain a corresponding fourth recognition result; and determining a sub audio scene in the target audio scene corresponding to the audio signal by combining the third recognition result and the fourth recognition result.
Here, if the audio scene identified by the first identification model is a target audio scene and the target audio scene includes a plurality of sub-audio scenes, it is necessary to further identify which sub-audio scene the audio scene corresponding to the audio signal to be identified is. In actual implementation, the static audio features of each audio clip are obtained, where the obtaining manner of the static audio features may refer to steps 2011 to 2014, the obtained static audio features are input into the second recognition model to perform audio scene recognition, a third recognition result and a fourth recognition result are obtained, and the similar manner is sampled to determine the sub-audio scene in the target audio scene corresponding to the audio signal by combining the third recognition result and the fourth recognition result.
For example, common audio scenes of an audio-video real-time call include: and if the audio scene corresponding to the audio signal is determined to be the noise class based on the static audio features of the first audio clip and the second audio clip, respectively inputting the static audio features of the first audio clip and the second audio clip into the second recognition model for audio scene recognition, and combining the obtained third recognition result and the fourth recognition result, sampling the similar manner, and determining the noise or the noise-added voice in the target audio scene corresponding to the audio signal.
In some embodiments, before inputting the static audio features of the first audio segment into the second recognition model for audio scene recognition, the second recognition model may be further trained by:
obtaining static audio features of an audio signal sample, wherein the audio signal sample is marked with a corresponding audio scene; inputting the static audio features of the audio signal samples into a second recognition model, and performing audio scene recognition on the audio signal samples to obtain recognition results; acquiring the difference between the identification result and the label of the audio signal sample; based on the obtained difference, the model parameters of the second recognition model are updated.
Here, the network structures used by the first recognition model and the second recognition model may be the same, for example, both use compressed residual network models, and the difference is that the inputs of the first recognition model and the second recognition model are different, where the first recognition model inputs the dynamic audio features of the audio segment, and the second recognition model inputs the static audio features of the audio segment, and the loss functions used in the training process may also be the same, for example, both use cross entropy as the loss function.
Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
Audio scene recognition is the recognition of audio scenes contained in a continuous audio stream, such as speech sounds, music sounds, etc. The audio and video real-time call is a chat function of multi-person voice and video realized based on the internet, so common use scenes of the audio and video real-time call comprise four types, namely music, noise, clean voice and noise-added voice, the common scenes used for the real-time call are classified, the next processing can be conveniently carried out aiming at different scenes, for example, the speech definition under the noise is improved, the appreciation capability of the music is improved, and the like.
In the audio and video real-time communication, the requirement on the size of a model is high due to limited bandwidth, and meanwhile, the requirements on low calculation complexity and low time delay are ensured. In the application, a compressed ResNet network is adopted to train a recognition model, and a two-stage classification mode is adopted to train two ResNet models, wherein one ResNet model is a first-stage classifier (namely, the first recognition model), one ResNet model is a second-stage classifier (namely, the second recognition model), the first-stage classifier is a third-stage classifier, the second-stage classifier is a second-stage classifier, the first-stage classifier is adopted to classify music, clean voice and noise, and the second-stage classifier is adopted to classify noise and noise-added voice under the noise. When classification and identification are carried out, the audio signal is divided into audio clips with different time lengths, the audio characteristics of the long-time audio clip (namely, the first audio clip) and the short-time audio clip (namely, the second audio clip) are respectively used as input in the same classifier, and the audio scene corresponding to the audio signal is obtained through a post-processing mode of scoring and fusion.
Referring to fig. 8, fig. 8 is a schematic data flow diagram of audio scene recognition provided in the embodiment of the present application, and as shown in fig. 8, first, audio segment extraction is performed on an acquired audio signal to be recognized to obtain an audio segment a (i.e., a first audio segment) and an audio segment B (i.e., a second audio segment), where the audio segment a includes the audio segment B, a duration of the audio segment a is greater than a duration of the audio segment B, and each audio segment a corresponds to one audio segment B.
Then, feature extraction is performed on the audio segment a and the audio segment B, referring to fig. 9, fig. 9 is a schematic diagram of a feature extraction flow provided in an embodiment of the present application, and the processing shown in fig. 9 is performed on the audio segment a and the audio segment B, respectively:
step 401: performing framing processing on the audio clips to obtain at least two corresponding audio frames;
step 402: windowing at least two audio frames to obtain corresponding windowed audio signals;
step 403: performing fast Fourier transform on the windowed audio signal to obtain a corresponding audio frequency spectrum;
step 404: squaring the audio frequency spectrum to obtain a corresponding audio power spectrum;
step 405: carrying out Mel filtering on the audio power spectrum to obtain a corresponding audio Mel frequency spectrum;
step 406: and carrying out logarithm processing on the audio Mel frequency spectrum to obtain corresponding logarithmic Mel frequency spectrum characteristics.
Here, the logarithmic mel-frequency spectrum feature is the static audio feature of the above-mentioned audio segment.
Step 407: carrying out first-order difference processing on the logarithmic Mel frequency spectrum characteristics to obtain corresponding first-order difference characteristics;
step 408: carrying out second-order difference processing on the logarithmic Mel frequency spectrum characteristics to obtain corresponding second-order difference characteristics;
step 409: and splicing the logarithmic Mel frequency spectrum characteristics, the first-order difference characteristics and the second-order difference characteristics to obtain the three-dimensional Mel frequency spectrum dynamic characteristics.
Here, the three-dimensional mel-frequency spectrum dynamic characteristics are the above dynamic audio characteristics.
Secondly, inputting the three-dimensional Mel frequency spectrum dynamic characteristics corresponding to the audio clip A into a first-stage classifier for audio scene recognition to obtain a first recognition result; and inputting the three-dimensional Mel frequency spectrum dynamic characteristics corresponding to the audio fragment B into a primary classifier for audio scene recognition to obtain a second recognition result, and determining an audio scene corresponding to the audio signal, such as music, noise or clean voice, by combining the first recognition result and the second recognition result.
When the audio scene corresponding to the audio signal is determined to be noise, logarithmic Mel frequency spectrum characteristics corresponding to the audio fragment A and the audio fragment B are respectively input into a secondary classifier, whether the audio scene corresponding to the audio signal belongs to noise or noise-added voice is further judged, and the hypothesis that C is used is that C is used1,1C1,2,C1,3To represent three audio scenes in a primary classifier, using C2,1And C2,2Two audio scenes representing a two-level classifier, where C2,1,C2,2∈C1,2Using F1And F2To represent the output results in the first-stage classifier and the second-stage classifier, for input x, the final audio scene class (x) is predicted to be:
Figure BDA0002625180610000211
outputting a prediction probability value corresponding to each audio segment through a secondary classifier, wherein the value range of the prediction probability value is between [0 and 1], then obtaining the average value of the prediction probability values of each audio segment, and if the average value is more than 0.5, determining that an audio scene corresponding to an audio signal is noise; otherwise, determining the audio scene corresponding to the audio signal as the noise-added voice.
Finally, the training of the first and second classifiers in the present application is explained. Referring to fig. 10, fig. 10 is a schematic diagram of classifier training provided in this embodiment of the present application, as shown in fig. 10, training processes of a primary classifier and a secondary classifier are completely the same, before training, audio scene corpora need to be collected, that is, audio signal samples are collected, different clean voices and noises are superimposed on a time domain, a signal-to-noise ratio is guaranteed to be-5 dB to 10dB during superimposing, a noise-added voice signal is generated, the noise-added voice signal, the clean voices, the noises and the music signals are audio signal training samples in a corpus, wherein the audio signal training samples are labeled with corresponding audio scenes.
The network structures adopted by the first-stage classifier and the second-stage classifier are the same, if a compressed residual network model is adopted, the difference is that the input of the first-stage classifier and the input of the second-stage classifier are different, wherein the input of the first-stage classifier is the three-dimensional Mel frequency spectrum dynamic characteristic of an audio signal sample, and the input of the first-stage classifier is the logarithmic Mel frequency spectrum characteristic of the audio signal sample, so that when the characteristic extraction is carried out on the audio signal sample, the processing of the steps 401 to 409 needs to be carried out on the audio signal sample for the training of the first-stage classifier, and the processing of the steps 401 to 406 needs to be carried out on the audio signal sample for the training of the first-stage classifier; after the features are extracted, in the training process, the extracted features of the audio signal samples are input into a residual error network model for training to obtain a recognition result, the difference between the recognition result and the label of the audio signal samples is obtained, and model parameters of the residual error network model are updated based on the obtained difference.
In actual implementation, the value of the loss function of the training model can be determined according to the difference between the recognition result and the label of the audio signal sample; in the training process, cross entropy can be used as a loss function, when the value of the loss function reaches a preset threshold value, a corresponding error signal is determined based on the value of the loss function, the error signal is reversely propagated in the training model, and model parameters of each layer of the training model are updated in the propagation process until convergence.
Through the mode, the audio scene identification method provided by the embodiment of the application adopts a mode of combining secondary classification and long-time and short-time segment classification, and compared with a common audio scene identification system, the method can ensure the requirements of lower computational complexity and low time delay while having high accuracy, and meets the use scenes and requirements of real-time audio and video communication.
Continuing with the exemplary structure of the audio scene recognition device 555 provided in the embodiment of the present application implemented as a software module, in some embodiments, as shown in fig. 11, fig. 11 is a schematic structural diagram of the audio scene recognition device provided in the embodiment of the present application, and the software module stored in the audio scene recognition device 555 of the memory 550 may include:
the segment extracting module 5551 is configured to perform audio segment extraction on the audio signal to be identified to obtain a first audio segment and a second audio segment;
wherein the first audio clip comprises the second audio clip and the duration of the first audio clip is greater than the duration of the second audio clip;
a feature obtaining module 5552, configured to obtain a dynamic audio feature of the first audio segment and a dynamic audio feature of the second audio segment;
the scene recognition module 5553 is configured to input the dynamic audio features of the first audio segment into a first recognition model for audio scene recognition to obtain a corresponding first recognition result, and input the dynamic audio features of the second audio segment into the first recognition model for audio scene recognition to obtain a corresponding second recognition result;
a scene determining module 5554, configured to determine an audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result.
In some embodiments, the segment extracting module is further configured to perform audio segment extraction on the audio signal through a first window to obtain a first audio segment;
performing audio clip extraction on the audio signal through a second window to obtain a second audio clip;
the time domain corresponding to the first window comprises the time domain corresponding to the second window, and the window size of the first window is larger than that of the second window.
In some embodiments, the feature obtaining module is further configured to perform the following operations on the first audio segment and the second audio segment, respectively:
acquiring static audio features of an audio clip;
performing first-order difference processing on the static audio features to obtain corresponding first-order difference features;
performing second-order differential processing on the static audio features to obtain corresponding second-order differential features;
and splicing the static audio features, the first-order difference features and the second-order difference features to obtain the dynamic audio features of the audio clip.
In some embodiments, the feature obtaining module is further configured to perform fast fourier transform on the audio segment to obtain a corresponding audio frequency spectrum;
squaring the audio frequency spectrum to obtain a corresponding audio power spectrum;
carrying out Mel filtering on the audio power spectrum to obtain a corresponding audio Mel frequency spectrum;
and carrying out logarithm processing on the audio Mel frequency spectrum to obtain corresponding logarithmic Mel frequency spectrum characteristics, and determining the logarithmic Mel frequency spectrum characteristics as the static audio frequency characteristics of the audio frequency fragment.
In some embodiments, the feature obtaining module is further configured to perform framing processing on the audio segment to obtain at least two corresponding audio frames;
windowing the at least two audio frames to obtain corresponding windowed audio signals;
and carrying out fast Fourier transform on the windowed audio signal to obtain a corresponding audio frequency spectrum.
In some embodiments, the apparatus further comprises a first recognition model training module to, prior to inputting the dynamic audio features of the first audio segment into the first recognition model,
acquiring dynamic audio features of audio signal samples, wherein the audio signal samples are marked with corresponding audio scenes;
inputting the dynamic audio features of the audio signal samples into a first recognition model, and performing audio scene recognition on the audio signal samples to obtain recognition results;
obtaining a difference between the recognition result and the label of the audio signal sample;
updating model parameters of the first recognition model based on the obtained difference.
In some embodiments, the first recognition model training module is further configured to determine an error signal for the first recognition model based on the difference when the difference exceeds a difference threshold;
and reversely propagating the error signal in the first recognition model, and updating the model parameters of each layer in the process of propagation.
In some embodiments, the first recognition result characterizes a first prediction probability that the audio signal corresponds to a different audio scene, and the second recognition result characterizes a second prediction probability that the audio signal corresponds to a different audio scene;
the scene determining module is further configured to obtain mean values of the first prediction probability and the second prediction probability in the same audio scene respectively;
and taking the audio scene with the maximum average value as the audio scene corresponding to the audio signal.
In some embodiments, the first recognition result characterizes a first prediction probability that the audio signal corresponds to different audio scenes, the second recognition result characterizes a second prediction probability that the audio signal corresponds to different audio scenes, and when the number of the second audio pieces is at least two,
the scene determining module is further configured to determine, based on the second recognition results of the second audio segments, a first mean value of the second prediction probability values in the same audio scene corresponding to the second audio segments, respectively;
acquiring a second average value of the first prediction probability and the first average value corresponding to the same audio scene;
and taking the audio scene with the largest second average value as the audio scene corresponding to the audio signal.
In some embodiments, the first audio piece and the second audio piece constitute an audio piece pair, and when the number of audio piece pairs is at least two,
the scene determining module is further configured to determine an audio scene corresponding to each of the audio segment pairs based on the first recognition result and the second recognition result of each of the audio segment pairs, respectively;
respectively acquiring the number of audio segment pairs corresponding to each determined audio scene;
and determining the audio scene corresponding to the audio signal based on the number of the audio fragment pairs corresponding to each audio scene.
In some embodiments, when it is determined that the audio scene corresponding to the audio signal is the target audio scene, the apparatus further includes a second recognition processing module, configured to, when it is determined that the audio scene corresponding to the audio signal is the target audio scene, recognize the target audio scene using the second recognition processing module
Acquiring static audio features of the first audio clip and static audio features of the second audio clip;
inputting the static audio features of the first audio clip into a second recognition model for audio scene recognition to obtain a corresponding third recognition result, and inputting the static audio features of the second audio clip into the second recognition model for audio scene recognition to obtain a corresponding fourth recognition result;
and determining that the audio signal corresponds to a sub audio scene in the target audio scene by combining the third recognition result and the fourth recognition result.
In some embodiments, the apparatus further comprises a second recognition model training module for, prior to inputting the static audio features of the first audio piece into a second recognition model for audio scene recognition,
obtaining static audio features of an audio signal sample, wherein the audio signal sample is marked with a corresponding audio scene;
inputting the static audio features of the audio signal samples into a second recognition model, and performing audio scene recognition on the audio signal samples to obtain recognition results;
obtaining a difference between the recognition result and the label of the audio signal sample;
updating model parameters of the second recognition model based on the obtained difference.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the audio scene recognition method described in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform an audio scene recognition method provided by embodiments of the present application, for example, the method as illustrated in fig. 4.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (15)

1. A method for audio scene recognition, the method comprising:
carrying out audio clip extraction on an audio signal to be identified to obtain a first audio clip and a second audio clip; wherein the first audio clip comprises the second audio clip and the duration of the first audio clip is greater than the duration of the second audio clip;
acquiring the dynamic audio features of the first audio clip and the dynamic audio features of the second audio clip;
inputting the dynamic audio features of the first audio clip into a first recognition model for audio scene recognition to obtain a corresponding first recognition result, and inputting the dynamic audio features of the second audio clip into the first recognition model for audio scene recognition to obtain a corresponding second recognition result;
and determining an audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result.
2. The method of claim 1, wherein said extracting the audio segment of the audio signal to obtain a first audio segment and a second audio segment comprises:
performing audio clip extraction on the audio signal through a first window to obtain a first audio clip;
performing audio clip extraction on the audio signal through a second window to obtain a second audio clip;
the time domain corresponding to the first window comprises the time domain corresponding to the second window, and the window size of the first window is larger than that of the second window.
3. The method of claim 1, wherein the obtaining the dynamic audio features of the first audio segment and the dynamic audio features of the second audio segment comprises:
performing the following operations on the first and second audio segments, respectively:
acquiring static audio features of an audio clip;
performing first-order difference processing on the static audio features to obtain corresponding first-order difference features;
performing second-order differential processing on the static audio features to obtain corresponding second-order differential features;
and splicing the static audio features, the first-order difference features and the second-order difference features to obtain the dynamic audio features of the audio clip.
4. The method of claim 3, wherein the obtaining the static audio features of the audio segment comprises:
performing fast Fourier transform on the audio clips to obtain corresponding audio frequency spectrums;
squaring the audio frequency spectrum to obtain a corresponding audio power spectrum;
carrying out Mel filtering on the audio power spectrum to obtain a corresponding audio Mel frequency spectrum;
and carrying out logarithm processing on the audio Mel frequency spectrum to obtain corresponding logarithmic Mel frequency spectrum characteristics, and determining the logarithmic Mel frequency spectrum characteristics as the static audio frequency characteristics of the audio frequency fragment.
5. The method of claim 4, wherein the fast Fourier transforming the audio segment to obtain a corresponding audio spectrum comprises:
performing framing processing on the audio clips to obtain at least two corresponding audio frames;
windowing the at least two audio frames to obtain corresponding windowed audio signals;
and carrying out fast Fourier transform on the windowed audio signal to obtain a corresponding audio frequency spectrum.
6. The method of claim 1, wherein prior to the inputting the dynamic audio features of the first audio segment into the first recognition model, the method further comprises:
acquiring dynamic audio features of audio signal samples, wherein the audio signal samples are marked with corresponding audio scenes;
inputting the dynamic audio features of the audio signal samples into a first recognition model, and performing audio scene recognition on the audio signal samples to obtain recognition results;
obtaining a difference between the recognition result and the label of the audio signal sample;
updating model parameters of the first recognition model based on the obtained difference.
7. The method of claim 6, wherein said updating model parameters of said first recognition model based on said obtained differences comprises:
determining an error signal for the first recognition model based on the difference when the difference exceeds a difference threshold;
and reversely propagating the error signal in the first recognition model, and updating the model parameters of each layer in the process of propagation.
8. The method according to any of claims 1 to 7, wherein the first recognition result characterizes a first prediction probability of the audio signal to be recognized corresponding to different audio scenes, and the second recognition result characterizes a second prediction probability of the audio signal corresponding to different audio scenes;
determining an audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result, including:
respectively obtaining the mean value of the first prediction probability and the second prediction probability under the same audio scene;
and taking the audio scene with the maximum average value as the audio scene corresponding to the audio signal.
9. The method according to any of claims 1 to 7, wherein the first recognition result characterizes a first prediction probability of the audio signal for different audio scenes, the second recognition result characterizes a second prediction probability of the audio signal for different audio scenes, and when the number of the second audio segments is at least two,
determining an audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result, including:
determining a first mean value of the second prediction probability values under the same audio scene corresponding to the second audio segments respectively based on second identification results of the second audio segments;
acquiring a second average value of the first prediction probability and the first average value corresponding to the same audio scene;
and taking the audio scene with the largest second average value as the audio scene corresponding to the audio signal.
10. The method according to any one of claims 1 to 7, wherein the first audio piece and the second audio piece constitute an audio piece pair, and when the number of the audio piece pair is at least two,
determining an audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result, including:
determining an audio scene corresponding to each of the audio clip pairs based on the first recognition result and the second recognition result of each of the audio clip pairs, respectively;
respectively acquiring the number of audio segment pairs corresponding to each determined audio scene;
and determining the audio scene corresponding to the audio signal based on the number of the audio fragment pairs corresponding to each audio scene.
11. The method according to any one of claims 1 to 7, wherein when the audio scene corresponding to the audio signal is determined to be a target audio scene, the method further comprises:
acquiring static audio features of the first audio clip and static audio features of the second audio clip;
inputting the static audio features of the first audio clip into a second recognition model for audio scene recognition to obtain a corresponding third recognition result, and inputting the static audio features of the second audio clip into the second recognition model for audio scene recognition to obtain a corresponding fourth recognition result;
and determining that the audio signal corresponds to a sub audio scene in the target audio scene by combining the third recognition result and the fourth recognition result.
12. The method of claim 11, wherein prior to inputting the static audio features of the first audio segment into a second recognition model for audio scene recognition, the method further comprises:
obtaining static audio features of an audio signal sample, wherein the audio signal sample is marked with a corresponding audio scene;
inputting the static audio features of the audio signal samples into a second recognition model, and performing audio scene recognition on the audio signal samples to obtain recognition results;
obtaining a difference between the recognition result and the label of the audio signal sample;
updating model parameters of the second recognition model based on the obtained difference.
13. An audio scene recognition apparatus, characterized in that the apparatus comprises:
the segment extraction module is used for extracting audio segments of the audio signal to be identified to obtain a first audio segment and a second audio segment;
wherein the first audio clip comprises the second audio clip and the duration of the first audio clip is greater than the duration of the second audio clip;
the characteristic acquisition module is used for acquiring the dynamic audio characteristics of the first audio clip and the dynamic audio characteristics of the second audio clip;
the scene recognition module is used for inputting the dynamic audio features of the first audio clip into a first recognition model for audio scene recognition to obtain a corresponding first recognition result, and inputting the dynamic audio features of the second audio clip into the first recognition model for audio scene recognition to obtain a corresponding second recognition result;
and the scene determining module is used for determining the audio scene corresponding to the audio signal by combining the first recognition result and the second recognition result.
14. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the audio scene recognition method of any one of claims 1 to 12 when executing executable instructions stored in the memory.
15. A computer-readable storage medium storing executable instructions for implementing the audio scene recognition method of any one of claims 1 to 12 when executed by a processor.
CN202010794916.4A 2020-08-10 2020-08-10 Audio scene recognition method, device, equipment and computer readable storage medium Active CN112750459B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010794916.4A CN112750459B (en) 2020-08-10 2020-08-10 Audio scene recognition method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010794916.4A CN112750459B (en) 2020-08-10 2020-08-10 Audio scene recognition method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112750459A true CN112750459A (en) 2021-05-04
CN112750459B CN112750459B (en) 2024-02-02

Family

ID=75645375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010794916.4A Active CN112750459B (en) 2020-08-10 2020-08-10 Audio scene recognition method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112750459B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793622A (en) * 2021-09-10 2021-12-14 中国科学院声学研究所 Audio scene recognition method, system and device
CN115334349A (en) * 2022-07-15 2022-11-11 北京达佳互联信息技术有限公司 Audio processing method and device, electronic equipment and storage medium
CN116070174A (en) * 2023-03-23 2023-05-05 长沙融创智胜电子科技有限公司 Multi-category target recognition method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645265A (en) * 2008-08-05 2010-02-10 中兴通讯股份有限公司 Method and device for identifying audio category in real time
CN102486920A (en) * 2010-12-06 2012-06-06 索尼公司 Audio event detection method and device
CN102968986A (en) * 2012-11-07 2013-03-13 华南理工大学 Overlapped voice and single voice distinguishing method based on long time characteristics and short time characteristics
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
US20200160845A1 (en) * 2018-11-21 2020-05-21 Sri International Real-time class recognition for an audio stream

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645265A (en) * 2008-08-05 2010-02-10 中兴通讯股份有限公司 Method and device for identifying audio category in real time
CN102486920A (en) * 2010-12-06 2012-06-06 索尼公司 Audio event detection method and device
CN102968986A (en) * 2012-11-07 2013-03-13 华南理工大学 Overlapped voice and single voice distinguishing method based on long time characteristics and short time characteristics
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
US20200160845A1 (en) * 2018-11-21 2020-05-21 Sri International Real-time class recognition for an audio stream

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793622A (en) * 2021-09-10 2021-12-14 中国科学院声学研究所 Audio scene recognition method, system and device
CN113793622B (en) * 2021-09-10 2023-08-29 中国科学院声学研究所 Audio scene recognition method, system and device
CN115334349A (en) * 2022-07-15 2022-11-11 北京达佳互联信息技术有限公司 Audio processing method and device, electronic equipment and storage medium
CN115334349B (en) * 2022-07-15 2024-01-02 北京达佳互联信息技术有限公司 Audio processing method, device, electronic equipment and storage medium
CN116070174A (en) * 2023-03-23 2023-05-05 长沙融创智胜电子科技有限公司 Multi-category target recognition method and system

Also Published As

Publication number Publication date
CN112750459B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
US11776530B2 (en) Speech model personalization via ambient context harvesting
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
Adeel et al. Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments
CN112750459B (en) Audio scene recognition method, device, equipment and computer readable storage medium
CN110246512B (en) Sound separation method, device and computer readable storage medium
CN110197658B (en) Voice processing method and device and electronic equipment
CN108847221B (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
CN113205803B (en) Voice recognition method and device with self-adaptive noise reduction capability
US20190206418A1 (en) Device and a method for classifying an acoustic environment
WO2022218027A1 (en) Audio playing method and apparatus, and computer-readable storage medium and electronic device
Salekin et al. Distant emotion recognition
CN113992970A (en) Video data processing method and device, electronic equipment and computer storage medium
CN114399995A (en) Method, device and equipment for training voice model and computer readable storage medium
Adeel et al. Towards next-generation lipreading driven hearing-aids: A preliminary prototype demo
Rajavel et al. Adaptive reliability measure and optimum integration weight for decision fusion audio-visual speech recognition
CN112669837A (en) Awakening method and device of intelligent terminal and electronic equipment
CN110689901A (en) Voice noise reduction method and device, electronic equipment and readable storage medium
CN114495946A (en) Voiceprint clustering method, electronic device and storage medium
CN114203156A (en) Audio recognition method, audio recognition device, electronic equipment and storage medium
CN116705013B (en) Voice wake-up word detection method and device, storage medium and electronic equipment
CN112017662A (en) Control instruction determination method and device, electronic equipment and storage medium
WO2024055751A1 (en) Audio data processing method and apparatus, device, storage medium, and program product
CN116612747B (en) Speech phoneme recognition method, device, equipment and storage medium
CN112799509B (en) Gesture input method and system based on acoustic wave sensing
CN117636909B (en) Data processing method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40043561

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant