CN117834596A - Audio processing method, device, apparatus, storage medium and computer program product - Google Patents

Audio processing method, device, apparatus, storage medium and computer program product Download PDF

Info

Publication number
CN117834596A
CN117834596A CN202310212294.3A CN202310212294A CN117834596A CN 117834596 A CN117834596 A CN 117834596A CN 202310212294 A CN202310212294 A CN 202310212294A CN 117834596 A CN117834596 A CN 117834596A
Authority
CN
China
Prior art keywords
audio
coding
processing
decoding
code stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310212294.3A
Other languages
Chinese (zh)
Inventor
肖玮
刘文哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310212294.3A priority Critical patent/CN117834596A/en
Publication of CN117834596A publication Critical patent/CN117834596A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/80Responding to QoS
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • G10L19/265Pre-filtering, e.g. high frequency emphasis prior to encoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Provided are an artificial intelligence-based audio processing method, apparatus, electronic device, computer-readable storage medium, and computer program product; relates to artificial intelligence technology; the method comprises the following steps: performing feature extraction processing on the audio signal to obtain the audio feature of the audio signal; filtering the audio characteristics of the audio signals, and pooling the filtered audio characteristics to obtain sampling characteristics of the audio signals; performing feature coding processing on the sampling features of the audio signal to obtain audio coding features of the audio signal; and performing signal coding processing on the audio coding characteristics of the audio signal to obtain a code stream of the audio signal. By the method and the device, the audio coding efficiency can be improved.

Description

Audio processing method, device, apparatus, storage medium and computer program product
Technical Field
The present application relates to artificial intelligence technology, and in particular, to an artificial intelligence-based audio processing method, apparatus, electronic device, computer readable storage medium, and computer program product.
Background
Artificial intelligence (AI, artificial Intelligence) is a comprehensive technology of computer science, and by researching the design principle and implementation method of various intelligent machines, the machines have the functions of sensing, reasoning and deciding. Artificial intelligence technology is a comprehensive subject, and relates to a wide range of fields, such as natural language processing technology, machine learning/deep learning and other directions, and with the development of technology, the artificial intelligence technology will be applied in more fields and has an increasingly important value.
Audio codec technology is one of the important applications in the field of artificial intelligence, and audio codec technology is a core technology in communication services including remote audio-video calls. The speech coding technology simply uses less network bandwidth resources to transmit as much speech information as possible. From the perspective of shannon information theory, speech coding is a source coding, and the purpose of source coding is to compress the data volume of information that we want to transmit as much as possible at the coding end, remove redundancy in the information, and at the decoding end, recover the information without damage (or near damage).
In the related art, in order to secure the speed of audio transmission during encoding, the quality of audio compression is greatly compromised.
Disclosure of Invention
Embodiments of the present application provide an audio processing method, apparatus, electronic device, computer readable storage medium and computer program product based on artificial intelligence, which can improve audio coding efficiency.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides an audio processing method based on artificial intelligence, which comprises the following steps:
performing feature extraction processing on an audio signal to obtain audio features of the audio signal;
Filtering the audio characteristics of the audio signals, and pooling the filtered audio characteristics to obtain sampling characteristics of the audio signals;
performing feature coding processing on the sampling features of the audio signal to obtain audio coding features of the audio signal;
and performing signal coding processing on the audio coding characteristics of the audio signal to obtain a code stream of the audio signal.
The embodiment of the application provides an audio processing method based on artificial intelligence, which comprises the following steps:
performing signal decoding processing on the code stream to obtain audio coding characteristics corresponding to the code stream;
performing feature decoding processing on the audio coding features corresponding to the code stream to obtain sampling features corresponding to the code stream;
pooling the sampling features corresponding to the code stream, and filtering the pooled sampling features to obtain audio features corresponding to the code stream;
and carrying out feature reconstruction processing on the audio features corresponding to the code stream to obtain an audio signal corresponding to the code stream.
An embodiment of the present application provides an audio processing apparatus, including:
the feature extraction module is used for carrying out feature extraction processing on the audio signal to obtain the audio feature of the audio signal;
The first pooling module is used for carrying out filtering processing on the audio characteristics of the audio signals and pooling processing on the filtered audio characteristics to obtain sampling characteristics of the audio signals;
the feature coding module is used for carrying out feature coding processing on the sampling features of the audio signal to obtain audio coding features of the audio signal;
and the signal coding module is used for carrying out signal coding processing on the audio coding characteristics of the audio signal to obtain a code stream of the audio signal.
An embodiment of the present application provides an audio processing apparatus, including:
the signal decoding module is used for carrying out signal decoding processing on the code stream to obtain audio coding characteristics corresponding to the code stream;
the feature decoding module is used for carrying out feature decoding processing on the audio coding features corresponding to the code stream to obtain sampling features corresponding to the code stream;
the second pooling module is used for pooling the sampling characteristics corresponding to the code stream and filtering the pooled sampling characteristics to obtain audio characteristics corresponding to the code stream;
and the characteristic reconstruction module is used for carrying out characteristic reconstruction processing on the audio characteristics corresponding to the code stream to obtain an audio signal corresponding to the code stream.
An embodiment of the present application provides an electronic device for audio processing, including:
a memory for storing executable instructions;
and the processor is used for realizing the audio processing method based on artificial intelligence when executing the executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium, which stores executable instructions for causing a processor to execute, so as to implement the audio processing method based on artificial intelligence.
Embodiments of the present application provide a computer program product comprising a computer program or instructions which, when executed by a processor, implement the artificial intelligence based audio processing method provided by the embodiments of the present application.
The embodiment of the application has the following beneficial effects:
the audio characteristics of the audio signals are filtered and pooled, and the signal processing technology and the artificial intelligence technology are organically combined, so that high-quality data sampling is realized, the introduced distortion of the audio signals is reduced from the data source in the coding process, the audio coding quality is improved, and the characteristic coding processing and the signal coding processing are performed on the sampling characteristics of the audio signals, so that the audio coding efficiency is improved under the condition of ensuring the audio quality.
Drawings
Fig. 1 is a schematic diagram of spectrum comparison at different code rates according to an embodiment of the present application;
fig. 2 is a schematic architecture diagram of an audio codec system according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 4 is a flow chart of an artificial intelligence based audio processing method provided in an embodiment of the present application;
FIG. 5 is a flow chart of an artificial intelligence based audio processing method provided in an embodiment of the present application;
FIG. 6 is a schematic diagram of an end-to-end voice communication link provided by an embodiment of the present application;
FIG. 7 is a flow chart of a speech coding method based on an improved filtering operator according to an embodiment of the present application;
FIG. 8A is a schematic diagram of a sampling operation provided by an embodiment of the present application;
fig. 8B is a schematic diagram of a pulse signal with a length of 31 according to an embodiment of the present application;
FIG. 8C is a schematic diagram of a pulse signal with length 5 according to an embodiment of the present application;
FIG. 9A is a schematic diagram of a generic convolutional network provided by an embodiment of the present application;
FIG. 9B is a schematic diagram of a hole convolution network provided by an embodiment of the present application;
FIG. 10 is a schematic diagram of a preprocessing network and an encoding network provided by an embodiment of the present application;
FIG. 11 is a schematic flow chart of preprocessing provided in an embodiment of the present application;
FIG. 12 is a schematic diagram of a decoding network and a post-processing network provided by an embodiment of the present application;
FIG. 13 is a schematic illustration of interpolation provided by an embodiment of the present application;
FIG. 14 is a flow diagram of post-processing provided by an embodiment of the present application;
FIG. 15 is a signal diagram of an upsampling filter provided in an embodiment of the present application;
fig. 16 is a schematic flow chart of post-processing provided in an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, the terms "first", "second", and the like are merely used to distinguish similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", and the like may be interchanged with one another, if permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.
1) Neural Networks (NN): is an algorithm mathematical model which simulates the behavior characteristics of an animal neural network and processes distributed parallel information. The network relies on the complexity of the system and achieves the purpose of processing information by adjusting the relationship of the interconnection among a large number of nodes.
2) Deep Learning (DL): is a new research direction in the field of Machine Learning (ML), and deep Learning is an inherent rule and presentation hierarchy of Learning sample data, and information obtained in the Learning process greatly helps to explain data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data.
3) Quantification: refers to a process of approximating a continuous value (or a large number of discrete values) of a signal to a finite number (or fewer) discrete values. Quantization includes vector quantization (VQ, vector Quantization) and scalar quantization, among others.
Among these, vector quantization is an effective lossy compression technique, and its theoretical basis is shannon's rate distortion theory. The basic principle of vector quantization is to replace the input vector with the index of the codeword in the codebook that best matches the input vector for transmission and storage, and only a simple look-up table operation is needed for decoding. For example, a vector space is formed by combining a plurality of scalar data, the vector space is divided into a plurality of small areas, and the vector which falls into the small areas during quantization is replaced by a corresponding index.
Scalar quantization is the quantization of a scalar, i.e., one-dimensional vector quantization, dividing the dynamic range into several cells, each cell having a representative value (i.e., index). When the input signal falls within a certain interval, the input signal is quantized into the representative value.
4) Entropy coding: the lossless coding mode without losing any information according to the entropy principle in the coding process is also a key module in the lossy coding and is positioned at the tail end of the coder. Entropy coding includes Shannon (Shannon) coding, huffman (Huffman) coding, exponential Golomb coding (Exp-Golomb) and arithmetic coding (arithmetic coding).
5) Quadrature mirror filter bank (QMF, quadrature Mirror Filters): is a filter pair comprising analysis-synthesis, wherein QMF analysis filters are used for subband signal decomposition to reduce the signal bandwidth, so that each subband signal can be successfully processed through its respective channel; the QMF synthesis filter is used for synthesizing each subband signal recovered by the decoding end, for example, reconstructing the original audio signal by zero value interpolation, band-pass filtering and other modes.
Speech coding techniques use less network bandwidth resources to deliver as much speech information as possible. The compression rate of the voice coder-decoder can reach more than 10 times, namely, after the original 10MB voice data is compressed by the encoder, only 1MB is needed for transmission, thereby greatly reducing the bandwidth resources required for information transmission. For example, for a wideband speech signal with a sampling rate of 16000Hz, if a 16-bit sampling depth (the degree of refinement of the recording of the speech intensity in the sample) is used, the code rate (the amount of data transferred per unit time) of the uncompressed version is 256kbps; if speech coding techniques are used, even if lossy coding, the quality of the reconstructed speech signal may approach a non-compressed version, even audibly considered to be non-differential, over a code rate range of 10-20 kbps. If a higher sampling rate service is required, such as ultra wideband speech at 32000Hz, the code rate range is at least 30 kbps.
In a communication system, in order to ensure smooth communication, industry-internal standard voice codec protocols are deployed, for example, standards from international and domestic standards organizations such as ITU-T and 3GPP, IETF, AVS, CCSA, and standards such as g.711, g.722, AMR series, EVS and OPUS. Fig. 1 shows a spectrum comparison diagram at a different code rate to demonstrate the relationship between compression code rate and quality. Curve 101 is the spectral curve of the original speech, i.e. the signal without compression; curve 102 is the spectrum curve of an OPUS encoder at a code rate of 20 kbps; curve 103 is the spectral curve of OPUS encoding at a rate of 6 kbps. As can be seen from fig. 1, as the coding rate increases, the compressed signal is closer to the original signal.
In the related art, the principle of speech coding is approximately as follows: the voice coding can directly code voice waveform samples sample by sample; or based on the sounding principle of the person, extracting relevant low-dimensional characteristics, encoding the characteristics by an encoding end, and reconstructing a voice signal by a decoding end based on the parameters.
The coding principles are all from voice signal modeling, namely a compression method based on signal processing, and the coding quality of the audio cannot be ensured. In order to improve coding efficiency while guaranteeing voice quality, embodiments of the present application provide an artificial intelligence-based audio processing method, apparatus, electronic device, computer-readable storage medium, and computer program product. An exemplary application of the electronic device provided by the embodiment of the present application is described below, where the electronic device provided by the embodiment of the present application may be implemented as a terminal device, may be implemented as a server, or may be implemented cooperatively by the terminal device and the server. The following description will take an example in which the electronic device is implemented as a terminal device.
Referring to fig. 2, fig. 2 is a schematic architecture diagram of an audio codec system 10 according to an embodiment of the present application, where the audio codec system 10 includes: server 200, network 300, terminal device 400 (i.e., encoding side), and terminal device 500 (i.e., decoding side), wherein network 300 may be a local area network, or a wide area network, or a combination of both.
In some embodiments, a client 410 is running on the terminal device 400, and the client 410 may be various types of clients, such as an instant messaging client, a web conference client, a live client, a browser, and the like. The client 410 responds to an audio collection instruction triggered by a sender (such as an initiator of a network conference, a host, an initiator of a voice call, etc.), invokes a microphone of the terminal device 400 to collect an audio signal, and encodes the collected audio signal to obtain a code stream.
For example, the client 410 invokes the audio processing method based on artificial intelligence provided in the embodiment of the present application to encode the collected audio signal, that is, to perform feature extraction processing on the audio signal, so as to obtain the audio feature of the audio signal; filtering and pooling the audio characteristics of the audio signals to obtain sampling characteristics of the audio signals; performing feature coding processing on the sampling features of the audio signal to obtain audio coding features of the audio signal; and performing signal coding processing on the audio coding characteristics of the audio signal to obtain a code stream of the audio signal. The coding end (i.e. the terminal device 400) combines the signal processing technology and the depth neural network to realize high-quality data sampling, ensure that the audio signal reduces the introduced distortion from the data source in the coding process, improve the audio coding quality, and perform feature coding processing and signal coding processing on the sampling feature of the audio signal, thereby improving the audio coding efficiency under the condition of ensuring the audio quality.
The client 410 may send the code stream to the server 200 over the network 300 to cause the server 200 to send the code stream to the associated terminal device 500 of the recipient (e.g., the participant of the web conference, the viewer, the recipient of the voice call, etc.).
After receiving the code stream sent by the server 200, the client 510 (e.g., an instant messaging client, a web conference client, a live client, a browser, etc.) may perform decoding processing on the code stream to obtain an audio signal, thereby implementing audio communication.
For example, the client 510 invokes the artificial intelligence-based audio processing method provided in the embodiment of the present application to decode the received code stream, that is, perform signal decoding processing on the code stream, to obtain the audio coding feature corresponding to the code stream; performing feature decoding processing on the audio coding features corresponding to the code stream to obtain sampling features corresponding to the code stream; filtering and pooling the sampling characteristics corresponding to the code stream to obtain audio characteristics corresponding to the code stream; and carrying out feature reconstruction processing on the audio features corresponding to the code stream to obtain an audio signal corresponding to the code stream.
In some embodiments, the embodiments of the present application may be implemented by Cloud Technology (Cloud Technology), which refers to a hosting Technology that unifies serial resources such as hardware, software, networks, etc. in a wide area network or a local area network, so as to implement calculation, storage, processing, and sharing of data.
The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. The service interaction function between the servers 200 may be implemented through cloud technology.
By way of example, the server 200 shown in fig. 2 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal device 400 and the terminal device 500 shown in fig. 2 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a car terminal, and the like. The terminal devices (e.g., terminal device 400 and terminal device 500) and the server 200 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.
In some embodiments, the terminal device or the server 200 may also implement the audio processing method provided in the embodiments of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; the method can be a local (Native) Application program (APP), namely a program which can be run only by being installed in an operating system, such as a live APP, a network conference APP, or an instant messaging APP; the method can also be an applet, namely a program which can be run only by being downloaded into a browser environment; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in.
In some embodiments, multiple servers may be organized into a blockchain, and server 200 may be a node on the blockchain, where there may be an information connection between each node in the blockchain, and where information may be transferred between nodes via the information connection. The data (e.g., logic for audio processing, code stream) related to the audio processing method based on artificial intelligence provided in the embodiments of the present application may be stored on a blockchain.
Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device 500 provided in an embodiment of the present application, and taking the electronic device 500 as an example of a terminal device, the electronic device 500 shown in fig. 3 includes: at least one processor 520, memory 550, at least one network interface 530, and a user interface 540. The various components in the electronic device 500 are coupled together by a bus system 550. It is appreciated that bus system 550 is used to facilitate connected communications between these components. The bus system 550 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration, the various buses are labeled as bus system 550 in fig. 3.
The processor 520 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
The user interface 540 includes one or more output devices 541, including one or more speakers and/or one or more visual displays, that enable presentation of media content. The user interface 540 also includes one or more input devices 542 that include user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 550 may optionally include one or more storage devices physically located remote from processor 520.
Memory 550 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;
network communication module 552 for accessing other computing devices via one or more (wired or wireless) network interfaces 530, exemplary network interfaces 530 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;
A presentation module 553 for enabling presentation of information (e.g., a user interface for operating a peripheral device and displaying content and information) via one or more output devices 541 (e.g., a display screen, speakers, etc.) associated with the user interface 540;
the input processing module 554 is configured to detect one or more user inputs or interactions from one of the one or more input devices 542 and translate the detected inputs or interactions.
In some embodiments, the artificial intelligence-based audio processing device provided in the embodiments of the present application may be implemented in software, and fig. 3 shows an artificial intelligence-based audio processing device 555 stored in a memory 550, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the feature extraction module 5551, the first pooling module 5552, the feature encoding module 5553, the signal encoding module 5554, or the signal decoding module 5555, the feature decoding module 5556, the second pooling module 5557, and the feature reconstruction module 5558, wherein the feature extraction module 5551, the first pooling module 5552, the feature encoding module 5553, the signal encoding module 5554 are configured to implement an audio encoding function, and the signal decoding module 5555, the feature decoding module 5556, the second pooling module 5557, and the feature reconstruction module 5558 are configured to implement an audio decoding function, and these modules are logical, so any combination or further splitting may be performed according to the implemented functions.
As previously described, the artificial intelligence based audio processing method provided by the embodiments of the present application may be implemented by various types of electronic devices. Referring to fig. 4, fig. 4 is a schematic flow chart of an audio processing method based on artificial intelligence according to an embodiment of the present application, and an audio coding function is implemented by an audio processing manner, and the following description is made with reference to the steps shown in fig. 4.
In step 101, a feature extraction process is performed on an audio signal, so as to obtain an audio feature of the audio signal.
As an example of acquiring an audio signal, the encoding end responds to an audio acquisition instruction triggered by a sender (such as an initiator of a web conference, a host, an initiator of a voice call, etc.), and invokes a microphone of a terminal device of the encoding end to acquire the audio signal (also called an input signal).
It should be noted that, after the audio signal is obtained, feature extraction is performed on the audio signal through the neural network model to obtain an audio feature of the audio signal, where the audio feature is an initial feature capable of fully characterizing the audio signal without difference.
It should be noted that the embodiments of the present application are not limited to the process of feature extraction of the neural network model. For example, the neural network model in the embodiment of the present application may be implemented by a causal convolution module, where the audio signal is a 1×320 tensor, and then the 24-channel causal convolution module expands the 1×320 tensor into a 24×320 tensor (audio feature), and then performs encoding processing based on the 24×320 tensor to compress the audio signal.
In step 102, filtering is performed on the audio features of the audio signal, and pooling is performed on the filtered audio features to obtain sampling features of the audio signal.
And step 101, filtering and pooling the audio characteristics of the audio signal to obtain sampling characteristics of the audio signal, wherein the characteristic dimension of the sampling characteristics is smaller than that of the audio signal. By organically combining the signal processing technology and the artificial intelligence technology, high-quality data sampling is realized, the introduced distortion is reduced from the data source in the encoding process of the audio signal, and the audio encoding quality is improved.
In some embodiments, filtering audio features of an audio signal includes: performing low-pass filtering processing based on the digital signals on the audio characteristics of the audio signals to obtain filtered audio characteristics; pooling the filtered audio features to obtain sampling features of the audio signals, including: and performing down-sampling processing based on the neural network on the filtered audio characteristics to obtain the sampling characteristics of the audio signals.
It should be noted that the embodiments of the present application are not limited to the model structure of the neural network, for example, the neural network may be a deep neural network, a convolutional neural network, or the like.
It should be noted that, in the neural network, the encoding end generally performs data dimension compression layer by layer, and each network module output performs a data sampling operation, i.e. Pooling (Pooling). For example, downsampling is achieved by means of Average Pooling (Average Pooling), i.e. the Average of adjacent N elements, outputting a single dimension, equivalent to downsampling by a specific factor of 1/N.
However, the pooling operation is essentially a simple operation such as directly extracting or averaging an input sequence, and has a problem of "spectrum leakage" in terms of digital signal processing. Therefore, the embodiment of the application organically combines the signal processing technology and the artificial intelligence technology, provides a downsampling pooling method (comprising filtering and pooling processing), compresses the audio characteristics of the audio signals, and simultaneously effectively completes the smoothing between adjacent data, thereby avoiding the problem of spectrum leakage.
As an example, for an input signal of length L (i.e., the audio characteristics of an audio signal), a low-pass filter is employed for low-pass filtering to achieve a digital signal-based low-pass filtering process. The low-pass filter may be not limited to use of a recursive filter (IIR, infinite Impulse Response) or a non-recursive filter (FIR, finite Impulse Response). Then, the input signal after low-pass filtering (i.e. the low-pass filtered signal) is sampled every N points, thus completing the downsampling by a factor of 1/N, so as to realize the downsampling process based on the neural network, and the length of the output signal (i.e. the sampling characteristic of the audio signal) is 1/N times that of the input signal.
It should be noted that the embodiments of the present application are not limited to the form of filtering, and for example, the filtering process may be a low-pass filtering process based on a digital signal, and may also be a band-pass filtering process based on a digital signal.
In some embodiments, before filtering the audio features of the audio signal, performing a first convolution process on the audio features of the audio signal to obtain first convolution features of the audio signal; and activating the first convolution characteristic to obtain an audio characteristic for filtering.
For example, before filtering, the embodiment of the application may call a deep neural network, perform a first convolution process on an audio feature of an audio signal to obtain the first convolution feature of the audio signal, and then perform an activation process on the first convolution feature to obtain the audio feature for performing pooling processing, so as to perform downsampling pooling processing. It should be noted that the embodiments of the present application are not limited to the model structure of the deep neural network.
With the above example in mind, as shown in fig. 11, the audio signal has an audio characteristic of 24×320 tensors, and is convolved by a 24-channel one-dimensional convolution module (i.e., the convolution network module shown in fig. 11) to implement a first convolution process, and then activated by an activation function (e.g., reLU) (i.e., the activation module shown in fig. 11), and finally sampled by the above-provided downsampling pooling, i.e., low-pass filtering by a 4-order FIR, and then downsampling by a factor of 1/2. Thus, a 24×320 tensor is converted into a 24×160 tensor by the preprocessing shown in fig. 11. As described above, by adopting the downsampling pooling provided by the embodiments of the present application, distortion can be avoided as much as possible in the downsampling process, and the most comprehensive information of the input signal is retained.
In step 103, feature encoding processing is performed on the sampled features of the audio signal, so as to obtain audio encoding features of the audio signal.
After the step 102 is carried out to the audio features of the audio signal, the feature encoding process is carried out to the sampled features of the audio signal, so as to obtain the audio encoding features of the audio signal, wherein the feature dimension of the audio encoding features is smaller than the feature dimension of the audio features. The audio signal is compressed again through the feature encoding process, so that the audio encoding efficiency is improved while the audio quality is ensured.
In some embodiments, performing feature encoding processing on a sampling feature of an audio signal to obtain an audio encoding feature of the audio signal, including: performing coding processing based on a neural network on the sampling characteristics of the audio signal to obtain coding characteristics of the audio signal; and performing second convolution processing on the coding characteristics of the audio signal to obtain the audio coding characteristics of the audio signal.
For example, the sampling feature of the audio signal is subjected to coding processing based on a neural network through a coding block in the coding network, so as to obtain the coding feature of the audio signal, and the coding feature of the audio signal is subjected to second convolution processing through a causal convolution module in the coding network, so as to obtain the audio coding feature of the audio signal. It should be noted that the embodiments of the present application are not limited to the model structure of the coding network, for example, the coding network may be a deep neural network, a convolutional neural network, or the like.
For example, as shown in fig. 10, the encoding network includes encoding blocks and causal convolutions, and a lower-dimensional feature vector F (n) (i.e., the audio encoding feature of the audio signal) is generated by invoking the encoding network based on a 24×160 tensor (i.e., the sampled feature of the audio signal) output by the preprocessing network. The dimension F (n) is 56, so that the function of dimension reduction is realized from the data quantity, and the function of data compression is realized.
In some embodiments, the neural network-based encoding process is implemented by a plurality of cascaded encoding blocks; performing a neural network-based encoding process on the sampled characteristics of the audio signal to obtain encoded characteristics of the audio signal, including: encoding the sampled characteristics of the audio signal by a first one of a plurality of cascaded encoding blocks; outputting the coding result of the first coding block to the coding block of the subsequent cascade connection, and continuing coding processing and coding result output through the coding block of the subsequent cascade connection until the last coding block is output; and taking the coding result output by the last coding block as the coding characteristic of the audio signal.
After the processing of one layer of coding block, understanding the downsampling characteristic is deepened by one step, and after the learning of multiple layers of coding blocks, the coding characteristic of the audio signal can be learned step by step and accurately. By concatenating the encoded blocks, the encoded characteristics of the progressive audio signal can be obtained.
In some embodiments, continuing the encoding process and the encoding result output through the subsequently concatenated encoded blocks includes: the following processing is performed by the subsequently concatenated coding blocks: performing third convolution processing on the coding result input to the coding block in the subsequent cascade connection to obtain a third convolution characteristic of the coding result; and performing downsampling processing on the third convolution characteristic of the coding result to obtain the coding result of the coding block of the subsequent cascade, and outputting the coding result of the coding block of the subsequent cascade.
The processing procedure of the first code block is similar to the processing procedure of the code block of the subsequent cascade, that is, the convolution processing is performed first, and then the downsampling processing is performed.
With the above example in mind, as shown in fig. 10, after the sampling characteristics of the audio signal are obtained, 3 different downsampling factor (down_factor) encoded blocks are concatenated. Taking the example of a coded block (down_factor=4), 1 or more hole convolutions may be performed first, each convolution kernel being fixed to a size of 1×3 and a shift Rate (Stride Rate) of 1. In addition, the expansion Rate (differential Rate) of 1 or more hole convolutions can be set according to requirements, such as 3, and of course, the expansion rates of different hole convolutions are not limited in the embodiments of the present application. Then, the down_factor of the 3 coding blocks is set to 4, 5 and 8 respectively, which is equivalent to setting different size pooling factors, and plays a role of downsampling. Finally, the number of 3 coding block channels is set to 48, 96, 192, respectively. Thus, the 24×160 tensors are sequentially converted into 48×40, 96×8, and 192×1 tensors through 3 encoding blocks. Finally, for the 192×1 tensor, a 56-dimensional eigenvector F (n) can be output by causal convolution like the preprocessing network. Wherein, each coding block is first subjected to hole convolution (namely third convolution processing), and then pooled to finish downsampling.
In some embodiments, downsampling the third convolution characteristic of the coding result to obtain a coding result output by a coding block of a subsequent cascade, including: and carrying out pooling processing based on the neural network on the third convolution characteristic of the coding result to obtain the coding result output by the coding block of the subsequent cascade connection.
For example, the Pooling manner (i.e., the Pooling process based on the neural network) in the encoding block in the encoding end may be an Average Pooling manner, an odd sampling manner, or the like. By adopting a pooling processing mode based on a neural network, rapid data compression can be realized.
In some embodiments, downsampling the third convolution characteristic of the coding result to obtain a coding result output by a coding block of a subsequent cascade, including: and filtering the third convolution characteristic of the coding result, and pooling the filtered third convolution characteristic to obtain the coding result output by the coding block in the subsequent cascade.
In connection with the above example, it should be noted that a downsampling pooling scheme (i.e., filtering first and then pooling) may be employed for the 3 encoded blocks. So that distortion is avoided as much as possible during the downsampling process, preserving the most comprehensive information of the audio signal.
In step 104, signal encoding processing is performed on the audio encoding characteristics of the audio signal, so as to obtain a code stream of the audio signal.
And step 103 is carried out, the audio coding characteristics of the audio signals are subjected to signal coding processing, the code stream of the audio signals is obtained, the code stream is transmitted to a decoding end, and the decoding end decodes the code stream to recover the audio signals, so that the audio coding efficiency is improved under the condition of ensuring the audio quality.
In some embodiments, signal encoding processing is performed on audio encoding features of an audio signal to obtain a code stream of the audio signal, including: carrying out quantization processing on the audio coding features of the audio signals to obtain index values of the audio coding features; and carrying out entropy coding processing on the index value of the audio coding characteristic to obtain a code stream of the audio signal.
For example, scalar quantization (quantization of each component separately) and entropy coding methods may be performed for audio coding features of an audio signal. In addition, the embodiment of the present application may also use a combination of vector quantization (where a plurality of adjacent components are combined into a vector to perform joint quantization) and entropy coding, where the coded code stream is transmitted to a decoding end, and the decoding end decodes the code stream, that is, the embodiment of the present application is not limited to quantization.
As previously described, the artificial intelligence based audio processing method provided by the embodiments of the present application may be implemented by various types of electronic devices. Referring to fig. 5, fig. 5 is a schematic flow chart of an audio processing method based on artificial intelligence according to an embodiment of the present application, and an audio decoding function is implemented by an audio processing manner, and the following description is made with reference to the steps shown in fig. 5.
In step 201, signal decoding processing is performed on the code stream, so as to obtain audio coding features corresponding to the code stream.
Wherein the code stream is obtained by audio encoding by the artificial intelligence based audio processing method shown in fig. 4.
For example, after the code stream is obtained by encoding by the audio processing method based on artificial intelligence as shown in fig. 4, the code stream obtained by encoding is transmitted to a decoding end, and after the decoding end receives the code stream, signal decoding processing is performed on the code stream, so as to obtain the audio coding feature corresponding to the code stream.
In some embodiments, signal decoding is performed on a code stream to obtain audio coding features corresponding to the code stream, including: performing entropy decoding treatment on the code stream to obtain an index value corresponding to the code stream; and performing inverse quantization processing on the index value corresponding to the code stream to obtain the sub-band signal characteristic corresponding to the code stream.
The signal decoding is the inverse of the signal encoding. And for the received code stream, entropy decoding is carried out first, and an estimated value F' (n) of the feature vector, namely the audio coding feature corresponding to the code stream, is obtained through looking up a quantization table (namely, inverse quantization, wherein the quantization table is a mapping table generated by quantization in the coding process). It should be noted that, the process of decoding the received code stream by the decoding end is the inverse of the process of encoding by the encoding end, so that the value generated in the decoding process is an estimated value relative to the value in the encoding process, for example, the audio coding feature generated in the decoding process is an estimated value relative to the audio coding feature in the encoding process.
In step 202, feature decoding processing is performed on the audio coding features corresponding to the code stream, so as to obtain sampling features corresponding to the code stream.
It should be noted that, feature decoding in the audio decoding process is an inverse process of feature encoding in the audio encoding process. The sample characteristics generated during decoding are an estimate relative to the sample characteristics during encoding.
After the signal decoding process is performed on the code stream, the feature decoding process is performed on the audio coding feature corresponding to the code stream, so as to obtain the sampling feature corresponding to the code stream, where the feature dimension of the sampling feature corresponding to the code stream is greater than the feature dimension of the audio coding feature corresponding to the code stream. The decoding stream is decompressed by the feature decoding process, thereby improving the audio decoding efficiency while ensuring the decoding quality.
In some embodiments, performing feature decoding processing on an audio coding feature corresponding to a code stream to obtain a sampling feature corresponding to the code stream, including: performing fifth convolution processing on the audio coding features corresponding to the code stream to obtain fifth convolution features corresponding to the audio coding features; and decoding the fifth convolution characteristic corresponding to the audio coding characteristic based on the neural network to obtain a sampling characteristic corresponding to the code stream.
For example, a causal convolution module in the decoding network performs fifth convolution processing on the audio coding feature corresponding to the code stream to obtain a fifth convolution feature corresponding to the audio coding feature, and a decoding block in the decoding network performs decoding processing based on the neural network on the fifth convolution feature corresponding to the audio coding feature to obtain a sampling feature corresponding to the code stream. It should be noted that the embodiments of the present application are not limited to the model structure of the decoding network, for example, the decoding network may be a deep neural network, a convolutional neural network, or the like.
For example, as shown in fig. 12, the decoding network includes a decoding block and causal convolution, and based on the 56×1-dimensional feature vector (i.e., the audio coding feature corresponding to the code stream) obtained by the decoding end quantization decoding, the decoding network is invoked to generate a 24×160-dimensional feature vector (i.e., the sampling feature corresponding to the code stream). From the data quantity, the function of 'dimension rising' is realized, and the function of data decompression is realized.
In some embodiments, the neural network-based decoding process is implemented by a plurality of cascaded decoding blocks; performing decoding processing based on a neural network on a fifth convolution characteristic corresponding to the audio coding characteristic to obtain a sampling characteristic corresponding to the code stream, including: decoding the fifth convolution characteristic corresponding to the audio coding characteristic through a first decoding block in the plurality of cascaded decoding blocks; outputting the decoding result of the first decoding block to the subsequent cascade decoding blocks, and continuing decoding processing and decoding result output through the subsequent cascade decoding blocks until the decoding result is output to the last decoding block; and taking the decoding result output by the last decoding block as the sampling characteristic corresponding to the code stream.
After the processing of the one-layer decoding block, the understanding of the audio coding features is deepened by one step, and after the learning of the multi-layer decoding block, the upsampling features of the audio coding features can be learned step by step and accurately. By means of the cascade decoding blocks, sampling characteristics corresponding to the progressive code stream can be obtained.
In some embodiments, continuing the decoding process and decoding result output through a subsequent concatenated decoding block includes: the following processing is performed by the subsequently concatenated decoding blocks: up-sampling the decoding result input to the subsequent cascade decoding block to obtain a sampling result of the subsequent cascade encoding block; and carrying out sixth convolution processing on the sampling result of the coding block in the subsequent cascade to obtain a decoding result of the coding block in the subsequent cascade, and outputting the coding result of the coding block in the subsequent cascade.
It should be noted that, the processing procedure of the first decoding block is similar to the processing procedure of the decoding block in the subsequent cascade, that is, the upsampling processing is performed first, and then the sixth convolution processing (that is, a convolution processing manner) is performed.
With the above example in mind, as shown in fig. 12, after obtaining the audio coding feature corresponding to the code stream, a causal convolution similar to the post-processing network is performed first, so that a 192×1-dimensional feature vector (i.e., a fifth convolution feature corresponding to the audio coding feature) may be output, and then 3 decoding blocks with different Up-sampling factors (up_factor) are cascaded. Taking the decoding block (up_factor=5) as an example, 1 or more hole convolutions may be performed first, each convolution kernel being fixed to 1×3 in size and having a shift Rate (Stride Rate) of 1. In addition, the expansion Rate (differential Rate) of 1 or more hole convolutions can be set according to requirements, such as 3, and of course, the expansion rates of different hole convolutions are not limited in the embodiments of the present application. Then, up_factor of 3 decoding blocks is set to 8, 5 and 4 respectively, which is equivalent to setting different size pooling factors, and plays a role of Up-sampling. Finally, the 3 decoding block channel numbers are set to 96, 48, 24, respectively. Thus, the 192×1 tensor is sequentially converted into 96×8, 48×40, and 24×160 tensors through 3 decoding blocks. Wherein, each decoding block is processed by pooling to finish up-sampling and then hole convolution (namely sixth convolution processing).
In some embodiments, upsampling a decoding result input to a subsequently concatenated decoding block to obtain a sampling result of the subsequently concatenated encoding block includes: and carrying out pooling processing based on the neural network on the decoding result input to the decoding block in the subsequent cascade to obtain the encoding result output by the encoding block in the subsequent cascade.
For example, the pooling manner in the decoding block in the decoding end (i.e., the neural network-based pooling process) may be an up-sampling method of a copy operation, a continuation operation, or the like. By adopting a pooling processing mode based on a neural network, rapid data decompression can be realized.
As in the interpolation of fig. 13, each element of an input is duplicated one time, which doubles the length of the sequence, thereby realizing Repeat operation. It should be noted that the upsampling method in the embodiment of the present application is not limited to the implementation of Repeat operation.
In some embodiments, upsampling a decoding result input to a subsequently concatenated decoding block to obtain a sampling result of the subsequently concatenated encoding block includes: and carrying out pooling treatment on the decoding result input to the subsequent cascade decoding blocks, and carrying out filtering treatment on the pooled decoding result to obtain the encoding result output by the subsequent cascade encoding blocks.
In connection with the above example, it should be noted that pooling and filtering methods (simply up-sampling pooling) may be used for up-sampling for 3 decoding blocks. Thus, during the up-sampling process, distortion is avoided as much as possible, and the most comprehensive information of the audio signal is preserved.
In step 203, the sampled features corresponding to the code stream are pooled, and the pooled sampled features are filtered, so as to obtain audio features corresponding to the code stream.
It should be noted that the upsampling pooling process in the audio decoding process is the inverse of the filtering and pooling process in the audio encoding process. The audio features generated during decoding are an estimate relative to the audio features during encoding.
And step 202 is carried out, after the audio coding features corresponding to the code stream are subjected to feature decoding processing, the sampling features corresponding to the code stream are subjected to up-sampling pooling processing, so that the audio features corresponding to the code stream are obtained, wherein the feature dimension of the audio features corresponding to the code stream is larger than the feature dimension of the sampling features corresponding to the code stream. The code stream is pooled through up-sampling, so that the audio decoding efficiency is improved under the condition of ensuring the decoding quality.
In some embodiments, pooling the sampling features corresponding to the code stream includes: performing up-sampling processing based on a neural network on sampling features corresponding to the code stream to obtain pooled sampling features; filtering the pooled sampling features to obtain audio features corresponding to the code stream, wherein the filtering comprises the following steps: and carrying out low-pass filtering processing based on the digital signals on the pooled sampling characteristics to obtain audio characteristics corresponding to the code stream.
It should be noted that, in the neural network, the decoding end generally performs decompression of the data dimension layer by layer, and each network module output performs a data replication (Repeat) operation to implement upsampling, which corresponds to an Average Pooling (Average Pooling) manner of the encoding end.
However, the above-described copy operation may also present a risk of signal distortion. Therefore, the embodiment of the application organically combines the signal processing technology and the artificial intelligence technology, provides an up-sampling pooling method (namely pooling and filtering processing), and effectively completes smoothing between adjacent data while decompressing the code stream, thereby avoiding the problem of signal distortion. Similar to the downsampling pooling method described above, the implementation of upsampling pooling includes the steps of:
first, for an input signal of length L (i.e., a sampling feature corresponding to a code stream), a continuation operation of factor N (i.e., an upsampling process based on a neural network) is performed. For example, between two adjacent elements of the input signal, N-1 zeros are added (other values added in the digital signal processing field may change the energy distribution). Thus, the length of the new output signal is changed from L to n×l, approximating up-sampling by a factor of N.
And then, filtering the sequence with the length of N by adopting an FIR corresponding to the coding end to obtain the output of N (namely the audio characteristic corresponding to the code stream).
It should be noted that the embodiments of the present application are not limited to the form of filtering, and for example, the filtering process may be a low-pass filtering process based on a digital signal, and may also be a band-pass filtering process based on a digital signal.
In some embodiments, before the pooling processing is performed on the sampling features corresponding to the code stream, performing fourth convolution processing on the sampling features corresponding to the code stream to obtain fourth convolution features corresponding to the code stream; and activating the fourth convolution characteristic corresponding to the code stream to obtain a sampling characteristic for pooling.
For example, before performing up-sampling pooling processing, the embodiment of the application may call the deep neural network to perform up-sampling pooling on the sampling features corresponding to the code stream, that is, perform fourth convolution processing on the sampling features corresponding to the code stream to obtain fourth convolution features corresponding to the code stream; and activating the fourth convolution characteristic corresponding to the code stream to obtain a sampling characteristic for pooling processing so as to carry out up-sampling pooling processing. It should be noted that the embodiments of the present application are not limited to the model structure of the deep neural network.
With the above example, as shown in fig. 14, the sampling characteristic corresponding to the code stream is a tensor of 24×160, the convolution is performed by a one-dimensional convolution module of 24 channels (i.e. the convolution network module shown in fig. 14) to implement the fourth convolution process, then the activation function (e.g. ReLU) is used (i.e. the activation module shown in fig. 14) to perform activation, and finally the upsampling is performed by using the upsampling pooling provided above, that is, the upsampling with a factor of 2 is performed first, and then the low-pass filtering is performed by FIR. Thus, a 24×160 tensor is converted to a 24×320 tensor by the candidate process shown in fig. 14. As described above, by adopting the upsampling pooling provided by the embodiments of the present application, distortion can be avoided as much as possible in the upsampling process, and the most comprehensive information of the input signal is retained.
In step 204, feature reconstruction processing is performed on the audio features corresponding to the code stream, so as to obtain an audio signal corresponding to the code stream.
After receiving the step 203, the audio features corresponding to the code stream are obtained, and then the audio features corresponding to the code stream are subjected to feature reconstruction processing through the neural network model, so as to obtain the audio signals corresponding to the code stream.
It should be noted that the embodiments of the present application are not limited to the process of feature extraction of the neural network model. For example, the neural network model in the embodiment of the present application may be implemented by a causal convolution module, where the audio feature corresponding to the code stream is a tensor of 24×320, and then the causal convolution module generates a tensor of 1×320 (the audio signal corresponding to the code stream).
In the following, an exemplary application of the embodiments of the present application in a practical application scenario will be described.
The embodiment of the application can be applied to various audio scenes, such as voice call, instant messaging and the like. The following description will take voice call as an example:
in the related art, the principle of speech coding is approximately as follows: the voice coding can directly code voice waveform samples sample by sample; or based on the sounding principle of the person, extracting relevant low-dimensional characteristics, encoding the characteristics by an encoding end, and reconstructing a voice signal by a decoding end based on the parameters.
The coding principles are all from voice signal modeling, namely a compression method based on signal processing, and the coding quality of the audio cannot be ensured. In order to improve coding efficiency under the condition of guaranteeing speech quality, the embodiment of the application provides a speech coding method based on an improved filtering operator (i.e. an audio processing method based on artificial intelligence), and after an input signal (i.e. an audio signal) is processed based on a Neural Network (NN) technology, fewer bits are obtained compared with a related technical scheme, and better coding quality is guaranteed, namely, in the aspect of the structure of the Neural Network, a specific resampling module is introduced to sample data with higher quality, so that the characteristic extracted by a coding end is guaranteed, and distortion is introduced from a data source head as little as possible. Therefore, after the encoding end portion processes the input signal, a feature vector having a lower dimension than the input signal is obtained, and the feature vector is compressed and encoded to obtain a code stream for transmission. And decoding the received code stream at the decoding end to obtain feature vectors, and respectively calling the inverse process of the corresponding encoding end to finish the generation of the audio signal.
The embodiment of the present application can be applied to a voice communication link as shown in fig. 6, taking a voice transmission over internet protocol (VoIP, voice over Internet Protocol) conference system as an example, the voice codec technology related to the embodiment of the present application is deployed in the encoding and decoding part to solve the basic function of voice compression. The encoder is disposed at the upstream client 601, the decoder is disposed at the downstream client 602, the voice is collected through the upstream client, preprocessing enhancement, encoding and other processes are performed, the encoded code stream is transmitted to the downstream client 602 through a network, and decoding, enhancement and other processes are performed through the downstream client 602, so that the decoded voice is played back at the downstream client 602.
Considering forward compatibility (i.e., the new encoder is compatible with the existing encoder), the transcoder needs to be deployed in the background (i.e., server) of the system to solve the problem of interworking of the new encoder with the existing encoder. For example, if the sender (upstream client) is a new NN encoder, the receiver (downstream client) is a public switched telephone network (PSTN, public Switched Telephone Network) (g.722). In the background, it is necessary to execute the NN decoder to generate a voice signal, and then call the g.722 encoder to generate a specific code stream, so as to implement a transcoding function, so that the receiving end can correctly decode based on the specific code stream.
The following describes a speech coding method based on the improved filtering operator according to the embodiment of the present application with reference to fig. 7:
the following processing is performed for the encoding end:
the input signal x (n) of the nth frame is non-linearly mapped using a preprocessing network. The resampling module provided by the embodiment of the application is used in the preprocessing network, so that the voice coding quality is ensured.
After the preprocessing network processes, the coding network is called to obtain a low-dimension feature vector F (n), wherein the dimension of the feature vector F (n) is smaller than the dimension of a subband signal (namely an input signal) so as to reduce the data volume. Embodiments of the present application are not limited to NN structures of coding networks, such as self-encoder (FC), full-Connection (FC) networks, long Short-Term Memory (LSTM) networks, convolutional neural networks (CNN, convolutional Neural Network) +lstm, and the like.
And carrying out vector quantization or scalar quantization on the feature vector obtained after the processing of the coding network, carrying out entropy coding on the quantized index value, and transmitting the coded code stream to a decoding end.
The following processing is performed for the decoding side:
and decoding the received code stream by the decoding end to obtain an estimated value F' (n) of the characteristic vector. Based on the estimated value F' (n) of the feature vector, a decoding network and a post-processing network corresponding to the pre-processing network are called to restore the voice signal x (n)。
The pooling technique and the hole convolution network are described before specifically describing the speech coding method based on the improved filtering operator provided in the embodiments of the present application.
As described above, the function of the encoding end is to map a segment of speech frame into a corresponding feature vector through the deep neural network. In particular, the dimensions of the resulting feature vectors are much smaller than the dimensions of the input speech frames. For example, a piece of speech includes 320 points, and the feature vector obtained by mapping is 56 dimensions. Thus, only the 56-dimensional feature vector is required to be quantized and encoded, so that the voice data comprising 320 points can be approximately generated at the decoding end; during transmission, only the 56-dimensional feature vector is manipulated, which is a compression technique.
In deep neural networks, the encoding end typically performs data dimension compression layer by layer. Thus, regardless of the network outcome used, each network module output will perform a data sampling operation, i.e., pooling. The sampling operation shown in FIG. 8A has 8 points as input and 4 points as output, corresponding to a factor of 1/2 downsampling. In particular, fig. 8A shows a simple odd sampling. In the deep neural network, downsampling may also be implemented in an Average Pooling (Average Pooling) manner, that is, the adjacent N elements are averaged, and a single dimension is output, which is equivalent to downsampling by a specific factor of 1/N.
The pooling operation is essentially a simple operation such as directly extracting or averaging an input sequence, and has a problem of "spectrum leakage" in terms of digital signal processing. Thus, referring to the theory of "downsampling" of classical digital signals, embodiments of the present application provide a downsampling Pooling method (Decimate Pooling) in deep neural networks. The implementation process of the downsampling pooling comprises the following steps:
first, for an input signal of length L, a low-pass filter is used for low-pass filtering. Among them, the low-pass filter may use both a recursive filter (IIR, infinite Impulse Response) or a non-recursive filter (FIR, finite Impulse Response). Considering causality, the embodiment of the application performs low-pass filtering based on FIR, so that filtering can be completed only by referring to historical data. The use of low-pass filtering can effectively accomplish smoothing between adjacent data, as compared to the odd sampling or Average Pooling mentioned above, can minimize avoiding the "spectrum leakage" problem. The FIR-based low-pass filtering scheme described above does not change the length of the input signal.
Then, the low-pass filtered input signal is sampled every N points, thus finishing downsampling by a factor of 1/N, and the output signal length is 1/N times of the input signal.
For the FIR mentioned above, the present embodiment gives two implementations as shown in FIG. 8B and FIG. 8C. Fig. 8B shows a FIR filter of 30 steps (pulse signal of length 31), and fig. 8C shows a FIR filter of 4 steps (pulse signal of length 5). Note that the 30-order FIR has higher accuracy but higher complexity. The accuracy of the 4-order FIR is much lower, although it is somewhat lower. Considering deep neural network learning capability, embodiments of the present application may use FIR of order 4 for low pass filtering. It should be noted that embodiments of the present application are not limited to other forms of low pass filtering.
Referring to fig. 9A and 9B, fig. 9A is a schematic diagram of a general convolutional network provided in an embodiment of the present application, and fig. 9B is a schematic diagram of a hole convolutional network provided in an embodiment of the present application. Compared with a common convolution network, the cavity convolution can increase the receptive field and keep the size of the feature map unchanged, and errors caused by up-sampling and down-sampling can be avoided. Although the convolution Kernel sizes (Kernel sizes) shown in fig. 9A and 9B are each 3×3; however, the normal convolution receptive field 901 shown in fig. 9A is only 3, whereas the empty convolution receptive field 902 shown in fig. 9B reaches 5. That is, for a convolution kernel of size 3×3, the receptive field of the normal convolution shown in fig. 9A is 3, and the expansion Rate (number of intervals of points in the convolution kernel) is 1; whereas the receptive field of the cavity convolution shown in fig. 9B is 5 and the expansion ratio is 2.
The convolution kernel may also be shifted in a plane like fig. 9A or 9B, here involving a shift Rate (step Rate) concept. For example, each time the convolution kernel shifts by 1 lattice, the corresponding shift rate is 1.
In addition, there is a concept of the number of convolution channels, namely, a convolution analysis is performed by using a number of parameters corresponding to the convolution kernels. Theoretically, the more the number of channels, the more comprehensive the analysis of the signals, and the higher the accuracy; however, the higher the channel, the higher the complexity. For example, a 1×320 tensor may use a 24-channel convolution operation, with the output being the 24×320 tensor.
It should be noted that, according to practical application requirements, the size of the cavity convolution kernel (for example, the size of the convolution kernel may be set to 1×3 for a speech signal), the expansion rate, the shift rate, and the number of channels may be defined by itself, which is not specifically limited in the embodiment of the present application.
The following describes a speech coding method based on an improved filtering operator according to an embodiment of the present application in detail.
In some embodiments, a speech signal with a sampling rate fs=16000 Hz is taken as an example (it should be noted that the method provided in the embodiments of the present application is also applicable to scenes with other sampling rates, including but not limited to 8000Hz, 32000Hz, 48000 Hz). Meanwhile, it is assumed that the frame length is set to 20ms, and thus, for fs=16000 Hz, it is equivalent to 320 sample points per frame.
The encoding side and the decoding side will be described in detail with reference to a flowchart shown in fig. 7.
The procedure for the encoding end is as follows:
1. an input signal is generated.
For a speech signal with a sampling rate fs=16000 Hz, the input signal for the nth frame comprises 320 sample points, denoted as input signal x (n).
2. The network is preprocessed.
As shown in fig. 10, the preprocessing network and the coding network output a feature vector F (n) of 56 dimensions through the preprocessing network (including causal convolution and preprocessing) and the coding network (including three coding blocks and causal convolution) for a speech signal including 320 points. From the data quantity, it is obvious that the preprocessing network and the coding network play a role in reducing the dimension of the input signal, namely, realize the function of data compression.
It should be noted that the preprocessing network includes two sub-modules, causal convolution and preprocessing. Wherein the causal convolution may be set to 24 channels, expanding the input 1 x 320 tensor (i.e. the input signal) to a 24 x 320 tensor.
As shown in fig. 11, the embodiment of the present application provides a method of downsampling pooling to pre-process a 24×320 tensor. The convolution is performed by a 24-channel one-dimensional convolution module (i.e. the convolution network module shown in fig. 11), then the activation is performed by using an activation function (e.g. ReLU) (i.e. the activation module shown in fig. 11), and finally the sampling is performed by using the downsampling pooling provided above, i.e. the downsampling by a factor of 1/2 is performed after the low-pass filtering by a 4-order FIR. Thus, a 24×320 tensor is converted into a 24×160 tensor by the preprocessing shown in fig. 11. As described above, by adopting the downsampling pooling provided by the embodiments of the present application, distortion can be avoided as much as possible in the downsampling process, and the most comprehensive information of the input signal is retained.
3. A coding network.
The purpose of the encoding network (including the encoding blocks and causal convolution) is to generate a lower-dimensional feature vector F (n) by invoking the deep neural network based on the 24 x 160 tensor of the preprocessing network output. In the embodiment of the application, the dimension F (n) is 56, so that the function of dimension reduction is realized from the data quantity, and the function of data compression is realized.
As shown in fig. 10, after preprocessing, 3 different downsampling factor (down_factor) encoded blocks are concatenated. Taking the example of a coded block (down_factor=4), 1 or more hole convolutions may be performed first, each convolution kernel being fixed to a size of 1×3 and a shift Rate (Stride Rate) of 1. In addition, the expansion Rate (differential Rate) of 1 or more hole convolutions can be set according to requirements, such as 3, and of course, the expansion rates of different hole convolutions are not limited in the embodiments of the present application. Then, the down_factor of the 3 coding blocks is set to 4, 5 and 8 respectively, which is equivalent to setting different size pooling factors, and plays a role of downsampling. Finally, the number of 3 coding block channels is set to 48, 96, 192, respectively. Thus, the 24×160 tensors are sequentially converted into 48×40, 96×8, and 192×1 tensors through 3 encoding blocks. Finally, for the 192×1 tensor, a 56-dimensional eigenvector F (n) can be output by causal convolution like the preprocessing network.
It should be noted that, for 3 code blocks, a Pooling operation similar to Average Pooling is contained inside. Thus, embodiments of the present application may use a downsampling Pooling scheme in 1 or more of the above decoding blocks to replace Average Pooling by the downsampling Pooling method. In particular, if the coding block uses the downsampling pooling method, the application is also started from the coding block (down_factor=4), and so on to the coding block (down_factor=8).
4. And (5) quantization coding.
For the feature vector F (n) obtained through the preprocessing network and the encoding network calculation, scalar quantization (individual quantization of each component) and entropy encoding methods can be performed. In addition, the embodiments of the present application also do not limit the technical combination of vector quantization (combining multiple components adjacent to one another into one vector for joint quantization) and entropy coding.
After the feature vector F (n) is quantized, a corresponding code stream can be generated. According to experiments, high-quality compression can be realized on 32kHz ultra-wideband signals through a 6-10kbps code rate.
The flow for the decoding side is as follows:
1. and (5) quantized decoding.
Quantization decoding is the inverse of quantization encoding. For the received code stream, entropy decoding is carried out first, and the estimated value F of the feature vector is obtained by looking up the quantization table (n)。
2. A decoding network and a post-processing network.
As shown in the decoding network and the post-processing network in fig. 12, the decoding end performs quantization decoding on the obtained 56-dimensional feature vector, and obtains 320-dimensional reconstructed speech signal frame (an estimated value) x through prediction of the deep neural network (n)。
It should be noted that the processing of the decoding network is similar to the processing of the encoding network, the processing of the post-processing network is similar to the processing of the pre-processing network, e.g. the causal convolution in the decoding network is similar to the causal convolution in the encoding network. The structure of the decoding block and the encoding block at the encoding end are symmetrical: the coding block of the coding end firstly carries out hole convolution, then the pooling is carried out to finish the downsampling, and the decoding block of the decoding end firstly carries out the pooling to finish the upsampling and then carries out hole convolution.
In some embodiments, corresponding to the pooling manner of the encoding end (pooling in the encoding block and pooling in the preprocessing), the decoding end may perform operations like up-sampling through copy (Repeat) operations, continuation operations, etc., which may be applied to the decoding block and post-processing network.
It should be noted that, at least in the preprocessing network, the downsampling Pooling method provided in the embodiment of the present application replaces the Average Pooling scheme, so that an interpolation manner as illustrated in fig. 13 may be adopted at the decoding end to copy each element of one input once, so that the length of the sequence may be doubled, thereby implementing Repeat operation. It should be noted that embodiments of the present application are not limited to implementation of Repeat operations.
For example, as shown in fig. 14, the Repeat operation is applied to a post-processing sub-module in the post-processing network. The up-sampling with a factor of 2 is achieved by first convolving with a 24-channel one-dimensional convolution module (i.e., the convolution network module shown in fig. 14), then activating with an activation function (e.g., reLU) (i.e., the activation module shown in fig. 14), and finally interpolating with the Repeat operation provided above.
In some embodiments, corresponding to the pooling approach at the encoding end (pooling in the encoding block and pooling in the pre-processing), the decoding end may perform an upsampling-like operation through an upsampling pooling operation, which may be applied to the decoding block and post-processing network.
It should be noted that, referring to classical signal processing theory, there may be a risk of signal distortion in this operation. Similar to the downsampling pooling method, the implementation of upsampling pooling includes the steps of:
first, a continuation operation of a factor N is performed on an input signal of length L. For example, between two adjacent elements of the input signal, N-1 zeros are added (other values added in the digital signal processing field may change the energy distribution). Thus, the length of the new output signal is changed from L to n×l, approximating up-sampling by a factor of N.
Then, as shown in fig. 15, a sequence of the length n×l is filtered using an FIR (an up-sampling filter of the order 26 (length 27)) corresponding to the encoding end, to obtain an output of n×l.
It should be noted that, compared with pooling shown by Repeat operation, the upsampling pooling provided in the embodiment of the present application may smooth filtering in the upsampling process, so as to improve speech quality.
For example, as shown in fig. 16, the upsampling pooling method is applied to a post-processing sub-module in a post-processing network. The up-sampling is performed by a 24-channel one-dimensional convolution module (i.e., the convolution network module shown in fig. 16), then activated by an activation function (e.g., reLU) (i.e., the activation module shown in fig. 16), and finally by the up-sampling pooling operation provided above, so as to implement up-sampling with a factor of 2. Thus, a 24×160 tensor is converted into a 24×320 tensor by the post-processing shown in fig. 16. As described above, by adopting the upsampling pooling provided by the embodiments of the present application, distortion can be avoided as much as possible in the upsampling process, and the most comprehensive information of the input signal is retained.
Furthermore, the above-described upsampling pooling method may be further applied to 1 or more decoding blocks in fig. 12. For 3 decoded blocks, the interior originally contains pooling that resembles Repeat operations. Thus, embodiments of the present application are not limited to the use of upsampling pooling methods in the 1 or more decoding blocks described above to replace the original Repeat operations. In particular, if the above 3 decoding blocks use the upsampling pooling scheme, the application is also started from the decoding block (up_factor=4), and so on to the decoding block (up_factor=8).
Thus, the 56-dimensional-based feature vector F can be completed through the multi-layer network (including the decoding network and the post-processing network) at the decoding end (n) predicting a 320-dimensional speech signal x (n)。
According to the embodiment of the application, the related network of the coding end and the decoding end can be jointly trained through data acquisition, so that the optimal parameters are obtained. Only data is prepared and corresponding network structures are set, and after the background is trained, the trained model can be put into use.
In summary, the speech coding method based on the improved filtering operator provided by the embodiment of the application is organically combined with the deep neural network through the signal processing technology, so that the coding efficiency is obviously improved compared with a signal processing scheme under the condition that the audio quality is ensured and the complexity is acceptable.
The artificial intelligence-based audio processing method provided by the embodiments of the present application has been described so far in connection with exemplary applications and implementations of a terminal device provided by the embodiments of the present application. The embodiment of the application also provides an audio processing device based on artificial intelligence, in practical application, each functional module in the audio processing device based on artificial intelligence can be cooperatively realized by hardware resources of electronic equipment (such as terminal equipment, a server or a server cluster), computing resources such as a processor, communication resources (such as a processor for supporting communication in various modes such as optical cables, cellular and the like) and a memory. Fig. 3 shows an artificial intelligence based audio processing device 555 stored in a memory 550, which may be software in the form of programs and plug-ins, etc., e.g. software C/c++, software modules designed in a programming language such as Java, application software designed in a programming language such as C/c++, java, or implementation of dedicated software modules, application program interfaces, plug-ins, cloud services, etc., in a large software system, different implementations being exemplified below.
The artificial intelligence-based audio processing device 555 includes a series of modules, including a feature extraction module 5551, a first pooling module 5552, a feature encoding module 5553, and a signal encoding module 5554. The following continues to describe a scheme for implementing audio coding by matching each module in the artificial intelligence-based audio processing device 555 provided in the embodiments of the present application.
The feature extraction module 5551 is configured to perform feature extraction processing on an audio signal to obtain an audio feature of the audio signal; the first pooling module 5552 is configured to perform filtering processing on the audio features of the audio signal, and pool the filtered audio features to obtain sampling features of the audio signal; the feature encoding module 5553 is configured to perform feature encoding processing on the sampled feature of the audio signal, so as to obtain an audio encoding feature of the audio signal; the signal encoding module 5554 is configured to perform signal encoding processing on the audio encoding feature of the audio signal, so as to obtain a code stream of the audio signal.
In some embodiments, the first pooling module 5552 is further configured to perform a low-pass filtering process based on a digital signal on the audio feature of the audio signal, to obtain the filtered audio feature; and performing down-sampling processing based on a neural network on the filtered audio characteristics to obtain sampling characteristics of the audio signals.
In some embodiments, before the filtering processing is performed on the audio features of the audio signal, the first pooling module 5552 is further configured to perform a first convolution processing on the audio features of the audio signal, to obtain a first convolution feature of the audio signal; and activating the first convolution characteristic to obtain an audio characteristic for filtering.
In some embodiments, the feature encoding module 5553 is further configured to perform a neural network-based encoding process on the sampled feature of the audio signal, to obtain an encoded feature of the audio signal; and performing second convolution processing on the coding characteristics of the audio signals to obtain the audio coding characteristics of the audio signals.
In some embodiments, the neural network-based encoding process is implemented by a plurality of cascaded encoding blocks; the feature encoding module 5553 is further configured to encode, by a first encoding block of the plurality of cascaded encoding blocks, a sampling feature of the audio signal; outputting the coding result of the first coding block to a coding block of a subsequent cascade, and continuing coding processing and coding result output through the coding block of the subsequent cascade until outputting to the last coding block; and taking the coding result output by the last coding block as the coding characteristic of the audio signal.
In some embodiments, the feature encoding module 5553 is further configured to perform the following processing by the subsequently concatenated encoded blocks: performing third convolution processing on the coding result input to the coding block in the subsequent cascade to obtain a third convolution characteristic of the coding result; and performing downsampling processing on the third convolution characteristic of the coding result to obtain the coding result of the coding block of the subsequent cascade, and outputting the coding result of the coding block of the subsequent cascade.
In some embodiments, the feature encoding module 5553 is further configured to perform a neural network-based pooling process on the third convolution feature of the encoding result, to obtain an encoding result output by the encoding block of the subsequent cascade; or filtering the third convolution characteristic of the coding result, and pooling the filtered third convolution characteristic to obtain the coding result output by the coding block of the subsequent cascade.
The artificial intelligence based audio processing device 555 includes a series of modules including a signal decoding module 5555, a feature decoding module 5556, a second pooling module 5557, and a feature reconstruction module 5558. The following continues to describe a scheme for implementing audio decoding by matching each module in the audio processing device 555 based on artificial intelligence provided in the embodiments of the present application.
The signal decoding module 5555 is configured to perform signal decoding processing on a code stream to obtain an audio coding feature corresponding to the code stream; the feature decoding module 5556 is configured to perform feature decoding processing on the audio coding feature corresponding to the code stream, so as to obtain a sampling feature corresponding to the code stream; the second pooling module 5557 is configured to pool the sampling features corresponding to the code stream, and filter the pooled sampling features to obtain audio features corresponding to the code stream; and the feature reconstruction module 5558 is configured to perform feature reconstruction processing on the audio feature corresponding to the code stream, so as to obtain an audio signal corresponding to the code stream.
In some embodiments, the second pooling module 5557 is further configured to perform a neural network-based upsampling process on the sampling feature corresponding to the code stream to obtain the pooled sampling feature; and carrying out low-pass filtering processing based on digital signals on the pooled sampling characteristics to obtain audio characteristics corresponding to the code stream.
In some embodiments, before the pooling processing is performed on the sampling feature corresponding to the code stream, the second pooling module 5557 is further configured to perform a fourth convolution processing on the sampling feature corresponding to the code stream, so as to obtain a fourth convolution feature corresponding to the code stream; and activating the fourth convolution characteristic corresponding to the code stream to obtain a sampling characteristic for pooling.
In some embodiments, the feature decoding module 5556 is further configured to perform a fifth convolution process on the audio coding feature corresponding to the code stream, to obtain a fifth convolution feature corresponding to the audio coding feature; and decoding the fifth convolution characteristic corresponding to the audio coding characteristic based on the neural network to obtain the sampling characteristic corresponding to the code stream.
In some embodiments, the neural network-based decoding process is implemented by a plurality of cascaded decoding blocks; the feature decoding module 5556 is further configured to perform decoding processing on a fifth convolution feature corresponding to the audio coding feature through a first decoding block in the plurality of cascaded decoding blocks; outputting the decoding result of the first decoding block to a subsequent cascade decoding block, and continuing decoding processing and decoding result output through the subsequent cascade decoding block until outputting to the last decoding block; and taking the decoding result output by the last decoding block as the sampling characteristic corresponding to the code stream.
In some embodiments, the feature decoding module 5556 is further configured to perform the following processing by the subsequently concatenated decoding blocks: up-sampling the decoding result input to the decoding block of the subsequent cascade to obtain a sampling result of the encoding block of the subsequent cascade; and carrying out sixth convolution processing on the sampling result of the coding block in the subsequent cascade to obtain a decoding result of the coding block in the subsequent cascade, and outputting the coding result of the coding block in the subsequent cascade.
In some embodiments, the feature decoding module 5556 is further configured to perform a pooling process based on a neural network on the decoding result input to the decoding block of the subsequent cascade, to obtain an encoding result output by the encoding block of the subsequent cascade; or, pooling the decoding result input to the decoding block in the subsequent cascade, and filtering the pooled decoding result to obtain the encoding result output by the encoding block in the subsequent cascade.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the audio processing method based on artificial intelligence according to the embodiment of the application.
Embodiments of the present application provide a computer readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform the artificial intelligence based audio processing method provided by embodiments of the present application, for example, as shown in fig. 4-5.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.
It will be appreciated that in the embodiments of the present application, related data such as user information is referred to, and when the embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims (19)

1. An artificial intelligence based audio processing method, the method comprising:
performing feature extraction processing on an audio signal to obtain audio features of the audio signal;
filtering the audio characteristics of the audio signals, and pooling the filtered audio characteristics to obtain sampling characteristics of the audio signals;
Performing feature coding processing on the sampling features of the audio signal to obtain audio coding features of the audio signal;
and performing signal coding processing on the audio coding characteristics of the audio signal to obtain a code stream of the audio signal.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the filtering processing of the audio characteristics of the audio signal comprises:
performing low-pass filtering processing based on digital signals on the audio characteristics of the audio signals to obtain the filtered audio characteristics;
pooling the filtered audio features to obtain sampling features of the audio signals, including:
and performing down-sampling processing based on a neural network on the filtered audio characteristics to obtain sampling characteristics of the audio signals.
3. The method of claim 1, wherein prior to filtering the audio characteristics of the audio signal, the method further comprises:
performing first convolution processing on the audio characteristics of the audio signals to obtain first convolution characteristics of the audio signals;
and activating the first convolution characteristic to obtain an audio characteristic for filtering.
4. The method of claim 1, wherein the feature encoding the sampled features of the audio signal to obtain audio encoded features of the audio signal comprises:
performing coding processing based on a neural network on the sampling characteristics of the audio signal to obtain coding characteristics of the audio signal;
and performing second convolution processing on the coding characteristics of the audio signals to obtain the audio coding characteristics of the audio signals.
5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,
the coding processing based on the neural network is realized by a plurality of cascade coding blocks;
the processing of the sampling feature of the audio signal based on the neural network to obtain the coding feature of the audio signal includes:
encoding the sampled characteristics of the audio signal by a first one of the plurality of concatenated encoding blocks;
outputting the coding result of the first coding block to a coding block of a subsequent cascade, and continuing coding processing and coding result output through the coding block of the subsequent cascade until outputting to the last coding block;
and taking the coding result output by the last coding block as the coding characteristic of the audio signal.
6. The method of claim 5, wherein said continuing the encoding process and the output of the encoding results through the subsequently concatenated encoded blocks comprises:
performing the following processing by the subsequently concatenated coding blocks:
performing third convolution processing on the coding result input to the coding block in the subsequent cascade to obtain a third convolution characteristic of the coding result;
and performing downsampling processing on the third convolution characteristic of the coding result to obtain the coding result of the coding block of the subsequent cascade, and outputting the coding result of the coding block of the subsequent cascade.
7. The method of claim 6, wherein downsampling the third convolution characteristic of the encoded result to obtain the encoded result of the encoded block output of the subsequent concatenation, comprising:
pooling processing based on a neural network is carried out on the third convolution characteristic of the coding result, and the coding result output by the coding block in the subsequent cascade is obtained; or,
and filtering the third convolution characteristic of the coding result, and pooling the filtered third convolution characteristic to obtain the coding result output by the coding block in the subsequent cascade.
8. An artificial intelligence based audio processing method, the method comprising:
performing signal decoding processing on the code stream to obtain audio coding characteristics corresponding to the code stream;
performing feature decoding processing on the audio coding features corresponding to the code stream to obtain sampling features corresponding to the code stream;
pooling the sampling features corresponding to the code stream, and filtering the pooled sampling features to obtain audio features corresponding to the code stream;
and carrying out feature reconstruction processing on the audio features corresponding to the code stream to obtain an audio signal corresponding to the code stream.
9. The method of claim 8, wherein the step of determining the position of the first electrode is performed,
the pooling processing of the sampling characteristics corresponding to the code stream comprises the following steps:
performing up-sampling processing based on a neural network on sampling features corresponding to the code stream to obtain pooled sampling features;
filtering the pooled sampling features to obtain audio features corresponding to the code stream, wherein the filtering comprises the following steps:
and carrying out low-pass filtering processing based on digital signals on the pooled sampling characteristics to obtain audio characteristics corresponding to the code stream.
10. The method of claim 8, wherein prior to pooling the sampled features corresponding to the code stream, the method further comprises:
performing fourth convolution processing on the sampling features corresponding to the code stream to obtain fourth convolution features corresponding to the code stream;
and activating the fourth convolution characteristic corresponding to the code stream to obtain a sampling characteristic for pooling.
11. The method of claim 8, wherein performing feature decoding processing on the audio coding feature corresponding to the code stream to obtain a sampling feature corresponding to the code stream, comprises:
performing fifth convolution processing on the audio coding features corresponding to the code stream to obtain fifth convolution features corresponding to the audio coding features;
and decoding the fifth convolution characteristic corresponding to the audio coding characteristic based on the neural network to obtain the sampling characteristic corresponding to the code stream.
12. The method of claim 11, wherein the step of determining the position of the probe is performed,
the neural network-based decoding process is implemented by a plurality of cascaded decoding blocks;
the decoding processing based on the neural network is performed on the fifth convolution characteristic corresponding to the audio coding characteristic to obtain a sampling characteristic corresponding to the code stream, which comprises the following steps:
Decoding a fifth convolution feature corresponding to the audio coding feature through a first decoding block in the plurality of cascaded decoding blocks;
outputting the decoding result of the first decoding block to a subsequent cascade decoding block, and continuing decoding processing and decoding result output through the subsequent cascade decoding block until outputting to the last decoding block;
and taking the decoding result output by the last decoding block as the sampling characteristic corresponding to the code stream.
13. The method of claim 12, wherein said continuing the decoding process through the subsequently concatenated decoding blocks and decoding result output comprises:
performing the following processing by the subsequently concatenated decoding blocks:
up-sampling the decoding result input to the decoding block of the subsequent cascade to obtain a sampling result of the encoding block of the subsequent cascade;
and carrying out sixth convolution processing on the sampling result of the coding block in the subsequent cascade to obtain a decoding result of the coding block in the subsequent cascade, and outputting the coding result of the coding block in the subsequent cascade.
14. The method according to claim 13, wherein upsampling the decoding result input to the subsequent concatenated decoding block to obtain the sampling result of the subsequent concatenated coding block comprises:
Performing pooling processing based on a neural network on the decoding result input to the decoding block in the subsequent cascade to obtain a coding result output by the coding block in the subsequent cascade; or,
and carrying out pooling treatment on the decoding result input to the decoding block in the subsequent cascade, and carrying out filtering treatment on the pooled decoding result to obtain the encoding result output by the encoding block in the subsequent cascade.
15. An audio processing apparatus, the apparatus comprising:
the feature extraction module is used for carrying out feature extraction processing on the audio signal to obtain the audio feature of the audio signal;
the first pooling module is used for carrying out filtering processing on the audio characteristics of the audio signals and pooling processing on the filtered audio characteristics to obtain sampling characteristics of the audio signals;
the feature coding module is used for carrying out feature coding processing on the sampling features of the audio signal to obtain audio coding features of the audio signal;
and the signal coding module is used for carrying out signal coding processing on the audio coding characteristics of the audio signal to obtain a code stream of the audio signal.
16. An audio processing apparatus, the apparatus comprising:
The signal decoding module is used for carrying out signal decoding processing on the code stream to obtain audio coding characteristics corresponding to the code stream;
the feature decoding module is used for carrying out feature decoding processing on the audio coding features corresponding to the code stream to obtain sampling features corresponding to the code stream;
the second pooling module is used for pooling the sampling characteristics corresponding to the code stream and filtering the pooled sampling characteristics to obtain audio characteristics corresponding to the code stream;
and the characteristic reconstruction module is used for carrying out characteristic reconstruction processing on the audio characteristics corresponding to the code stream to obtain an audio signal corresponding to the code stream.
17. An electronic device, the electronic device comprising:
a memory for storing executable instructions;
a processor for implementing the artificial intelligence based audio processing method of any one of claims 1 to 14 when executing executable instructions stored in the memory.
18. A computer readable storage medium storing executable instructions for implementing the artificial intelligence based audio processing method of any one of claims 1 to 14 when executed by a processor.
19. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the artificial intelligence based audio processing method of any one of claims 1 to 14.
CN202310212294.3A 2023-02-24 2023-02-24 Audio processing method, device, apparatus, storage medium and computer program product Pending CN117834596A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310212294.3A CN117834596A (en) 2023-02-24 2023-02-24 Audio processing method, device, apparatus, storage medium and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310212294.3A CN117834596A (en) 2023-02-24 2023-02-24 Audio processing method, device, apparatus, storage medium and computer program product

Publications (1)

Publication Number Publication Date
CN117834596A true CN117834596A (en) 2024-04-05

Family

ID=90517825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310212294.3A Pending CN117834596A (en) 2023-02-24 2023-02-24 Audio processing method, device, apparatus, storage medium and computer program product

Country Status (1)

Country Link
CN (1) CN117834596A (en)

Similar Documents

Publication Publication Date Title
JP4850837B2 (en) Data processing method by passing between different subband regions
US8509931B2 (en) Progressive encoding of audio
TWI488177B (en) Linear prediction based coding scheme using spectral domain noise shaping
KR102222838B1 (en) Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
WO2023241254A9 (en) Audio encoding and decoding method and apparatus, electronic device, computer readable storage medium, and computer program product
CN101006495A (en) Audio encoding apparatus, audio decoding apparatus, communication apparatus and audio encoding method
RU2530926C2 (en) Rounding noise shaping for integer transform based audio and video encoding and decoding
WO2023241193A1 (en) Audio encoding method and apparatus, electronic device, storage medium, and program product
WO2023241205A1 (en) Audio processing method and apparatus, and electronic device, computer-readable storage medium and computer program product
WO2023241222A9 (en) Audio processing method and apparatus, and device, storage medium and computer program product
JPWO2008066071A1 (en) Decoding device and decoding method
CN117834596A (en) Audio processing method, device, apparatus, storage medium and computer program product
CN115116457A (en) Audio encoding and decoding methods, devices, equipment, medium and program product
JP5491193B2 (en) Speech coding method and apparatus
CN117198301A (en) Audio encoding method, audio decoding method, apparatus, and readable storage medium
JPH09127987A (en) Signal coding method and device therefor
CN117219095A (en) Audio encoding method, audio decoding method, device, equipment and storage medium
CN117831548A (en) Training method, encoding method, decoding method and device of audio coding and decoding system
CN117219099A (en) Audio encoding, audio decoding method, audio encoding device, and audio decoding device
CN117476024A (en) Audio encoding method, audio decoding method, apparatus, and readable storage medium
CN116959459B (en) Audio transmission method and system
CN116072132A (en) Audio encoder, decoder, transmission system, method and medium
JP2024527536A (en) Compressing Audio Waveforms Using Neural Networks and Vector Quantizers
KR20230125985A (en) Audio generation device and method using adversarial generative neural network, and trainning method thereof
CN117616498A (en) Compression of audio waveforms using neural networks and vector quantizers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination