CN116504254A

CN116504254A - Audio encoding and decoding method and device, storage medium and computer equipment

Info

Publication number: CN116504254A
Application number: CN202310453713.2A
Authority: CN
Inventors: 姜鹏; 谯轶轩
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-07-28

Abstract

The invention relates to the technical field of computers and the field of digital medical treatment, and discloses an audio encoding and decoding method, an audio encoding and decoding device, a storage medium and computer equipment. The method comprises the steps of firstly obtaining original audio data, downsampling the original audio data to obtain a first feature vector, carrying out convolution processing on the first feature vector to obtain a coding vector, then carrying out multistage quantization on the coding vector based on a preset codebook to quantize the coding vector into a codebook vector which is most similar to the coding vector in the preset codebook, finally upsampling the codebook vector to obtain a second feature vector, and carrying out convolution processing on the second feature vector to obtain audio decoding data. According to the method, the processing procedures of up-sampling, multi-stage quantization and down-sampling are sequentially carried out on the original audio data, so that the audio data can be subjected to low-code-rate compression and high-quality restoration, and the transmission efficiency and the transmission quality of the audio data are effectively improved.

Description

Audio encoding and decoding method and device, storage medium and computer equipment

Technical Field

The present invention relates to the field of computer technology and digital medical technology, and in particular, to an audio encoding and decoding method, an audio encoding and decoding device, a storage medium, and a computer device.

Background

Along with the rapid development of mobile internet medical treatment, more and more intelligent diagnosis and treatment means and devices support the functions of disease auxiliary diagnosis, health management, remote consultation and the like, and the intelligent diagnosis and treatment device can overcome the limitation of the traditional diagnosis and treatment device on the use distance and improve the accuracy of acquiring the physiological indexes. Specifically, the intelligent diagnosis and treatment device can accurately collect important physiological indexes such as heart sounds and lung sounds in remote consultation, can accurately capture biological sounds as weak as respiratory sounds, and can accurately transmit the collected sounds to a doctor at a far end by utilizing an audio codec for accurate diagnosis and treatment. An audio codec is a codec that encodes or decodes audio to efficiently compress audio to reduce storage requirements or network bandwidth, ideally where the decoded audio is audibly indistinguishable from the original audio and does not produce perceptible delays in the encoding and decoding process. Conventional audio codec technologies can be largely divided into two main categories: waveform codec and parameter codec.

The waveform codec is a method of generating a reconstruction as similar as possible to an input audio sample at a waveform level at a decoder side, mapping an input time domain waveform to a time-frequency domain, and then quantizing and entropy-encoding transform coefficients, and reversing a conversion process at the decoder side to reconstruct the time domain waveform. While waveform codecs have little assumption about the type of audio content and therefore can operate on ordinary audio and produce very high quality audio at medium to high bit rates, they tend to introduce coding artifact problems when operated at low bit rates. Whereas a parametric codec introduces strong a priori information in the form of a parametric model describing the audio synthesis process by making specific assumptions about the source audio to be encoded. The encoder quantizes the model parameters and the decoder generates a time domain waveform using a synthesis model driven by the quantized parameters. A parametric codec is a device that generates audio that is acoustically similar to the original audio. Both codecs work well at medium and high bit rates, but when operating at lower bit rates, the efficiency and quality of the data transmission are degraded to different extents.

Disclosure of Invention

In view of this, the present application provides an audio encoding and decoding method, an apparatus, a storage medium, and a computer device, and is mainly aimed at solving the technical problem that the efficiency and quality of data transmission are reduced to different extents when the codec adopted by the intelligent diagnosis and treatment device in the prior art operates at a lower bit rate.

According to a first aspect of the present invention, there is provided an audio codec method, the method comprising:

acquiring original audio data, downsampling the original audio data to obtain a first eigenvector, and convolving the first eigenvector to obtain a coding vector;

performing multistage quantization on the coding vector based on a preset codebook to quantize the coding vector into a codebook vector most similar to the coding vector in the preset codebook;

and up-sampling the codebook vector to obtain a second eigenvector, and carrying out convolution processing on the second eigenvector to obtain audio decoding data.

According to a second aspect of the present invention, there is provided an audio codec apparatus comprising:

the audio coding module is used for obtaining original audio data, downsampling the original audio data to obtain a first feature vector, and convolving the first feature vector to obtain a coding vector;

The vector quantization module is used for carrying out multistage quantization on the coding vector based on a preset codebook so as to quantize the coding vector into a codebook vector which is most similar to the coding vector in the preset codebook;

and the vector decoding module is used for up-sampling the codebook vector to obtain a second eigenvector, and carrying out convolution processing on the second eigenvector to obtain audio decoding data.

According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described audio codec method.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above audio codec method when executing the program.

The invention provides an audio coding and decoding method, a device, a storage medium and computer equipment, which are characterized in that original audio data is firstly obtained, the original audio data is downsampled to obtain a first feature vector, the first feature vector is subjected to convolution processing to obtain a coding vector, then the coding vector is subjected to multistage quantization based on a preset codebook so as to be quantized into a codebook vector which is most similar to the coding vector in the preset codebook, the codebook vector is finally upsampled to obtain a second feature vector, and the second feature vector is subjected to convolution processing to obtain audio decoding data. The method comprises the steps of obtaining a first characteristic quantity by downsampling original audio data, and carrying out convolution processing on the first characteristic quantity, so that the data coding process can stably run under a low code rate, and a larger data receptive field is obtained; the coding vector is quantized in a multistage quantization mode, so that the number expansion of the codebook can be avoided, and the accuracy of the quantization process is effectively improved; the process of decoding the codebook vector is a reverse mirror image process of the data encoding process, and the original audio data can be restored with high quality by up-sampling. According to the method, the processing procedures of up-sampling, multi-stage quantization and down-sampling are sequentially carried out on the original audio data, so that the audio data can be subjected to low-code-rate compression and high-quality restoration, and the transmission efficiency and the transmission quality of the audio data are effectively improved. The method is applied to remote diagnosis and treatment, can accurately collect and restore the physiological indexes of the patient with high quality, and is convenient for doctors to accurately grasp the physiological states of the patient so as to carry out accurate diagnosis.

The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

fig. 1 is a schematic flow chart of an audio encoding and decoding method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of an audio encoding and decoding method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of audio encoding in an audio encoding and decoding method according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating an audio encoding principle in an audio encoding and decoding method according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a multi-level quantization in an audio encoding and decoding method according to an embodiment of the present invention;

fig. 6 is a schematic flow chart of an audio encoding and decoding method according to an embodiment of the present invention;

Fig. 7 is a schematic structural diagram of an audio codec according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an audio codec according to an embodiment of the present invention;

fig. 9 shows a schematic device structure of a computer device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the application provides an audio encoding and decoding method, as shown in fig. 1, comprising the following steps:

101. the method comprises the steps of obtaining original audio data, downsampling the original audio data to obtain a first eigenvector, and convolving the first eigenvector to obtain a coding vector.

Firstly, an application scene suitable for the method is introduced, and an audio codec is a codec for encoding or decoding audio and is used for effectively compressing the audio so as to reduce the storage requirement or network bandwidth, the audio after decoding and the original audio in an ideal state cannot be distinguished in hearing, and no perceptible delay is generated in the encoding and decoding processes. The conventional audio codec technology can be mainly divided into a waveform codec and a parameter codec, but the existing two codecs work perfectly at medium and high bit rates, but when running at lower bit rates, the efficiency and quality of data transmission are reduced to different degrees.

Specifically, the audio encoding and decoding processes are divided into an audio encoding process and an audio decoding process, wherein the audio encoding process is to compress original audio data into data with smaller data volume, the encoded data is convenient for network transmission, occupies smaller bandwidth, and effectively improves the efficiency of data transmission, and the audio decoding process is to restore the encoded data into audio. The operation of convolution processing needs to be performed based on a convolution neural network, and the convolution neural network is the most successful deep learning structure in the field of multimedia data processing such as text, images, audios and videos, and the convolution neural network is composed of a plurality of convolution layers, wherein the convolution layers generally comprise a convolution layer, a downsampling layer, an activation function layer, a standardization layer, a full connection layer and the like.

Furthermore, the application proposes an audio encoding and decoding method, firstly, the original audio data to be transmitted is obtained, then the original audio data is downsampled, the downsampling is that a plurality of sample values are sampled once in a sequence, and the obtained new sequence is the downsampling of the original sequence. In the process of downsampling the original audio data, each downsampling process reduces the vector dimension generated by the original audio data, and the number of channels is doubled. The original audio data can be compressed at a low code rate through downsampling, a first feature vector is introduced in the middle for describing the original audio data, and the first feature vector is subjected to convolution processing to obtain a coding vector for subsequent processing.

102. The encoded vector is quantized in multiple stages based on a preset codebook to quantize the encoded vector into a codebook vector most similar to the encoded vector within the preset codebook.

Specifically, a codebook, i.e., a vector table, is a set of multiple vectors; the quantization process is to search the vector most similar to the input vector from a set of fixed vector tables for replacement, and because the vector space in the vector tables is far smaller than the size of the input space, a lossy compression mode is achieved, for example, a picture is taken as an example, the color value range of each pixel point in the picture is 0-255, and only five colors (0, 31, 63, 127, 255) exist in the codebook, the colors of the pixel points are replaced based on the colors in the codebook, all the colors in the range of 0-30 are replaced with the number 0 colors, and all the colors in the range of 31-62 are replaced with the number 31 colors, so that the input picture is represented by five colors, and further some key image structure information is reserved, so that lossy compression is achieved.

In this embodiment of the present application, the input vector is quantized by a limited number of vectors in the codebook, and a single codebook in the prior art is adopted, so that a large number of vectors need to be stored in the codebook, resulting in codebook expansion, so that the present application adopts a multi-level quantization mode, introduces a plurality of codebooks, sequentially passes through each codebook with the input vector, carries out multi-level quantization, and finally combines all the obtained codebook vectors by matching the codebook vector most similar to the current input vector in each codebook, and the finally obtained codebook vector is the closest to the input coding vector, and effectively improves the quantization precision by adopting the multi-level quantization mode, thereby ensuring the high-quality transmission of the audio data.

103. And up-sampling the codebook vector to obtain a second eigenvector, and carrying out convolution processing on the second eigenvector to obtain audio decoding data.

In the embodiment of the application, the codebook vector obtained after quantization is decoded, the decoding process is similar to the encoding process, and can be understood as a mirror image reverse sequence process of encoding, because the downsampling process is adopted in the encoding process, the downsampling conventional convolution is adopted in the encoding process, and the upsampling transpose convolution is adopted in the decoding process, and in the upsampling process, the same sequence is adopted, the channel number is halved, the encoding steps are the same, the channel number is opposite, the high-quality restored audio decoding data is ensured to be output, the restored audio decoding data can be kept highly similar to the original audio data, and the quality of data transmission is improved.

The invention provides an audio coding and decoding method, a device, a storage medium and computer equipment, which are characterized in that original audio data is firstly obtained, the original audio data is downsampled to obtain a first feature vector, the first feature vector is subjected to convolution processing to obtain a coding vector, then the coding vector is subjected to multistage quantization based on a preset codebook so as to be quantized into a codebook vector which is most similar to the coding vector in the preset codebook, the codebook vector is finally upsampled to obtain a second feature vector, and the second feature vector is subjected to convolution processing to obtain audio decoding data. The method comprises the steps of obtaining a first characteristic quantity by downsampling original audio data, and carrying out convolution processing on the first characteristic quantity, so that the data coding process can stably run under a low code rate, and a larger data receptive field is obtained; the coding vector is quantized in a multistage quantization mode, so that the number expansion of the codebook can be avoided, and the accuracy of the quantization process is effectively improved; the process of decoding the codebook vector is a reverse mirror image process of the data encoding process, and the original audio data can be restored with high quality by up-sampling. According to the method, the processing procedures of up-sampling, multi-stage quantization and down-sampling are sequentially carried out on the original audio data, so that the audio data can be subjected to low-code-rate compression and high-quality restoration, and the transmission efficiency and the transmission quality of the audio data are effectively improved.

The audio encoding and decoding method provided by the embodiment of the application can be applied to the field of remote medical treatment, in the process of diagnosing and monitoring a patient, physiological indexes such as heart sounds, lung sounds, breathing sounds and the like generated by the patient are firstly collected to serve as original audio data, and then downsampling, convolution processing, multistage quantization, upsampling and deconvolution processing are sequentially carried out on each original audio data representing different physiological indexes respectively, so that audio decoding data corresponding to the original audio data is finally obtained. By adopting the audio encoding and decoding method provided by the application, important physiological indexes such as heart sounds, lung sounds and respiratory sounds of a patient can be acquired, processed and reduced with high quality, so that the transmission efficiency and quality are improved, and a doctor for remote diagnosis can accurately grasp the physiological state of the patient according to audio data obtained after decoding, thereby performing accurate diagnosis.

The embodiment of the application provides an audio encoding and decoding method, as shown in fig. 2, comprising the following steps:

201. raw audio data is acquired and an encoder is constructed.

Specifically, first, original audio data is acquired, the original audio data is subjected to framing processing to obtain an audio frame of the original audio data, and then an encoder is constructed based on a convolution network, wherein the encoder comprises an encoder input layer, a plurality of downsampling layers and an encoder output layer.

In the embodiment of the present application, from the whole point of view, since the characteristics and parameters of the original audio data change with time, analysis processing cannot be performed by using a signal processing technology for processing a stationary signal, while the audio frame is a data segment with a certain time length obtained by processing the original audio data, and the processing mode of the original audio data can specifically be a frame division processing or a windowing processing, and finally the original audio data is divided into multiple segments to analyze the characteristic parameters, wherein each segment is an audio frame. The encoder constructed based on the convolution network specifically comprises an encoder input layer, at least one downsampling layer and an encoder output layer, wherein the encoder input layer and the encoder output layer are convolution layers constructed based on one-dimensional convolution kernels, and specifically in the application, four downsampling layers are sequentially connected between the encoder input layer and the encoder output layer.

202. And downsampling the audio frames of the original audio data through a plurality of downsampling layers in the encoder to obtain a first feature vector.

Specifically, firstly, an audio frame of original audio data is sent to an input layer of an encoder, a plurality of downsampling layers are based on the historical audio frames to obtain feature vectors of the historical audio frames, wherein the number of the feature vectors of the historical audio frames is equal to that of the downsampling layers, then the current audio frame is downsampled based on the downsampling layers, the feature vectors of the historical audio frames are synchronously downsampled to obtain a first feature vector, finally the first feature vector is sent to an output layer of the encoder, and the first feature vector is subjected to convolution processing to obtain a coding vector of the original audio data.

In the embodiment of the present application, as shown in fig. 3, the encoder input layer is a 1D convolutional layer, i.e., C _enc The channel is used for receiving an audio frame of original audio data, the input layer of the encoder is connected with four downsampling layers, the downsampling layers specifically downsample in a stride convolution mode, and each sampling layer corresponds to one coding block, namely B _enc The convolution blocks, each coding block comprising three residual blocks comprising dilation convolutions with dilation ratios 1, 3 and 9, respectively. Input layer C from encoder _enc Initially, the number of channels is doubled each time downsampling occurs, and the block B is convolved _enc The number of (and corresponding stride sequence) determines the temporal resampling rate, e.g. when convolving fast B _enc When (2, 4,5, 8) is used as the stride, then m=2·4·5·8=320 input sampling points calculate an embedding, and finally a first eigenvector describing the audio frame is obtained, and the encoder output layer is also a 1D convolution layer, with a convolution kernel length of 3 and a stride of 1, for setting the dimension of the embedding to D, performing convolution processing on the first eigenvector, and outputting the encoded vector of the original audio data.

Specifically, as part of information is lost in the process of encoding original audio data, the actual original audio data cannot be restored in the subsequent decoding process, further, a historical audio frame is introduced in the process of stride sampling, the historical audio frame is one or more audio frames which are continuous with the current audio frame in time in an audio frame sequence, the principle of stride sampling is shown in fig. 4, from bottom to top, the current audio frame only uses the historical audio frame, the historical audio frame is sampled in the process of sampling the current audio frame, and the information in the larger extension of the first layer can be seen in the highest layer by adopting the process of stride sampling, namely the expanded causal convolution, so that the input experience field of the current audio frame is improved, and the accuracy of audio encoding and decoding is improved.

203. The encoded vector is quantized in multiple stages based on a preset codebook to quantize the encoded vector into a codebook vector most similar to the encoded vector within the preset codebook.

Specifically, firstly, a quantizer is obtained, a preset codebook corresponding to each quantizer is determined, wherein the number of the quantizers is N, N is a positive integer larger than 1, the preset codebook comprises a first codebook and a second codebook; and summing the first codebook vector to the N codebook vector to obtain a codebook vector corresponding to the coding vector.

In the embodiment of the present application, if the bit rate r=6000 bps of the encoder and decoder is adopted, the sampling rate f when the stride factor m=320 is used _s Every second audio of =24000 Hz is represented at the encoder output by s=75 frames, corresponding to r=6000/75=80 bits allocated to each audio frame, each bit can be 0 or 1, using a conventional quantizer, which requires n=2 to be stored ⁸⁰ And codebook vectors, which is obviously not feasible. Thus inThe multi-level quantizer used in the present application can disassemble the 80 bits into a plurality of independent parts, such as into 8 10 bits, and 8 quantizers need to be constructed, and each quantizer only needs to store 2 ¹⁰ The number of quantizers n=8, as shown in fig. 5, so that the first codebook and the second codebook are shared, the eighth codebook is mutually independent or completely consistent, the encoded vector is quantized in the first codebook first, the codebook vector in the first codebook is traversed, the two codebook vectors in the first codebook are searched to be most similar to the encoded vector, the two codebook vectors are extracted to be used as the first codebook vector, the encoded vector and the two codebook vectors are subjected to difference to obtain a first residual vector, then the two codebook vectors are reserved, the first residual vector is quantized in the second codebook, the vector which is most similar to the first residual vector is searched in the second codebook to be used as the second codebook vector, the first residual vector and the second codebook vector are subjected to difference to obtain a second residual vector, the second residual vector is quantized in the third codebook, and the quantization process is sequentially carried out until the seventh residual vector is completed to be the vector in the eighth codebook, and finally the first codebook vector is obtained, and the eighth codebook vector is obtained. The multi-level quantization mode is adopted, so that the sum of a plurality of codebook vectors is closer to the coding vector, the problem of overlarge codebook is solved, and the problem of quantization precision is also solved.

Further, a preset codebook corresponding to the quantizer is obtained first, codebook vectors in the preset codebook are judged one by one, then when the codebook vectors are dissimilar to any coding vector, the preset codebook is removed from the codebook vectors, the coding vectors are finally obtained, the coding vectors are clustered based on a k-means clustering algorithm, a vector of a cluster centroid point is obtained, and the vector of the cluster centroid point is added to the preset codebook.

In this embodiment of the present application, the initialization of the preset codebook corresponding to each quantizer may be random generation, and update learning is performed by a gradient descent method, where the quality of the initialized preset codebook determines the quality of quantization, and has a greater influence on the quality of audio data recovery. Therefore, by recording the use condition of the codebook vector in the preset codebook, when part of the vectors in the preset codebook are not extracted in the training and actual quantization processes, namely are dissimilar to any coding vector, the codebook vector is proved to be invalid or low-efficiency, and the codebook vector needs to be removed, so that the validity of the preset codebook is ensured. In the generation process of the preset codebook, the existing coding vector can be directly used as an initial vector of the preset codebook, and clustering is specifically performed through a K-means clustering algorithm, namely a K-means algorithm, wherein the K-means clustering algorithm is an iterative solution clustering analysis algorithm, data are divided into K groups, K objects are randomly selected as initial clustering centers, then the distance between each object and each seed clustering center is calculated, each object is distributed to the closest clustering center, the clustering center and the objects distributed to the clustering centers represent a cluster, and in the application, the vector of the cluster centroid point is directly filled into the preset codebook to serve as an initialization value of the preset codebook, so that the quality of the preset codebook is effectively improved.

204. And up-sampling the codebook vector to obtain a second eigenvector, and carrying out convolution processing on the second eigenvector to obtain audio decoding data.

Specifically, constructing a decoder based on a convolutional network, wherein the decoder comprises a decoder input layer, a plurality of upsampling layers, and a decoder output layer; transmitting the codebook vector to an input layer of a decoder, and up-sampling the codebook vector through a plurality of up-sampling layers to obtain a second feature vector; and sending the second feature vector to an output layer of the decoder, and carrying out convolution processing on the second feature vector to obtain audio decoding data.

In this embodiment of the present application, an upsampling transposed convolution is adopted in a decoding process of a decoder, specifically, the transposed convolution is a special convolution operation, by adding 0 to a low-dimensional vector to enlarge the vector dimension, and then performing forward convolution through a convolution kernel to obtain a high-dimensional vector, the decoding process can be understood as a mirror image reverse sequence process of encoding, an upsampling layer is also a mirror image structure of a downsampling layer, each upsampling layer also includes three residual units, and adopts the same steps as the encoder, but in reverse sequence, a waveform is reconstructed with the same resolution as an input waveform, and in the sampling process of each upsampling layer, the number of channels is reduced by half, so that the output layer of the final decoder outputs single-channel audio decoding data.

205. The original audio data and the audio decoding data are discriminated by the discriminator to update the network parameters.

Specifically, firstly, constructing an anti-neural network GAN model, generating a discriminator on the anti-neural network GAN model, then acquiring original audio data and audio decoding data, inputting the original audio data and the audio decoding data into the discriminator to discriminate, obtaining discrimination results, finally determining a loss function between the original audio data and the audio decoding data based on the discrimination results, performing network training on the processes of downsampling, multistage quantization and upsampling according to the loss function, and updating network parameters.

In the embodiment of the application, the model training is performed by constructing the anti-neural network GAN model, so that the generated audio decoding data is close to the original audio data as much as possible, and the high-quality restoration of the audio data is realized. In the present application, the generator corresponds to the encoder and the decoder, and is used for acquiring the original audio data and the audio decoding data, distinguishing the original audio data and the audio decoding data as input into the discriminator, a specific distinguishing process can be performed by taking two angles of audio sampling and mel spectrum sampling, a loss function between the original audio data and the audio decoding data is obtained after distinguishing, the loss function represents the degree of inconsistency between the original audio data and the audio decoding data, when the degree of inconsistency is larger, the larger the difference between the restored audio decoding data and the original audio data is illustrated, the lower the restoring quality is, then the network training is required to be performed on downsampling, multistage quantization and upsampling processes based on the obtained loss function, in particular, the encoder, the decoder and the discriminator are trained, and respective network parameters are updated, and continuous comparison training is performed after each time of restoring data, so that the restored audio decoding data is more and more close to the original audio data, the accuracy of restoration is ensured, and the quality of distinguishing is improved, and high-quality audio data restoration is achieved.

Specifically, as shown in fig. 6, the original audio data is input to the encoder for encoding, the obtained encoded vector is input to the quantizer for multi-level quantization, the obtained codebook vector is input to the decoder for decoding to warm up the audio decoding data, and the discriminator discriminates the original audio data and the audio decoding data, and further trains the encoder, the decoder and the discriminator to improve the restoration quality of the data.

The invention provides an audio coding and decoding method, a device, a storage medium and computer equipment, which are characterized in that original audio data are firstly obtained, an encoder is constructed, then audio frames of the original audio data are downsampled through a plurality of downsampling layers in the encoder to obtain first feature vectors, then the encoding vectors are subjected to multistage quantization based on a preset codebook, so that the encoding vectors are quantized into codebook vectors which are most similar to the encoding vectors in the preset codebook, the codebook vectors are upsampled to obtain second feature vectors, convolution processing is carried out on the second feature vectors to obtain audio decoding data, and finally the original audio data and the audio decoding data are distinguished by a discriminator to update network parameters. According to the method, the original audio data are sequentially up-sampled by the construction encoder to obtain the coding vector, then the multi-stage quantizer is introduced to carry out multi-stage quantization on the coding vector, the decoder is utilized to carry out down-sampling on the obtained codebook vector, low-code-rate compression and high-quality reduction can be carried out on the audio data, the transmission efficiency and the transmission quality of the audio data are effectively improved, and the encoding and decoding processes are trained by the discriminator to improve the reduction quality of the audio data.

Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides an audio codec apparatus, as shown in fig. 7, where the apparatus includes: an audio encoding module 301, a vector quantization module 302, a vector decoding module 303.

The audio encoding module 301 may be configured to obtain original audio data, downsample the original audio data to obtain a first feature vector, and convolve the first feature vector to obtain an encoded vector;

the vector quantization module 302 is configured to perform multi-level quantization on the encoded vector based on a preset codebook, so as to quantize the encoded vector into a codebook vector that is most similar to the encoded vector in the preset codebook;

the vector decoding module 303 may be configured to up-sample the codebook vector to obtain a second feature vector, and convolve the second feature vector to obtain audio decoding data.

In a specific application scenario, the audio encoding module 301 may be specifically configured to obtain original audio data, and perform frame segmentation processing on the original audio data to obtain an audio frame of the original audio data; constructing an encoder based on a convolutional network, wherein the encoder comprises an encoder input layer, a plurality of downsampling layers and an encoder output layer; transmitting an audio frame of original audio data to an input layer of an encoder, and downsampling the audio frame of the original audio data through a plurality of downsampling layers to obtain a first feature vector; and sending the first feature vector to an output layer of the encoder, and carrying out convolution processing on the first feature vector to obtain the encoded vector of the original audio data.

In a specific application scenario, the audio encoding module 301 may be further configured to downsample the historical audio frames based on a plurality of downsampling layers to obtain feature vectors of a plurality of historical audio frames, where the number of feature vectors of the historical audio frames is equal to the number of downsampling layers; downsampling the current audio frame based on the plurality of downsampling layers, and synchronously downsampling the feature vectors of the historical audio frame to obtain a first feature vector.

In a specific application scenario, the vector quantization module 302 is specifically configured to obtain a quantizer, and determine a preset codebook corresponding to each quantizer, where the number of quantizers is N, N is a positive integer greater than 1, and the preset codebook includes a first codebook and a second codebook. Obtaining a coding vector, searching a first codebook vector which is most similar to the coding vector in a first codebook, and carrying out difference on the coding vector and the first codebook vector to obtain a first residual error vector; searching a second codebook vector which is most similar to the first residual vector in the second codebook, and carrying out difference on the first residual vector and the second codebook vector to obtain a second residual vector; and so on until an nth codebook vector that is most similar to the nth-1 residual vector is retrieved in the nth codebook; and summing the first codebook vector to the N codebook vector to obtain a codebook vector corresponding to the coding vector.

In a specific application scenario, the vector quantization module 302 may be further configured to obtain a preset codebook corresponding to the quantizer, and determine codebook vectors in the preset codebook one by one; when the codebook vector is dissimilar to any coding vector, removing the preset codebook from the codebook vector; and obtaining a coding vector, clustering the coding vector based on a k-means clustering algorithm to obtain a vector of a clustering centroid point, and adding the vector of the clustering centroid point into a preset codebook.

In a specific application scenario, the vector decoding module 303 is specifically configured to construct a decoder based on a convolutional network, where the decoder includes a decoder input layer, a plurality of upsampling layers, and a decoder output layer; transmitting the codebook vector to an input layer of a decoder, and up-sampling the codebook vector through a plurality of up-sampling layers to obtain a second feature vector; and sending the second feature vector to an output layer of the decoder, and carrying out convolution processing on the second feature vector to obtain audio decoding data.

In a specific application scenario, as shown in fig. 8, the apparatus further includes an audio recognition module 304, where the audio recognition module 304 is specifically configured to construct an antagonistic neural network GAN model, and generate a discriminator on the antagonistic neural network GAN model; acquiring original audio data and audio decoding data, and inputting the original audio data and the audio decoding data into a discriminator to discriminate, so as to obtain discrimination results; a loss function between the original audio data and the audio decoding data is determined based on the discrimination result, and the down-sampling, multi-level quantization and up-sampling processes are network trained and network parameters are updated according to the loss function.

It should be noted that, for other corresponding descriptions of the functional units related to the audio codec device provided in this embodiment, reference may be made to corresponding descriptions in fig. 1 and fig. 2, and details are not repeated here.

Based on the above method as shown in fig. 1, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, which when executed by a processor, implements the above audio codec method.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, where the software product to be identified may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.), and includes several instructions to cause a computer device (may be a personal computer, a server, or a network device, etc.) to perform the audio codec method of each implementation scenario of the present application.

Based on the method shown in fig. 1 and fig. 2 and the embodiment of the audio codec device shown in fig. 7 and fig. 8, in order to achieve the above object, as shown in fig. 9, this embodiment further provides an entity device for audio codec, where the device includes a communication bus, a processor, a memory, a communication interface, and may further include an input/output interface and a display device, where each functional unit may complete communication with each other through the bus. The memory stores a computer program and a processor for executing the program stored in the memory to perform the audio encoding and decoding method in the above embodiment.

Optionally, the physical device may further include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be appreciated by those skilled in the art that the structure of the audio codec entity device provided in this embodiment is not limited to the entity device, and may include more or fewer components, or some components may be combined, or different arrangements of components.

The storage medium may also include an operating system, a network communication module. The operating system is a program for managing the entity equipment hardware and the software resources to be identified, and supports the operation of the information processing program and other software and/or programs to be identified. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the information processing entity equipment.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. By applying the technical scheme, the method comprises the steps of firstly obtaining original audio data, downsampling the original audio data to obtain a first feature vector, carrying out convolution processing on the first feature vector to obtain a coding vector, then carrying out multistage quantization on the coding vector based on a preset codebook to quantize the coding vector into a codebook vector which is most similar to the coding vector in the preset codebook, finally carrying out upsampling on the codebook vector to obtain a second feature vector, and carrying out convolution processing on the second feature vector to obtain audio decoding data. The method comprises the steps of obtaining a first characteristic quantity by downsampling original audio data, and carrying out convolution processing on the first characteristic quantity, so that the data coding process can stably run under a low code rate, and a larger data receptive field is obtained; the coding vector is quantized in a multistage quantization mode, so that the number expansion of the codebook can be avoided, and the accuracy of the quantization process is effectively improved; the process of decoding the codebook vector is a reverse mirror image process of the data encoding process, and the original audio data can be restored with high quality by up-sampling. According to the method, the processing procedures of up-sampling, multi-stage quantization and down-sampling are sequentially carried out on the original audio data, so that the audio data can be subjected to low-code-rate compression and high-quality restoration, and the transmission efficiency and the transmission quality of the audio data are effectively improved.

Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims

1. An audio codec method, the method comprising:

2. The method of claim 1, wherein the obtaining the original audio data, downsampling the original audio data to obtain a first feature vector, and convolving the first feature vector to obtain a coded vector, comprises:

obtaining original audio data, and carrying out framing treatment on the original audio data to obtain an audio frame of the original audio data;

constructing an encoder based on a convolutional network, wherein the encoder comprises an encoder input layer, a plurality of downsampling layers and an encoder output layer;

transmitting the audio frame of the original audio data to the input layer of the encoder, and downsampling the audio frame of the original audio data through the plurality of downsampling layers to obtain the first feature vector;

and sending the first eigenvector to the output layer of the encoder, and carrying out convolution processing on the first eigenvector to obtain the encoding vector of the original audio data.

3. The method of claim 2, wherein the audio frames of the original audio data comprise a current audio frame and a historical audio frame; the downsampling, by the plurality of downsampling layers, the audio frame of the original audio data to obtain the first feature vector, including:

Downsampling the historical audio frames based on the downsampling layers to obtain feature vectors of the historical audio frames, wherein the number of the feature vectors of the historical audio frames is equal to that of the downsampling layers;

and downsampling the current audio frame based on the downsampling layers, and synchronously downsampling the feature vectors of the historical audio frame to obtain the first feature vector.

4. The method of claim 1, wherein the multi-level quantizing the encoded vector based on a preset codebook to quantize the encoded vector to a codebook vector most similar to the encoded vector within the preset codebook, comprises:

obtaining quantizers and determining a preset codebook corresponding to each quantizer, wherein the number of the quantizers is N, N is a positive integer greater than 1, and the preset codebook comprises a first codebook and a second codebook;

the coding vector is obtained, a first codebook vector which is most similar to the coding vector is searched in the first codebook, and the coding vector and the first codebook vector are subjected to difference to obtain a first residual error vector;

Searching a second codebook vector which is most similar to the first residual vector in the second codebook, and carrying out difference on the first residual vector and the second codebook vector to obtain a second residual vector;

and so on until an nth codebook vector most similar to the nth-1 residual vector is retrieved in the nth codebook;

and summing the first codebook vector to the N codebook vector to obtain the codebook vector corresponding to the coding vector.

5. The method of claim 4, wherein the obtaining the quantizers and determining a corresponding preset codebook for each quantizer comprises:

acquiring a preset codebook corresponding to the quantizer, and judging codebook vectors in the preset codebook one by one;

when the codebook vector is dissimilar to any coding vector, removing the preset codebook from the codebook vector;

and acquiring the coding vector, clustering the coding vector based on a k-means clustering algorithm to obtain a vector of clustered centroid points, and adding the vector of clustered centroid points into the preset codebook.

6. The method of claim 1, wherein upsampling the codebook vector to obtain a second eigenvector and convolving the second eigenvector to obtain audio decoded data comprises:

Constructing a decoder based on a convolutional network, wherein the decoder comprises a decoder input layer, a plurality of upsampling layers and a decoder output layer;

transmitting the codebook vector to the decoder input layer, and up-sampling the codebook vector through the plurality of up-sampling layers to obtain the second feature vector;

and sending the second feature vector to the decoder output layer, and carrying out convolution processing on the second feature vector to obtain the audio decoding data.

7. The method of claim 1, wherein after upsampling the codebook vector to obtain a second eigenvector and convolving the second eigenvector to obtain audio decoded data, the method comprises:

constructing an antagonistic neural network GAN model, and generating a discriminator on the antagonistic neural network GAN model;

acquiring the original audio data and the audio decoding data, and inputting the original audio data and the audio decoding data into the discriminator for discrimination to obtain a discrimination result;

a loss function between the original audio data and the audio decoding data is determined based on the discrimination result, and the down-sampling, the multi-level quantization, and the up-sampling processes are network trained and network parameters are updated according to the loss function.

8. An audio codec apparatus, the apparatus comprising:

9. A storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the method of any of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the method according to any one of claims 1 to 7.