WO2021258958A1 - 语音编码方法、装置、计算机设备和存储介质 - Google Patents

语音编码方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021258958A1
WO2021258958A1 PCT/CN2021/095714 CN2021095714W WO2021258958A1 WO 2021258958 A1 WO2021258958 A1 WO 2021258958A1 CN 2021095714 W CN2021095714 W CN 2021095714W WO 2021258958 A1 WO2021258958 A1 WO 2021258958A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
speech frame
encoded
voice
speech
Prior art date
Application number
PCT/CN2021/095714
Other languages
English (en)
French (fr)
Inventor
梁俊斌
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP21828640.9A priority Critical patent/EP4040436B1/en
Priority to JP2022554706A priority patent/JP7471727B2/ja
Publication of WO2021258958A1 publication Critical patent/WO2021258958A1/zh
Priority to US17/740,309 priority patent/US20220270622A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/24Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring
    • G10L19/025Detection of transients or attacks for time/frequency resolution switching
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Definitions

  • This application relates to the field of Internet technology, in particular to a speech coding method, device, computer equipment and storage medium.
  • voice codec occupies an important position in modern communication systems.
  • the code rate parameters of speech coding are usually set in advance.
  • the pre-set code rate parameters are used for speech coding.
  • the current method of using pre-set code rate parameters for speech coding may have redundant coding, which leads to the problem of low coding quality.
  • a speech encoding method for example, a speech encoding method, device, computer equipment, and storage medium are provided.
  • a speech coding method executed by a computer device, the method including:
  • the to-be-coded speech frame is coded according to the coding bit rate to obtain the coding result.
  • encoding the to-be-encoded speech frame according to the encoding rate to obtain the encoding result includes:
  • the encoding rate is passed to the standard encoder through the interface to obtain the encoding result.
  • the standard encoder is used to encode the to-be-encoded speech frame using the encoding rate.
  • a speech coding device comprising:
  • the voice frame acquisition module is used to acquire the voice frame to be encoded and the backward voice frame corresponding to the voice frame to be encoded;
  • the first criticality calculation module is used to extract the characteristics of the voice frame to be encoded corresponding to the voice frame to be encoded, and obtain the criticality of the voice frame to be encoded corresponding to the voice frame to be encoded based on the characteristics of the voice frame to be encoded;
  • the second criticality calculation module is used to extract the backward voice frame characteristics corresponding to the backward voice frame, and obtain the backward voice frame criticality corresponding to the backward voice frame based on the backward voice frame characteristics;
  • the code rate calculation module is used to obtain key trend characteristics based on the keyness of the speech frame to be encoded and the keyness of the backward speech frame, and use the key trend characteristics to determine the encoding bit rate corresponding to the speech frame to be encoded;
  • the encoding module is used to encode the to-be-encoded speech frame according to the encoding bit rate to obtain the encoding result.
  • a computer device includes a memory and a processor.
  • the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the processor executes the following steps:
  • the to-be-coded speech frame is coded according to the coding bit rate to obtain the coding result.
  • One or more non-volatile storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the following steps are realized when the one or more processors are executed:
  • the to-be-coded speech frame is coded according to the coding bit rate to obtain the coding result.
  • Figure 1 is an application environment diagram of a speech coding method in an embodiment
  • Figure 2 is a schematic flowchart of a speech encoding method in an embodiment
  • Fig. 3 is a schematic diagram of a flow of feature extraction in an embodiment
  • FIG. 4 is a schematic diagram of a process for calculating the criticality of a speech frame to be encoded in an embodiment
  • FIG. 5 is a schematic diagram of a process of calculating an encoding code rate in an embodiment
  • FIG. 6 is a schematic diagram of a process for obtaining the degree of critical difference in an embodiment
  • FIG. 7 is a schematic diagram of a process of determining a coding rate in an embodiment
  • FIG. 8 is a schematic flowchart of calculating the criticality of a speech frame to be encoded in a specific embodiment
  • FIG. 9 is a schematic flow chart of calculating the criticality of backward speech frames in the specific embodiment of FIG. 8;
  • FIG. 10 is a schematic diagram of a flow chart for obtaining an encoding result in the specific embodiment of FIG. 8; FIG.
  • FIG. 11 is a schematic diagram of a flow of broadcasting audio in a specific embodiment
  • Figure 12 is a diagram of the application environment of the speech coding method in a specific embodiment
  • Figure 13 is a structural block diagram of a speech encoding device in an embodiment
  • Fig. 14 is a diagram of the internal structure of a computer device in an embodiment.
  • ASR automatic speech recognition technology
  • TTS speech synthesis technology
  • voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • the speech coding method provided in this application can be applied to the application environment as shown in FIG. 1.
  • the terminal 102 collects the sound signal sent by the user.
  • the terminal 102 obtains the speech frame to be encoded and the backward speech frame corresponding to the speech frame to be encoded; extracts the characteristics of the speech frame to be encoded corresponding to the speech frame to be encoded, and the terminal 102 obtains the speech frame to be encoded corresponding to the speech frame to be encoded based on the characteristics of the speech frame to be encoded.
  • the key of the encoded speech frame the terminal 102 extracts the characteristic of the backward speech frame corresponding to the backward speech frame, and obtains the key of the backward speech frame corresponding to the backward speech frame based on the characteristic of the backward speech frame; the terminal 102 is based on the key characteristic of the speech frame to be encoded
  • the key trend feature is acquired with the key of the backward speech frame, and the key trend feature is used to determine the encoding rate corresponding to the speech frame to be encoded; the terminal 102 encodes the speech frame to be encoded according to the encoding rate to obtain the encoding result.
  • the terminal 102 can be, but is not limited to, various personal computers with recording functions, notebook computers with recording functions, smart phones with recording functions, tablet computers with recording functions, and audio broadcasting. It is understandable that the speech coding method can also be applied to a server, and can also be applied to a system including a terminal and a server.
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, and cloud communications. , Middleware services, domain name services, security services, CDN, and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • a speech coding method is provided.
  • the method is applied to the terminal in Fig. 1 as an example for description, including the following steps:
  • Step 202 Obtain a speech frame to be encoded and a backward speech frame corresponding to the speech frame to be encoded.
  • the speech frame is obtained after speech is divided into frames.
  • the speech frame to be coded refers to the speech frame that currently needs to be coded.
  • the backward speech frame refers to the speech frame in the future corresponding to the speech frame to be encoded, and refers to the speech frame collected after the speech frame to be encoded.
  • the terminal may collect voice signals through a language collection device, and the voice collection device may be a microphone.
  • the terminal converts the collected voice signal into a digital signal, and then obtains the voice frame to be encoded and the backward voice frame corresponding to the voice frame to be encoded from the digital signal.
  • the number of acquired backward speech frames is 3 frames.
  • the terminal can also obtain the pre-stored voice signal in the memory, convert the voice signal into a digital signal, and then obtain the voice frame to be encoded and the backward voice frame corresponding to the voice frame to be encoded from the digital signal.
  • the terminal can also download the voice signal from the Internet, convert the voice signal into a digital signal, and then obtain the voice frame to be encoded and the backward voice frame corresponding to the voice frame to be encoded from the digital signal.
  • the terminal can also obtain a voice signal sent by another terminal or server, convert the voice signal into a digital signal, and then obtain a voice frame to be encoded from the digital signal, and a backward voice frame corresponding to the voice frame to be encoded.
  • Step 204 Extract the features of the voice frame to be encoded corresponding to the voice frame to be encoded, and obtain the keyness of the voice frame to be encoded corresponding to the voice frame to be encoded based on the features of the voice frame to be encoded.
  • the voice frame feature refers to a feature used to measure the sound quality of the voice frame.
  • Voice frame features include, but are not limited to, voice start frame features, energy change features, pitch period mutation frame features, and non-speech frame features.
  • the voice start frame feature refers to whether the voice frame is a feature corresponding to the voice frame that the voice signal starts.
  • the energy change feature refers to the feature that the energy of the frame corresponding to the current speech frame is relatively compared with the energy change of the frame corresponding to the previous speech frame.
  • the feature of the pitch period mutation frame refers to the feature of the pitch period corresponding to the speech frame.
  • the non-speech frame feature refers to the corresponding feature when the speech frame is a noisy speech frame.
  • the feature of the voice frame to be encoded refers to the feature of the voice frame corresponding to the voice frame to be encoded.
  • the criticality of a speech frame refers to the contribution of the sound quality of the speech frame to the overall speech quality within a period of time before and after. The higher the contribution, the higher the key of the corresponding speech frame.
  • the criticality of the voice frame to be encoded refers to the criticality of the voice frame corresponding to the voice frame to be encoded.
  • the terminal extracts the features of the voice frame to be encoded corresponding to the voice frame to be encoded according to the voice frame type corresponding to the voice frame to be encoded.
  • the voice frame type may include a voice start frame, an energy sudden increase frame, a pitch period mutation frame, and a non-voice frame. At least one of the frames.
  • the corresponding speech start frame feature is obtained according to the speech start frame.
  • the speech frame to be encoded is an energy burst frame
  • the corresponding energy change feature is obtained according to the energy burst frame.
  • the speech frame to be encoded is a pitch period mutation frame
  • the corresponding pitch period mutation frame feature is obtained according to the pitch period mutation frame.
  • the speech frame to be encoded is a non-speech frame
  • the corresponding non-speech frame feature is obtained according to the non-speech frame.
  • a weighted calculation is performed based on the extracted features of the speech frame to be coded to obtain the keyness of the speech frame to be coded corresponding to the speech frame to be coded.
  • the forward weighting calculation can be performed on the characteristics of the speech start frame, the energy change characteristics and the pitch period mutation frame characteristics to obtain the keyness of the forward speech frame to be encoded
  • the reverse weighting calculation of the non-speech frame characteristics can obtain the reverse waiting frame.
  • the keyness of the coded speech frame is obtained according to the keyness of the speech frame to be coded in the forward direction and the keyness of the speech frame to be coded in the reverse direction to obtain the speech frame keyness corresponding to the final speech frame to be coded.
  • Step 206 Extract the features of the backward voice frame corresponding to the backward voice frame, and obtain the keyness of the backward voice frame corresponding to the backward voice frame based on the feature of the backward voice frame.
  • the backward voice frame feature refers to the voice frame feature corresponding to the backward voice frame, and each backward voice frame has a corresponding backward voice frame feature.
  • the criticality of the backward voice frame refers to the criticality of the voice frame corresponding to the backward voice frame.
  • the terminal extracts the characteristics of the backward voice frame corresponding to the backward voice frame according to the voice frame type of the backward voice frame, and when the backward voice frame is a voice start frame, the corresponding voice start frame is obtained according to the voice start frame Frame characteristics.
  • the backward speech frame is an energy burst frame
  • the corresponding energy change feature is obtained according to the energy burst frame.
  • the backward speech frame is a pitch period mutation frame
  • the corresponding pitch period mutation frame feature is obtained according to the pitch period mutation frame.
  • the backward speech frame is a non-speech frame
  • weighted calculation is performed based on the characteristics of the backward speech frame to obtain the keyness of the backward speech frame corresponding to the backward speech frame.
  • the forward weighting calculation can be performed on the voice start frame feature, the energy change feature, and the pitch period mutation frame feature to obtain the keyness of the forward backward speech frame
  • the reverse weighting calculation on the non-speech frame feature can obtain the reverse posterior.
  • the criticality of the forward voice frame is based on the criticality of the forward backward voice frame and the criticality of the reverse backward voice frame to obtain the voice frame criticality corresponding to the final backward voice frame.
  • the characteristics of the speech frame to be encoded and the backward speech frame may be separately
  • the features are input into the criticality measurement model for calculation, and the criticality of the speech frame to be encoded and the backward speech frame pair are obtained.
  • the criticality measurement model is a model established using a linear regression algorithm based on the characteristics of the historical speech frame and the criticality of the historical speech frame and deployed in the terminal. Recognizing the criticality of the speech frame through the criticality metric model can improve accuracy and efficiency.
  • Step 208 Obtain key trend characteristics based on the keyness of the speech frame to be encoded and the keyness of the backward speech frame, and use the key trend characteristics to determine the encoding bit rate corresponding to the speech frame to be encoded.
  • the critical trend refers to the criticality trend of the voice frame of the voice frame to be encoded and the corresponding backward voice frame. no change.
  • the key trend feature refers to the feature that reflects the key trend, which can be a statistical feature, such as the key average, the key difference, and so on.
  • the encoding rate is used to encode the speech frame to be encoded.
  • the terminal obtains key trend characteristics based on the keyness of the speech frame to be coded and the keyness of the backward speech frame, for example, calculates the keyness of the speech frame to be coded and the keyness of the backward speech frame, and calculates the calculated statistical characteristics
  • statistical features can include average speech frame key features, median speech frame key features, standard deviation speech frame key features, mode speech frame key features, range speech frame key features, and At least one of the key difference features of the speech frame.
  • Use key trend features and preset code rate calculation functions to calculate the encoding rate corresponding to the speech frame to be encoded.
  • the rate calculation function is a monotonically increasing function and can be customized according to requirements.
  • Each key trend feature can have a corresponding code rate calculation function, or the same code rate calculation function can be used.
  • Step 210 Encode the to-be-encoded speech frame according to the encoding bit rate to obtain an encoding result.
  • the encoding rate is used to encode the to-be-encoded speech frame to obtain an encoding result
  • the encoding result refers to the code stream data corresponding to the to-be-encoded speech frame.
  • the terminal can store the code stream data in the memory, or send the code stream data to the server for storage. Among them, it can be encoded by a speech encoder.
  • the saved code stream data is acquired, the code rate data is decoded, and finally the voice playback device of the terminal, such as a loudspeaker, is used to play it.
  • the keyness of the speech frame to be coded corresponding to the speech frame to be coded and the backward speech corresponding to the backward speech frame are respectively calculated Frame criticality, and then obtain the key trend characteristics according to the criticality of the speech frame to be encoded and the criticality of the backward speech frame, use the key trend characteristics to determine the encoding rate corresponding to the speech frame to be encoded, and then use the encoding rate for encoding to obtain
  • the encoding result that is, the encoding rate can be adjusted according to the key trend characteristics of the speech frame, so that each speech frame to be encoded has a adjusted encoding rate, and then encoding is performed according to the adjusted encoding rate, so that the key When the trend becomes stronger, the speech frame to be coded is assigned a higher coding rate for encoding.
  • the speech frame to be coded is assigned a lower coding rate for encoding, so that it can be adaptively controlled.
  • the coding rate corresponding to the coded speech frame avoids redundant coding and improves the quality of speech coding.
  • the features of the voice frame to be encoded and the features of the backward voice frame include at least one of the feature of the voice start frame and the feature of the non-speech frame. As shown in FIG. 3, the feature of the voice start frame and the feature of the non-speech frame.
  • Step 302 Acquire a voice frame to be extracted, which is at least one of a voice frame to be encoded and a backward voice frame.
  • Step 304a Perform voice endpoint detection based on the voice frame to be extracted to obtain the voice endpoint detection result.
  • the speech frame to be extracted refers to the speech frame for which the characteristics of the speech frame need to be extracted, and it may be the speech frame to be encoded or the backward speech frame.
  • Voice endpoint detection refers to the use of a voice endpoint detection (Vad, Voice Activity Detection) algorithm to detect the voice start endpoint in the voice signal, that is, the transition point of the voice signal from 0 to 1.
  • Voice endpoint detection algorithm can be based on subband signal-to-noise ratio decision algorithm, DNN (Deep Neural Networks, deep neural network) based voice frame decision algorithm, short-term energy-based voice endpoint detection algorithm, and dual-threshold voice endpoint detection algorithm etc.
  • the voice endpoint detection result refers to the detection result of whether the voice frame to be extracted is a voice endpoint, including that the voice frame is a voice initiating endpoint and the voice frame is a non-voice initiating endpoint.
  • the server uses a voice endpoint detection algorithm to perform voice endpoint detection on the voice frame to be extracted, and obtains the voice endpoint detection result.
  • Step 306a When the voice endpoint detection result is the voice start endpoint, it is determined that the voice start frame feature corresponding to the voice frame to be extracted is the first target value and the non-voice frame feature corresponding to the voice frame to be extracted is the second target value. At least one.
  • the voice start endpoint means that the voice frame to be extracted is the start of the voice signal.
  • the first target value is the specific value of the feature, and the meaning of the first target value corresponding to different features is different.
  • the voice start frame feature is the first target value
  • the first target value is used to characterize that the voice frame to be extracted is a voice start
  • the non-speech frame feature is the first target value
  • the first target value is used to characterize that the speech frame to be extracted is a noisy speech frame.
  • the second target value is the specific value of the feature, and the meaning of the second target value corresponding to different features is different.
  • the second target value is used to characterize the speech frame to be extracted as non-noise speech Frame
  • the voice start frame feature is the second target value
  • the second target value is used to characterize the voice frame to be extracted as a voice frame that is not a voice start endpoint.
  • the first target value may be 1, and the second target value may be 0.
  • the voice endpoint detection result is the voice start endpoint
  • the voice start frame feature corresponding to the voice frame to be extracted is the first target value
  • the non-voice frame feature corresponding to the voice frame to be extracted is the second target value.
  • the voice start frame feature corresponding to the voice frame to be extracted is the first target value or the non-voice frame feature corresponding to the voice frame to be extracted is the second target value.
  • Step 308a When the voice endpoint detection result is a non-voice initiation endpoint, it is determined that the voice initiation frame feature corresponding to the voice frame to be extracted is the second target value and the non-voice frame feature corresponding to the voice frame to be extracted is the first target value At least one of.
  • the non-speech start endpoint means that the speech frame to be extracted is not the start point of the speech signal, that is, the speech frame to be extracted is the noise signal before the speech signal.
  • the second target value is directly used as the voice start frame feature corresponding to the voice frame to be extracted, and the first target value is used as the non-voice corresponding to the voice frame to be extracted Frame characteristics.
  • the second target value is directly used as the voice start frame feature corresponding to the voice frame to be extracted, or the first target value is used as the voice frame corresponding to the voice frame to be extracted Features of non-speech frames.
  • the voice endpoint detection is performed on the voice frame to be extracted, so that the voice start frame feature and the non-voice frame feature are obtained, which improves efficiency and accuracy.
  • the features of the speech frame to be encoded and the features of the backward speech frame include energy change features.
  • the extraction of the energy change feature includes the following steps:
  • Step 302 Acquire a voice frame to be extracted, which is a voice frame to be encoded or a backward voice frame.
  • Step 304b Obtain the forward speech frame corresponding to the speech frame to be extracted, calculate the energy of the frame to be extracted corresponding to the speech frame to be extracted, and calculate the forward frame energy corresponding to the forward speech frame.
  • the forward speech frame refers to the previous frame of the speech frame to be extracted, and is the speech frame that has been acquired before the speech frame to be extracted is acquired.
  • the forward speech frame may be the 7th frame.
  • the frame energy is used to reflect the strength of the speech frame signal.
  • the frame energy to be extracted refers to the frame energy corresponding to the speech frame to be extracted.
  • the forward frame energy refers to the frame energy corresponding to the forward speech frame.
  • the terminal obtains the speech frame to be extracted, the speech frame to be extracted is the speech frame to be encoded or the backward speech frame, the forward speech frame corresponding to the speech frame to be extracted is obtained, and the energy of the frame to be extracted corresponding to the speech frame to be extracted is calculated, At the same time, the forward frame energy corresponding to the forward speech frame is calculated.
  • the energy of the frame to be extracted or the energy of the forward frame can be obtained by calculating the sum of squares of all digital signals in the speech frame to be extracted or the forward speech frame. It is also possible to sample from all digital signals in the speech frame to be extracted or the forward speech frame, and calculate the sum of the squares of the sampled data to obtain the energy of the frame to be extracted or the energy of the forward frame.
  • Step 306c Calculate the ratio of the energy of the frame to be extracted and the energy of the forward frame, and determine the energy change feature corresponding to the speech frame to be extracted according to the result of the ratio.
  • the terminal calculates the ratio of the energy of the frame to be extracted and the energy of the forward frame, and determines the energy change feature corresponding to the speech frame to be extracted according to the result of the ratio.
  • the ratio result is greater than the preset threshold, it means that the frame energy of the speech frame to be extracted has a greater change compared to the frame energy of the previous frame, and the corresponding energy change feature is 1, when the ratio result is not greater than the preset threshold , It means that the energy change of the speech frame to be extracted is smaller than that of the previous frame, and the corresponding energy change feature is 0.
  • the energy change feature corresponding to the speech frame to be extracted can be determined according to the ratio result and the energy of the frame to be extracted.
  • the speech frame to be extracted is greater than the preset frame energy and the ratio result is greater than the preset threshold, it is indicated that If the speech frame to be extracted is a speech frame with a sudden increase in frame energy, the corresponding energy change feature is 1.
  • the speech frame to be extracted is indicated If it is not a speech frame with a sudden increase in frame energy, the corresponding energy change feature is 0.
  • the preset threshold refers to a preset value, for example, the ratio result is higher than a preset multiple.
  • the preset frame energy is a preset frame energy threshold.
  • the energy change feature corresponding to the speech frame to be extracted is determined according to the energy of the frame to be extracted and the energy of the forward frame, which improves the accuracy of obtaining the energy change feature.
  • calculating the energy of the frame to be extracted corresponding to the speech frame to be extracted includes
  • Data sampling is performed based on the voice frame to be extracted, and the data value and the number of samples of each sample point are obtained. Calculate the sum of squares of the data values of each sample point, and calculate the ratio of the sum of squares to the number of samples to obtain the frame energy to be extracted.
  • the sample point data value is the data obtained by sampling the voice frame to be extracted.
  • the number of samples refers to the total number of sample data obtained.
  • the terminal performs data sampling on the voice frame to be extracted to obtain the data value of each sample point and the number of samples. Calculate the sum of squares of the data values of each sample point, and then calculate the ratio of the sum of squares to the number of samples, and use the ratio as the frame energy to be extracted.
  • the following formula (1) can be used to calculate the energy of the frame to be extracted:
  • m is the number of sample points
  • x is the sample point data value
  • i-th sample point data value is x(i).
  • 20ms is regarded as a frame, and the sampling rate is 16khz.
  • 320 sample point data values will be obtained.
  • the data value of each sample point is a 16-bit signed number, and the value range is [-32768,32767].
  • the data value of the i-th sample point is x(i)
  • the frame energy of the frame is calculated as
  • the terminal performs data sampling based on the forward voice frame to obtain the data value of each sample point and the number of samples; calculate the square sum of the data value of each sample point, and calculate the ratio of the square sum to the number of samples to obtain the previous To frame energy.
  • the terminal can use formula (1) to calculate the forward frame energy corresponding to the forward speech frame.
  • the efficiency of obtaining the frame energy can be improved.
  • the feature of the speech frame to be encoded and the feature of the backward speech frame include the feature of the pitch period mutation frame.
  • the extraction of the pitch period mutation frame feature includes the following steps:
  • Step 302 Obtain a voice frame to be extracted, which is a voice frame to be encoded or a backward voice frame;
  • Step 304c Obtain the forward speech frame corresponding to the speech frame to be extracted, detect the pitch period of the speech frame to be extracted and the forward speech frame, and obtain the pitch period to be extracted and the forward pitch period.
  • the pitch period refers to the time each time the vocal cords are opened and closed.
  • the pitch period to be extracted refers to the pitch period corresponding to the speech frame to be extracted, that is, the pitch period corresponding to the speech frame to be encoded or the pitch period corresponding to the backward speech frame.
  • the terminal obtains a voice frame to be extracted, and the voice frame to be extracted may be a voice frame to be encoded or may be a backward voice frame. Then the forward speech frame corresponding to the speech frame to be extracted is obtained, and the pitch period detection algorithm is used to detect the speech frame to be extracted and the pitch period corresponding to the forward speech frame respectively, to obtain the pitch period and the forward pitch period to be extracted.
  • the pitch period detection algorithm can be divided into non-time-based pitch period detection methods and time-based pitch period detection methods.
  • Non-time-based pitch period detection methods include autocorrelation function method, average amplitude difference function method and cepstrum method, etc.
  • Time-based pitch period detection methods include waveform estimation method, correlation processing method and transformation method.
  • step 306c the pitch period change degree is calculated according to the pitch period to be extracted and the forward pitch period, and the pitch period mutation frame feature corresponding to the speech frame to be extracted is determined according to the pitch period change degree.
  • the pitch period change degree is used to reflect the pitch period change degree between the forward speech frame and the speech frame to be extracted.
  • the terminal calculates the absolute value of the difference between the forward pitch period and the pitch period to be extracted to obtain the pitch period change degree.
  • the pitch period change degree exceeds the preset period change degree threshold, it indicates that the speech frame to be extracted is the pitch period. Abrupt change frame.
  • the characteristic of the obtained pitch period change frame can be represented by "1”.
  • the pitch period change degree does not exceed the preset period change degree threshold, it means that the pitch period of the speech frame to be extracted has no mutation compared to the previous frame.
  • the obtained pitch period mutation frame feature can be represented by "0".
  • the forward pitch period and the pitch period to be extracted are obtained through detection, and the pitch period mutation frame feature is obtained according to the forward pitch period and the pitch period to be extracted, which improves the accuracy of obtaining the pitch period mutation frame feature.
  • step 204 namely obtaining the keyness of the speech frame to be encoded corresponding to the speech frame to be encoded based on the characteristics of the speech frame to be encoded, includes:
  • Step 402 Determine the characteristics of the forward voice frame to be encoded from the characteristics of the voice frame to be encoded, and perform a weighted calculation on the characteristics of the forward voice frame to be encoded to obtain the criticality of the forward voice frame to be encoded.
  • the characteristics of the forward voice frame to be encoded include voice At least one of the initial frame feature, the energy change feature, and the pitch period mutation frame feature.
  • the forward voice frame feature to be encoded refers to the feature that has a positive relationship between the voice frame feature and the criticality of the voice frame, including at least one of the voice start frame feature, the energy change feature, and the pitch period mutation frame feature.
  • the more obvious the characteristics of the voice frame to be encoded in the forward direction the more critical the voice frame is.
  • the criticality of the voice frame to be encoded in the forward direction refers to the criticality of the voice frame obtained according to the characteristics of the voice frame to be encoded in the forward direction.
  • the terminal determines the characteristics of the forward voice frame to be encoded from the characteristics of each voice frame to be encoded, obtains the preset weights corresponding to the characteristics of each forward voice frame to be encoded, and performs a calculation on the characteristics of each forward voice frame to be encoded. Weighting calculation, and then counting the results of the weighting calculation to obtain the keyness of the forward speech frame to be encoded.
  • Step 404 Determine the characteristics of the reverse voice frame to be encoded from the characteristics of the voice frame to be encoded, and determine the criticality of the reverse voice frame to be encoded according to the characteristics of the reverse voice frame to be encoded, and the reverse voice frame characteristics to be encoded include non-speech frame characteristics.
  • the reverse voice frame feature to be coded refers to the feature in which the voice frame feature and the criticality of the voice frame have a reverse relationship, including non-voice frame features.
  • the criticality of the reverse voice frame to be encoded refers to the criticality of the voice frame obtained according to the characteristics of the reverse voice frame to be encoded.
  • the terminal determines the characteristics of the reverse speech frame to be encoded from the characteristics of the speech frame to be encoded, and determines the criticality of the reverse speech frame to be encoded according to the characteristics of the reverse speech frame to be encoded.
  • the feature of a non-speech frame is 1, it means that the speech frame is noise, and at this time, the criticality of the speech frame of the noise is 0.
  • the non-voice frame feature is 0, it means that the voice frame is a collected voice. At this time, the key of the speech frame is 1.
  • Step 406 Calculate the forward criticality based on the criticality of the forward voice frame to be encoded and the preset forward weight, and calculate the reverse criticality based on the criticality of the reverse voice frame to be encoded and the preset reverse weight.
  • the forward criticality and the reverse criticality obtain the criticality of the voice frame to be encoded corresponding to the voice frame to be encoded.
  • the preset forward weight refers to a preset key weight of the forward voice frame to be encoded
  • the preset reverse weight refers to a preset key weight of the reverse voice frame to be encoded
  • the terminal calculates the product of the criticality of the forward speech frame to be encoded and the preset forward weight to obtain the forward criticality, and calculates the product of the criticality of the reverse speech frame to be encoded and the preset reverse weight to obtain the reverse criticality.
  • the forward criticality and the reverse criticality are added to obtain the criticality of the voice frame to be encoded corresponding to the voice frame to be encoded. It is also possible, for example, to calculate the product of the forward criticality and the reverse criticality to obtain the criticality of the speech frame to be encoded.
  • the following formula (2) can be used to calculate the criticality of the speech frame to be coded corresponding to the speech frame to be coded.
  • r is the criticality of the speech frame to be encoded
  • r 1 is the speech start frame feature
  • r 2 is the energy change feature
  • r 3 is the pitch period mutation frame feature
  • w is the preset weight
  • w 1 is the speech start The weight corresponding to the frame feature
  • w 2 is the weight corresponding to the energy change feature
  • w 3 is the weight corresponding to the pitch period mutation frame feature.
  • w 1 *r 1 +w 2 *r 2 +w 3 *r 3 is the criticality of the voice frame to be encoded in the forward direction.
  • r 4 is the non-verbal frame feature
  • (1-r 4 ) is the keyness of the reverse speech frame to be encoded.
  • b is a constant and positive number, which is a forward bias. Wherein, b can specifically be 0.1, and w 1 , w 2 and w 3 can be specifically all 0.3.
  • formula (2) may also be used to calculate the keyness of the backward speech frame corresponding to the backward speech frame according to the characteristics of the backward speech frame. Specifically: the voice start frame feature, energy change feature, and pitch period mutation frame feature corresponding to the backward voice frame are weighted and calculated to obtain the forward criticality corresponding to the backward voice frame. Determine the reverse criticality corresponding to the backward speech frame according to the characteristics of the non-speech frame corresponding to the backward speech frame. Based on the forward criticality and the reverse criticality, the backward speech frame criticality corresponding to the backward speech frame is obtained.
  • the criticality of the voice frame, and finally the criticality of the voice frame to be encoded is obtained, which improves the accuracy of obtaining the criticality of the voice frame to be encoded.
  • the key trend feature is acquired based on the criticality of the voice frame to be encoded and the criticality of the backward voice frame, and the key trend feature is used to determine the encoding rate corresponding to the voice frame to be encoded, including:
  • Obtain the keyness of the forward speech frame obtain the key trend characteristics of the target based on the keyness of the forward speech frame, the keyness of the speech frame to be coded, and the keyness of the backward speech frame, and use the target key trend characteristic to determine the code corresponding to the speech frame to be encoded Bit rate.
  • the forward speech frame refers to the speech frame that has been coded before the speech frame to be coded.
  • the criticality of the forward voice frame refers to the criticality of the voice frame corresponding to the forward voice frame.
  • the terminal can obtain the criticality of the forward voice frame, calculate the criticality of the forward voice frame, the criticality of the voice frame to be encoded, and the criticality of the backward voice frame, and calculate the criticality of the forward voice frame and the criticality of the backward voice frame.
  • the key difference degree between the keyness of the encoded speech frame and the keyness of the backward speech frame, the target key trend feature is obtained according to the key average degree and the key difference degree, and the target key trend feature is used to determine the encoding code corresponding to the speech frame to be encoded Rate.
  • the target critical trend feature is obtained by using the forward speech frame criticality, the criticality of the speech frame to be encoded, and the criticality of the backward speech frame, and then the target critical trend feature is used to determine the code corresponding to the speech frame to be encoded.
  • the code rate makes the code rate corresponding to the speech frame to be coded more accurate.
  • the key trend feature is obtained based on the criticality of the speech frame to be encoded and the criticality of the backward speech frame, and the key trend feature is used to determine the encoding rate corresponding to the speech frame to be encoded.
  • Step 502 Calculate the criticality difference degree and the criticality average degree based on the criticality of the speech frame to be encoded and the criticality of the backward speech frame.
  • the degree of criticality difference is used to reflect the criticality difference between the backward speech frame and the speech frame to be encoded.
  • the criticality average degree is used to reflect the criticality average of the speech frame to be encoded and the backward speech frame.
  • the server performs statistical calculations based on the criticality of the voice frame to be encoded and the criticality of the backward voice frame, that is, calculates the average criticality of the criticality of the voice frame to be encoded and the criticality of the backward voice frame, obtains the criticality average degree, and calculates The difference between the keyness of the speech frame to be coded and the keyness of the backward speech frame and the keyness of the speech frame to be coded is combined to obtain the degree of criticality difference.
  • Step 504 Calculate the encoding bit rate corresponding to the speech frame to be encoded according to the degree of criticality difference and the average degree of criticality.
  • a preset code rate calculation function is obtained, and the code rate calculation function is used to calculate the encoding rate corresponding to the speech frame to be encoded according to the degree of criticality difference and the average degree of criticality.
  • the code rate calculation function is used to calculate the code rate, which is a monotonically increasing function and can be customized according to the needs of the application scenario.
  • the code rate can be calculated according to the code rate calculation function corresponding to the degree of critical difference, and the code rate can be calculated according to the code rate calculation function corresponding to the average degree of criticality, and then the sum of the code rates is calculated to obtain the code rate corresponding to the speech frame to be encoded .
  • the same code rate calculation function can also be used to calculate the code rate corresponding to the critical difference degree and the critical average degree, and then the sum of the code rates is calculated to obtain the code rate corresponding to the speech frame to be encoded.
  • the degree of criticality difference and the average degree of criticality between the backward speech frame and the speech frame to be encoded are obtained by calculation, and the coding corresponding to the speech frame to be encoded is calculated according to the degree of criticality difference and the average degree of criticality. Code rate, which can make the obtained code rate more accurate.
  • step 502 calculating the degree of criticality difference based on the criticality of the speech frame to be encoded and the criticality of the backward speech frame, includes:
  • Step 602 Calculate the first weighting value of the keyness of the speech frame to be encoded and the preset first weight, and calculate the second weighting value of the keyness of the backward speech frame and the preset second weight.
  • the preset first weight refers to a weight corresponding to the keyness of the speech frame to be encoded, which is preset.
  • the preset second weight refers to the weight corresponding to the criticality of the backward speech frame, each backward speech frame has a corresponding backward speech frame criticality, and each backward speech frame criticality has a corresponding weight.
  • the first weighted value is a value obtained by weighting the criticality of the speech frame to be encoded.
  • the second weighted value refers to the value obtained by weighting the keyness of the backward speech frame
  • the terminal calculates the product of the keyness of the speech frame to be encoded and the preset first weight to obtain the first weight value, and calculates the product of the keyness of the backward speech frame and the preset second weight to obtain the second weight value.
  • Step 604 Calculate the target weight value based on the first weight value and the second weight value, calculate the difference between the target weight value and the criticality of the speech frame to be encoded, to obtain the degree of criticality difference.
  • the target weight value refers to the sum of the first weight value and the second weight value.
  • the terminal calculates the sum between the first weighted value and the second weighted value to obtain the target weighted value, and then calculates the difference between the target weighted value and the criticality of the speech frame to be encoded, and uses the difference as the degree of criticality difference .
  • formula (3) can be used to calculate the degree of critical difference:
  • ⁇ R(i) refers to the degree of critical difference
  • N is the total number of speech frames to be encoded and backward speech frames.
  • r(i) represents the keyness of the speech frame to be coded corresponding to the speech frame to be coded
  • r(j) represents the keyness of the backward speech frame corresponding to the j-th backward speech frame.
  • the preset second weight corresponding to each backward speech frame may be the same or different, where a j may have a larger value as j is larger. Indicates the target weight value.
  • N when there are 3 backward speech frames, N is 4, a 0 can be 0.1, a 1 can be 0.2, a 2 can be 0.3, and a 3 can be 0.4.
  • the critical difference degree is calculated by calculating the target weight value and then using the target weight value and the criticality of the speech frame to be encoded, which improves the accuracy of obtaining the critical difference degree.
  • step 502 calculating the criticality average degree based on the criticality of the speech frame to be encoded and the criticality of the backward speech frame, includes:
  • the number of frames refers to the total number of speech frames to be encoded and the backward speech frames. For example, when there are 3 backward speech frames, the total number of frames obtained is 4.
  • the terminal obtains the frame number of the voice frame to be encoded and the backward voice frame. Count the sum of the keyness of the speech frame to be coded and the keyness of the backward speech frame to obtain the comprehensive keyness. Then calculate the ratio of comprehensive criticality to the number of frames to get the average criticality.
  • formula (4) can be used to calculate the criticality average degree:
  • N refers to the number of speech frames to be encoded and backward speech frames.
  • r refers to the criticality of the voice frame, r(i) is used to indicate the criticality of the voice frame to be encoded corresponding to the voice frame to be encoded, r(j) is used to indicate the criticality of the backward voice frame corresponding to the jth backward voice frame .
  • the criticality average degree is calculated by the number of frames of the speech frame to be coded and the backward speech frame and the comprehensive criticality calculation, which improves the accuracy of obtaining the criticality average degree.
  • step 504 which is to calculate the encoding rate corresponding to the speech frame to be encoded according to the degree of criticality difference and the average degree of criticality, includes:
  • Step 702 Obtain a first code rate calculation function and a second code rate calculation function.
  • Step 704 Use the criticality average degree and the first bit rate calculation function to calculate the first bit rate, and use the critical difference degree and the second bit rate calculation function to calculate the second bit rate, according to the first bit rate and the second bit rate.
  • the code rate determines the comprehensive code rate, where the first code rate is proportional to the average degree of criticality, and the second code rate is proportional to the degree of critical talent.
  • the first code rate calculation function is a preset function that uses the criticality average degree to calculate the code rate
  • the second code rate calculation function is a preset function that uses the critical difference degree to calculate the code rate.
  • the first The code rate calculation function and the second code rate calculation function can be set according to the specific needs of the application scenario.
  • the first code rate refers to the code rate calculated by using the first code rate calculation function.
  • the second code rate refers to the code rate calculated by using the second code rate calculation function.
  • the integrated code rate refers to the code rate obtained by integrating the first code rate and the second code rate. For example, the sum of the first code rate and the second code rate can be calculated, and the sum is used as the integrated code rate.
  • the terminal obtains the preset first code rate calculation function and the second code rate calculation function, and then calculates the criticality average degree and the critical difference degree respectively to obtain the first bit rate and the second bit rate, and then Calculate the sum of the first code rate and the second code rate, and use the sum as the integrated code rate.
  • formula (5) can be used to calculate the integrated code rate.
  • formula (6) can be used as the first code rate calculation function
  • formula (7) can be used as the second code rate calculation function.
  • p 0 , c 0 , b 0 , p 1 , c 1 and b 1 are all constants and positive numbers.
  • Step 706 Obtain a preset code rate upper limit value and a preset code rate lower limit value, and determine an encoding code rate based on the preset code rate upper limit value, the preset code rate lower limit value and the integrated code rate.
  • the preset code rate upper limit refers to the preset maximum value of the voice frame encoding code rate
  • the preset code rate lower limit refers to the preset minimum value of the voice frame encoding code rate.
  • the terminal obtains the upper limit of the preset code rate and the lower limit of the preset code rate, compares the upper limit of the preset code rate and the lower limit of the preset code rate with the integrated code rate, and determines the final encoding code according to the comparison result Rate.
  • the first code rate and the second code rate are calculated by using the first code rate calculation function and the second code rate calculation function, and then the integrated code rate is obtained according to the first code rate and the second code rate, which improves In order to obtain the accuracy of the integrated code rate, finally the coding code rate is determined according to the preset upper limit of the code rate, the preset lower limit of the code rate and the integrated code rate, so that the obtained code rate is more accurate.
  • step 706, that is, determining the encoding code rate based on the preset upper limit of the code rate, the preset lower limit of the code rate, and the integrated code rate, includes:
  • the terminal compares the upper limit of the preset code rate with the integrated code rate.
  • the integrated code rate is less than the upper limit of the preset code rate, it means that the integrated code rate does not exceed the upper limit of the preset code rate.
  • the integrated code rate is greater than the lower limit of the preset code rate, it means that the integrated code rate exceeds the lower limit of the preset code rate, and the integrated code rate is directly used as the code rate.
  • the upper limit of the preset code rate is compared with the integrated code rate. When the integrated code rate is greater than the upper limit of the preset code rate, it means that the integrated code rate exceeds the upper limit of the preset code rate.
  • the upper limit of the preset code rate is used as the code rate.
  • the lower limit of the preset code rate is compared with the integrated code rate. When the integrated code rate is less than the lower limit of the preset code rate, it means that the integrated code rate does not exceed the lower limit of the preset code rate. At this time, The lower limit of the preset code rate is used as the code rate.
  • formula (8) can be used to obtain the coding rate:
  • max_bitrate refers to the upper limit of the preset bitrate.
  • min_bitrate refers to the lower limit of the preset bitrate.
  • bitrate(i) represents the coding rate of the speech frame to be coded.
  • the encoding rate is determined by the preset upper limit of the code rate, the preset lower limit of the preset rate, and the integrated code rate, so as to ensure that the encoding rate of the speech frame is within the preset code rate range.
  • the quality of speech coding is determined by the preset upper limit of the code rate, the preset lower limit of the preset rate, and the integrated code rate, so as to ensure that the encoding rate of the speech frame is within the preset code rate range.
  • step 210 that is, encoding the to-be-encoded speech frame according to the encoding rate to obtain the encoding result, includes:
  • the encoding rate is passed to the standard encoder through the interface to obtain the encoding result.
  • the standard encoder is used to encode the to-be-encoded speech frame using the encoding rate.
  • the standard encoder is used to perform speech encoding on the speech frame to be encoded.
  • the interface refers to the external interface of the standard encoder, which is used to control the encoding rate.
  • the terminal transmits the encoding rate to the standard encoder through the interface, and when the standard encoder receives the encoding rate, it obtains the corresponding speech frame to be encoded, uses the encoding rate to encode the to-be-encoded speech frame, and obtains the encoding result. So as to ensure accurate and error-free standard coding results.
  • a speech coding method is provided, specifically:
  • the voice frame to be encoded and the backward voice frame corresponding to the voice frame to be encoded are calculated in parallel.
  • obtaining the criticality of the speech frame to be coded corresponding to the speech frame to be coded includes the following steps:
  • Step 802 Perform voice endpoint detection based on the voice frame to be encoded to obtain a voice endpoint detection result, and determine the voice start frame feature corresponding to the voice frame to be encoded and the non-voice frame feature corresponding to the voice frame to be encoded according to the voice endpoint detection result.
  • Step 804 Obtain the forward speech frame corresponding to the speech frame to be encoded, calculate the energy of the frame to be encoded corresponding to the speech frame to be encoded, calculate the energy of the forward frame corresponding to the forward speech frame, and calculate the energy of the frame to be encoded and the energy of the forward frame According to the ratio, the energy change characteristics corresponding to the speech frame to be encoded are determined according to the ratio result.
  • Step 806 Detect the pitch period of the speech frame to be coded and the forward speech frame to obtain the pitch period to be coded and the forward pitch period, calculate the pitch period change degree according to the pitch period to be coded and the forward pitch period, and determine the pitch period change degree The feature of the pitch period mutation frame corresponding to the speech frame to be encoded.
  • Step 808 Determine the characteristics of the forward voice frame to be encoded from the characteristics of the voice frame to be encoded, and perform a weighted calculation on the characteristics of the forward voice frame to be encoded to obtain the criticality of the forward voice frame to be encoded.
  • Step 810 Determine the characteristics of the reverse speech frame to be encoded from the characteristics of the speech frame to be encoded, and determine the criticality of the reverse speech frame to be encoded according to the characteristics of the reverse speech frame to be encoded.
  • Step 812 Obtain the keyness of the speech frame to be encoded corresponding to the speech frame to be encoded based on the keyness of the forward speech frame to be encoded and the keyness of the reverse speech frame to be encoded.
  • obtaining the criticality of the backward speech frame corresponding to the backward speech frame includes the following steps:
  • Step 902 Perform voice endpoint detection based on the backward voice frame to obtain a voice endpoint detection result, and determine the voice start frame feature corresponding to the backward voice frame and the non-voice frame feature corresponding to the backward voice frame according to the voice endpoint detection result.
  • Step 904 Obtain the forward speech frame corresponding to the backward speech frame, calculate the backward frame energy corresponding to the backward speech frame, calculate the forward frame energy corresponding to the forward speech frame, and calculate the backward frame energy and the forward frame energy According to the ratio, the energy change characteristic corresponding to the backward speech frame is determined according to the result of the ratio.
  • Step 906 Detect the pitch period of the backward voice frame and the forward voice frame to obtain the backward pitch period and the forward pitch period, calculate the pitch period change degree according to the backward pitch period and the forward pitch period, and determine according to the pitch period change degree The feature of the pitch period mutation frame corresponding to the backward speech frame.
  • Step 908 Perform weighted calculation on the voice start frame feature, energy change feature, and pitch period mutation frame feature corresponding to the backward voice frame to obtain the forward criticality corresponding to the backward voice frame.
  • Step 910 Determine the reverse criticality corresponding to the backward speech frame according to the characteristics of the non-speech frame corresponding to the backward speech frame.
  • Step 912 based on the forward criticality and the reverse criticality, obtain the backward speech frame criticality corresponding to the backward speech frame.
  • calculating the encoding rate corresponding to the voice frame to be encoded includes the following steps:
  • Step 1002 Calculate the first weighting value of the keyness of the speech frame to be encoded and the preset first weight, and calculate the second weighting value of the keyness of the backward speech frame and the preset second weight.
  • Step 1004 Calculate the target weight value based on the first weight value and the second weight value, calculate the difference between the target weight value and the criticality of the speech frame to be encoded, to obtain the degree of criticality difference.
  • Step 1006 Obtain the frame number of the speech frame to be encoded and the backward speech frame, count the keyness of the speech frame to be encoded and the keyness of the backward speech frame to obtain the comprehensive key, and calculate the ratio of the comprehensive key to the number of frames to obtain the key Average degree.
  • Step 1008 Obtain the first code rate calculation function and the second code rate calculation function.
  • Step 1010 Use the critical difference degree and the first bit rate calculation function to calculate the first bit rate, and use the critical average degree and the second bit rate calculation function to calculate the second bit rate. According to the first bit rate and the second bit rate The code rate determines the integrated code rate.
  • Step 1012 Compare the upper limit of the preset code rate with the integrated code rate, and when the integrated code rate is less than the upper limit of the preset code rate, compare the lower limit of the preset code rate with the integrated code rate.
  • Step 1014 When the integrated code rate is greater than the preset lower limit of the code rate, the integrated code rate is used as the encoding code rate.
  • Step 1016 Pass the coding rate into a standard encoder through the interface to obtain an encoding result, and the standard encoder is used to encode the to-be-coded speech frame using the coding rate. Finally, save the obtained encoding result.
  • This application also provides an application scenario, which applies the above-mentioned speech coding method.
  • the application of the speech coding method in this application scenario is as follows: As shown in FIG. 11, it is a schematic diagram of a process of performing audio broadcasting. At this time, when the announcer is broadcasting, the microphone collects the audio signal broadcast by the announcer. At this time, the multi-frame voice signal in the audio signal is read, and the multi-frame voice signal includes the current voice frame to be encoded and 3 frames of backward voice frames.
  • the multi-frame speech criticality analysis is performed, specifically: extracting the characteristics of the speech frame to be encoded corresponding to the speech frame to be encoded, and obtaining the keyness of the speech frame to be encoded corresponding to the speech frame to be encoded based on the characteristics of the speech frame to be encoded.
  • the characteristics of the backward speech frames corresponding to the 3 backward speech frames are extracted respectively, and the keyness of the backward speech frame corresponding to each backward speech frame is obtained based on the characteristics of the backward speech frames.
  • the key trend feature is obtained based on the keyness of the speech frame to be coded and the keyness of the backward speech frame of each frame, and the key trend feature is used to determine the coding rate corresponding to the speech frame to be coded.
  • the encoding rate is set, that is, the encoding rate in the standard encoder is adjusted to the encoding rate corresponding to the voice frame to be encoded through the external interface.
  • the standard encoder encodes the current voice frame to be encoded using the coding rate corresponding to the voice frame to be encoded, obtains the rate data, stores the rate data, and decodes the rate data when playing , Get the audio signal, play the audio signal through the speaker, so as to make the broadcast sound clearer.
  • This application also provides an application scenario, which applies the above-mentioned speech coding method.
  • the application of the voice coding method in this application scenario is as follows: As shown in Figure 12, it is an application scenario diagram for voice communication, including a terminal 1202, a server 1204, and a terminal 1206. The terminal 1202 and the server 1204 are connected through the network. , The server 1204 and the terminal 1206 are connected through the network.
  • terminal 1202 collects user A’s voice signal, obtains the to-be-encoded voice frame and backward voice frame from the voice signal, and then The feature of the voice frame to be coded corresponding to the voice frame to be coded is extracted, and the key of the voice frame to be coded corresponding to the voice frame to be coded is obtained based on the feature of the voice frame to be coded. The feature of the backward voice frame corresponding to the backward voice frame is extracted, and the criticality of the backward voice frame corresponding to the backward voice frame is obtained based on the feature of the backward voice frame.
  • the code stream data is sent to the terminal 1206 through the server 1204.
  • the bit rate data is decoded to obtain the corresponding voice signal, and the voice signal is played through the speaker.
  • the voice coding quality is improved, user B The voice heard is clearer and saves network bandwidth resources.
  • This application also provides an application scenario, which applies the above-mentioned speech coding method.
  • the application of the speech encoding method in this application scenario is as follows: the meeting audio signal is collected through a microphone during meeting recording, and the speech frame to be encoded and 5 backward speech frames are determined to be obtained from the meeting audio signal, and then extracted The characteristics of the speech frame to be encoded corresponding to the speech frame to be encoded are obtained based on the characteristics of the speech frame to be encoded. The feature of the backward speech frame corresponding to each backward speech frame is extracted, and the keyness of the backward speech frame corresponding to each backward speech frame is obtained based on the characteristics of the backward speech frame.
  • a speech coding apparatus 1300 is provided.
  • the apparatus may adopt a software module or a hardware module, or a combination of the two may become a part of computer equipment.
  • the apparatus specifically includes: speech frame The acquiring module 1302, the first criticality calculation module 1304, the second criticality calculation module 1306, the code rate calculation module 1308, and the encoding module 1310, where:
  • the speech frame obtaining module 1302 is used to obtain the speech frame to be encoded and the backward speech frame corresponding to the speech frame to be encoded;
  • the first criticality calculation module 1304 is configured to extract the characteristics of the voice frame to be encoded corresponding to the voice frame to be encoded, and obtain the criticality of the voice frame to be encoded corresponding to the voice frame to be encoded based on the characteristics of the voice frame to be encoded;
  • the second criticality calculation module 1306 is configured to extract the backward speech frame characteristics corresponding to the backward speech frame, and obtain the backward speech frame criticality corresponding to the backward speech frame based on the backward speech frame characteristics;
  • the code rate calculation module 1308 is used to obtain key trend characteristics based on the keyness of the speech frame to be encoded and the keyness of the backward speech frame, and use the key trend characteristics to determine the encoding bit rate corresponding to the speech frame to be encoded;
  • the encoding module 1310 is used to encode the to-be-encoded speech frame according to the encoding bit rate to obtain an encoding result.
  • the feature of the speech frame to be encoded and the feature of the backward speech frame include at least one of a feature of a speech start frame and a feature of a non-speech frame
  • the speech encoding device 1300 further includes: first feature extraction A module for acquiring a voice frame to be extracted, the voice frame to be extracted is the voice frame to be encoded or the backward voice frame; voice endpoint detection is performed based on the voice frame to be extracted, and the voice endpoint detection result is obtained,
  • the voice endpoint detection result is the voice start endpoint
  • it is determined that the voice start frame feature corresponding to the voice frame to be extracted is the first target value and the non-voice frame feature corresponding to the voice frame to be extracted is the second target At least one of the values
  • the voice endpoint detection result is a non-voice initiation endpoint
  • the voice initiation frame feature corresponding to the voice frame to be extracted is the second target value and the voice frame to be extracted
  • the corresponding non-speech frame feature is at least one of the
  • the feature of the voice frame to be encoded and the feature of the backward voice frame include an energy change feature
  • the voice encoding device 1300 further includes: a second feature extraction module for acquiring the voice frame to be extracted,
  • the speech frame to be extracted is the speech frame to be encoded or the backward speech frame;
  • the forward speech frame corresponding to the speech frame to be extracted is obtained, the energy of the frame to be extracted corresponding to the speech frame to be extracted is calculated, and the calculation
  • the forward frame energy corresponding to the forward speech frame; the ratio of the energy of the frame to be extracted to the energy of the forward frame is calculated, and the energy change feature corresponding to the speech frame to be extracted is determined according to the ratio result.
  • the speech encoding device 1300 further includes: a frame energy calculation module, configured to perform data sampling based on the speech frame to be extracted to obtain the data value of each sample point and the number of samples; and calculate the data of each sample point And calculate the ratio of the square sum to the number of samples to obtain the frame energy to be extracted.
  • a frame energy calculation module configured to perform data sampling based on the speech frame to be extracted to obtain the data value of each sample point and the number of samples.
  • the feature of the speech frame to be encoded and the feature of the backward speech frame include the feature of a pitch period mutation frame
  • the speech encoding device 1300 further includes: a third feature extraction module for acquiring the speech frame to be extracted, so
  • the voice frame to be extracted is the voice frame to be encoded or the backward voice frame; the forward voice frame corresponding to the voice frame to be extracted is acquired, and the difference between the voice frame to be extracted and the forward voice frame is detected.
  • the pitch period is to obtain the pitch period to be extracted and the forward pitch period; the pitch period change degree is calculated according to the pitch period to be extracted and the forward pitch period, and the pitch period change degree is determined according to the pitch period change degree corresponding to the speech frame to be extracted Pitch period mutation frame characteristics.
  • the first criticality calculation module 1304 includes: a forward calculation unit, configured to determine the characteristics of the forward speech frame to be encoded from the characteristics of the speech frame to be encoded, and to determine the characteristics of the forward speech frame to be encoded The feature is weighted and calculated to obtain the criticality of the forward voice frame to be encoded, and the forward voice frame feature to be encoded includes at least one of a voice start frame feature, an energy change feature, and a pitch period mutation frame feature; a reverse calculation unit , Used to determine the characteristics of the reverse speech frame to be encoded from the characteristics of the speech frame to be encoded, and determine the criticality of the reverse speech frame to be encoded according to the characteristics of the reverse speech frame to be encoded, and the characteristics of the reverse speech frame to be encoded Including non-speech frame features; a criticality calculation unit for obtaining the criticality of the voice frame to be encoded corresponding to the voice frame to be encoded based on the criticality of the forward voice frame to be encoded and the criticality of the reverse voice
  • the code rate calculation module 1308 includes: a degree calculation unit, configured to calculate the degree of criticality difference and the average degree of criticality based on the criticality of the speech frame to be encoded and the criticality of the backward speech frame;
  • the rate obtaining unit is configured to calculate the encoding rate corresponding to the speech frame to be encoded according to the degree of criticality difference and the average degree of criticality.
  • the degree calculation unit is further configured to calculate a first weighting value of the criticality of the speech frame to be encoded and a preset first weight, and calculate the criticality of the backward speech frame and a preset second weight.
  • a second weighting value; a target weighting value is calculated based on the first weighting value and the second weighting value, and the difference between the target weighting value and the criticality of the speech frame to be encoded is calculated to obtain the criticality difference degree.
  • the degree calculation unit is further used to obtain the frame numbers of the speech frame to be encoded and the backward speech frame; the keyness of the speech frame to be encoded and the keyness of the backward speech frame are calculated and integrated Criticality, and calculate the ratio of the comprehensive criticality to the number of frames to obtain the average degree of criticality.
  • the code rate obtaining unit is further used to obtain a first code rate calculation function and a second code rate calculation function; the first code rate is calculated by using the criticality average degree and the first code rate calculation function , And use the critical difference degree and the second code rate calculation function to calculate the second code rate, and determine the integrated code rate according to the first code rate and the second code rate, where the first code rate Is proportional to the average degree of criticality, and the second code rate is proportional to the degree of criticality difference; obtaining the preset upper limit of the code rate and the preset lower limit of the code rate, based on the preset The upper limit value of the code rate, the lower limit value of the preset code rate and the integrated code rate determine the encoding code rate.
  • the code rate obtaining unit is further used to compare the preset code rate upper limit value and the integrated code rate; when the integrated code rate is less than the preset code rate upper limit value, compare all The preset code rate lower limit value and the integrated code rate; when the integrated code rate is greater than the preset code rate lower limit value, the integrated code rate is used as the encoding code rate.
  • the encoding module 1310 is further configured to pass the encoding rate into a standard encoder through an interface to obtain an encoding result, and the standard encoder is used to perform the encoding on the to-be-encoded speech frame using the encoding rate. coding.
  • Each module in the above-mentioned speech coding device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 14.
  • the computer equipment includes a processor, a memory, a communication interface, a display screen, an input device and a recording device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be implemented through WIFI, an operator's network, NFC (near field communication) or other technologies.
  • the computer-readable instructions are executed by the processor to realize a speech coding method.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, trackball or touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • the voice collection device of the computer equipment may be a microphone.
  • FIG. 14 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device including a memory and a processor, and computer-readable instructions are stored in the memory.
  • the processor implements the foregoing method embodiments. Steps in.
  • one or more non-volatile storage media storing computer-readable instructions are provided.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute When realizing the steps in the above-mentioned method embodiments.
  • a computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical storage.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM may be in various forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种语音编码方法、装置、计算机设备和存储介质。该方法包括:获取待编码语音帧,及与待编码语音帧对应的后向语音帧(步骤202);提取待编码语音帧对应的待编码语音帧特征,基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性(步骤204);提取后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到后向语音帧对应的后向语音帧关键性(步骤206);基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率(步骤208);根据编码码率对待编码语音帧进行编码,得到编码结果(步骤210)。

Description

语音编码方法、装置、计算机设备和存储介质
本申请要求于2020年06月24日提交中国专利局,申请号为2020105855459,申请名称为“语音编码方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,特别是涉及一种语音编码方法、装置、计算机设备和存储介质。
背景技术
随着通讯技术的发展,语音编解码在现代通讯***中占有重要的地位。目前在非实时的语音编解码应用场景中,比如会议录音、音频广播等等,通常是预先设置好语音编码的码率参数,在进行编码时,使用预先设置好的码率参数进行语音编码,然而,目前的使用预先设置好的码率参数进行语音编码的方式,可能存在冗余编码,导致编码质量低的问题。
发明内容
根据本申请提供的各种实施例,提供一种语音编码方法、装置、计算机设备和存储介质。
一种语音编码方法,由计算机设备执行,所述方法包括:
获取待编码语音帧,及与待编码语音帧对应的后向语音帧;
提取待编码语音帧对应的待编码语音帧特征,基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性;
提取后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到后向语音帧对应的后向语音帧关键性;
基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率;及
根据编码码率对待编码语音帧进行编码,得到编码结果。
在一个实施例中,根据编码码率对待编码语音帧进行编码,得到编码结果,包括:
将编码码率通过接口传入标准编码器,得到编码结果,标准编码器用于使用编码码率对待编码语音帧进行编码。
一种语音编码装置,所述装置包括:
语音帧获取模块,用于获取待编码语音帧,及与待编码语音帧对应的后向语音帧;
第一关键性计算模块,用于提取待编码语音帧对应的待编码语音帧特征,基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性;
第二关键性计算模块,用于提取后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到后向语音帧对应的后向语音帧关键性;
码率计算模块,用于基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率;及
编码模块,用于根据编码码率对待编码语音帧进行编码,得到编码结果。
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
获取待编码语音帧,及与待编码语音帧对应的后向语音帧;
提取待编码语音帧对应的待编码语音帧特征,基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性;
提取后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到后向语音帧对应的后向语音帧关键性;
基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率;及
根据编码码率对待编码语音帧进行编码,得到编码结果。
一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行时实现以下步骤:
获取待编码语音帧,及与待编码语音帧对应的后向语音帧;
提取待编码语音帧对应的待编码语音帧特征,基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性;
提取后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到后向语音帧对应的后向语音帧关键性;
基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率;及
根据编码码率对待编码语音帧进行编码,得到编码结果。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本发明实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中语音编码方法的应用环境图;
图2为一个实施例中语音编码方法的流程示意图;
图3为一个实施例中特征提取的流程示意图;
图4为一个实施例中计算待编码语音帧关键性的流程示意图;
图5为一个实施例中计算编码码率的流程示意图;
图6为一个实施例中得到关键性差异程度的流程示意图;
图7为一个实施例中确定编码码率的流程示意图;
图8为一个具体实施例中计算待编码语音帧关键性的流程示意图;
图9为图8具体实施例中计算后向语音帧关键性的流程示意图;
图10为图8具体实施例中得到编码结果的流程示意图;
图11为一个具体实施例中广播音频的流程示意图;
图12为一个具体实施例中语音编码方法的应用环境图;
图13为一个实施例中语音编码装置的结构框图;
图14为一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
本申请实施例提供的方案涉及人工智能的语音技术等技术,具体通过如下实施例进行说明:
本申请提供的语音编码方法,可以应用于如图1所示的应用环境中。其中,终端102采集用户发出的声音信号。终端102获取待编码语音帧,及与待编码语音帧对应的后向语音帧;提取待编码语音帧对应的待编码语音帧特征,终端102基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性;终端102提取后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到后向语音帧对应的后向语音帧关键性;终端102基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率;终端102根据编码码率对待编码语音帧进行编码,得到编码结果。其中,终端102可以但不限于是各种具有录音功能的个人计算机、具有录音功能的笔记本电脑、具有录音功能的智能手机、具有录音功能的平板电脑和音频广播。可以理解的是,该语音编码方法也可以应用于服务器,还可以应用于包括终端和服务器的***中。其中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式***,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。
在一个实施例中,如图2所示,提供了一种语音编码方法,以该方法应用于图1中的终 端为例进行说明,包括以下步骤:
步骤202,获取待编码语音帧,及与待编码语音帧对应的后向语音帧。
其中,语音帧是语音进行分帧后得到的。待编码语音帧是指当前需要进行编码的语音帧。后向语音帧是指待编码语音帧对应的未来时间的语音帧,是指在待编码语音帧后采集到的语音帧。
具体地,终端可以通过语言采集装置采集语音信号,该语音采集装置可以是麦克风。终端将采集到的语音信号转换为数字信号,然后从数字信号中获取到待编码语音帧,及与待编码语音帧对应的后向语音帧。其中,后向语音帧可以有多个。比如,获取的后向语音帧的数量为3帧。终端也可获取到内存中预先存储的语音信号,将语音信号转换为数字信号,然后从数字信号中获取到待编码语音帧,及与待编码语音帧对应的后向语音帧。终端还可以从互联网(internet)中下载到语音信号,将语音信号转换为数字信号,然后从数字信号中获取到待编码语音帧,及与待编码语音帧对应的后向语音帧。终端还可以获取到其他终端或者服务器发送的语音信号,将语音信号转换为数字信号,然后从数字信号中获取到待编码语音帧,及与到待编码语音帧对应的后向语音帧。
步骤204,提取待编码语音帧对应的待编码语音帧特征,基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性。
其中,语音帧特征是指用于衡量该语音帧声音质量高低的特征。语音帧特征包括但不限于语音起始帧特征、能量变化特征、基音周期突变帧特征和非语音帧特征。语音起始帧特征是指该语音帧是否为语音信号开始的语音帧对应的特征。能量变化特征是指当前语音帧对应的帧能量相对比与前一语音帧对应的帧能量变化的特征。基音周期突变帧特征是指该语音帧对应的基音周期的特征。非语音帧特征是指该语音帧为噪声语音帧时对应的特征。待编码语音帧特征是指待编码语音帧对应的语音帧特征。语音帧关键性是指该语音帧声音质量高低对其前后一段时间内的整体语音音质的贡献程度,贡献程度越高,对应的语音帧关键性越高。待编码语音帧关键性是指待编码语音帧对应的语音帧关键性。
具体地,终端根据待编码语音帧对应的语音帧类型提取到待编码语音帧对应的待编码语音帧特征,语音帧类型可以包括语音起始帧、能量突增帧、基音周期突变帧和非语音帧中的至少一种。
当该待编码语音帧为语音起始帧时,根据语音起始帧得到对应的语音起始帧特征。当待编码语音帧为能量突增帧时,根据能量突增帧得到对应的能量变化特征。当待编码语音帧为基音周期突变帧时,根据基音周期突变帧得到对应的基音周期突变帧特征。当待编码语音帧为非语音帧时,根据非语音帧得到对应的非语音帧特征。
然后基于提取到的待编码语音帧特征进行加权计算得到待编码语音帧对应的待编码语音帧关键性。其中,可以对语音起始帧特征、能量变化特征和基音周期突变帧特征进行正向加权计算得到正向的待编码语音帧关键性,对非语音帧特征进行反向加权计算得到反向的待编码语音帧关键性,根据正向的待编码语音帧关键性和反向的待编码语音帧关键性得到最终的待编码语音帧对应的语音帧关键性。
步骤206,提取后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到后向语音帧对应的后向语音帧关键性。
其中,后向语音帧特征是指后向语音帧对应的语音帧特征,每个后向语音帧都有对应的后向语音帧特征。后向语音帧关键性是指后向语音帧对应的语音帧关键性。
具体地,终端根据后向语音帧的语音帧类型提取后向语音帧对应的后向语音帧特征,当该后向语音帧为语音起始帧时,根据语音起始帧得到对应的语音起始帧特征。当后向语音帧为能量突增帧时,根据能量突增帧得到对应的能量变化特征。当后向语音帧为基音周期突变帧时,根据基音周期突变帧得到对应的基音周期突变帧特征。当后向语音帧为非语音帧时,根据非语音帧得到对应的非语音帧特征
然后基于后向语音帧特征进行加权计算得到后向语音帧对应的后向语音帧关键性。其中,可以对语音起始帧特征、能量变化特征和基音周期突变帧特征进行正向加权计算得到正向的后向语音帧关键性,对非语音帧特征进行反向加权计算得到反向的后向语音帧关键性,根据正向的后向语音帧关键性和反向的后向语音帧关键性得到最终的后向语音帧对应的语音帧关键性。
在一个具体的实施例中,在计算待编码语音帧对应的待编码语音帧关键性和后向语音帧对应的后向语音帧关键性时,可以分别将待编码语音帧特征和后向语音帧特征输入到关键性 度量模型中进行计算,得到待编码语音帧关键性和后向语音帧对。其中,关键性度量模型是根据历史语音帧特征和历史语音帧关键性使用线性回归算法建立的模型并部署在终端中的。通过关键性度量模型来识别语音帧关键性,能够提高准确性和效率。
步骤208,基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率。
其中,关键性趋势是指待编码语音帧和对应的后向语音帧的语音帧关键性的趋势,比如,语音帧关键性越来越高或者语音帧关键性越来越低或者语音帧关键性没有变化。关键性趋势特征是指反映关键性趋势的特征,可以是统计学特征,比如关键性的平均、关键性的差异等等。编码码率用于对待编码语音帧进行编码。
具体地,终端基于待编码语音帧关键性和后向语音帧关键性得到关键性趋势特征,比如,计算待编码语音帧关键性和后向语音帧关键性的统计特征,将计算得到的统计特征作为关键性趋势特征,统计特征可以包括平均语音帧关键性特征、中位数语音帧关键性特征、标准差语音帧关键性特征、众数语音帧关键性特征、极差语音帧关键性特征和语音帧关键性差值特征中的至少一种。使用关键性趋势特征和预先设置好的码率计算函数来计算待编码语音帧对应的编码码率,其中,码率计算函数为单调递增函数,可以根据需求自定义。每一个关键性趋势特征可以有对应的码率计算函数,也可以使用相同的码率计算函数。
步骤210,根据编码码率对待编码语音帧进行编码,得到编码结果。
具体地,当得到编码码率时,使用该编码码率对待编码语音帧进行编码,得到编码结果,该编码结果是指待编码语音帧对应的码流数据。终端可以将码流数据存储到内存中,也可以将码流数据发送到服务器中进行保存。其中,可以通过语音编码器进行编码。
在一个实施例中,当需要播放采集的语音时,获取到保存的码流数据,将码率数据进行解码,最终通过终端的语音播放装置比如扬声器进行播放。
上述语音编码方法中,通过获取待编码语音帧,及与待编码语音帧对应的后向语音帧,分别计算待编码语音帧对应的待编码语音帧关键性和后向语音帧对应的后向语音帧关键性,然后根据待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率,从而使用编码码率进行编码,得到编码结果,即可以根据语音帧的关键性趋势特征来调控编码码率,使每个待编码语音帧都有调控好的编码码率,然后根据调控好的编码码率进行编码,从而可以在关键性趋势变强时,对待编码语音帧分配较高的编码码率进行编码,在关键性趋势变弱时,对待编码语音帧分配较低的编码码率进行编码,使得能够自适应的控制各个待编码语音帧对应的编码码率,避免冗余编码,提高语音编码质量。
在一个实施例中,待编码语音帧特征和后向语音帧特征包括语音起始帧特征和非语音帧特征中的至少一种,如图3所示,语音起始帧特征和非语音帧特征的提取包括以下步骤:
步骤302,获取待提取语音帧,待提取语音帧为待编码语音帧和后向语音帧中的至少一种。
步骤304a,基于待提取语音帧进行语音端点检测,得到语音端点检测结果。
其中,待提取语音帧是指需要提取语音帧特征的语音帧,可以是待编码语音帧或者后向语音帧。语音端点检测是指使用语音端点检测(vad,Voice Activity Detection)算法检测语音信号当中的语音起始端点,即语音信号从0到1的跳变点。语音端点检测算法可以是基于子带信噪比判决算法、基于DNN(Deep Neural Networks,深度神经网络)的语音帧判决算法、基于短时能量的语音端点检测算法和基于双门限的语音端点检测算法等等。语音端点检测结果是指待提取语音帧是否为语音端点的检测结果,包括语音帧为语音起始端点和语音帧为非语音起始端点。
具体地,服务器对待提取语音帧使用语音端点检测算法进行语音端点检测,得到语音端点检测结果。
步骤306a,当语音端点检测结果为语音起始端点时,确定待提取语音帧对应的语音起始帧特征为第一目标值和待提取语音帧对应的非语音帧特征为第二目标值中的至少一种。
其中,语音起始端点是指该待提取语音帧是语音信号的起始。第一目标值是特征的具体值,不同的特征对应的第一目标值的含义不同,当语音起始帧特征为第一目标值时,第一目标值用于表征待提取语音帧为语音起始端点的语音帧,当非语音帧特征为第一目标值时,第一目标值用于表征待提取语音帧为噪声语音帧。第二目标值是特征的具体值,不同的特征对应的第二目标值的含义不同,当非语音帧特征为第二目标值时,第二目标值用于表征待提取 语音帧为非噪声语音帧,当语音起始帧特征为第二目标值时,第二目标值用于表征待提取语音帧为非语音起始端点的语音帧。比如,第一目标值可以为1,第二目标值可以为0。
具体地,当语音端点检测结果为语音起始端点时,得到待提取语音帧对应的语音起始帧特征为第一目标值和待提取语音帧对应的非语音帧特征为第二目标值。在一个实施例中,当语音端点检测结果为语音起始端点时,得到待提取语音帧对应的语音起始帧特征为第一目标值或者待提取语音帧对应的非语音帧特征为第二目标值。
步骤308a,当语音端点检测结果为非语音起始端点时,确定待提取语音帧对应的语音起始帧特征为第二目标值和待提取语音帧对应的非语音帧特征为第一目标值中的至少一种。
其中,非语音起始端点是指待提取语音帧不是语音信号的起始点,即该待提取语音帧是语音信号之前的噪音信号。
具体地,当语音端点检测结果为非语音起始端点时,直接将第二目标值作为待提取语音帧对应的语音起始帧特征,并将第一目标值作为待提取语音帧对应的非语音帧特征。在一个实施例中,当语音端点检测结果为非语音起始端点时,直接将第二目标值作为待提取语音帧对应的语音起始帧特征,或者将第一目标值作为待提取语音帧对应的非语音帧特征。
在上述实施例中,通过对待提取语音帧进行语音端点检测,从而得到语音起始帧特征和非语音帧特征,提高了效率和准确性。
在一个实施例中,待编码语音帧特征和后向语音帧特征包括能量变化特征,如图3所示,能量变化特征的提取包括以下步骤:
步骤302,获取待提取语音帧,待提取语音帧为待编码语音帧或者为后向语音帧。
步骤304b,获取待提取语音帧对应的前向语音帧,计算待提取语音帧对应的待提取帧能量,并计算前向语音帧对应的前向帧能量。
其中,前向语音帧是指待提取语音帧的前一帧,是在获取到待提取语音帧之前已经获取到的语音帧。比如,待提取帧是第8帧,则前向语音帧可以是第7帧。帧能量用于反映该语音帧信号的强弱程度。待提取帧能量是指待提取语音帧对应的帧能量。前向帧能量是指前向语音帧对应的帧能量。
具体地,终端获取待提取语音帧,待提取语音帧为待编码语音帧或者为后向语音帧,获取待提取语音帧对应的前向语音帧,计算待提取语音帧对应的待提取帧能量,并同时计算前向语音帧对应的前向帧能量,其中,可以通过计算待提取语音帧或者前向语音帧中所有数字信号的平方和,得到待提取帧能量或者前向帧能量。也可以从待提取语音帧或者前向语音帧中所有数字信号中进行采样,计算采样数据的平方和,得到待提取帧能量或者前向帧能量。
步骤306c,计算待提取帧能量和前向帧能量的比值,根据比值结果确定待提取语音帧对应的能量变化特征。
具体地,终端计算待提取帧能量和前向帧能量的比值,根据比值结果确定待提取语音帧对应的能量变化特征。其中,当比值结果大于预设阈值时,说明该待提取语音帧的帧能量相比于前一帧的帧能量变化较大,则对应的能量变化特征为1,当比值结果未大于预设阈值时,说明该待提取语音帧相比于前一帧的帧能量变化较小,则对应的能量变化特征为0。在一个实施例中,可以根据比值结果和待提取帧能量确定待提取语音帧对应的能量变化特征,其中,当待提取帧能量大于预设帧能量,且比值结果大于预设阈值时,说明该待提取语音帧为帧能量突然增大的语音帧,则对应的能量变化特征为1,当待提取帧能量未大于预设帧能量或者比值结果未大于预设阈值时,说明该待提取语音帧不是帧能量突然增大的语音帧,则对应的能量变化特征为0。该预设阈值是指预先设置好的数值,比如,比值结果高于预设倍数。预设帧能量为预先设置好的帧能量阈值。
在上述实施例中,通过计算待提取帧能量和前向帧能量,根据待提取帧能量和前向帧能量确定待提取语音帧对应的能量变化特征,提高了得到能量变化特征的准确性。
在一个实施例中,计算待提取语音帧对应的待提取帧能量,包括
基于待提取语音帧进行数据采样,得到各个样点数据值和样点数量。计算各个样点数据值的平方和,并计算平方和与样点数量的比值,得到待提取帧能量。
其中,样点数据值从待提取语音帧进行采样得到的数据。样点数量是指采用得到的样点数据的总数。
具体地,终端对待提取语音帧进行数据采样,得到各个样点数据值和样点数量。计算各个样点数据值的平方和,然后计算平方和与样点数量的比值,将比值作为待提取帧能量。可以使用如下公式(1)计算待提取帧能量:
Figure PCTCN2021095714-appb-000001
其中,m为样点数量,x为样点数据值,第i个样点数据值为x(i)。
在一个具体地实施例中,将20ms作为一帧,采样率为16khz。则进行数据采样后会得到320个样点数据值。每个样点数据值为16位有符号数,取值范围为[-32768,32767],如图第i个样点数据值为x(i),则计算该帧的帧能量为
Figure PCTCN2021095714-appb-000002
在一个实施例中,终端基于前向语音帧进行数据采样,得到各个样点数据值和样点数量;计算各个样点数据值的平方和,并计算平方和与样点数量的比值,得到前向帧能量。其中,终端可以使用公式(1)计算前向语音帧对应的前向帧能量。
在上述实施例中,通过对语音帧进行数据采样,然后根据样点数据和样点数量计算帧能量,能够提高得到帧能量的效率。
在一个实施例中,待编码语音帧特征和后向语音帧特征包括基音周期突变帧特征,如图3所述,基音周期突变帧特征的提取包括以下步骤:
步骤302,获取待提取语音帧,待提取语音帧为待编码语音帧或者为后向语音帧;
步骤304c,获取待提取语音帧对应的前向语音帧,检测待提取语音帧和前向语音帧的基音周期,得到待提取基音周期和前向基音周期。
其中,基音周期是指是声带每开启和闭合一次的时间。待提取基音周期是指待提取语音帧对应的基音周期,即是待编码语音帧对应的基音周期或者是后向语音帧对应的基音周期。
具体地,终端获取到待提取语音帧,该待提取语音帧可以是待编码语音帧或者可以是后向语音帧。然后获取到待提取语音帧对应的前向语音帧,使用基音周期检测算法分别检测待提取语音帧和前向语音帧对应的基音周期,得到待提取基音周期和前向基音周期。其中,基音周期检测算法可以分为非基于时间的基音周期检测方法和基于时间的基音周期检测方法,非基于时间的基音周期检测方法包括自相关函数法、平均幅度差函数法和倒谱方法等,基于时间的基音周期检测方法包括波形估计法、相关处理法和变换法等。
步骤306c,根据待提取基音周期和前向基音周期计算基音周期变化程度,根据基音周期变化程度确定待提取语音帧对应的基音周期突变帧特征。
其中,基音周期变化程度用于反映前向语音帧与待提取语音帧之间基音周期的变化程度。
具体地,终端计算前向基音周期与待提取基音周期之间差值的绝对值,得到基音周期变化程度,当基音周期变化程度超过预设周期变化程度阈值时,说明待提取语音帧为基音周期突变帧,此时,得到的基音周期突变帧特征可以用“1”表示。当基音周期变化程度未超过预设周期变化程度阈值时,说明待提取语音帧的基音周期相比于前一帧未发生突变,此时,得到的基音周期突变帧特征可以用“0”表示。
在上述实施例中,通过检测得到前向基音周期与待提取基音周期,根据前向基音周期与待提取基音周期得到基音周期突变帧特征,提高了得到基音周期突变帧特征的准确性。
在一个实施例中,如图4所示,步骤204,即基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性,包括:
步骤402,从待编码语音帧特征中确定正向待编码语音帧特征,对正向待编码语音帧特征进行加权计算,得到正向待编码语音帧关键性,正向待编码语音帧特征包括语音起始帧特征、能量变化特征和基音周期突变帧特征中的至少一种。
其中,正向待编码语音帧特征是指语音帧特征与语音帧关键性呈正向关系的特征,包括语音起始帧特征、能量变化特征和基音周期突变帧特征中的至少一种。正向待编码语音帧特征越明显则语音帧关键性越高。正向待编码语音帧关键性是指根据正向待编码语音帧特征得到的语音帧关键性。
具体地,终端从各个待编码语音帧特征中确定正向待编码语音帧特征,获取到预先设置好的各个正向待编码语音帧特征对应的权重,对每个正向待编码语音帧特征进行加权计算,然后统计加权计算结果,得到正向待编码语音帧关键性。
步骤404,从待编码语音帧特征中确定反向待编码语音帧特征,根据反向待编码语音帧特征确定反向待编码语音帧关键性,反向待编码语音帧特征包括非语音帧特征。
其中,反向待编码语音帧特征是指语音帧特征与语音帧关键性呈反向关系的特征,包括非语音帧特征。反向待编码语音帧特征越明显则语音帧关键性越低。反向待编码语音帧关键性是指根据反向待编码语音帧特征得到的语音帧关键性。
具体地,终端从待编码语音帧特征中确定反向待编码语音帧特征,根据反向待编码语音帧特征确定反向待编码语音帧关键性。在一个具体的实施例中,当非语音帧特征为1时,说明该语音帧为噪声,此时,噪声的语音帧关键性就为0。当非语音帧特征为0时,说明该语音帧为采集的语音。此时,语音的语音帧关键性就为1.
步骤406,基于正向待编码语音帧关键性和预设正向权重计算得到正向关键性,基于反向待编码语音帧关键性和预设反向权重计算得到反向关键性,基于所述正向关键性和所述反向关键性得到待编码语音帧对应的待编码语音帧关键性。
其中,预设正向权重是指预先设置好的正向待编码语音帧关键性的权重,预设反向权重是指预先设置好的反向待编码语音帧关键性的权重。
具体地,终端计算正向待编码语音帧关键性和预设正向权重的乘积得到正向关键性,计算反向待编码语音帧关键性和预设反向权重的乘积得到反向关键性,将正向关键性和反向关键性相加得到待编码语音帧对应的待编码语音帧关键性。也可以比如可以计算正向关键性和反向关键性的乘积,得到待编码语音帧关键性。在一个具体的实施例中,可以使用如下公式(2)计算待编码语音帧对应的待编码语音帧关键性。
r=b+(1-r 4)*(w 1*r 1+w 2*r 2+w 3*r 3)        公式(2)
其中,r为待编码语音帧关键性,r 1为语音起始帧特征,r 2为能量变化特征,r 3为基音周期突变帧特征,w为预先设置好的权重,w 1为语音起始帧特征对应的权重,w 2为能量变化特征对应的权重,w 3为基音周期突变帧特征对应的权重。w 1*r 1+w 2*r 2+w 3*r 3为正向待编码语音帧关键性。r 4为非语言帧特征,(1-r 4)为反向待编码语音帧关键性。b为常数且为正数,为正向偏置。其中,b具体可以为0.1,w 1、w 2和w 3具体可以都为0.3。
在一个实施例中,也可以使用公式(2)根据后向语音帧特征计算得到后向语音帧对应的后向语音帧关键性。具体来说:对后向语音帧对应的语音起始帧特征、能量变化特征和基音周期突变帧特征进行加权计算,得到后向语音帧对应的正向关键性。根据后向语音帧对应的非语音帧特征确定后向语音帧对应的反向关键性。基于正向关键性和反向关键性计算得到后向语音帧对应的后向语音帧关键性。
在上述实施例中,通过从待编码语音帧特征中确定正向待编码语音帧特征和反向待编码语音帧特征,然后分别计算得到对应的正向待编码语音帧关键性和反向待编码语音帧关键性,最后得到待编码语音帧关键性,提高了得到待编码语音帧关键性的准确性。
在一个实施例中,基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率,包括:
获取前向语音帧关键性,基于前向语音帧关键性、待编码语音帧关键性和后向语音帧关键性获取目标关键性趋势特征,使用目标关键性趋势特征确定待编码语音帧对应的编码码率。
其中,前向语音帧是指待编码语音帧之前的已经编码的语音帧。前向语音帧关键性是指前向语音帧对应的语音帧关键性。
具体地,终端可以获取到前向语音帧关键性,计算前向语音帧关键性、待编码语音帧关键性和后向语音帧关键性的关键性平均程度,计算前向语音帧关键性、待编码语音帧关键性和后向语音帧关键性的关键性差异程度,根据关键性平均程度和关键性差异程度得到目标关键性趋势特征,使用目标关键性趋势特征确定待编码语音帧对应的编码码率。其中,计算2个前向语音帧的前向语音帧关键性,待编码语音帧关键性和3个后向语音帧的后向语音帧关键性的关键性总和,计算关键性之和与6个语音帧的比值,得到关键性平均程度。计算2个前向语音帧的前向语音帧关键性和待编码语音帧关键性的和,得到关键性部分和,并计算关 键性总和与关键性部分和的差值,得到关键性差异程度,从而得到目标关键性趋势特征。
在上述实施例中,通过使用前向语音帧关键性、待编码语音帧关键性和后向语音帧关键性获取目标关键性趋势特征,进而使用目标关键性趋势特征确定待编码语音帧对应的编码码率,使得到的待编码语音帧对应的编码码率更为准确。
在一个实施例中,如图5所示,步骤208,基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率,包括:
步骤502,基于待编码语音帧关键性和后向语音帧关键性计算关键性差异程度和关键性平均程度。
其中,关键性差异程度用于反映后向语音帧与待编码语音帧之间关键性的差异。关键性平均程度用于反映待编码语音帧和后向语音帧的关键性均值。
具体地,服务器基于待编码语音帧关键性和后向语音帧关键性进行统计计算,即计算待编码语音帧关键性和后向语音帧关键性的平均关键性,得到关键性平均程度,并计算待编码语音帧关键性和后向语音帧关键性的综合与待编码语音帧关键性的差值,得到关键性差异程度。
步骤504,根据关键性差异程度和关键性平均程度计算得到待编码语音帧对应的编码码率。
具体地,获取到预先设置好的码率计算函数,根据关键性差异程度和关键性平均程度使用码率计算函数来计算待编码语音帧对应的编码码率。其中,码率计算函数用于计算编码码率,是单调递增函数,可以根据应用场景的需要进行自定义。可以根据关键性差异程度对应的码率计算函数计算出码率同时根据关键性平均程度对应的码率计算函数计算出码率,然后再计算码率之和得到待编码语音帧对应的编码码率。也可以使用相同的码率计算函数计算关键性差异程度和关键性平均程度对应的码率,然后计算码率之和得到待编码语音帧对应的编码码率。
在上述实施例中,通过计算得到后向语音帧与待编码语音帧之间的关键性差异程度和关键性平均程度,根据关键性差异程度和关键性平均程度计算得到待编码语音帧对应的编码码率,从而能够使得到的编码码率更加的精确。
在一个实施例中,如图6所示,步骤502,基于待编码语音帧关键性和后向语音帧关键性计算关键性差异程度,包括:
步骤602,计算待编码语音帧关键性与预设第一权重的第一加权值,并计算后向语音帧关键性与预设第二权重的第二加权值。
其中,预设第一权重是指预先设置好的待编码语音帧关键性对应的权重。预设第二权重是指后向语音帧关键性对应的权重,每个后向语音帧都有对应的后向语音帧关键性,每个后向语音帧关键性都有对应的权重。第一加权值是将待编码语音帧关键性进行加权后得到的值。第二加权值是指将后向语音帧关键性进行加权后得到的值
具体地,终端计算待编码语音帧关键性与预设第一权重的乘积,得到第一加权值,并计算后向语音帧关键性与预设第二权重的乘积,得到第二加权值。
步骤604,基于第一加权值和第二加权值计算得到目标加权值,计算目标加权值与待编码语音帧关键性的差值,得到关键性差异程度。
其中,目标加权值是指第一加权值与第二加权值的和。
具体地,终端计算第一加权值和第二加权值之间的和,得到目标加权值,然后计算出目标加权值与待编码语音帧关键性的差值,将该差值作为关键性差异程度。在一个具体的实施例中,可以使用公式(3)计算关键性差异程度:
Figure PCTCN2021095714-appb-000003
其中,ΔR(i)是指关键性差异程度,N为待编码语音帧以及后向语音帧的总帧数。r(i)表示待编码语音帧对应的待编码语音帧关键性,r(j)表示第j个后向语音帧对应的后向语音帧关键性。a表示权重取值范围为(0,1),当j=0时,a 0为预设第一权重,当j大于0时,a j为 预设第二权重,可以有多个预设第二权重,每个后向语音帧对应的预设第二权重可以相同也可以不同,其中,a j可以随着j越大取值越大。
Figure PCTCN2021095714-appb-000004
表示目标加权值。在一个具体的实施例中,当后向语音帧有3帧时,N为4,a 0可以为0.1,a 1可以为0.2,a 2可以为0.3,a 3可以为0.4。
在上述实施例中,通过计算目标加权值,然后使用目标加权值与待编码语音帧关键性计算得到关键性差异程度,提高了得到关键性差异程度的准确性。
在一个实施例中,步骤502,基于待编码语音帧关键性和后向语音帧关键性计算关键性平均程度,包括:
获取待编码语音帧和后向语音帧的帧数量。统计待编码语音帧关键性与后向语音帧关键性得到综合关键性,并计算综合关键性与帧数量的比值,得到关键性平均程度。
其中,帧数量是指待编码语音帧和后向语音帧的总帧数,比如,当后向语音帧有3帧时,得到的总帧数为4。
具体地,终端获取到待编码语音帧和后向语音帧的帧数量。统计待编码语音帧关键性与后向语音帧关键性之和,得到综合关键性。然后计算综合关键性与帧数量的比值,得到关键性平均程度。在一个具体的实施例中,可以使用公式(4)计算关键性平均程度:
Figure PCTCN2021095714-appb-000005
其中,
Figure PCTCN2021095714-appb-000006
为关键性平均程度,N是指待编码语音帧和后向语音帧的帧数量。r是指语音帧关键性,r(i)用于表示待编码语音帧对应的待编码语音帧关键性,r(j)用于表示第j个后向语音帧对应的后向语音帧关键性。
在上述实施例中,通过待编码语音帧和后向语音帧的帧数量和综合关键性计算得到关键性平均程度,提高了得到关键性平均程度的准确性。
在一个实施例中,如图7所示,步骤504,即根据关键性差异程度和关键性平均程度计算得到待编码语音帧对应的编码码率,包括:
步骤702,获取第一码率计算函数和第二码率计算函数。
步骤704,使用关键性平均程度和第一码率计算函数计算得到第一码率,并使用关键性差异程度和第二码率计算函数计算得到第二码率,根据第一码率和第二码率确定综合码率,其中,第一码率与关键性平均程度成正比关系,第二码率与关键性才艺程度成正比关系。
其中,第一码率计算函数是预先设置好的使用关键性平均程度计算码率的函数,第二码率计算函数是预先设置好的使用关键性差异程度计算码率的函数,其中,第一码率计算函数和第二码率计算函数可以根据应用场景具体需要进行设置。第一码率是指使用第一码率计算函数计算得到的码率。第二码率是指使用第二码率计算函数计算得到的码率。综合码率是指综合第一码率和第二码率后得到的码率,比如,可以计算第一码率和第二码率的和,将和作为综合码率。
具体地,终端获取到预先设置好的第一码率计算函数和第二码率计算函数,然后关键性平均程度和关键性差异程度分别进行计算,得到第一码率和第二码率,然后计算第一码率和第二码率的和,将和作为综合码率。
在一个具体的实施例中,可以使用公式(5)计算综合码率。
Figure PCTCN2021095714-appb-000007
其中,
Figure PCTCN2021095714-appb-000008
为关键性平均程度,ΔR(i)为关键性差异程度,f 1()为第一码率计算函数, f 2()为第二码率计算函数。使用
Figure PCTCN2021095714-appb-000009
计算得到第一码率,使用f 2(ΔR(i))计算得到第二码率。
在一个具体的实施例中,可以使用公式(6)作为第一码率计算函数,使用公式(7)作为第二码率计算函数。
Figure PCTCN2021095714-appb-000010
Figure PCTCN2021095714-appb-000011
其中,p 0、c 0、b 0、p 1、c 1和b 1均为常数,且为正数。
步骤706,获取预设码率上限值和预设码率下限值,基于预设码率上限值、预设码率下限值和综合码率确定编码码率。
具体地,预设码率上限值是指预先设置好的语音帧编码码率的最大值,预设码率下限值是指预先设置好的语音帧编码码率的最小值。终端获取到预设码率上限值和预设码率下限值,将预设码率上限值和预设码率下限值与综合码率进行比较,根据比较结果确定最终的编码码率。
在上述实施例中,通过使用第一码率计算函数和第二码率计算函数计算得到第一码率和第二码率,然后根据第一码率和第二码率得到综合码率,提高了得到综合码率的准确性,最后根据预设码率上限值、预设码率下限值和综合码率确定编码码率,从而使得到的编码码率更加的准确。
在一个实施例中,步骤706,即基于预设码率上限值、预设码率下限值和综合码率确定编码码率,包括:
比较预设码率上限值和综合码率。当综合码率小于预设码率上限值时,比较预设码率下限值和综合码率。当综合码率大于预设码率下限值时,将综合码率作为编码码率。
具体地,终端比较预设码率上限值和综合码率,当综合码率小于预设码率上限值时,说明综合码率未超过预设码率上限值,此时,比较预设码率下限值和综合码率,当综合码率大于预设码率下限值时,说明综合码率超过了预设码率下限值,则直接将综合码率作为编码码率。在一个实施例中,比较预设码率上限值和综合码率,当综合码率大于预设码率上限值时,说明综合码率超过预设码率上限值,此时,直接将预设码率上限值作为编码码率。在一个实施例中,比较预设码率下限值和综合码率,当综合码率小于预设码率下限值时,说明综合码率未超过预设码率下限值,此时,将预设码率下限值作为编码码率。
在一个具体地实施例中,可以使用公式(8)得到编码码率:
Figure PCTCN2021095714-appb-000012
其中,max_bitrate是指预设码率上限值。min_bitrate是指预设码率下限值。bitrate(i)表示待编码语音帧的编码码率。
在上述实施例中,通过预设码率上限值、预设码率下限值和综合码率来确定编码码率,从而保证语音帧的编码率在预设的码率范围内容,保证整体的语音编码质量。
在一个实施例中,步骤210,即根据编码码率对待编码语音帧进行编码,得到编码结果,包括:
将编码码率通过接口传入标准编码器,得到编码结果,标准编码器用于使用编码码率对待编码语音帧进行编码。
其中,标准编码器用于将待编码语音帧进行语音编码。接口是指标准编码器的外部接口,用于调控编码码率。
具体地,终端将编码码率通过接口传入标准编码器,标准编码器接收到编码码率时,获取到对应的待编码语音帧,使用编码码率对待编码语音帧进行编码,得到编码结果,从而保证得到准确无误的标准编码结果。
在一个具体的实施例中,提供一种语音编码方法,具体来说:
获取到获取待编码语音帧,及与所述待编码语音帧对应的后向语音帧。此时,并行计算待编码语音帧对应的待编码语音帧关键性和后向语音帧对应的后向语音帧关键性。
其中,如图8所示,得到待编码语音帧对应的待编码语音帧关键性包括以下步骤:
步骤802,基于待编码语音帧进行语音端点检测,得到语音端点检测结果,根据语音端点检测结果确定待编码语音帧对应的语音起始帧特征和待编码语音帧对应的非语音帧特征。
步骤804,获取待编码语音帧对应的前向语音帧,计算待编码语音帧对应的待编码帧能量,并计算前向语音帧对应的前向帧能量,计算待编码帧能量和前向帧能量的比值,根据比值结果确定待编码语音帧对应的能量变化特征。
步骤806,检测待编码语音帧和前向语音帧的基音周期,得到待编码基音周期和前向基音周期,根据待编码基音周期和前向基音周期计算基音周期变化程度,根据基音周期变化程度确定待编码语音帧对应的基音周期突变帧特征。
步骤808,从待编码语音帧特征中确定正向待编码语音帧特征,对正向待编码语音帧特征进行加权计算,得到正向待编码语音帧关键性。
步骤810,从待编码语音帧特征中确定反向待编码语音帧特征,根据反向待编码语音帧特征确定反向待编码语音帧关键性。
步骤812,基于正向待编码语音帧关键性和反向待编码语音帧关键性得到待编码语音帧对应的待编码语音帧关键性。
其中,如图9所示,得到后向语音帧对应的后向语音帧关键性包括以下步骤:
步骤902,基于后向语音帧进行语音端点检测,得到语音端点检测结果,根据语音端点检测结果确定后向语音帧对应的语音起始帧特征和后向语音帧对应的非语音帧特征。
步骤904,获取后向语音帧对应的前向语音帧,计算后向语音帧对应的后向帧能量,并计算前向语音帧对应的前向帧能量,计算后向帧能量和前向帧能量的比值,根据比值结果确定后向语音帧对应的能量变化特征。
步骤906,检测后向语音帧和前向语音帧的基音周期,得到后向基音周期和前向基音周期,根据后向基音周期和前向基音周期计算基音周期变化程度,根据基音周期变化程度确定后向语音帧对应的基音周期突变帧特征。
步骤908,对后向语音帧对应的语音起始帧特征、能量变化特征和基音周期突变帧特征进行加权计算,得到后向语音帧对应的正向关键性。
步骤910,根据后向语音帧对应的非语音帧特征确定后向语音帧对应的反向关键性。
步骤912,基于正向关键性和反向关键性得到后向语音帧对应的后向语音帧关键性。当得到的待编码语音帧对应的待编码语音帧关键性和后向语音帧对应的后向语音帧关键性时,如图10所示,计算待编码语音帧对应的编码码率包括以下步骤:
步骤1002,计算待编码语音帧关键性与预设第一权重的第一加权值,并计算后向语音帧关键性与预设第二权重的第二加权值。
步骤1004,基于第一加权值和第二加权值计算得到目标加权值,计算目标加权值与待编码语音帧关键性的差值,得到关键性差异程度。
步骤1006,获取待编码语音帧和后向语音帧的帧数量,统计待编码语音帧关键性与后向语音帧关键性得到综合关键性,并计算综合关键性与帧数量的比值,得到关键性平均程度。
步骤1008,获取第一码率计算函数和第二码率计算函数。
步骤1010,使用关键性差异程度和第一码率计算函数计算得到第一码率,并使用关键性平均程度和第二码率计算函数计算得到第二码率,根据第一码率和第二码率确定综合码率。
步骤1012,比较预设码率上限值和综合码率,当综合码率小于预设码率上限值时,比较预设码率下限值和综合码率。
步骤1014,当综合码率大于预设码率下限值时,将综合码率作为编码码率。
步骤1016,将编码码率通过接口传入标准编码器,得到编码结果,标准编码器用于使用编码码率对待编码语音帧进行编码。最后,将得到的编码结果进行保存。
本申请还提供一种应用场景,该应用场景应用上述的语音编码方法。具体地,该语音编码方法在该应用场景的应用如下:如图11所示,为进行音频广播的流程示意图。此时,广播员进行广播时,麦克风采集到广播员播报的音频信号。此时,读取到音频信号中的多帧语音信号,该多帧语音信号中包括了当前待编码语音帧和3帧的后向语音帧。此时,进行多帧语音关键性的分析,具体来说:提取待编码语音帧对应的待编码语音帧特征,基于待编码语音 帧特征得到待编码语音帧对应的待编码语音帧关键性。分别提取3帧后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到每一帧后向语音帧对应的后向语音帧关键性。基于待编码语音帧关键性和每一帧后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率。然后对编码码率进行设置,即通过外部接口对标准编码器中的码率调节为待编码语音帧对应的编码码率。此时,标准编码器使用待编码语音帧对应的编码码率对当前的待编码语音帧进行编码,得到码率数据,将码率数据进行存储,并在进行播放时,对码率数据进行解码,得到音频信号,通过扬声器播放音频信号,从而使广播的声音更加的清晰。
本申请还另外提供一种应用场景,该应用场景应用上述的语音编码方法。具体地,该语音编码方法在该应用场景的应用如下:如图12所示,为进行语音交流沟通的应用场景图,包括终端1202,服务器1204以及终端1206,终端1202与服务器1204通过网络进行连接,服务器1204与终端1206通过网络进行连接。其中,用户A通过终端1202中的通讯应用向用户B的终端1206发送语音消息时,终端1202采集到用户A的语音信号,从该语音信号中获取到待编码语音帧和后向语音帧,然后提取待编码语音帧对应的待编码语音帧特征,基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性。提取后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到后向语音帧对应的后向语音帧关键性。基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率,使用编码码率对待编码语音帧进行编码得到码流数据,将码流数据通过服务器1204发送到终端1206。当用户B通过终端1206中的通信应用播放用户A发送的语音时,将码率数据进行解码,得到对应的语音信号,将语音信号通过扬声器进行播放,由于提升了语音编码质量,从而使用户B听到的语音更加的清晰,并且节省了网络带宽资源。
本申请还另外提供一种应用场景,该应用场景应用上述的语音编码方法。具体地,该语音编码方法在该应用场景的应用如下:在进行会议录音时通过麦克风采集到会议音频信号,从会议音频信号中确定获取到待编码语音帧和5帧后向语音帧,然后提取待编码语音帧对应的待编码语音帧特征,基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性。提取每个后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到每个后向语音帧对应的后向语音帧关键性。基于待编码语音帧关键性和每个后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率,使用编码码率对待编码语音帧进行编码得到码流数据,将码率数据保存到指定的服务器地址中,由于能够调控编码码率,从而能够降低整体的码率,从而节省了服务器的存储资源。后续会议用户其他用户要查看会议内容时,可以从服务器地址中获取到保存了码流数据,将码流数据进行解码,得到会议音频信号,将会议音频信号进行播放,从而能够使会议用户或者其他用户听到会议内容,方便使用。
应该理解的是,虽然图2-10的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-10中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图13所示,提供了一种语音编码装置1300,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:语音帧获取模块1302、第一关键性计算模块1304、第二关键性计算模块1306、码率计算模块1308和编码模块1310,其中:
语音帧获取模块1302,用于获取待编码语音帧,及与待编码语音帧对应的后向语音帧;
第一关键性计算模块1304,用于提取待编码语音帧对应的待编码语音帧特征,基于待编码语音帧特征得到待编码语音帧对应的待编码语音帧关键性;
第二关键性计算模块1306,用于提取后向语音帧对应的后向语音帧特征,基于后向语音帧特征得到后向语音帧对应的后向语音帧关键性;
码率计算模块1308,用于基于待编码语音帧关键性和后向语音帧关键性获取关键性趋势特征,使用关键性趋势特征确定待编码语音帧对应的编码码率;
编码模块1310,用于根据编码码率对待编码语音帧进行编码,得到编码结果。
在一个实施例中,所述待编码语音帧特征和所述后向语音帧特征包括语音起始帧特征和 非语音帧特征中的至少一种,语音编码装置1300,还包括:第一特征提取模块,用于获取待提取语音帧,所述待提取语音帧为所述待编码语音帧或者为所述后向语音帧;基于所述待提取语音帧进行语音端点检测,得到语音端点检测结果,当所述语音端点检测结果为语音起始端点时,确定所述待提取语音帧对应的语音起始帧特征为第一目标值和所述待提取语音帧对应的非语音帧特征为第二目标值中的至少一种;当所述语音端点检测结果为非语音起始端点时,确定所述待提取语音帧对应的语音起始帧特征为所述第二目标值和所述待提取语音帧对应的非语音帧特征为所述第一目标值中的至少一种。
在一个实施例中,所述待编码语音帧特征和所述后向语音帧特征包括能量变化特征,语音编码装置1300,还包括:第二特征提取模块,用于获取待提取语音帧,所述待提取语音帧为所述待编码语音帧或者为所述后向语音帧;获取所述待提取语音帧对应的前向语音帧,计算所述待提取语音帧对应的待提取帧能量,并计算所述前向语音帧对应的前向帧能量;计算所述待提取帧能量和所述前向帧能量的比值,根据比值结果确定所述待提取语音帧对应的能量变化特征。
在一个实施例中,语音编码装置1300,还包括:帧能量计算模块,用于基于所述待提取语音帧进行数据采样,得到各个样点数据值和样点数量;计算所述各个样点数据值的平方和,并计算所述平方和与所述样点数量的比值,得到所述待提取帧能量。
在一个实施例中,所述待编码语音帧特征和所述后向语音帧特征包括基音周期突变帧特征,语音编码装置1300,还包括:第三特征提取模块,用获取待提取语音帧,所述待提取语音帧为所述待编码语音帧或者为所述后向语音帧;获取所述待提取语音帧对应的前向语音帧,检测所述待提取语音帧和所述前向语音帧的基音周期,得到待提取基音周期和前向基音周期;根据所述待提取基音周期和所述前向基音周期计算基音周期变化程度,根据所述基音周期变化程度确定所述待提取语音帧对应的基音周期突变帧特征。
在一个实施例中,第一关键性计算模块1304,包括:正向计算单元,用于从所述待编码语音帧特征中确定正向待编码语音帧特征,对所述正向待编码语音帧特征进行加权计算,得到正向待编码语音帧关键性,所述正向待编码语音帧特征包括语音起始帧特征、能量变化特征和基音周期突变帧特征中的至少一种;反向计算单元,用于从所述待编码语音帧特征中确定反向待编码语音帧特征,根据所述反向待编码语音帧特征确定反向待编码语音帧关键性,所述反向待编码语音帧特征包括非语音帧特征;关键性计算单元,用于基于正向待编码语音帧关键性和反向待编码语音帧关键性得到所述待编码语音帧对应的待编码语音帧关键性。
在一个实施例中,码率计算模块1308,包括:程度计算单元,用于基于所述待编码语音帧关键性和所述后向语音帧关键性计算关键性差异程度和关键性平均程度;码率得到单元,用于根据所述关键性差异程度和所述关键性平均程度计算得到所述待编码语音帧对应的编码码率。
在一个实施例中,程度计算单元还用于计算所述待编码语音帧关键性与预设第一权重的第一加权值,并计算所述后向语音帧关键性与预设第二权重的第二加权值;基于所述第一加权值和所述第二加权值计算得到目标加权值,计算所述目标加权值与所述待编码语音帧关键性的差值,得到所述关键性差异程度。
在一个实施例中,程度计算单元还用于获取所述待编码语音帧和所述后向语音帧的帧数量;统计所述待编码语音帧关键性与所述后向语音帧关键性得到综合关键性,并计算所述综合关键性与所述帧数量的比值,得到所述关键性平均程度。
在一个实施例中,码率得到单元还用于获取第一码率计算函数和第二码率计算函数;使用所述关键性平均程度和所述第一码率计算函数计算得到第一码率,并使用所述关键性差异程度和所述第二码率计算函数计算得到第二码率,根据所述第一码率和第二码率确定综合码率,其中,所述第一码率与所述关键性平均程度成正比关系,所述第二码率与所述关键性差异程度成正比关系;获取预设码率上限值和预设码率下限值,基于所述预设码率上限值、预设码率下限值和所述综合码率确定所述编码码率。
在一个实施例中,码率得到单元还用于比较所述预设码率上限值和所述综合码率;当所述综合码率小于所述预设码率上限值时,比较所述预设码率下限值和所述综合码率;当所述综合码率大于所述预设码率下限值时,将所述综合码率作为所述编码码率。
在一个实施例中,编码模块1310还用于将所述编码码率通过接口传入标准编码器,得到编码结果,所述标准编码器用于使用所述编码码率对所述待编码语音帧进行编码。
关于语音编码装置的具体限定可以参见上文中对于语音编码方法的限定,在此不再赘述。 上述语音编码装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图14所示。该计算机设备包括通过***总线连接的处理器、存储器、通信接口、显示屏、输入装置和录音装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***和计算机可读指令。该内存储器为非易失性存储介质中的操作***和计算机可读指令的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机可读指令被处理器执行时以实现一种语音编码方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。该计算机设备的语音采集装置可以是麦克风。
本领域技术人员可以理解,图14中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种语音编码方法,其特征在于,由计算机设备执行,所述方法包括:
    获取待编码语音帧及与所述待编码语音帧对应的后向语音帧;
    提取所述待编码语音帧对应的待编码语音帧特征,基于所述待编码语音帧特征得到所述待编码语音帧对应的待编码语音帧关键性;
    提取所述后向语音帧对应的后向语音帧特征,基于所述后向语音帧特征得到所述后向语音帧对应的后向语音帧关键性;
    基于所述待编码语音帧关键性和所述后向语音帧关键性获取关键性趋势特征,使用所述关键性趋势特征确定所述待编码语音帧对应的编码码率,其中,通过所述关键性趋势特征表征的关键性趋势的强弱自适应的控制各个待编码语音帧对应的编码码率;及
    根据所述编码码率对所述待编码语音帧进行编码,得到编码结果。
  2. 根据权利要求1所述的方法,其特征在于,所述待编码语音帧特征和所述后向语音帧特征包括语音起始帧特征和非语音帧特征中的至少一种,所述语音起始帧特征和非语音帧特征的提取包括以下步骤:
    获取待提取语音帧,所述待提取语音帧为所述待编码语音帧和所述后向语音帧中的至少一种;
    基于所述待提取语音帧进行语音端点检测,得到语音端点检测结果;
    当所述语音端点检测结果为语音起始端点时,确定所述待提取语音帧对应的语音起始帧特征为第一目标值和所述待提取语音帧对应的非语音帧特征为第二目标值中的至少一种;及
    当所述语音端点检测结果为非语音起始端点时,确定所述待提取语音帧对应的语音起始帧特征为所述第二目标值和所述待提取语音帧对应的非语音帧特征为所述第一目标值中的至少一种。
  3. 根据权利要求1所述的方法,其特征在于,所述待编码语音帧特征和所述后向语音帧特征包括能量变化特征,所述能量变化特征的提取包括以下步骤:
    获取待提取语音帧,所述待提取语音帧为所述待编码语音帧和所述后向语音帧中的至少一种;
    获取所述待提取语音帧对应的前向语音帧,计算所述待提取语音帧对应的待提取帧能量,并计算所述前向语音帧对应的前向帧能量;及
    计算所述待提取帧能量和所述前向帧能量的比值,根据比值结果确定所述待提取语音帧对应的能量变化特征。
  4. 根据权利要求3所述的方法,其特征在于,所述计算所述待提取语音帧对应的待提取帧能量,包括:
    基于所述待提取语音帧进行数据采样,得到各个样点数据值和样点数量;及
    计算所述各个样点数据值的平方和,并计算所述平方和与所述样点数量的比值,得到所述待提取帧能量。
  5. 根据权利要求1所述的方法,其特征在于,所述待编码语音帧特征和所述后向语音帧特征包括基音周期突变帧特征,所述基音周期突变帧特征的提取包括以下步骤:
    获取待提取语音帧,所述待提取语音帧为所述待编码语音帧和所述后向语音帧中的至少一种;
    获取所述待提取语音帧对应的前向语音帧,检测所述待提取语音帧和所述前向语音帧的基音周期,得到待提取基音周期和前向基音周期;及
    根据所述待提取基音周期和所述前向基音周期计算基音周期变化程度,根据所述基音周期变化程度确定所述待提取语音帧对应的基音周期突变帧特征。
  6. 根据权利要求1所述的方法,其特征在于,所述基于所述待编码语音帧特征得到所述待编码语音帧对应的待编码语音帧关键性,包括:
    从所述待编码语音帧特征中确定正向待编码语音帧特征,对所述正向待编码语音帧特征进行加权计算,得到正向待编码语音帧关键性,所述正向待编码语音帧特征包括语音起始帧特征、能量变化特征和基音周期突变帧特征中的至少一种;
    从所述待编码语音帧特征中确定反向待编码语音帧特征,根据所述反向待编码语音帧特征确定反向待编码语音帧关键性,所述反向待编码语音帧特征包括非语音帧特征;及
    基于所述正向待编码语音帧关键性和预设正向权重计算得到正向关键性,基于所述反向待编码语音帧关键性和预设反向权重计算得到反向关键性,基于所述正向关键性和所述反向 关键性得到所述待编码语音帧对应的待编码语音帧关键性。
  7. 根据权利要求1所述的方法,所述基于所述待编码语音帧关键性和所述后向语音帧关键性获取关键性趋势特征,使用所述关键性趋势特征确定所述待编码语音帧对应的编码码率,包括:
    获取前向语音帧关键性,基于所述前向语音帧关键性、所述待编码语音帧关键性和所述后向语音帧关键性获取目标关键性趋势特征,使用所述目标关键性趋势特征确定所述待编码语音帧对应的编码码率。
  8. 根据权利要求1所述的方法,其特征在于,所述基于所述待编码语音帧关键性和所述后向语音帧关键性获取关键性趋势特征,使用所述关键性趋势特征确定所述待编码语音帧对应的编码码率,包括:
    基于所述待编码语音帧关键性和所述后向语音帧关键性计算关键性差异程度和关键性平均程度;及
    根据所述关键性差异程度和所述关键性平均程度计算得到所述待编码语音帧对应的编码码率。
  9. 根据权利要求8所述的方法,其特征在于,基于所述待编码语音帧关键性和所述后向语音帧关键性计算关键性差异程度,包括:
    计算所述待编码语音帧关键性与预设第一权重的第一加权值,并计算所述后向语音帧关键性与预设第二权重的第二加权值;及
    基于所述第一加权值和所述第二加权值计算得到目标加权值,计算所述目标加权值与所述待编码语音帧关键性的差值,得到所述关键性差异程度。
  10. 根据权利要求8所述的方法,其特征在于,所述基于所述待编码语音帧关键性和所述后向语音帧关键性计算关键性平均程度,包括:
    获取所述待编码语音帧和所述后向语音帧的帧数量;及
    统计所述待编码语音帧关键性与所述后向语音帧关键性得到综合关键性,并计算所述综合关键性与所述帧数量的比值,得到所述关键性平均程度。
  11. 根据权利要求8所述的方法,其特征在于,所述根据所述关键性差异程度和所述关键性平均程度计算得到所述待编码语音帧对应的编码码率,包括:
    获取第一码率计算函数和第二码率计算函数;
    使用所述关键性平均程度和所述第一码率计算函数计算得到第一码率,并使用所述关键性差异程度和所述第二码率计算函数计算得到第二码率,根据所述第一码率和第二码率确定综合码率,其中,所述第一码率与所述关键性平均程度成正比关系,所述第二码率与所述关键性差异程度成正比关系;及
    获取预设码率上限值和预设码率下限值,基于所述预设码率上限值、预设码率下限值和所述综合码率确定所述编码码率。
  12. 根据权利要求11所述的方法,其特征在于,所述基于所述预设码率上限值、预设码率下限值和所述综合码率确定所述编码码率,包括:
    比较所述预设码率上限值和所述综合码率;
    当所述综合码率小于所述预设码率上限值时,比较所述预设码率下限值和所述综合码率;及
    当所述综合码率大于所述预设码率下限值时,将所述综合码率作为所述编码码率。
  13. 一种语音编码装置,其特征在于,所述装置包括:
    语音帧获取模块,用于获取待编码语音帧,及与所述待编码语音帧对应的后向语音帧;
    第一关键性计算模块,用于提取所述待编码语音帧对应的待编码语音帧特征,基于所述待编码语音帧特征计算得到所述待编码语音帧对应的待编码语音帧关键性;
    第二关键性计算模块,用于提取所述后向语音帧对应的后向语音帧特征,基于所述后向语音帧特征计算得到所述后向语音帧对应的后向语音帧关键性;
    码率计算模块,用于基于所述待编码语音帧关键性和所述后向语音帧关键性获取关键性趋势特征,使用所述关键性趋势特征确定所述待编码语音帧对应的编码码率,其中,通过所述关键性趋势特征表征的关键性趋势的强弱自适应的控制各个待编码语音帧对应的编码码率;及
    编码模块,用于根据所述编码码率对所述待编码语音帧进行编码,得到编码结果。
  14. 根据权利要求13所述的装置,其特征在于,所述待编码语音帧特征和所述后向语 音帧特征包括语音起始帧特征和非语音帧特征中的至少一种,所述装置,还包括:
    第一特征提取模块,用于获取待提取语音帧,所述待提取语音帧为所述待编码语音帧和所述后向语音帧中的至少一种;基于所述待提取语音帧进行语音端点检测,得到语音端点检测结果;当所述语音端点检测结果为语音起始端点时,确定所述待提取语音帧对应的语音起始帧特征为第一目标值和所述待提取语音帧对应的非语音帧特征为第二目标值中的至少一种;当所述语音端点检测结果为非语音起始端点时,确定所述待提取语音帧对应的语音起始帧特征为所述第二目标值和所述待提取语音帧对应的非语音帧特征为所述第一目标值中的至少一种。
  15. 根据权利要求13所述的装置,其特征在于,所述待编码语音帧特征和所述后向语音帧特征包括能量变化特征,所述装置,还包括:
    第二特征提取模块,用于获取待提取语音帧,所述待提取语音帧为所述待编码语音帧和所述后向语音帧中的至少一种;获取所述待提取语音帧对应的前向语音帧,计算所述待提取语音帧对应的待提取帧能量,并计算所述前向语音帧对应的前向帧能量;计算所述待提取帧能量和所述前向帧能量的比值,根据比值结果确定所述待提取语音帧对应的能量变化特征。
  16. 根据权利要求15所述的装置,其特征在于,所述装置,还包括:
    帧能量计算模块,用于基于所述待提取语音帧进行数据采样,得到各个样点数据值和样点数量;计算所述各个样点数据值的平方和,并计算所述平方和与所述样点数量的比值,得到所述待提取帧能量。
  17. 根据权利要求13所述的装置,其特征在于,所述待编码语音帧特征和所述后向语音帧特征包括基音周期突变帧特征,所述装置,还包括:
    第三特征提取模块,用获取待提取语音帧,所述待提取语音帧为所述待编码语音帧或者为所述后向语音帧;获取所述待提取语音帧对应的前向语音帧,检测所述待提取语音帧和所述前向语音帧的基音周期,得到待提取基音周期和前向基音周期;根据所述待提取基音周期和所述前向基音周期计算基音周期变化程度,根据所述基音周期变化程度确定所述待提取语音帧对应的基音周期突变帧特征。
  18. 根据权利要求13所述的装置,其特征在于,所述第一关键性计算模块,包括:
    正向计算单元,用于从所述待编码语音帧特征中确定正向待编码语音帧特征,对所述正向待编码语音帧特征进行加权计算,得到正向待编码语音帧关键性,所述正向待编码语音帧特征包括语音起始帧特征、能量变化特征和基音周期突变帧特征中的至少一种;
    反向计算单元,用于从所述待编码语音帧特征中确定反向待编码语音帧特征,根据所述反向待编码语音帧特征确定反向待编码语音帧关键性,所述反向待编码语音帧特征包括非语音帧特征;及
    关键性计算单元,用于基于正向待编码语音帧关键性和反向待编码语音帧关键性得到所述待编码语音帧对应的待编码语音帧关键性。
  19. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行时实现权利要求1至12中任一项所述的方法的步骤。
  20. 一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行时实现权利要求1至12中任一项所述的方法的步骤。
PCT/CN2021/095714 2020-06-24 2021-05-25 语音编码方法、装置、计算机设备和存储介质 WO2021258958A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21828640.9A EP4040436B1 (en) 2021-05-25 Speech encoding method and apparatus, computer device, and storage medium
JP2022554706A JP7471727B2 (ja) 2020-06-24 2021-05-25 音声符号化方法、装置、コンピュータ機器及びコンピュータプログラム
US17/740,309 US20220270622A1 (en) 2020-06-24 2022-05-09 Speech coding method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010585545.9A CN112767953B (zh) 2020-06-24 2020-06-24 语音编码方法、装置、计算机设备和存储介质
CN202010585545.9 2020-06-24

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/740,309 Continuation US20220270622A1 (en) 2020-06-24 2022-05-09 Speech coding method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021258958A1 true WO2021258958A1 (zh) 2021-12-30

Family

ID=75693048

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095714 WO2021258958A1 (zh) 2020-06-24 2021-05-25 语音编码方法、装置、计算机设备和存储介质

Country Status (4)

Country Link
US (1) US20220270622A1 (zh)
JP (1) JP7471727B2 (zh)
CN (1) CN112767953B (zh)
WO (1) WO2021258958A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767953B (zh) * 2020-06-24 2024-01-23 腾讯科技(深圳)有限公司 语音编码方法、装置、计算机设备和存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103841418A (zh) * 2012-11-22 2014-06-04 中国科学院声学研究所 一种3g网络中视频监控器码率控制的优化方法及***
CN109151470A (zh) * 2017-06-28 2019-01-04 腾讯科技(深圳)有限公司 编码分辨率控制方法及终端
CN109729353A (zh) * 2019-01-31 2019-05-07 深圳市迅雷网文化有限公司 一种视频编码方法、装置、***及介质
CN110166781A (zh) * 2018-06-22 2019-08-23 腾讯科技(深圳)有限公司 一种视频编码方法、装置和可读介质
CN110166780A (zh) * 2018-06-06 2019-08-23 腾讯科技(深圳)有限公司 视频的码率控制方法、转码处理方法、装置和机器设备
US20200029081A1 (en) * 2018-07-17 2020-01-23 Wowza Media Systems, LLC Adjusting encoding frame size based on available network bandwidth
CN110890945A (zh) * 2019-11-20 2020-03-17 腾讯科技(深圳)有限公司 数据传输方法、装置、终端及存储介质
CN112767953A (zh) * 2020-06-24 2021-05-07 腾讯科技(深圳)有限公司 语音编码方法、装置、计算机设备和存储介质

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05175941A (ja) * 1991-12-20 1993-07-13 Fujitsu Ltd 符号化率可変伝送方式
TW271524B (zh) * 1994-08-05 1996-03-01 Qualcomm Inc
US20070036227A1 (en) * 2005-08-15 2007-02-15 Faisal Ishtiaq Video encoding system and method for providing content adaptive rate control
KR100746013B1 (ko) * 2005-11-15 2007-08-06 삼성전자주식회사 무선 네트워크에서의 데이터 전송 방법 및 장치
JP4548348B2 (ja) * 2006-01-18 2010-09-22 カシオ計算機株式会社 音声符号化装置及び音声符号化方法
US20090319261A1 (en) * 2008-06-20 2009-12-24 Qualcomm Incorporated Coding of transitional speech frames for low-bit-rate applications
US8352252B2 (en) * 2009-06-04 2013-01-08 Qualcomm Incorporated Systems and methods for preventing the loss of information within a speech frame
JP5235168B2 (ja) 2009-06-23 2013-07-10 日本電信電話株式会社 符号化方法、復号方法、符号化装置、復号装置、符号化プログラム、復号プログラム
WO2013062392A1 (ko) 2011-10-27 2013-05-02 엘지전자 주식회사 음성 신호 부호화 방법 및 복호화 방법과 이를 이용하는 장치
CN102543090B (zh) * 2011-12-31 2013-12-04 深圳市茂碧信息科技有限公司 一种应用于变速率语音和音频编码的码率自动控制***
US9208798B2 (en) 2012-04-09 2015-12-08 Board Of Regents, The University Of Texas System Dynamic control of voice codec data rate
CN103050122B (zh) * 2012-12-18 2014-10-08 北京航空航天大学 一种基于melp的多帧联合量化低速率语音编解码方法
CN103338375A (zh) * 2013-06-27 2013-10-02 公安部第一研究所 一种宽带集群***中基于视频数据重要性的动态码率分配方法
CN104517612B (zh) * 2013-09-30 2018-10-12 上海爱聊信息科技有限公司 基于amr-nb语音信号的可变码率编码器和解码器及其编码和解码方法
CN106534862B (zh) * 2016-12-20 2019-12-10 杭州当虹科技股份有限公司 一种视频编码方法
CN110740334B (zh) * 2019-10-18 2021-08-31 福州大学 一种帧级别的应用层动态fec编码方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103841418A (zh) * 2012-11-22 2014-06-04 中国科学院声学研究所 一种3g网络中视频监控器码率控制的优化方法及***
CN109151470A (zh) * 2017-06-28 2019-01-04 腾讯科技(深圳)有限公司 编码分辨率控制方法及终端
CN110166780A (zh) * 2018-06-06 2019-08-23 腾讯科技(深圳)有限公司 视频的码率控制方法、转码处理方法、装置和机器设备
CN110166781A (zh) * 2018-06-22 2019-08-23 腾讯科技(深圳)有限公司 一种视频编码方法、装置和可读介质
US20200029081A1 (en) * 2018-07-17 2020-01-23 Wowza Media Systems, LLC Adjusting encoding frame size based on available network bandwidth
CN109729353A (zh) * 2019-01-31 2019-05-07 深圳市迅雷网文化有限公司 一种视频编码方法、装置、***及介质
CN110890945A (zh) * 2019-11-20 2020-03-17 腾讯科技(深圳)有限公司 数据传输方法、装置、终端及存储介质
CN112767953A (zh) * 2020-06-24 2021-05-07 腾讯科技(深圳)有限公司 语音编码方法、装置、计算机设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4040436A4

Also Published As

Publication number Publication date
JP2023517973A (ja) 2023-04-27
JP7471727B2 (ja) 2024-04-22
EP4040436A1 (en) 2022-08-10
EP4040436A4 (en) 2023-01-18
CN112767953B (zh) 2024-01-23
CN112767953A (zh) 2021-05-07
US20220270622A1 (en) 2022-08-25

Similar Documents

Publication Publication Date Title
US10540979B2 (en) User interface for secure access to a device using speaker verification
WO2019196196A1 (zh) 一种耳语音恢复方法、装置、设备及可读存储介质
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
Li et al. Robust endpoint detection and energy normalization for real-time speech and speaker recognition
CN108346425B (zh) 一种语音活动检测的方法和装置、语音识别的方法和装置
US20150317977A1 (en) Voice profile management and speech signal generation
JP2016180988A (ja) モバイルデバイスのためのスマートオーディオロギングのシステムおよび方法
JP2006079079A (ja) 分散音声認識システム及びその方法
WO2014114049A1 (zh) 一种语音识别的方法、装置
US11741943B2 (en) Method and system for acoustic model conditioning on non-phoneme information features
CN111540342B (zh) 一种能量阈值调整方法、装置、设备及介质
CN111916061A (zh) 语音端点检测方法、装置、可读存储介质及电子设备
CN112786052A (zh) 语音识别方法、电子设备和存储装置
US8868419B2 (en) Generalizing text content summary from speech content
WO2021258958A1 (zh) 语音编码方法、装置、计算机设备和存储介质
US20180082703A1 (en) Suitability score based on attribute scores
JP2012168296A (ja) 音声による抑圧状態検出装置およびプログラム
WO2020003413A1 (ja) 情報処理装置、制御方法、及びプログラム
CN112767955B (zh) 音频编码方法及装置、存储介质、电子设备
Zhu et al. A robust and lightweight voice activity detection algorithm for speech enhancement at low signal-to-noise ratio
EP4040436B1 (en) Speech encoding method and apparatus, computer device, and storage medium
CN115985347B (zh) 基于深度学习的语音端点检测方法、装置和计算机设备
WO2022068675A1 (zh) 发声者语音抽取方法、装置、存储介质及电子设备
CN113793598B (zh) 语音处理模型的训练方法和数据增强方法、装置及设备
Weychan et al. Real time recognition of speakers from internet audio stream

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21828640

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021828640

Country of ref document: EP

Effective date: 20220428

ENP Entry into the national phase

Ref document number: 2022554706

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE