CN113314151A - Voice information processing method and device, electronic equipment and storage medium - Google Patents

Voice information processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113314151A
CN113314151A CN202110581331.9A CN202110581331A CN113314151A CN 113314151 A CN113314151 A CN 113314151A CN 202110581331 A CN202110581331 A CN 202110581331A CN 113314151 A CN113314151 A CN 113314151A
Authority
CN
China
Prior art keywords
voice information
information processing
speech
digital signal
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110581331.9A
Other languages
Chinese (zh)
Inventor
柳丝婉
陈永录
李变
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110581331.9A priority Critical patent/CN113314151A/en
Publication of CN113314151A publication Critical patent/CN113314151A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The present disclosure provides a voice information processing method, including: acquiring voice information, and preprocessing the voice information; extracting the features of the preprocessed voice information, wherein the extracted features comprise at least one of short-time energy and a Mel frequency cepstrum coefficient; classifying the feature input deep neural network to obtain classification features of speech emotion, wherein the deep neural network comprises a convolution layer and a full-link layer which are sequentially connected; and identifying the classification features to obtain a speech emotion classification result. The disclosure also provides a voice information processing device, an electronic device and a storage medium.

Description

Voice information processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of speech emotion recognition and the field of finance, and in particular, to a speech information processing method and apparatus, an electronic device, and a storage medium.
Background
With the increasing development of network technology, financial risks are increased, and especially for the old, the capacity of identifying financial risks and transfer hidden dangers is weak, so that the old is often deceived, and a large amount of money is transferred to the account names of other people. In order to protect the client rights and interests, when a client transfers a large amount of money, a bank can perform risk prevention and control, check the identity of the user, trigger a wind control model, confirm whether the large amount of money is transferred to other people through voice telephone, and remit money into the account name of the user-appointed opposite party after a positive response is obtained.
However, current wind control precautions still have some drawbacks, such as the bank teller alerting the user to the risk, the user possibly alerting the teller of his goodwill to the fact that the converter unit is hearing, that the teller is thinking about disturbing his privacy, and that he/she is willing to remit money to others. For another example, currently, the decision to block customer transfers is in the hands of tellers, but the levels of business of tellers are uneven and it is difficult to perform customer transfer protection under the same standard. The teller misjudgment can also cause errors due to the influence of the emotion of the customer.
BRIEF SUMMARY OF THE PRESENT DISCLOSURE
In view of the above, the present disclosure provides, in one aspect, a method for processing voice information, including: acquiring voice information, and preprocessing the voice information; extracting the features of the preprocessed voice information, wherein the extracted features comprise at least one of short-time energy and a Mel frequency cepstrum coefficient; classifying the feature input deep neural network to obtain classification features of speech emotion, wherein the deep neural network comprises a convolution layer and a full-link layer which are sequentially connected; and identifying the classification features to obtain a speech emotion classification result.
According to an embodiment of the present disclosure, the preprocessing the voice information includes: converting the voice information into a digital signal; pre-emphasis processing is carried out on the digital signal so as to improve the high-frequency spectrum of the digital signal; and segmenting the pre-emphasized digital signal through a window function to obtain a multi-frame digital signal.
According to an embodiment of the present disclosure, the extracting the feature of the preprocessed voice information includes: setting a frame moving step length; and calculating the short-time energy of each frame of digital signal through the window function and the frame moving step length.
According to an embodiment of the present disclosure, the window function comprises a hamming window function.
According to an embodiment of the present disclosure, the pre-emphasis processing of the digital signal includes: and inputting the digital signal into a digital filter with a preset octave for filtering so as to improve the high-frequency spectrum of the digital signal.
According to an embodiment of the present disclosure, the extracting the feature of the preprocessed voice information includes: carrying out Fourier transform on the preprocessed voice information, and calculating an energy spectrum of the voice information; calculating a response of the speech information from the energy spectrum; and calculating the Mel frequency cepstrum coefficient according to the response.
According to an embodiment of the present disclosure, said calculating said mel-frequency cepstral coefficients from said response comprises: solving a logarithm of the response; and performing inverse discrete cosine transform on the logarithm, and calculating the Mel frequency cepstrum coefficient.
According to an embodiment of the present disclosure, the identifying the classification feature includes: and inputting the classification features into an SVM network for recognition to obtain a speech emotion classification result.
According to the embodiment of the disclosure, the voice information comprises call content of transfer confirmation; the voice information processing method further includes: and confirming whether to transfer or not according to the voice emotion classification result.
According to an embodiment of the present disclosure, the voice information processing method further includes: and after the transfer is finished, revisiting the transfer result, and optimizing the deep neural network according to revising data.
Another aspect of the present disclosure provides a voice information processing apparatus, including: the preprocessing module is used for acquiring voice information and preprocessing the voice information; the feature extraction module is used for extracting features of the preprocessed voice information, wherein the extracted features comprise at least one of short-time energy and a Mel frequency cepstrum coefficient; the classification module is used for inputting the features into the deep neural network for classification to obtain classification features of the speech emotion, wherein the deep neural network comprises a convolutional layer and a full-link layer which are sequentially connected; and the recognition module is used for recognizing the classification features to obtain a speech emotion classification result.
According to an embodiment of the present disclosure, the preprocessing module includes: the conversion unit is used for converting the voice information into a digital signal; the pre-emphasis unit is used for performing pre-emphasis processing on the digital signal so as to improve the high-frequency spectrum of the digital signal; and the division unit is used for dividing the digital signal after the pre-emphasis through a window function to obtain a multi-frame digital signal.
According to an embodiment of the present disclosure, the feature extraction module includes: a setting unit for setting a frame moving step; and the first calculation unit is used for calculating the short-time energy of each frame of digital signal through the window function and the frame moving step length.
According to an embodiment of the present disclosure, the window function comprises a hamming window function.
According to an embodiment of the present disclosure, the pre-emphasis processing of the digital signal by the pre-emphasis unit includes: and inputting the digital signal into a digital filter with a preset octave for filtering so as to improve the high-frequency spectrum of the digital signal.
According to an embodiment of the present disclosure, the feature extraction module includes: the second calculation unit is used for carrying out Fourier transform on the preprocessed voice information and calculating an energy spectrum of the voice information; a third calculation unit for calculating a response of the speech information from the energy spectrum; and the fourth calculating unit is used for calculating the Mel frequency cepstrum coefficient according to the response.
According to an embodiment of the present disclosure, the fourth calculating unit calculating the mel-frequency cepstrum coefficients according to the response includes: solving a logarithm of the response; and performing inverse discrete cosine transform on the logarithm, and calculating the Mel frequency cepstrum coefficient.
According to an embodiment of the present disclosure, the identifying the classification feature by the identifying module includes: and inputting the classification features into an SVM network for recognition to obtain a speech emotion classification result.
According to the embodiment of the disclosure, the voice information comprises call content of transfer confirmation; the voice information processing apparatus further includes: and the confirming module is used for confirming whether to transfer the account or not according to the voice emotion classification result.
According to an embodiment of the present disclosure, the voice information processing apparatus further includes: and the optimization module is used for revisiting the transfer result after the transfer is finished and optimizing the deep neural network according to the revisit data.
Another aspect of the present disclosure provides an electronic device including: one or more processors; memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described above.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.
Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.
Drawings
Fig. 1 schematically illustrates a system architecture 100 of a voice information processing method and apparatus according to an embodiment of the present disclosure;
FIG. 2 schematically shows a flow chart of a method of speech information processing according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a block diagram of a deep neural network architecture, in accordance with an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of a method of pre-processing speech information according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow diagram of a feature extraction method according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates a graph of speech emotion versus average energy in accordance with an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow diagram of a feature extraction method according to an embodiment of the present disclosure;
FIG. 8 schematically illustrates a flow diagram of a method of identifying classification features according to an embodiment of the present disclosure;
FIG. 9 schematically illustrates a flow chart of a method of speech information processing according to an embodiment of the present disclosure;
FIG. 10 schematically illustrates a flow chart of a method of speech information processing according to an embodiment of the present disclosure;
FIG. 11 schematically shows a block diagram of a speech information processing apparatus according to an embodiment of the present disclosure;
fig. 12 schematically shows a block diagram of an information processing apparatus according to still another embodiment of the present disclosure;
fig. 13 schematically shows a block diagram of an information processing apparatus according to still another embodiment of the present disclosure;
FIG. 14 schematically illustrates a block diagram of a pre-processing module according to an embodiment of the present disclosure;
FIG. 15 schematically shows a block diagram of a feature extraction module according to an embodiment of the present disclosure;
FIG. 16 schematically shows a block diagram of a feature extraction module according to yet another embodiment of the present disclosure;
FIG. 17 schematically shows a block diagram of a feature extraction module according to yet another embodiment of the present disclosure;
fig. 18 schematically shows a block diagram of an electronic device adapted to implement the above described method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. The techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). In addition, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon for use by or in connection with an instruction execution system.
An embodiment of the present disclosure provides a voice information processing method, including: and acquiring voice information, and preprocessing the voice information. And extracting the features of the preprocessed voice information, wherein the extracted features comprise at least one of short-time energy and Mel frequency cepstrum coefficients. And (3) classifying the feature input deep neural network to obtain the classification features of the speech emotion, wherein the deep neural network comprises a convolution layer and a full-link layer which are sequentially connected. And identifying the classification features to obtain a speech emotion classification result.
Fig. 1 schematically shows a system architecture 100 of a voice information processing method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the system architecture 100 according to the embodiment may include a storage unit 101, a network 102 and a server 103. Network 102 is used to provide communication links between storage unit 101 and servers 103.
The storage unit 101 may be, for example, a hardware or software implementation, such as an electronic device (e.g., a hard disk) storing data, or a database, made using semiconductor, magnetic media, etc. technologies. The storage unit 101 stores therein voice information to be processed. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The server 103 may be a server capable of acquiring voice data from the storage unit and processing the voice data. According to the embodiment of the disclosure, in the process of voice processing, the server 103 acquires the voice information stored in the storage unit 101 through the network 102, preprocesses the voice information, and performs feature extraction on the preprocessed voice information, wherein the extracted features include at least one of short-time energy and mel-frequency cepstrum coefficients, and the features are input into the deep neural network to be classified, so as to obtain classification features of voice emotion, wherein the deep neural network includes a convolutional layer and a full-link layer which are sequentially connected, and the classification features are identified, so as to obtain a voice emotion classification result.
It should be noted that the voice information processing method provided by the embodiment of the present disclosure may be executed by the server 103. Accordingly, the voice information processing apparatus provided by the embodiment of the present disclosure may be disposed in the server 103. Alternatively, the voice information processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 103 and is capable of communicating with the storage unit 101 and/or the server 103. Accordingly, the voice information processing apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 103 and capable of communicating with the storage unit 101 and/or the server 103. Alternatively, the voice information processing method provided by the embodiment of the present disclosure may also be executed in part by the server 103 and in part by the storage unit 101. Accordingly, the voice information processing apparatus provided by the embodiment of the present disclosure may also be partially disposed in the server 103 and partially disposed in the storage unit 101.
It should be understood that the number of storage units, networks, and servers in FIG. 1 is illustrative only. There may be any number of storage units, networks, and servers, as desired for an implementation.
The voice information processing method provided by the embodiment of the disclosure can be applied to the field of financial services, taking a bank as an example, because the capacity of identifying financial risks and transfer hidden dangers of the old is weak, when the old transfers money on a counter or in other modes, after the wind control model is triggered, bank business personnel can carry out voice communication with the old, the current tone emotion of the old can be monitored and analyzed, the motivation and emotion of the old for transferring money can be distinguished, if the tone emotion of the old is normal and money for transferring money is really true, money is collected into the account name of the old assigned by the other, if the tone emotion of the old is found to be excited, the tone semantic meaning is ambiguous and the money is confused, the money cannot be collected into the name of the other, and the risk can be stopped in time. According to the voice information processing method provided by the embodiment of the disclosure, the emotion recognition can be rapidly and accurately carried out on the voice content by processing the conversation content between the banking staff and the client, so that the voice emotion classification result is obtained, and whether the account transfer is allowed or not can be accurately judged according to the voice emotion classification result.
It should be understood that the voice information processing method provided by the embodiment of the present disclosure is not limited to be applied to the technical field of financial services, the above description is only exemplary, and for the field related to voice emotion classification recognition, such as the communication fraud related field, the voice emotion classification can be performed by applying the voice information processing method of the embodiment of the present disclosure.
Fig. 2 schematically shows a flow chart of a voice information processing method according to an embodiment of the present disclosure.
As shown in fig. 2, the voice information processing method may include operations S201 to S204, for example.
In operation S201, voice information is acquired and preprocessed.
In operation S202, feature extraction is performed on the preprocessed voice information, wherein the extracted features include at least one of short-time energy and mel-frequency cepstrum coefficients.
In operation S203, the features are input into a deep neural network for classification, so as to obtain classification features of speech emotion, where the deep neural network includes a convolutional layer and a fully-connected layer, which are sequentially connected.
In operation S204, the classification features are recognized to obtain a speech emotion classification result.
In the embodiment of the disclosure, the voice message may include call content of transfer confirmation. The preprocessing process is to facilitate subsequent better feature extraction and speech emotion classification and recognition.
In the embodiment of the present disclosure, the deep neural network for classifying features includes convolutional layers and fully-connected layers connected in sequence, the neural network is structured as shown in fig. 3, and for example, a plurality of convolutional layers and a plurality of fully-connected layers connected thereto are disposed between an input and an output to form a convolutional neural network for extracting significant features. The traditional convolutional neural network directly inputs one-dimensional variables into a Softmax classifier for classification after pooling layers, and the classification method is more suitable for image recognition because of stronger continuity among image features. In carrying out the disclosed concept, applicants have discovered that: for voice signals, the feature relation among the one-dimensional vectors is not obvious, and the result deviation is caused by the fact that the features are easily lost when the pooling layer is used. Therefore, the embodiment of the present disclosure improves the conventional convolutional neural network to obtain the deep neural network shown in fig. 3, and when classifying the features, the pooling process is not performed, the features are directly subjected to convolutional operation, and then the features are input to the full-link layer to perform feature classification, so as to improve the accuracy of speech emotion classification and recognition.
According to the voice information processing method provided by the embodiment of the disclosure, the voice emotion contained in the voice information can be more accurately represented by extracting at least one feature of the short-time energy and the Mel frequency cepstrum coefficient, so that the voice classification can be carried out better in the following process. Furthermore, the deep neural network which only comprises the convolution layer and the full-connection layer of the memory design classifies the short-time energy and the Mel frequency cepstrum coefficient to obtain the classification characteristics of the speech emotion, and the accuracy of speech emotion classification and recognition can be improved.
The voice message processing method shown in fig. 2 will be described in further detail with reference to the accompanying drawings by taking a large amount of money transfer transaction as an example.
Fig. 4 schematically shows a flow chart of a method of pre-processing speech information according to an embodiment of the present disclosure.
As shown in fig. 4, the preprocessing method may include, for example, operations S401 to S403.
In operation S401, voice information is converted into a digital signal.
In the embodiment of the disclosure, when a large amount of money is transferred, the transaction is confirmed by the telephone, the real-time call content confirmed by the telephone is recorded as the voice information to be processed, and the voice information to be processed is I. The analog speech signal may be converted to a digital signal by low pass filtering, sampling and quantization.
In operation S402, a pre-emphasis process is performed on the digital signal to increase a high frequency spectrum of the digital signal.
In the embodiment of the disclosure, research shows that: the speech signal decreases in amplitude with increasing frequency, and the amplitude at the high frequency end is typically significantly lower than at the low frequency end. In order to facilitate subsequent processing, the amplitude of the high frequency band may be increased, and the method for increasing the amplitude of the high frequency band may be a pre-emphasis method. The pre-emphasis may refer to that after the signal is digitized, a digital filter with a preset octave (e.g. 6dB) is used to boost the high frequency spectrum of the signal. After pre-emphasis, the frequency spectrum of the speech signal will become flat, which is beneficial for subsequent processing.
In operation S403, the pre-emphasized digital signal is divided by a window function to obtain a plurality of frames of digital signals.
In the embodiment of the disclosure, research shows that: the speech signal is a non-stationary signal, but the vibration of the sound-producing organ is much slower than the vibration of the sound, and the speech signal can be considered stationary in a very short time (for example, within 30 milliseconds). Therefore, when processing the voice information, the voice information can be divided into small segments and processed. The specific segmentation process can be realized by a window function, and the data after each segmentation is called a frame, so that the continuous speech information I becomes a stable frame signal I1、I2、I3、……、InThere is a certain overlap between frames, which ensures a smooth transition of the sound signal.
In the embodiment of the disclosure, a specific window function can be performed by selecting a hamming window, and the hamming window has a wider main lobe and a smaller side lobe peak value, which is beneficial to analyzing main characteristics in sound. The Hamming window function may be:
Figure RE-GDA0003149750990000091
where w (N) represents a function of a hamming window, N represents the total number of frames of frame signals into which the speech information is divided, and N represents the number of the nth frame signal.
According to the voice information preprocessing method provided by the embodiment of the disclosure, the voice signal is subjected to pre-emphasis processing, so that the frequency spectrum of the voice signal becomes flat, and the subsequent processing is facilitated. By segmenting the speech signal into multiple frame signals, the stationary frame signals are beneficial to improving the accuracy of subsequent feature extraction and classification.
Fig. 5 schematically shows a flow chart of a feature extraction method according to an embodiment of the present disclosure.
The feature extraction method can be used for extracting short-time energy contained in the voice information. The embodiment of the disclosure finds out through research that: the speech energy contained in different emotions is obviously different, and the emotion change of the sound can be effectively described by extracting the short-time energy of the sound. The short-term energy has the most direct proportional relationship with the amplitude of speech. The amplitude of the speech signal is large, and the short-term energy contained therein is large, whereas the short-term energy is small. For high intensity emotional speech, positive emotions such as excitement or negative emotions such as anger, the signal has a larger amplitude. While the amplitude of a low-intensity speech signal, such as silence, is generally smaller. By performing voice analysis on the historical voice information, a relation curve graph of emotion and average energy is obtained, as shown in fig. 6, and it is obvious from fig. 6 that the average energy of angry emotion is the largest and the average energy of happy emotion is the second.
As shown in fig. 5, the method may include, for example, operations S501 to S502.
In operation S501, a frame moving step is set.
In the embodiment of the present disclosure, the short-time energy included in the speech information is extracted based on the window function described above. Specifically, setting frame moving step length, then using window function to frame voice signal I of nth frame obtained after voice signal is framedn(m) may be:
In(m)=w(m)I(n+m),0≤m≤N-1,m=0,T,2T…
wherein, N represents the total frame number of the frame signal of the voice information segmentation, N represents the serial number of the nth frame signal, and T is the frame moving step length.
In operation S502, the short-time energy of each frame of the digital signal is calculated by the window function and the frame moving step size.
In the disclosed embodiment, the short-time energy X 'of the n-th frame speech signal'1Can be calculated by the following formula:
Figure RE-GDA0003149750990000101
in the embodiment of the disclosure, emotional energy distribution of short-time voice can help to distinguish the current psychological activity state during available transfer to a certain extent. In a general sense, decisions made in the case of large mood swings are often not the most appropriate. Thus, the bank would choose these mood swings to intervene. However, the intervention method based on short-term emotional fluctuation has large errors, and the problem that the customer is easily subjected to financial fraud is not thoroughly solved. Therefore, in the embodiment of the present disclosure, the mel-frequency cepstrum coefficient in the speech information may also be extracted.
Fig. 7 schematically shows a flow chart of a feature extraction method according to an embodiment of the present disclosure.
The feature extraction method can be used for extracting Mel Frequency Cepstrum Coefficients (MFCC) contained in voice information. The Mel frequency cepstrum coefficient is a characteristic parameter which simulates the hearing characteristic extraction of human ears and can well represent the characteristics of voice. Mel frequency refers to a set of filter banks that imitate the human ear, and is scaled to the actual frequency as follows:
Mel(f)=2595*lg(1+f/700)
where f is the actual frequency.
In the embodiment of the disclosure, the frame-divided speech signal I is processed for each frame1、I2、I3、……、 InThe method shown in fig. 7 can be adopted to extract the MFCC characteristics to obtain the MFCC characteristic values.
As shown in fig. 7, the method may include, for example, operations S701 to S703.
In operation S701, fourier transform is performed on the preprocessed voice information, and an energy spectrum of the voice information is calculated.
In the disclosed embodiment, for each frame of speech signal InAnd performing fast Fourier transform to obtain an energy spectrum function of the voice signal.
In operation S702, a response of the voice information is calculated from the energy spectrum.
In the disclosed embodiment, the energy spectrum function of the voice signal can be input into a Mel filter to calculate the response of the voice signal.
In operation S703, a mel-frequency cepstrum coefficient is calculated from the response.
In the embodiment of the present disclosure, the logarithm may be solved for the response result output by the mel-frequency filter, and then inverse Discrete Cosine Transform (DCT) may be performed on the logarithm to calculate the mel-frequency cepstrum coefficient.
In the embodiment of the disclosure, the accuracy of speech emotion classification is further improved by extracting the mel frequency cepstrum coefficient in the speech information and classifying the speech emotion contained in the speech signal by combining the extracted short-time energy features.
It should be understood that when performing feature extraction on voice information, features to be extracted may be selected according to practical application requirements, for example, only short-term energy in the voice information is extracted, only mel-frequency cepstrum coefficients in the voice information are extracted, or both the short-term energy and the mel-frequency cepstrum coefficients in the voice information are extracted.
Fig. 8 schematically illustrates a flow chart of a method of identifying classification features according to an embodiment of the present disclosure.
As shown in fig. 8, the method may include, for example, operation S801.
In operation S801, the classification features of the speech emotion are input to an SVM network for recognition, and a speech emotion classification result is obtained.
In the disclosed embodiment, the SVM network and the support vector machine network allow the decision boundary to be very complex even if the data has only a few characteristics. It performs well on both low-dimensional data and high-dimensional data (i.e., few features and many features). Based on the SVM network, the classification result of the speech emotion can be well recognized.
FIG. 9 schematically shows a flow chart of a method of speech information processing according to an embodiment of the present disclosure.
As shown in fig. 9, the method of voice information processing includes operations S201, S202, S203, S204, and S901.
In operation S201, voice information is acquired and preprocessed.
In operation S202, feature extraction is performed on the preprocessed voice information, wherein the extracted features include at least one of short-time energy and mel-frequency cepstrum coefficients.
In operation S203, the features are input into a deep neural network for classification, so as to obtain classification features of speech emotion, where the deep neural network includes a convolutional layer and a fully-connected layer, which are sequentially connected.
In operation S204, the classification features are recognized to obtain a speech emotion classification result.
In operation S901, whether to transfer money is confirmed according to the speech emotion classification result.
In the disclosed embodiment, the speech emotion classification may include, for example, calm or non-calm (anger, happiness, boredom, etc.). If the emotion is in a non-calm state such as happiness, boredom, hurry, anger and the like, stopping the large transfer behavior of the customer; and if the transfer is in a calm state, allowing the large transfer behavior of the client.
FIG. 10 schematically shows a flow chart of a method of speech information processing according to an embodiment of the present disclosure.
As shown in fig. 10, the method of voice information processing includes operation S201, operation S202, operation S203, operation S204, operation S901, and operation S1001.
In operation S201, voice information is acquired and preprocessed.
In operation S202, feature extraction is performed on the preprocessed voice information, wherein the extracted features include at least one of short-time energy and mel-frequency cepstrum coefficients.
In operation S203, the features are input into a deep neural network for classification, so as to obtain classification features of speech emotion, where the deep neural network includes a convolutional layer and a fully-connected layer, which are sequentially connected.
In operation S204, the classification features are recognized to obtain a speech emotion classification result.
In operation S901, whether to transfer money is confirmed according to the speech emotion classification result.
In operation S1001, the transfer result is revisited, and the deep neural network is optimized according to the revisit data.
In the embodiment of the disclosure, the family or the friend of the transfer client can be revisited, the action result is confirmed, and the deep neural network for feature classification is further optimized according to the confirmed result to perform persistent optimization.
Based on the optimization method, the accuracy of the voice emotion classification process can be further improved, and the accuracy of the transfer confirmation result is further improved.
Fig. 11 schematically shows a block diagram of a speech information processing apparatus according to an embodiment of the present disclosure.
As shown in fig. 11, the speech information processing apparatus 1100 may include, for example, a preprocessing module 1110, a feature extraction module 1120, a classification module 1130, and a recognition module 1140.
The preprocessing module 1110 is configured to obtain voice information and perform preprocessing on the voice information.
The feature extraction module 1120 is configured to perform feature extraction on the preprocessed voice information, where the extracted features include at least one of short-time energy and mel-frequency cepstrum coefficients.
The classification module 1130 is configured to input the features into the deep neural network for classification, so as to obtain classification features of speech emotion, where the deep neural network includes a convolutional layer and a full link layer that are sequentially connected.
And the recognition module 1140 is used for recognizing the classification features to obtain a speech emotion classification result.
Fig. 12 schematically shows a block diagram of an information processing apparatus according to still another embodiment of the present disclosure.
As shown in fig. 12, the voice information processing apparatus 1100 may be used for transfer confirmation, and the corresponding voice information includes call contents of the transfer confirmation, and the voice information processing apparatus 1100 may further include, for example, a confirmation module 1150.
And the confirming module 1150 is used for confirming whether to transfer the account according to the voice emotion classification result.
Fig. 13 schematically shows a block diagram of an information processing apparatus according to still another embodiment of the present disclosure.
As shown in fig. 13, the speech information processing apparatus 1100 may further include an optimization module 1160, for example.
And the optimization module 1160 is used for revisiting the transfer result after the transfer is finished and optimizing the deep neural network according to the revisit data.
FIG. 14 schematically illustrates a block diagram of a pre-processing module according to an embodiment of the present disclosure.
As shown in fig. 14, the pre-processing module 1110 may include, for example, a transformation unit 1111, a pre-emphasis unit 1112, and a segmentation unit 1113.
A converting unit 1111, configured to convert the voice information into a digital signal.
A pre-emphasis unit 1112 is configured to perform pre-emphasis processing on the digital signal to improve the high frequency spectrum of the digital signal.
The dividing unit 1113 is configured to divide the pre-emphasized digital signal by a window function to obtain a multi-frame digital signal.
FIG. 15 schematically shows a block diagram of a feature extraction module according to an embodiment of the present disclosure.
As shown in fig. 15, the feature extraction module 1120 may include, for example, a setting unit 1121 and a first calculation unit 1122.
A setting unit 1121 configured to set a frame moving step size.
A first calculating unit 1122 for calculating the short-time energy of each frame of the digital signal by the window function and the frame moving step size.
FIG. 16 schematically shows a block diagram of a feature extraction module according to yet another embodiment of the present disclosure.
As shown in fig. 16, the feature extraction module 1120 may include, for example, a second calculation unit 1123, a third calculation unit 1124, and a fourth calculation unit 1125.
A second calculating unit 1123, configured to perform fourier transform on the preprocessed voice information, and calculate an energy spectrum of the voice information;
a third calculating unit 1124 for calculating a response of the speech information according to the energy spectrum;
a fourth calculating unit 1125 for calculating the mel-frequency cepstrum coefficients according to the response.
Fig. 17 schematically shows a block diagram of a feature extraction module according to yet another embodiment of the present disclosure.
As shown in fig. 17, the feature extraction module 1120 may include, for example, a setting unit 1121, a first calculation unit 1122, a second calculation unit 1123, a third calculation unit 1124, and a fourth calculation unit 1125.
A setting unit 1121 configured to set a frame moving step size.
A first calculating unit 1122 for calculating the short-time energy of each frame of the digital signal by the window function and the frame moving step size.
A second calculating unit 1123, configured to perform fourier transform on the preprocessed voice information, and calculate an energy spectrum of the voice information;
a third calculating unit 1124 for calculating a response of the speech information according to the energy spectrum;
a fourth calculating unit 1125 for calculating the mel-frequency cepstrum coefficients according to the response.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
For example, any plurality of the preprocessing module 1110, the feature extraction module 1120, the classification module 1130, the recognition module 1140, the confirmation module 1150, and the optimization module 1160 may be combined and implemented in one module/unit/sub-unit, or any one of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Alternatively, at least part of the functionality of one or more of these modules/units/sub-units may be combined with at least part of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to an embodiment of the present disclosure, at least one of the preprocessing module 1110, the feature extraction module 1120, the classification module 1130, and the recognition module 1140, the confirmation module 1150, and the optimization module 1160 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented in any one of three implementations of software, hardware, and firmware, or in a suitable combination of any of them. Alternatively, the preprocessing module 1110, the feature extraction module 1120, the classification module 1130, and at least one of the recognition module 1140, the confirmation module 1150, and the optimization module 1160 may be at least partially implemented as computer program modules that, when executed, may perform corresponding functions.
It should be noted that the voice information processing apparatus portion in the embodiment of the present disclosure corresponds to the voice information processing method portion in the embodiment of the present disclosure, and the specific implementation details and the technical effects thereof are also the same, and are not described herein again.
Fig. 18 schematically shows a block diagram of an electronic device adapted to implement the above described method according to an embodiment of the present disclosure. The electronic device shown in fig. 18 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 18, an electronic device 1800 according to an embodiment of the present disclosure includes a processor 1801, which may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1802 or a program loaded from a storage portion 1808 into a Random Access Memory (RAM) 1803. The processor 1801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1801 may also include onboard memory for caching purposes. The processor 1801 may include a single processing unit or multiple processing units for performing the different actions of the method flows in accordance with embodiments of the present disclosure.
In the RAM1803, various programs and data necessary for the operation of the electronic apparatus 1800 are stored. The processor 1801, ROM1802, and RAM1803 are connected to one another by a bus 1804. The processor 1801 performs various operations of the method flows according to embodiments of the present disclosure by executing programs in the ROM1802 and/or the RAM 1803. Note that the programs may also be stored in one or more memories other than ROM1802 and RAM 1803. The processor 1801 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, the electronic device 1800 may also include an input/output (I/O) interface 1805, the input/output (I/O) interface 1805 also being connected to the bus 1804. The electronic device 1800 may also include one or more of the following components connected to the I/O interface 1805: an input portion 1806 including a keyboard, a mouse, and the like; an output portion 1807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1808 including a hard disk and the like; and a communication section 1809 including a network interface card such as a LAN card, a modem, or the like. The communication section 1809 performs communication processing via a network such as the internet. A driver 1810 is also connected to the I/O interface 1805 as needed. A removable medium 1811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1810 as necessary, so that a computer program read out therefrom is mounted in the storage portion 1808 as necessary.
According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1809, and/or installed from the removable media 1811. The computer program, when executed by the processor 1801, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
For example, according to embodiments of the present disclosure, a computer-readable storage medium may include ROM1802 and/or RAM1803 and/or one or more memories other than ROM1802 and RAM1803 described above.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

Claims (22)

1. A method of processing speech information, comprising:
acquiring voice information, and preprocessing the voice information;
extracting the features of the preprocessed voice information, wherein the extracted features comprise at least one of short-time energy and a Mel frequency cepstrum coefficient;
classifying the feature input deep neural network to obtain classification features of speech emotion, wherein the deep neural network comprises a convolution layer and a full-link layer which are sequentially connected;
and identifying the classification features to obtain a speech emotion classification result.
2. The voice information processing method of claim 1, wherein the preprocessing the voice information comprises:
converting the voice information into a digital signal;
pre-emphasis processing is carried out on the digital signal so as to improve the high-frequency spectrum of the digital signal;
and segmenting the pre-emphasized digital signal through a window function to obtain a multi-frame digital signal.
3. The method for processing the voice information according to claim 2, wherein the extracting the features of the voice information after the pre-processing comprises:
setting a frame moving step length;
and calculating the short-time energy of each frame of digital signal through the window function and the frame moving step length.
4. The speech information processing method according to claim 2, wherein the window function comprises a hamming window function.
5. The speech information processing method according to claim 2, wherein the pre-emphasis processing of the digital signal comprises:
and inputting the digital signal into a digital filter with a preset octave for filtering so as to improve the high-frequency spectrum of the digital signal.
6. The method for processing the voice information according to claim 1, wherein the extracting the features of the pre-processed voice information comprises:
carrying out Fourier transform on the preprocessed voice information, and calculating an energy spectrum of the voice information;
calculating a response of the speech information from the energy spectrum;
and calculating the Mel frequency cepstrum coefficient according to the response.
7. The speech information processing method of claim 6, wherein the calculating the mel-frequency cepstral coefficients from the response comprises:
solving a logarithm of the response;
and performing inverse discrete cosine transform on the logarithm, and calculating the Mel frequency cepstrum coefficient.
8. The method of processing speech information according to claim 1, wherein said identifying the classification feature comprises:
and inputting the classification features into an SVM network for recognition to obtain a speech emotion classification result.
9. The voice information processing method as claimed in claim 1, wherein the voice information includes call contents of transfer confirmation;
the voice information processing method further includes:
and confirming whether to transfer or not according to the voice emotion classification result.
10. The voice information processing method according to claim 9, wherein the voice information processing method further comprises:
and after the transfer is finished, revisiting the transfer result, and optimizing the deep neural network according to revising data.
11. A speech information processing apparatus comprising:
the preprocessing module is used for acquiring voice information and preprocessing the voice information;
the feature extraction module is used for extracting features of the preprocessed voice information, wherein the extracted features comprise at least one of short-time energy and a Mel frequency cepstrum coefficient;
the classification module is used for inputting the features into the deep neural network for classification to obtain classification features of the speech emotion, wherein the deep neural network comprises a convolutional layer and a full-link layer which are sequentially connected;
and the recognition module is used for recognizing the classification features to obtain a speech emotion classification result.
12. The speech information processing apparatus according to claim 11, wherein the preprocessing module includes:
the conversion unit is used for converting the voice information into a digital signal;
the pre-emphasis unit is used for performing pre-emphasis processing on the digital signal so as to improve the high-frequency spectrum of the digital signal;
and the division unit is used for dividing the digital signal after the pre-emphasis through a window function to obtain a multi-frame digital signal.
13. The speech information processing apparatus according to claim 12, wherein the feature extraction module includes:
a setting unit for setting a frame moving step;
and the first calculation unit is used for calculating the short-time energy of each frame of digital signal through the window function and the frame moving step length.
14. The speech information processing apparatus according to claim 12, wherein the window function comprises a hamming window function.
15. The speech information processing apparatus according to claim 12, wherein the pre-emphasis unit performs pre-emphasis processing on the digital signal includes:
and inputting the digital signal into a digital filter with a preset octave for filtering so as to improve the high-frequency spectrum of the digital signal.
16. The speech information processing apparatus according to claim 11, wherein the feature extraction module includes:
the second calculation unit is used for carrying out Fourier transform on the preprocessed voice information and calculating an energy spectrum of the voice information;
a third calculation unit for calculating a response of the speech information from the energy spectrum;
and the fourth calculating unit is used for calculating the Mel frequency cepstrum coefficient according to the response.
17. The speech information processing apparatus according to claim 16, wherein the fourth calculation unit calculating the mel-frequency cepstrum coefficients from the response includes:
solving a logarithm of the response;
and performing inverse discrete cosine transform on the logarithm, and calculating the Mel frequency cepstrum coefficient.
18. The speech information processing apparatus according to claim 11, wherein the recognition module recognizing the classification feature comprises:
and inputting the classification features into an SVM network for recognition to obtain a speech emotion classification result.
19. The voice information processing apparatus according to claim 11, wherein the voice information includes call contents of transfer confirmation;
the voice information processing apparatus further includes:
and the confirming module is used for confirming whether to transfer the account or not according to the voice emotion classification result.
20. The speech information processing apparatus according to claim 19, further comprising:
and the optimization module is used for revisiting the transfer result after the transfer is finished and optimizing the deep neural network according to the revisit data.
21. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-10.
22. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 10.
CN202110581331.9A 2021-05-26 2021-05-26 Voice information processing method and device, electronic equipment and storage medium Pending CN113314151A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110581331.9A CN113314151A (en) 2021-05-26 2021-05-26 Voice information processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110581331.9A CN113314151A (en) 2021-05-26 2021-05-26 Voice information processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113314151A true CN113314151A (en) 2021-08-27

Family

ID=77375302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110581331.9A Pending CN113314151A (en) 2021-05-26 2021-05-26 Voice information processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113314151A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108550375A (en) * 2018-03-14 2018-09-18 鲁东大学 A kind of emotion identification method, device and computer equipment based on voice signal
CN108899049A (en) * 2018-05-31 2018-11-27 中国地质大学(武汉) A kind of speech-emotion recognition method and system based on convolutional neural networks
CN109036467A (en) * 2018-10-26 2018-12-18 南京邮电大学 CFFD extracting method, speech-emotion recognition method and system based on TF-LSTM
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN109243492A (en) * 2018-10-28 2019-01-18 国家计算机网络与信息安全管理中心 A kind of speech emotion recognition system and recognition methods
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization
CN111145786A (en) * 2019-12-17 2020-05-12 深圳追一科技有限公司 Speech emotion recognition method and device, server and computer readable storage medium
CN111445899A (en) * 2020-03-09 2020-07-24 咪咕文化科技有限公司 Voice emotion recognition method and device and storage medium
CN112200556A (en) * 2020-10-23 2021-01-08 中国工商银行股份有限公司 Transfer processing method and device of automatic teller machine

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
CN108550375A (en) * 2018-03-14 2018-09-18 鲁东大学 A kind of emotion identification method, device and computer equipment based on voice signal
CN108899049A (en) * 2018-05-31 2018-11-27 中国地质大学(武汉) A kind of speech-emotion recognition method and system based on convolutional neural networks
CN109036467A (en) * 2018-10-26 2018-12-18 南京邮电大学 CFFD extracting method, speech-emotion recognition method and system based on TF-LSTM
CN109243492A (en) * 2018-10-28 2019-01-18 国家计算机网络与信息安全管理中心 A kind of speech emotion recognition system and recognition methods
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization
CN111145786A (en) * 2019-12-17 2020-05-12 深圳追一科技有限公司 Speech emotion recognition method and device, server and computer readable storage medium
CN111445899A (en) * 2020-03-09 2020-07-24 咪咕文化科技有限公司 Voice emotion recognition method and device and storage medium
CN112200556A (en) * 2020-10-23 2021-01-08 中国工商银行股份有限公司 Transfer processing method and device of automatic teller machine

Similar Documents

Publication Publication Date Title
CN112562691B (en) Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium
CN111145786A (en) Speech emotion recognition method and device, server and computer readable storage medium
CN111276131A (en) Multi-class acoustic feature integration method and system based on deep neural network
EP3469582A1 (en) Neural network-based voiceprint information extraction method and apparatus
Palo et al. Recognition of human speech emotion using variants of mel-frequency cepstral coefficients
Musaev et al. Image approach to speech recognition on CNN
Jia et al. Speaker recognition based on characteristic spectrograms and an improved self-organizing feature map neural network
CN109256138A (en) Auth method, terminal device and computer readable storage medium
Mittal et al. Static–dynamic features and hybrid deep learning models based spoof detection system for ASV
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
KR20220071059A (en) Method for evaluation of emotion based on emotion analysis model and device using the same
Krishna et al. Language independent gender identification from raw waveform using multi-scale convolutional neural networks
Karthikeyan Adaptive boosted random forest-support vector machine based classification scheme for speaker identification
CN108172214A (en) A kind of small echo speech recognition features parameter extracting method based on Mel domains
Devi et al. Automatic speaker recognition with enhanced swallow swarm optimization and ensemble classification model from speech signals
Gaurav et al. An efficient speaker identification framework based on Mask R-CNN classifier parameter optimized using hosted cuckoo optimization (HCO)
CN114155460A (en) Method and device for identifying user type, computer equipment and storage medium
Kuchebo et al. Convolution neural network efficiency research in gender and age classification from speech
Saritha et al. Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal
Zhang et al. An effective deep learning approach for dialogue emotion recognition in car-hailing platform
CN116312559A (en) Training method of cross-channel voiceprint recognition model, voiceprint recognition method and device
CN113314151A (en) Voice information processing method and device, electronic equipment and storage medium
Alhlffee MFCC-Based Feature Extraction Model for Long Time Period Emotion Speech Using CNN.
CN114822558A (en) Voiceprint recognition method and device, electronic equipment and storage medium
Wickramasinghe et al. DNN controlled adaptive front-end for replay attack detection systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210827

RJ01 Rejection of invention patent application after publication