CN110782916A - Multi-modal complaint recognition method, device and system - Google Patents

Multi-modal complaint recognition method, device and system Download PDF

Info

Publication number
CN110782916A
CN110782916A CN201910943563.7A CN201910943563A CN110782916A CN 110782916 A CN110782916 A CN 110782916A CN 201910943563 A CN201910943563 A CN 201910943563A CN 110782916 A CN110782916 A CN 110782916A
Authority
CN
China
Prior art keywords
complaint
user
data
voice
image sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910943563.7A
Other languages
Chinese (zh)
Other versions
CN110782916B (en
Inventor
苏绥绥
常富洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qiyu Information Technology Co Ltd
Original Assignee
Beijing Qiyu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qiyu Information Technology Co Ltd filed Critical Beijing Qiyu Information Technology Co Ltd
Priority to CN201910943563.7A priority Critical patent/CN110782916B/en
Publication of CN110782916A publication Critical patent/CN110782916A/en
Application granted granted Critical
Publication of CN110782916B publication Critical patent/CN110782916B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/01Customer relationship services
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Accounting & Taxation (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Finance (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Psychiatry (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a multi-modal complaint identification method, a multi-modal complaint identification device and a multi-modal complaint identification system, which are used for identifying whether the conversation content of a user contains complaint content, wherein the method comprises the following steps: receiving user voice in the user call, and converting the user voice into sound wave; converting the acoustic waveform into image sequence data while recognizing text content data of the acoustic waveform; calculating a score reflecting a probability of complaint from the image sequence data and the text content data; and judging whether the user call contains complaint content according to the score. According to the invention, the user voice is converted into the image sequence data and the text content data, and then the emotion recognition is respectively carried out on the image sequence data and the text content data, so that the accuracy of emotion recognition is improved.

Description

Multi-modal complaint recognition method, device and system
Technical Field
The invention relates to the field of computer information processing, in particular to a multi-modal complaint identification method, device and system.
Background
The customer service center is a main bridge for communication between enterprises and users, and a main channel for improving the satisfaction degree of the users. In the past, a customer service center mainly takes manual customer service as a main part and professional customer service personnel serve users.
With the development of computer information processing technology, more and more customer service centers begin to use voice robots to serve users, and the problem of overlong waiting time of manual customer service is solved.
At present, a voice robot generally cannot recognize the emotion of a user, and in order to solve the problem, voice recognition is introduced into some customer service centers to analyze and judge the emotion of the customer. However, the recognition is not very accurate only through the voice recognition, and misjudgment or omission exists.
There is a need for a technique that is more accurate, identifies the emotion of a user from multiple angles, more accurately discovers the fluctuation of the emotion of the user, and reduces complaints of the user.
Disclosure of Invention
The invention aims to solve the problem of low accuracy of the existing user emotion recognition technology.
In order to solve the above technical problem, a first aspect of the present invention provides a multi-modal complaint identification method for identifying whether a user call content includes a complaint content, the complaint identification method including:
receiving user voice in the user call, and converting the user voice into sound wave;
converting the acoustic waveform into image sequence data while recognizing text content data of the acoustic waveform;
calculating a score reflecting a probability of complaint from the image sequence data and the text content data;
and judging whether the user call contains complaint content according to the score.
According to a preferred embodiment of the present invention, calculating a score reflecting a probability of complaint from the image series data and the text content data includes:
inputting the image sequence data and the text content data into a complaint probability judgment model for calculation, wherein the complaint probability judgment is a machine self-learning model, and the machine self-learning model is trained through historical user call records.
According to a preferred embodiment of the present invention, inputting the image series data and the text content data into a complaint probability judgment model for calculation includes:
vectorizing the image sequence data and the text content data, and inputting the vectorized data into the complaint probability judgment model for calculation.
According to a preferred embodiment of the present invention, the converting the user voice input into a sound wave is specifically: voice input is detected by using VAD algorithm, and a sound wave shape is obtained.
According to a preferred embodiment of the present invention, the continuously sampling the acoustic waveform specifically includes: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and cutting the acoustic waveform by using the overlapped sliding window to obtain a series of waveform samples.
According to a preferred embodiment of the present invention, the speech emotion judgment model is an RNN recurrent neural network model.
According to a preferred embodiment of the present invention, the text emotion judgment model is a CNN convolutional neural network model.
In order to solve the above technical problem, a second aspect of the present invention provides a multi-modal complaint recognition apparatus for recognizing whether a user call content includes a complaint content, the complaint recognition apparatus including:
the voice receiving module is used for receiving user voice in the user call and converting the user voice into sound wave;
the voice conversion module is used for converting the sound wave waveform into image sequence data and identifying text content data of the sound wave waveform;
a probability calculation module for calculating a score reflecting a probability of complaint according to the image sequence data and the text content data;
and the complaint judging module is used for judging whether the user call contains complaint content according to the score.
According to a preferred embodiment of the present invention, calculating a score reflecting a probability of complaint from the image series data and the text content data includes:
inputting the image sequence data and the text content data into a complaint probability judgment model for calculation, wherein the complaint probability judgment is a machine self-learning model, and the machine self-learning model is trained through historical user call records.
According to a preferred embodiment of the present invention, inputting the image series data and the text content data into a complaint probability judgment model for calculation includes:
vectorizing the image sequence data and the text content data, and inputting the vectorized data into the complaint probability judgment model for calculation.
According to a preferred embodiment of the present invention, the converting the user voice input into a sound wave is specifically: voice input is detected by using VAD algorithm, and a sound wave shape is obtained.
According to a preferred embodiment of the present invention, the continuously sampling the acoustic waveform specifically includes: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and cutting the acoustic waveform by using the overlapped sliding window to obtain a series of waveform samples.
According to a preferred embodiment of the present invention, the speech emotion judgment model is an RNN recurrent neural network model.
According to a preferred embodiment of the present invention, the text emotion judgment model is a CNN convolutional neural network model.
In order to solve the above technical problem, a third aspect of the present invention provides a multi-modal complaint recognition system, including:
a storage unit for storing a computer executable program;
and the processing unit is used for reading the computer executable program in the storage unit so as to execute the multi-modal complaint identification method.
In order to solve the above technical problem, a fourth aspect of the present invention provides a computer-readable medium storing a computer-readable program, wherein the computer-readable program is configured to execute a multi-modal complaint identification method.
By adopting the technical scheme, the existing data are utilized to train the speech emotion judgment model, and the complaint probability is analyzed and judged through the image sequence data and the text content data, so that the complaint identification accuracy is improved.
Drawings
In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive step.
FIG. 1 is a schematic flow diagram of a multi-modal complaint identification method in an embodiment of the invention;
FIG. 2A is a diagram of a speech waveform in the time domain in accordance with an embodiment of the present invention;
FIG. 2B is a time domain speech waveform within 800ms for one embodiment of the present invention;
FIG. 2C is a block diagram of the speech waveform of FIG. 2A after being cut in succession;
FIG. 3 is a schematic diagram of a multi-modal complaint recognition device in an embodiment of the invention;
FIG. 4 is a block diagram of a structural framework for a multi-modal complaint recognition system in an embodiment of the invention;
fig. 5 is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention may be embodied in many specific forms, and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.
The structures, properties, effects or other characteristics described in a certain embodiment may be combined in any suitable manner in one or more other embodiments, while still complying with the technical idea of the invention.
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit devices and/or microcontroller devices.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
The invention is mainly applied to the voice robot. As described above, the current voice robot cannot recognize the emotion of the user from the voice of the user, and cannot make a corresponding countermeasure. To solve this problem, the present invention proposes a method of recognizing the emotion of a user by analyzing a pre-trained model using sound wave patterns and text data.
Fig. 1 is a multi-modal complaint recognition method for recognizing whether the content of a user call contains complaint content, as shown in fig. 1, the method of the present invention has the following steps:
and S1, receiving the user voice in the user call, and converting the user voice into a sound wave shape.
On the basis of the above technical solution, further, converting the user voice input into a sound wave specifically includes: voice input is detected by using VAD algorithm, and a sound wave shape is obtained.
In the embodiment, when the voice robot answers the question for the client and performs communication, the voice of the user is processed, the non-voice part is filtered, only the voice part is reserved, the subsequent analysis is facilitated, and the accuracy is improved.
The voice activity detection VAD algorithm is also called a voice endpoint detection algorithm or a voice boundary detection algorithm. In this embodiment, due to the influence of noise such as environmental noise and equipment noise, the voice input of the user often includes not only the sound of the user but also the noise of the environment where the user is located, and if the noise is not filtered, the analysis result is affected. Therefore, the voice section and the non-voice section in the audio data are marked by VAD algorithm, the non-voice section in the audio data is removed by using the marking result, the voice input of the user is detected, the environmental noise is filtered, only the voice of the user is reserved, and the voice is converted into the sound wave.
There are many specific algorithms in VAD algorithm, and in the present embodiment, a gaussian mixture GMM model algorithm is used for human voice detection. In other embodiments, other ones of the VAD algorithms may also be employed.
And S2, converting the sound wave waveform into image sequence data, and identifying text content data of the sound wave waveform.
The speech input received from the user is typically analog audio data, but may also be digital audio data, which typically has some compression rate. After receiving the voice input of the user, on one hand, the voice robot performs voice-to-text content recognition on the audio data, and after recognizing the content, the voice robot also needs to perform semantic understanding on the content by using a semantic understanding engine. In the above process, the present invention also converts the audio data into graphic data that can be processed by a data processing device in real time, so as to identify the graphic in the subsequent steps to obtain emotional information, unlike the prior art.
In the present invention, the graphic data refers to a speech waveform obtained by processing an input speech.
In one embodiment, the speech waveform is a graphical representation in the time dimension of the energy value of the speech. The speech data may be presented in the form of a waveform map of the speech energy, one of which is presented in the time domain. That is, we can show up as a graphical pattern according to the energy over time of a piece of speech.
Fig. 2A is a waveform diagram in the time domain of one embodiment of the present invention. As shown in fig. 2A, which shows a time domain waveform of a segment of speech in a time period from 0 to 600ms, it can be seen that different speech will exhibit different waveforms.
In addition, fig. 2A shows a continuous curve picture, which may also be displayed as a block picture if the temporal range is taken longer, as shown in fig. 2B. In this figure, a time domain speech waveform is shown for 800 ms. In other embodiments, a filling algorithm may be used to convert the line graph into the block graph. The present invention is not limited to a particular graphical presentation method.
Whether for analog audio data or data audio data, resampling of the audio data is required. Preferably, the present invention detects the voice input using VAD algorithm to obtain the acoustic waveform. The voice activity detection VAD algorithm is also called a voice endpoint detection algorithm or a voice boundary detection algorithm. In this embodiment, due to the influence of noise such as environmental noise and equipment noise, the voice input of the user often includes not only the sound of the user but also the noise of the environment where the user is located, and if the noise is not filtered, the analysis result is affected. Therefore, the voice section and the non-voice section in the audio data are marked by VAD algorithm, the non-voice section in the audio data is removed by using the marking result, the voice input of the user is detected, the environmental noise is filtered, only the voice of the user is reserved, and the voice is converted into the sound wave.
There are many specific algorithms in the VAD algorithm, and the invention preferably uses the Gaussian mixture GMM model algorithm for human voice detection. In other embodiments, other ones of the VAD algorithms may also be employed.
In order to convert the speech oscillogram into a format that can be recognized by a machine learning model, the speech oscillogram needs to be segmented. That is, the speech waveform is divided over a predetermined time window so that the user's speech input produces a speech waveform that is continuous in time. For example, the speech waveform map may be divided continuously over a time window, thereby generating individual continuous speech waveform map segments. The length of the time window may be predetermined, e.g. 25ms, 50ms, 100ms, 200ms, etc.
In this embodiment, the continuously sampling the acoustic waveform specifically includes: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and cutting the acoustic waveform by using the overlapped sliding window to obtain a series of waveform samples. As shown in fig. 2C, it is a plurality of segment images obtained by continuously cutting the speech waveform in fig. 2A.
In another embodiment, the user's speech input may also produce speech waveform maps that overlap in time. In order to avoid missing related side picture information in consecutive pictures, the present invention may use an overlapped slicing manner, for example, for the waveform diagram shown in fig. 2A, the slicing may be performed to 0ms-50ms, 25ms-75ms, 50ms-100ms, and 75-125ms … ….
The cut image may be stored as a jpg file. In other embodiments, the image file may be converted into an image file in other formats. In other embodiments, the transformed image is also represented as a vector for input to the emotion determination model.
And S3, calculating a score reflecting the complaint probability according to the image sequence data and the text content data.
On the basis of the above technical solution, further, calculating a score reflecting a probability of complaint from the image sequence data and the text content data includes: inputting the image sequence data and the text content data into a complaint probability judgment model for calculation, wherein the complaint probability judgment is a machine self-learning model, and the machine self-learning model is trained through historical user call records.
On the basis of the above technical solution, further, inputting the image sequence data and the text content data into a complaint probability judgment model for calculation includes: vectorizing the image sequence data and the text content data, and inputting the vectorized data into the complaint probability judgment model for calculation.
One commonly used technique in deep neural networks is a pre-training technique. Many research results demonstrate that initializing parameters of a neural network using vectors obtained from unsupervised or supervised training of large-scale data can yield a better model than random initialization training. Therefore, in the present embodiment, the machine self-learning model is trained through historical user call records.
On the basis of the technical scheme, the speech emotion judgment model is an RNN recurrent neural network model.
The recurrent neural network RNN is a type of deep network that can be used for unsupervised and supervised learning, the depth can even be consistent with the length of an input sequence, in the unsupervised learning mode, the recurrent neural network RNN is used for predicting a future data sequence according to a previous data sample, and class information is not used in the learning process, so the recurrent neural network RNN is very suitable for sequence data modeling.
Furthermore, in the field of language processing, the recurrent neural network RNN model is one of the most widely used neural networks. In the field of language processing, the influence of the above information on the following is generally analyzed by a language model, and the RNN model utilizes a hidden layer of circular feedback to naturally utilize the above information and can theoretically use all the above information, which cannot be achieved by the traditional language model. Therefore, in the present embodiment, the speech emotion determination model is an RNN recurrent neural network model.
In this embodiment, the speech emotion judgment model includes an input layer for inputting the image sequence data, a hidden layer, and an output layer for outputting an emotion judgment value sequence of a user, where the number of nodes in the input layer is the same as the number of nodes in the output layer.
In the present embodiment, image sequence data is input to an input layer of a speech emotion determination model, the number of nodes in an output layer of the speech emotion determination model is the same as the number of nodes in the input layer, emotion determination values corresponding to each sample in the image sequence data are output, and the output emotion determination values constitute an emotion determination value sequence.
On the basis of the technical scheme, the text emotion judgment model is a CNN convolutional neural network model.
In the present embodiment, the text emotion judgment model based on the convolutional neural network CNN performs emotion classification on text content data in the problem domain using lexical semantic vectors generated in the target domain, and its input is a sentence or document expressed in a matrix, where each row of the matrix corresponds to a word segmentation element and each row is a vector representing one word.
In the present embodiment, the text emotion determination model outputs a text emotion fluctuation value.
And S4, judging whether the user call contains the complaint content according to the score.
In the present embodiment, the speech emotion determination model outputs an emotion determination value sequence, and further data processing is required. And calculating the variance of the emotion judgment value sequence to obtain a value which is a voice emotion fluctuation value, wherein different voice emotion fluctuation values correspond to different emotions.
In the embodiment, the variance of the emotion judgment value sequence is calculated to judge the magnitude of the emotion fluctuation of the user, the variance value is the emotion fluctuation value, and the larger the variance value is, the larger the emotion fluctuation of the user is.
In this embodiment, the weights of the speech emotion fluctuation value and the text emotion fluctuation value are set, respectively, and the global emotion fluctuation value is calculated. And a global emotion fluctuation value threshold is also preset, and when the calculated global emotion fluctuation value exceeds the global emotion fluctuation value threshold, the situation that the emotion fluctuation of the user is serious and the complaint probability is high is shown. At this time, the conversation strategy of the voice robot needs to be adjusted, and the adjusted conversation strategy comprises adjusting the speed of speech, adjusting the tone of speech, adjusting the content of speech and the like.
As shown in fig. 3, in the present embodiment, there is also provided a multi-modal complaint recognition apparatus 300 including:
the voice receiving module 301 is configured to receive a user voice in the user call, and convert the user voice into a sound wave.
On the basis of the above technical solution, further, converting the user voice input into a sound wave specifically includes: voice input is detected by using VAD algorithm, and a sound wave shape is obtained.
In the embodiment, when the voice robot answers the question for the client and performs communication, the voice of the user is processed, the non-voice part is filtered, only the voice part is reserved, the subsequent analysis is facilitated, and the accuracy is improved.
The voice activity detection VAD algorithm is also called a voice endpoint detection algorithm or a voice boundary detection algorithm. In this embodiment, due to the influence of noise such as environmental noise and equipment noise, the voice input of the user often includes not only the sound of the user but also the noise of the environment where the user is located, and if the noise is not filtered, the analysis result is affected. Therefore, the voice section and the non-voice section in the audio data are marked by VAD algorithm, the non-voice section in the audio data is removed by using the marking result, the voice input of the user is detected, the environmental noise is filtered, only the voice of the user is reserved, and the voice is converted into the sound wave.
There are many specific algorithms in VAD algorithm, and in the present embodiment, a gaussian mixture GMM model algorithm is used for human voice detection. In other embodiments, other ones of the VAD algorithms may also be employed.
A voice conversion module 302, configured to convert the acoustic waveform into image sequence data, and identify text content data of the acoustic waveform.
The speech input received from the user is typically analog audio data, but may also be digital audio data, which typically has some compression rate. After receiving the voice input of the user, on one hand, the voice robot performs voice-to-text content recognition on the audio data, and after recognizing the content, the voice robot also needs to perform semantic understanding on the content by using a semantic understanding engine. In the above process, the present invention also converts the audio data into graphic data that can be processed by a data processing device in real time, so as to identify the graphic in the subsequent steps to obtain emotional information, unlike the prior art.
In the present invention, the graphic data refers to a speech waveform obtained by processing an input speech.
In one embodiment, the speech waveform is a graphical representation in the time dimension of the energy value of the speech. The speech data may be presented in the form of a waveform map of the speech energy, one of which is presented in the time domain. That is, we can show up as a graphical pattern according to the energy over time of a piece of speech.
Whether for analog audio data or data audio data, resampling of the audio data is required. Preferably, the present invention detects the voice input using VAD algorithm to obtain the acoustic waveform. The voice activity detection VAD algorithm is also called a voice endpoint detection algorithm or a voice boundary detection algorithm. In this embodiment, due to the influence of noise such as environmental noise and equipment noise, the voice input of the user often includes not only the sound of the user but also the noise of the environment where the user is located, and if the noise is not filtered, the analysis result is affected. Therefore, the voice section and the non-voice section in the audio data are marked by VAD algorithm, the non-voice section in the audio data is removed by using the marking result, the voice input of the user is detected, the environmental noise is filtered, only the voice of the user is reserved, and the voice is converted into the sound wave.
There are many specific algorithms in the VAD algorithm, and the invention preferably uses the Gaussian mixture GMM model algorithm for human voice detection. In other embodiments, other ones of the VAD algorithms may also be employed.
In order to convert the speech oscillogram into a format that can be recognized by a machine learning model, the speech oscillogram needs to be segmented. That is, the speech waveform is divided over a predetermined time window so that the user's speech input produces a speech waveform that is continuous in time. For example, the speech waveform map may be divided continuously over a time window, thereby generating individual continuous speech waveform map segments. The length of the time window may be predetermined, e.g. 25ms, 50ms, 100ms, 200ms, etc.
In this embodiment, the continuously sampling the acoustic waveform specifically includes: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and cutting the acoustic waveform by using the overlapped sliding window to obtain a series of waveform samples.
In another embodiment, the user's speech input may also produce speech waveform maps that overlap in time. In order to avoid missing related side picture information in consecutive pictures, the present invention may use an overlapped slicing manner, for example, for the waveform diagram shown in fig. 2A, the slicing may be performed to 0ms-50ms, 25ms-75ms, 50ms-100ms, and 75-125ms … ….
The cut image may be stored as a jpg file. In other embodiments, the image file may be converted into an image file in other formats. In other embodiments, the transformed image is also represented as a vector for input to the emotion determination model. A probability calculation module 303, configured to calculate a score reflecting the probability of complaint according to the image sequence data and the text content data.
On the basis of the above technical solution, further, calculating a score reflecting a probability of complaint from the image sequence data and the text content data includes: inputting the image sequence data and the text content data into a complaint probability judgment model for calculation, wherein the complaint probability judgment is a machine self-learning model, and the machine self-learning model is trained through historical user call records.
On the basis of the above technical solution, further, inputting the image sequence data and the text content data into a complaint probability judgment model for calculation includes: vectorizing the image sequence data and the text content data, and inputting the vectorized data into the complaint probability judgment model for calculation.
One commonly used technique in deep neural networks is a pre-training technique. Many research results demonstrate that initializing parameters of a neural network using vectors obtained from unsupervised or supervised training of large-scale data can yield a better model than random initialization training. Therefore, in the present embodiment, the machine self-learning model is trained through historical user call records.
On the basis of the technical scheme, the speech emotion judgment model is an RNN recurrent neural network model.
The recurrent neural network RNN is a type of deep network that can be used for unsupervised and supervised learning, the depth can even be consistent with the length of an input sequence, in the unsupervised learning mode, the recurrent neural network RNN is used for predicting a future data sequence according to a previous data sample, and class information is not used in the learning process, so the recurrent neural network RNN is very suitable for sequence data modeling.
Furthermore, in the field of language processing, the recurrent neural network RNN model is one of the most widely used neural networks. In the field of language processing, the influence of the above information on the following is generally analyzed by a language model, and the RNN model utilizes a hidden layer of circular feedback to naturally utilize the above information and can theoretically use all the above information, which cannot be achieved by the traditional language model. Therefore, in the present embodiment, the speech emotion determination model is an RNN recurrent neural network model.
In this embodiment, the speech emotion judgment model includes an input layer for inputting the image sequence data, a hidden layer, and an output layer for outputting an emotion judgment value sequence of a user, where the number of nodes in the input layer is the same as the number of nodes in the output layer.
In the present embodiment, image sequence data is input to an input layer of a speech emotion determination model, the number of nodes in an output layer of the speech emotion determination model is the same as the number of nodes in the input layer, emotion determination values corresponding to each sample in the image sequence data are output, and the output emotion determination values constitute an emotion determination value sequence.
On the basis of the technical scheme, the text emotion judgment model is a CNN convolutional neural network model.
In the present embodiment, the text emotion judgment model based on the convolutional neural network CNN performs emotion classification on text content data in the problem domain using lexical semantic vectors generated in the target domain, and its input is a sentence or document expressed in a matrix, where each row of the matrix corresponds to a word segmentation element and each row is a vector representing one word.
In the present embodiment, the text emotion determination model outputs a text emotion fluctuation value.
A complaint judgment module 304, configured to judge whether the user call includes complaint content according to the score.
In the present embodiment, the speech emotion determination model outputs an emotion determination value sequence, and further data processing is required. And calculating the variance of the emotion judgment value sequence to obtain a value which is a voice emotion fluctuation value, wherein different voice emotion fluctuation values correspond to different emotions.
In the embodiment, the variance of the emotion judgment value sequence is calculated to judge the magnitude of the emotion fluctuation of the user, the variance value is the emotion fluctuation value, and the larger the variance value is, the larger the emotion fluctuation of the user is.
In this embodiment, the weights of the speech emotion fluctuation value and the text emotion fluctuation value are set, respectively, and the global emotion fluctuation value is calculated. And a global emotion fluctuation value threshold is also preset, and when the calculated global emotion fluctuation value exceeds the global emotion fluctuation value threshold, the situation that the emotion fluctuation of the user is serious and the complaint probability is high is shown. At this time, the conversation strategy of the voice robot needs to be adjusted, and the adjusted conversation strategy comprises adjusting the speed of speech, adjusting the tone of speech, adjusting the content of speech and the like.
As shown in fig. 4, in an embodiment of the present invention, a multi-modal complaint recognition system is further disclosed, and the information processing system shown in fig. 4 is only an example and should not bring any limitation to the functions and the scope of the embodiment of the present invention.
A multi-modal complaint recognition system 400 including a storage unit 420 for storing a computer-executable program; a processing unit 410 for reading the computer executable program in the storage unit to perform the steps of various embodiments of the present invention.
In the present embodiment, the multi-modal complaint recognition system 400 further includes a bus 430 connecting different system components (including the storage unit 420 and the processing unit 410), a display unit 440, and the like.
The storage unit 420 stores a computer-readable program, which may be a code of a source program or a read-only program. The program may be executed by the processing unit 410 such that the processing unit 410 performs the steps of various embodiments of the present invention. For example, the processing unit 410 may perform the steps as shown in fig. 1.
The storage unit 420 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)4201 and/or a cache memory unit 4202, and may further include a read only memory unit (ROM) 4203. The storage unit 420 may also include a program/utility 4204 having a set (at least one) of program modules 4205, such program modules 4205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 430 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
Multimodal complaint identification system 400 can also communicate with one or more external devices 470 (e.g., keyboard, display, network device, bluetooth device, etc.) such that a user can interact with processing unit 410 via input/output (I/O) interfaces 450 via these external devices 470, and can also interact with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via network adapter 460. Network adapter 460 may communicate with other modules of multi-modal complaint identification system 400 over bus 430. It should be understood that, although not shown in the figures, other hardware and/or software modules may be used in multi-modal complaint identification system 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
FIG. 5 is a schematic diagram of one computer-readable medium embodiment of the present invention. As shown in fig. 4, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a random access memory unit (RAM), a read-only memory unit (ROM), an erasable programmable read-only memory unit (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory unit (CD-ROM), an optical storage unit, a magnetic storage unit, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described method of the invention, namely:
s1, receiving user voice in the user call, and converting the user voice into sound wave;
s2, converting the sound wave waveform into image sequence data, and identifying text content data of the sound wave waveform;
s3, calculating a score reflecting the probability of complaint according to the image sequence data and the text content data;
and S4, judging whether the user call contains the complaint content according to the score.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a data processing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the present invention can be implemented as a method, an apparatus, an electronic device, or a computer-readable medium executing a computer program. Some or all of the functions of the present invention may be implemented in practice using general purpose data processing equipment such as a micro-processing unit or a digital signal processing unit (DSP).
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (10)

1. A multi-modal complaint identification method is used for identifying whether the conversation content of a user contains complaint content, and is characterized by comprising the following steps:
receiving user voice in the user call, and converting the user voice into sound wave;
converting the acoustic waveform into image sequence data while recognizing text content data of the acoustic waveform;
calculating a score reflecting a probability of complaint from the image sequence data and the text content data;
and judging whether the user call contains complaint content according to the score.
2. The multi-modal complaint identification method of claim 1, wherein calculating a score reflecting a probability of complaint based on the image sequence data and the text content data comprises:
inputting the image sequence data and the text content data into a complaint probability judgment model for calculation, wherein the complaint probability judgment is a machine self-learning model, and the machine self-learning model is trained through historical user call records.
3. The multimodal complaint recognition method according to any one of claims 1 to 2, wherein inputting the image series data and the text content data into a complaint probability judgment model for calculation includes:
vectorizing the image sequence data and the text content data, and inputting the vectorized data into the complaint probability judgment model for calculation.
4. Complaint identification method according to any one of claims 1 to 3, characterized in that the conversion of the user speech input into a sonic waveform is embodied as: voice input is detected by using VAD algorithm, and a sound wave shape is obtained.
5. Complaint identification method according to any one of claims 1 to 4, characterized in that the continuous sampling of the acoustic waveform is in particular: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and cutting the acoustic waveform by using the overlapped sliding window to obtain a series of waveform samples.
6. The complaint recognition method of any one of claims 1 to 5, wherein the speech emotion judgment model is an RNN recurrent neural network model.
7. The complaint identification method of any one of claims 1 to 6, wherein the text emotion judgment model is a CNN convolutional neural network model.
8. A multi-modal complaint recognition apparatus for recognizing whether or not a content of a user call contains a complaint content, the complaint recognition apparatus comprising:
the voice receiving module is used for receiving user voice in the user call and converting the user voice into sound wave;
the voice conversion module is used for converting the sound wave waveform into image sequence data and identifying text content data of the sound wave waveform;
a probability calculation module for calculating a score reflecting a probability of complaint according to the image sequence data and the text content data;
and the complaint judging module is used for judging whether the user call contains complaint content according to the score.
9. A multi-modal complaint identification system, comprising:
a storage unit for storing a computer executable program;
a processing unit for reading the computer executable program in the storage unit to execute the multi-modal complaint identification method of any of claims 1-7.
10. A computer-readable medium storing a computer-readable program for executing the multi-modal complaint identification method of any of claims 1-7.
CN201910943563.7A 2019-09-30 2019-09-30 Multi-mode complaint identification method, device and system Active CN110782916B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910943563.7A CN110782916B (en) 2019-09-30 2019-09-30 Multi-mode complaint identification method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910943563.7A CN110782916B (en) 2019-09-30 2019-09-30 Multi-mode complaint identification method, device and system

Publications (2)

Publication Number Publication Date
CN110782916A true CN110782916A (en) 2020-02-11
CN110782916B CN110782916B (en) 2023-09-05

Family

ID=69385079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910943563.7A Active CN110782916B (en) 2019-09-30 2019-09-30 Multi-mode complaint identification method, device and system

Country Status (1)

Country Link
CN (1) CN110782916B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101046A (en) * 2020-11-02 2020-12-18 北京淇瑀信息科技有限公司 Conversation analysis method, device and system based on conversation behavior
CN112804400A (en) * 2020-12-31 2021-05-14 中国工商银行股份有限公司 Customer service call voice quality inspection method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339606A (en) * 2011-05-17 2012-02-01 首都医科大学宣武医院 Depressed mood phone automatic speech recognition screening system
WO2014069443A1 (en) * 2012-10-31 2014-05-08 日本電気株式会社 Complaint call determination device and complaint call determination method
CN103811009A (en) * 2014-03-13 2014-05-21 华东理工大学 Smart phone customer service system based on speech analysis
CN105810205A (en) * 2014-12-29 2016-07-27 ***通信集团公司 Speech processing method and device
US20180027123A1 (en) * 2015-02-03 2018-01-25 Dolby Laboratories Licensing Corporation Conference searching and playback of search results
CN109817245A (en) * 2019-01-17 2019-05-28 深圳壹账通智能科技有限公司 Generation method, device, computer equipment and the storage medium of meeting summary

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339606A (en) * 2011-05-17 2012-02-01 首都医科大学宣武医院 Depressed mood phone automatic speech recognition screening system
WO2014069443A1 (en) * 2012-10-31 2014-05-08 日本電気株式会社 Complaint call determination device and complaint call determination method
CN103811009A (en) * 2014-03-13 2014-05-21 华东理工大学 Smart phone customer service system based on speech analysis
CN105810205A (en) * 2014-12-29 2016-07-27 ***通信集团公司 Speech processing method and device
US20180027123A1 (en) * 2015-02-03 2018-01-25 Dolby Laboratories Licensing Corporation Conference searching and playback of search results
CN109817245A (en) * 2019-01-17 2019-05-28 深圳壹账通智能科技有限公司 Generation method, device, computer equipment and the storage medium of meeting summary

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101046A (en) * 2020-11-02 2020-12-18 北京淇瑀信息科技有限公司 Conversation analysis method, device and system based on conversation behavior
CN112101046B (en) * 2020-11-02 2022-04-29 北京淇瑀信息科技有限公司 Conversation analysis method, device and system based on conversation behavior
CN112804400A (en) * 2020-12-31 2021-05-14 中国工商银行股份有限公司 Customer service call voice quality inspection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110782916B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN106653052B (en) Virtual human face animation generation method and device
Vashisht et al. Speech recognition using machine learning
CN110648691B (en) Emotion recognition method, device and system based on energy value of voice
US8209182B2 (en) Emotion recognition system
CN101930735B (en) Speech emotion recognition equipment and speech emotion recognition method
CN112650831A (en) Virtual image generation method and device, storage medium and electronic equipment
CN112101045B (en) Multi-mode semantic integrity recognition method and device and electronic equipment
CN112037773B (en) N-optimal spoken language semantic recognition method and device and electronic equipment
CN111901627B (en) Video processing method and device, storage medium and electronic equipment
US10685644B2 (en) Method and system for text-to-speech synthesis
CN113539240B (en) Animation generation method, device, electronic equipment and storage medium
CN109697988B (en) Voice evaluation method and device
CN111177350A (en) Method, device and system for forming dialect of intelligent voice robot
CN111177186A (en) Question retrieval-based single sentence intention identification method, device and system
CN111177351A (en) Method, device and system for acquiring natural language expression intention based on rule
CN114330371A (en) Session intention identification method and device based on prompt learning and electronic equipment
CN110782916B (en) Multi-mode complaint identification method, device and system
Basak et al. Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems.
Steidl et al. The hinterland of emotions: facing the open-microphone challenge
CN109697975B (en) Voice evaluation method and device
US9484045B2 (en) System and method for automatic prediction of speech suitability for statistical modeling
CN110619894B (en) Emotion recognition method, device and system based on voice waveform diagram
CN116631434A (en) Video and voice synchronization method and device based on conversion system and electronic equipment
CN112017668B (en) Intelligent voice conversation method, device and system based on real-time emotion detection
Kostoulas et al. Enhancing emotion recognition from speech through feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant