CN112633172A - Communication optimization method, device, equipment and medium - Google Patents

Communication optimization method, device, equipment and medium Download PDF

Info

Publication number
CN112633172A
CN112633172A CN202011545611.6A CN202011545611A CN112633172A CN 112633172 A CN112633172 A CN 112633172A CN 202011545611 A CN202011545611 A CN 202011545611A CN 112633172 A CN112633172 A CN 112633172A
Authority
CN
China
Prior art keywords
target
emotion
voice
communication optimization
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011545611.6A
Other languages
Chinese (zh)
Other versions
CN112633172B (en
Inventor
彭钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202011545611.6A priority Critical patent/CN112633172B/en
Publication of CN112633172A publication Critical patent/CN112633172A/en
Application granted granted Critical
Publication of CN112633172B publication Critical patent/CN112633172B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to the field of artificial intelligence, and provides a communication optimization method, a device, equipment and a medium, which can comprehensively judge emotion recognition results of videos and emotion recognition results of voices, effectively improve the emotion judgment accuracy, and softly process real-time voices input by customers when the emotion abnormality of the customers is detected, so that the influence of the emotion of the customers on customer service staff can be avoided, the communication between two parties is smoother, the customer complaint rate is further reduced, and better interactive experience is brought to the two parties. In addition, the invention also relates to a block chain technology, and the emotion recognition model can be stored in the block chain node.

Description

Communication optimization method, device, equipment and medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a communication optimization method, a communication optimization device, communication optimization equipment and communication optimization media.
Background
At present, many fields involve frequent audio-video communication between clients and customer services for problem consultation or service handling. In the process, if the emotion of one party is not good, the communication process is likely to be unpleasant, so that the client leaves a bad impression on the enterprise corresponding to the customer service, and the customer service is also subjected to related responsibility.
Aiming at the problems, a common method is to make a corresponding constraint system for customer service, so that when the customer service faces a client with unstable emotion, the customer service keeps the attitude of cool and quiet control based on the consideration of professional quality and work performance, but only the customer service subjectively constrains the customer service to have certain risks.
However, for the scheme of judging emotion by adopting an intelligent recognition mode and further providing a client emotion judgment basis for the customer service, emotion judgment is usually performed only according to the currently captured user picture, and the judgment result is not accurate enough, so that the correct response of the subsequent customer service is influenced.
Disclosure of Invention
In view of the above, it is necessary to provide a communication optimization method, apparatus, device, and medium, which can comprehensively perform judgment on the emotion recognition result of the video and the emotion recognition result of the voice, effectively improve the accuracy of emotion judgment, and perform soft processing on the real-time voice input by the client when detecting that the emotion of the client is abnormal, so as to avoid the influence of the emotion of the client on the customer service staff, make the communication between the two parties smoother, further reduce the customer complaint rate, and bring better interactive experience to the two parties.
A communication optimization method, comprising:
responding to a communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
intercepting the characteristics of the target video to obtain a picture to be detected;
inputting the picture to be detected into a pre-trained emotion recognition model, and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training;
performing emotion recognition on the target voice to obtain a second emotion type;
when the first emotion type and/or the second emotion type are/is abnormal, acquiring input voice of the target user in real time by using the target acquisition equipment, and optimizing the input voice to obtain real-time voice;
and outputting the real-time voice.
According to a preferred embodiment of the present invention, the acquiring target acquisition device according to the communication optimization instruction includes:
analyzing the method body of the communication optimization instruction to obtain the information carried by the communication optimization instruction;
acquiring a preset label corresponding to the equipment identifier;
constructing a regular expression according to the preset label;
traversing in the information carried by the communication optimization instruction by using the regular expression, and determining the traversed data as a target equipment identifier;
and determining user equipment according to the target equipment identification, and determining acquisition equipment of the user equipment as the target acquisition equipment.
According to a preferred embodiment of the present invention, the intercepting the features of the target video to obtain the picture to be detected includes:
acquiring all frame pictures contained in the target video;
inputting each frame picture in all the frame pictures into a YOLOv3 network for identification to obtain a face area of each frame picture;
and intercepting the face area of each frame picture to obtain the picture to be detected.
According to a preferred embodiment of the present invention, said determining a first emotion type from an output of said emotion recognition model comprises:
obtaining the predicted emotion and the corresponding predicted probability of each picture to be detected from the output of the emotion recognition model;
acquiring a maximum prediction probability from the prediction probabilities as a target prediction probability;
and acquiring the predicted emotion corresponding to the target prediction probability as the first emotion type.
According to a preferred embodiment of the invention, the method further comprises:
obtaining a sample video, and splitting the sample video by a preset time length to obtain at least one sub-video;
performing feature interception on the at least one sub video to obtain a training sample;
extracting the features of the training samples by using a preset residual error network to obtain initial features;
inputting the initial features to a full connection layer corresponding to each color channel, and outputting feature vectors;
processing the feature vector by a first sigmoid function to obtain a first attention weight;
converting the feature vector based on the first attention weight to obtain an initial global frame feature;
merging the feature vector and the initial global frame feature to obtain a merging feature;
processing the shunting characteristics by a second sigmoid function to obtain a second attention weight;
converting the splicing feature based on the second attention weight to obtain a target global frame feature;
processing the target global frame characteristics by a softmax function, and outputting a prediction result and a loss value;
and when the convergence of the loss value is detected, stopping training to obtain the emotion recognition model.
According to the preferred embodiment of the present invention, the feature vector is converted based on the first attention weight by using the following formula to obtain an initial global frame feature:
Figure BDA0002856118040000041
wherein, f'vFor the initial global bounding box feature, αiIs the first attention weight, fiIs the characteristic directionQuantity, i is the frame number to which the feature vector belongs, and n is the maximum frame number;
converting the splicing characteristics based on the second attention weight by adopting the following formula to obtain target global frame characteristics:
Figure BDA0002856118040000042
wherein f isvFor the target global bounding box feature, βiFor the second attention weight, [ f [ ]i:f′v]Is the join feature.
According to a preferred embodiment of the present invention, the optimizing the input speech to obtain real-time speech includes:
carrying out noise reduction processing on the input voice to obtain a first voice;
identifying a target sound wave in the first voice, and deleting the target sound wave from the first voice to obtain a second voice;
and performing fade-in and fade-out processing on the second voice to obtain the real-time voice.
A communication optimization device, the communication optimization device comprising:
the acquisition unit is used for responding to a communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
the intercepting unit is used for intercepting the characteristics of the target video to obtain a picture to be detected;
the recognition unit is used for inputting the picture to be detected to a pre-trained emotion recognition model and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training;
the recognition unit is further used for carrying out emotion recognition on the target voice to obtain a second emotion type;
the optimization unit is used for acquiring the input voice of the target user in real time by using the target acquisition equipment and optimizing the input voice to obtain real-time voice when the first emotion type and/or the second emotion type are/is abnormal;
and the output unit is used for outputting the real-time voice.
An electronic device, the electronic device comprising:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the communication optimization method.
A computer-readable storage medium having stored therein at least one instruction, the at least one instruction being executable by a processor in an electronic device to implement the communication optimization method.
According to the technical scheme, the method can respond to a communication optimization instruction, obtain target acquisition equipment according to the communication optimization instruction, start the target acquisition equipment to acquire target voice and target video of a target user, perform characteristic interception on the target video to obtain a picture to be detected, input the picture to be detected into a pre-trained emotion recognition model, and determine a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training, time-related sequence characteristics are integrated based on the frame attention mechanism so as to effectively classify the characteristics of video segments, perform emotion recognition by video characteristics, can more favorably capture the emotion state of a representative human face, and perform emotion recognition on the target voice, and when the first emotion type and/or the second emotion type are/is abnormal, acquiring the input voice of the target user in real time by using the target acquisition equipment, optimizing the input voice to obtain real-time voice, outputting the real-time voice, integrating the emotion recognition result of the video and the emotion recognition result of the voice to judge, effectively improving the accuracy of emotion judgment, and softly processing the real-time voice input by the client when the emotion of the client is detected to be abnormal, so that the influence of the emotion of the client on customer service staff can be avoided, the communication between the two parties is smoother, the customer complaint rate is further reduced, and better interactive experience is brought to the two parties.
Drawings
Fig. 1 is a flow chart of a communication optimization method according to a preferred embodiment of the present invention.
Fig. 2 is a functional block diagram of a communication optimization apparatus according to a preferred embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device implementing a communication optimization method according to a preferred embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flow chart of a communication optimization method according to a preferred embodiment of the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.
The communication optimization method is applied to one or more electronic devices, which are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware thereof includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The electronic device may be any electronic product capable of performing human-computer interaction with a user, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive Internet Protocol Television (IPTV), an intelligent wearable device, and the like.
The electronic device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of hosts or network servers.
The Network where the electronic device is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.
S10, responding to the communication optimization instruction, acquiring the target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire the target voice and the target video of the target user.
In at least one embodiment of the present invention, the communication optimization instruction may be triggered by a customer service currently performing a call, or may be automatically triggered when detecting that audio/video is started, which is not limited in the present invention.
In at least one embodiment of the present invention, the acquiring the target collecting device according to the communication optimization instruction includes:
analyzing the method body of the communication optimization instruction to obtain the information carried by the communication optimization instruction;
acquiring a preset label corresponding to the equipment identifier;
constructing a regular expression according to the preset label;
traversing in the information carried by the communication optimization instruction by using the regular expression, and determining the traversed data as a target equipment identifier;
and determining user equipment according to the target equipment identification, and determining acquisition equipment of the user equipment as the target acquisition equipment.
For example: when the bank service and the client carry out video interaction, the two parties respectively hold a terminal device to carry out a conference, the terminal device of the client is determined as the user device by analyzing the communication optimization instruction, and the acquisition device of the user device is determined as the target acquisition device.
The information carried by the communication optimization instruction may include, but is not limited to: equipment identification, a user name for triggering the communication optimization instruction and the like.
The communication optimization instruction is a code, and the contents between { } in the communication optimization instruction are called the method body according to the writing principle of the code.
The preset tag may be configured by user, and the preset tag and the device identifier have a one-to-one correspondence relationship, for example: the preset label can be ZID, and the regular expression ZID () is further established by the preset label and traversed by the ZID ().
Through the implementation mode, the equipment identification can be rapidly determined based on the regular expression and the preset label, and the target acquisition equipment is further determined by utilizing the equipment identification.
And S11, performing feature interception on the target video to obtain a picture to be detected.
Since each target video may include other non-face information that would interfere with feature recognition, feature truncation is performed on the video first.
Specifically, the feature capturing the target video to obtain the picture to be detected includes:
acquiring all frame pictures contained in the target video;
inputting each frame picture in all the frame pictures into a YOLOv3 network for identification to obtain a face area of each frame picture;
and intercepting the face area of each frame picture to obtain the picture to be detected.
Through the embodiment, the YOLOv3 network has high stability precision, so that redundant information in the video can be effectively eliminated by intercepting the facial features through the YOLOv3 network, and the accuracy and efficiency of subsequent emotion recognition are improved.
S12, inputting the picture to be detected into a pre-trained emotion recognition model, and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training.
For example: the output of the emotion recognition model may be: anger, 0.95.
In this embodiment, the determining the first emotion type according to the output of the emotion recognition model includes:
obtaining the predicted emotion and the corresponding predicted probability of each picture to be detected from the output of the emotion recognition model;
acquiring a maximum prediction probability from the prediction probabilities as a target prediction probability;
and acquiring the predicted emotion corresponding to the target prediction probability as the first emotion type.
Through the embodiment, the final emotion recognition result can be determined by integrating all recognition results, so that the recognition accuracy is higher.
In at least one embodiment of the invention, the method further comprises:
obtaining a sample video, and splitting the sample video by a preset time length to obtain at least one sub-video;
performing feature interception on the at least one sub video to obtain a training sample;
extracting the features of the training samples by using a preset residual error network to obtain initial features;
inputting the initial features to a full connection layer corresponding to each color channel, and outputting feature vectors;
processing the feature vector by a first sigmoid function to obtain a first attention weight;
converting the feature vector based on the first attention weight to obtain an initial global frame feature;
merging the feature vector and the initial global frame feature to obtain a merging feature;
processing the shunting characteristics by a second sigmoid function to obtain a second attention weight;
converting the splicing feature based on the second attention weight to obtain a target global frame feature;
processing the target global frame characteristics by a softmax function, and outputting a prediction result and a loss value;
and when the convergence of the loss value is detected, stopping training to obtain the emotion recognition model.
The preset time duration may be configured by user-defined means, such as 10 seconds.
Further, the at least one sub-video may be feature-intercepted using a YOLOv3 network.
Further, the predetermined residual network may be a Resnet50 network, and the present invention is not limited thereto.
In this embodiment, when merging the feature vector and the initial global frame feature, a horizontal merging manner may be adopted.
For example: after the concatenation of two 1024 x 1 vectors, a 2048 x 1 vector is obtained.
Through the embodiment, the time-dependent sequence features can be integrated based on the frame attention mechanism, so that the features of the video segments can be effectively classified, and the trained emotion recognition model has higher accuracy.
Specifically, the feature vector is converted based on the first attention weight by using the following formula to obtain an initial global frame feature:
Figure BDA0002856118040000091
wherein, f'vFor the initial global bounding box feature, αiIs the first attention weight, fiTaking the feature vector as a reference, wherein i is the frame number of the feature vector, and n is the maximum frame number;
converting the splicing characteristics based on the second attention weight by adopting the following formula to obtain target global frame characteristics:
Figure BDA0002856118040000092
wherein f isvFor the target global bounding box feature, βiFor the second attention weight, [ f [ ]i:f′v]Is the join feature.
Through the embodiment, the characteristic standardization processing is carried out for multiple times based on the frame attention mechanism, the image characteristic is converted into the global video characteristic, the emotion recognition is carried out by the video characteristic, and the representative human face emotion state can be captured more favorably.
And S13, performing emotion recognition on the target voice to obtain a second emotion type.
In at least one embodiment of the present invention, the target speech may be subjected to emotion recognition by using GMM (Adaptive background mixed models for real-time tracking), SVM (Support Vector Machine), HMM (Hidden Markov Model), Multiple classifier system (Multiple classifier system), and the like, which will not be described herein.
It should be noted that the first emotion type refers to an emotion type recognized according to the emotion recognition model, and the second emotion type refers to an emotion type recognized after emotion recognition is performed according to the target speech. Both the first emotion type and the second emotion type may include an angry, happy, etc. emotion.
And S14, when the first emotion type and/or the second emotion type are/is abnormal, acquiring the input voice of the target user in real time by using the target acquisition equipment, and optimizing the input voice to obtain real-time voice.
In this embodiment, when it is detected that the first emotion type and/or the second emotion type is anger, excitement, or the like, it may be determined as abnormal.
The embodiment integrates the emotion recognition result for the video and the emotion recognition result for the voice to judge, and the emotion judgment accuracy is effectively improved.
In at least one embodiment of the present invention, the optimizing the input speech to obtain real-time speech includes:
carrying out noise reduction processing on the input voice to obtain a first voice;
identifying a target sound wave in the first voice, and deleting the target sound wave from the first voice to obtain a second voice;
and performing fade-in and fade-out processing on the second voice to obtain the real-time voice.
Wherein the target sound wave may include an aggressive sound wave, etc., and the present invention is not limited thereto.
By the embodiment, when the emotion abnormality of the client is detected, the real-time voice input by the client can be softly processed.
And S15, outputting the real-time voice.
For example: and outputting the real-time voice on the terminal equipment of the customer service staff.
Through the implementation mode, the influence of the emotion of the customer service personnel on the customer can be avoided, the communication between the customer service personnel and the customer service personnel is smoother, the customer complaint rate is further reduced, and better interactive experience is brought to the customer service personnel and the customer service personnel.
It should be noted that, in order to further ensure the security of the data, the emotion recognition model may be deployed in a blockchain to avoid malicious tampering of the data.
According to the technical scheme, the method can respond to a communication optimization instruction, obtain target acquisition equipment according to the communication optimization instruction, start the target acquisition equipment to acquire target voice and target video of a target user, perform characteristic interception on the target video to obtain a picture to be detected, input the picture to be detected into a pre-trained emotion recognition model, and determine a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training, time-related sequence characteristics are integrated based on the frame attention mechanism so as to effectively classify the characteristics of video segments, perform emotion recognition by video characteristics, can more favorably capture the emotion state of a representative human face, and perform emotion recognition on the target voice, and when the first emotion type and/or the second emotion type are/is abnormal, acquiring the input voice of the target user in real time by using the target acquisition equipment, optimizing the input voice to obtain real-time voice, outputting the real-time voice, integrating the emotion recognition result of the video and the emotion recognition result of the voice to judge, effectively improving the accuracy of emotion judgment, and softly processing the real-time voice input by the client when the emotion of the client is detected to be abnormal, so that the influence of the emotion of the client on customer service staff can be avoided, the communication between the two parties is smoother, the customer complaint rate is further reduced, and better interactive experience is brought to the two parties.
Fig. 2 is a functional block diagram of a communication optimization apparatus according to a preferred embodiment of the present invention. The communication optimization device 11 includes an acquisition unit 110, an interception unit 111, an identification unit 112, an optimization unit 113, and an output unit 114. The module/unit referred to in the present invention refers to a series of computer program segments that can be executed by the processor 13 and that can perform a fixed function, and that are stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.
In response to the communication optimization instruction, the acquisition unit 110 acquires the target acquisition device according to the communication optimization instruction, and starts the target acquisition device to acquire the target voice and the target video of the target user.
In at least one embodiment of the present invention, the communication optimization instruction may be triggered by a customer service currently performing a call, or may be automatically triggered when detecting that audio/video is started, which is not limited in the present invention.
In at least one embodiment of the present invention, the acquiring unit 110, according to the communication optimization instruction, acquires a target acquiring device includes:
analyzing the method body of the communication optimization instruction to obtain the information carried by the communication optimization instruction;
acquiring a preset label corresponding to the equipment identifier;
constructing a regular expression according to the preset label;
traversing in the information carried by the communication optimization instruction by using the regular expression, and determining the traversed data as a target equipment identifier;
and determining user equipment according to the target equipment identification, and determining acquisition equipment of the user equipment as the target acquisition equipment.
For example: when the bank service and the client carry out video interaction, the two parties respectively hold a terminal device to carry out a conference, the terminal device of the client is determined as the user device by analyzing the communication optimization instruction, and the acquisition device of the user device is determined as the target acquisition device.
The information carried by the communication optimization instruction may include, but is not limited to: equipment identification, a user name for triggering the communication optimization instruction and the like.
The communication optimization instruction is a code, and the contents between { } in the communication optimization instruction are called the method body according to the writing principle of the code.
The preset tag may be configured by user, and the preset tag and the device identifier have a one-to-one correspondence relationship, for example: the preset label can be ZID, and the regular expression ZID () is further established by the preset label and traversed by the ZID ().
Through the implementation mode, the equipment identification can be rapidly determined based on the regular expression and the preset label, and the target acquisition equipment is further determined by utilizing the equipment identification.
The intercepting unit 111 intercepts the features of the target video to obtain a picture to be detected.
Since each target video may include other non-face information that would interfere with feature recognition, feature truncation is performed on the video first.
Specifically, the intercepting unit 111 performs feature interception on the target video to obtain a to-be-detected picture, including:
acquiring all frame pictures contained in the target video;
inputting each frame picture in all the frame pictures into a YOLOv3 network for identification to obtain a face area of each frame picture;
and intercepting the face area of each frame picture to obtain the picture to be detected.
Through the embodiment, the YOLOv3 network has high stability precision, so that redundant information in the video can be effectively eliminated by intercepting the facial features through the YOLOv3 network, and the accuracy and efficiency of subsequent emotion recognition are improved.
The recognition unit 112 inputs the picture to be detected to a pre-trained emotion recognition model, and determines a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training.
For example: the output of the emotion recognition model may be: anger, 0.95.
In this embodiment, the determining, by the recognition unit 112, the first emotion type according to the output of the emotion recognition model includes:
obtaining the predicted emotion and the corresponding predicted probability of each picture to be detected from the output of the emotion recognition model;
acquiring a maximum prediction probability from the prediction probabilities as a target prediction probability;
and acquiring the predicted emotion corresponding to the target prediction probability as the first emotion type.
Through the embodiment, the final emotion recognition result can be determined by integrating all recognition results, so that the recognition accuracy is higher.
In at least one embodiment of the invention, a sample video is obtained, and the sample video is split according to a preset duration to obtain at least one sub-video;
performing feature interception on the at least one sub video to obtain a training sample;
extracting the features of the training samples by using a preset residual error network to obtain initial features;
inputting the initial features to a full connection layer corresponding to each color channel, and outputting feature vectors;
processing the feature vector by a first sigmoid function to obtain a first attention weight;
converting the feature vector based on the first attention weight to obtain an initial global frame feature;
merging the feature vector and the initial global frame feature to obtain a merging feature;
processing the shunting characteristics by a second sigmoid function to obtain a second attention weight;
converting the splicing feature based on the second attention weight to obtain a target global frame feature;
processing the target global frame characteristics by a softmax function, and outputting a prediction result and a loss value;
and when the convergence of the loss value is detected, stopping training to obtain the emotion recognition model.
The preset time duration may be configured by user-defined means, such as 10 seconds.
Further, the at least one sub-video may be feature-intercepted using a YOLOv3 network.
Further, the predetermined residual network may be a Resnet50 network, and the present invention is not limited thereto.
In this embodiment, when merging the feature vector and the initial global frame feature, a horizontal merging manner may be adopted.
For example: after the concatenation of two 1024 x 1 vectors, a 2048 x 1 vector is obtained.
Through the embodiment, the time-dependent sequence features can be integrated based on the frame attention mechanism, so that the features of the video segments can be effectively classified, and the trained emotion recognition model has higher accuracy.
Specifically, the feature vector is converted based on the first attention weight by using the following formula to obtain an initial global frame feature:
Figure BDA0002856118040000151
wherein, f'vFor the initial global bounding box feature, αiIs the first attention weight, fiTaking the feature vector as a reference, wherein i is the frame number of the feature vector, and n is the maximum frame number;
converting the splicing characteristics based on the second attention weight by adopting the following formula to obtain target global frame characteristics:
Figure BDA0002856118040000152
wherein f isvFor the target global bounding box feature, βiFor the second attention weight, [ f [ ]i:f′v]Is the join feature.
Through the embodiment, the characteristic standardization processing is carried out for multiple times based on the frame attention mechanism, the image characteristic is converted into the global video characteristic, the emotion recognition is carried out by the video characteristic, and the representative human face emotion state can be captured more favorably.
The recognition unit 112 performs emotion recognition on the target voice to obtain a second emotion type.
In at least one embodiment of the present invention, the target speech may be subjected to emotion recognition by using GMM (Adaptive background mixed models for real-time tracking), SVM (Support Vector Machine), HMM (Hidden Markov Model), Multiple classifier system (Multiple classifier system), and the like, which will not be described herein.
It should be noted that the first emotion type refers to an emotion type recognized according to the emotion recognition model, and the second emotion type refers to an emotion type recognized after emotion recognition is performed according to the target speech. Both the first emotion type and the second emotion type may include an angry, happy, etc. emotion.
When the first emotion type and/or the second emotion type are/is abnormal, the optimization unit 113 acquires the input voice of the target user in real time by using the target acquisition device, and performs optimization processing on the input voice to obtain real-time voice.
In this embodiment, when it is detected that the first emotion type and/or the second emotion type is anger, excitement, or the like, it may be determined as abnormal.
The embodiment integrates the emotion recognition result for the video and the emotion recognition result for the voice to judge, and the emotion judgment accuracy is effectively improved.
In at least one embodiment of the present invention, the optimizing unit 113 performs optimization processing on the input speech, and obtaining real-time speech includes:
carrying out noise reduction processing on the input voice to obtain a first voice;
identifying a target sound wave in the first voice, and deleting the target sound wave from the first voice to obtain a second voice;
and performing fade-in and fade-out processing on the second voice to obtain the real-time voice.
Wherein the target sound wave may include an aggressive sound wave, etc., and the present invention is not limited thereto.
By the embodiment, when the emotion abnormality of the client is detected, the real-time voice input by the client can be softly processed.
The output unit 114 outputs the real-time voice.
For example: and outputting the real-time voice on the terminal equipment of the customer service staff.
Through the implementation mode, the influence of the emotion of the customer service personnel on the customer can be avoided, the communication between the customer service personnel and the customer service personnel is smoother, the customer complaint rate is further reduced, and better interactive experience is brought to the customer service personnel and the customer service personnel.
It should be noted that, in order to further ensure the security of the data, the emotion recognition model may be deployed in a blockchain to avoid malicious tampering of the data.
According to the technical scheme, the method can respond to a communication optimization instruction, obtain target acquisition equipment according to the communication optimization instruction, start the target acquisition equipment to acquire target voice and target video of a target user, perform characteristic interception on the target video to obtain a picture to be detected, input the picture to be detected into a pre-trained emotion recognition model, and determine a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training, time-related sequence characteristics are integrated based on the frame attention mechanism so as to effectively classify the characteristics of video segments, perform emotion recognition by video characteristics, can more favorably capture the emotion state of a representative human face, and perform emotion recognition on the target voice, and when the first emotion type and/or the second emotion type are/is abnormal, acquiring the input voice of the target user in real time by using the target acquisition equipment, optimizing the input voice to obtain real-time voice, outputting the real-time voice, integrating the emotion recognition result of the video and the emotion recognition result of the voice to judge, effectively improving the accuracy of emotion judgment, and softly processing the real-time voice input by the client when the emotion of the client is detected to be abnormal, so that the influence of the emotion of the client on customer service staff can be avoided, the communication between the two parties is smoother, the customer complaint rate is further reduced, and better interactive experience is brought to the two parties.
Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the communication optimization method of the present invention.
The electronic device 1 may comprise a memory 12, a processor 13 and a bus, and may further comprise a computer program, such as a communication optimization program, stored in the memory 12 and executable on the processor 13.
It will be understood by those skilled in the art that the schematic diagram is merely an example of the electronic device 1, and does not constitute a limitation to the electronic device 1, the electronic device 1 may have a bus-type structure or a star-type structure, the electronic device 1 may further include more or less hardware or software than those shown in the figures, or different component arrangements, for example, the electronic device 1 may further include an input and output device, a network access device, and the like.
It should be noted that the electronic device 1 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
The memory 12 includes at least one type of readable storage medium, which includes flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, for example a removable hard disk of the electronic device 1. The memory 12 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 12 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of a communication optimization program, etc., but also to temporarily store data that has been output or is to be output.
The processor 13 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the electronic device 1 by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., executing a communication optimization program, etc.) stored in the memory 12 and calling data stored in the memory 12.
The processor 13 executes an operating system of the electronic device 1 and various installed application programs. The processor 13 executes the application program to implement the steps of the above-described communication optimization method embodiments, such as the steps shown in fig. 1.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 12 and executed by the processor 13 to accomplish the present invention. The one or more modules/units may be a series of computer readable instruction segments capable of performing certain functions, which are used for describing the execution process of the computer program in the electronic device 1. For example, the computer program may be divided into an acquisition unit 110, a truncation unit 111, a recognition unit 112, an optimization unit 113, an output unit 114.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the communication optimization method according to the embodiments of the present invention.
The integrated modules/units of the electronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented.
Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), random-access Memory, or the like.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 3, but this does not indicate only one bus or one type of bus. The bus is arranged to enable connection communication between the memory 12 and at least one processor 13 or the like.
Although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 13 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
Fig. 3 only shows the electronic device 1 with components 12-13, and it will be understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
Referring to fig. 1, the memory 12 of the electronic device 1 stores a plurality of instructions to implement a communication optimization method, and the processor 13 can execute the plurality of instructions to implement:
responding to a communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
intercepting the characteristics of the target video to obtain a picture to be detected;
inputting the picture to be detected into a pre-trained emotion recognition model, and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training;
performing emotion recognition on the target voice to obtain a second emotion type;
when the first emotion type and/or the second emotion type are/is abnormal, acquiring input voice of the target user in real time by using the target acquisition equipment, and optimizing the input voice to obtain real-time voice;
and outputting the real-time voice.
Specifically, the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the instruction, which is not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in system embodiments may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A communication optimization method, comprising:
responding to a communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
intercepting the characteristics of the target video to obtain a picture to be detected;
inputting the picture to be detected into a pre-trained emotion recognition model, and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training;
performing emotion recognition on the target voice to obtain a second emotion type;
when the first emotion type and/or the second emotion type are/is abnormal, acquiring input voice of the target user in real time by using the target acquisition equipment, and optimizing the input voice to obtain real-time voice;
and outputting the real-time voice.
2. The communication optimization method of claim 1, wherein the obtaining the target collection device according to the communication optimization command comprises:
analyzing the method body of the communication optimization instruction to obtain the information carried by the communication optimization instruction;
acquiring a preset label corresponding to the equipment identifier;
constructing a regular expression according to the preset label;
traversing in the information carried by the communication optimization instruction by using the regular expression, and determining the traversed data as a target equipment identifier;
and determining user equipment according to the target equipment identification, and determining acquisition equipment of the user equipment as the target acquisition equipment.
3. The communication optimization method according to claim 1, wherein the feature capturing the target video to obtain the picture to be detected comprises:
acquiring all frame pictures contained in the target video;
inputting each frame picture in all the frame pictures into a YOLOv3 network for identification to obtain a face area of each frame picture;
and intercepting the face area of each frame picture to obtain the picture to be detected.
4. The communication optimization method of claim 1, wherein determining a first emotion type based on the output of the emotion recognition model comprises:
obtaining the predicted emotion and the corresponding predicted probability of each picture to be detected from the output of the emotion recognition model;
acquiring a maximum prediction probability from the prediction probabilities as a target prediction probability;
and acquiring the predicted emotion corresponding to the target prediction probability as the first emotion type.
5. The communication optimization method of claim 1, further comprising:
obtaining a sample video, and splitting the sample video by a preset time length to obtain at least one sub-video;
performing feature interception on the at least one sub video to obtain a training sample;
extracting the features of the training samples by using a preset residual error network to obtain initial features;
inputting the initial features to a full connection layer corresponding to each color channel, and outputting feature vectors;
processing the feature vector by a first sigmoid function to obtain a first attention weight;
converting the feature vector based on the first attention weight to obtain an initial global frame feature;
merging the feature vector and the initial global frame feature to obtain a merging feature;
processing the shunting characteristics by a second sigmoid function to obtain a second attention weight;
converting the splicing feature based on the second attention weight to obtain a target global frame feature;
processing the target global frame characteristics by a softmax function, and outputting a prediction result and a loss value;
and when the convergence of the loss value is detected, stopping training to obtain the emotion recognition model.
6. The communication optimization method of claim 5, wherein the feature vector is transformed based on the first attention weight using the following formula to obtain an initial global border feature:
Figure FDA0002856118030000031
wherein, f'vFor the initial global bounding box feature, αiIs the first attention weight, fiTaking the feature vector as a reference, wherein i is the frame number of the feature vector, and n is the maximum frame number;
converting the splicing characteristics based on the second attention weight by adopting the following formula to obtain target global frame characteristics:
Figure FDA0002856118030000032
wherein f isvFor the target global bounding box feature, βiFor the second attention weight, [ f [ ]i:f′v]Is the join feature.
7. The communication optimization method of claim 1, wherein the optimizing the input speech to obtain real-time speech comprises:
carrying out noise reduction processing on the input voice to obtain a first voice;
identifying a target sound wave in the first voice, and deleting the target sound wave from the first voice to obtain a second voice;
and performing fade-in and fade-out processing on the second voice to obtain the real-time voice.
8. A communication optimization apparatus, comprising:
the acquisition unit is used for responding to a communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
the intercepting unit is used for intercepting the characteristics of the target video to obtain a picture to be detected;
the recognition unit is used for inputting the picture to be detected to a pre-trained emotion recognition model and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual error network training;
the recognition unit is further used for carrying out emotion recognition on the target voice to obtain a second emotion type;
the optimization unit is used for acquiring the input voice of the target user in real time by using the target acquisition equipment and optimizing the input voice to obtain real-time voice when the first emotion type and/or the second emotion type are/is abnormal;
and the output unit is used for outputting the real-time voice.
9. An electronic device, characterized in that the electronic device comprises:
a memory storing at least one instruction; and
a processor executing instructions stored in the memory to implement the communication optimization method of any of claims 1 to 7.
10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein at least one instruction that is executable by a processor in an electronic device to implement the communication optimization method of any one of claims 1 to 7.
CN202011545611.6A 2020-12-23 2020-12-23 Communication optimization method, device, equipment and medium Active CN112633172B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011545611.6A CN112633172B (en) 2020-12-23 2020-12-23 Communication optimization method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011545611.6A CN112633172B (en) 2020-12-23 2020-12-23 Communication optimization method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN112633172A true CN112633172A (en) 2021-04-09
CN112633172B CN112633172B (en) 2023-11-14

Family

ID=75324289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011545611.6A Active CN112633172B (en) 2020-12-23 2020-12-23 Communication optimization method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112633172B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100036660A1 (en) * 2004-12-03 2010-02-11 Phoenix Solutions, Inc. Emotion Detection Device and Method for Use in Distributed Systems
US20170270922A1 (en) * 2015-11-18 2017-09-21 Shenzhen Skyworth-Rgb Electronic Co., Ltd. Smart home control method based on emotion recognition and the system thereof
CN108962255A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Emotion identification method, apparatus, server and the storage medium of voice conversation
US20190012599A1 (en) * 2010-06-07 2019-01-10 Affectiva, Inc. Multimodal machine learning for emotion metrics
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN111276162A (en) * 2020-01-14 2020-06-12 林泽珊 Hearing aid-based voice output optimization method, server and storage medium
CN111368609A (en) * 2018-12-26 2020-07-03 深圳Tcl新技术有限公司 Voice interaction method based on emotion engine technology, intelligent terminal and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100036660A1 (en) * 2004-12-03 2010-02-11 Phoenix Solutions, Inc. Emotion Detection Device and Method for Use in Distributed Systems
US20190012599A1 (en) * 2010-06-07 2019-01-10 Affectiva, Inc. Multimodal machine learning for emotion metrics
US20170270922A1 (en) * 2015-11-18 2017-09-21 Shenzhen Skyworth-Rgb Electronic Co., Ltd. Smart home control method based on emotion recognition and the system thereof
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN108962255A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Emotion identification method, apparatus, server and the storage medium of voice conversation
CN111368609A (en) * 2018-12-26 2020-07-03 深圳Tcl新技术有限公司 Voice interaction method based on emotion engine technology, intelligent terminal and storage medium
CN111276162A (en) * 2020-01-14 2020-06-12 林泽珊 Hearing aid-based voice output optimization method, server and storage medium

Also Published As

Publication number Publication date
CN112633172B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
CN111488433B (en) Artificial intelligence interactive system suitable for bank and capable of improving field experience
WO2020211388A1 (en) Behavior prediction method and device employing prediction model, apparatus, and storage medium
WO2021232594A1 (en) Speech emotion recognition method and apparatus, electronic device, and storage medium
US11315366B2 (en) Conference recording method and data processing device employing the same
CN111741356B (en) Quality inspection method, device and equipment for double-recording video and readable storage medium
US10970334B2 (en) Navigating video scenes using cognitive insights
CN111723727A (en) Cloud monitoring method and device based on edge computing, electronic equipment and storage medium
WO2022116420A1 (en) Speech event detection method and apparatus, electronic device, and computer storage medium
WO2021175019A1 (en) Guide method for audio and video recording, apparatus, computer device, and storage medium
CN108920640B (en) Context obtaining method and device based on voice interaction
CN110598008B (en) Method and device for detecting quality of recorded data and storage medium
CN113343824A (en) Double-recording quality inspection method, device, equipment and medium
CN114007131A (en) Video monitoring method and device and related equipment
CN114677650B (en) Intelligent analysis method and device for pedestrian illegal behaviors of subway passengers
CN113345431A (en) Cross-language voice conversion method, device, equipment and medium
CN112686232B (en) Teaching evaluation method and device based on micro expression recognition, electronic equipment and medium
CN112542172A (en) Communication auxiliary method, device, equipment and medium based on online conference
CN112528265A (en) Identity recognition method, device, equipment and medium based on online conference
CN116761013A (en) Digital human face image changing method, device, equipment and storage medium
CN112633172B (en) Communication optimization method, device, equipment and medium
CN112101191A (en) Expression recognition method, device, equipment and medium based on frame attention network
CN114401346A (en) Response method, device, equipment and medium based on artificial intelligence
CN112183347A (en) Depth space gradient-based in-vivo detection method, device, equipment and medium
CN113408265A (en) Semantic analysis method, device and equipment based on human-computer interaction and storage medium
CN112633170B (en) Communication optimization method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant