WO2021135140A1 - Word collection method matching emotion polarity - Google Patents

Word collection method matching emotion polarity Download PDF

Info

Publication number
WO2021135140A1
WO2021135140A1 PCT/CN2020/100549 CN2020100549W WO2021135140A1 WO 2021135140 A1 WO2021135140 A1 WO 2021135140A1 CN 2020100549 W CN2020100549 W CN 2020100549W WO 2021135140 A1 WO2021135140 A1 WO 2021135140A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
matching
voice
facial expression
facial
Prior art date
Application number
PCT/CN2020/100549
Other languages
French (fr)
Chinese (zh)
Inventor
路璐
Original Assignee
北京三快在线科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京三快在线科技有限公司 filed Critical 北京三快在线科技有限公司
Publication of WO2021135140A1 publication Critical patent/WO2021135140A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the embodiments of the present application relate to the field of data processing technology, in particular to words matching emotional polarity.
  • the emotional polarity of words In daily life, people's definitions of the emotional polarity of words include: praise words, derogatory words and neutral words. Combined with application scenarios such as information push, the emotional polarity of words can be divided into two types: positive emotions and negative emotions, such as words that are of interest to users and words that are not of interest to users. It is particularly important to accurately determine words with different emotional polarities in many application scenarios. For example, in an information push application, by identifying words that are of interest to the user and words that are not of interest to the user, it is possible to determine which information to push to the user. For another example, in the process of intelligent conversation, words of interest to the user can be output to the user to improve the user experience. In the prior art, it is usually based on the language use experience to manually determine the words that are of interest to the user and the words that are not of interest in different application scenarios.
  • an embodiment of the present application provides a word collection method matching emotional polarity, including:
  • Step S1 acquiring the voice of the first user and the facial image of the second user during the conversation between the first user and the second user;
  • Step S2 by performing expression recognition on the facial image of the second user, determine the facial expressions of the second user at different times during the conversation;
  • Step S3 according to the occurrence time of each facial expression and the occurrence time of the voice, each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and each of the facial expressions is determined The text corresponding to the facial expression;
  • Step S4 According to the text corresponding to each facial expression of the second user, a word that matches the preset emotional polarity of the second user is determined.
  • an embodiment of the present application provides a word collection device matching emotional polarity, including:
  • a voice and facial image acquisition module configured to acquire the voice of the first user and the facial image of the second user during a conversation between the first user and the second user;
  • a facial expression determining module configured to recognize facial expressions of the second user at different times during the dialogue process by performing facial expression recognition on the facial image of the second user;
  • the facial expression and voice matching module is used to match each facial expression of the second user with the text converted from the voice of the first user according to the occurrence time of each facial expression and the occurrence time of the voice , Determine the text corresponding to each of the facial expressions;
  • the word determination module that matches the emotional polarity is configured to determine a word that matches the preset emotional polarity of the second user according to the text corresponding to each facial expression of the second user.
  • an embodiment of the present application also discloses an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor.
  • the processor executes the computer program when the computer program is executed.
  • the word acquisition method for matching emotional polarity described in the embodiment of the present application is described in the embodiment of the present application.
  • an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the steps of the word collection method for matching emotional polarity disclosed in the embodiment of the present application are disclosed.
  • the word collection method for matching emotional polarity obtaineds the voice of the first user and the facial image of the second user during the conversation between the first user and the second user; Perform facial expression recognition on the facial image of the second user to determine the facial expressions of the second user at different times during the conversation; according to the occurrence time of each facial expression and the occurrence time of the voice, the Each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and the text corresponding to each facial expression is determined; according to the corresponding facial expression of the second user
  • the text determines the words matching the preset emotional polarity of the second user, which can improve the collection efficiency of words based on the emotional polarity.
  • the embodiment of the application discloses a method for collecting words matching emotional polarity.
  • FIG. 1 is a flowchart of a word collection method for matching emotional polarity according to Embodiment 1 of the present application;
  • FIG. 2 is a flowchart of a word collection method for matching emotional polarity according to Embodiment 2 of the present application;
  • FIG. 3 is one of the structural schematic diagrams of a word collection device for matching emotion polarity according to Embodiment 3 of the present application;
  • FIG. 4 is a second structural diagram of a word collection device matching emotional polarity in Embodiment 3 of the present application.
  • FIG. 5 is the third structural diagram of a word collection device for matching emotional polarity according to the fourth embodiment of the present application.
  • Fig. 6 schematically shows a block diagram of an electronic device for executing the method according to the present application.
  • Fig. 7 schematically shows a storage unit for holding or carrying program codes for implementing the method according to the present application.
  • An embodiment of the present application discloses a word collection method for matching emotional polarity. As shown in FIG. 1, the method includes: step S1 to step S4.
  • Step S1 Acquire the voice of the first user and the facial image of the second user during the conversation between the first user and the second user.
  • the voice of the first user is stored in the cloud server of the word collection platform in the form of a voice file
  • the facial image is stored in the cloud server in the form of an image file.
  • each facial image has a collection time; each voice file has a collection time.
  • the facial image file may be a facial image file uploaded by a client registered in the word acquisition platform in advance, or the word acquisition platform may extract from a video file uploaded by a registered client of the word acquisition platform
  • the facial image file; the voice file can be a voice file uploaded by the registered client of the word collection platform, or it can be extracted by the word collection platform from a video file uploaded by the registered client of the word collection platform Voice file.
  • the word collection method for matching emotion polarity described in the embodiment of the present application is applicable to a scene where facial expressions and voices of both parties in a conversation can be collected.
  • the voice and video images of both participants in the conversation are collected, and words matching the emotional polarity of the other party are collected based on the matching relationship between the voice of one party and the expression of the other party.
  • collect the sales staff’s voice and customer expressions collect the sales staff’s voice and customer expressions, and collect words matching the customer’s emotional polarity based on the matching relationship between the sales staff’s voice and the customer’s facial expressions. .
  • the voice of the intelligent robot and the facial expression of a real person are collected, and words matching the emotional polarity of the real person are collected based on the matching relationship between the voice of the intelligent robot and the real person's expression.
  • corresponding technical means are used to collect the voice and facial images of both parties in the conversation.
  • the voice and facial image can be collected by the same device, or can be collected by different devices.
  • an electronic device running a conversation application can collect the voice file and video file of the current one of the two parties in the conversation. For example, when a salesperson (i.e., the first user) and a customer (i.e., the second user) conduct online video and dialogue through the application client on the electronic device, the microphone of the salesperson’s electronic device collects the salesperson’s voice to generate Voice file; at the same time, the customer's video image stream is obtained through the application client to generate a video file.
  • a salesperson i.e., the first user
  • a customer i.e., the second user
  • the microphone of the salesperson’s electronic device collects the salesperson’s voice to generate Voice file
  • the customer's video image stream is obtained through the application client to generate a video file.
  • the voice file and video file collected by the electronic device may be associated and uploaded to the word collection platform, and the word collection platform extracts facial images of customers at different times from the video file. , Generating multiple facial image files with time stamps, and storing the voice files and the generated multiple facial images in the cloud server in association.
  • the video file may be processed by the video image processing module provided on the electronic device to generate multiple facial image files with time stamps, and then the voice The file and the generated multiple facial image files with time stamps are associated and uploaded to the word collection platform.
  • the time stamp indicates the collection time of the facial image in the corresponding facial image file.
  • the voice of the first user and the facial image of the second user can be obtained in the following ways.
  • the first is to collect video files containing the voice of the salesperson, the voice of the customer, and the facial image of the customer through a camera with a microphone installed at the service desk and the information desk.
  • the audio stream is extracted to generate an audio file, and the audio file has a time stamp to indicate the collection time of the audio stream; the video image frames are extracted to obtain each frame image
  • each image file has a time stamp, which is used to indicate the acquisition time of the frame image.
  • the audio file may contain the voice of the salesperson and the customer at the same time, it is necessary to further process the audio file, and extract the voice of the salesperson (that is, the first user) from the audio file. For example, by pre-collecting the voiceprint of the salesperson, audio information matching the voiceprint of the salesperson can be extracted from the generated audio file to generate the voice file of the salesperson in the dialogue scene.
  • the generated image file can be used as a facial image file, and the voice file can be associated and uploaded directly to the word collection platform.
  • the image files generated from each frame of image can be detected and located separately, and the results of the facial positioning can be checked.
  • Each image file is intercepted, and only the facial image area is reserved to generate the corresponding facial image file.
  • Each generated facial image file has a timestamp, and the timestamp of the facial image file is the timestamp of the video image frame in which the facial image file is generated. After that, the facial image file that only includes the human face area is associated with the voice file and uploaded to the word collection platform.
  • the word collection platform stores the received voice file of the first user and multiple facial image files of the second user in a cloud server.
  • the second method is to collect video files containing the voice of the salesperson, the voice of the customer, and the facial image of the customer through a camera with a microphone installed at the service desk and the information desk.
  • Upload the video file to the word collection platform the word collection platform performs image processing on the video file, and extracts the audio stream from it to generate an audio file.
  • the audio file has a timestamp to indicate the audio
  • the collection time of the stream the video image frames are extracted to obtain the image file corresponding to each frame image, and each image file has a time stamp to indicate the collection time of the frame image.
  • the word collection platform generates a voice file of the salesperson (that is, the first user) according to the audio file according to the above-mentioned voice processing method.
  • the word acquisition platform performs face detection and positioning on the image files generated from each frame of image in the above manner, and performs image interception on each image file according to the face positioning results, and only retains the facial image area to generate the corresponding Face image file.
  • Each generated facial image file has a timestamp, and the timestamp of the facial image file is the timestamp of the video image frame in which the facial image file is generated.
  • the word collection platform stores the generated voice file of the first user and multiple facial image files of the second user in a cloud server.
  • the third type is to collect video files containing the voice and facial images of sales staff and the voice and facial images of customers through monitoring equipment installed in the store. Afterwards, by performing image processing on the video file, the audio stream is extracted to generate an audio file.
  • the audio file has a time stamp to indicate the collection time of the audio stream; extracting the video image frames in it, and obtaining the corresponding image of each frame Image files, each image file has a time stamp, used to indicate the acquisition time of the frame image.
  • the audio file may contain the voice of the salesperson and the customer at the same time, it is necessary to further process the audio file, extract the voice of the salesperson (ie, the first user) from the audio file, and generate the salesperson (ie, the first user). ) Voice file.
  • generating the voice file of the salesperson (ie, the first user) from the audio file refer to the foregoing description, and will not be repeated here.
  • the image file generated by each frame of image may include the facial image of the customer and the facial image of the salesperson, it is also necessary to detect and locate the face of each image file, and determine the face area included in each image.
  • the image of each face area is compared with the face image pre-collected by the salesperson, and the image of the face area that is not successfully recognized is used as the face area of the customer (that is, the second user).
  • the facial image file of the customer is generated respectively.
  • the generated facial image file of the customer and the voice file of the salesperson are associated and uploaded to the word collection platform, and stored in a cloud server.
  • the fourth is to collect video files containing the voice and facial images of sales staff and the voice and facial images of customers through monitoring equipment installed in the store, and upload the collected video files to the word collection platform.
  • the word collection platform processes the video files, extracts audio files and image files, and further performs voiceprint recognition processing on the audio files by referring to the method in the third case by the word collection platform to generate customer voice files, and perform human operations on the video files. Face detection, positioning and recognition processing, generating multiple facial image files of customers.
  • the generated facial image file of the customer and the voice file of the salesperson are stored in a cloud server in association.
  • Step S2 by performing expression recognition on the facial image of the second user, determine the facial expressions of the second user at different times during the conversation.
  • the facial expression of the second user in each facial image can be determined.
  • the facial expression of the second user obtained by recognizing each facial image includes but is not limited to any one of the following: smiling, focused, calm, disgusted, and angry.
  • the specific implementation manner of performing expression recognition on each facial image refer to the prior art, which will not be repeated in this embodiment.
  • the embodiments of the present application do not limit the specific implementation manners used to perform expression recognition on each facial image.
  • the time stamp of each facial image is used as the occurrence time of the facial expression of the second user recognized from the facial image.
  • the facial expressions that occurred at different times during the dialogue between the second user and the first user can be obtained.
  • Step S3 According to the occurrence time of each facial expression and the occurrence time of the voice, each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and each of the facial expressions is determined The text corresponding to the facial expression.
  • the facial expressions of one party to the conversation reflect the real-time emotional polarity of what they say to the other party.
  • the facial expressions of the second user at different times reflect whether the second user is satisfied or disliked by the first user at that point in time. Therefore, by matching the text obtained by the voice conversion of the first user with the facial expression of the second user based on time, words matching the different emotional polarities of the second user can be obtained.
  • the text obtained by converting each facial expression of the second user and the voice of the first user is performed Matching, and determining the text corresponding to each facial expression, including: for each facial expression, the voice of the first user that occurred within a preset time range of the facial expression occurrence time The voice fragment is used as a voice fragment matching the facial expression; for each facial expression, the text obtained by converting the voice fragment matching the facial expression is used as the text matching the facial expression.
  • the acquired voice file of the first user is represented as voice.wav
  • the acquired multiple facial images of the second user are represented as picture ⁇ p1,t1 ⁇ ,..., ⁇ pN,tN ⁇ .
  • the time attribute of voice.wav includes the collection time
  • pN represents the Nth facial image acquired by the second user
  • tN represents the collection time (ie timestamp) of the facial image pN
  • N is a natural number greater than 1.
  • the facial expression recognition results of each facial image are: the facial expression of the second user in the facial image p1 is "smile”, and the facial expression of the second user in the facial image p2 is "Calm", the facial expression of the second user in the facial image p3 is “calm”, the facial expression of the second user in the facial image p4 is “disgusting”, and the facial expression of the second user in the facial image p5 is "angry”.
  • the occurrence time of each facial expression of the second user (that is, the acquisition time of the facial image corresponding to the facial expression) can be set to occur within a preset time range (such as 10 seconds).
  • the voice segment of the first user is used as a voice segment matching the facial expression.
  • the voice segment in the voice file voice.wav of the first user whose audio stream has a timestamp within the time range (t1-5, t1+5) as a voice segment matching the "smiling" facial expression of the second user
  • the voice segment of the first user corresponding to each facial expression of the second user at different time points can be determined.
  • the same facial expression may correspond to different speech segments, which means that different words of the first user can trigger the same expression of the second user.
  • the preset time range is set according to specific needs.
  • Step S4 Determine a word matching the preset emotional polarity of the second user according to the text corresponding to each facial expression of the second user.
  • the preset emotion polarity includes: positive emotion and negative emotion.
  • the positive situation is the emotional polarity embodied through the user's facial expression definitions such as “smile”, “focus”, and “calm”;
  • the negative emotion is the emotional polarity embodied through facial expressions such as "disgust” and "angry”.
  • the determining a word matching the preset emotional polarity of the second user according to the text corresponding to each facial expression of the second user includes: following facial expressions The corresponding relationship with the emotional polarity is determined to determine the emotional polarity of each facial expression matching; the emotional polarity of each facial expression matching is used as the emotional polarity matched by the text matching the facial expression; according to the matching The frequency of occurrence of different words in the text with the same emotional polarity is determined to match the words of each emotional polarity of the second user.
  • the correspondence between facial expressions and emotional polarities can be established in advance based on expert common sense.
  • emotional polarity includes: positive emotions and negative emotions.
  • the facial expressions of “smile,” “focus,” and “calm” are defined as matching the emotional polarity of positive emotions.
  • the expression of "" is defined as the emotional polarity that matches the negative emotion.
  • the emotional polarity of the second user's speech to the first user at different time points is:
  • at time t5 the words of the first user cause the second user to produce negative emotions.
  • the corresponding text of the first user's speech in the (t1-5, t1+5) time period is "Hello, I am very happy to serve you” ;
  • the corresponding text of the first user's speech in the (t5-5, t5+5) time period is "You must do this as soon as possible, otherwise it will be too late” .
  • the emotional polarities of the facial expression matching of the above-mentioned five time points from t1 to t5 in the dialogue process of the second user have been determined, and It is determined that the first user’s utterance text matched by the facial expressions of the second user at the 5 time points from t1 to t5 in the dialogue process can be further determined, and the utterances of the first user at different time points can be further determined The emotional polarity of the second user whose text matches.
  • the first user’s utterance text “Hello, I am happy to serve you” matches the positive emotion of the second user; at time t5, the first user’s utterance text "You must handle it as soon as possible, Otherwise, it’s too late to match the negative sentiment of the second user.
  • a set of words matching the emotional polarity of positive emotions and a set of words matching the emotional polarity of negative emotions are determined.
  • multiple sets of words matching the emotional polarity of positive emotions and multiple sets of words matching the emotional polarity of negative emotions can be obtained.
  • the multiple sets of words matching the positive emotional polarity are analyzed.
  • the words whose occurrence frequency meets the preset condition (for example, the top 5 words with the most occurrence frequency) are determined as words that match the positive emotion of the second user, which is the emotional polarity; match the multiple groups of negative emotions,
  • the words whose occurrence frequency meets the preset condition (for example, the top 5 words with the most occurrence frequency) are determined as words matching the emotional polarity of the negative emotion of the second user.
  • the embodiment of the application discloses a word collection method matching emotional polarity, by acquiring the voice of the first user and the facial image of the second user during the conversation between the first user and the second user; Perform facial expression recognition on the facial image of the second user to determine the facial expressions of the second user at different times during the conversation; according to the occurrence time of each facial expression and the occurrence time of the voice, Each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and the text corresponding to each facial expression is determined; according to each facial expression of the second user Corresponding to the text, determining words matching the preset emotional polarity of the second user can improve the collection efficiency of words based on the emotional polarity.
  • the embodiment of the application discloses a method for collecting words matching emotional polarity.
  • the word collection method for matching emotional polarity disclosed in the embodiments of the present application can be applied to many fields.
  • a word collection method for matching emotional polarity disclosed in the embodiment of the present application is applied to the field of chat robots.
  • the robot collects the facial expression of a real person with whom it is talking, as well as the robot's voice, and uses the word collection platform to identify and match the robot's voice and the real person's facial expression based on the collection time to determine the different emotions that cause the real person Polar words.
  • the method further includes: matching the front face of the second user according to The emotional polarity words of the second user are used to establish the positive emotional vocabulary of the second user; and/or according to the words matching the negative emotions of the second user, the emotional polarity words of the second user are established.
  • Negative sentiment vocabulary For example, after the words matching the positive and negative emotions of the real person are determined based on the multiple dialogues between the robot and the real person, a positive emotion vocabulary is established based on the words matching the positive emotion, and the negative emotion is established based on the words matching the negative emotion.
  • the emotional vocabulary, and the positive emotional vocabulary and the negative emotional vocabulary are updated to the corpus vocabulary of the robot for optimizing the dialogue content between the robot and the real person, so as to enhance the chatting experience of the real person.
  • step S1 For another example, in a customer service scenario, by collecting the facial image of the customer and the voice of the customer during the conversation between a certain customer and multiple customer service personnel, the above steps S1 to step S1 are performed for each facial image of the voice during the conversation.
  • S4 The positive emotion word database and the negative emotion word database of the customer can be determined, so that the customer service staff can refer to the emotional word database to select the content of the dialogue with the customer, and improve the service quality of the customer.
  • the word collection method for matching emotional polarity described in Embodiment 1 can also be applied to the establishment of a corpus vocabulary in a preset conversation scenario.
  • the conversation process between the first user and the second user is a conversation process in a preset conversation scenario, as shown in FIG. 2, the text corresponding to each facial expression of the second user,
  • the method further includes: step S5 and step S6.
  • Step S5 reselect the first user and the second user, and repeat steps S1 to S4 until the word set output condition is satisfied;
  • Step S6 Output a set of words matching the preset emotional polarity in the conversation scene according to the words matching the preset emotional polarity of all the selected second users.
  • the word set output condition includes, but is not limited to, any one of the following: the number of repeated executions of step S1 to step S4 reaches a preset number of times (such as 10,000 times), and the number of the selected second users reaches a preset value ( For example, 1,000 people), and the acquired voice of the first user reaches a preset value (for example, 10,000 voices).
  • step S4 respectively determines the words matching the positive emotions and negative emotions of the 1000 customers. After that, construct a word set matching the positive emotions based on the words matching the positive emotions of the 1,000 customers; construct a word set matching the negative emotions based on the words matching the negative emotions of the 1,000 customers.
  • the constructed word set matching the positive emotion can be used as the positive emotion word set in the conversation scene between the salesperson and the customer; the constructed word set matching the positive emotion can be used as the conversation scene between the salesperson and the customer.
  • the set of words matching the positive sentiment can be output for constructing a preferred corpus of salespersons. It is also possible to output the set of words matching the negative sentiment for building an evasive corpus of sales staff.
  • the embodiment of the present application discloses a word collection method that matches emotional polarity.
  • the first user’s voice and the first user’s voice and the first user’s voice and the first user’s voice and the first user’s voice are collected during several conversations between multiple second users and at least one first user in a preset conversation 2.
  • the facial image of the user and execute the voice and facial image corresponding to each conversation process: by performing expression recognition on the facial image of the second user, it is determined that the second user is different in the conversation process Each facial expression occurring at the time; according to the occurrence time of each facial expression and the occurrence time of the voice, each facial expression of the second user is matched with the text obtained by converting the voice of the first user to determine The text corresponding to each of the facial expressions; a data processing operation for determining words that match the preset emotional polarity of the second user according to the text corresponding to each of the facial expressions of the second user Finally, according to the words matching the preset emotional polarity of all the second users, output a set of words matching the preset emotional polarity in the conversation scene, which helps to automatically establish a match for the conversation scene A corpus of different emotional polarities. For example, when building a training corpus for sales staff, it is no longer necessary to manually count and match words with different emotional polarities of different users, which can improve the efficiency of building the corp
  • An embodiment of the present application discloses a word collection device matching emotional polarity. As shown in FIG. 3, the device includes:
  • the voice and facial image acquisition module 310 is configured to acquire the voice of the first user and the facial image of the second user during the conversation between the first user and the second user;
  • the facial expression determining module 320 is configured to perform facial expression recognition on the facial image of the second user to determine the facial expressions of the second user at different times during the conversation;
  • the facial expression and voice matching module 330 is configured to perform text conversion between the facial expressions of the second user and the voice of the first user according to the occurrence time of each facial expression and the occurrence time of the voice. Match, determine the text corresponding to each of the facial expressions;
  • the word determination module 340 for matching emotional polarity is configured to determine a word matching the preset emotional polarity of the second user according to the text corresponding to each facial expression of the second user.
  • the facial expression and voice matching module 330 is further used for:
  • For each facial expression use a voice segment of the first user that occurs within a preset time range of the facial expression occurrence time in the voice as a voice segment matching the facial expression;
  • the text obtained by converting the voice segment matching the facial expression is used as the text matching the facial expression.
  • the word determination module 340 for matching emotional polarity is further configured to:
  • the words matching each emotional polarity of the second user are determined.
  • the preset emotion polarity includes: positive emotion and negative emotion.
  • the device further includes:
  • the user positive emotion vocabulary establishment module 350 is configured to establish the positive emotion vocabulary of the second user according to the words matching the positive emotion of the second user, which is the emotional polarity; and/or,
  • the user negative sentiment vocabulary building module 360 is configured to build a negative sentiment vocabulary of the second user according to words matching the negative sentiment of the second user.
  • the words that cause the user to produce positive and negative emotions can be accurately determined.
  • the conversation process between the first user and the second user is a conversation process in a preset conversation scenario.
  • the apparatus further includes:
  • the multi-conversation word acquisition module 370 is configured to reselect the first user and the second user, and repeatedly call the voice and facial image acquisition module 310, the facial expression determining module 320, and the facial expression and voice
  • the matching module 330 and the word determination module 340 matching the emotional polarity until the word set output condition is satisfied;
  • the scene word set output module 380 is configured to output a set of words matching the preset emotional polarity in the conversation scene according to the words matching the preset emotional polarity of all the selected second users.
  • the word collection device for matching emotional polarity disclosed in the embodiment of the present application is used to implement the word collection method for matching emotional polarity described in the first embodiment or the second embodiment of the present application.
  • the specific implementation of each module of the device will not be omitted.
  • the embodiment of the present application discloses a word collection device matching emotional polarity, which acquires the voice of the first user and the facial image of the second user during the conversation between the first user and the second user; Perform facial expression recognition on the facial image of the second user to determine the facial expressions of the second user at different times during the conversation; according to the occurrence time of each facial expression and the occurrence time of the voice, Each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and the text corresponding to each facial expression is determined; according to each facial expression of the second user Corresponding to the text, determining words matching the preset emotional polarity of the second user can improve the collection efficiency of words based on the emotional polarity.
  • the embodiment of the application discloses a word collection device that matches the polarity of emotions. By automatically collecting the facial expressions of one party and the voice of the other party during a conversation between two users, and based on the words of the other party when the user's facial expression occurs, it can be accurate The determination of the words that make the user produce positive emotions and negative emotions.
  • the embodiment of the present application discloses a word collection device matching emotional polarity, which collects the voice of the first user and the first user in the process of several conversations between multiple second users and at least one first user in a preset conversation scenario.
  • the facial image of the user and execute the voice and facial image corresponding to each conversation process: by performing expression recognition on the facial image of the second user, it is determined that the second user is different in the conversation process
  • Each facial expression occurring at time according to the occurrence time of each facial expression and the occurrence time of the voice, each facial expression of the second user is matched with the text obtained by converting the voice of the first user to determine The text corresponding to each of the facial expressions; a data processing operation for determining words that match the preset emotional polarity of the second user according to the text corresponding to each of the facial expressions of the second user
  • output a set of words matching the preset emotional polarity in the conversation scene which helps to automatically establish a match for the
  • the device embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
  • the various component embodiments of the present application may be implemented by hardware, or by software modules running on one or more processors, or by a combination of them.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the electronic device according to the embodiments of the present application.
  • This application can also be implemented as a device or device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein.
  • Such a program for implementing the present application may be stored on a computer-readable medium, or may have the form of one or more signals.
  • Such a signal can be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form.
  • FIG. 6 shows an electronic device that can implement the method according to the present application.
  • the electronic device may be a PC, a mobile terminal, a personal digital assistant, a tablet computer, etc.
  • the electronic device traditionally includes a processor 610, a memory 620, and a program code 630 that is stored on the memory 620 and can run on the processor 610.
  • the processor 610 executes the program code 630, the above embodiments are implemented.
  • the memory 620 may be a computer program product or a computer readable medium.
  • the memory 620 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the memory 620 has a storage space 6201 of the program code 630 of the computer program for executing any method steps in the above-mentioned method.
  • the storage space 6201 for the program code 630 may include various computer programs respectively used to implement various steps in the above method.
  • the program code 630 is computer readable code. These computer programs can be read from or written into one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards, or floppy disks.
  • the computer program includes computer-readable code, which when run on an electronic device, causes the electronic device to execute the method according to the above-mentioned embodiment.
  • the embodiment of the present application also discloses a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the collection of words matching emotional polarity as described in Embodiment 1 or Embodiment 2 of this application is realized. Method steps.
  • Such a computer program product may be a computer-readable storage medium, and the computer-readable storage medium may have storage segments, storage spaces, etc., arranged similarly to the memory 620 in the electronic device shown in FIG. 6.
  • the program code may be compressed and stored in the computer-readable storage medium in an appropriate form, for example.
  • the computer-readable storage medium is usually a portable or fixed storage unit as described with reference to FIG. 7.
  • the storage unit includes computer-readable codes 630', which are codes read by a processor, and when these codes are executed by the processor, each step in the method described above is implemented.
  • any reference signs placed between parentheses should not be constructed as a limitation to the claims.
  • the word “comprising” does not exclude the presence of elements or steps not listed in the claims.
  • the word “a” or “an” preceding an element does not exclude the presence of multiple such elements.
  • the application can be realized by means of hardware including several different elements and by means of a suitably programmed computer. In the unit claims listing several devices, several of these devices may be embodied in the same hardware item.
  • the use of the words first, second, and third, etc. do not indicate any order. These words can be interpreted as names.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Provided is a word collection method matching emotion polarity, belonging to the technical field of data processing. The method comprises: acquiring a voice of a first user and a facial image of a second user in a dialog process of the first user and the second user (S1); performing expression recognition on the facial image of the second user to determine each facial expression of the second user at different times in the dialog process (S2); according to the occurrence time of each facial expression and the occurrence time of the voice, matching each facial expression of the second user with text obtained by converting the voice of the first user, to determine text corresponding to each facial expression (S3); and according to the text corresponding to each facial expression of the second user, determining words matching a preset emotion polarity of the second user (S4). By automatically collecting the facial expression of one party and the voice of the other party in the dialog process of two users, and on the basis of the words spoken by the other party when a facial expression of a user occurs, words causing the user to generate a positive emotion and a negative emotion can be accurately determined.

Description

匹配情感极性的词语采集Word collection that matches emotional polarity
本申请要求在2019年12月31日提交中国专利局、申请号为201911419689.0、发明名称为“匹配情感极性的词语采集方法、装置、电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on December 31, 2019, the application number is 201911419689.0, and the invention title is "word collection method, device, and electronic equipment matching emotional polarity", and the entire content of it is passed The reference is incorporated in this application.
技术领域Technical field
本申请实施例涉及数据处理技术领域,特别是涉及匹配情感极性的词语。The embodiments of the present application relate to the field of data processing technology, in particular to words matching emotional polarity.
背景技术Background technique
日常生活中,人们对词语的情感极性的定义包括:褒义词、贬义词和中性词。结合信息推送等应用场景,词语的情感极性可以分为:正面情感和负面情感两种,如用户感兴趣的词语和用户不感兴趣的词语。准确地确定不同情感极性的词语在很多应用场景中显得尤为重要。例如,在信息推送应用中,通过识别信息中包括的用户感兴趣词语和用户不感兴趣词语,可以确定向用户推送哪些信息。再例如,在智能会话过程中,可以为用户输出用户感兴趣的词语,以提升用户体验。现有技术中,通常是根据语言使用经验,人工确定不同应用场景中用户感兴趣的词语和不感兴趣的词语。In daily life, people's definitions of the emotional polarity of words include: praise words, derogatory words and neutral words. Combined with application scenarios such as information push, the emotional polarity of words can be divided into two types: positive emotions and negative emotions, such as words that are of interest to users and words that are not of interest to users. It is particularly important to accurately determine words with different emotional polarities in many application scenarios. For example, in an information push application, by identifying words that are of interest to the user and words that are not of interest to the user, it is possible to determine which information to push to the user. For another example, in the process of intelligent conversation, words of interest to the user can be output to the user to improve the user experience. In the prior art, it is usually based on the language use experience to manually determine the words that are of interest to the user and the words that are not of interest in different application scenarios.
发明内容Summary of the invention
第一方面,本申请实施例提供了一种匹配情感极性的词语采集方法,包括:In the first aspect, an embodiment of the present application provides a word collection method matching emotional polarity, including:
步骤S1,获取在第一用户和第二用户对话过程中,所述第一用户的语音和所述第二用户的面部图像;Step S1, acquiring the voice of the first user and the facial image of the second user during the conversation between the first user and the second user;
步骤S2,通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;Step S2, by performing expression recognition on the facial image of the second user, determine the facial expressions of the second user at different times during the conversation;
步骤S3,根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;Step S3, according to the occurrence time of each facial expression and the occurrence time of the voice, each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and each of the facial expressions is determined The text corresponding to the facial expression;
步骤S4,根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语。Step S4: According to the text corresponding to each facial expression of the second user, a word that matches the preset emotional polarity of the second user is determined.
第二方面,本申请实施例提供了一种匹配情感极性的词语采集装置,包括:In the second aspect, an embodiment of the present application provides a word collection device matching emotional polarity, including:
语音和面部图像获取模块,用于获取在第一用户和第二用户对话过程中,所述第一用户的语音和所述第二用户的面部图像;A voice and facial image acquisition module, configured to acquire the voice of the first user and the facial image of the second user during a conversation between the first user and the second user;
面部表情确定模块,用于通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;A facial expression determining module, configured to recognize facial expressions of the second user at different times during the dialogue process by performing facial expression recognition on the facial image of the second user;
面部表情和语音匹配模块,用于根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;The facial expression and voice matching module is used to match each facial expression of the second user with the text converted from the voice of the first user according to the occurrence time of each facial expression and the occurrence time of the voice , Determine the text corresponding to each of the facial expressions;
匹配情感极性的词语确定模块,用于根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语。The word determination module that matches the emotional polarity is configured to determine a word that matches the preset emotional polarity of the second user according to the text corresponding to each facial expression of the second user.
第三方面,本申请实施例还公开了一种电子设备,包括存储器、处理器及存储在所述存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现本申请实施例所述的匹配情感极性的词语采集方法。In the third aspect, an embodiment of the present application also discloses an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor. The processor executes the computer program when the computer program is executed. The word acquisition method for matching emotional polarity described in the embodiment of the present application.
第四方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时本申请实施例公开的匹配情感极性的词语采集方法的步骤。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the steps of the word collection method for matching emotional polarity disclosed in the embodiment of the present application are disclosed.
本申请实施例公开的匹配情感极性的词语采集方法,通过获取在第一用户和第二用户对话过程中,所述第一用户的语音和所述第二用户的面部图像;通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语,可以提升基于情感极性的词语的采集效率。本申请实施例公开的一种匹配情感极性的词语采集方法,通过自动采集两个用户对话过程中一方的面部表情和对方的语音,并基于该用户的面部表情发生时对方的话语,可以准确的确定使该用户产生正面情感和负面情感的词语。The word collection method for matching emotional polarity disclosed in the embodiment of the present application obtains the voice of the first user and the facial image of the second user during the conversation between the first user and the second user; Perform facial expression recognition on the facial image of the second user to determine the facial expressions of the second user at different times during the conversation; according to the occurrence time of each facial expression and the occurrence time of the voice, the Each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and the text corresponding to each facial expression is determined; according to the corresponding facial expression of the second user The text determines the words matching the preset emotional polarity of the second user, which can improve the collection efficiency of words based on the emotional polarity. The embodiment of the application discloses a method for collecting words matching emotional polarity. By automatically collecting the facial expressions of one party and the voice of the other party during a conversation between two users, and based on the words of the other party when the user's facial expression occurs, it can be accurate The determination of the words that make the user produce positive emotions and negative emotions.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它 目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。The above description is only an overview of the technical solution of this application. In order to understand the technical means of this application more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other purposes, features and advantages of this application more obvious and understandable. , The specific implementations of this application are cited below.
附图说明Description of the drawings
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
图1是本申请实施例一的匹配情感极性的词语采集方法流程图;FIG. 1 is a flowchart of a word collection method for matching emotional polarity according to Embodiment 1 of the present application;
图2是本申请实施例二的匹配情感极性的词语采集方法流程图;2 is a flowchart of a word collection method for matching emotional polarity according to Embodiment 2 of the present application;
图3是本申请实施例三匹配情感极性的词语采集装置结构示意图之一;FIG. 3 is one of the structural schematic diagrams of a word collection device for matching emotion polarity according to Embodiment 3 of the present application;
图4是本申请实施例三匹配情感极性的词语采集装置结构示意图之二;4 is a second structural diagram of a word collection device matching emotional polarity in Embodiment 3 of the present application;
图5是本申请实施例四匹配情感极性的词语采集装置结构示意图之三;FIG. 5 is the third structural diagram of a word collection device for matching emotional polarity according to the fourth embodiment of the present application; FIG.
图6示意性地示出了用于执行根据本申请的方法的电子设备的框图;以及Fig. 6 schematically shows a block diagram of an electronic device for executing the method according to the present application; and
图7示意性地示出了用于保持或者携带实现根据本申请的方法的程序代码的存储单元。Fig. 7 schematically shows a storage unit for holding or carrying program codes for implementing the method according to the present application.
具体实施例Specific embodiment
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
实施例一Example one
本申请实施例公开的一种匹配情感极性的词语采集方法,如图1所示,所述方法包括:步骤S1至步骤S4。An embodiment of the present application discloses a word collection method for matching emotional polarity. As shown in FIG. 1, the method includes: step S1 to step S4.
步骤S1,获取在第一用户和第二用户对话过程中,所述第一用户的语音和所述第二用户的面部图像。Step S1: Acquire the voice of the first user and the facial image of the second user during the conversation between the first user and the second user.
本申请的一些实施例中,所述第一用户的语音以语音文件的形式存储在词语采集平台的云端服务器中,所述面部图像以图像文件的形式存储在所述云端服务器中。本申请的一些实施例中,每幅面部图像具有采集时间;每个语音文件具有采集时间。In some embodiments of the present application, the voice of the first user is stored in the cloud server of the word collection platform in the form of a voice file, and the facial image is stored in the cloud server in the form of an image file. In some embodiments of the present application, each facial image has a collection time; each voice file has a collection time.
其中,所述面部图像文件可以为预先在所述词语采集平台注册的客户端上传的面部图像文件,也可以是所述词语采集平台从所述词语采集平台的注册客户端上传的视频文件中提取的面部图像文件;所述语音文件可以为所述词语采集平台的注册客户端上传的语音文件,也可以是所述词语采集平台从所述词语采集平台的注册客户端上传的视频文件中提取的语音文件。Wherein, the facial image file may be a facial image file uploaded by a client registered in the word acquisition platform in advance, or the word acquisition platform may extract from a video file uploaded by a registered client of the word acquisition platform The facial image file; the voice file can be a voice file uploaded by the registered client of the word collection platform, or it can be extracted by the word collection platform from a video file uploaded by the registered client of the word collection platform Voice file.
本申请实施例中所述的匹配情感极性的词语采集方法适用于可以采集到会话双方的面部表情和语音的场景。例如,在视频会话场景中,采集会话参与者双方的语音和视频图像,基于其中一方的语音和另一方的表情的匹配关系,采集匹配所述另一方情感极性的词语。又例如,在线下各种门店的销售人员和顾客的对话场景中,采集销售人员的语音和顾客的表情,基于销售人员的语音和顾客表情的匹配关系,采集匹配所述顾客情感极性的词语。再例如,在智能机器人与真人对话的场景中,通过采集智能机器人的语音和真人的面部表情,基于智能机器人的语音和真人表情的匹配关系,采集匹配真人情感极性的词语。The word collection method for matching emotion polarity described in the embodiment of the present application is applicable to a scene where facial expressions and voices of both parties in a conversation can be collected. For example, in a video conversation scenario, the voice and video images of both participants in the conversation are collected, and words matching the emotional polarity of the other party are collected based on the matching relationship between the voice of one party and the expression of the other party. For another example, in a dialogue scene between sales staff and customers in various offline stores, collect the sales staff’s voice and customer expressions, and collect words matching the customer’s emotional polarity based on the matching relationship between the sales staff’s voice and the customer’s facial expressions. . For another example, in a scene where an intelligent robot talks with a real person, the voice of the intelligent robot and the facial expression of a real person are collected, and words matching the emotional polarity of the real person are collected based on the matching relationship between the voice of the intelligent robot and the real person's expression.
本申请实施例中,对于不同应用场景,采用相应的技术手段采集对话双方的语音和面部图像。所述语音和面部图像可以通过同一设备采集,也可以通过不同设备采集。In the embodiments of the present application, for different application scenarios, corresponding technical means are used to collect the voice and facial images of both parties in the conversation. The voice and facial image can be collected by the same device, or can be collected by different devices.
以视频会话场景举例,可以通过运行会话应用的电子设备采集会话双方中当前一方的语音文件和视频文件。例如,在销售人员(即第一用户)和顾客(即第二用户)通过电子设备上的应用客户端进行线上视频和对话时,通过销售人员的电子设备的麦克采集销售人员的语音,生成语音文件;同时,通过所述应用客户端获取所述顾客的视频图像流,生成视频文件。Taking a video conversation scenario as an example, an electronic device running a conversation application can collect the voice file and video file of the current one of the two parties in the conversation. For example, when a salesperson (i.e., the first user) and a customer (i.e., the second user) conduct online video and dialogue through the application client on the electronic device, the microphone of the salesperson’s electronic device collects the salesperson’s voice to generate Voice file; at the same time, the customer's video image stream is obtained through the application client to generate a video file.
本申请的一些实施例中,可以将所述电子设备采集的语音文件和视频文件关联上传到所述词语采集平台,由所述词语采集平台从所述视频文件中提取顾客在不同时间的面部图像,生成具有时间戳的多幅面部图像文件,并将所述语音文件和生成的所述多幅面部图像关联存储在所述云端服务器。本申请的另一些实施例中,还可以通过设置在所述电子设备上的视频图像处理模块对所述视频文件进行图像处理,生成具有时间戳的多幅面部图像文件,之后,将所述语音文件和生成具有时间戳的多幅面部图像文件关联上传到所述词语采集平台。其中,所述时间戳指示相应面部图像文件中面部图像的采集时间。In some embodiments of the present application, the voice file and video file collected by the electronic device may be associated and uploaded to the word collection platform, and the word collection platform extracts facial images of customers at different times from the video file. , Generating multiple facial image files with time stamps, and storing the voice files and the generated multiple facial images in the cloud server in association. In other embodiments of the present application, the video file may be processed by the video image processing module provided on the electronic device to generate multiple facial image files with time stamps, and then the voice The file and the generated multiple facial image files with time stamps are associated and uploaded to the word collection platform. Wherein, the time stamp indicates the collection time of the facial image in the corresponding facial image file.
以线下各种门店的销售人员(即第一用户)和顾客(即第二用户)的对 话场景举例,可以通过如以下几种方式获取第一用户的语音和第二用户的面部图像。Taking an example of a dialogue scenario between a salesperson (i.e. the first user) and a customer (i.e. the second user) in various offline stores, the voice of the first user and the facial image of the second user can be obtained in the following ways.
第一种,可以通过设置在服务台、咨询台的具有麦克风的摄像头采集包含销售人员的语音和顾客的语音和顾客的面部图像的视频文件。之后,通过对所述视频文件进行图像处理,提取其中的音频流生成音频文件,音频文件中的具有时间戳,用于指示该音频流的采集时间;提取其中的视频图像帧,得到每帧图像对应的图像文件,每个图像文件具有时间戳,用于指示该帧图像的采集时间。The first is to collect video files containing the voice of the salesperson, the voice of the customer, and the facial image of the customer through a camera with a microphone installed at the service desk and the information desk. After that, by performing image processing on the video file, the audio stream is extracted to generate an audio file, and the audio file has a time stamp to indicate the collection time of the audio stream; the video image frames are extracted to obtain each frame image Corresponding image files, each image file has a time stamp, which is used to indicate the acquisition time of the frame image.
由于音频文件中可能同时包含销售人员和顾客的语音,因此,需要进一步对音频文件进行处理,从所述音频文件中提取销售人员(即第一用户)的语音。例如,可以通过预先采集销售人员的声纹的方式,从生成的音频文件中提取与销售人员的声纹匹配的音频信息,生成销售人员在该对话场景中的语音文件。Since the audio file may contain the voice of the salesperson and the customer at the same time, it is necessary to further process the audio file, and extract the voice of the salesperson (that is, the first user) from the audio file. For example, by pre-collecting the voiceprint of the salesperson, audio information matching the voiceprint of the salesperson can be extracted from the generated audio file to generate the voice file of the salesperson in the dialogue scene.
之后,可以将生成的上述图像文件作为面部图像文件,关联所述语音文件直接上传至所述词语采集平台。After that, the generated image file can be used as a facial image file, and the voice file can be associated and uploaded directly to the word collection platform.
本申请的一些实施例中,为了减少上传面部图像文件占用的网络资源以及提升面部图像传输效率,可以对由每帧图像生成的图像文件分别进行人脸检测、定位,并根据人脸定位结果对每个图像文件进行图像截取,仅保留面部图像区域生成相应的面部图像文件。生成的每个面部图像文件具有时间戳,所述面部图像文件的时间戳为生成该面部图像文件的视频图像帧的时间戳。之后,将仅包括人脸区域的面部图像文件,关联所述语音文件上传至所述词语采集平台。In some embodiments of the present application, in order to reduce the network resources occupied by uploading facial image files and improve the transmission efficiency of facial images, the image files generated from each frame of image can be detected and located separately, and the results of the facial positioning can be checked. Each image file is intercepted, and only the facial image area is reserved to generate the corresponding facial image file. Each generated facial image file has a timestamp, and the timestamp of the facial image file is the timestamp of the video image frame in which the facial image file is generated. After that, the facial image file that only includes the human face area is associated with the voice file and uploaded to the word collection platform.
所述词语采集平台将接收到的所述第一用户的语音文件和第二用户的多个面部图像文件存储在云端服务器。The word collection platform stores the received voice file of the first user and multiple facial image files of the second user in a cloud server.
第二种,可以通过设置在服务台、咨询台的具有麦克风的摄像头采集包含销售人员的语音和顾客的语音和顾客的面部图像的视频文件。将所述视频文件上传至所述词语采集平台,由所述词语采集平台对所述视频文件进行图像处理,提取其中的音频流生成音频文件,音频文件中的具有时间戳,用于指示该音频流的采集时间;提取其中的视频图像帧,得到每帧图像对应的图像文件,每个图像文件具有时间戳,用于指示该帧图像的采集时间。之后,所述词语采集平台按照上述语音处理方式根据音频文件生成销售人员(即第一用户)的语音文件。以及,所述词语采集平台按照上述方式对由每帧图像 生成的图像文件分别进行人脸检测、定位,并根据人脸定位结果对每个图像文件进行图像截取,仅保留面部图像区域生成相应的面部图像文件。生成的每个面部图像文件具有时间戳,所述面部图像文件的时间戳为生成该面部图像文件的视频图像帧的时间戳。所述词语采集平台将生成的所述第一用户的语音文件和第二用户的多个面部图像文件存储在云端服务器。The second method is to collect video files containing the voice of the salesperson, the voice of the customer, and the facial image of the customer through a camera with a microphone installed at the service desk and the information desk. Upload the video file to the word collection platform, the word collection platform performs image processing on the video file, and extracts the audio stream from it to generate an audio file. The audio file has a timestamp to indicate the audio The collection time of the stream; the video image frames are extracted to obtain the image file corresponding to each frame image, and each image file has a time stamp to indicate the collection time of the frame image. After that, the word collection platform generates a voice file of the salesperson (that is, the first user) according to the audio file according to the above-mentioned voice processing method. And, the word acquisition platform performs face detection and positioning on the image files generated from each frame of image in the above manner, and performs image interception on each image file according to the face positioning results, and only retains the facial image area to generate the corresponding Face image file. Each generated facial image file has a timestamp, and the timestamp of the facial image file is the timestamp of the video image frame in which the facial image file is generated. The word collection platform stores the generated voice file of the first user and multiple facial image files of the second user in a cloud server.
第三种,可以通过设置在门店中的监控设备采集包含销售人员的语音和面部图像、顾客的语音面部图像的视频文件。之后,通过对所述视频文件进行图像处理,提取其中的音频流生成音频文件,音频文件具有时间戳,用于指示该音频流的采集时间;提取其中的视频图像帧,得到每帧图像对应的图像文件,每个图像文件具有时间戳,用于指示该帧图像的采集时间。The third type is to collect video files containing the voice and facial images of sales staff and the voice and facial images of customers through monitoring equipment installed in the store. Afterwards, by performing image processing on the video file, the audio stream is extracted to generate an audio file. The audio file has a time stamp to indicate the collection time of the audio stream; extracting the video image frames in it, and obtaining the corresponding image of each frame Image files, each image file has a time stamp, used to indicate the acquisition time of the frame image.
由于音频文件中可能同时包含销售人员和顾客的语音,因此,需要进一步对音频文件进行处理,从所述音频文件中提取销售人员(即第一用户)的语音,生成销售人员(即第一用户)的语音文件。根据音频文件生成销售人员(即第一用户)的语音文件的具体实施方式参见前面描述,此处不再赘述。Since the audio file may contain the voice of the salesperson and the customer at the same time, it is necessary to further process the audio file, extract the voice of the salesperson (ie, the first user) from the audio file, and generate the salesperson (ie, the first user). ) Voice file. For the specific implementation of generating the voice file of the salesperson (ie, the first user) from the audio file, refer to the foregoing description, and will not be repeated here.
由于每帧图像生成的图像文件中可能包括顾客的面部图像和销售人员的面部图像,因此,还需要对每幅图像文件进行人脸检测、定位,确定每幅图像中包括的人脸区域,对各人脸区域的图像,分别和销售人员预先采集的人脸图像进行比对,将识别不成功的人脸区域的图像,作为顾客(即第二用户)的人脸区域。最后,根据每个图像文件中顾客的人脸区域,分别生成顾客的面部图像文件。Since the image file generated by each frame of image may include the facial image of the customer and the facial image of the salesperson, it is also necessary to detect and locate the face of each image file, and determine the face area included in each image. The image of each face area is compared with the face image pre-collected by the salesperson, and the image of the face area that is not successfully recognized is used as the face area of the customer (that is, the second user). Finally, according to the facial area of the customer in each image file, the facial image file of the customer is generated respectively.
之后,将生成的上述顾客的面部图像文件和所述销售人员的语音文件关联上传至所述词语采集平台,存储至云端服务器。After that, the generated facial image file of the customer and the voice file of the salesperson are associated and uploaded to the word collection platform, and stored in a cloud server.
第四种,通过设置在门店中的监控设备采集包含销售人员的语音和面部图像、顾客的语音面部图像的视频文件,并将采集的视频文件上传至所述词语采集平台。由所述词语采集平台参照第三种情况中的方式对视频文件进行处理、提取音频文件和图像文件,以及,进一步对音频文件进行声纹识别处理,生成顾客的语音文件,对视频文件进行人脸检测、定位和识别处理,生成顾客的多个面部图像文件。The fourth is to collect video files containing the voice and facial images of sales staff and the voice and facial images of customers through monitoring equipment installed in the store, and upload the collected video files to the word collection platform. The word collection platform processes the video files, extracts audio files and image files, and further performs voiceprint recognition processing on the audio files by referring to the method in the third case by the word collection platform to generate customer voice files, and perform human operations on the video files. Face detection, positioning and recognition processing, generating multiple facial image files of customers.
之后,将生成的上述顾客的面部图像文件和所述销售人员的语音文件关联存储至云端服务器。After that, the generated facial image file of the customer and the voice file of the salesperson are stored in a cloud server in association.
本申请的一些实施例中,还可以采用其他方式获取对话双方其中一方的的语音和另一方的面部图像,本申请实施例中不再一一例举。In some embodiments of the present application, other methods may also be used to obtain the voice of one of the two parties to the conversation and the facial image of the other party, which will not be listed one by one in the embodiments of the present application.
步骤S2,通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情。Step S2, by performing expression recognition on the facial image of the second user, determine the facial expressions of the second user at different times during the conversation.
进一步的,在获取到第二用户的多个面部图像之后,通过对每幅面部图像进行表情识别,可以确定每幅面部图像中,所述第二用户的面部表情。其中,对每幅面部图像识别得到的所述第二用户的面部表情包括但不限于以下任意一种:微笑、专注、平静、反感、生气。对每幅面部图像进行表情识别的具体实施方式参见现有技术,本实施例中不再赘述。本申请实施例对对每幅面部图像进行表情识别采用的具体实施方式不做限定。Further, after acquiring multiple facial images of the second user, by performing expression recognition on each facial image, the facial expression of the second user in each facial image can be determined. Wherein, the facial expression of the second user obtained by recognizing each facial image includes but is not limited to any one of the following: smiling, focused, calm, disgusted, and angry. For the specific implementation manner of performing expression recognition on each facial image, refer to the prior art, which will not be repeated in this embodiment. The embodiments of the present application do not limit the specific implementation manners used to perform expression recognition on each facial image.
进一步的,对于每幅面部图像,将每幅面部图像的时间戳作为从该幅面部图像中识别得到的所述第二用户的面部表情的发生时间。Further, for each facial image, the time stamp of each facial image is used as the occurrence time of the facial expression of the second user recognized from the facial image.
按照此方法,可以得到第二用户在和第一用户对话过程中,不同时间发生的面部表情。According to this method, the facial expressions that occurred at different times during the dialogue between the second user and the first user can be obtained.
步骤S3,根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本。Step S3: According to the occurrence time of each facial expression and the occurrence time of the voice, each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and each of the facial expressions is determined The text corresponding to the facial expression.
通常,在对话过程中,对话一方的面部表情反映了其对另一方所讲的话的实时情感极性。例如,在第一用户和第二用户对话过程中,第二用户在不同时间的面部表情反映了第二用户在该时间点对第一用户的话是否满意或反感。因此,通过将第一用户的语音转换得到的文本和第二用户的面部表情基于时间进行匹配,即可得到匹配第二用户的不同情感极性的词语。Generally, during a conversation, the facial expressions of one party to the conversation reflect the real-time emotional polarity of what they say to the other party. For example, during a conversation between the first user and the second user, the facial expressions of the second user at different times reflect whether the second user is satisfied or disliked by the first user at that point in time. Therefore, by matching the text obtained by the voice conversion of the first user with the facial expression of the second user based on time, words matching the different emotional polarities of the second user can be obtained.
本申请的一些实施例中,所述根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本,包括:对于每个所述面部表情,将所述语音中在该面部表情的发生时间预设时间范围内发生的所述第一用户的语音片段,作为匹配该面部表情的语音片段;对于每个所述面部表情,将匹配该面部表情的语音片段转换得到的文本,作为匹配该面部表情的文本。In some embodiments of the present application, according to the occurrence time of each facial expression and the occurrence time of the voice, the text obtained by converting each facial expression of the second user and the voice of the first user is performed Matching, and determining the text corresponding to each facial expression, including: for each facial expression, the voice of the first user that occurred within a preset time range of the facial expression occurrence time The voice fragment is used as a voice fragment matching the facial expression; for each facial expression, the text obtained by converting the voice fragment matching the facial expression is used as the text matching the facial expression.
例如,对于第一用户和第二用户的在时间T开始的一段对话过程中,获取的第一用户的语音文件表示为voice.wav,获取的第二用户的多幅面部图像表示为picture{{p1,t1},…,{pN,tN}}。其中,voice.wav的时间属性中包括采集时间,pN表示获取到第二用户的第N幅面部图像,tN表示面部图像pN的采集时间(即时间戳),N为大于1的自然数。在对第二用户的多 幅面部图像picture{{p1,t1},…,{pN,tN}}中每幅面部图像分别进行表情识别之后,可以得到上述每幅面部图像中第二用户的面部表情。以获取到第二用户的5幅面部图像举例,每幅面部图像的表情识别结果分别为:面部图像p1中第二用户的面部表情为“微笑”、面部图像p2中第二用户的面部表情为“平静”、面部图像p3中第二用户的面部表情为“平静”、面部图像p4中第二用户的面部表情为“反感”、面部图像p5中第二用户的面部表情为“生气”。For example, for a conversation between the first user and the second user starting at time T, the acquired voice file of the first user is represented as voice.wav, and the acquired multiple facial images of the second user are represented as picture{{ p1,t1},...,{pN,tN}}. Among them, the time attribute of voice.wav includes the collection time, pN represents the Nth facial image acquired by the second user, tN represents the collection time (ie timestamp) of the facial image pN, and N is a natural number greater than 1. After performing expression recognition on each facial image in the multiple facial images picture{{p1,t1},...,{pN,tN}} of the second user, the face of the second user in each facial image can be obtained expression. Taking 5 facial images of the second user as an example, the facial expression recognition results of each facial image are: the facial expression of the second user in the facial image p1 is "smile", and the facial expression of the second user in the facial image p2 is "Calm", the facial expression of the second user in the facial image p3 is "calm", the facial expression of the second user in the facial image p4 is "disgusting", and the facial expression of the second user in the facial image p5 is "angry".
本申请的一些实施例中,可以将第二用户的每个所述面部表情的发生时间(即所述面部表情对应的面部图像的采集时间)预设时间范围(如10秒钟)内发生的所述第一用户的语音片段,作为匹配该面部表情的语音片段。例如,将第一用户的语音文件voice.wav中音频流的时间戳在时间范围(t1-5,t1+5)内的语音片段,作为匹配第二用户的“微笑”面部表情的语音片段;将第一用户的语音文件voice.wav中音频流的时间戳在时间范围(t2-5,t2+5)内的语音片段,作为匹配第二用户的“平静”面部表情的第一语音片段;将第一用户的语音文件voice.wav中音频流的时间戳在时间范围(t3-5,t3+5)内的语音片段,作为匹配第二用户的“平静”面部表情的第二语音片段;将第一用户的语音文件voice.wav中音频流的时间戳在时间范围(t4-5,t4+5)内的语音片段,作为匹配第二用户的“反感”面部表情的第一语音片段;将第一用户的语音文件voice.wav中音频流的时间戳在时间范围(t5-5,t5+5)内的语音片段,作为匹配第二用户的“生气”面部表情的语音片段。In some embodiments of the present application, the occurrence time of each facial expression of the second user (that is, the acquisition time of the facial image corresponding to the facial expression) can be set to occur within a preset time range (such as 10 seconds). The voice segment of the first user is used as a voice segment matching the facial expression. For example, use the voice segment in the voice file voice.wav of the first user whose audio stream has a timestamp within the time range (t1-5, t1+5) as a voice segment matching the "smiling" facial expression of the second user; Use the voice segment in the voice file voice.wav of the first user whose audio stream timestamp is within the time range (t2-5, t2+5) as the first voice segment matching the "calm" facial expression of the second user; Use the voice segment in the voice file voice.wav of the first user whose audio stream timestamp is within the time range (t3-5, t3+5) as the second voice segment matching the "calm" facial expression of the second user; Use the voice segment in the voice file voice.wav of the first user whose audio stream timestamp is within the time range (t4-5, t4+5) as the first voice segment matching the "disgusting" facial expression of the second user; The voice segment in the voice file voice.wav of the first user whose time stamp of the audio stream is within the time range (t5-5, t5+5) is used as a voice segment matching the "angry" facial expression of the second user.
按照上述方法,可以确定第二用户在不同时间点的每个面部表情对应的第一用户的语音片段。其中,相同的面部表情可能对应不同的语音片段,表示第一用户的不同的话语可以引发第二用户的同样表情。According to the above method, the voice segment of the first user corresponding to each facial expression of the second user at different time points can be determined. Wherein, the same facial expression may correspond to different speech segments, which means that different words of the first user can trigger the same expression of the second user.
本申请的一些实施例中,所述预设时间范围根据具体需要设置。In some embodiments of the present application, the preset time range is set according to specific needs.
步骤S4,根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语。Step S4: Determine a word matching the preset emotional polarity of the second user according to the text corresponding to each facial expression of the second user.
本申请的一些实施例中,所述预设情感极性包括:正面情感、负面情感。其中,所述正面情况为通过用户的“微笑”、“专注”、“平静”等面部表情定义体现的情感极性;负面情感为通过“反感”“生气”等面部表情体现的情感极性。In some embodiments of the present application, the preset emotion polarity includes: positive emotion and negative emotion. Wherein, the positive situation is the emotional polarity embodied through the user's facial expression definitions such as "smile", "focus", and "calm"; the negative emotion is the emotional polarity embodied through facial expressions such as "disgust" and "angry".
本申请的一些实施例中,所述根据所述第二用户的每个所述面部表情对 应的所述文本,确定匹配所述第二用户的预设情感极性的词语,包括:按照面部表情和情感极性的对应关系,确定各所述面部表情匹配的情感极性;将各所述面部表情匹配的情感极性,作为与所述面部表情匹配的文本所匹配的情感极性;根据匹配同一情感极性的所述文本中不同词语出现的频次,确定匹配所述第二用户的各情感极性的词语。In some embodiments of the present application, the determining a word matching the preset emotional polarity of the second user according to the text corresponding to each facial expression of the second user includes: following facial expressions The corresponding relationship with the emotional polarity is determined to determine the emotional polarity of each facial expression matching; the emotional polarity of each facial expression matching is used as the emotional polarity matched by the text matching the facial expression; according to the matching The frequency of occurrence of different words in the text with the same emotional polarity is determined to match the words of each emotional polarity of the second user.
本申请的一些实施例中,可以根据专家常识预先建立面部表情和情感极性的对应关系。例如,确定用户的情感极性包括:正面情感、负面情感两种,将“微笑”、“专注”、“平静”面部表情定义为匹配正面情感这一情感极性,将“反感”、“生气”的表情定义为匹配负面情感这一情感极性。则根据此面部表情和情感极性的对应关系和前述识别得到的不同时间点第二用户的面部表情,可以确定第二用户在不同时间点对第一用户的话语的情感极性分别为:在t1时刻第一用户的话语使第二用户产生正面情感、在t2时刻第一用户的话语使第二用户产生正面情感、在t3时刻第一用户的话语使第二用户产生正面情感、在t4时刻第一用户的话语使第二用户产生负面情感、在t5时刻第一用户的话语使第二用户产生负面情感。In some embodiments of the present application, the correspondence between facial expressions and emotional polarities can be established in advance based on expert common sense. For example, to determine the user’s emotional polarity includes: positive emotions and negative emotions. The facial expressions of “smile,” “focus,” and “calm” are defined as matching the emotional polarity of positive emotions. The expression of "" is defined as the emotional polarity that matches the negative emotion. According to the corresponding relationship between the facial expression and the emotional polarity and the facial expressions of the second user at different time points obtained by the aforementioned recognition, it can be determined that the emotional polarity of the second user's speech to the first user at different time points is: The words of the first user at t1 make the second user have positive emotions, the words of the first user at t2 make the second user have positive emotions, and the words of the first user at t3 make the second user have positive emotions, at t4 The words of the first user cause the second user to produce negative emotions, and at time t5, the words of the first user cause the second user to produce negative emotions.
在确定了第二用户的面部表情匹配的第一用户的语音片段的对应关系之后,对每一个语音片段分别进行文本转换,确定每一个语音片段中第一用户的话语文本,进一步确定第二用户在不同时间点发生的面部表情匹配的第二用户的话语文本。将语音片段转换成文本的具体实施方式为参见现有技术,本申请实施例中不再赘述。本申请实施例,对将语音片段转换成文本的具体实施方式不做限定。之后,将每个语音片段转换得到的文本作为与该语音片段匹配的第二用户的面部表情所匹配的文本。例如,对于第二用户在t1时刻发生的面部表情“微笑”,其对应的第一用户在(t1-5,t1+5)时间段内的话语文本为“您好,很高兴为您服务”;对于第二用户在t5时刻发生的面部表情“反感”,其对应的第一用户在(t5-5,t5+5)时间段内的话语文本为“您必须尽快办理,否则就来不及了”。After determining the corresponding relationship of the first user’s voice segments matching the facial expressions of the second user, text conversion is performed on each voice segment to determine the utterance text of the first user in each voice segment, and further determine the second user Facial expressions occurring at different time points match the second user’s utterance text. The specific implementation manner of converting the speech fragment into text is referring to the prior art, which will not be repeated in the embodiment of the present application. The embodiment of the present application does not limit the specific implementation manner of converting the speech segment into text. After that, the text obtained by converting each voice segment is used as the text that matches the facial expression of the second user that matches the voice segment. For example, for the facial expression "smile" of the second user at t1, the corresponding text of the first user's speech in the (t1-5, t1+5) time period is "Hello, I am very happy to serve you" ; For the facial expression "disgust" of the second user at t5, the corresponding text of the first user's speech in the (t5-5, t5+5) time period is "You must do this as soon as possible, otherwise it will be too late" .
如前所述,根据面部表情和情感极性的对应关系,已经确定了所述第二用户在该对话过程中上述t1至t5中5个时间点的面部表情匹配的情感极性,并且,已经确定了所述第二用户在该对话过程中上述t1至t5中5个时间点的面部表情所匹配的第一用户的话语文本,则可以进一步确定,所述第一用户在不同时间点的话语文本匹配的所述第二用户的情感极性。例如,在t1时间点,第一用户的话语文本“您好,很高兴为您服务”匹配所述第二用 户的正面情感;在t5时间点,第一用户的话语文本“您必须尽快办理,否则就来不及了”匹配所述第二用户的负面情感。As mentioned above, according to the corresponding relationship between facial expressions and emotional polarities, the emotional polarities of the facial expression matching of the above-mentioned five time points from t1 to t5 in the dialogue process of the second user have been determined, and It is determined that the first user’s utterance text matched by the facial expressions of the second user at the 5 time points from t1 to t5 in the dialogue process can be further determined, and the utterances of the first user at different time points can be further determined The emotional polarity of the second user whose text matches. For example, at time t1, the first user’s utterance text "Hello, I am happy to serve you" matches the positive emotion of the second user; at time t5, the first user’s utterance text "You must handle it as soon as possible, Otherwise, it’s too late to match the negative sentiment of the second user.
最后,将当前会话中第一用户的话语文本中,匹配第二用户的正面情感这一情感极性的文本中包括的词语,加入正面情感这一情感极性的候选词语集合;将当前会话中第一用户的话语文本中,匹配第二用户的负面情感这一情感极性的文本中包括的词语,加入负面情感这一情感极性的候选词语集合。Finally, add words included in the text of the first user’s utterance text in the current conversation that match the emotional polarity of the second user’s positive emotions to the set of candidate words for the positive emotions; In the utterance text of the first user, words included in the text matching the emotional polarity of the negative emotion of the second user are added to the set of candidate words for the emotional polarity of the negative emotion.
至此,根据所述第一用户和所述第二用户的一段对话确定了一组匹配正面情感这一情感极性的词语和一组匹配负面情感这一情感极性的词语。通过对所述第一用户和第二用户的多段对话进行采集,即可得到多组匹配正面情感这一情感极性的词语和多组匹配负面情感这一情感极性的词语。So far, according to a conversation between the first user and the second user, a set of words matching the emotional polarity of positive emotions and a set of words matching the emotional polarity of negative emotions are determined. By collecting multiple conversations between the first user and the second user, multiple sets of words matching the emotional polarity of positive emotions and multiple sets of words matching the emotional polarity of negative emotions can be obtained.
进一步的,通过对在所述第一用户和所述第二用户的多段对话中采集的多组匹配不同情感极性的词语分别进行分析,将所述多组匹配正面情感这一情感极性的词语中出现频次满足预设条件的词语(如出现频次最多的前5个词语),确定为匹配所述第二用户的正面情感这一情感极性的词语;将所述多组匹配负面情感这一情感极性的词语中出现频次满足预设条件的词语(如出现频次最多的前5个词语),确定为匹配所述第二用户的负面情感这一情感极性的词语。Further, by separately analyzing multiple sets of words matching different emotional polarities collected in the multiple conversations of the first user and the second user, the multiple sets of words matching the positive emotional polarity are analyzed. The words whose occurrence frequency meets the preset condition (for example, the top 5 words with the most occurrence frequency) are determined as words that match the positive emotion of the second user, which is the emotional polarity; match the multiple groups of negative emotions, Among the words with an emotional polarity, the words whose occurrence frequency meets the preset condition (for example, the top 5 words with the most occurrence frequency) are determined as words matching the emotional polarity of the negative emotion of the second user.
至此,完成了匹配所述第二用户的不同情感极性的词语的采集。So far, the collection of words matching the different emotional polarities of the second user is completed.
本申请实施例公开的一种匹配情感极性的词语采集方法,通过获取在第一用户和第二用户对话过程中,所述第一用户的语音和所述第二用户的面部图像;通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语,可以提升基于情感极性的词语的采集效率。本申请实施例公开的一种匹配情感极性的词语采集方法,通过自动采集两个用户对话过程中一方的面部表情和对方的语音,并基于该用户的面部表情发生时对方的话语,可以准确的确定使该用户产生正面情感和负面情感的词语。The embodiment of the application discloses a word collection method matching emotional polarity, by acquiring the voice of the first user and the facial image of the second user during the conversation between the first user and the second user; Perform facial expression recognition on the facial image of the second user to determine the facial expressions of the second user at different times during the conversation; according to the occurrence time of each facial expression and the occurrence time of the voice, Each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and the text corresponding to each facial expression is determined; according to each facial expression of the second user Corresponding to the text, determining words matching the preset emotional polarity of the second user can improve the collection efficiency of words based on the emotional polarity. The embodiment of the application discloses a method for collecting words matching emotional polarity. By automatically collecting the facial expressions of one party and the voice of the other party during a conversation between two users, and based on the words of the other party when the user's facial expression occurs, it can be accurate The determination of the words that make the user produce positive emotions and negative emotions.
本申请实施例公开的一种匹配情感极性的词语采集方法,可以应用于 很多领域。例如,将本申请实施例公开的一种匹配情感极性的词语采集方法应用于聊天机器人领域。首先,通过机器人采集与其对话的某一真人的面部表情,以及该机器人的语音,并通过词语采集平台基于采集时间对机器人的语音和该真人的面部表情进行识别匹配,确定引起该真人的不同情感极性的词语。所述根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语之后,还包括:根据匹配所述第二用户的正面情感这一情感极性的词语,建立所述第二用户的正面情感词库;和/或,根据匹配所述第二用户的负面情感这一情感极性的词语,建立所述第二用户的负面情感词库。例如,在根据所述机器人和所述真人的多段对话确定了匹配所述真人的正面情感和负面情感的词语之后,根据匹配正面情感的词语建立正面情感词库,根据匹配负面情感的词语建立负面情感词库,并将所述正面情感词库和所述负面情感词库更新至该机器人的语料词语库中,用于优化该机器人与该真人的对话内容,以提升该真人的聊天体验。The word collection method for matching emotional polarity disclosed in the embodiments of the present application can be applied to many fields. For example, a word collection method for matching emotional polarity disclosed in the embodiment of the present application is applied to the field of chat robots. First, the robot collects the facial expression of a real person with whom it is talking, as well as the robot's voice, and uses the word collection platform to identify and match the robot's voice and the real person's facial expression based on the collection time to determine the different emotions that cause the real person Polar words. After determining the word matching the preset emotional polarity of the second user according to the text corresponding to each of the facial expressions of the second user, the method further includes: matching the front face of the second user according to The emotional polarity words of the second user are used to establish the positive emotional vocabulary of the second user; and/or according to the words matching the negative emotions of the second user, the emotional polarity words of the second user are established. Negative sentiment vocabulary. For example, after the words matching the positive and negative emotions of the real person are determined based on the multiple dialogues between the robot and the real person, a positive emotion vocabulary is established based on the words matching the positive emotion, and the negative emotion is established based on the words matching the negative emotion. The emotional vocabulary, and the positive emotional vocabulary and the negative emotional vocabulary are updated to the corpus vocabulary of the robot for optimizing the dialogue content between the robot and the real person, so as to enhance the chatting experience of the real person.
再例如,客服场景中,通过采集某一顾客和多名客服人员的对话过程中顾客的面部图像和客服人员的语音,对每一个对话过程中的语音的面部图像,分别执行上述步骤S1至步骤S4,可以确定该顾客的正面情感词库和负面情感词库,便于客服人员参考所述情感词库选择与该顾客的对话内容,提升对该顾客的服务质量。For another example, in a customer service scenario, by collecting the facial image of the customer and the voice of the customer during the conversation between a certain customer and multiple customer service personnel, the above steps S1 to step S1 are performed for each facial image of the voice during the conversation. S4: The positive emotion word database and the negative emotion word database of the customer can be determined, so that the customer service staff can refer to the emotional word database to select the content of the dialogue with the customer, and improve the service quality of the customer.
实施例二Example two
本申请的另一些实施例中,实施例一中所述的匹配情感极性的词语采集方法,还可以应用于预设会话场景下语料词库的建立。例如,所述第一用户和第二用户对话过程为预设会话场景下的对话过程,如图2所示,所述根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语的步骤之后,还包括:步骤S5和步骤S6。In other embodiments of the present application, the word collection method for matching emotional polarity described in Embodiment 1 can also be applied to the establishment of a corpus vocabulary in a preset conversation scenario. For example, the conversation process between the first user and the second user is a conversation process in a preset conversation scenario, as shown in FIG. 2, the text corresponding to each facial expression of the second user, After the step of determining the word matching the preset emotional polarity of the second user, the method further includes: step S5 and step S6.
步骤S5,重新选择所述第一用户和所述第二用户,并重复执行步骤S1至步骤S4,直至满足词语集合输出条件;Step S5, reselect the first user and the second user, and repeat steps S1 to S4 until the word set output condition is satisfied;
步骤S6,根据匹配所选择的所有所述第二用户的所述预设情感极性的词语,输出所述会话场景下与所述预设情感极性匹配的词语集合。Step S6: Output a set of words matching the preset emotional polarity in the conversation scene according to the words matching the preset emotional polarity of all the selected second users.
其中,所述词语集合输出条件包括但不限于以下任意一种:重复执行步骤S1至步骤S4的次数达到预设次数(如10000次)、所选择的所述第二用户数量达到预设值(如1000人)、获取的所述第一用户的语音达到预设值(如10000条)。Wherein, the word set output condition includes, but is not limited to, any one of the following: the number of repeated executions of step S1 to step S4 reaches a preset number of times (such as 10,000 times), and the number of the selected second users reaches a preset value ( For example, 1,000 people), and the acquired voice of the first user reaches a preset value (for example, 10,000 voices).
以建立销售人员与顾客会话场景下,销售人员的语料库为例,可以选择1000名顾客(即第二用户)与销售人员的对话过程进行语音和面部图像采集,并按照实施例一中步骤S1至步骤S4所述的方法分别确定匹配这1000名顾客的正面情感的词语和负面情感的词语。之后,根据匹配这1000名顾客的正面情感的词语构建匹配正面情感的词语集合;根据匹配这1000名顾客的负面情感的词语构建匹配负面情感的词语集合。构建的匹配正面情感的词语集合,可以作为所述销售人员和顾客的会话场景下的正面情感的词语集合;构建的匹配正面情感的词语集合,可以作为所述销售人员和顾客的会话场景下的负面情感的词语集合。最后,可以输出所述匹配正面情感的词语集合,用于构建销售人员的优选语料库。还可以输出所述匹配负面情感的词语集合,用于构建销售人员的规避语料库。Taking the establishment of a salesperson-customer conversation scenario, a salesperson's corpus as an example, you can select 1,000 customers (that is, the second user) and the salesperson's dialogue process for voice and facial image collection, and follow the steps S1 to S1 in the first embodiment. The method described in step S4 respectively determines the words matching the positive emotions and negative emotions of the 1000 customers. After that, construct a word set matching the positive emotions based on the words matching the positive emotions of the 1,000 customers; construct a word set matching the negative emotions based on the words matching the negative emotions of the 1,000 customers. The constructed word set matching the positive emotion can be used as the positive emotion word set in the conversation scene between the salesperson and the customer; the constructed word set matching the positive emotion can be used as the conversation scene between the salesperson and the customer. A collection of negative emotion words. Finally, the set of words matching the positive sentiment can be output for constructing a preferred corpus of salespersons. It is also possible to output the set of words matching the negative sentiment for building an evasive corpus of sales staff.
本申请实施例公开的一种匹配情感极性的词语采集方法,通过采集预设会话场景下多个第二用户和至少一个第一用户之间的若干对话过程中,第一用户的语音和第二用户的面部图像,并对采集的每个对话过程对应的语音和面部图像分别执行:通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语的数据处理操作;最后,根据匹配所有所述第二用户的所述预设情感极性的词语,输出所述会话场景下与所述预设情感极性匹配的词语集合,有助于针对会话场景自动建立匹配不同情感极性的语料库。例如,在建立销售人员的培训语料库时,不再需要人工统计匹配不同用户的不同情感极性的词语,可以提升构建语料库的效率。The embodiment of the present application discloses a word collection method that matches emotional polarity. The first user’s voice and the first user’s voice and the first user’s voice and the first user’s voice are collected during several conversations between multiple second users and at least one first user in a preset conversation 2. The facial image of the user, and execute the voice and facial image corresponding to each conversation process: by performing expression recognition on the facial image of the second user, it is determined that the second user is different in the conversation process Each facial expression occurring at the time; according to the occurrence time of each facial expression and the occurrence time of the voice, each facial expression of the second user is matched with the text obtained by converting the voice of the first user to determine The text corresponding to each of the facial expressions; a data processing operation for determining words that match the preset emotional polarity of the second user according to the text corresponding to each of the facial expressions of the second user Finally, according to the words matching the preset emotional polarity of all the second users, output a set of words matching the preset emotional polarity in the conversation scene, which helps to automatically establish a match for the conversation scene A corpus of different emotional polarities. For example, when building a training corpus for sales staff, it is no longer necessary to manually count and match words with different emotional polarities of different users, which can improve the efficiency of building the corpus.
实施例三Example three
本申请实施例公开的一种匹配情感极性的词语采集装置,如图3所示,所述装置包括:An embodiment of the present application discloses a word collection device matching emotional polarity. As shown in FIG. 3, the device includes:
语音和面部图像获取模块310,用于获取在第一用户和第二用户对话过程中,所述第一用户的语音和所述第二用户的面部图像;The voice and facial image acquisition module 310 is configured to acquire the voice of the first user and the facial image of the second user during the conversation between the first user and the second user;
面部表情确定模块320,用于通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;The facial expression determining module 320 is configured to perform facial expression recognition on the facial image of the second user to determine the facial expressions of the second user at different times during the conversation;
面部表情和语音匹配模块330,用于根据所述各个面部表情的发生时间 和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;The facial expression and voice matching module 330 is configured to perform text conversion between the facial expressions of the second user and the voice of the first user according to the occurrence time of each facial expression and the occurrence time of the voice. Match, determine the text corresponding to each of the facial expressions;
匹配情感极性的词语确定模块340,用于根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语。The word determination module 340 for matching emotional polarity is configured to determine a word matching the preset emotional polarity of the second user according to the text corresponding to each facial expression of the second user.
本申请的一些实施例中,所述面部表情和语音匹配模块330,进一步用于:In some embodiments of the present application, the facial expression and voice matching module 330 is further used for:
对于每个所述面部表情,将所述语音中在该面部表情的发生时间预设时间范围内发生的所述第一用户的语音片段,作为匹配该面部表情的语音片段;For each facial expression, use a voice segment of the first user that occurs within a preset time range of the facial expression occurrence time in the voice as a voice segment matching the facial expression;
对于每个所述面部表情,将匹配该面部表情的语音片段转换得到的文本,作为匹配该面部表情的文本。For each facial expression, the text obtained by converting the voice segment matching the facial expression is used as the text matching the facial expression.
本申请的一些实施例中,所述匹配情感极性的词语确定模块340,进一步用于:In some embodiments of the present application, the word determination module 340 for matching emotional polarity is further configured to:
按照面部表情和情感极性的对应关系,确定各所述面部表情匹配的情感极性;According to the corresponding relationship between facial expressions and emotional polarities, determine the emotional polarities matched by each of the facial expressions;
将各所述面部表情匹配的情感极性,作为与所述面部表情匹配的文本所匹配的情感极性;Taking the emotional polarity matched by each of the facial expressions as the emotional polarity matched by the text matching the facial expression;
根据匹配同一情感极性的所述文本中不同词语出现的频次,确定匹配所述第二用户的各情感极性的词语。According to the frequency of appearance of different words in the text matching the same emotional polarity, the words matching each emotional polarity of the second user are determined.
本申请的一些实施例中,所述预设情感极性包括:正面情感、负面情感,如图4所示,所述装置还包括:In some embodiments of the present application, the preset emotion polarity includes: positive emotion and negative emotion. As shown in FIG. 4, the device further includes:
用户正面情感词库建立模块350,用于根据匹配所述第二用户的正面情感这一情感极性的词语,建立所述第二用户的正面情感词库;和/或,The user positive emotion vocabulary establishment module 350 is configured to establish the positive emotion vocabulary of the second user according to the words matching the positive emotion of the second user, which is the emotional polarity; and/or,
用户负面情感词库建立模块360,用于根据匹配所述第二用户的负面情感这一情感极性的词语,建立所述第二用户的负面情感词库。The user negative sentiment vocabulary building module 360 is configured to build a negative sentiment vocabulary of the second user according to words matching the negative sentiment of the second user.
通过自动采集两个用户对话过程中一方的面部表情和对方的语音,并基于该用户的面部表情发生时对方的话语,可以准确的确定使该用户产生正面情感和负面情感的词语。By automatically collecting the facial expressions of one party and the voice of the other during a conversation between two users, and based on the other's words when the user's facial expression occurs, the words that cause the user to produce positive and negative emotions can be accurately determined.
本申请的一些实施例中,所述第一用户和第二用户对话过程为预设会话场景下的对话过程,如图5所示,所述装置还包括:In some embodiments of the present application, the conversation process between the first user and the second user is a conversation process in a preset conversation scenario. As shown in FIG. 5, the apparatus further includes:
多对话词语采集模块370,用于重新选择所述第一用户和所述第二用户,并重复调用所述语音和面部图像获取模块310、所述面部表情确定模块320、 所述面部表情和语音匹配模块330和所述匹配情感极性的词语确定模块340,直至满足词语集合输出条件;The multi-conversation word acquisition module 370 is configured to reselect the first user and the second user, and repeatedly call the voice and facial image acquisition module 310, the facial expression determining module 320, and the facial expression and voice The matching module 330 and the word determination module 340 matching the emotional polarity until the word set output condition is satisfied;
场景词语集合输出模块380,用于根据匹配所选择的所有所述第二用户的所述预设情感极性的词语,输出所述会话场景下与所述预设情感极性匹配的词语集合。The scene word set output module 380 is configured to output a set of words matching the preset emotional polarity in the conversation scene according to the words matching the preset emotional polarity of all the selected second users.
本申请实施例公开的匹配情感极性的词语采集装置,用于实现本申请实施例一或实施例二中所述的匹配情感极性的词语采集方法,装置的各模块的具体实施方式不再赘述,可参见方法实施例相应步骤的具体实施方式。The word collection device for matching emotional polarity disclosed in the embodiment of the present application is used to implement the word collection method for matching emotional polarity described in the first embodiment or the second embodiment of the present application. The specific implementation of each module of the device will not be omitted. For details, please refer to the specific implementation of the corresponding steps in the method embodiment.
本申请实施例公开的一种匹配情感极性的词语采集装置,通过获取在第一用户和第二用户对话过程中,所述第一用户的语音和所述第二用户的面部图像;通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语,可以提升基于情感极性的词语的采集效率。本申请实施例公开的一种匹配情感极性的词语采集装置,通过自动采集两个用户对话过程中一方的面部表情和对方的语音,并基于该用户的面部表情发生时对方的话语,可以准确的确定使该用户产生正面情感和负面情感的词语。The embodiment of the present application discloses a word collection device matching emotional polarity, which acquires the voice of the first user and the facial image of the second user during the conversation between the first user and the second user; Perform facial expression recognition on the facial image of the second user to determine the facial expressions of the second user at different times during the conversation; according to the occurrence time of each facial expression and the occurrence time of the voice, Each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and the text corresponding to each facial expression is determined; according to each facial expression of the second user Corresponding to the text, determining words matching the preset emotional polarity of the second user can improve the collection efficiency of words based on the emotional polarity. The embodiment of the application discloses a word collection device that matches the polarity of emotions. By automatically collecting the facial expressions of one party and the voice of the other party during a conversation between two users, and based on the words of the other party when the user's facial expression occurs, it can be accurate The determination of the words that make the user produce positive emotions and negative emotions.
本申请实施例公开的一种匹配情感极性的词语采集装置,通过采集预设会话场景下多个第二用户和至少一个第一用户之间的若干对话过程中,第一用户的语音和第二用户的面部图像,并对采集的每个对话过程对应的语音和面部图像分别执行:通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语的数据处理操作;最后,根据匹配所有所述第二用户的所述预设情感极性的词语,输出所述会话场景下与所述预设情感极性匹配的词语集合,有助于针对会话场景自动建立匹配不同情感极性的语料库。例如,在建立销售人员的培训语料库时,不再需要人工 统计匹配不同用户的不同情感极性的词语,可以提升构建语料库的效率。The embodiment of the present application discloses a word collection device matching emotional polarity, which collects the voice of the first user and the first user in the process of several conversations between multiple second users and at least one first user in a preset conversation scenario. 2. The facial image of the user, and execute the voice and facial image corresponding to each conversation process: by performing expression recognition on the facial image of the second user, it is determined that the second user is different in the conversation process Each facial expression occurring at time; according to the occurrence time of each facial expression and the occurrence time of the voice, each facial expression of the second user is matched with the text obtained by converting the voice of the first user to determine The text corresponding to each of the facial expressions; a data processing operation for determining words that match the preset emotional polarity of the second user according to the text corresponding to each of the facial expressions of the second user Finally, according to the words matching the preset emotional polarity of all the second users, output a set of words matching the preset emotional polarity in the conversation scene, which helps to automatically establish a match for the conversation scene A corpus of different emotional polarities. For example, when building a training corpus for sales staff, it is no longer necessary to manually count and match words with different emotional polarities of different users, which can improve the efficiency of building the corpus.
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
以上对本申请提供的一种匹配情感极性的词语采集方法及装置进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其一种核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The above describes in detail a word collection method and device for matching emotional polarity provided in the present application. Specific examples are used in this article to illustrate the principles and implementation of the present application. The description of the above embodiments is only used to help understanding The method of this application and one of its core ideas; at the same time, for those of ordinary skill in the art, according to the ideas of this application, there will be changes in the specific implementation and scope of application. In summary, the content of this specification It should not be construed as a limitation on this application.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
本申请的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本申请实施例的电子设备中的一些或者全部部件的一些或者全部功能。本申请还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本申请的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present application may be implemented by hardware, or by software modules running on one or more processors, or by a combination of them. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the electronic device according to the embodiments of the present application. This application can also be implemented as a device or device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein. Such a program for implementing the present application may be stored on a computer-readable medium, or may have the form of one or more signals. Such a signal can be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form.
例如,图6示出了可以实现根据本申请的方法的电子设备。所述电子设备可以为PC机、移动终端、个人数字助理、平板电脑等。该电子设备传统上包括处理器610和存储器620及存储在所述存储器620上并可在处理器610上运行的程序代码630,所述处理器610执行所述程序代码630时实现上述实施例中所述的方法。所述存储器620可以为计算机程序产品或者计算机可读介质。存储器620可以是诸如闪存、EEPROM(电可擦除可编程只读存 储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器620具有用于执行上述方法中的任何方法步骤的计算机程序的程序代码630的存储空间6201。例如,用于程序代码630的存储空间6201可以包括分别用于实现上面的方法中的各种步骤的各个计算机程序。所述程序代码630为计算机可读代码。这些计算机程序可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。所述计算机程序包括计算机可读代码,当所述计算机可读代码在电子设备上运行时,导致所述电子设备执行根据上述实施例的方法。For example, FIG. 6 shows an electronic device that can implement the method according to the present application. The electronic device may be a PC, a mobile terminal, a personal digital assistant, a tablet computer, etc. The electronic device traditionally includes a processor 610, a memory 620, and a program code 630 that is stored on the memory 620 and can run on the processor 610. When the processor 610 executes the program code 630, the above embodiments are implemented. The method described. The memory 620 may be a computer program product or a computer readable medium. The memory 620 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. The memory 620 has a storage space 6201 of the program code 630 of the computer program for executing any method steps in the above-mentioned method. For example, the storage space 6201 for the program code 630 may include various computer programs respectively used to implement various steps in the above method. The program code 630 is computer readable code. These computer programs can be read from or written into one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards, or floppy disks. The computer program includes computer-readable code, which when run on an electronic device, causes the electronic device to execute the method according to the above-mentioned embodiment.
本申请实施例还公开了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请实施例一或实施例二所述的匹配情感极性的词语采集方法的步骤。The embodiment of the present application also discloses a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the collection of words matching emotional polarity as described in Embodiment 1 or Embodiment 2 of this application is realized. Method steps.
这样的计算机程序产品可以为计算机可读存储介质,该计算机可读存储介质可以具有与图6所示的电子设备中的存储器620类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩存储在所述计算机可读存储介质中。所述计算机可读存储介质通常为如参考图7所述的便携式或者固定存储单元。通常,存储单元包括计算机可读代码630’,所述计算机可读代码630’为由处理器读取的代码,这些代码被处理器执行时,实现上面所描述的方法中的各个步骤。Such a computer program product may be a computer-readable storage medium, and the computer-readable storage medium may have storage segments, storage spaces, etc., arranged similarly to the memory 620 in the electronic device shown in FIG. 6. The program code may be compressed and stored in the computer-readable storage medium in an appropriate form, for example. The computer-readable storage medium is usually a portable or fixed storage unit as described with reference to FIG. 7. Generally, the storage unit includes computer-readable codes 630', which are codes read by a processor, and when these codes are executed by the processor, each step in the method described above is implemented.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本申请的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。The “one embodiment”, “an embodiment” or “one or more embodiments” referred to herein means that a specific feature, structure, or characteristic described in conjunction with the embodiment is included in at least one embodiment of the present application. In addition, please note that the word examples "in one embodiment" here do not necessarily all refer to the same embodiment.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本申请的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the instructions provided here, a lot of specific details are explained. However, it can be understood that the embodiments of the present application can be practiced without these specific details. In some instances, well-known methods, structures, and technologies are not shown in detail, so as not to obscure the understanding of this specification.
在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本申请可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可 将这些单词解释为名称。In the claims, any reference signs placed between parentheses should not be constructed as a limitation to the claims. The word "comprising" does not exclude the presence of elements or steps not listed in the claims. The word "a" or "an" preceding an element does not exclude the presence of multiple such elements. The application can be realized by means of hardware including several different elements and by means of a suitably programmed computer. In the unit claims listing several devices, several of these devices may be embodied in the same hardware item. The use of the words first, second, and third, etc. do not indicate any order. These words can be interpreted as names.
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application, not to limit them; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (13)

  1. 一种匹配情感极性的词语采集方法,包括:A word collection method that matches emotional polarity includes:
    步骤S1,获取在第一用户和第二用户对话过程中,所述第一用户的语音和所述第二用户的面部图像;Step S1, acquiring the voice of the first user and the facial image of the second user during the conversation between the first user and the second user;
    步骤S2,通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;Step S2, by performing expression recognition on the facial image of the second user, determine the facial expressions of the second user at different times during the conversation;
    步骤S3,根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;Step S3, according to the occurrence time of each facial expression and the occurrence time of the voice, each facial expression of the second user is matched with the text obtained by the voice conversion of the first user, and each of the facial expressions is determined The text corresponding to the facial expression;
    步骤S4,根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语。Step S4: According to the text corresponding to each facial expression of the second user, a word that matches the preset emotional polarity of the second user is determined.
  2. 根据权利要求1所述的方法,所述根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本的步骤,包括:The method according to claim 1, wherein the text obtained by converting each facial expression of the second user and the voice of the first user according to the occurrence time of each facial expression and the occurrence time of the voice The step of matching and determining the text corresponding to each of the facial expressions includes:
    对于每个所述面部表情,将所述语音中在该面部表情的发生时间预设时间范围内发生的所述第一用户的语音片段,作为匹配该面部表情的语音片段;For each facial expression, use a voice segment of the first user that occurs within a preset time range of the facial expression occurrence time in the voice as a voice segment matching the facial expression;
    对于每个所述面部表情,将匹配该面部表情的语音片段转换得到的文本,作为匹配该面部表情的文本。For each facial expression, the text obtained by converting the voice segment matching the facial expression is used as the text matching the facial expression.
  3. 根据权利要求1所述的方法,所述根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语的步骤,包括:The method according to claim 1, wherein the step of determining a word matching the preset emotional polarity of the second user according to the text corresponding to each of the facial expressions of the second user comprises:
    按照面部表情和情感极性的对应关系,确定各所述面部表情匹配的情感极性;According to the corresponding relationship between facial expressions and emotional polarities, determine the emotional polarities matched by each of the facial expressions;
    将各所述面部表情匹配的情感极性,作为与所述面部表情匹配的文本所匹配的情感极性;Taking the emotional polarity matched by each of the facial expressions as the emotional polarity matched by the text matching the facial expression;
    根据匹配同一情感极性的所述文本中不同词语出现的频次,确定匹配所述第二用户的各情感极性的词语。According to the frequency of appearance of different words in the text matching the same emotional polarity, the words matching each emotional polarity of the second user are determined.
  4. 根据权利要求1所述的方法,所述预设情感极性包括:正面情感、负面情感,所述根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语的步骤之后,还包括:The method according to claim 1, wherein the preset emotion polarity includes: positive emotions and negative emotions, and the text corresponding to each of the facial expressions of the second user is determined to match the first After the step of presetting words of emotional polarity by the user, it further includes:
    根据匹配所述第二用户的正面情感这一情感极性的词语,建立所述第二用户的正面情感词库;和/或,According to the words matching the positive emotion of the second user, the positive emotion word database of the second user is established; and/or,
    根据匹配所述第二用户的负面情感这一情感极性的词语,建立所述第二用户的负面情感词库。The negative emotion vocabulary of the second user is established according to words matching the emotional polarity of the negative emotion of the second user.
  5. 根据权利要求1所述的方法,所述第一用户和第二用户对话过程为预设会话场景下的对话过程,所述根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语的步骤之后,还包括:The method according to claim 1, wherein the conversation process between the first user and the second user is a conversation process in a preset conversation scenario, and the text corresponding to each facial expression of the second user is After the step of determining the word matching the preset emotional polarity of the second user, the method further includes:
    重新选择所述第一用户和所述第二用户,并重复执行步骤S1至步骤S4,直至满足词语集合输出条件;Reselect the first user and the second user, and repeat step S1 to step S4 until the word set output condition is met;
    根据匹配所选择的所有所述第二用户的所述预设情感极性的词语,输出所述会话场景下与所述预设情感极性匹配的词语集合。According to the words matching the preset emotion polarity of all the selected second users, output a set of words matching the preset emotion polarity in the conversation scene.
  6. 一种匹配情感极性的词语采集装置,包括:A word collection device matching emotional polarity includes:
    语音和面部图像获取模块,用于获取在第一用户和第二用户对话过程中,所述第一用户的语音和所述第二用户的面部图像;A voice and facial image acquisition module, configured to acquire the voice of the first user and the facial image of the second user during a conversation between the first user and the second user;
    面部表情确定模块,用于通过对所述第二用户的面部图像进行表情识别,确定所述第二用户在所述对话过程中不同时间发生的各个面部表情;A facial expression determining module, configured to recognize facial expressions of the second user at different times during the dialogue process by performing facial expression recognition on the facial image of the second user;
    面部表情和语音匹配模块,用于根据所述各个面部表情的发生时间和所述语音的发生时间,对所述第二用户的各个面部表情和所述第一用户的语音转换得到的文本进行匹配,确定每个所述面部表情对应的所述文本;The facial expression and voice matching module is used to match each facial expression of the second user with the text converted from the voice of the first user according to the occurrence time of each facial expression and the occurrence time of the voice , Determine the text corresponding to each of the facial expressions;
    匹配情感极性的词语确定模块,用于根据所述第二用户的每个所述面部表情对应的所述文本,确定匹配所述第二用户的预设情感极性的词语。The word determination module that matches the emotional polarity is configured to determine a word that matches the preset emotional polarity of the second user according to the text corresponding to each facial expression of the second user.
  7. 根据权利要求6所述的装置,所述面部表情和语音匹配模块,进一步用于:The device according to claim 6, wherein the facial expression and voice matching module is further configured to:
    对于每个所述面部表情,将所述语音中在该面部表情的发生时间预设时间范围内发生的所述第一用户的语音片段,作为匹配该面部表情的语音片段;For each facial expression, use a voice segment of the first user that occurs within a preset time range of the facial expression occurrence time in the voice as a voice segment matching the facial expression;
    对于每个所述面部表情,将匹配该面部表情的语音片段转换得到的文本,作为匹配该面部表情的文本。For each facial expression, the text obtained by converting the voice segment matching the facial expression is used as the text matching the facial expression.
  8. 根据权利要求6所述的装置,所述匹配情感极性的词语确定模块,进一步用于:The device according to claim 6, wherein the word determination module for matching emotional polarity is further configured to:
    按照面部表情和情感极性的对应关系,确定各所述面部表情匹配的情感 极性;According to the correspondence between facial expressions and emotional polarities, determine the emotional polarities matched by each of the facial expressions;
    将各所述面部表情匹配的情感极性,作为与所述面部表情匹配的文本所匹配的情感极性;Taking the emotional polarity matched by each of the facial expressions as the emotional polarity matched by the text matching the facial expression;
    根据匹配同一情感极性的所述文本中不同词语出现的频次,确定匹配所述第二用户的各情感极性的词语。According to the frequency of appearance of different words in the text matching the same emotional polarity, the words matching each emotional polarity of the second user are determined.
  9. 根据权利要求6所述的装置,所述预设情感极性包括:正面情感、负面情感,所述装置还包括:The device according to claim 6, wherein the preset emotion polarity includes: positive emotion and negative emotion, and the device further comprises:
    用户正面情感词库建立模块,用于根据匹配所述第二用户的正面情感这一情感极性的词语,建立所述第二用户的正面情感词库;和/或,The user positive emotion vocabulary establishment module, which is used to establish the positive emotion vocabulary of the second user according to the words matching the positive emotion of the second user, which is the emotional polarity; and/or,
    用户负面情感词库建立模块,用于根据匹配所述第二用户的负面情感这一情感极性的词语,建立所述第二用户的负面情感词库。The user negative sentiment vocabulary building module is used to build the negative sentiment vocabulary of the second user according to words matching the negative sentiment of the second user.
  10. 根据权利要求6所述的装置,所述第一用户和第二用户对话过程为预设会话场景下的对话过程,所述装置还包括:The apparatus according to claim 6, wherein the dialogue process between the first user and the second user is a dialogue process in a preset conversation scenario, and the apparatus further comprises:
    多对话词语采集模块,用于重新选择所述第一用户和所述第二用户,并重复调用所述语音和面部图像获取模块、所述面部表情确定模块、所述面部表情和语音匹配模块和所述匹配情感极性的词语确定模块,直至满足词语集合输出条件;The multi-conversation word acquisition module is used to reselect the first user and the second user, and repeatedly call the voice and facial image acquisition module, the facial expression determining module, the facial expression and voice matching module, and The word determination module that matches the emotional polarity until the word set output condition is satisfied;
    场景词语集合输出模块,用于根据匹配所选择的所有所述第二用户的所述预设情感极性的词语,输出所述会话场景下与所述预设情感极性匹配的词语集合。The scene word set output module is configured to output a set of words matching the preset emotional polarity in the conversation scene according to the words matching the preset emotional polarity of all the selected second users.
  11. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在处理器上运行的程序代码,所述处理器执行所述程序代码时实现权利要求1至5任意一项所述的匹配情感极性的词语采集方法。An electronic device, comprising a memory, a processor, and program code stored on the memory and capable of running on the processor, and the processor implements any one of claims 1 to 5 when the program code is executed Word collection method that matches emotional polarity.
  12. 一种计算机可读存储介质,其上存储有程序代码,该程序代码被处理器执行时实现权利要求1至5任意一项所述的匹配情感极性的词语采集方法的步骤。A computer-readable storage medium has program code stored thereon, and when the program code is executed by a processor, it realizes the steps of the word acquisition method for matching emotional polarity according to any one of claims 1 to 5.
  13. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备上运行时,导致所述电子设备执行根据权利要求1至5中的任意一项所述的匹配情感极性的词语采集的方法。A computer program, comprising computer readable code, when the computer readable code runs on an electronic device, causes the electronic device to execute the matching emotion polarity according to any one of claims 1 to 5 The method of word collection.
PCT/CN2020/100549 2019-12-31 2020-07-07 Word collection method matching emotion polarity WO2021135140A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911419689.0 2019-12-31
CN201911419689.0A CN111210818B (en) 2019-12-31 2019-12-31 Word acquisition method and device matched with emotion polarity and electronic equipment

Publications (1)

Publication Number Publication Date
WO2021135140A1 true WO2021135140A1 (en) 2021-07-08

Family

ID=70786549

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/100549 WO2021135140A1 (en) 2019-12-31 2020-07-07 Word collection method matching emotion polarity

Country Status (2)

Country Link
CN (1) CN111210818B (en)
WO (1) WO2021135140A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111210818B (en) * 2019-12-31 2021-10-01 北京三快在线科技有限公司 Word acquisition method and device matched with emotion polarity and electronic equipment
CN112200051B (en) * 2020-09-30 2023-09-29 重庆天智慧启科技有限公司 Case field inspection system and method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100211397A1 (en) * 2009-02-18 2010-08-19 Park Chi-Youn Facial expression representation apparatus
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN106373569A (en) * 2016-09-06 2017-02-01 北京地平线机器人技术研发有限公司 Voice interaction apparatus and method
CN108334583A (en) * 2018-01-26 2018-07-27 上海智臻智能网络科技股份有限公司 Affective interaction method and device, computer readable storage medium, computer equipment
KR20190002067A (en) * 2017-06-29 2019-01-08 네이버 주식회사 Method and system for human-machine emotional communication
CN109360130A (en) * 2018-10-29 2019-02-19 四川文轩教育科技有限公司 A kind of student's mood monitoring method based on artificial intelligence
CN110362833A (en) * 2019-07-22 2019-10-22 腾讯科技(深圳)有限公司 A kind of text based sentiment analysis method and relevant apparatus
CN111210818A (en) * 2019-12-31 2020-05-29 北京三快在线科技有限公司 Word acquisition method and device matched with emotion polarity and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674270B (en) * 2017-08-28 2022-01-28 大国创新智能科技(东莞)有限公司 Humorous generation and emotion interaction method based on artificial intelligence and robot system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100211397A1 (en) * 2009-02-18 2010-08-19 Park Chi-Youn Facial expression representation apparatus
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN106373569A (en) * 2016-09-06 2017-02-01 北京地平线机器人技术研发有限公司 Voice interaction apparatus and method
KR20190002067A (en) * 2017-06-29 2019-01-08 네이버 주식회사 Method and system for human-machine emotional communication
CN108334583A (en) * 2018-01-26 2018-07-27 上海智臻智能网络科技股份有限公司 Affective interaction method and device, computer readable storage medium, computer equipment
CN109360130A (en) * 2018-10-29 2019-02-19 四川文轩教育科技有限公司 A kind of student's mood monitoring method based on artificial intelligence
CN110362833A (en) * 2019-07-22 2019-10-22 腾讯科技(深圳)有限公司 A kind of text based sentiment analysis method and relevant apparatus
CN111210818A (en) * 2019-12-31 2020-05-29 北京三快在线科技有限公司 Word acquisition method and device matched with emotion polarity and electronic equipment

Also Published As

Publication number Publication date
CN111210818A (en) 2020-05-29
CN111210818B (en) 2021-10-01

Similar Documents

Publication Publication Date Title
US8676586B2 (en) Method and apparatus for interaction or discourse analytics
US9093081B2 (en) Method and apparatus for real time emotion detection in audio interactions
US8798255B2 (en) Methods and apparatus for deep interaction analysis
US11682401B2 (en) Matching speakers to meeting audio
JP6462651B2 (en) Speech translation apparatus, speech translation method and program
US20170270930A1 (en) Voice tallying system
US8219404B2 (en) Method and apparatus for recognizing a speaker in lawful interception systems
US8145482B2 (en) Enhancing analysis of test key phrases from acoustic sources with key phrase training models
JP5602653B2 (en) Information processing apparatus, information processing method, information processing system, and program
CN104050221A (en) Automatic note taking within a virtual meeting
CN108920640B (en) Context obtaining method and device based on voice interaction
Triantafyllopoulos et al. Deep speaker conditioning for speech emotion recognition
CN110633912A (en) Method and system for monitoring service quality of service personnel
WO2021135140A1 (en) Word collection method matching emotion polarity
CN105810205A (en) Speech processing method and device
US10872615B1 (en) ASR-enhanced speech compression/archiving
CN113744742A (en) Role identification method, device and system in conversation scene
US20220201121A1 (en) System, method and apparatus for conversational guidance
US20230260519A1 (en) System, method and programmed product for uniquely identifying participants in a recorded streaming teleconference
Nakamura et al. LSTM‐based japanese speaker identification using an omnidirectional camera and voice information
CN110765242A (en) Method, device and system for providing customer service information
EP4093005A1 (en) System method and apparatus for combining words and behaviors
CN112714220B (en) Business processing method and device, computing equipment and computer readable storage medium
Hassan et al. Emotions analysis of speech for call classification
US11799679B2 (en) Systems and methods for creation and application of interaction analytics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20909297

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20909297

Country of ref document: EP

Kind code of ref document: A1