WO2014069121A1 - Conversation analysis device and conversation analysis method - Google Patents

Conversation analysis device and conversation analysis method Download PDF

Info

Publication number
WO2014069121A1
WO2014069121A1 PCT/JP2013/075243 JP2013075243W WO2014069121A1 WO 2014069121 A1 WO2014069121 A1 WO 2014069121A1 JP 2013075243 W JP2013075243 W JP 2013075243W WO 2014069121 A1 WO2014069121 A1 WO 2014069121A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
conversation
expression data
expression
apology
Prior art date
Application number
PCT/JP2013/075243
Other languages
French (fr)
Japanese (ja)
Inventor
真 寺尾
祥史 大西
真宏 谷
岡部 浩司
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2014544379A priority Critical patent/JP6365304B2/en
Publication of WO2014069121A1 publication Critical patent/WO2014069121A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/51Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/40Electronic components, circuits, software, systems or apparatus used in telephone systems using speech recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/20Aspects of automatic or semi-automatic exchanges related to features of supplementary services
    • H04M2203/2038Call context notifications

Definitions

  • the present invention relates to a conversation analysis technique.
  • An example of a technology for analyzing conversation is a technology for analyzing call data.
  • data of a call performed in a department called a call center or a contact center is analyzed.
  • a contact center such a department that specializes in the business of responding to customer calls such as inquiries, complaints and orders regarding products and services.
  • Patent Document 1 it is determined whether or not a keyword issued at the time of complaint is included in the call by performing voice recognition on the call contents between the customer and the operator, and the customer's CS (customer satisfaction) is determined based on the determination result. A method to judge the level is proposed.
  • the degree of satisfaction or dissatisfaction of the person who participates in the conversation (hereinafter referred to as “conversation participant”), that is, the degree of satisfaction or dissatisfaction of the customer cannot be determined appropriately.
  • the degree of satisfaction or dissatisfaction of the customer cannot be determined appropriately.
  • even expressions (keywords) that can express satisfaction may be uttered regardless of satisfaction.
  • the thank-you expression of “Thank you” can express satisfaction.
  • the expression does not necessarily indicate satisfaction when used in the following dialogue. Operator “If this is the case, please reboot the PC first.” Customer “Thank you. I just tried it.”
  • misrecognition such as insertion error and omission error may occur.
  • an expression that is not actually uttered in the conversation (call) is recognized, or an expression that is actually uttered in the conversation is not recognized.
  • the keyword to be extracted is erroneously detected or dropped, and as a result, the estimation accuracy of customer satisfaction or dissatisfaction based on the keyword decreases.
  • the present invention has been made in view of such circumstances, and provides a technique for accurately estimating the degree of satisfaction or dissatisfaction of a conversation participant.
  • the degree of satisfaction or dissatisfaction of a conversation participant means the degree of satisfaction or dissatisfaction that at least one conversation participant felt in the conversation.
  • the degree of satisfaction includes only indicating satisfaction or not
  • the degree of dissatisfaction includes indicating only dissatisfaction or not.
  • the first aspect relates to a conversation analysis device.
  • the conversation analysis device provides thank-you expression data uttered by the first conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant.
  • an expression detection unit for detecting at least one of the apology expression data uttered by the second conversation participant as specific expression data, and the satisfaction level of the first conversation participant in the conversation or
  • An estimation unit for estimating the degree of dissatisfaction.
  • the second aspect relates to a conversation analysis method executed by at least one computer.
  • the conversation analysis method according to the second aspect includes thank-you expression data uttered by the first conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant. And at least one of the apology expression data uttered by the second conversation participant is detected as specific expression data, and the degree of satisfaction or dissatisfaction of the first conversation participant in the conversation is estimated according to the detection result of the specific expression data Including.
  • Another aspect of the present invention may be a program that causes at least one computer to implement each configuration in the first aspect, or a computer-readable recording medium that records such a program. There may be.
  • This recording medium includes a non-transitory tangible medium.
  • the conversation analysis apparatus provides thank-you expression data uttered by the first conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant. And an expression detection unit for detecting at least one of the apology expression data uttered by the second conversation participant as specific expression data, and the satisfaction level of the first conversation participant in the conversation or An estimation unit for estimating the degree of dissatisfaction.
  • the conversation analysis method is executed by at least one computer, and from the data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant, the first conversation participation At least one of the thank-you expression data uttered by the person and the apology expression data uttered by the second conversation participant is detected as specific expression data, and the first conversation participant in the conversation according to the detection result of the specific expression data Estimating the degree of satisfaction or dissatisfaction.
  • conversation means that two or more speakers speak by expressing their intentions by uttering a language.
  • conversation participants can speak directly, such as at bank counters and cash registers at stores, and in remote conversations such as telephone conversations and video conferencing.
  • remote conversations such as telephone conversations and video conferencing.
  • the content and form of the target conversation are not limited, but a public conversation is more preferable as the target conversation than a private conversation such as a conversation between friends.
  • the above-mentioned thank-you expression data, apology expression data, and specific expression data are a word, a word string that is a sequence of a plurality of words, or a set of words scattered in a certain utterance in a conversation.
  • the thank-you expression data and the thank-you expression, the apology expression data and the apology expression, the specific expression data and the specific expression may be used without distinction.
  • the word “Thank you”, the word string “Thank you”, “Yes” and “Masu”, the word set “True” and “Thank you” can be included.
  • the apology expression data may include the word “sorry”, the word string “sorry”, “present”, “no”, “n”, and the like.
  • Talkers often express their gratitude when they are satisfied with the conversation.
  • a conversation participant often gives an apology when he / she feels that his / her conversation partner is dissatisfied due to his / her disagreement.
  • an apology may be uttered regardless of the dissatisfaction of the conversation partner.
  • a typical apology expression is issued such as “I'm sorry, please wait for a while”. In this case, the conversation participant is expressing his apology regardless of the emotion of the conversation partner.
  • the present inventors have found that the conversation participants' feelings regarding the whole conversation, especially satisfaction and dissatisfaction, are easily expressed in the conversation end process, and this knowledge was further spoken during the conversation end process. We found that pleasedness and apology are likely to express the emotions of the participants.
  • the present embodiment provides the concept of a closing section that means the conversation end process, and the thanks given by the first conversation participant and the second conversation participant from the data corresponding to the voice of only this closing section.
  • Specific expression data representing at least one of the spoken apologies is detected.
  • the closing time of the closing section is set to the conversation end time.
  • the end of the conversation is represented, for example, by disconnecting the call in the case of a call, and by the dissolution of the conversation participants in the case of a conversation other than a call.
  • a conversation is terminated due to a specific sudden cause such as a situation in which the conversation participant cannot help, there may be no closing section in the conversation.
  • noise information accompanying voice recognition misrecognition for voice outside the closing section can also be excluded from the estimation material of satisfaction or dissatisfaction of the first conversation participant. Specifically, if a thank-you expression or apology expression that is not actually uttered by a conversation participant outside the closing section is misrecognized, the mis-recognized thank-you expression and apology expression are excluded from the estimated material. Is done.
  • the satisfaction level or dissatisfaction level of the first conversation participant is estimated only for specific expression data that is likely to represent the satisfaction level or dissatisfaction level of the conversation participant. Therefore, according to the present embodiment, the specific expression that does not reflect the satisfaction or dissatisfaction of the first conversation participant and the specific expression data having high purity excluding noise data based on misrecognition of speech recognition, Participant satisfaction or dissatisfaction can be estimated with high accuracy.
  • the conversation analysis apparatus and the conversation analysis method described above are not limited to application to a contact center system that handles call data, and can be applied to various modes that handle conversation data. For example, they can also be applied to in-house call management systems other than contact centers, and personal terminals owned by PCs (Personal Computers), fixed telephones, mobile phones, tablet terminals, smartphones, etc. .
  • conversation data for example, data indicating conversation between a person in charge and a customer at a bank counter or a store cash register can be exemplified.
  • a call handled in each embodiment refers to a call from when a call terminal possessed by a certain caller and a certain caller is connected between the call connection and the call disconnection.
  • a continuous area in which a single caller is speaking in a call voice is referred to as an utterance or an utterance section.
  • the speech segment is detected as a segment in which the amplitude of a predetermined value or more continues in the voice waveform of the caller.
  • a normal call is formed from each speaker's utterance section, silent section, and the like.
  • FIG. 1 is a conceptual diagram showing a configuration example of a contact center system 1 in the first embodiment.
  • the contact center system 1 in the first embodiment includes an exchange (PBX) 5, a plurality of operator telephones 6, a plurality of operator terminals 7, a file server 9, a call analysis server 10, and the like.
  • the call analysis server 10 includes a configuration corresponding to the conversation analysis device in the above-described embodiment.
  • the customer corresponds to the first conversation participant described above
  • the operator corresponds to the second conversation participant described above.
  • the exchange 5 is communicably connected via a communication network 2 to a call terminal (customer telephone) 3 such as a PC, a fixed telephone, a mobile phone, a tablet terminal, or a smartphone that is used by a customer.
  • the communication network 2 is a public network such as the Internet or a PSTN (Public Switched Telephone Network), a wireless communication network, or the like.
  • the exchange 5 is connected to each operator telephone 6 used by each operator of the contact center. The exchange 5 receives the call from the customer and connects the call to the operator telephone 6 of the operator corresponding to the call.
  • Each operator uses an operator terminal 7.
  • Each operator terminal 7 is a general-purpose computer such as a PC connected to a communication network 8 (LAN (Local Area Network) or the like) in the contact center system 1.
  • LAN Local Area Network
  • each operator terminal 7 records customer voice data and operator voice data in a call between each operator and the customer.
  • the customer voice data and the operator voice data may be generated by being separated from the mixed state by predetermined voice processing. Note that this embodiment does not limit the recording method and the recording subject of such audio data.
  • Each voice data may be generated by a device (not shown) other than the operator terminal 7.
  • the file server 9 is realized by a general server computer.
  • the file server 9 stores the call data of each call between the customer and the operator together with the identification information of each call.
  • Each call data includes a pair of customer voice data and operator voice data, and disconnection time data indicating the time when the call was disconnected.
  • the file server 9 acquires customer voice data and operator voice data from another device (each operator terminal 7 or the like) that records each voice of the customer and the operator. Further, the file server 9 acquires disconnection time data from each operator telephone 6, the exchange 5 and the like.
  • the call analysis server 10 estimates the degree of customer satisfaction or dissatisfaction for each call data stored in the file server 9.
  • the call analysis server 10 includes a CPU (Central Processing Unit) 11, a memory 12, an input / output interface (I / F) 13, a communication device 14 and the like as a hardware configuration.
  • the memory 12 is a RAM (Random Access Memory), a ROM (Read Only Memory), a hard disk, a portable storage medium, or the like.
  • the input / output I / F 13 is connected to a device that accepts an input of a user operation such as a keyboard and a mouse, and a device that provides information to the user such as a display device and a printer.
  • the communication device 14 communicates with the file server 9 and the like via the communication network 8. Note that the hardware configuration of the call analysis server 10 is not limited.
  • FIG. 2 is a diagram conceptually illustrating a processing configuration example of the call analysis server 10 in the first embodiment.
  • the call analysis server 10 in the first embodiment includes a call data acquisition unit 20, a voice recognition unit 21, a closing detection unit 23, a specific expression table 25, an expression detection unit 26, an estimation unit 27, and the like.
  • Each of these processing units is realized, for example, by executing a program stored in the memory 12 by the CPU 11. Further, the program may be installed from a portable recording medium such as a CD (Compact Disc) or a memory card, or another computer on the network via the input / output I / F 13 and stored in the memory 12. Good.
  • a portable recording medium such as a CD (Compact Disc) or a memory card
  • the call data acquisition unit 20 acquires the call data of the call to be analyzed from the file server 9 together with the identification information of the call. As described above, the call data includes disconnection time data.
  • the call data may be acquired by communication between the call analysis server 10 and the file server 9, or may be acquired via a portable recording medium.
  • the voice recognition unit 21 performs voice recognition processing on each voice data of the operator and the customer included in the call data. Thereby, the voice recognition unit 21 acquires each voice text data and each utterance time data corresponding to the operator voice and the customer voice from the call data.
  • the voice text data is character data in which a voice uttered by a customer or an operator is converted into text. Each voice text data is divided for each word (part of speech). Each utterance time data includes utterance time data for each word of each voice text data.
  • the voice recognition unit 21 may detect the utterance sections of the operator and the customer from the voice data of the operator and the customer, respectively, and acquire the start time and the end time of each utterance section. In this case, the speech recognition unit 21 determines an utterance time for each word string corresponding to each utterance section in each speech text data, and uses the utterance time for each word string corresponding to each utterance section as the utterance time data. You may do it.
  • a voice recognition parameter (hereinafter referred to as a reference voice recognition parameter) adapted for a call in a contact center is used.
  • this speech recognition parameter for example, an acoustic model and a language model learned from a plurality of speech samples are used.
  • a known method may be used for the voice recognition process, and the voice recognition process itself and various voice recognition parameters used in the voice recognition process are not limited.
  • the method for detecting the utterance section is not limited.
  • the voice recognition unit 21 may perform the voice recognition process only on the voice data of either the customer or the operator according to the processing contents of the closing detection unit 23 and the expression detection unit 26. For example, when a closing section is detected by searching for a predetermined closing phrase as will be described later, the closing detection unit 23 requires the operator's voice text data. Moreover, the expression detection part 26 requires a customer's audio
  • the closing detection unit 23 detects the closing period of the target call based on the disconnection time data included in the call data, the voice text data of the operator or customer acquired by the voice recognition unit 21 and the utterance time data thereof.
  • the closing detection unit 23 generates closing section data including the start time and the end time of the detected closing section.
  • the end time of the closing section is set to the cutting time indicated by the cutting time data.
  • the start time of the closing section is set as follows, for example.
  • the closing detection unit 23 determines the start end time of a predetermined number of utterance sections from the call disconnection time as the start end time of the closing section.
  • the closing detection unit 23 may determine a time point that is a predetermined time later than the call disconnection time as the start time of the closing section. According to these methods for determining the start time of the closing section, the start time of the closing section can be determined based only on the voice text data of either the operator or the customer used in the expression detection unit 26.
  • the predetermined number of utterances and the predetermined time for determining the width of the closing section are determined in advance according to a closing sentence described in an operator manual or the like, a result of listening to audio data at a contact center, or the like.
  • the closing detection unit 23 may determine the utterance time of the previous predetermined closing phrase in the voice text data of the operator as the start time of the closing section.
  • the closing phrase is a phrase issued by the operator in the process of ending the call, such as a final greeting phrase.
  • a phrase to be issued by an operator in the process of terminating a call is often determined manually.
  • the closing detection unit 23 may hold data of a plurality of such predetermined closing phrases so as to be adjustable in advance.
  • Such predetermined closing phrase data may be input by a user based on an input screen or the like, or may be acquired from a portable recording medium, another computer, or the like via the input / output I / F 13.
  • the specific expression table 25 holds thanks expression data and apology expression data as specific expression data. Specifically, the specific expression table 25 holds the specific expression data to be detected by the expression detection unit 26 so that it can be distinguished into thank-you expression data and apology expression data. The specific expression table 25 may hold only one of thank-you expression data and apology expression data in accordance with the processing of the expression detection unit 26.
  • the expression detection unit 26 executes any one of the following three types of processing according to the specific expression data to be detected.
  • the first processing type detects only thank-you expression data
  • the second processing type detects only apology expression data
  • the third processing type detects both thank-you expression data and apology expression data. set to target.
  • the expression detection unit 26 has an utterance time within the time range indicated by the closing section data generated by the closing detection unit 23 from the voice text data of the customer acquired by the voice recognition unit 21. Extract voice text data.
  • the expression detection unit 26 detects thank-you expression data held in the specific expression table 25 from the voice text data of the customer corresponding to the extracted closing section. Along with this detection, the expression detection unit 26 counts the number of thanks expression data detected.
  • the expression detection unit 26 has an utterance time within the time range indicated by the closing section data generated by the closing detection unit 23 from the voice text data of the operator acquired by the voice recognition unit 21. Extract voice text data.
  • the expression detection unit 26 detects apology expression data held in the specific expression table 25 from the voice text data of the operator corresponding to the extracted closing section. Along with this detection, the expression detection unit 26 counts the number of detected apology expression data.
  • the expression detection unit 26 utters speech within the time range indicated by the closing section data generated by the closing detection unit 23 from the voice text data of the customer and the operator acquired by the voice recognition unit 21. Each speech text data having time is extracted.
  • the expression detection unit 26 detects apology expression data held in the specific expression table 25 from the voice text data of the operator corresponding to the extracted closing section, and the customer's corresponding to the extracted closing section is detected. Thanks expression data held in the specific expression table 25 is detected from the voice text data.
  • the expression detection unit 26 separately counts the number of detected thank-you expression data and the number of detected apology expression data.
  • the estimation unit 27 estimates at least one of customer satisfaction and dissatisfaction in the target call according to the number of thank-you expression data detected by the expression detection unit 26. For example, when the number of thank-you expression data detected is greater than or equal to a predetermined threshold, the estimation unit 27 estimates that there is satisfaction. Further, when the number of thank-you expression data detected is equal to or greater than a predetermined threshold, it may be estimated that there is no dissatisfaction. Furthermore, the estimation unit 27 may estimate that there is no satisfaction when the number of thank-you expression data detected is smaller than a predetermined threshold.
  • the predetermined threshold for estimating the presence or absence of satisfaction or dissatisfaction is determined in advance based on the result of listening to audio data at the contact center.
  • the table below shows the results of examining the relationship between the number of times the customer expressed gratitude in the closing period of the contact center call and the satisfaction and dissatisfaction of the customer. “Neutral” in the table indicates that the customer does not feel satisfaction or dissatisfaction. From the table below, it can be seen that the greater the number of times thanking in the closing section, the greater the probability that the customer feels satisfied and the less likely that the customer feels dissatisfied.
  • the threshold value for estimating the presence or absence of satisfaction or dissatisfaction is determined in advance based on such survey results. For example, based on the table below, it can be expected that satisfaction can be estimated with an accuracy of about 80% if the number of thanks is three or more. Moreover, if the number of thanks is less than 1 (that is, zero), it can be expected that satisfaction can be estimated with an accuracy of about 88%.
  • the estimation unit 27 estimates at least one of the degree of customer dissatisfaction and satisfaction in the target call according to the number of detected apology expression data counted by the expression detection unit 26. For example, the estimation unit 27 estimates that there is dissatisfaction when the detected number of apology expression data is greater than or equal to a predetermined threshold. Moreover, the estimation part 27 may determine the satisfaction level value and dissatisfaction level value according to the detection number of thanks expression data. Similarly, the estimation unit 27 may determine a dissatisfaction level value or a satisfaction level value corresponding to the number of detected apology expression data.
  • the estimation unit 27 calculates at least one of customer satisfaction and dissatisfaction level in the target call according to the number of detections of both. You may make it estimate. For example, the estimation unit 27 estimates that there is satisfaction when the number of thanks expression data detected is greater than the other, and estimates that there is dissatisfaction when the number of detected apology data is greater than the other. In addition, the estimation unit 27 may determine a satisfaction level value and a dissatisfaction level value according to the number of detections, or determine a satisfaction level value or a dissatisfaction level value based on a difference value between the two. Also good.
  • the estimation unit 27 generates output data including information indicating the estimation result, and outputs the determination result to the display unit or another output device via the input / output I / F 13.
  • the present embodiment does not limit the specific form of output of the determination result.
  • FIG. 3 is a flowchart showing an operation example of the call analysis server 10 in the first embodiment.
  • the call analysis server 10 acquires call data (S30).
  • the call analysis server 10 acquires call data to be analyzed from a plurality of call data stored in the file server 9.
  • the call analysis server 10 performs voice recognition processing on the customer voice data included in the call data acquired in (S30) (S31). Thereby, the call analysis server 10 acquires the customer's voice text data and utterance time data.
  • the customer's voice text data is divided for each word (part of speech).
  • the utterance time data includes utterance time data for each word or for each word string corresponding to each utterance section.
  • the call analysis server 10 detects the closing section of the target call based on the disconnection time data included in the call data acquired in (S30) and the utterance time data acquired in (S31) (S32). For example, the call analysis server 10 determines a time point that is a predetermined time back from the call disconnection time indicated by the disconnection time data as the start time of the closing section. As another example, the call analysis server 10 determines the start time of the utterance section for a predetermined number of customers as the start time of the closing section from the call disconnection time. The call analysis server 10 generates closing section data indicating the start time and the end time of the detected closing section.
  • the call analysis server 10 extracts voice text data corresponding to the utterance time within the time range indicated by the closing section data generated in (S32) from the customer voice text data acquired in (S31). From the extracted speech text data, thank-you expression data as specific expression data is detected (S33). With this detection, the call analysis server 10 counts the number of thank-you expression data detected (S34).
  • the call analysis server 10 estimates the customer satisfaction of the target call based on the number of thank-you expression data detected in (S34) (S35). For example, when the number of thank-you expression data detected is greater than a predetermined threshold, the call analysis server 10 estimates that there is satisfaction and no dissatisfaction. When the number of thank-you expression data detected is smaller than the predetermined threshold, the call analysis server 10 estimates that there is no satisfaction. The call analysis server 10 generates output data indicating the presence or absence of the estimated satisfaction or dissatisfaction level, or a level value.
  • the call analysis server 10 performs a voice recognition process on the operator's voice data included in the call data. Thereby, the call analysis server 10 acquires the operator's voice text data and utterance time data.
  • the call analysis server 10 closes the closing period of the target call based on the disconnection time data included in the call data acquired in (S30) and the voice text data of the operator acquired in (S31). Is detected. In this case, the call analysis server 10 determines the utterance time of the first predetermined closing phrase in the voice text data of the operator as the start time of the closing section.
  • the call analysis server 10 selects the voice corresponding to the utterance time within the time range indicated by the closing section data generated in (S32) from the voice text data of the operator acquired in (S31). Text data is extracted, and apology expression data as specific expression data is detected from the extracted speech text data. In (S34), the call analysis server 10 counts the number of detected apology expression data (S34).
  • the call analysis server 10 estimates the degree of dissatisfaction of the customer of the target call based on the detected number of apology expression data counted in (S34) (S35). The call analysis server 10 estimates that there is dissatisfaction if the detected number of apology expression data is greater than a predetermined threshold value, and otherwise estimates that there is no dissatisfaction.
  • the call analysis server 10 performs voice recognition processing on each voice data of the customer and the operator. Thereby, the call analysis server 10 acquires voice text data and utterance time data related to the customer and the operator, respectively.
  • the call analysis server 10 executes (S33) and (S34) in the above two cases, respectively. As a result, the number of detected thank-you expression data and the number of detected apology data are counted.
  • the call analysis server 10 estimates at least one of the satisfaction level and dissatisfaction level of the customer of the target call based on the detected number of thank-you expression data and the detected number of apology data counted in (S34). To do.
  • the detected number of thank-you expression data uttered by the customer and the detected number of apology expression data uttered by the operator, which are detected from the data corresponding to the voice of the closing period of the target call, are detected. Based on at least one, at least one of customer satisfaction and dissatisfaction of the target call is estimated. According to this embodiment, since a thank-you expression or an apology expression is detected only from the closing section, these specific expressions are highly likely to reflect customer satisfaction or dissatisfaction, and other than the closing section. Since it is not adversely affected by the misrecognized specific expression, the satisfaction or dissatisfaction of the customer can be estimated with high accuracy.
  • the satisfaction or dissatisfaction level of the customer can be estimated with high accuracy as described in the above embodiment. Can do. Therefore, according to the present embodiment, it is possible to reduce the load of the voice recognition process as compared with the form in which the voice recognition process is performed on the voice data of both the customer and the operator.
  • At least one of the satisfaction level and the dissatisfaction level of the customer of the target call is based on both the detected number of thank-you expression data uttered by the customer and the detected number of apology expression data uttered by the operator. Can also be estimated. In this way, both the customer's gratitude expression and the operator's apology expression that have a strong correlation with the customer's satisfaction and dissatisfaction are taken into account, so the accuracy of estimating the customer's satisfaction or dissatisfaction is further improved. be able to.
  • FIG. 4 is a diagram conceptually illustrating a processing configuration example of the call analysis server 10 in the second embodiment.
  • the call analysis server 10 in the second embodiment further includes a voice recognition unit 41 in addition to the configuration of the first embodiment.
  • the voice recognition unit 41 is realized by executing a program stored in the memory 12 by the CPU 11, for example, in the same manner as the other processing units.
  • the voice recognition unit 21 performs voice recognition processing on the voice data of the operator included in the call data, using the reference voice recognition parameter LM-1. Since the voice text data acquired by the voice recognition process is used only by the closing detection unit 23, the voice recognition process may be performed only on the voice data of the operator. Note that the voice recognition unit 21 may perform voice recognition processing on the voice data of both the operator and the customer.
  • the voice recognition unit 21 holds in advance a reference voice recognition parameter LM-1 that has been learned in advance for general calls in the contact center.
  • the voice recognition unit 41 is weighted so that the reference voice recognition parameter LM-1 used by the voice recognition unit 21 is weighted so that the specific expression data detected by the expression detection unit 26 can be recognized more easily than other word data.
  • the recognition parameter hereinafter referred to as a weighted speech recognition parameter
  • LM-2 the recognition parameter (hereinafter referred to as a weighted speech recognition parameter) LM-2, speech recognition processing is performed on the speech data in the closing section of the target call.
  • the voice recognition unit 21 and the voice recognition unit 41 are distinguished from each other, but both may be realized as one processing unit, and the voice recognition parameters to be used may be switched.
  • the weighted speech recognition parameter LM-2 is calculated by a predetermined method based on the reference speech recognition parameter LM-1, for example, and is held in advance by the speech recognition unit 41.
  • the following equation is a diagram illustrating an example of calculating the weighted speech recognition parameter LM-2 when the N-gram language model is used as the speech recognition parameter.
  • w i ⁇ n + 1 i ⁇ 1 ) of the above formula represents an N-gram language model corresponding to the weighted speech recognition parameter LM-2, and the (i ⁇ n + 1) th to (i ⁇ 1)
  • the appearance probability of the i-th word w i under the condition of the word string w i-n + 1 i ⁇ 1 up to the th is shown.
  • w i ⁇ n + 1 i ⁇ 1 ) on the right side of the above expression represents an N-gram language model corresponding to the reference speech recognition parameter LM ⁇ 1.
  • P new (w i ) on the right side of the above expression is a unigram language model in which the appearance probability of a willing expression and an apology expression is increased
  • P old (w i ) on the right side of the above expression is a reference speech recognition parameter LM-1
  • (P new (w i ) / P old (w) is set so that the N-gram language model learned in advance for general calls in the contact center increases the appearance probability of a thank-you expression and an apology expression.
  • the N-gram language model weighted in i )) is calculated as the weighted speech recognition parameter LM-2.
  • the voice recognition unit 41 performs the voice recognition process only on the voice data within the time range indicated by the closing section data generated by the closing detection unit 23. Further, the voice recognition unit 41 may set both voice data of the customer and the operator as a target of voice recognition processing according to the processing content of the expression detection unit 26, or only the voice data of one of the customer and the operator. It may be a target of voice recognition processing.
  • the expression detection unit 26 detects at least one of thanks expression data and apology expression data held in the specific expression table 25 from the voice text data acquired by the voice recognition unit 41.
  • FIG. 5 is a flowchart illustrating an operation example of the call analysis server 10 according to the second embodiment.
  • the same steps as those in FIG. 3 are denoted by the same reference numerals as those in FIG.
  • the call analysis server 10 applies weighted speech recognition parameters to the voice data in the time range indicated by the closing section data generated in (S32) among the voice data included in the call data acquired in (S30). Speech recognition using LM-2 is performed (S51). The call analysis server 10 detects at least one of thank-you expression data and apology expression data as specific expression data from the speech text data acquired in (S51) (S33).
  • the speech recognition process is performed on the speech data in the closing section using the weighted speech recognition parameters weighted so as to easily recognize the thanks and apologies. Then, at least one of thank-you expression data and apology expression data is detected from the speech text data acquired by this speech recognition process, and the satisfaction or dissatisfaction level of the customer of the target call is estimated based on the detection result.
  • the detection rate of the thank-you expression and the apology expression is increased in this way, if the thank-you expression is still not detected, it is estimated that the customer is not satisfied according to the detection result. , Extremely high accuracy (purity) will be exhibited. Therefore, according to the second embodiment, it can be expected that the estimation accuracy is very high by estimating that there is no satisfaction when the number of detected thank-you expressions is 0.
  • a weighted language model is used so that a thank-you expression can be easily recognized. Therefore, when the number of detected thank-you expressions is 0, there is a high possibility that the customer did not say thank-you at all. It is also possible to estimate that there is dissatisfaction with the call.
  • FIG. 6 is a diagram conceptually illustrating a processing configuration example of the call analysis server 10 in the first modification.
  • the closing detection unit 23 detects a closing section using at least one of voice data and disconnection time data included in the call data acquired by the call data acquisition unit 20.
  • the closing detection unit 23 may set the call disconnection time indicated by the disconnection time data as the end time of the closing section, and determine a predetermined time width from the call disconnection time as the start end time of the closing section. In addition, the closing detection unit 23 holds each voice signal waveform obtained from the voice data of each closing phrase, and collates each voice signal waveform with the waveform of the voice data included in the call data, thereby closing the phrase. May be acquired.
  • the voice recognition unit 21 may perform voice recognition processing on the voice data in the closing section of the target call.
  • the step (S31) shown in FIG. 3 may be executed after the step (S32) and before the step (S33).
  • FIG. 7 is a diagram conceptually illustrating a processing configuration example of the call analysis server 10 in the second modification.
  • the call analysis server 10 may not have the voice recognition unit 21.
  • the closing detection unit 23 detects a closing section using at least one of voice data and disconnection time data included in the call data acquired by the call data acquisition unit 20. Since the processing content of the closing detection unit 23 in the second modification may be the same as that in the first modification, description thereof is omitted here.
  • the step (S31) shown in FIG. 5 is omitted. According to the first modification and the second modification, since voice recognition is applied only to the section detected by the closing detection unit, there is an advantage that the calculation time required for estimating the degree of satisfaction or dissatisfaction of the customer can be reduced. is there.
  • customer satisfaction or dissatisfaction is estimated based on the number of thank-you expression data detected and the number of apology expression data detected.
  • customer satisfaction or dissatisfaction may be estimated from other than the number of detections.
  • a satisfaction point is given in advance for each thanks expression data, and a dissatisfaction point is given in advance for each apology expression.
  • the customer satisfaction level value and the dissatisfaction level value may be estimated from the total value of the dissatisfaction points of the apology expression data.
  • the contact center system 1 is exemplified, and an example in which the reference voice recognition parameter is adapted (learned) for general calls in the contact center is shown.
  • the reference speech recognition parameters need only be adapted to the type of call being handled. For example, when a general call by a call terminal is handled, a reference speech recognition parameter adapted for such a general call may be used.
  • the call data includes disconnection time data
  • the disconnection time data is generated by each operator telephone 6, the exchange 5, or the like. It may be generated by detecting a cutting sound from customer voice data.
  • the disconnection time data may be generated by the file server 9 or the call analysis server 10.
  • the above-described call analysis server 10 may be realized as a plurality of computers.
  • the call analysis server 10 includes only the expression detection unit 26 and the estimation unit 27, and is configured such that another computer has another processing unit.
  • the closing detection unit 23 may acquire the closing section data by the user operating the input device based on the input screen or the like, or the input / output I / F 13 from a portable recording medium, another computer, or the like. May be obtained via.
  • call data is handled.
  • the above-described conversation analysis device and conversation analysis method may be applied to a device or system that handles conversation data other than a call.
  • a recording device for recording a conversation to be analyzed is installed at a place (conference room, bank window, store cash register, etc.) where the conversation is performed.
  • the conversation data is recorded in a state in which the voices of a plurality of conversation participants are mixed, the conversation data is separated from the mixed state into voice data for each conversation participant by a predetermined voice process.
  • the call disconnection time data is used as the data indicating the end time of the conversation.
  • the event indicating the end of the conversation May be detected automatically or manually, and this detection time point may be treated as conversation end time data.
  • the automatic detection the end of the utterance of all the conversation participants may be detected, or the movement of the person indicating the dissolution of the conversation participants may be detected by a sensor or the like.
  • the manual detection an input operation for notifying the conversation end by the conversation participant may be detected.
  • the closing detection unit 23 includes the conversation end time data included in the conversation data, the voice text data of the conversation participant acquired by the voice recognition unit 21, and the utterance time thereof.
  • the closing section of the target conversation may be detected based on the data.
  • the predetermined number of utterances and the predetermined time for determining the width of the closing section are the conversation types such as conversations conducted at bank counters, conversations conducted at store cash registers, conversations conducted at facility information centers, etc. It is decided according to.
  • a predetermined closing phrase is determined according to the conversation type.
  • the expression detection unit The speech recognition parameter weighted so that the specific expression data is more easily recognized than other word data is used as the reference speech recognition parameter adapted for speech recognition of a predetermined form of conversation including the conversation.
  • a speech recognition unit that performs speech recognition processing on the speech data of the closing section; Including The conversation analysis device according to claim 1, wherein the specific expression data is detected from speech text data in the closing section of the conversation obtained by the speech recognition process of the speech recognition unit.
  • the expression detection unit detects the specific expression data based on a specific expression table that holds the specific expression data in a distinguishable manner between the thanks expression data and the apology expression data. Count the number of detected apology data, The estimation unit estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation based on the number of detections of the thanks expression data or the number of detections of the apology expression data.
  • the conversation analyzer according to appendix 1 or 2.
  • the expression detection unit detects the thanks expression data by detecting the specific expression data based on a specific expression table that holds the specific expression data in a distinguishable manner between the thanks expression data and the apology expression data. Count the number and the number of detected apology data, The estimation unit estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation based on the number of detections of the thanks expression data and the number of detections of the apology expression data.
  • the conversation analyzer according to appendix 1 or 2.
  • the speech recognition parameter weighted so that the specific expression data is more easily recognized than other word data is used as the reference speech recognition parameter adapted for speech recognition of a predetermined form of conversation including the conversation.
  • Voice recognition processing is performed on the voice data of the closing section. Further including The specific expression data is detected by detecting the specific expression data from the speech text data of the closing section of the conversation obtained by the speech recognition process.
  • At least one of the thank-you expression data and the apology-expression data by detecting the specific-expression data based on a specific-expression table that holds the specific-expression data separately in the thank-you expression data and the apology-expression data Count the number of detected Further including
  • the estimation estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation based on the number of detected thanksgiving data or the number of detected apology data.
  • the conversation analysis method according to appendix 5 or 6.
  • Appendix 9 A program for causing at least one computer to execute the conversation analysis method according to any one of appendices 5 to 8.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Provided is a conversation analysis device comprising: an expression detection unit that detects data related to thanking expressions uttered by a first conversation participant and/or data related to apology expressions uttered by a second conversation participant as specific expression data from data corresponding to audio of only the closing segment of a conversation between the first conversation participant and the second conversation participant; and an estimation unit that estimates the degree of satisfaction or dissatisfaction of the first conversation participant in the conversation in accordance with the detection result of the specific expression data.

Description

会話分析装置及び会話分析方法Conversation analyzer and conversation analysis method
 本発明は、会話の分析技術に関する。 The present invention relates to a conversation analysis technique.
 会話を分析する技術の一例として、通話データを分析する技術がある。例えば、コールセンタ、コンタクトセンタなどと呼ばれる部署で行われる通話のデータが分析される。以降、このような、商品やサービスに関する問い合わせや苦情や注文といった顧客からの電話に応対する業務を専門的に行う部署をコンタクトセンタと表記する。 An example of a technology for analyzing conversation is a technology for analyzing call data. For example, data of a call performed in a department called a call center or a contact center is analyzed. Hereinafter, such a department that specializes in the business of responding to customer calls such as inquiries, complaints and orders regarding products and services will be referred to as a contact center.
 コンタクトセンタに寄せられる顧客の声には、顧客ニーズや満足度などが反映されている場合が多く、顧客との通話からそのような顧客の感情やニーズを抽出することは、リピータ顧客を増加させるために企業にとって大変重要である。そこで、音声を分析することで、顧客の感情(怒り、苛立ち、不快感など)等を抽出する各種手法が提案されている。下記特許文献1には、顧客とオペレータとの通話内容に対して音声認識を行うことによりその通話にクレーム時に発せられるキーワードが含まれるかどうかを判定し、この判定結果により顧客のCS(顧客満足度)レベルを判断する手法が提案されている。 Customer feedback from contact centers often reflects customer needs and satisfaction, and extracting such customer emotions and needs from customer calls increases repeat customers. Therefore, it is very important for companies. Therefore, various methods have been proposed for extracting customer emotions (anger, irritation, discomfort, etc.) by analyzing voice. In the following Patent Document 1, it is determined whether or not a keyword issued at the time of complaint is included in the call by performing voice recognition on the call contents between the customer and the operator, and the customer's CS (customer satisfaction) is determined based on the determination result. A method to judge the level is proposed.
特開2005-252845号公報JP 2005-252845 A
 しかしながら、上記提案手法では、会話に参加する者(以降、会話参加者と表記する)、即ち、顧客の満足度又は不満度を適切に判断できない可能性がある。例えば、満足感を表し得る表現(キーワード)であっても、満足感とは無関係に発声される場合があるからである。「ありがとうございます」というお礼表現は、満足感を表し得る。しかしながら、その表現は、以下のような対話で用いられる場合、必ずしも満足感を表しているわけではない。
 オペレータ「その症状でしたら、まずそのPCを再起動して頂いて、~」
 顧客「ありがとうございます。ただ、それはもう試しました。」
However, in the proposed method, there is a possibility that the degree of satisfaction or dissatisfaction of the person who participates in the conversation (hereinafter referred to as “conversation participant”), that is, the degree of satisfaction or dissatisfaction of the customer cannot be determined appropriately. For example, even expressions (keywords) that can express satisfaction may be uttered regardless of satisfaction. The thank-you expression of “Thank you” can express satisfaction. However, the expression does not necessarily indicate satisfaction when used in the following dialogue.
Operator “If this is the case, please reboot the PC first.”
Customer “Thank you. I just tried it.”
 また、上記提案手法で用いられる音声認識では、挿入誤り、脱落誤りといった誤認識が生じ得る。その誤認識によれば、その会話(通話)で実際には発声されていない表現が認識されたり、その会話で実際に発声されている表現が認識されなかったりする。これにより、抽出すべきキーワードが誤検出されたり脱落したりしてしまい、ひいては、そのキーワードに基づく顧客の満足度又は不満度の推定精度が低下することになる。 Also, in the speech recognition used in the above proposed method, misrecognition such as insertion error and omission error may occur. According to the misrecognition, an expression that is not actually uttered in the conversation (call) is recognized, or an expression that is actually uttered in the conversation is not recognized. As a result, the keyword to be extracted is erroneously detected or dropped, and as a result, the estimation accuracy of customer satisfaction or dissatisfaction based on the keyword decreases.
 本発明は、このような事情に鑑みてなされたものであり、会話参加者の満足度又は不満度を高精度に推定する技術を提供する。ここで、会話参加者の満足度又は不満度とは、会話において少なくとも一方の会話参加者が感じたであろう満足感又は不満感の程度を意味する。また、満足感の程度は、満足感有り又は満足感なしのみを示すことも含み、不満感の程度は、不満感有り又は不満感なしのみを示すことも含む。 The present invention has been made in view of such circumstances, and provides a technique for accurately estimating the degree of satisfaction or dissatisfaction of a conversation participant. Here, the degree of satisfaction or dissatisfaction of a conversation participant means the degree of satisfaction or dissatisfaction that at least one conversation participant felt in the conversation. In addition, the degree of satisfaction includes only indicating satisfaction or not, and the degree of dissatisfaction includes indicating only dissatisfaction or not.
 本発明の各態様では、上述した課題を解決するために、それぞれ以下の構成を採用する。 Each aspect of the present invention employs the following configurations in order to solve the above-described problems.
 第1の態様は、会話分析装置に関する。第1態様に係る会話分析装置は、第1会話参加者と第2会話参加者との間の会話のクロージング区間のみの音声に対応するデータから、第1会話参加者により発声されたお礼表現データ及び第2会話参加者により発声された謝罪表現データの少なくとも一方を特定表現データとして検出する表現検出部と、特定表現データの検出結果に応じて、当該会話における第1会話参加者の満足度又は不満度を推定する推定部と、を有する。 The first aspect relates to a conversation analysis device. The conversation analysis device according to the first aspect provides thank-you expression data uttered by the first conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant. And an expression detection unit for detecting at least one of the apology expression data uttered by the second conversation participant as specific expression data, and the satisfaction level of the first conversation participant in the conversation or An estimation unit for estimating the degree of dissatisfaction.
 第2の態様は、少なくとも1つのコンピュータにより実行される会話分析方法に関する。第2態様に係る会話分析方法は、第1会話参加者と第2会話参加者との間の会話のクロージング区間のみの音声に対応するデータから、第1会話参加者により発声されたお礼表現データ及び第2会話参加者により発声された謝罪表現データの少なくとも一方を特定表現データとして検出し、特定表現データの検出結果に応じて、当該会話における第1会話参加者の満足度又は不満度を推定する、ことを含む。 The second aspect relates to a conversation analysis method executed by at least one computer. The conversation analysis method according to the second aspect includes thank-you expression data uttered by the first conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant. And at least one of the apology expression data uttered by the second conversation participant is detected as specific expression data, and the degree of satisfaction or dissatisfaction of the first conversation participant in the conversation is estimated according to the detection result of the specific expression data Including.
 なお、本発明の他の態様としては、上記第1の態様における各構成を少なくとも1つのコンピュータに実現させるプログラムであってもよいし、このようなプログラムを記録したコンピュータが読み取り可能な記録媒体であってもよい。この記録媒体は、非一時的な有形の媒体を含む。 Another aspect of the present invention may be a program that causes at least one computer to implement each configuration in the first aspect, or a computer-readable recording medium that records such a program. There may be. This recording medium includes a non-transitory tangible medium.
 上記各態様によれば、会話参加者の満足度又は不満度を高精度に推定する技術を提供することができる。 According to each of the above aspects, it is possible to provide a technique for accurately estimating the degree of satisfaction or dissatisfaction of a conversation participant.
 上述した目的、およびその他の目的、特徴および利点は、以下に述べる好適な実施の形態、およびそれに付随する以下の図面によってさらに明らかになる。 The above-described object and other objects, features, and advantages will be further clarified by a preferred embodiment described below and the following drawings attached thereto.
第1実施形態におけるコンタクトセンタシステムの構成例を示す概念図である。It is a conceptual diagram which shows the structural example of the contact center system in 1st Embodiment. 第1実施形態における通話分析サーバの処理構成例を概念的に示す図である。It is a figure which shows notionally the process structural example of the call analysis server in 1st Embodiment. 第1実施形態における通話分析サーバの動作例を示すフローチャートである。It is a flowchart which shows the operation example of the telephone call analysis server in 1st Embodiment. 第2実施形態における通話分析サーバの処理構成例を概念的に示す図である。It is a figure which shows notionally the process structural example of the call analysis server in 2nd Embodiment. 第2実施形態における通話分析サーバの動作例を示すフローチャートである。It is a flowchart which shows the operation example of the call analysis server in 2nd Embodiment. 第1変形例における通話分析サーバの処理構成例を概念的に示す図である。It is a figure which shows notionally the process structural example of the call analysis server in a 1st modification. 第2変形例における通話分析サーバの処理構成例を概念的に示す図である。It is a figure which shows notionally the process structural example of the call analysis server in a 2nd modification.
 以下、本発明の実施の形態について説明する。なお、以下に挙げる各実施形態はそれぞれ例示であり、本発明は以下の各実施形態の構成に限定されない。 Hereinafter, embodiments of the present invention will be described. In addition, each embodiment given below is an illustration, respectively, and this invention is not limited to the structure of each following embodiment.
 本実施形態に係る会話分析装置は、第1会話参加者と第2会話参加者との間の会話のクロージング区間のみの音声に対応するデータから、第1会話参加者により発声されたお礼表現データ及び第2会話参加者により発声された謝罪表現データの少なくとも一方を特定表現データとして検出する表現検出部と、特定表現データの検出結果に応じて、当該会話における第1会話参加者の満足度又は不満度を推定する推定部と、を有する。 The conversation analysis apparatus according to the present embodiment provides thank-you expression data uttered by the first conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant. And an expression detection unit for detecting at least one of the apology expression data uttered by the second conversation participant as specific expression data, and the satisfaction level of the first conversation participant in the conversation or An estimation unit for estimating the degree of dissatisfaction.
 本実施形態に係る会話分析方法は、少なくとも1つのコンピュータにより実行され、第1会話参加者と第2会話参加者との間の会話のクロージング区間のみの音声に対応するデータから、第1会話参加者により発声されたお礼表現データ及び第2会話参加者により発声された謝罪表現データの少なくとも一方を特定表現データとして検出し、特定表現データの検出結果に応じて、当該会話における第1会話参加者の満足度又は不満度を推定する、ことを含む。 The conversation analysis method according to the present embodiment is executed by at least one computer, and from the data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant, the first conversation participation At least one of the thank-you expression data uttered by the person and the apology expression data uttered by the second conversation participant is detected as specific expression data, and the first conversation participant in the conversation according to the detection result of the specific expression data Estimating the degree of satisfaction or dissatisfaction.
 ここで、会話とは、2以上の話者が、言語の発声などによる意思表示によって、話をすることを意味する。会話には、銀行の窓口や店舗のレジ等のように、会話参加者が直接、話をする形態もあれば、通話機を用いた通話やテレビ会議等のように、離れた位置にいる会話参加者同士が話をする形態もあり得る。本実施形態では、対象会話の内容や形態は制限されないが、友達同士の会話など私的な会話よりは、公的な会話のほうが対象会話として望ましい。 Here, “conversation” means that two or more speakers speak by expressing their intentions by uttering a language. In some conversations, conversation participants can speak directly, such as at bank counters and cash registers at stores, and in remote conversations such as telephone conversations and video conferencing. There may be a form in which the participants talk. In the present embodiment, the content and form of the target conversation are not limited, but a public conversation is more preferable as the target conversation than a private conversation such as a conversation between friends.
 また、上述のお礼表現データ、謝罪表現データ及び特定表現データとは、単語、複数単語の並びである単語列、又は、会話内の或る発話において散在する単語の集合である。以降、お礼表現データとお礼表現、謝罪表現データと謝罪表現、特定表現データと特定表現とは、区別されず用いられる場合がある。お礼表現データとしては、単語の「ありがとう」、単語列の「ありがとう」、「ござい」及び「ます」、単語集合の「本当」及び「ありがとう」などがあり得る。また、謝罪表現データとしては、単語の「申し訳」、単語列の「申し訳」、「ござい」、「ませ」及び「ん」などがあり得る。 Further, the above-mentioned thank-you expression data, apology expression data, and specific expression data are a word, a word string that is a sequence of a plurality of words, or a set of words scattered in a certain utterance in a conversation. Hereinafter, the thank-you expression data and the thank-you expression, the apology expression data and the apology expression, the specific expression data and the specific expression may be used without distinction. As a thank-you expression data, the word “Thank you”, the word string “Thank you”, “Yes” and “Masu”, the word set “True” and “Thank you” can be included. Further, the apology expression data may include the word “sorry”, the word string “sorry”, “present”, “no”, “n”, and the like.
 会話参加者は、その会話において満足感を得ると、お礼表現を発する場合が多い。一方、会話参加者は、自身の側の非により会話相手が不満を感じていることを察すると、謝罪表現を発する場合が多い。しかしながら、上述したように、お礼表現でも、会話参加者の満足感とは無関係に発声される場合がある。同様に、謝罪表現についても、会話相手の不満とは無関係に発声される場合がある。例えば、会話参加者が、その会話の場から離れる際に、「申し訳ございませんが、少しお待ち下さい。」というように定型的に謝罪表現を発する場合がある。この場合、その会話参加者は、会話相手の感情とは直接関係なく、その謝罪表現を発している。 Talkers often express their gratitude when they are satisfied with the conversation. On the other hand, a conversation participant often gives an apology when he / she feels that his / her conversation partner is dissatisfied due to his / her disagreement. However, as described above, even a thank-you expression may be uttered regardless of the satisfaction of the conversation participants. Similarly, an apology may be uttered regardless of the dissatisfaction of the conversation partner. For example, when a conversation participant leaves the place of the conversation, there is a case where a typical apology expression is issued such as “I'm sorry, please wait for a while”. In this case, the conversation participant is expressing his apology regardless of the emotion of the conversation partner.
 本発明者らは、会話の終了過程において、その会話全体に関する会話参加者の感情、特に、満足感及び不満感が表出し易いことを見出し、この知見から更に、会話の終了過程で発声されたお礼表現及び謝罪表現は、会話参加者の感情を表す可能性が高いことを見出した。 The present inventors have found that the conversation participants' feelings regarding the whole conversation, especially satisfaction and dissatisfaction, are easily expressed in the conversation end process, and this knowledge was further spoken during the conversation end process. We found that thankfulness and apology are likely to express the emotions of the participants.
 そこで、本実施形態は、会話の終了過程を意味するクロージング区間という概念を設け、このクロージング区間のみの音声に対応するデータから、第1会話参加者により発声されたお礼及び第2会話参加者により発声された謝罪の少なくとも一方を表す特定表現データが検出される。例えば、クロージング区間の終端時間は会話の終了時間に設定される。会話の終了は、例えば、通話の場合には通話の切断で表され、通話以外の会話の場合には会話参加者の解散により表される。クロージング区間の始端時間の決定方法は様々である。また、会話参加者の止むを得ない事情などの特定突発原因で会話が終了される場合には、その会話にはクロージング区間が存在しない場合もあり得る。 In view of this, the present embodiment provides the concept of a closing section that means the conversation end process, and the thanks given by the first conversation participant and the second conversation participant from the data corresponding to the voice of only this closing section. Specific expression data representing at least one of the spoken apologies is detected. For example, the closing time of the closing section is set to the conversation end time. The end of the conversation is represented, for example, by disconnecting the call in the case of a call, and by the dissolution of the conversation participants in the case of a conversation other than a call. There are various methods for determining the start time of the closing section. In addition, when a conversation is terminated due to a specific sudden cause such as a situation in which the conversation participant cannot help, there may be no closing section in the conversation.
 このように、当該特定表現データの検出対象をクロージング区間の音声に対応するデータに絞ることで、本実施形態では、第1会話参加者の満足感及び不満感とは無関係に発声されるお礼表現及び謝罪表現を、第1会話参加者の満足度又は不満度の推定材料から排除する。 In this way, in this embodiment, by narrowing down the detection target of the specific expression data to data corresponding to the voice of the closing section, in the present embodiment, a thank expression that is uttered regardless of the satisfaction and dissatisfaction of the first conversation participant And the apology expression is excluded from the estimation material of satisfaction or dissatisfaction of the first conversation participant.
 更に、上述のように、当該特定表現データの検出対象をクロージング区間の音声に対応するデータに絞ることで、本実施形態によれば、クロージング区間外の音声に対する音声認識の誤認識に伴う雑音情報も、第1会話参加者の満足度又は不満度の推定材料から除外することができる。具体的には、クロージング区間外で会話参加者から実際には発声されていないお礼表現又は謝罪表現が誤認識されている場合に、その誤認識されたお礼表現及び謝罪表現が当該推定材料から除外される。 Further, as described above, by narrowing down the detection target of the specific expression data to data corresponding to the voice in the closing section, according to the present embodiment, noise information accompanying voice recognition misrecognition for voice outside the closing section. Can also be excluded from the estimation material of satisfaction or dissatisfaction of the first conversation participant. Specifically, if a thank-you expression or apology expression that is not actually uttered by a conversation participant outside the closing section is misrecognized, the mis-recognized thank-you expression and apology expression are excluded from the estimated material. Is done.
 結果、本実施形態では、会話参加者の満足度又は不満度を表す可能性が高い特定表現データのみを対象に、第1会話参加者の満足度又は不満度が推定される。従って、本実施形態によれば、第1会話参加者の満足感又は不満感を反映していない特定表現及び音声認識の誤認識に基づく雑音データを除いた純度の高い当該特定表現データにより、会話参加者の満足度又は不満度を高精度に推定することができる。 As a result, in the present embodiment, the satisfaction level or dissatisfaction level of the first conversation participant is estimated only for specific expression data that is likely to represent the satisfaction level or dissatisfaction level of the conversation participant. Therefore, according to the present embodiment, the specific expression that does not reflect the satisfaction or dissatisfaction of the first conversation participant and the specific expression data having high purity excluding noise data based on misrecognition of speech recognition, Participant satisfaction or dissatisfaction can be estimated with high accuracy.
 以下、上述の実施形態について更に詳細を説明する。以下には、詳細実施形態として、第1実施形態及び第2実施形態を例示する。以下の各実施形態は、上述の会話分析装置及び会話分析方法をコンタクトセンタシステムに適用した場合の例である。なお、上述の会話分析装置及び会話分析方法は、通話データを扱うコンタクトセンタシステムへの適用に限定されるものではなく、会話データを扱う様々な態様に適用可能である。例えば、それらは、コンタクトセンタ以外の社内の通話管理システムや、個人が所有する、PC(Personal Computer)、固定電話機、携帯電話機、タブレット端末、スマートフォン等の通話端末などに適用することも可能である。更に、会話データとしては、例えば、銀行の窓口や店舗のレジにおける、担当者と顧客の会話を示すデータなどが例示できる。 Hereinafter, further details of the above-described embodiment will be described. Below, 1st Embodiment and 2nd Embodiment are illustrated as detailed embodiment. Each of the following embodiments is an example when the above-described conversation analysis device and conversation analysis method are applied to a contact center system. The conversation analysis apparatus and the conversation analysis method described above are not limited to application to a contact center system that handles call data, and can be applied to various modes that handle conversation data. For example, they can also be applied to in-house call management systems other than contact centers, and personal terminals owned by PCs (Personal Computers), fixed telephones, mobile phones, tablet terminals, smartphones, etc. . Furthermore, as conversation data, for example, data indicating conversation between a person in charge and a customer at a bank counter or a store cash register can be exemplified.
 以下、各実施形態で扱われる通話とは、或る通話者と或る通話者とがそれぞれ持つ通話端末間が呼接続されてから呼切断されるまでの間の呼を意味する。また、通話の音声中、一人の通話者が声を発している連続領域を発話又は発話区間と表記する。例えば、発話区間は、通話者の音声波形において所定値以上の振幅が継続している区間として検出される。通常の通話は、各通話者の発話区間、無声区間などから形成される。 Hereinafter, a call handled in each embodiment refers to a call from when a call terminal possessed by a certain caller and a certain caller is connected between the call connection and the call disconnection. In addition, a continuous area in which a single caller is speaking in a call voice is referred to as an utterance or an utterance section. For example, the speech segment is detected as a segment in which the amplitude of a predetermined value or more continues in the voice waveform of the caller. A normal call is formed from each speaker's utterance section, silent section, and the like.
 [第1実施形態]
 〔システム構成〕
 図1は、第1実施形態におけるコンタクトセンタシステム1の構成例を示す概念図である。第1実施形態におけるコンタクトセンタシステム1は、交換機(PBX)5、複数のオペレータ電話機6、複数のオペレータ端末7、ファイルサーバ9、通話分析サーバ10等を有する。通話分析サーバ10は、上述の実施形態における会話分析装置に相当する構成を含む。第1実施形態では、顧客が上述の第1会話参加者に相当し、オペレータが上述の第2会話参加者に相当する。
[First Embodiment]
〔System configuration〕
FIG. 1 is a conceptual diagram showing a configuration example of a contact center system 1 in the first embodiment. The contact center system 1 in the first embodiment includes an exchange (PBX) 5, a plurality of operator telephones 6, a plurality of operator terminals 7, a file server 9, a call analysis server 10, and the like. The call analysis server 10 includes a configuration corresponding to the conversation analysis device in the above-described embodiment. In the first embodiment, the customer corresponds to the first conversation participant described above, and the operator corresponds to the second conversation participant described above.
 交換機5は、通信網2を介して、顧客により利用される、PC、固定電話機、携帯電話機、タブレット端末、スマートフォン等の通話端末(顧客電話機)3と通信可能に接続されている。通信網2は、インターネットやPSTN(Public Switched Telephone Network)等のような公衆網、無線通信ネットワーク等である。更に、交換機5は、コンタクトセンタの各オペレータが用いる各オペレータ電話機6とそれぞれ接続される。交換機5は、顧客からの呼を受け、その呼に応じたオペレータのオペレータ電話機6にその呼を接続する。 The exchange 5 is communicably connected via a communication network 2 to a call terminal (customer telephone) 3 such as a PC, a fixed telephone, a mobile phone, a tablet terminal, or a smartphone that is used by a customer. The communication network 2 is a public network such as the Internet or a PSTN (Public Switched Telephone Network), a wireless communication network, or the like. Further, the exchange 5 is connected to each operator telephone 6 used by each operator of the contact center. The exchange 5 receives the call from the customer and connects the call to the operator telephone 6 of the operator corresponding to the call.
 各オペレータは、オペレータ端末7をそれぞれ用いる。各オペレータ端末7は、コンタクトセンタシステム1内の通信網8(LAN(Local Area Network)等)に接続される、PC等のような汎用コンピュータである。例えば、各オペレータ端末7は、各オペレータと顧客との間の通話における顧客の音声データ及びオペレータの音声データをそれぞれ録音する。顧客の音声データとオペレータの音声データとは、混合状態から所定の音声処理により分離されて生成されてもよい。なお、本実施形態は、このような音声データの録音手法及び録音主体を限定しない。各音声データの生成は、オペレータ端末7以外の他の装置(図示せず)により行われてもよい。 Each operator uses an operator terminal 7. Each operator terminal 7 is a general-purpose computer such as a PC connected to a communication network 8 (LAN (Local Area Network) or the like) in the contact center system 1. For example, each operator terminal 7 records customer voice data and operator voice data in a call between each operator and the customer. The customer voice data and the operator voice data may be generated by being separated from the mixed state by predetermined voice processing. Note that this embodiment does not limit the recording method and the recording subject of such audio data. Each voice data may be generated by a device (not shown) other than the operator terminal 7.
 ファイルサーバ9は、一般的なサーバコンピュータにより実現される。ファイルサーバ9は、顧客とオペレータとの間の各通話の通話データを、各通話の識別情報と共にそれぞれ格納する。各通話データには、顧客の音声データとオペレータの音声データとのペア、及び、その通話が切断された時間を示す切断時間データがそれぞれ含まれる。ファイルサーバ9は、顧客及びオペレータの各音声を録音する他の装置(各オペレータ端末7等)から、顧客の音声データとオペレータの音声データとを取得する。また、ファイルサーバ9は、切断時間データを、各オペレータ電話機6、交換機5等から取得する。 The file server 9 is realized by a general server computer. The file server 9 stores the call data of each call between the customer and the operator together with the identification information of each call. Each call data includes a pair of customer voice data and operator voice data, and disconnection time data indicating the time when the call was disconnected. The file server 9 acquires customer voice data and operator voice data from another device (each operator terminal 7 or the like) that records each voice of the customer and the operator. Further, the file server 9 acquires disconnection time data from each operator telephone 6, the exchange 5 and the like.
 通話分析サーバ10は、ファイルサーバ9に格納される各通話データに関し、顧客の満足度又は不満度を推定する。
 通話分析サーバ10は、図1に示されるように、ハードウェア構成として、CPU(Central Processing Unit)11、メモリ12、入出力インタフェース(I/F)13、通信装置14等を有する。メモリ12は、RAM(Random Access Memory)、ROM(Read Only Memory)、ハードディスク、可搬型記憶媒体等である。入出力I/F13は、キーボード、マウス等のようなユーザ操作の入力を受け付ける装置、ディスプレイ装置やプリンタ等のようなユーザに情報を提供する装置などと接続される。通信装置14は、通信網8を介して、ファイルサーバ9などと通信を行う。なお、通話分析サーバ10のハードウェア構成は制限されない。
The call analysis server 10 estimates the degree of customer satisfaction or dissatisfaction for each call data stored in the file server 9.
As shown in FIG. 1, the call analysis server 10 includes a CPU (Central Processing Unit) 11, a memory 12, an input / output interface (I / F) 13, a communication device 14 and the like as a hardware configuration. The memory 12 is a RAM (Random Access Memory), a ROM (Read Only Memory), a hard disk, a portable storage medium, or the like. The input / output I / F 13 is connected to a device that accepts an input of a user operation such as a keyboard and a mouse, and a device that provides information to the user such as a display device and a printer. The communication device 14 communicates with the file server 9 and the like via the communication network 8. Note that the hardware configuration of the call analysis server 10 is not limited.
 〔処理構成〕
 図2は、第1実施形態における通話分析サーバ10の処理構成例を概念的に示す図である。第1実施形態における通話分析サーバ10は、通話データ取得部20、音声認識部21、クロージング検出部23、特定表現テーブル25、表現検出部26、推定部27等を有する。これら各処理部は、例えば、CPU11によりメモリ12に格納されるプログラムが実行されることにより実現される。また、当該プログラムは、例えば、CD(Compact Disc)、メモリカード等のような可搬型記録媒体やネットワーク上の他のコンピュータから入出力I/F13を介してインストールされ、メモリ12に格納されてもよい。
[Processing configuration]
FIG. 2 is a diagram conceptually illustrating a processing configuration example of the call analysis server 10 in the first embodiment. The call analysis server 10 in the first embodiment includes a call data acquisition unit 20, a voice recognition unit 21, a closing detection unit 23, a specific expression table 25, an expression detection unit 26, an estimation unit 27, and the like. Each of these processing units is realized, for example, by executing a program stored in the memory 12 by the CPU 11. Further, the program may be installed from a portable recording medium such as a CD (Compact Disc) or a memory card, or another computer on the network via the input / output I / F 13 and stored in the memory 12. Good.
 通話データ取得部20は、ファイルサーバ9から、分析対象となる通話の通話データをその通話の識別情報と共に取得する。通話データには、上述したように、切断時間データが含まれる。当該通話データは、通話分析サーバ10とファイルサーバ9との間の通信により取得されてもよいし、可搬型記録媒体を介して取得されてもよい。 The call data acquisition unit 20 acquires the call data of the call to be analyzed from the file server 9 together with the identification information of the call. As described above, the call data includes disconnection time data. The call data may be acquired by communication between the call analysis server 10 and the file server 9, or may be acquired via a portable recording medium.
 音声認識部21は、通話データに含まれるオペレータ及び顧客の各音声データに対して音声認識処理をそれぞれ行う。これにより、音声認識部21は、当該通話データから、オペレータ音声及び顧客音声に対応する各音声テキストデータ及び各発声時間データをそれぞれ取得する。ここで、音声テキストデータとは、顧客又はオペレータにより発された声がテキスト化された文字データである。各音声テキストデータは、単語(品詞)ごとにそれぞれ区分けされている。各発声時間データには、各音声テキストデータの単語毎の発声時間データが含まれる。 The voice recognition unit 21 performs voice recognition processing on each voice data of the operator and the customer included in the call data. Thereby, the voice recognition unit 21 acquires each voice text data and each utterance time data corresponding to the operator voice and the customer voice from the call data. Here, the voice text data is character data in which a voice uttered by a customer or an operator is converted into text. Each voice text data is divided for each word (part of speech). Each utterance time data includes utterance time data for each word of each voice text data.
 音声認識部21は、オペレータ及び顧客の各音声データから、オペレータ及び顧客の各発話区間をそれぞれ検出し、各発話区間の始端時間及び終端時間を取得するようにしてもよい。この場合、音声認識部21は、各音声テキストデータにおける、各発話区間に相当する単語列ごとに発声時間を決定し、各発話区間に相当する単語列ごとの発声時間を上記発声時間データとするようにしてもよい。 The voice recognition unit 21 may detect the utterance sections of the operator and the customer from the voice data of the operator and the customer, respectively, and acquire the start time and the end time of each utterance section. In this case, the speech recognition unit 21 determines an utterance time for each word string corresponding to each utterance section in each speech text data, and uses the utterance time for each word string corresponding to each utterance section as the utterance time data. You may do it.
 音声認識処理では、コンタクトセンタにおける通話用に適合された音声認識パラメータ(以降、基準音声認識パラメータと表記する)が用いられる。この音声認識パラメータとしては、例えば、複数の音声サンプルから学習された、音響モデルと言語モデルとが用いられる。なお、本実施形態では、この音声認識処理には、周知な手法が利用されればよく、その音声認識処理自体及びその音声認識処理で利用される各種音声認識パラメータを制限しない。また、本実施形態では、発話区間の検出手法についても制限しない。 In the voice recognition process, a voice recognition parameter (hereinafter referred to as a reference voice recognition parameter) adapted for a call in a contact center is used. As this speech recognition parameter, for example, an acoustic model and a language model learned from a plurality of speech samples are used. In the present embodiment, a known method may be used for the voice recognition process, and the voice recognition process itself and various voice recognition parameters used in the voice recognition process are not limited. In the present embodiment, the method for detecting the utterance section is not limited.
 音声認識部21は、クロージング検出部23及び表現検出部26の各処理内容に応じて、顧客及びオペレータのいずれか一方の音声データに対してのみ音声認識処理を行うようにしてもよい。例えば、後述するような所定のクロージング文句の検索によりクロージング区間を検出する場合には、クロージング検出部23は、オペレータの音声テキストデータを必要とする。また、表現検出部26は、お礼表現データの検出を行う場合には、顧客の音声テキストデータを必要とする。表現検出部26は、謝罪表現データの検出を行う場合には、オペレータの音声テキストデータを必要とする。 The voice recognition unit 21 may perform the voice recognition process only on the voice data of either the customer or the operator according to the processing contents of the closing detection unit 23 and the expression detection unit 26. For example, when a closing section is detected by searching for a predetermined closing phrase as will be described later, the closing detection unit 23 requires the operator's voice text data. Moreover, the expression detection part 26 requires a customer's audio | voice text data, when detecting thanks expression data. The expression detection unit 26 requires the voice text data of the operator when detecting apology expression data.
 クロージング検出部23は、通話データに含まれる切断時間データと、音声認識部21により取得されたオペレータ又は顧客の音声テキストデータ及びその発声時間データとに基づいて、対象通話のクロージング区間を検出する。クロージング検出部23は、検出されたクロージング区間の始端時間と終端時間とを含むクロージング区間データを生成する。クロージング区間の終端時間は、切断時間データにより示される切断時間に設定される。 The closing detection unit 23 detects the closing period of the target call based on the disconnection time data included in the call data, the voice text data of the operator or customer acquired by the voice recognition unit 21 and the utterance time data thereof. The closing detection unit 23 generates closing section data including the start time and the end time of the detected closing section. The end time of the closing section is set to the cutting time indicated by the cutting time data.
 クロージング区間の始端時間は、例えば、次のように設定される。クロージング検出部23は、通話の切断時間から所定数分の発話区間の始端時間を、クロージング区間の始端時間に決定する。また、クロージング検出部23は、通話の切断時間から所定時間遡った時点を、クロージング区間の始端時間に決定してもよい。これらクロージング区間の始端時間の決定方法によれば、表現検出部26で用いられる、オペレータ及び顧客のいずれか一方の音声テキストデータのみに基づいて、クロージング区間の始端時間を決定することができる。クロージング区間の幅を決めるための所定発話数及び所定時間は、オペレータマニュアル等に記載されるクロージングの定型文や、コンタクトセンタでの音声データの検聴結果等により、予め決定される。 The start time of the closing section is set as follows, for example. The closing detection unit 23 determines the start end time of a predetermined number of utterance sections from the call disconnection time as the start end time of the closing section. In addition, the closing detection unit 23 may determine a time point that is a predetermined time later than the call disconnection time as the start time of the closing section. According to these methods for determining the start time of the closing section, the start time of the closing section can be determined based only on the voice text data of either the operator or the customer used in the expression detection unit 26. The predetermined number of utterances and the predetermined time for determining the width of the closing section are determined in advance according to a closing sentence described in an operator manual or the like, a result of listening to audio data at a contact center, or the like.
 更に、クロージング検出部23は、オペレータの音声テキストデータの中で、最前の所定のクロージング文句の発声時間を、クロージング区間の始端時間に決定してもよい。ここで、クロージング文句とは、最終挨拶文句のような、通話の終了過程でオペレータが発するフレーズである。コンタクトセンタでは、通話の終了過程でオペレータが発すべきフレーズがマニュアルにより決められている場合が多い。また、コンタクトセンタのような専門部署に属さない一般的な通話者においても、通話の終了過程で発声されるフレーズには或る程度決まったフレーズが存在する。そこで、クロージング検出部23は、そのような複数の所定のクロージング文句のデータを予め調整可能に保持するようにしてもよい。このような所定クロージング文句のデータは、入力画面等に基づいてユーザにより入力されてもよいし、可搬型記録媒体、他のコンピュータ等から入出力I/F13を経由して取得されてもよい。 Furthermore, the closing detection unit 23 may determine the utterance time of the previous predetermined closing phrase in the voice text data of the operator as the start time of the closing section. Here, the closing phrase is a phrase issued by the operator in the process of ending the call, such as a final greeting phrase. In a contact center, a phrase to be issued by an operator in the process of terminating a call is often determined manually. Also, even for a general caller who does not belong to a specialized department such as a contact center, there is a certain fixed phrase as a phrase uttered in the process of ending a call. Therefore, the closing detection unit 23 may hold data of a plurality of such predetermined closing phrases so as to be adjustable in advance. Such predetermined closing phrase data may be input by a user based on an input screen or the like, or may be acquired from a portable recording medium, another computer, or the like via the input / output I / F 13.
 特定表現テーブル25は、お礼表現データ及び謝罪表現データを特定表現データとして保持する。具体的には、特定表現テーブル25は、表現検出部26による検出対象となる特定表現データをお礼表現データと謝罪表現データとに区別可能に保持する。特定表現テーブル25は、表現検出部26の処理に応じて、お礼表現データ及び謝罪表現データのいずれか一方のみを保持するようにしてもよい。 The specific expression table 25 holds thanks expression data and apology expression data as specific expression data. Specifically, the specific expression table 25 holds the specific expression data to be detected by the expression detection unit 26 so that it can be distinguished into thank-you expression data and apology expression data. The specific expression table 25 may hold only one of thank-you expression data and apology expression data in accordance with the processing of the expression detection unit 26.
 表現検出部26は、検出対象となる特定表現データに応じて次のような3タイプの処理のいずれか1つを実行する。第1の処理タイプは、お礼表現データのみを検出対象とし、第2の処理タイプは、謝罪表現データのみを検出対象とし、第3の処理タイプは、お礼表現データ及び謝罪表現データの両方を検出対象とする。 The expression detection unit 26 executes any one of the following three types of processing according to the specific expression data to be detected. The first processing type detects only thank-you expression data, the second processing type detects only apology expression data, and the third processing type detects both thank-you expression data and apology expression data. set to target.
 第1の処理タイプでは、表現検出部26は、音声認識部21により取得された顧客の音声テキストデータから、クロージング検出部23により生成されたクロージング区間データで示される時間範囲内の発声時間を持つ音声テキストデータを抽出する。表現検出部26は、この抽出されたクロージング区間に対応する顧客の音声テキストデータの中から、特定表現テーブル25に保持されるお礼表現データを検出する。この検出と共に、表現検出部26は、お礼表現データの検出数をカウントする。 In the first processing type, the expression detection unit 26 has an utterance time within the time range indicated by the closing section data generated by the closing detection unit 23 from the voice text data of the customer acquired by the voice recognition unit 21. Extract voice text data. The expression detection unit 26 detects thank-you expression data held in the specific expression table 25 from the voice text data of the customer corresponding to the extracted closing section. Along with this detection, the expression detection unit 26 counts the number of thanks expression data detected.
 第2の処理タイプでは、表現検出部26は、音声認識部21により取得されたオペレータの音声テキストデータから、クロージング検出部23により生成されたクロージング区間データで示される時間範囲内の発声時間を持つ音声テキストデータを抽出する。表現検出部26は、この抽出されたクロージング区間に対応するオペレータの音声テキストデータの中から、特定表現テーブル25に保持される謝罪表現データを検出する。この検出と共に、表現検出部26は、謝罪表現データの検出数をカウントする。 In the second processing type, the expression detection unit 26 has an utterance time within the time range indicated by the closing section data generated by the closing detection unit 23 from the voice text data of the operator acquired by the voice recognition unit 21. Extract voice text data. The expression detection unit 26 detects apology expression data held in the specific expression table 25 from the voice text data of the operator corresponding to the extracted closing section. Along with this detection, the expression detection unit 26 counts the number of detected apology expression data.
 第3の処理タイプでは、表現検出部26は、音声認識部21により取得された顧客及びオペレータの各音声テキストデータから、クロージング検出部23により生成されたクロージング区間データで示される時間範囲内の発声時間を持つ各音声テキストデータをそれぞれ抽出する。表現検出部26は、この抽出されたクロージング区間に対応するオペレータの音声テキストデータの中から、特定表現テーブル25に保持される謝罪表現データを検出し、当該抽出されたクロージング区間に対応する顧客の音声テキストデータの中から、特定表現テーブル25に保持されるお礼表現データを検出する。表現検出部26は、これら検出と共に、お礼表現データの検出数及び謝罪表現データの検出数を区別してそれぞれカウントする。 In the third processing type, the expression detection unit 26 utters speech within the time range indicated by the closing section data generated by the closing detection unit 23 from the voice text data of the customer and the operator acquired by the voice recognition unit 21. Each speech text data having time is extracted. The expression detection unit 26 detects apology expression data held in the specific expression table 25 from the voice text data of the operator corresponding to the extracted closing section, and the customer's corresponding to the extracted closing section is detected. Thanks expression data held in the specific expression table 25 is detected from the voice text data. Along with these detections, the expression detection unit 26 separately counts the number of detected thank-you expression data and the number of detected apology expression data.
 推定部27は、表現検出部26によりカウントされたお礼表現データの検出数に応じて、対象通話における顧客の満足度及び不満度の少なくとも一方を推定する。例えば、推定部27は、お礼表現データの検出数が所定閾値以上の場合、満足感有りと推定する。また、お礼表現データの検出数が所定閾値以上の場合、不満感無しと推定しても良い。さらに、推定部27は、お礼表現データの検出数が所定閾値より小さい場合、満足感無しと推定しても良い。満足感や不満感の有無を推定するための上記所定閾値は、コンタクトセンタでの音声データの検聴結果等により、予め決定される。 The estimation unit 27 estimates at least one of customer satisfaction and dissatisfaction in the target call according to the number of thank-you expression data detected by the expression detection unit 26. For example, when the number of thank-you expression data detected is greater than or equal to a predetermined threshold, the estimation unit 27 estimates that there is satisfaction. Further, when the number of thank-you expression data detected is equal to or greater than a predetermined threshold, it may be estimated that there is no dissatisfaction. Furthermore, the estimation unit 27 may estimate that there is no satisfaction when the number of thank-you expression data detected is smaller than a predetermined threshold. The predetermined threshold for estimating the presence or absence of satisfaction or dissatisfaction is determined in advance based on the result of listening to audio data at the contact center.
 下表は、コンタクトセンタの通話のクロージング区間において顧客がお礼を述べた回数と、顧客の満足感および不満感の有無との関係を調べた結果である。表中の「中立」は、顧客が満足感も不満感も感じていないことを示す。下表より、クロージング区間にてお礼を述べた回数が多くなるほど顧客が満足感を感じている確率が大きくなり、不満感を感じている確率が小さくなることがわかる。満足感や不満感の有無を推定するための上記閾値は、このような調査結果に基づいて予め決定される。例えば、下表に基づくと、お礼回数3回以上とすれば満足感有りを80%程度の精度で推定できると期待できる。また、お礼回数1回未満(すなわちゼロ)とすれば満足感なしを88%程度の精度で推定できると期待できる。 The table below shows the results of examining the relationship between the number of times the customer expressed gratitude in the closing period of the contact center call and the satisfaction and dissatisfaction of the customer. “Neutral” in the table indicates that the customer does not feel satisfaction or dissatisfaction. From the table below, it can be seen that the greater the number of times thanking in the closing section, the greater the probability that the customer feels satisfied and the less likely that the customer feels dissatisfied. The threshold value for estimating the presence or absence of satisfaction or dissatisfaction is determined in advance based on such survey results. For example, based on the table below, it can be expected that satisfaction can be estimated with an accuracy of about 80% if the number of thanks is three or more. Moreover, if the number of thanks is less than 1 (that is, zero), it can be expected that satisfaction can be estimated with an accuracy of about 88%.
Figure JPOXMLDOC01-appb-T000001
Figure JPOXMLDOC01-appb-T000001
 また、推定部27は、表現検出部26によりカウントされた謝罪表現データの検出数に応じて、対象通話における顧客の不満度及び満足度の少なくとも一方を推定する。例えば、推定部27は、謝罪表現データの検出数が所定閾値以上の場合、不満感有りと推定する。また、推定部27は、お礼表現データの検出数に応じた満足度レベル値や不満度レベル値を決定しても良い。同様に、推定部27は、謝罪表現データの検出数に応じた不満度レベル値や満足度レベル値を決定するようにしてもよい。 Further, the estimation unit 27 estimates at least one of the degree of customer dissatisfaction and satisfaction in the target call according to the number of detected apology expression data counted by the expression detection unit 26. For example, the estimation unit 27 estimates that there is dissatisfaction when the detected number of apology expression data is greater than or equal to a predetermined threshold. Moreover, the estimation part 27 may determine the satisfaction level value and dissatisfaction level value according to the detection number of thanks expression data. Similarly, the estimation unit 27 may determine a dissatisfaction level value or a satisfaction level value corresponding to the number of detected apology expression data.
 更に、推定部27は、お礼表現データ及び謝罪表現データの両方の検出数がカウントされた場合には、その両方の検出数に応じて、対象通話における顧客の満足度及び不満度の少なくとも一方を推定するようにしてもよい。例えば、推定部27は、お礼表現データの検出数が他方より大きい場合には、満足感有りと推定し、謝罪表現データの検出数が他方より大きい場合には、不満感有りと推定する。また、推定部27は、各検出数に応じた満足度レベル値及び不満度レベル値を決定してもよいし、両者の差分値により満足度レベル値又は不満度レベル値を決定するようにしてもよい。 Furthermore, when the number of detections of both the thank-you expression data and the apology expression data is counted, the estimation unit 27 calculates at least one of customer satisfaction and dissatisfaction level in the target call according to the number of detections of both. You may make it estimate. For example, the estimation unit 27 estimates that there is satisfaction when the number of thanks expression data detected is greater than the other, and estimates that there is dissatisfaction when the number of detected apology data is greater than the other. In addition, the estimation unit 27 may determine a satisfaction level value and a dissatisfaction level value according to the number of detections, or determine a satisfaction level value or a dissatisfaction level value based on a difference value between the two. Also good.
 推定部27は、推定結果を示す情報を含む出力データを生成し、入出力I/F13を介して表示部や他の出力装置にその判定結果を出力する。本実施形態は、この判定結果の出力の具体的形態を制限しない。 The estimation unit 27 generates output data including information indicating the estimation result, and outputs the determination result to the display unit or another output device via the input / output I / F 13. The present embodiment does not limit the specific form of output of the determination result.
 〔動作例〕
 以下、第1実施形態における通話分析方法について図3を用いて説明する。図3は、第1実施形態における通話分析サーバ10の動作例を示すフローチャートである。
[Operation example]
Hereinafter, the call analysis method according to the first embodiment will be described with reference to FIG. FIG. 3 is a flowchart showing an operation example of the call analysis server 10 in the first embodiment.
 まず、お礼表現のみが用いられる場合の通話分析方法について説明する。
 通話分析サーバ10は、通話データを取得する(S30)。第1実施形態では、通話分析サーバ10は、ファイルサーバ9に格納される複数の通話データの中から、分析対象となる通話データを取得する。
First, a call analysis method when only a thank-you expression is used will be described.
The call analysis server 10 acquires call data (S30). In the first embodiment, the call analysis server 10 acquires call data to be analyzed from a plurality of call data stored in the file server 9.
 通話分析サーバ10は、(S30)で取得された通話データに含まれる顧客の音声データに対して音声認識処理を行う(S31)。これにより、通話分析サーバ10は、顧客の音声テキストデータ及び発声時間データを取得する。顧客の音声テキストデータは、単語(品詞)ごとにそれぞれ区分けされている。また、発声時間データには、単語毎又は各発話区間に相当する単語列毎の発声時間データが含まれる。 The call analysis server 10 performs voice recognition processing on the customer voice data included in the call data acquired in (S30) (S31). Thereby, the call analysis server 10 acquires the customer's voice text data and utterance time data. The customer's voice text data is divided for each word (part of speech). The utterance time data includes utterance time data for each word or for each word string corresponding to each utterance section.
 通話分析サーバ10は、(S30)で取得された通話データに含まれる切断時間データ、及び、(S31)で取得された発声時間データに基づいて、対象通話のクロージング区間を検出する(S32)。例えば、通話分析サーバ10は、切断時間データにより示される通話切断時間から所定時間遡った時点を、クロージング区間の始端時間に決定する。他の例としては、通話分析サーバ10は、当該通話切断時間から、顧客の所定数分の発話区間の始端時間を、クロージング区間の始端時間に決定する。通話分析サーバ10は、検出されたクロージング区間の始端時間及び終端時間を示すクロージング区間データを生成する。 The call analysis server 10 detects the closing section of the target call based on the disconnection time data included in the call data acquired in (S30) and the utterance time data acquired in (S31) (S32). For example, the call analysis server 10 determines a time point that is a predetermined time back from the call disconnection time indicated by the disconnection time data as the start time of the closing section. As another example, the call analysis server 10 determines the start time of the utterance section for a predetermined number of customers as the start time of the closing section from the call disconnection time. The call analysis server 10 generates closing section data indicating the start time and the end time of the detected closing section.
 通話分析サーバ10は、(S31)で取得された顧客の音声テキストデータの中から、(S32)で生成されたクロージング区間データで示される時間範囲内の発声時間に対応する音声テキストデータを抽出し、この抽出された音声テキストデータの中から、特定表現データとしてのお礼表現データを検出する(S33)。この検出に伴い、通話分析サーバ10は、お礼表現データの検出数をカウントする(S34)。 The call analysis server 10 extracts voice text data corresponding to the utterance time within the time range indicated by the closing section data generated in (S32) from the customer voice text data acquired in (S31). From the extracted speech text data, thank-you expression data as specific expression data is detected (S33). With this detection, the call analysis server 10 counts the number of thank-you expression data detected (S34).
 通話分析サーバ10は、(S34)でカウントされたお礼表現データの検出数に基づいて、対象通話の顧客の満足度を推定する(S35)。例えば、通話分析サーバ10は、お礼表現データの検出数が所定閾値より大きい場合、満足感有り、かつ、不満感なしと推定する。また、お礼表現データの検出数が所定閾値より小さい場合、通話分析サーバ10は、満足感なしと推定する。通話分析サーバ10は、推定された満足度や不満度の有無、又は、レベル値を示す出力データを生成する。 The call analysis server 10 estimates the customer satisfaction of the target call based on the number of thank-you expression data detected in (S34) (S35). For example, when the number of thank-you expression data detected is greater than a predetermined threshold, the call analysis server 10 estimates that there is satisfaction and no dissatisfaction. When the number of thank-you expression data detected is smaller than the predetermined threshold, the call analysis server 10 estimates that there is no satisfaction. The call analysis server 10 generates output data indicating the presence or absence of the estimated satisfaction or dissatisfaction level, or a level value.
 次に、謝罪表現のみを用いる場合の通話分析方法について説明する。
 この場合、(S31)では、通話分析サーバ10は、当該通話データに含まれるオペレータの音声データに対して音声認識処理を行う。これにより、通話分析サーバ10は、オペレータの音声テキストデータ及び発声時間データを取得する。
Next, a call analysis method using only an apology expression will be described.
In this case, in (S31), the call analysis server 10 performs a voice recognition process on the operator's voice data included in the call data. Thereby, the call analysis server 10 acquires the operator's voice text data and utterance time data.
 (S32)では、通話分析サーバ10は、(S30)で取得された通話データに含まれる切断時間データ、及び、(S31)で取得されたオペレータの音声テキストデータに基づいて、対象通話のクロージング区間を検出する。この場合、通話分析サーバ10は、オペレータの音声テキストデータの中で、最前の所定のクロージング文句の発声時間を、クロージング区間の始端時間に決定する。 In (S32), the call analysis server 10 closes the closing period of the target call based on the disconnection time data included in the call data acquired in (S30) and the voice text data of the operator acquired in (S31). Is detected. In this case, the call analysis server 10 determines the utterance time of the first predetermined closing phrase in the voice text data of the operator as the start time of the closing section.
 (S33)では、通話分析サーバ10は、(S31)で取得されたオペレータの音声テキストデータの中から、(S32)で生成されたクロージング区間データで示される時間範囲内の発声時間に対応する音声テキストデータを抽出し、この抽出された音声テキストデータの中から、特定表現データとしての謝罪表現データを検出する。(S34)では、通話分析サーバ10は、謝罪表現データの検出数をカウントする(S34)。 In (S33), the call analysis server 10 selects the voice corresponding to the utterance time within the time range indicated by the closing section data generated in (S32) from the voice text data of the operator acquired in (S31). Text data is extracted, and apology expression data as specific expression data is detected from the extracted speech text data. In (S34), the call analysis server 10 counts the number of detected apology expression data (S34).
 (S35)では、通話分析サーバ10は、(S34)でカウントされた謝罪表現データの検出数に基づいて、対象通話の顧客の不満度を推定する(S35)。通話分析サーバ10は、謝罪表現データの検出数が所定閾値より大きい場合、不満感有りと推定し、それ以外の場合、不満感なしと推定する。 In (S35), the call analysis server 10 estimates the degree of dissatisfaction of the customer of the target call based on the detected number of apology expression data counted in (S34) (S35). The call analysis server 10 estimates that there is dissatisfaction if the detected number of apology expression data is greater than a predetermined threshold value, and otherwise estimates that there is no dissatisfaction.
 以下、お礼表現及び謝罪表現の両方を特定表現として用いる場合の通話分析方法について説明する。この場合、(S31)では、通話分析サーバ10は、顧客及びオペレータの各音声データに対してそれぞれ音声認識処理を行う。これにより、通話分析サーバ10は、顧客及びオペレータに関する音声テキストデータ及び発声時間データをそれぞれ取得する。 The following describes the call analysis method when both the thank-you expression and the apology expression are used as specific expressions. In this case, in (S31), the call analysis server 10 performs voice recognition processing on each voice data of the customer and the operator. Thereby, the call analysis server 10 acquires voice text data and utterance time data related to the customer and the operator, respectively.
 (S33)及び(S34)では、通話分析サーバ10は、上述の2つの場合の(S33)及び(S34)をそれぞれ実行する。これにより、お礼表現データの検出数及び謝罪表現データの検出数がそれぞれカウントされる。 In (S33) and (S34), the call analysis server 10 executes (S33) and (S34) in the above two cases, respectively. As a result, the number of detected thank-you expression data and the number of detected apology data are counted.
 (S35)では、通話分析サーバ10は、(S34)でカウントされたお礼表現データの検出数及び謝罪表現データの検出数に基づいて、対象通話の顧客の満足度及び不満度の少なくとも一方を推定する。 In (S35), the call analysis server 10 estimates at least one of the satisfaction level and dissatisfaction level of the customer of the target call based on the detected number of thank-you expression data and the detected number of apology data counted in (S34). To do.
 〔第1実施形態の作用及び効果〕
 上述したように第1実施形態では、対象通話のクロージング区間の音声に対応するデータから検出される、顧客により発声されるお礼表現データの検出数及びオペレータにより発声される謝罪表現データの検出数の少なくとも一方に基づいて、対象通話の顧客の満足度及び不満度の少なくとも一方が推定される。本実施形態によれば、クロージング区間のみからお礼表現や謝罪表現を検出しているため、これら特定表現は顧客の満足感または不満感を反映している可能性が高く、かつ、クロージング区間以外で誤認識された特定表現の悪影響を受けなくなるため、顧客の満足度または不満度を高精度に推定することができる。
[Operation and Effect of First Embodiment]
As described above, in the first embodiment, the detected number of thank-you expression data uttered by the customer and the detected number of apology expression data uttered by the operator, which are detected from the data corresponding to the voice of the closing period of the target call, are detected. Based on at least one, at least one of customer satisfaction and dissatisfaction of the target call is estimated. According to this embodiment, since a thank-you expression or an apology expression is detected only from the closing section, these specific expressions are highly likely to reflect customer satisfaction or dissatisfaction, and other than the closing section. Since it is not adversely affected by the misrecognized specific expression, the satisfaction or dissatisfaction of the customer can be estimated with high accuracy.
 更に、本実施形態によれば、顧客及びオペレータのいずれか一方のみの音声テキストデータがあれば、上述の実施形態で述べたように、高精度に、顧客の満足度又は不満度を推定することができる。従って、本実施形態によれば、顧客及びオペレータの両方の音声データに対して音声認識処理を行う形態に比べて、音声認識処理の負荷を軽減することもできる。 Furthermore, according to the present embodiment, if there is voice text data of only one of the customer and the operator, the satisfaction or dissatisfaction level of the customer can be estimated with high accuracy as described in the above embodiment. Can do. Therefore, according to the present embodiment, it is possible to reduce the load of the voice recognition process as compared with the form in which the voice recognition process is performed on the voice data of both the customer and the operator.
 また、第1実施形態では、顧客により発声されるお礼表現データの検出数及びオペレータにより発声される謝罪表現データの検出数の両方に基づいて、対象通話の顧客の満足度及び不満度の少なくとも一方を推定することもできる。このようにすれば、顧客の満足度及び不満度と強い相関を持つ、顧客によるお礼表現及びオペレータによる謝罪表現の両方が加味されるため、顧客の満足度又は不満度の推定精度を更に向上させることができる。 In the first embodiment, at least one of the satisfaction level and the dissatisfaction level of the customer of the target call is based on both the detected number of thank-you expression data uttered by the customer and the detected number of apology expression data uttered by the operator. Can also be estimated. In this way, both the customer's gratitude expression and the operator's apology expression that have a strong correlation with the customer's satisfaction and dissatisfaction are taken into account, so the accuracy of estimating the customer's satisfaction or dissatisfaction is further improved. be able to.
 [第2実施形態]
 第2実施形態では、お礼表現及び謝罪表現を認識し易いように重み付けされた音声認識パラメータを用いて、クロージング区間の音声データに対する音声認識処理が行われる。以下、第2実施形態におけるコンタクトセンタシステム1について、第1実施形態と異なる内容を中心に説明する。以下の説明では、第1実施形態と同様の内容については適宜省略する。
[Second Embodiment]
In the second embodiment, speech recognition processing is performed on speech data in a closing section using speech recognition parameters weighted so as to easily recognize a thank-you expression and an apology expression. Hereinafter, the contact center system 1 in the second embodiment will be described focusing on the content different from the first embodiment. In the following description, the same contents as those in the first embodiment are omitted as appropriate.
 〔処理構成〕
 図4は、第2実施形態における通話分析サーバ10の処理構成例を概念的に示す図である。第2実施形態における通話分析サーバ10は、第1実施形態の構成に加えて、音声認識部41を更に有する。音声認識部41は、他の処理部と同様に、例えば、CPU11によりメモリ12に格納されるプログラムが実行されることにより実現される。
[Processing configuration]
FIG. 4 is a diagram conceptually illustrating a processing configuration example of the call analysis server 10 in the second embodiment. The call analysis server 10 in the second embodiment further includes a voice recognition unit 41 in addition to the configuration of the first embodiment. The voice recognition unit 41 is realized by executing a program stored in the memory 12 by the CPU 11, for example, in the same manner as the other processing units.
 音声認識部21は、通話データに含まれるオペレータの音声データに対して、基準音声認識パラメータLM-1を用いて、音声認識処理を行う。この音声認識処理で取得される音声テキストデータは、クロージング検出部23のみにより利用されるため、音声認識処理は、オペレータの音声データのみに対して行われればよい。なお、音声認識部21は、オペレータ及び顧客の両方の音声データに対して音声認識処理を行うようにしてもよい。音声認識部21は、コンタクトセンタにおける通話全般用に予め学習された基準音声認識パラメータLM-1を予め保持する。 The voice recognition unit 21 performs voice recognition processing on the voice data of the operator included in the call data, using the reference voice recognition parameter LM-1. Since the voice text data acquired by the voice recognition process is used only by the closing detection unit 23, the voice recognition process may be performed only on the voice data of the operator. Note that the voice recognition unit 21 may perform voice recognition processing on the voice data of both the operator and the customer. The voice recognition unit 21 holds in advance a reference voice recognition parameter LM-1 that has been learned in advance for general calls in the contact center.
 音声認識部41は、音声認識部21により用いられる基準音声認識パラメータLM-1が、表現検出部26で検出される特定表現データが他の単語データよりも認識され易くなるように重み付けされた音声認識パラメータ(以降、加重音声認識パラメータと表記)LM-2を用いて、対象通話のクロージング区間の音声データに対して音声認識処理を行う。図4では、音声認識部21と音声認識部41とが区別されて示されるが、両者は1つの処理部として実現され、用いられる音声認識パラメータが切り替えられるようにしてもよい。 The voice recognition unit 41 is weighted so that the reference voice recognition parameter LM-1 used by the voice recognition unit 21 is weighted so that the specific expression data detected by the expression detection unit 26 can be recognized more easily than other word data. Using the recognition parameter (hereinafter referred to as a weighted speech recognition parameter) LM-2, speech recognition processing is performed on the speech data in the closing section of the target call. In FIG. 4, the voice recognition unit 21 and the voice recognition unit 41 are distinguished from each other, but both may be realized as one processing unit, and the voice recognition parameters to be used may be switched.
 加重音声認識パラメータLM-2は、例えば、基準音声認識パラメータLM-1に基づいて所定手法により算出され、予め、音声認識部41により保持される。下記式は、音声認識パラメータとしてN-gram言語モデルが利用される場合における、加重音声認識パラメータLM-2の算出例を示す図である。
Figure JPOXMLDOC01-appb-M000002
The weighted speech recognition parameter LM-2 is calculated by a predetermined method based on the reference speech recognition parameter LM-1, for example, and is held in advance by the speech recognition unit 41. The following equation is a diagram illustrating an example of calculating the weighted speech recognition parameter LM-2 when the N-gram language model is used as the speech recognition parameter.
Figure JPOXMLDOC01-appb-M000002
 上記式の左辺Pnew(w|wi-n+1 i-1)は、加重音声認識パラメータLM-2に相当するN-gram言語モデルを示し、(i-n+1)番目から(i-1)番目までの単語列wi-n+1 i-1の条件下におけるi番目の単語wの出現確率を示す。上記式の右辺のPold(w|wi-n+1 i-1)は、基準音声認識パラメータLM-1に相当するN-gram言語モデルを示す。上記式の右辺のPnew(w)は、お礼表現及び謝罪表現の出現確率を大きくしたユニグラム言語モデルであり、上記式の右辺のPold(w)は、基準音声認識パラメータLM-1に相当するユニグラム言語モデルである。上記式によれば、コンタクトセンタにおける通話全般用に予め学習されたN-gram言語モデルが、お礼表現及び謝罪表現の出現確率が大きくなるように、(Pnew(w)/Pold(w))で重み付けされたN-gram言語モデルが、加重音声認識パラメータLM-2として算出される。 The left side P new (w i | w i−n + 1 i−1 ) of the above formula represents an N-gram language model corresponding to the weighted speech recognition parameter LM-2, and the (i−n + 1) th to (i−1) The appearance probability of the i-th word w i under the condition of the word string w i-n + 1 i−1 up to the th is shown. P old (w i | w i−n + 1 i−1 ) on the right side of the above expression represents an N-gram language model corresponding to the reference speech recognition parameter LM−1. P new (w i ) on the right side of the above expression is a unigram language model in which the appearance probability of a thankful expression and an apology expression is increased, and P old (w i ) on the right side of the above expression is a reference speech recognition parameter LM-1 Is a unigram language model equivalent to According to the above formula, (P new (w i ) / P old (w) is set so that the N-gram language model learned in advance for general calls in the contact center increases the appearance probability of a thank-you expression and an apology expression. The N-gram language model weighted in i )) is calculated as the weighted speech recognition parameter LM-2.
 音声認識部41は、クロージング検出部23により生成されるクロージング区間データにより示される時間範囲内の音声データに対してのみ音声認識処理を行う。また、音声認識部41は、表現検出部26の処理内容に応じて、顧客及びオペレータの両方の音声データを音声認識処理の対象としてもよいし、顧客及びオペレータのいずれか一方の音声データのみを音声認識処理の対象としてもよい。 The voice recognition unit 41 performs the voice recognition process only on the voice data within the time range indicated by the closing section data generated by the closing detection unit 23. Further, the voice recognition unit 41 may set both voice data of the customer and the operator as a target of voice recognition processing according to the processing content of the expression detection unit 26, or only the voice data of one of the customer and the operator. It may be a target of voice recognition processing.
 表現検出部26は、音声認識部41により取得された音声テキストデータの中から、特定表現テーブル25に保持されるお礼表現データ及び謝罪表現データの少なくとも一方を検出する。 The expression detection unit 26 detects at least one of thanks expression data and apology expression data held in the specific expression table 25 from the voice text data acquired by the voice recognition unit 41.
 〔動作例〕
 以下、第2実施形態における通話分析方法について図5を用いて説明する。図5は、第2実施形態における通話分析サーバ10の動作例を示すフローチャートである。図5では、図3と同じ工程については、図3と同じ符号が付されている。
[Operation example]
Hereinafter, a call analysis method according to the second embodiment will be described with reference to FIG. FIG. 5 is a flowchart illustrating an operation example of the call analysis server 10 according to the second embodiment. In FIG. 5, the same steps as those in FIG. 3 are denoted by the same reference numerals as those in FIG.
 通話分析サーバ10は、(S30)で取得された通話データに含まれる音声データの中の、(S32)で生成されたクロージング区間データで示される時間範囲の音声データに対して、加重音声認識パラメータLM-2を用いた音声認識を行う(S51)。
 通話分析サーバ10は、(S51)で取得された音声テキストデータの中から、特定表現データとしてのお礼表現データ及び謝罪表現データの少なくとも一方を検出する(S33)。
The call analysis server 10 applies weighted speech recognition parameters to the voice data in the time range indicated by the closing section data generated in (S32) among the voice data included in the call data acquired in (S30). Speech recognition using LM-2 is performed (S51).
The call analysis server 10 detects at least one of thank-you expression data and apology expression data as specific expression data from the speech text data acquired in (S51) (S33).
 〔第2実施形態の作用及び効果〕
 上述のように、第2実施形態では、お礼表現及び謝罪表現を認識し易いように重み付けされた加重音声認識パラメータを用いて、クロージング区間の音声データに対する音声認識処理が行われる。そして、この音声認識処理で取得される音声テキストデータから、お礼表現データ及び謝罪表現データの少なくとも一方が検出され、この検出結果に基づいて対象通話の顧客の満足度又は不満度が推定される。
[Operation and Effect of Second Embodiment]
As described above, in the second embodiment, the speech recognition process is performed on the speech data in the closing section using the weighted speech recognition parameters weighted so as to easily recognize the thanks and apologies. Then, at least one of thank-you expression data and apology expression data is detected from the speech text data acquired by this speech recognition process, and the satisfaction or dissatisfaction level of the customer of the target call is estimated based on the detection result.
 通話の終了過程では、お礼表現及び謝罪表現が発声されている可能性が、他の区間よりも高い。これにより、クロージング区間の音声データに対して行われる音声認識処理では、お礼表現及び謝罪表現を認識し易いように重み付けされた加重音声認識パラメータが用いられる。従って、第2実施形態によれば、クロージング区間の音声データから確実にお礼表現データ及び謝罪表現データを検出することができる。 In the process of ending the call, there is a higher possibility that a thank-you expression and an apology expression are uttered compared to other sections. Thereby, in the speech recognition processing performed on the speech data in the closing section, weighted speech recognition parameters weighted so as to easily recognize the thank-you expression and the apology expression are used. Therefore, according to the second embodiment, it is possible to reliably detect thank-you expression data and apology expression data from the audio data in the closing section.
 一方、このような加重音声認識パラメータを用いた音声認識処理が、クロージング区間以外の区間の音声データに対して行われた場合、お礼表現及び謝罪表現の認識誤り率が増加する可能性が高まり、ひいては、顧客の満足度又は不満度の推定精度が低下する可能性がある。これに対して、第2実施形態では、上述のように、加重音声認識パラメータを用いた音声認識処理を、お礼表現及び謝罪表現の出現確率の高いクロージング区間の音声データに絞って行っているため、そのような推定精度の低下を避けることができる。 On the other hand, when speech recognition processing using such weighted speech recognition parameters is performed on speech data in a section other than the closing section, there is a high possibility that the recognition error rate of the thank-you expression and the apology expression will increase. As a result, the estimation accuracy of customer satisfaction or dissatisfaction may be reduced. On the other hand, in the second embodiment, as described above, the speech recognition process using the weighted speech recognition parameters is performed only on the speech data in the closing section where the appreciation expression and the apology expression have a high appearance probability. Such a decrease in estimation accuracy can be avoided.
 第2実施形態では、このようにお礼表現及び謝罪表現の検出率を上げているため、それでもお礼表現が検出されなかった場合には、その検出結果に応じた顧客の満足感なしとの推定は、極めて高い精度(純度)を示すことになる。よって、第2実施形態によれば、お礼表現の検出数が0の場合に満足感なしと推定することで、その推定精度が非常に高いことを期待できる。また、第2実施形態においてはお礼表現を認識し易いように重み付けした言語モデルを用いているため、お礼表現の検出数が0の場合、顧客が全くお礼を述べなかった可能性が特に高いため、その通話に関し不満有りとの推定をすることも可能である。 In the second embodiment, since the detection rate of the thank-you expression and the apology expression is increased in this way, if the thank-you expression is still not detected, it is estimated that the customer is not satisfied according to the detection result. , Extremely high accuracy (purity) will be exhibited. Therefore, according to the second embodiment, it can be expected that the estimation accuracy is very high by estimating that there is no satisfaction when the number of detected thank-you expressions is 0. In the second embodiment, a weighted language model is used so that a thank-you expression can be easily recognized. Therefore, when the number of detected thank-you expressions is 0, there is a high possibility that the customer did not say thank-you at all. It is also possible to estimate that there is dissatisfaction with the call.
 [第1変形例]
 以下、第1実施形態における通話分析サーバ10の変形例を第1変形例として説明する。図6は、第1変形例における通話分析サーバ10の処理構成例を概念的に示す図である。第1変形例では、クロージング検出部23は、通話データ取得部20により取得された通話データに含まれる音声データ及び切断時間データの少なくとも一方を用いてクロージング区間を検出する。
[First Modification]
Hereinafter, a modification of the call analysis server 10 in the first embodiment will be described as a first modification. FIG. 6 is a diagram conceptually illustrating a processing configuration example of the call analysis server 10 in the first modification. In the first modification, the closing detection unit 23 detects a closing section using at least one of voice data and disconnection time data included in the call data acquired by the call data acquisition unit 20.
 クロージング検出部23は、切断時間データが示す通話切断時間をクロージング区間の終端時間に設定し、その通話切断時間から所定時間幅をクロージング区間の始端時間と決定してもよい。また、クロージング検出部23は、各クロージング文句の音声データから得られる各音声信号波形を保持し、当該各音声信号波形と通話データに含まれる音声データの波形とをそれぞれ照合することにより、クロージング文句の発声時間を取得するようにしてもよい。 The closing detection unit 23 may set the call disconnection time indicated by the disconnection time data as the end time of the closing section, and determine a predetermined time width from the call disconnection time as the start end time of the closing section. In addition, the closing detection unit 23 holds each voice signal waveform obtained from the voice data of each closing phrase, and collates each voice signal waveform with the waveform of the voice data included in the call data, thereby closing the phrase. May be acquired.
 第1変形例では、音声認識部21は、対象通話のクロージング区間の音声データに対して音声認識処理を行えばよい。
 第1変形例における通話分析方法では、図3に示される工程(S31)が、工程(S32)の後で工程(S33)の前に実行されればよい。
In the first modification, the voice recognition unit 21 may perform voice recognition processing on the voice data in the closing section of the target call.
In the call analysis method in the first modification, the step (S31) shown in FIG. 3 may be executed after the step (S32) and before the step (S33).
 [第2変形例]
 以下、第2実施形態における通話分析サーバ10の変形例を第2変形例として説明する。図7は、第2変形例における通話分析サーバ10の処理構成例を概念的に示す図である。第2変形例では、通話分析サーバ10は、音声認識部21を持たなくてもよい。クロージング検出部23は、通話データ取得部20により取得された通話データに含まれる音声データ及び切断時間データの少なくとも一方を用いてクロージング区間を検出する。第2変形例におけるクロージング検出部23の処理内容は、第1変形例と同様でよいため、ここでは説明を省略する。
[Second Modification]
Hereinafter, a modification of the call analysis server 10 in the second embodiment will be described as a second modification. FIG. 7 is a diagram conceptually illustrating a processing configuration example of the call analysis server 10 in the second modification. In the second modification, the call analysis server 10 may not have the voice recognition unit 21. The closing detection unit 23 detects a closing section using at least one of voice data and disconnection time data included in the call data acquired by the call data acquisition unit 20. Since the processing content of the closing detection unit 23 in the second modification may be the same as that in the first modification, description thereof is omitted here.
 第2変形例における通話分析方法では、図5に示される工程(S31)が省かれる。第1変形例および第2変形例によれば、クロージング検出部によって検出された区間のみに音声認識を適用するため、顧客の満足度や不満度の推定に要する計算時間が少なくて済むという利点がある。 In the call analysis method in the second modification, the step (S31) shown in FIG. 5 is omitted. According to the first modification and the second modification, since voice recognition is applied only to the section detected by the closing detection unit, there is an advantage that the calculation time required for estimating the degree of satisfaction or dissatisfaction of the customer can be reduced. is there.
 [その他の変形例]
 上述の各実施形態及び各変形例では、お礼表現データの検出数及び謝罪表現データの検出数により顧客の満足度又は不満度が推定された。しかしながら、顧客の満足度又は不満度は、検出数以外から推定されるようにしてもよい。例えば、特定表現テーブル25において、お礼表現データ毎に満足度ポイントを、謝罪表現毎に不満度ポイントをそれぞれ予め付与しておき、検出されたお礼表現データの満足度ポイントの合計値、及び、検出された謝罪表現データの不満度ポイントの合計値から、顧客の満足度レベル値及び不満度レベル値が推定されるようにしてもよい。
[Other variations]
In each of the above-described embodiments and modifications, customer satisfaction or dissatisfaction is estimated based on the number of thank-you expression data detected and the number of apology expression data detected. However, customer satisfaction or dissatisfaction may be estimated from other than the number of detections. For example, in the specific expression table 25, a satisfaction point is given in advance for each thanks expression data, and a dissatisfaction point is given in advance for each apology expression. The customer satisfaction level value and the dissatisfaction level value may be estimated from the total value of the dissatisfaction points of the apology expression data.
 上述の各実施形態及び各変形例は、コンタクトセンタシステム1を例示するため、基準音声認識パラメータが、コンタクトセンタにおける通話全般用に適合(学習)されている例が示された。基準音声認識パラメータは、扱われる通話の形態に適合されればよい。例えば、通話端末による一般的な通話が扱われる場合には、そのような一般的な通話用に適合された基準音声認識パラメータが利用されればよい。 In each of the above-described embodiments and modifications, the contact center system 1 is exemplified, and an example in which the reference voice recognition parameter is adapted (learned) for general calls in the contact center is shown. The reference speech recognition parameters need only be adapted to the type of call being handled. For example, when a general call by a call terminal is handled, a reference speech recognition parameter adapted for such a general call may be used.
 上述の各実施形態及び各変形例では、通話データには切断時間データが含まれ、その切断時間データが各オペレータ電話機6や交換機5等により生成される例が示されたが、切断時間データは、顧客の音声データから切断音を検出することにより、生成されるようにしてもよい。この場合、切断時間データは、ファイルサーバ9が生成してもよいし、通話分析サーバ10が生成してもよい。 In each of the above-described embodiments and modifications, the call data includes disconnection time data, and the disconnection time data is generated by each operator telephone 6, the exchange 5, or the like. It may be generated by detecting a cutting sound from customer voice data. In this case, the disconnection time data may be generated by the file server 9 or the call analysis server 10.
 また、上述の通話分析サーバ10は、複数のコンピュータとして実現されてもよい。この場合、例えば、通話分析サーバ10は、表現検出部26及び推定部27のみを有し、他のコンピュータが他の処理部を有するように構成される。更に、クロージング検出部23は、クロージング区間データを、入力画面等に基づいて入力装置をユーザが操作することにより取得してもよいし、可搬型記録媒体、他のコンピュータ等から入出力I/F13を経由して取得してもよい。 Further, the above-described call analysis server 10 may be realized as a plurality of computers. In this case, for example, the call analysis server 10 includes only the expression detection unit 26 and the estimation unit 27, and is configured such that another computer has another processing unit. Further, the closing detection unit 23 may acquire the closing section data by the user operating the input device based on the input screen or the like, or the input / output I / F 13 from a portable recording medium, another computer, or the like. May be obtained via.
 [他の実施形態]
 上述の各実施形態及び各変形例では、通話データが扱われたが、上述の会話分析装置及び会話分析方法は、通話以外の会話データを扱う装置やシステムに適用されてもよい。この場合、例えば、分析対象となる会話を録音する録音装置がその会話が行われる場所(会議室、銀行の窓口、店舗のレジなど)に設置される。また、会話データが複数の会話参加者の声が混合された状態で録音される場合には、その混合状態から所定の音声処理により会話参加者毎の音声データに分離される。
[Other Embodiments]
In each of the above-described embodiments and modifications, call data is handled. However, the above-described conversation analysis device and conversation analysis method may be applied to a device or system that handles conversation data other than a call. In this case, for example, a recording device for recording a conversation to be analyzed is installed at a place (conference room, bank window, store cash register, etc.) where the conversation is performed. Further, when the conversation data is recorded in a state in which the voices of a plurality of conversation participants are mixed, the conversation data is separated from the mixed state into voice data for each conversation participant by a predetermined voice process.
 また、上述の各実施形態及び各変形例では、会話の終了時点を示すデータとして通話の切断時間データが用いられたが、通話データ以外の会話データが扱われる形態では、会話の終了を示す事象が自動又は手動で検出され、この検出時点が会話の終了時間データとして扱われるようにすればよい。自動検出では、会話参加者全員の発声の終了が検出されてもよいし、会話参加者の解散を示す人の動きがセンサ等で検出されてもよい。また、手動検出では、会話参加者による会話終了を通知するための入力操作が検出されてもよい。 Further, in each of the above-described embodiments and modifications, the call disconnection time data is used as the data indicating the end time of the conversation. However, in the form in which the conversation data other than the call data is handled, the event indicating the end of the conversation May be detected automatically or manually, and this detection time point may be treated as conversation end time data. In the automatic detection, the end of the utterance of all the conversation participants may be detected, or the movement of the person indicating the dissolution of the conversation participants may be detected by a sensor or the like. In the manual detection, an input operation for notifying the conversation end by the conversation participant may be detected.
 また、通話データ以外の会話データが扱われる形態では、クロージング検出部23は、会話データに含まれる会話終了時間データと、音声認識部21により取得された会話参加者の音声テキストデータ及びその発声時間データとに基づいて、対象会話のクロージング区間を検出すればよい。この場合、クロージング区間の幅を決めるための所定発話数及び所定時間は、銀行の窓口で行われる会話、店舗のレジで行われる会話、施設のインフォメーションセンタで行われる会話などのようなその会話種に応じて決められる。また、所定のクロージング文句についても同様に、会話種に応じてそれぞれ決められる。 In the form in which conversation data other than the call data is handled, the closing detection unit 23 includes the conversation end time data included in the conversation data, the voice text data of the conversation participant acquired by the voice recognition unit 21, and the utterance time thereof. The closing section of the target conversation may be detected based on the data. In this case, the predetermined number of utterances and the predetermined time for determining the width of the closing section are the conversation types such as conversations conducted at bank counters, conversations conducted at store cash registers, conversations conducted at facility information centers, etc. It is decided according to. Similarly, a predetermined closing phrase is determined according to the conversation type.
 なお、上述の説明で用いた複数のフローチャートでは、複数の工程(処理)が順番に記載されているが、本実施形態で実行される工程の実行順序は、その記載の順番に制限されない。本実施形態では、図示される工程の順番を内容的に支障のない範囲で変更することができる。また、上述の各実施形態及び各変形例は、内容が相反しない範囲で組み合わせることができる。 In the plurality of flowcharts used in the above description, a plurality of steps (processes) are described in order, but the execution order of the steps executed in the present embodiment is not limited to the description order. In the present embodiment, the order of the illustrated steps can be changed within a range that does not hinder the contents. Moreover, each above-mentioned embodiment and each modification can be combined in the range with which the content does not conflict.
 上記の各実施形態及び各変形例の一部又は全部は、以下の付記のようにも特定され得る。但し、各実施形態及び各変形例が以下の記載に限定されるものではない。 Some or all of the above embodiments and modifications may be specified as in the following supplementary notes. However, each embodiment and each modification are not limited to the following description.
(付記1)
 第1会話参加者と第2会話参加者との間の会話のクロージング区間のみの音声に対応するデータから、該第1会話参加者により発声されたお礼表現データ及び該第2会話参加者により発声された謝罪表現データの少なくとも一方を特定表現データとして検出する表現検出部と、
 前記特定表現データの検出結果に応じて、前記会話における前記第1会話参加者の満足度又は不満度を推定する推定部と、
 を備える会話分析装置。
(Appendix 1)
Thank-you expression data uttered by the first conversation participant and data uttered by the second conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant. An expression detector that detects at least one of the apologized expression data as specific expression data;
An estimation unit for estimating satisfaction or dissatisfaction of the first conversation participant in the conversation according to a detection result of the specific expression data;
Conversation analyzer with
(付記2)
 前記表現検出部は、
  前記会話を含む所定形態の会話の音声認識に適合された基準音声認識パラメータが、前記特定表現データが他の単語データよりも認識され易くなるように重み付けされた音声認識パラメータを用いて、前記会話の前記クロージング区間の音声データに対して音声認識処理を行う音声認識部、
 を含み、
 前記音声認識部の前記音声認識処理により得られる、前記会話の前記クロージング区間の音声テキストデータの中から、前記特定表現データを検出する
 付記1に記載の会話分析装置。
(Appendix 2)
The expression detection unit
The speech recognition parameter weighted so that the specific expression data is more easily recognized than other word data is used as the reference speech recognition parameter adapted for speech recognition of a predetermined form of conversation including the conversation. A speech recognition unit that performs speech recognition processing on the speech data of the closing section;
Including
The conversation analysis device according to claim 1, wherein the specific expression data is detected from speech text data in the closing section of the conversation obtained by the speech recognition process of the speech recognition unit.
(付記3)
 前記表現検出部は、前記特定表現データを前記お礼表現データと前記謝罪表現データとに区別可能に保持する特定表現テーブルに基づいて、前記特定表現データを検出することにより、前記お礼表現データおよび前記謝罪表現データの少なくとも一方の検出数をカウントし、
 前記推定部は、前記お礼表現データの検出数または前記謝罪表現データの検出数に基づいて、前記会話における前記第1会話参加者の満足度及び不満度の少なくとも一方を推定する、
 付記1又は2に記載の会話分析装置。
(Appendix 3)
The expression detection unit detects the specific expression data based on a specific expression table that holds the specific expression data in a distinguishable manner between the thanks expression data and the apology expression data. Count the number of detected apology data,
The estimation unit estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation based on the number of detections of the thanks expression data or the number of detections of the apology expression data.
The conversation analyzer according to appendix 1 or 2.
(付記4)
 前記表現検出部は、前記特定表現データを前記お礼表現データと前記謝罪表現データとに区別可能に保持する特定表現テーブルに基づいて、前記特定表現データを検出することにより、前記お礼表現データの検出数及び前記謝罪表現データの検出数をそれぞれカウントし、
 前記推定部は、前記お礼表現データの検出数及び前記謝罪表現データの検出数に基づいて、前記会話における前記第1会話参加者の満足度及び不満度の少なくとも一方を推定する、
 付記1又は2に記載の会話分析装置。
(Appendix 4)
The expression detection unit detects the thanks expression data by detecting the specific expression data based on a specific expression table that holds the specific expression data in a distinguishable manner between the thanks expression data and the apology expression data. Count the number and the number of detected apology data,
The estimation unit estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation based on the number of detections of the thanks expression data and the number of detections of the apology expression data.
The conversation analyzer according to appendix 1 or 2.
(付記5)
 少なくとも1つのコンピュータにより実行される会話分析方法において、
 第1会話参加者と第2会話参加者との間の会話のクロージング区間のみの音声に対応するデータから、該第1会話参加者により発声されたお礼表現データ及び該第2会話参加者により発声された謝罪表現データの少なくとも一方を特定表現データとして検出し、
 前記特定表現データの検出結果に応じて、前記会話における前記第1会話参加者の満足度又は不満度を推定する、
 ことを含む会話分析方法。
(Appendix 5)
In a conversation analysis method performed by at least one computer,
Thank-you expression data uttered by the first conversation participant and utterance by the second conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant. Detecting at least one of the received apology expression data as specific expression data,
According to the detection result of the specific expression data, the satisfaction level or dissatisfaction level of the first conversation participant in the conversation is estimated.
Conversation analysis method including things.
(付記6)
 前記会話を含む所定形態の会話の音声認識に適合された基準音声認識パラメータが、前記特定表現データが他の単語データよりも認識され易くなるように重み付けされた音声認識パラメータを用いて、前記会話の前記クロージング区間の音声データに対して音声認識処理を行う、
 ことを更に含み、
 前記特定表現データの検出は、前記音声認識処理により得られる、前記会話の前記クロージング区間の音声テキストデータの中から、前記特定表現データを検出する、
 付記5に記載の会話分析方法。
(Appendix 6)
The speech recognition parameter weighted so that the specific expression data is more easily recognized than other word data is used as the reference speech recognition parameter adapted for speech recognition of a predetermined form of conversation including the conversation. Voice recognition processing is performed on the voice data of the closing section.
Further including
The specific expression data is detected by detecting the specific expression data from the speech text data of the closing section of the conversation obtained by the speech recognition process.
The conversation analysis method according to appendix 5.
(付記7)
 前記特定表現データを前記お礼表現データと前記謝罪表現データとに区別可能に保持する特定表現テーブルに基づいて、前記特定表現データを検出することにより、前記お礼表現データおよび前記謝罪表現データの少なくとも一方の検出数をカウントする、
 ことを更に含み、
 前記推定は、前記お礼表現データの検出数又は前記謝罪表現データの検出数に基づいて、前記会話における前記第1会話参加者の満足度及び不満度の少なくとも一方を推定する、
 付記5又は6に記載の会話分析方法。
(Appendix 7)
At least one of the thank-you expression data and the apology-expression data by detecting the specific-expression data based on a specific-expression table that holds the specific-expression data separately in the thank-you expression data and the apology-expression data Count the number of detected
Further including
The estimation estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation based on the number of detected thanksgiving data or the number of detected apology data.
The conversation analysis method according to appendix 5 or 6.
(付記8)
 前記特定表現データを前記お礼表現データと前記謝罪表現データとに区別可能に保持する特定表現テーブルに基づいて、前記特定表現データを検出することにより、前記お礼表現データの検出数及び前記謝罪表現データの検出数をそれぞれカウントする、
 ことを更に含み、
 前記推定は、前記お礼表現データの検出数及び前記謝罪表現データの検出数に基づいて、前記会話における前記第1会話参加者の満足度及び不満度の少なくとも一方を推定する、
 付記5又は6に記載の会話分析方法。
(Appendix 8)
By detecting the specific expression data based on a specific expression table that distinguishably holds the specific expression data between the thank-you expression data and the apology expression data, the number of thank-you expression data detected and the apology expression data Count the number of detected
Further including
The estimation is based on the number of detections of the thank-you expression data and the number of detections of the apology expression data, and estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation
The conversation analysis method according to appendix 5 or 6.
(付記9)
 少なくとも1つのコンピュータに、付記5から8のいずれか1つに記載の会話分析方法を実行させるプログラム。
(Appendix 9)
A program for causing at least one computer to execute the conversation analysis method according to any one of appendices 5 to 8.
(付記10)付記9に記載のプログラムを記録したコンピュータが読み取り可能な記録媒体。 (Supplementary Note 10) A computer-readable recording medium on which the program according to Supplementary Note 9 is recorded.
 この出願は、2012年10月31日に出願された日本出願特願2012-240750号を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2012-240750 filed on October 31, 2012, the entire disclosure of which is incorporated herein.

Claims (9)

  1.  第1会話参加者と第2会話参加者との間の会話のクロージング区間のみの音声に対応するデータから、該第1会話参加者により発声されたお礼表現データ及び該第2会話参加者により発声された謝罪表現データの少なくとも一方を特定表現データとして検出する表現検出部と、
     前記特定表現データの検出結果に応じて、前記会話における前記第1会話参加者の満足度又は不満度を推定する推定部と、
     を備える会話分析装置。
    Thank-you expression data uttered by the first conversation participant and data uttered by the second conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant. An expression detector that detects at least one of the apologized expression data as specific expression data;
    An estimation unit for estimating satisfaction or dissatisfaction of the first conversation participant in the conversation according to a detection result of the specific expression data;
    Conversation analyzer with
  2.  前記表現検出部は、
      前記会話を含む所定形態の会話の音声認識に適合された基準音声認識パラメータが、前記特定表現データが他の単語データよりも認識され易くなるように重み付けされた音声認識パラメータを用いて、前記会話の前記クロージング区間の音声データに対して音声認識処理を行う音声認識部、
     を含み、
     前記音声認識部の前記音声認識処理により得られる、前記会話の前記クロージング区間の音声テキストデータの中から、前記特定表現データを検出する
     請求項1に記載の会話分析装置。
    The expression detection unit
    The speech recognition parameter weighted so that the specific expression data is more easily recognized than other word data is used as the reference speech recognition parameter adapted for speech recognition of a predetermined form of conversation including the conversation. A speech recognition unit that performs speech recognition processing on the speech data of the closing section;
    Including
    The conversation analysis device according to claim 1, wherein the specific expression data is detected from speech text data in the closing section of the conversation obtained by the speech recognition process of the speech recognition unit.
  3.  前記表現検出部は、前記特定表現データを前記お礼表現データと前記謝罪表現データとに区別可能に保持する特定表現テーブルに基づいて、前記特定表現データを検出することにより、前記お礼表現データおよび前記謝罪表現データの少なくとも一方の検出数をカウントし、
     前記推定部は、前記お礼表現データの検出数または前記謝罪表現データの検出数に基づいて、前記会話における前記第1会話参加者の満足度及び不満度の少なくとも一方を推定する、
     請求項1又は2に記載の会話分析装置。
    The expression detection unit detects the specific expression data based on a specific expression table that holds the specific expression data in a distinguishable manner between the thanks expression data and the apology expression data. Count the number of detected apology data,
    The estimation unit estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation based on the number of detections of the thanks expression data or the number of detections of the apology expression data.
    The conversation analysis device according to claim 1 or 2.
  4.  前記表現検出部は、前記特定表現データを前記お礼表現データと前記謝罪表現データとに区別可能に保持する特定表現テーブルに基づいて、前記特定表現データを検出することにより、前記お礼表現データの検出数及び前記謝罪表現データの検出数をそれぞれカウントし、
     前記推定部は、前記お礼表現データの検出数及び前記謝罪表現データの検出数に基づいて、前記会話における前記第1会話参加者の満足度及び不満度の少なくとも一方を推定する、
     請求項1又は2に記載の会話分析装置。
    The expression detection unit detects the thanks expression data by detecting the specific expression data based on a specific expression table that holds the specific expression data in a distinguishable manner between the thanks expression data and the apology expression data. Count the number and the number of detected apology data,
    The estimation unit estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation based on the number of detections of the thanks expression data and the number of detections of the apology expression data.
    The conversation analysis device according to claim 1 or 2.
  5.  少なくとも1つのコンピュータにより実行される会話分析方法において、
     第1会話参加者と第2会話参加者との間の会話のクロージング区間のみの音声に対応するデータから、該第1会話参加者により発声されたお礼表現データ及び該第2会話参加者により発声された謝罪表現データの少なくとも一方を特定表現データとして検出し、
     前記特定表現データの検出結果に応じて、前記会話における前記第1会話参加者の満足度又は不満度を推定する、
     ことを含む会話分析方法。
    In a conversation analysis method performed by at least one computer,
    Thank-you expression data uttered by the first conversation participant and data uttered by the second conversation participant from data corresponding to the voice of only the closing section of the conversation between the first conversation participant and the second conversation participant. Detecting at least one of the received apology expression data as specific expression data,
    According to the detection result of the specific expression data, the satisfaction level or dissatisfaction level of the first conversation participant in the conversation is estimated.
    Conversation analysis method including things.
  6.  前記会話を含む所定形態の会話の音声認識に適合された基準音声認識パラメータが、前記特定表現データが他の単語データよりも認識され易くなるように重み付けされた音声認識パラメータを用いて、前記会話の前記クロージング区間の音声データに対して音声認識処理を行う、
     ことを更に含み、
     前記特定表現データの検出は、前記音声認識処理により得られる、前記会話の前記クロージング区間の音声テキストデータの中から、前記特定表現データを検出する、
     請求項5に記載の会話分析方法。
    The speech recognition parameter weighted so that the specific expression data is more easily recognized than other word data is used as the reference speech recognition parameter adapted for speech recognition of a predetermined form of conversation including the conversation. Voice recognition processing is performed on the voice data of the closing section.
    Further including
    The specific expression data is detected by detecting the specific expression data from the speech text data of the closing section of the conversation obtained by the speech recognition process.
    The conversation analysis method according to claim 5.
  7.  前記特定表現データを前記お礼表現データと前記謝罪表現データとに区別可能に保持する特定表現テーブルに基づいて、前記特定表現データを検出することにより、前記お礼表現データおよび前記謝罪表現データの少なくとも一方の検出数をカウントする、
     ことを更に含み、
     前記推定は、前記お礼表現データの検出数又は前記謝罪表現データの検出数に基づいて、前記会話における前記第1会話参加者の満足度及び不満度の少なくとも一方を推定する、
     請求項5又は6に記載の会話分析方法。
    At least one of the thank-you expression data and the apology-expression data by detecting the specific-expression data based on a specific-expression table that holds the specific-expression data separately in the thank-you expression data and the apology-expression data Count the number of detected
    Further including
    The estimation estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation based on the number of detected thanksgiving data or the number of detected apology data.
    The conversation analysis method according to claim 5 or 6.
  8.  前記特定表現データを前記お礼表現データと前記謝罪表現データとに区別可能に保持する特定表現テーブルに基づいて、前記特定表現データを検出することにより、前記お礼表現データの検出数及び前記謝罪表現データの検出数をそれぞれカウントする、
     ことを更に含み、
     前記推定は、前記お礼表現データの検出数及び前記謝罪表現データの検出数に基づいて、前記会話における前記第1会話参加者の満足度及び不満度の少なくとも一方を推定する、
     請求項5又は6に記載の会話分析方法。
    By detecting the specific expression data based on a specific expression table that distinguishably holds the specific expression data between the thank-you expression data and the apology expression data, the number of thank-you expression data detected and the apology expression data Count the number of detected
    Further including
    The estimation is based on the number of detections of the thank-you expression data and the number of detections of the apology expression data, and estimates at least one of satisfaction and dissatisfaction of the first conversation participant in the conversation
    The conversation analysis method according to claim 5 or 6.
  9.  少なくとも1つのコンピュータに、請求項5から8のいずれか1項に記載の会話分析方法を実行させるプログラム。 A program for causing at least one computer to execute the conversation analysis method according to any one of claims 5 to 8.
PCT/JP2013/075243 2012-10-31 2013-09-19 Conversation analysis device and conversation analysis method WO2014069121A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2014544379A JP6365304B2 (en) 2012-10-31 2013-09-19 Conversation analyzer and conversation analysis method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012-240750 2012-10-31
JP2012240750 2012-10-31

Publications (1)

Publication Number Publication Date
WO2014069121A1 true WO2014069121A1 (en) 2014-05-08

Family

ID=50627037

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/075243 WO2014069121A1 (en) 2012-10-31 2013-09-19 Conversation analysis device and conversation analysis method

Country Status (2)

Country Link
JP (1) JP6365304B2 (en)
WO (1) WO2014069121A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750674A (en) * 2015-02-17 2015-07-01 北京京东尚科信息技术有限公司 Man-machine conversation satisfaction degree prediction method and system
JP2019101399A (en) * 2017-11-30 2019-06-24 日本電信電話株式会社 Favorability estimating apparatus, favorability estimating method, and program
JP2020126185A (en) * 2019-02-06 2020-08-20 日本電信電話株式会社 Voice recognition device, search device, voice recognition method, search method and program
WO2023119992A1 (en) * 2021-12-24 2023-06-29 ソニーグループ株式会社 Information processing device, information processing method, and program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010041507A1 (en) * 2008-10-10 2010-04-15 インターナショナル・ビジネス・マシーンズ・コーポレーション System and method which extract specific situation in conversation
JP2012047875A (en) * 2010-08-25 2012-03-08 Nippon Telegr & Teleph Corp <Ntt> Business section extracting method and device, and program therefor
JP2013156524A (en) * 2012-01-31 2013-08-15 Fujitsu Ltd Specific phoning detection device, specific phoning detection method and specific phoning detecting computer program

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4972107B2 (en) * 2009-01-28 2012-07-11 日本電信電話株式会社 Call state determination device, call state determination method, program, recording medium
US20100332287A1 (en) * 2009-06-24 2010-12-30 International Business Machines Corporation System and method for real-time prediction of customer satisfaction
JP5533219B2 (en) * 2010-05-11 2014-06-25 セイコーエプソン株式会社 Hospitality data recording device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010041507A1 (en) * 2008-10-10 2010-04-15 インターナショナル・ビジネス・マシーンズ・コーポレーション System and method which extract specific situation in conversation
JP2012047875A (en) * 2010-08-25 2012-03-08 Nippon Telegr & Teleph Corp <Ntt> Business section extracting method and device, and program therefor
JP2013156524A (en) * 2012-01-31 2013-08-15 Fujitsu Ltd Specific phoning detection device, specific phoning detection method and specific phoning detecting computer program

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750674A (en) * 2015-02-17 2015-07-01 北京京东尚科信息技术有限公司 Man-machine conversation satisfaction degree prediction method and system
JP2019101399A (en) * 2017-11-30 2019-06-24 日本電信電話株式会社 Favorability estimating apparatus, favorability estimating method, and program
JP2020126185A (en) * 2019-02-06 2020-08-20 日本電信電話株式会社 Voice recognition device, search device, voice recognition method, search method and program
JP7177348B2 (en) 2019-02-06 2022-11-24 日本電信電話株式会社 Speech recognition device, speech recognition method and program
WO2023119992A1 (en) * 2021-12-24 2023-06-29 ソニーグループ株式会社 Information processing device, information processing method, and program

Also Published As

Publication number Publication date
JPWO2014069121A1 (en) 2016-09-08
JP6365304B2 (en) 2018-08-01

Similar Documents

Publication Publication Date Title
JP6341092B2 (en) Expression classification device, expression classification method, dissatisfaction detection device, and dissatisfaction detection method
EP2717258B1 (en) Phrase spotting systems and methods
WO2014069076A1 (en) Conversation analysis device and conversation analysis method
JP6358093B2 (en) Analysis object determination apparatus and analysis object determination method
US9293133B2 (en) Improving voice communication over a network
US10592611B2 (en) System for automatic extraction of structure from spoken conversation using lexical and acoustic features
US9269357B2 (en) System and method for extracting a specific situation from a conversation
US8417524B2 (en) Analysis of the temporal evolution of emotions in an audio interaction in a service delivery environment
US9711167B2 (en) System and method for real-time speaker segmentation of audio interactions
JP6213476B2 (en) Dissatisfied conversation determination device and dissatisfied conversation determination method
CN102254556A (en) Estimating a Listener&#39;s Ability To Understand a Speaker, Based on Comparisons of Their Styles of Speech
US10199035B2 (en) Multi-channel speech recognition
JP6365304B2 (en) Conversation analyzer and conversation analysis method
CN113744742A (en) Role identification method, device and system in conversation scene
JP6327252B2 (en) Analysis object determination apparatus and analysis object determination method
JP7287006B2 (en) Speaker Determining Device, Speaker Determining Method, and Control Program for Speaker Determining Device
WO2014069443A1 (en) Complaint call determination device and complaint call determination method
WO2014069444A1 (en) Complaint conversation determination device and complaint conversation determination method
CN116975242A (en) Voice broadcast interrupt processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13851196

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014544379

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13851196

Country of ref document: EP

Kind code of ref document: A1