JP7394192B2

JP7394192B2 - Audio processing device, audio processing method, and program

Info

Publication number: JP7394192B2
Application number: JP2022150288A
Authority: JP
Inventors: 敏之中谷; 君慧末永; 俊雄今村; 啓祐阪下; 周平 ▲高▼原
Original assignee: SoftBank Corp
Current assignee: SoftBank Corp
Priority date: 2021-11-25
Filing date: 2022-09-21
Publication date: 2023-12-07
Anticipated expiration: 2041-11-25
Also published as: JP2023078068A; JP2023164770A; JP7164793B1; JP2023077444A

Description

本発明は、音声処理システム、音声処理装置及び音声処理方法に関する。 The present invention relates to an audio processing system, an audio processing device, and an audio processing method.

従来、顧客満足度（Customer Satisfaction：ＣＳ）向上のために、顧客の苦情等に対してオペレータが電話で応対する各種のコールセンターが運用されている。このような顧客応対業務では、顧客がオペレータに対して威圧的な言動や理不尽な要求を行う「カスタマーハラスメント」により、オペレータの精神不調を招いたり、オペレータの離職率が高くなったりすることが問題視されている。 Conventionally, in order to improve customer satisfaction (CS), various call centers have been operated in which operators respond to customer complaints by telephone. In this kind of customer service work, there is a problem with "customer harassment" in which customers make intimidating behavior and unreasonable demands towards operators, which can cause mental health problems for operators and increase the turnover rate of operators. being watched.

近年、このようなカスタマーハラスメントから、企業側が従業員であるオペレータを守るための音声変換システムも検討されている。例えば、特許文献１では、入力音声信号から音量及びピッチ変動量を算出し、音量及びピッチ変動量が所定値を超える場合に、音量及びピッチ変動量が所定内に収まるように音量及びピッチを変換して出力するように制御することが記載されている。 In recent years, voice conversion systems have been considered for companies to protect their employees (operators) from such customer harassment. For example, in Patent Document 1, the volume and pitch variation amount are calculated from the input audio signal, and when the volume and pitch variation amount exceeds a predetermined value, the volume and pitch are converted so that the volume and pitch variation amount are within a predetermined range. It is described that it can be controlled so that it is output as follows.

特開２００４－２５２０８５号公報Japanese Patent Application Publication No. 2004-252085

しかしながら、例えば、特許文献１に記載の方法で話し手の発話音声を変換するだけでは、話し手（第１のユーザ）の感情が十分に抑制されず、聞き手（第２のユーザ）のストレスを十分に軽減できない恐れがある。一方、聞き手のストレスを軽減するために、聞き手に出力される話し手の発話音声を変換すると、聞き手が話し手の感情を十分に認識できず、聞き手が適切な応対を行うことができない恐れもある。 However, for example, simply converting the speaker's utterance using the method described in Patent Document 1 does not sufficiently suppress the speaker's (first user) emotions, and does not sufficiently reduce the listener's (second user) stress. There is a possibility that it cannot be reduced. On the other hand, if the speaker's speech output to the listener is converted in order to reduce stress on the listener, the listener may not be able to fully recognize the speaker's emotions and may not be able to respond appropriately.

そこで、本発明は、聞き手のストレスの十分な軽減、及び／又は、聞き手の適切な応対を可能とする音声処理システム、音声処理装置及び音声処理方法を提供する。 Accordingly, the present invention provides a voice processing system, a voice processing device, and a voice processing method that can sufficiently reduce the listener's stress and/or enable the listener to respond appropriately.

本発明の一つの態様に係る音声処理システムは、第１のユーザの発話音声の信号である発話音声信号を取得する取得部と、前記発話音声信号に基づいて抽出される特徴量を音声認識モデルに入力して、一以上の単語からなる単語列を含むテキストデータを生成する音声認識部と、前記テキストデータに基づいて抽出される特徴量を音声合成モデルに入力して、合成音声の信号である合成音声信号を生成する音声合成部と、第２のユーザに対して前記合成音声を出力する音声出力部と、を備える。 A voice processing system according to one aspect of the present invention includes an acquisition unit that acquires a spoken voice signal that is a signal of a first user's spoken voice, and a voice recognition model that uses feature quantities extracted based on the spoken voice signal. A speech recognition unit generates text data including a word string consisting of one or more words, and a speech synthesis model inputs features extracted based on the text data to generate a synthesized speech signal. The apparatus includes a voice synthesis section that generates a certain synthesized voice signal, and a voice output section that outputs the synthesized voice to a second user.

この態様によれば、第１のユーザの発話音声信号に基づいてテキストデータを生成し、当該テキストデータに基づいて生成される合成音声を第２のユーザに出力する。このため、第１のユーザの発話音声に含まれる顧客の感情を十分に抑制した合成音声を第２のユーザに聞かせることができ、第１のユーザの感情的発話に起因する第２のユーザのストレスを十分に軽減できる。 According to this aspect, text data is generated based on the speech signal of the first user, and synthesized speech generated based on the text data is output to the second user. Therefore, the second user can hear the synthesized voice that sufficiently suppresses the customer's emotions included in the first user's utterance, and the second user's voice caused by the first user's emotional utterance can sufficiently reduce stress.

上記音声処理システムにおいて、前記感情認識部は、発話音声信号、当該発話音声信号から抽出した特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータから抽出された特徴量、又はこれらの少なくとも二つの組み合わせを入力とし、当該発話音声信号の発話者の感情情報を出力するよう機械学習された感情認識モデルに、前記取得部が取得した発話音声信号、当該発話音声信号から抽出した音声特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータに対応するテキスト特徴量、又はこれらの少なくとも二つの組み合わせを入力することにより、前記取得部が取得した発話音声信号に対応する第１のユーザの感情情報を生成してもよい。 In the above-mentioned speech processing system, the emotion recognition unit is configured to generate a speech audio signal, a feature amount extracted from the speech audio signal, text data generated from the speech audio signal, a feature amount extracted from the text data, or at least one of these. The uttered audio signal acquired by the acquisition unit and the audio features extracted from the uttered audio signal are input to an emotion recognition model machine-learned to output emotion information of the speaker of the uttered audio signal by inputting the combination of the two. , a first user corresponding to the speech audio signal acquired by the acquisition unit by inputting text data generated from the speech audio signal, text features corresponding to the text data, or a combination of at least two of these. emotional information may be generated.

本実施形態に係る音声処理システム１の概略の一例を示す図である。1 is a diagram schematically showing an example of an audio processing system 1 according to the present embodiment. 本実施形態に係る音声処理システム１を構成する各装置の物理構成の一例を示す図である。1 is a diagram showing an example of the physical configuration of each device configuring the audio processing system 1 according to the present embodiment. FIG. 本実施形態に係る音声処理装置１０の機能構成の一例を示す図である。1 is a diagram showing an example of a functional configuration of an audio processing device 10 according to the present embodiment. 本実施形態に係る合成音声信号の生成の一例を示す図である。FIG. 3 is a diagram illustrating an example of generation of a synthesized speech signal according to the present embodiment. 本実施形態に係る顧客の感情情報の生成の一例を示す図である。It is a figure showing an example of generation of customer emotion information concerning this embodiment. 本実施形態に係る顧客の感情情報の生成の一例を示す図である。It is a figure showing an example of generation of customer emotion information concerning this embodiment. 本実施形態に係るオペレータ端末２０の機能構成の一例を示す図である。It is a diagram showing an example of a functional configuration of an operator terminal 20 according to the present embodiment. 本実施形態に係る画面Ｄ１の一例を示す図である。It is a figure showing an example of screen D1 concerning this embodiment. 本実施形態に係る画面Ｄ２の一例を示す図である。It is a figure showing an example of screen D2 concerning this embodiment. 本実施形態に係る感情抑制動作の一例を示すフローチャートである。It is a flowchart which shows an example of emotion suppression operation concerning this embodiment. 本実施形態に係る感情抑制機能の自動切り替え動作を示すフローチャートである。It is a flowchart which shows the automatic switching operation of the emotion suppression function based on this embodiment. 本実施形態の変更例に係る合成音声信号の生成の一例を示す図である。FIG. 7 is a diagram illustrating an example of generation of a synthesized speech signal according to a modification of the present embodiment. 本実施形態に係る画面Ｄ３の一例を示す図である。It is a figure showing an example of screen D3 concerning this embodiment.

添付図面を参照して、本発明の実施形態について説明する。なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。 Embodiments of the present invention will be described with reference to the accompanying drawings. In addition, in each figure, those with the same reference numerals have the same or similar configurations.

以下、本実施形態に係る音声処理システムをコールセンター等の顧客応対業務において使用することを想定して説明を行うが、本発明の適用形態はこれに限られない。本実施形態は、第１のユーザの発話音声の信号（以下、「発話音声信号」という）に所定の処理を施して生成される音声を第２のユーザに対して出力するどのような場面にも適用可能である。以下では、第１のユーザが顧客であり、第２のユーザがオペレータであるものとするが、これに限られない。 Although the following description assumes that the voice processing system according to the present embodiment is used in customer service work such as a call center, the application form of the present invention is not limited to this. The present embodiment is applicable to any situation in which a sound generated by subjecting a first user's speech signal (hereinafter referred to as "speech sound signal") to a predetermined process is output to a second user. is also applicable. In the following, it is assumed that the first user is a customer and the second user is an operator, but the present invention is not limited to this.

（音声処理システムの構成）
＜全体構成＞
図１は、本実施形態に係る音声処理システム１の概略の一例を示す図である。図１に示すように、音声処理システム１は、音声処理装置１０と、第２のユーザ（以下、「オペレータ」という）によって使用される端末（以下、「オペレータ端末」という）２０と、第１のユーザ（以下、「顧客」という）によって使用される端末（以下、「顧客端末」という）３０と、を備える。 (Speech processing system configuration)
<Overall configuration>
FIG. 1 is a diagram schematically showing an example of an audio processing system 1 according to the present embodiment. As shown in FIG. 1, the audio processing system 1 includes an audio processing device 10, a terminal (hereinafter referred to as "operator terminal") 20 used by a second user (hereinafter referred to as "operator"), and a first (hereinafter referred to as a "customer terminal") 30 used by a user (hereinafter referred to as a "customer").

音声処理装置１０は、顧客端末３０で取得される発話音声信号を、ネットワーク４０を介して受信する。ネットワーク４０は、インターネット等の外部ネットワークであってもよいし、外部ネットワーク、及び、Local Access Network（ＬＡＮ）等の内部ネットワークを含んでもよい。音声処理装置１０は、顧客の発話音声信号に対して所定の処理を施した音声をオペレータ端末２０に送信する。なお、音声処理装置１０は、一つ又は複数のサーバで構成されてもよい。 The voice processing device 10 receives the uttered voice signal acquired by the customer terminal 30 via the network 40 . The network 40 may be an external network such as the Internet, or may include an external network and an internal network such as a Local Access Network (LAN). The audio processing device 10 performs predetermined processing on the customer's uttered audio signal and transmits the audio to the operator terminal 20 . Note that the audio processing device 10 may be configured with one or more servers.

オペレータ端末２０は、例えば、電話、スマートフォン、パーソナルコンピュータ、タブレット等である。オペレータ端末２０は、音声処理装置１０で所定の処理で生成される音声信号又は顧客端末３０からの発話音声信号に基づいて、音声をオペレータに出力する。 The operator terminal 20 is, for example, a telephone, a smartphone, a personal computer, a tablet, or the like. The operator terminal 20 outputs voice to the operator based on the voice signal generated by the voice processing device 10 through predetermined processing or the uttered voice signal from the customer terminal 30.

顧客端末３０は、例えば、電話、スマートフォン、パーソナルコンピュータ、タブレット等である。顧客端末３０は、顧客の発話音声をマイクにより収音して、当該発話音声の信号である発話音声信号を音声処理装置１０に送信する。 The customer terminal 30 is, for example, a telephone, a smartphone, a personal computer, a tablet, or the like. The customer terminal 30 collects the customer's uttered voice using a microphone, and transmits a uttered voice signal, which is a signal of the uttered voice, to the audio processing device 10 .

＜物理構成＞
図２は、本実施形態に係る音声処理システム１を構成する各装置の物理構成の一例を示す図である。各装置（例えば、音声処理装置１０、オペレータ端末２０及び顧客端末３０）は、演算部に相当するプロセッサ１０ａと、記憶部に相当するＲＡＭ（Random Access Memory）１０ｂと、記憶部に相当するＲＯＭ（Read Only Memory）１０ｃと、通信部１０ｄと、入力部１０ｅと、表示部１０ｆと、カメラ１０ｇ、音声入力部１０ｈと、音声出力部１０ｉと、を有する。これらの各構成は、バスを介して相互にデータ送受信可能に接続される。なお、図２で示す構成は一例であり、各装置はこれら以外の構成を有してもよいし、これらの構成のうち一部を有さなくてもよい。 <Physical configuration>
FIG. 2 is a diagram showing an example of the physical configuration of each device configuring the audio processing system 1 according to the present embodiment. Each device (for example, the voice processing device 10, the operator terminal 20, and the customer terminal 30) includes a processor 10a corresponding to a calculation section, a RAM (Random Access Memory) 10b corresponding to a storage section, and a ROM (ROM) corresponding to a storage section. (Read Only Memory) 10c, a communication section 10d, an input section 10e, a display section 10f, a camera 10g, an audio input section 10h, and an audio output section 10i. These components are connected to each other via a bus so that they can transmit and receive data. Note that the configuration shown in FIG. 2 is an example, and each device may have a configuration other than these, or may not have a part of these configurations.

プロセッサ１０ａは、例えば、ＣＰＵ（Central Processing Unit）である。プロセッサ１０ａは、ＲＡＭ１０ｂ又はＲＯＭ１０ｃに記憶されているプログラムを実行することにより、各装置における各種処理を制御する制御部である。プロセッサ１０ａは、各装置が備える他の構成と、プログラムとの協働により、各装置の機能を実現し、処理の実行を制御する。プロセッサ１０ａは、入力部１０ｅや通信部１０ｄから種々のデータを受け取り、データの演算結果を表示部１０ｆに表示したり、ＲＡＭ１０ｂに格納したりする。 The processor 10a is, for example, a CPU (Central Processing Unit). The processor 10a is a control unit that controls various processes in each device by executing programs stored in the RAM 10b or ROM 10c. The processor 10a realizes the functions of each device and controls the execution of processing in cooperation with other components included in each device and programs. The processor 10a receives various data from the input section 10e and the communication section 10d, and displays the calculation results of the data on the display section 10f or stores them in the RAM 10b.

ＲＡＭ１０ｂ及びＲＯＭ１０ｃは、各種処理に必要なデータ及び処理結果のデータを記憶する記憶部である。各装置は、ＲＡＭ１０ｂ及びＲＯＭ１０ｃ以外に、ハードディスクドライブ等の大容量の記憶部を備えてもよい。ＲＡＭ１０ｂ及びＲＯＭ１０ｃは、例えば、半導体記憶素子で構成されてもよい。 The RAM 10b and the ROM 10c are storage units that store data necessary for various processes and data of processing results. In addition to the RAM 10b and ROM 10c, each device may include a large-capacity storage unit such as a hard disk drive. The RAM 10b and the ROM 10c may be composed of semiconductor memory elements, for example.

通信部１０ｄは、各装置を他の機器に接続するインターフェースである。通信部１０ｄは、他の機器と通信する。入力部１０ｅは、ユーザからデータの入力を受け付けるためのデバイスや、各装置の外部からデータを入力するためのデバイスである。入力部１０ｅは、例えば、キーボード、マウス及びタッチパネル等を含んでよい。表示部１０ｆは、プロセッサ１０ａによる制御に従って、情報を表示するデバイスである。表示部１０ｆは、例えば、ＬＣＤ（Liquid Crystal Display）により構成されてよい。 The communication unit 10d is an interface that connects each device to other devices. The communication unit 10d communicates with other devices. The input unit 10e is a device for receiving data input from a user or a device for inputting data from outside each device. The input unit 10e may include, for example, a keyboard, a mouse, a touch panel, and the like. The display unit 10f is a device that displays information under the control of the processor 10a. The display unit 10f may be configured by, for example, an LCD (Liquid Crystal Display).

カメラ１０ｇは、静止画像又は動画像を撮像する撮像素子を含み、所定の領域の撮像により撮像画像（例えば、静止画像又は動画像）を生成する。音声入力部１０ｈは、音声を収音するデバイスであり、例えば、マイクである。音声出力部１０ｉは、音声を出力するデバイスであり、例えば、スピーカーである。 The camera 10g includes an image sensor that captures a still image or a moving image, and generates a captured image (for example, a still image or a moving image) by capturing a predetermined area. The audio input unit 10h is a device that collects audio, and is, for example, a microphone. The audio output unit 10i is a device that outputs audio, and is, for example, a speaker.

各装置を実行させるためのプログラムは、ＲＡＭ１０ｂやＲＯＭ１０ｃ等のコンピュータによって読み取り可能な記憶媒体に記憶されて提供されてもよいし、通信部１０ｄにより接続されるネットワーク４０を介して提供されてもよい。各装置では、プロセッサ１０ａが当該プログラムを実行することにより、各装置を制御するための様々な動作が実現される。なお、これらの物理的な構成は例示であって、必ずしも独立した構成でなくてもよい。例えば、各装置は、プロセッサ１０ａとＲＡＭ１０ｂやＲＯＭ１０ｃが一体化したＬＳＩ（Large-Scale Integration）を備えていてもよい。 The program for executing each device may be provided by being stored in a computer-readable storage medium such as the RAM 10b or ROM 10c, or may be provided via the network 40 connected by the communication unit 10d. . In each device, various operations for controlling each device are realized by the processor 10a executing the program. Note that these physical configurations are merely examples, and do not necessarily have to be independent configurations. For example, each device may include an LSI (Large-Scale Integration) in which a processor 10a, a RAM 10b, and a ROM 10c are integrated.

＜機能的構成＞
≪音声処理装置≫
図３は、本実施形態に係る音声処理装置１０の機能構成の一例を示す図である。音声処理装置１０は、記憶部１０１、送受信部１０２、音声認識部１０３、除去部１０４、音声合成部１０５、感情認識部１０６、ストレス認識部１０７、制御部１０８、学習部１０９を含む。 <Functional configuration>
≪Sound processing device≫
FIG. 3 is a diagram showing an example of the functional configuration of the audio processing device 10 according to the present embodiment. The speech processing device 10 includes a storage section 101, a transmission/reception section 102, a speech recognition section 103, a removal section 104, a speech synthesis section 105, an emotion recognition section 106, a stress recognition section 107, a control section 108, and a learning section 109.

記憶部１０１は、各種情報、プログラム、アルゴリズム、モデル、操作ログ等を記憶する。具体的には、記憶部１０１は、後述する音声認識モデル１０１ａ、音声合成モデル１０１ｂ、感情認識モデル１０１ｃ、ストレス認識モデル１０１ｄ、感情抑制切替モデル１０１ｅ等を記憶する。 The storage unit 101 stores various information, programs, algorithms, models, operation logs, and the like. Specifically, the storage unit 101 stores a speech recognition model 101a, a speech synthesis model 101b, an emotion recognition model 101c, a stress recognition model 101d, an emotion suppression switching model 101e, etc., which will be described later.

送受信部１０２は、オペレータ端末２０及び／又は顧客端末３０との間で、種々の情報及び／又は信号を送信及び／又は受信する。例えば、送受信部１０２（取得部）は、顧客端末３０で収音された顧客の発話音声の信号である発話音声信号を取得する。送受信部１０２は、オペレータ端末２０に対して、合成音声信号及び／又は発話音声信号を送信する。また、送受信部１０２は、オペレータ端末２０からオペレータによる操作ログを取得してもよい。操作ログにはオペレータによる顧客の感情の主観的評価に関する情報（以下、「主観的評価情報」という）、後述する「ストレスの度合い」、後述する「手動切替履歴データ」が含まれてよい。また、送受信部１０２は、オペレータ端末２０に対して、顧客の感情に関する情報（以下、「感情情報」という）等を送信してもよい。 The transmitting/receiving unit 102 transmits and/or receives various information and/or signals to/from the operator terminal 20 and/or the customer terminal 30. For example, the transmitting/receiving unit 102 (acquisition unit) acquires a spoken voice signal that is a signal of the customer's spoken voice collected by the customer terminal 30. The transmitting/receiving unit 102 transmits a synthesized voice signal and/or a spoken voice signal to the operator terminal 20. Further, the transmitting/receiving unit 102 may acquire an operator operation log from the operator terminal 20. The operation log may include information regarding the subjective evaluation of the customer's emotions by the operator (hereinafter referred to as "subjective evaluation information"), "degree of stress" described later, and "manual switching history data" described later. The transmitting/receiving unit 102 may also transmit information regarding the customer's emotions (hereinafter referred to as “emotional information”) to the operator terminal 20.

音声認識部１０３は、送受信部１０２で取得された発話音声信号に基づいて抽出される特徴量（以下、「音声特徴量」という）を音声認識モデル１０１ａに入力して、一以上の単語からなる単語列を含むテキストデータを生成する。具体的には、音声認識部１０３は、音声認識モデル１０１ａの音響モデルを用いて上記音声特徴量から単語列を生成し、言語モデルを用いた単語列の分析結果に従って上記テキストデータを生成してもよい。音声認識部１０３は、発話音声信号に対して前処理（例えば、アナログ信号のディジタル化、ノイズの除去、フーリエ変換等）を実施して、音声特徴量を抽出してもよい。 The speech recognition unit 103 inputs into the speech recognition model 101a feature quantities (hereinafter referred to as "speech feature quantities") extracted based on the uttered audio signal acquired by the transmitter/receiver 102, and inputs the feature quantities extracted based on the uttered audio signal acquired by the transmitting/receiving unit 102 into the speech recognition model 101a. Generate text data containing word strings. Specifically, the speech recognition unit 103 generates a word string from the voice feature amount using the acoustic model of the speech recognition model 101a, and generates the text data according to the analysis result of the word string using the language model. Good too. The speech recognition unit 103 may perform pre-processing (for example, digitization of an analog signal, noise removal, Fourier transform, etc.) on the uttered speech signal to extract the speech feature amount.

音声認識モデル１０１ａは、音声信号に基づいて音声の内容を推定するアルゴリズムである。音声認識モデル１０１ａは、ある単語がどのような音となって現れやすいかということをモデル化した音響モデル、及び／又は、特定の言語においてある単語列がどのくらいの確率で現れるかをモデル化した言語モデルを含んでもよい。音響モデルとしては、例えば、隠れマルコフモデル（Hidden Markov Model：ＨＭＭ）及び／又はディープニューラルネットワーク（Deep Neural Network：ＤＮＮ）が用いられてもよい。言語モデルとしては、例えば、ｎグラム言語モデル等の確率的言語モデルが用いられてもよい。 The speech recognition model 101a is an algorithm that estimates the content of speech based on the speech signal. The speech recognition model 101a is an acoustic model that models how a certain word is likely to appear as a sound, and/or a model that models how likely a certain word string is to appear in a specific language. It may also include a language model. As the acoustic model, for example, a Hidden Markov Model (HMM) and/or a Deep Neural Network (DNN) may be used. As the language model, for example, a probabilistic language model such as an n-gram language model may be used.

除去部１０４は、音声認識部１０３で生成されたテキストデータに含まれる特定の単語列を検出し、当該特定の単語列を除去又は前記特定の単語列を他の単語列に置換したテキストデータを生成し、音声合成部１０５に出力する。除去部１０４は、音声認識部１０３で生成されたテキストデータ内で特定の単語列が検出されない場合、当該テキストデータを音声合成部１０５に出力してもよい。 The removal unit 104 detects a specific word string included in the text data generated by the speech recognition unit 103, and removes the specific word string or replaces the specific word string with another word string to create text data. It is generated and output to the speech synthesis section 105. If a specific word string is not detected in the text data generated by the speech recognition section 103, the removal section 104 may output the text data to the speech synthesis section 105.

当該特定の単語列は、例えば、聞き手を侮辱したり、聞き手の人格を否定したりする、聞き手を不快にする等、聞き手に心理的悪影響を与える一以上の単語であってもよい。ここで、各単語は、名詞、動詞、副詞、助詞、形容詞、助動詞等の少なくとも一つの品詞、当該品詞が音変化したもの等を含んでもよい。例えば、特定の単語列は、「お前、ぶっ殺すぞ」というような「文」であってもよいし、「困るっつってんの」の「っつってん」等、乱暴な言葉遣いであることを示す「文の一部」であってもよい。除去部１０４は、テキストデータ内で検出された特定の単語列のみを他の単語列に置き換えたテキストデータを音声合成部１０５に出力してもよいし、又は、当該特定の単語列を含む文全体を他の単語列に置き換えたテキストデータを音声合成部１０５に出力してもよい。当該他の単語列は、空白等であってもよい。 The specific word string may be one or more words that have a negative psychological impact on the listener, such as insulting the listener, denying the listener's personality, or making the listener uncomfortable. Here, each word may include at least one part of speech such as a noun, verb, adverb, particle, adjective, or auxiliary verb, or a sound change of the part of speech. For example, a specific word string may be a sentence such as ``I'm going to kill you,'' or it may be a sentence that uses rough language, such as ``ttsutten'' in ``troubled tsuttenno.'' It may also be a "part of a sentence" shown in the text. The removal unit 104 may output text data in which only a specific word string detected in the text data is replaced with another word string to the speech synthesis unit 105, or a sentence containing the specific word string may be output to the speech synthesis unit 105. Text data with the entire text replaced with other word strings may be output to the speech synthesis unit 105. The other word string may be blank or the like.

除去部１０４は、記憶部１０１に予め記憶された特定の単語列に基づいて、テキストデータ内の特定の単語列の検出及び／又は他の単語列への置き換えを実施してもよい。 The removal unit 104 may detect a specific word string in the text data and/or replace it with another word string based on the specific word string stored in advance in the storage unit 101.

或いは、除去部１０４は、機械学習により学習されたモデルに基づいて、テキストデータ内の特定の単語列の検出、及び／又は、意味的感情を緩和した他の単語列への置き換えを実施してもよい。例えば、テキストデータ内の特定の単語列「お前」は、「あなた」に置換されてもよい。機械学習に基づくモデルに基づいて、テキストデータ内の特定の単語列の検出及び／又は他の単語列への置き換えを実施してもよい。 Alternatively, the removal unit 104 detects a specific word string in the text data and/or replaces it with another word string with less semantic emotion based on a model learned by machine learning. Good too. For example, a specific word string "you" in the text data may be replaced with "you". A specific word string within text data may be detected and/or replaced with another word string based on a model based on machine learning.

なお、除去部１０４は、テキストデータ内で特定の単語列が検出される場合、当該特定の単語列の検出に関する情報（以下、「検出情報」という）を生成してもよい。当該検出情報は、例えば、当該特定の単語列が検出されたことを示す情報（例えば、「ＮＧワード」又は「ＮＧワード検出」という文字列）、当該特定の単語列を示す情報、及び、顧客に対する警告に関する情報（以下、「警告情報」という）の少なくとも一つを含んでもよい。当該警告情報は、例えば、オペレータに対する顧客の発話内容が侮辱罪、名誉棄損罪等の刑事告訴対象となり得ることを通知するための情報であってもよい。検出情報は、送受信部１０２によってオペレータ端末２０に送信されてもよい。検出情報が生成された場合、音声処理装置１０は、顧客端末３０に対して警告情報（例えば、「当社オペレータに対して侮辱罪等の恐れがあります。当社の不手際もあるとは思いますが、当社オペレータに過度な負担になる場合がありますのでご協力を頂けますと幸いです。」）を出力させてもよい。このような警告情報は、カスタマーハラスメントに対する事前告知として利用することができる。 Note that when a specific word string is detected within the text data, the removal unit 104 may generate information regarding the detection of the specific word string (hereinafter referred to as "detection information"). The detection information includes, for example, information indicating that the specific word string has been detected (for example, a character string "NG word" or "NG word detection"), information indicating the specific word string, and customer information. The information may include at least one piece of information regarding a warning (hereinafter referred to as "warning information"). The warning information may be, for example, information for notifying the operator that the contents of the customer's utterances may be subject to criminal charges such as contempt and defamation. The detection information may be transmitted to the operator terminal 20 by the transmitting/receiving section 102. When the detection information is generated, the voice processing device 10 sends warning information to the customer terminal 30 (for example, ``There is a risk of contempt charges against our operator. This may place an excessive burden on our operators, so we would appreciate your cooperation.'') may be output. Such warning information can be used as advance notice of customer harassment.

音声合成部１０５は、除去部１０４から入力されるテキストデータに基づいて抽出される特徴量（以下、「テキスト特徴量」という）を音声合成モデル１０１ｂに入力して、合成音声の信号（以下、「合成音声信号」という）を生成する。具体的には、除去部１０４は、テキスト特徴量に基づいて音声合成パラメータを予測し、予測された音声合成パラメータを用いて合成音声信号を生成してもよい。音声合成部１０５は、合成音声信号を送受信部１０２に出力する。合成音声信号は、テキストデータの内容を読み上げた音声の信号ともいえる。 The speech synthesis unit 105 inputs the feature quantity extracted based on the text data input from the removal unit 104 (hereinafter referred to as "text feature quantity") to the speech synthesis model 101b, and generates a synthesized speech signal (hereinafter referred to as "text feature quantity"). (referred to as a "synthetic audio signal"). Specifically, the removal unit 104 may predict speech synthesis parameters based on the text feature amount, and generate a synthesized speech signal using the predicted speech synthesis parameters. Speech synthesis section 105 outputs a synthesized speech signal to transmitting/receiving section 102 . The synthesized speech signal can also be said to be a signal of speech that reads out the contents of text data.

音声合成モデル１０１ｂは、テキストデータを入力として当該テキストデータの内容に対応する合成音声信号を出力するアルゴリズムである。音声合成モデル１０１ｂとしては、例えば、上記ＨＭＭ及び／又はＤＮＮが用いられてもよい。 The speech synthesis model 101b is an algorithm that receives text data as input and outputs a synthesized speech signal corresponding to the content of the text data. As the speech synthesis model 101b, for example, the above HMM and/or DNN may be used.

当該音声合成モデル１０１ｂは、複数の音声種別に対応してもよい。音声合成部１０５は、複数の音声種別の中から合成音声信号に用いる音声種別を選択し、選択した音声種別とテキストデータとを音声合成モデル１０１ｂに入力して、選択した音声種別の合成音声信号を合成してもよい。当該複数の音声種別は、例えば、抑揚が少ない音声、機械音、キャラクターの音声、芸能人の音声及び声優の音声の少なくとも一つ等であってもよい。音声合成部１０５は、オペレータからオペレータ端末２０を介して音声種別の選択を受け付けてもよい。 The speech synthesis model 101b may correspond to a plurality of speech types. The speech synthesis unit 105 selects a speech type to be used for a synthesized speech signal from among a plurality of speech types, inputs the selected speech type and text data to the speech synthesis model 101b, and generates a synthesized speech signal of the selected speech type. may be synthesized. The plurality of sound types may be, for example, at least one of a voice with little intonation, a mechanical sound, a character's voice, a celebrity's voice, and a voice actor's voice. The speech synthesis unit 105 may receive a selection of the speech type from the operator via the operator terminal 20.

図４は、本実施形態に係る合成音声信号の生成の一例を示す図である。図４では、送受信部１０２で取得された発話音声信号Ｓ１～Ｓ３に基づいて、音声認識部１０３においてテキストデータＴ１～Ｔ３が生成されるものとする。例えば、図４では、除去部１０４は、テキストデータＴ１内で特定の単語列を検出しないので、テキストデータＴ１をそのまま音声合成部１０５に出力する。一方、除去部１０４は、テキストデータＴ２及びＴ３内で特定の単語列（Ｔ２では「お前、ぶっ殺すぞ」、Ｔ３では「っつってん」）を検出するので、当該特定の単語列を除去又は置換したテキストデータＴ２’及びＴ３’を音声合成部１０５に出力する。例えば、テキストデータＴ２’では、テキストデータＴ２内の特定の単語列が空白（□）に置換される。また、テキストデータＴ３’では、テキストデータＴ３内の特定の単語列「っつってん」が「という」に置換される。音声合成部１０５は、テキストデータＴ１、Ｔ２及びＴ３からそれぞれ合成音声信号Ｓ１、Ｓ２’及びＳ３’を生成する。 FIG. 4 is a diagram illustrating an example of generation of a synthesized speech signal according to this embodiment. In FIG. 4, it is assumed that text data T1 to T3 are generated in the speech recognition unit 103 based on the uttered audio signals S1 to S3 acquired by the transmission/reception unit 102. For example, in FIG. 4, the removal unit 104 does not detect a specific word string in the text data T1, so it outputs the text data T1 as is to the speech synthesis unit 105. On the other hand, the removal unit 104 detects a specific word string (“I'm going to kill you” in T2 and “Ttsutten” in T3) in the text data T2 and T3, so it removes or replaces the specific word string. The resulting text data T2' and T3' are output to the speech synthesis section 105. For example, in text data T2', a specific word string in text data T2 is replaced with a blank (□). Furthermore, in the text data T3', the specific word string "ttsutten" in the text data T3 is replaced with "toi". The speech synthesis unit 105 generates synthesized speech signals S1, S2' and S3' from the text data T1, T2 and T3, respectively.

感情認識部１０６は、送受信部１０２で取得された発話音声信号、音声認識部１０３で生成されたテキストデータ、及び、送受信部１０２で受信される主観的評価情報の少なくとも一つに基づいて、顧客の感情情報を生成する。感情認識部１０６は、発話音声信号に基づいて抽出された音声特徴量（例えば抑揚や音量など）に基づいて顧客の感情情報を生成してよい。感情認識部１０６は、発話音声信号に基づいて生成されたテキストデータに特定の単語列が検出されたこと、又は、特定の単語列が所定時間以上検出されなかったことに基づいて顧客の感情情報を生成してよい。感情認識部１０６は、カメラ１０ｇで取得される顧客の撮像画像に基づいて、顧客の感情情報を生成してもよい。感情認識部１０６は感情認識モデル１０１ｃを用いて顧客の感情情報を生成してもよい。 The emotion recognition unit 106 determines the customer's feelings based on at least one of the uttered audio signal acquired by the transmission/reception unit 102 , the text data generated by the voice recognition unit 103 , and the subjective evaluation information received by the transmission/reception unit 102 . generate emotional information. The emotion recognition unit 106 may generate emotional information about the customer based on voice features (for example, intonation, volume, etc.) extracted based on the uttered audio signal. The emotion recognition unit 106 determines the customer's emotional information based on the fact that a specific word string has been detected in the text data generated based on the uttered audio signal, or that the specific word string has not been detected for a predetermined period of time or more. may be generated. The emotion recognition unit 106 may generate emotional information about the customer based on the captured image of the customer acquired by the camera 10g. The emotion recognition unit 106 may generate customer emotion information using the emotion recognition model 101c.

感情認識モデル１０１ｃは、発話音声信号、当該発話音声信号から抽出した音声特徴量、当該発話音声信号から生成したテキストデータ、テキスト特徴量又はこれらの少なくとも二つの組み合わせを入力とし、当該発話音声信号に対応する顧客の感情である感情情報を出力するモデルである。 The emotion recognition model 101c receives as input a spoken voice signal, a voice feature extracted from the said spoken voice signal, text data generated from the said spoken voice signal, a text feature, or a combination of at least two of these, and applies a function to the said spoken voice signal. This model outputs emotional information that is the emotions of the corresponding customer.

図５Ａは感情認識モデル１０１ｃの学習処理の説明図である。例えば、感情認識モデル１０１ｃの学習には、発話音声信号から抽出される音声特徴量、テキストデータから抽出されるテキスト特徴量、及び、オペレータによる「主観的評価情報」（又は主観的評価情報から抽出される特徴量）の少なくとも一つをそれぞれ含む複数のデータのセット（以下、「データセット」という）を用いてよい。主観的評価情報は、オペレータが顧客の発話音声信号を聞いて顧客の感情を主観で評価した情報である。例えば、怒りレベル１～１０のように、オペレータが複数のレベルで顧客の怒りを評価するものであってもよい。感情認識モデル１０１ｃを学習するためのデータセットは例えば以下のように生成されてもよい。オペレータは、顧客の生の発話音声信号を聞いて、当該発話音声信号から推定される顧客の感情をアノテーションする（すなわち発話音声信号に対して「主観的評価情報」を付与する）。これにより、発話音声信号と当該発話音声信号から推定される顧客の感情とが時間軸上で関連付けされた情報が得られる。複数のオペレータが複数の発話音声信号に対して主観的評価情報の付与を行うことにより、このような情報の束であるデータセットが得られる。感情認識モデル１０１ｃは、このようなデータセットを用いて教師有り機械学習されてもよい。なお、感情認識モデル１０１ｃの学習に用いられるデータセットは、音声特徴量に加えて又は代えて発話音声信号を含んでもよいし、テキスト特徴量に加えて又は代えてテキストデータを含んでもよい。 FIG. 5A is an explanatory diagram of the learning process of the emotion recognition model 101c. For example, in learning the emotion recognition model 101c, voice features extracted from speech audio signals, text features extracted from text data, and "subjective evaluation information" by an operator (or extracted from subjective evaluation information) are used to learn the emotion recognition model 101c. A plurality of data sets (hereinafter referred to as "data sets") each including at least one of the following feature values may be used. The subjective evaluation information is information obtained by the operator subjectively evaluating the customer's emotions by listening to the customer's uttered audio signal. For example, the operator may evaluate the customer's anger on a plurality of levels, such as anger levels 1 to 10. The data set for learning the emotion recognition model 101c may be generated as follows, for example. The operator listens to the customer's raw spoken voice signal and annotates the customer's emotion estimated from the spoken voice signal (that is, adds "subjective evaluation information" to the spoken voice signal). As a result, information is obtained in which the uttered audio signal and the customer's emotion estimated from the uttered audio signal are associated on the time axis. When a plurality of operators assign subjective evaluation information to a plurality of speech audio signals, a data set that is a bundle of such information is obtained. The emotion recognition model 101c may be subjected to supervised machine learning using such a data set. Note that the data set used for learning the emotion recognition model 101c may include an uttered audio signal in addition to or in place of the voice feature amount, or may include text data in addition to or in place of the text feature amount.

図５Ｂは感情認識モデル１０１ｃを用いた推定処理の説明図である。例えば、図５Ｂに示すように、発話音声信号Ｓ１から抽出した音声特徴量、及び／又は、当該発話音声信号Ｓ１から生成したテキストデータＴ１から抽出したテキスト特徴量を感情認識モデル１０１ｃに入力することにより、入力に対応する出力、すなわち発話音声信号に対応する感情情報が得られる。なお、感情認識モデル１０１ｃには、音声特徴量に加えて又は代えて発話音声信号Ｓ１が入力されてもよいし、テキスト特徴量に加えて又は代えてテキストデータＴ１が入力されてもよい。 FIG. 5B is an explanatory diagram of estimation processing using the emotion recognition model 101c. For example, as shown in FIG. 5B, the voice feature extracted from the uttered audio signal S1 and/or the text feature extracted from the text data T1 generated from the uttered audio signal S1 may be input to the emotion recognition model 101c. As a result, an output corresponding to the input, that is, emotional information corresponding to the uttered audio signal is obtained. Note that the uttered audio signal S1 may be input to the emotion recognition model 101c in addition to or in place of the voice feature amount, and the text data T1 may be input in addition to or in place of the text feature amount.

主観的評価情報は、一以上の感情（例えば、「幸福」、「驚き」、「恐怖」、「怒り」、「嫌悪」及び「悲しみ」の少なくとも一つ等）の度合を数値で示すものであってもよい。又は、感情情報は、顧客が感じている可能性が高い特定の感情（例えば、「怒り」）を示すものであってもよい。 The subjective evaluation information indicates numerically the degree of one or more emotions (for example, at least one of "happiness," "surprise," "fear," "anger," "disgust," and "sadness," etc.) There may be. Alternatively, the emotion information may indicate a specific emotion (eg, "anger") that the customer is likely to be feeling.

ストレス認識部１０７は、オペレータのストレス状況に関する情報（以下、「ストレス情報」という）を生成する。例えば、ストレス認識部１０７は、オペレータの心拍数、発汗量、呼吸量などのバイタルデータあるいは、カメラを用いて収集したオペレータの視線、表情などの画像情報に基づいて、従来周知の方法によってオペレータのストレス状況を推定してよい。例えば、ストレス認識部１０７は、オペレータによる発話音声に基づいてオペレータのストレス状況を推定してよい。具体的には、ストレス認識部１０７は、オペレータの発話のトーンやスピードの変化、謝罪に関する単語の出現、顧客の発言に被せて発言すること等に基づいて、オペレータのストレス状況を推定してよい。例えば、ストレス認識部１０７は、オペレータ端末２０の操作ログに基づいてオペレータのストレス状況を推定してよい。具体的には、ストレス認識部１０７は、マウス等の動きや、操作すべき場面で操作入力が無いことなどに応じて、オペレータのストレス状況を推定してよい。ストレス認識部１０７は、ストレス認識モデル１０１ｄに基づいてストレス情報を生成してよい。ストレス認識モデル１０１ｄは、発話音声信号、当該発話音声信号から抽出した音声特徴量、当該発話音声信号から生成したテキストデータ、テキスト特徴量又はこれらの少なくとも二つの組み合わせを入力とし、当該発話音声を聞いているオペレータが感じるストレスの推定値を出力するモデルである。ストレス認識モデル１０１ｄの学習には、顧客の発話音声を聞いてオペレータが実際に感じたストレスの実測値を用いてよい。ストレス認識モデル１０１ｄを学習するためのデータセットは例えば以下のように生成されてもよい。オペレータは、顧客の発話音声を聞いて感じたストレスの度合い（例えば１～１０のようなレベル）をアノテーションする（すなわち発話音声信号に対して自身が感じた「ストレスの度合い」を付与する）。これにより、発話音声信号と当該発話音声信号を聞いた際のオペレータのストレスとが時間軸上で関連付けされた情報が得られる。複数のオペレータが複数の発話音声信号に対してストレスの度合いの付与を行うことにより、このような情報の束であるデータセットが得られる。ストレス認識モデル１０１ｄは、このようなデータセットを用いて教師有り機械学習されてもよい。 The stress recognition unit 107 generates information regarding the stress situation of the operator (hereinafter referred to as "stress information"). For example, the stress recognition unit 107 uses a conventionally known method to recognize the operator based on vital data such as the operator's heart rate, amount of sweat, and amount of breathing, or image information such as the operator's line of sight and facial expressions collected using a camera. May estimate stress situation. For example, the stress recognition unit 107 may estimate the stress situation of the operator based on the voice uttered by the operator. Specifically, the stress recognition unit 107 may estimate the operator's stress situation based on changes in the tone and speed of the operator's speech, the appearance of words related to apologies, the words uttered over the customer's utterances, etc. . For example, the stress recognition unit 107 may estimate the stress situation of the operator based on the operation log of the operator terminal 20. Specifically, the stress recognition unit 107 may estimate the stress state of the operator based on the movement of a mouse or the like or the absence of an operation input in a scene where an operation should be performed. The stress recognition unit 107 may generate stress information based on the stress recognition model 101d. The stress recognition model 101d receives as input a spoken voice signal, a voice feature extracted from the said spoken voice signal, text data generated from the said spoken voice signal, a text feature, or a combination of at least two of these, and listens to the said spoken voice. This model outputs an estimated value of the stress felt by the operator. The stress recognition model 101d may be trained using actual measured values of the stress that the operator actually felt while listening to the customer's utterances. The data set for learning the stress recognition model 101d may be generated as follows, for example. The operator annotates the degree of stress (for example, on a scale of 1 to 10) that the operator feels while listening to the customer's speech (that is, assigns the "degree of stress" that the operator feels to the speech signal). As a result, information is obtained in which the uttered audio signal and the stress of the operator when listening to the uttered audio signal are associated on the time axis. A data set, which is a bundle of such information, is obtained by having a plurality of operators assign stress levels to a plurality of speech audio signals. The stress recognition model 101d may be subjected to supervised machine learning using such a data set.

制御部１０８は、音声処理装置１０に関する種々の制御を行う。具体的には、制御部１０８は、ストレス認識部１０７において生成されるストレス情報に基づいて、オペレータ端末２０において音声合成部１０５で生成された合成音声又は顧客の発話音声のどちらを出力するかを切り替えてもよい。制御部１０８は、発話音声信号に基づいて合成音声信号を生成するか否かをストレス情報に基づいて切り替えてもよい。例えば、制御部１０８は、ストレス情報が示すストレス度数が所定の閾値以上又はより大きい場合、顧客の発話音声ではなく合成音声をオペレータに出力するように制御してもよい。一方、制御部１０８は、ストレス情報が示すストレス度数が所定の閾値より小さい又は以下である場合、発話音声をオペレータに出力するように制御してもよい。制御部１０８は、オペレータから感情抑制機能の自動切り替えについての指示情報が入力された場合、ストレス情報に基づいて上記切り替えを行ってもよい。感情抑制機能とは、顧客の発話音声に代えて合成音声をオペレータに出力する機能である。 The control unit 108 performs various controls regarding the audio processing device 10. Specifically, the control unit 108 determines whether to output the synthesized voice generated by the voice synthesis unit 105 or the customer's uttered voice at the operator terminal 20 based on the stress information generated by the stress recognition unit 107. You may switch. The control unit 108 may switch whether or not to generate a synthesized voice signal based on the uttered voice signal based on the stress information. For example, if the stress level indicated by the stress information is greater than or equal to a predetermined threshold value, the control unit 108 may control the synthesized voice to be output to the operator instead of the customer's uttered voice. On the other hand, if the stress level indicated by the stress information is smaller than or equal to a predetermined threshold value, the control unit 108 may control the speech sound to be output to the operator. When the operator inputs instruction information regarding automatic switching of the emotion suppression function, the control unit 108 may perform the switching based on stress information. The emotion suppression function is a function that outputs synthesized speech to the operator in place of the customer's uttered speech.

制御部１０８は、感情情報に基づいて上記切り替えを行ってもよい。制御部１０８は、当該切り替えを感情抑制切替モデル１０１ｅの出力に基づいて行ってもよい。感情抑制切替モデル１０１ｅは、発話音声信号、音声特徴量、テキストデータ、テキスト特徴量又はこれらの少なくとも二つの組み合わせを入力として、感情抑制機能のオン・オフを切り替えるタイミングを出力とするモデルである。感情抑制切替モデル１０１ｅは更にストレス情報又は感情情報を入力としてもよい。感情抑制切替モデル１０１ｅの詳細については後述する。 The control unit 108 may perform the above switching based on emotional information. The control unit 108 may perform the switching based on the output of the emotion suppression switching model 101e. The emotion suppression switching model 101e is a model that inputs a spoken voice signal, voice feature, text data, text feature, or a combination of at least two of these, and outputs the timing at which the emotion suppression function is switched on and off. The emotion suppression switching model 101e may further receive stress information or emotion information as input. Details of the emotion suppression switching model 101e will be described later.

また、制御部１０８は、オペレータによって入力される切り替え情報に基づいて上記切り替えを行ってもよい。ここで、切り替え情報は、顧客の感情抑制機能の適用（オン）又は非適用（オフ）の切り替えに関する情報である。例えば、制御部１０８は、切り替え情報が顧客の感情抑制機能の適用を示す場合、合成音声をオペレータに出力するように制御してもよい。一方、制御部１０８は、切り替え情報が顧客の感情抑制機能の非適用を示す場合、発話音声をオペレータに出力するように制御してもよい。制御部１０８は、オペレータから感情抑制機能の手動切り替えについての指示情報が入力された場合、上記切り替え情報に基づいて上記切り替えを行ってもよい。 Further, the control unit 108 may perform the above switching based on switching information input by the operator. Here, the switching information is information regarding switching between application (on) and non-application (off) of the customer's emotion suppression function. For example, when the switching information indicates application of the customer's emotion suppression function, the control unit 108 may control the synthesized voice to be output to the operator. On the other hand, when the switching information indicates that the customer's emotion suppression function is not applied, the control unit 108 may control the uttered voice to be output to the operator. When instruction information regarding manual switching of the emotion suppression function is input from the operator, the control unit 108 may perform the switching based on the switching information.

学習部１０９は、感情認識モデル１０１ｃ、ストレス認識モデル１０１ｄ及び感情抑制切替モデル１０１ｅの学習処理を行ってよい。 The learning unit 109 may perform learning processing on the emotion recognition model 101c, the stress recognition model 101d, and the emotion suppression switching model 101e.

音声処理装置１０は、以下１）乃至７）に示すいずれかの情報、又は、少なくとも二つの情報の組み合わせを時間軸上で関連付け、送受信部１０２を介して、オペレータ端末２０に対して送信してよい。１）顧客の発話音声信号、２）発話音声信号から生成されたテキストデータ、３）除去部１０４の処理を経たあとのテキストデータ、４）検出情報、５）合成音声信号、６）顧客の発話音声信号から推定される顧客の感情情報、７）感情抑制機能のオン・オフを切り替えるタイミング。感情抑制機能がオンである場合、音声処理装置１０は顧客の発話音声信号をオペレータ端末２０に送らなくてもよい。感情抑制機能がオフである場合、音声処理装置１０は合成音声信号をオペレータ端末２０に送らなくてもよい。感情抑制機能のオン・オフに関わらず、音声処理装置１０は顧客の発話音声信号と合成音声信号との両方をオペレータ端末２０に送ってもよい。 The voice processing device 10 associates any of the information shown in 1) to 7) below, or a combination of at least two pieces of information, on the time axis, and transmits it to the operator terminal 20 via the transmitter/receiver 102. good. 1) Customer's utterance audio signal, 2) Text data generated from the utterance audio signal, 3) Text data after being processed by removal unit 104, 4) Detected information, 5) Synthesized audio signal, 6) Customer's utterance Customer emotional information estimated from voice signals, 7) Timing for switching on/off of emotion suppression function. When the emotion suppression function is on, the voice processing device 10 does not need to send the customer's uttered voice signal to the operator terminal 20. When the emotion suppression function is off, the voice processing device 10 does not need to send the synthesized voice signal to the operator terminal 20. Regardless of whether the emotion suppression function is on or off, the voice processing device 10 may send both the customer's uttered voice signal and the synthesized voice signal to the operator terminal 20.

≪オペレータ端末≫ ≪Operator terminal≫

図６は、本実施形態に係るオペレータ端末の機能構成の一例を示す図である。オペレータ端末２０は、送受信部２０１、入力受付部２０２、制御部２０３を備える。なお、図６に示す機能構成は一例にすぎず、図示しない他の構成を備えてもよい。 FIG. 6 is a diagram showing an example of the functional configuration of the operator terminal according to the present embodiment. The operator terminal 20 includes a transmitting/receiving section 201, an input receiving section 202, and a control section 203. Note that the functional configuration shown in FIG. 6 is only an example, and other configurations not shown may be provided.

送受信部２０１は、音声処理装置１０及び／又は顧客端末３０との間で、種々の情報及び／又は信号を送信及び／又は受信する。例えば、送受信部２０１は、顧客端末３０で収音された顧客の発話音声の信号である発話音声信号を受信してもよい。送受信部１０２は、音声処理装置１０から、合成音声信号を受信してもよい。また、送受信部２０１は、音声処理装置１０に対して、主観的評価情報を送信してもよい。また、送受信部２０１は、音声処理装置１０から、顧客の感情情報を受信してもよい。 The transmitting/receiving unit 201 transmits and/or receives various information and/or signals to/from the audio processing device 10 and/or the customer terminal 30. For example, the transmitting/receiving unit 201 may receive a speech signal that is a signal of the customer's speech collected by the customer terminal 30 . The transmitting/receiving unit 102 may receive a synthesized audio signal from the audio processing device 10. Further, the transmitting/receiving unit 201 may transmit subjective evaluation information to the audio processing device 10. Further, the transmitting/receiving unit 201 may receive emotional information of the customer from the voice processing device 10.

入力受付部２０２は、オペレータによる入力部１０ｅの操作に基づいて、種々の情報の入力を受け付ける。例えば、入力受付部２０２は、感情認識モデル１０１ｃやストレス認識モデル１０１ｄを学習するためのデータセットを生成するための作業の一環として、顧客の生の発話音声信号に対して主観的評価情報やストレスの度合いの入力を受け付けてもよい。以降、オペレータが、オペレータ端末２０において主観的評価情報やストレスの度合いを入力する作業を「アノテーション作業」と呼ぶ。アノテーション作業は、通常のコールセンター業務とは別の業務として位置付けられていてもよい。また、入力受付部２０２は、顧客の感情抑制機能の切り替え情報の入力を受け付けてもよい。また、入力受付部２０２は、感情抑制機能の手動切り替え又は自動切り替えのどちらかを指示する指示情報の入力を受け付けてもよい。 The input reception unit 202 receives input of various information based on the operation of the input unit 10e by the operator. For example, as part of the work to generate a data set for learning the emotion recognition model 101c and the stress recognition model 101d, the input reception unit 202 inputs subjective evaluation information and stress information to the customer's raw utterance audio signal. It may also be possible to accept input of the degree of Hereinafter, the work in which the operator inputs subjective evaluation information and the degree of stress on the operator terminal 20 will be referred to as "annotation work." Annotation work may be positioned as a separate work from normal call center work. Further, the input reception unit 202 may receive input of information on switching the emotion suppression function from the customer. Further, the input receiving unit 202 may receive an input of instruction information instructing either manual switching or automatic switching of the emotion suppression function.

制御部２０３は、オペレータ端末２０に関する種々の制御を行う。例えば、制御部２０３は、表示部１０ｆにおける情報及び／又は画像の表示を制御する。また、制御部２０３は、音声出力部１０ｉにおける音声の出力を制御する。制御部２０３は、音声処理装置１０から送信される情報に基づいて音声の出力を制御してもよいし、入力受付部２０２が受け付けた情報に基づいて音声の出力を制御してもよい。 The control unit 203 performs various controls regarding the operator terminal 20. For example, the control unit 203 controls the display of information and/or images on the display unit 10f. Further, the control unit 203 controls the output of audio in the audio output unit 10i. The control unit 203 may control the output of audio based on information transmitted from the audio processing device 10, or may control the output of audio based on information received by the input receiving unit 202.

制御部２０３は、音声処理装置１０から受信した合成音声信号に基づいて合成音声を音声出力部１０ｉから出力させる。制御部２０３は、顧客端末３０からの発話音声信号に基づいて発話音声を音声出力部１０ｉから出力させてもよい。 The control unit 203 causes the audio output unit 10i to output synthesized speech based on the synthesized speech signal received from the audio processing device 10. The control unit 203 may cause the audio output unit 10i to output the uttered audio based on the uttered audio signal from the customer terminal 30.

また、制御部２０３は、音声処理装置１０から受信した感情情報に基づいて、合成音声信号に対応する感情情報を表示部１０ｆに表示させてもよい。また、制御部２０３は、音声処理装置１０から受信した合成音声信号に対応するテキストデータを表示部１０ｆに表示させてもよい。例えば、制御部２０３は、感情情報、テキストデータ及び検出情報の少なくとも一つを含む画面Ｄ１を表示部１０ｆに表示させてもよい。また、制御部２０３は、ストレス情報を表示部１０ｆに表示させてもよい。例えば、制御部２０３は、ストレス情報を含む画面Ｄ２を表示部１０ｆに表示させてもよい。 Further, the control unit 203 may cause the display unit 10f to display emotional information corresponding to the synthesized audio signal based on the emotional information received from the audio processing device 10. Further, the control unit 203 may cause the display unit 10f to display text data corresponding to the synthesized speech signal received from the audio processing device 10. For example, the control unit 203 may cause the display unit 10f to display a screen D1 that includes at least one of emotional information, text data, and detection information. Further, the control unit 203 may display stress information on the display unit 10f. For example, the control unit 203 may display the screen D2 including stress information on the display unit 10f.

図７は、本実施形態に係る画面Ｄ１の一例を示す図である。図７に示すように、画面Ｄ１において、制御部２０３は、音声出力部１０ｉからの合成音声の出力タイミングＴに合わせて、感情情報Ｉ１を表示部１０ｆに表示させてもよい。合成音声の出力タイミングＴ毎に感情情報Ｉ１を表示させることにより、オペレータは、感情抑制機能により顧客の感情が抑制された合成音声を聞く場合でも、顧客の感情をリアルタイムで認識することができる。 FIG. 7 is a diagram showing an example of the screen D1 according to the present embodiment. As shown in FIG. 7, on the screen D1, the control unit 203 may display the emotional information I1 on the display unit 10f in accordance with the output timing T of the synthesized voice from the voice output unit 10i. By displaying the emotion information I1 at every output timing T of the synthesized voice, the operator can recognize the customer's emotion in real time even when listening to the synthesized voice in which the customer's emotion has been suppressed by the emotion suppression function.

また、画面Ｄ１において、制御部２０３は、当該合成音声の出力タイミングＴに合わせて、当該合成音声に対応するテキストデータＩ２の内容を表示部１０ｆに表示させてもよい。テキストデータＩ２の内容を表示させることにより、オペレータは、合成音声だけでなく、視覚的にも顧客の発話内容を把握可能となる。 Further, on the screen D1, the control unit 203 may cause the display unit 10f to display the content of the text data I2 corresponding to the synthesized voice in accordance with the output timing T of the synthesized voice. By displaying the content of the text data I2, the operator can grasp the content of the customer's utterance not only by the synthesized voice but also visually.

また、画面Ｄ１では、制御部２０３は、音声処理装置１０から受信した検出情報に基づいて、特定の単語列そのものの表示に代えて、特定の単語列の検出を示す情報Ｉ３（例えば、「ＮＧワード検出」）を表示部１０ｆに表示させてもよい。この機能を「ＮＧワード非表示機能」と呼ぶ。これにより、心理的悪影響を与える顧客の発話の内容をそのままオペレータに認識させるのを回避できるのでオペレータのストレスを抑制できる。また、当該発話があったことはオペレータに通知できるので、オペレータが顧客に対する応対を適切に行うことができる。 Furthermore, on screen D1, based on the detection information received from the speech processing device 10, the control unit 203 displays information I3 indicating detection of a specific word string (for example, "NG "Word Detection") may be displayed on the display section 10f. This function is called "NG word hiding function". As a result, it is possible to avoid having the operator recognize the content of the customer's utterance as it is, which would have a negative psychological impact, thereby suppressing the operator's stress. Furthermore, since the operator can be notified of the utterance, the operator can respond appropriately to the customer.

また、画面Ｄ１において、制御部２０３は、音声処理装置１０からの感情情報に基づいて、合成音声の出力タイミングＴ毎に、顧客の特定の感情のレベルＩ４を時系列に表示部１０ｆに表示させてもよい。例えば、図７では、合成音声の出力タイミングＴ毎の顧客の「怒り」のレベルＩ４が折れ線グラフで示される。これにより、オペレータが顧客の特定の感情（例えば、「怒り」）の遷移を容易に把握できるので、顧客に対するオペレータの応対の満足度を向上できる。 In addition, on the screen D1, the control unit 203 causes the display unit 10f to display the customer's specific emotional level I4 in chronological order at each output timing T of the synthesized voice based on the emotional information from the voice processing device 10. It's okay. For example, in FIG. 7, the customer's "anger" level I4 at each output timing T of the synthesized voice is shown in a line graph. This allows the operator to easily grasp the transition of a customer's specific emotion (for example, "anger"), thereby improving the satisfaction level of the operator's response to the customer.

画面Ｄ１において、制御部２０３は選択ボタンＩ５を表示部１０ｆに表示させてもよい。
選択ボタンＩ５は、感情抑制機能の適用（オン）又は非適用（オフ）を自動又は手動のどちらで切り替えるかをオペレータが選択可能とするインターフェースである。オペレータは選択ボタンＩ５に対してクリック、タップ又はスライド等の操作を行うことにより「自動切換モード」と「手動切替モード」を切り替えることができる。自動切換モードにおいては、例えば感情情報、ストレス情報、又は感情抑制切替モデル１０１ｅからの出力等に基づいて感情抑制機能のオン・オフが自動で切り替わる。 On the screen D1, the control unit 203 may display a selection button I5 on the display unit 10f.
The selection button I5 is an interface that allows the operator to select whether to automatically or manually switch application (on) or non-application (off) of the emotion suppression function. The operator can switch between the "automatic switching mode" and the "manual switching mode" by clicking, tapping, or sliding the selection button I5. In the automatic switching mode, the emotion suppression function is automatically switched on and off based on, for example, emotional information, stress information, or the output from the emotion suppression switching model 101e.

「手動切替モード」が選択された場合、制御部２０３は、感情抑制機能の適用又は非適用をオペレータが選択可能とするインターフェースである切替ボタンＩ６を表示部１０ｆに表示させてよい。オペレータが感情抑制機能のオンとオフを切り替えたタイミングは、顧客の発話音声（及び/又は発話音声に基づいて抽出される各種特徴量）と時間軸上で関連付けされて「手動切替履歴データ」として不図示の記憶部に蓄積される。「手動切替履歴データ」には更にオペレータの識別情報が関連付けられてもよい。 When the "manual switching mode" is selected, the control unit 203 may display a switching button I6, which is an interface that allows the operator to select application or non-application of the emotion suppression function, on the display unit 10f. The timing at which the operator turns the emotion suppression function on and off is associated with the customer's spoken voice (and/or various features extracted based on the spoken voice) on the time axis, and is recorded as "manual switching history data". The information is stored in a storage unit (not shown). The "manual switching history data" may further be associated with operator identification information.

切り替えボタンＩ７は、「ＮＧワード非表示機能」のオン・オフを切り替えるためのボタンである。「ＮＧワード非表示機能」がオフの場合には、テキストデータＩ２の内に特定の単語列が検出された場合でも、除去部１０４による処理が行われる前のテキストデータＩ２がそのまま表示部１０ｆに表示される。感情抑制機能をオンにしつつＮＧワード非表示機能をオフにした場合、オペレータは顧客による特定の単語列を直接聞くことは無いのでストレスが軽減される一方で、顧客の発話内容を正確に把握することにより顧客の感情をより正確に把握することができる。 The switching button I7 is a button for switching the "NG word hiding function" on and off. When the "NG word hiding function" is off, even if a specific word string is detected in the text data I2, the text data I2 before being processed by the removal unit 104 is displayed as is on the display unit 10f. Is displayed. When the emotion suppression function is turned on and the NG word hiding function is turned off, the operator does not directly hear the customer's specific word string, which reduces stress and allows the operator to accurately understand what the customer is saying. This allows for a more accurate understanding of customer sentiment.

感情抑制切替モデル１０１ｅを学習するためのデータセットは、ストレス情報、感情情報、発話音声信号Ｓ１、音声特徴量、テキストデータ、テキスト特徴量又はこれらの少なくとも二つの組み合わせと、オペレータが感情抑制機能のオン・オフを切り替えたタイミングとが、時間軸上で関連付けされたデータの束であってよい。感情抑制切替モデル１０１eを学習する方法は、例えば下記１）から３）に述べるような様々な方法がある。１）感情抑制切替モデル１０１ｅはオペレータ毎に学習されてもよい。すなわち、或るオペレータに対して適用される感情抑制切替モデル１０１ｅは、そのオペレータによる感情抑制機能の「手動切替履歴データ」のみに基づいて学習されてもよい。この方法によれば、感情抑制切替モデル１０１ｅはそのオペレータの好みに合わせたタイミングで感情抑制機能を切り替えることができるようになる。あるいは、２）或るオペレータに対して適用される感情抑制切替モデル１０１ｅは、不特定多数のオペレータによる「手動切替履歴データ」に基づいて学習されてもよい。この方法によれば、学習に用いることができるデータが多くなるため、感情抑制切替モデル１０１ｅを早く学習することができるようになる。あるいは、３）或るオペレータに対して適用される感情抑制切替モデル１０１ｅは、そのオペレータと年齢・性別・その他の特性が類似したオペレータによる「手動切替履歴データ」に基づいて学習されてもよい。この方法によれば、１）の方法と比較して学習に用いることができるデータが多くなるため感情抑制切替モデル１０１ｅを早く学習することができ、２）の方法と比較して自分の好みに合った切替タイミングを学習することができるようになる。 The data set for learning the emotion suppression switching model 101e includes stress information, emotion information, uttered audio signal S1, voice features, text data, text features, or a combination of at least two of these, and information that the operator uses to control the emotion suppression function. The timing of switching on/off may be a bundle of data associated on the time axis. There are various methods for learning the emotion suppression switching model 101e, such as those described in 1) to 3) below. 1) The emotion suppression switching model 101e may be learned for each operator. That is, the emotion suppression switching model 101e applied to a certain operator may be learned based only on "manual switching history data" of the emotion suppression function by that operator. According to this method, the emotion suppression switching model 101e can switch the emotion suppression function at a timing that suits the operator's preference. Alternatively, 2) the emotion suppression switching model 101e applied to a certain operator may be learned based on "manual switching history data" by an unspecified number of operators. According to this method, the amount of data that can be used for learning increases, so that the emotion suppression switching model 101e can be learned quickly. Alternatively, 3) the emotion suppression switching model 101e applied to a certain operator may be learned based on "manual switching history data" of operators similar in age, gender, and other characteristics to the operator. According to this method, the amount of data that can be used for learning increases compared to method 1), so the emotion suppression switching model 101e can be learned more quickly, and compared to method 2), it can be learned according to one's preference. This makes it possible to learn appropriate switching timing.

図８は、本実施形態に係る画面Ｄ２の一例を示す図である。画面Ｄ２において、制御部２０３は、音声処理装置１０からのストレス情報を表示させてもよい。例えば、図８では、ストレス情報として、オペレータが感じるストレスの推定値を示す情報（例えば、「５６％」）と、当該オペレータの平常時の状態からの相対的な評価値を示す情報（例えば、「平常時より８．１％減」）とが表示される。 FIG. 8 is a diagram showing an example of the screen D2 according to the present embodiment. On the screen D2, the control unit 203 may display stress information from the audio processing device 10. For example, in FIG. 8, the stress information includes information indicating an estimated value of stress felt by the operator (for example, "56%") and information indicating a relative evaluation value of the operator from the normal state (for example, ``8.1% less than normal'') is displayed.

図１２は、本実施形態に係る画面Ｄ３の一例を示す図である。画面Ｄ３において、制御部２０３は、オペレータがアノテーション作業を行うためのインターフェースＩ８を表示させてもよい。オペレータは、例えば、顧客の生の音声（サンプル音声）を聞きながら、サンプル音声から感じられる顧客の感情をインターフェースＩ８から都度選択する。図１２において、顧客感情Ｉ１はオペレータによる顧客感情の主観的評価情報である。例えば、オペレータが、サンプル音声「今日の夕方までにどうにかして届けてよ」に対して「怒り」という感情をアノテーションしたならば、図１２に示すように、「今日の夕方までにどうにかして届けてよ」というサンプル音声と「怒り」という情報が時間軸上で関連付けられる。アノテーションは文単位で行われてもよいし所定の時間間隔ごとに行われてもよい。 FIG. 12 is a diagram showing an example of the screen D3 according to the present embodiment. On the screen D3, the control unit 203 may display an interface I8 for the operator to perform annotation work. For example, while listening to the customer's live voice (sample voice), the operator selects the customer's emotion felt from the sample voice from the interface I8 each time. In FIG. 12, customer emotion I1 is subjective evaluation information of customer emotion by an operator. For example, if an operator annotates the sample voice ``Please somehow deliver it by this evening'' with the emotion ``anger'', the operator annotates the emotion ``anger'' with the emotion ``Please somehow deliver it by this evening'', as shown in Figure 12. The sample voice saying, ``Deliver it to me'' and the information of ``anger'' are associated on the time axis. Annotation may be performed on a sentence-by-sentence basis or at predetermined time intervals.

（音声処理システムの動作）
図９は、本実施形態に係る感情抑制動作の一例を示すフローチャートである。なお、図９は、例示にすぎず、少なくとも一部のステップ（例えば、ステップＳ１０６）の順番は入れ替えられてもよいし、不図示のステップが実施されてもよいし、一部のステップが省略されてもよい。 (Operation of audio processing system)
FIG. 9 is a flowchart illustrating an example of emotion suppression operation according to this embodiment. Note that FIG. 9 is only an example, and the order of at least some steps (for example, step S106) may be changed, steps not shown may be performed, or some steps may be omitted. may be done.

音声処理装置１０は、顧客端末３０の音声入力部１０ｈで収音される顧客の発話音声の信号である発話音声信号を取得する（Ｓ１０１）。 The voice processing device 10 acquires a spoken voice signal that is a signal of the customer's voice collected by the voice input unit 10h of the customer terminal 30 (S101).

音声処理装置１０は、Ｓ１０１で取得された発話音声信号に基づいて抽出される特徴量を音声認識モデル１０１ａに入力して、一以上の単語からなる単語列を含むテキストデータを生成する（Ｓ１０２）。 The speech processing device 10 inputs the feature amount extracted based on the speech audio signal acquired in S101 to the speech recognition model 101a, and generates text data including a word string consisting of one or more words (S102). .

音声処理装置１０は、Ｓ１０２で生成されたテキストデータ内に特定の単語列が含まれるか否かを判定する（Ｓ１０３）。当該テキストデータ内に特定の単語列が含まれる場合、音声処理装置１０は、当該特定の単語列を除去又は前記特定の単語列を他の単語列に変換したテキストデータを生成する（Ｓ１０４）。 The speech processing device 10 determines whether a specific word string is included in the text data generated in S102 (S103). When the text data includes a specific word string, the speech processing device 10 generates text data by removing the specific word string or converting the specific word string into another word string (S104).

音声処理装置１０は、テキストデータに基づいて抽出される特徴量を音声合成モデル１０１ｂに入力して、合成音声の信号である合成音声信号を生成する（Ｓ１０５）。 The speech processing device 10 inputs the feature amount extracted based on the text data into the speech synthesis model 101b, and generates a synthesized speech signal that is a signal of synthesized speech (S105).

音声処理装置１０は、Ｓ１０１で取得された発話音声信号、Ｓ１０２で生成されたテキストデータ、及び、オペレータによって入力される顧客の感情の主観的評価情報の少なくとも一つに基づいて抽出される特徴量を感情認識モデル１０１ｃに入力して、顧客の感情情報を生成する（Ｓ１０６）。 The voice processing device 10 extracts feature quantities based on at least one of the uttered voice signal acquired in S101, the text data generated in S102, and the subjective evaluation information of the customer's emotions input by the operator. is input into the emotion recognition model 101c to generate customer emotion information (S106).

オペレータ端末２０は、Ｓ１０５で生成された合成音声信号に基づいて合成音声を音声出力部１０ｉから出力させるとともに、当該合成音声の出力タイミングＴに合わせて当該合成音声に対応する感情情報を表示部１０ｆに表示させる（Ｓ１０７、例えば、図７）。 The operator terminal 20 outputs a synthesized voice from the voice output unit 10i based on the synthesized voice signal generated in S105, and displays emotional information corresponding to the synthesized voice on the display unit 10f in accordance with the output timing T of the synthesized voice. (S107, for example, FIG. 7).

音声処理装置１０は、処理を終了するか否かを判定する（Ｓ１０８）。処理を終了しない場合（Ｓ１０８：ＮＯ）、音声処理装置１０は、処理Ｓ１０１～Ｓ１０７を再び実行する。一方、音声変換処理を終了する場合（Ｓ１０８：ＹＥＳ）、音声処理装置１０は、処理を終了する。 The audio processing device 10 determines whether to end the process (S108). If the process does not end (S108: NO), the audio processing device 10 executes the processes S101 to S107 again. On the other hand, when ending the audio conversion process (S108: YES), the audio processing device 10 ends the process.

図１０は、本実施形態に係る感情抑制機能の自動切り替え動作を示すフローチャートである。なお、図１０は、例示にすぎず、少なくとも一部のステップの順番は入れ替えられてもよいし、不図示のステップが実施されてもよいし、一部のステップが省略されてもよい。 FIG. 10 is a flowchart showing the automatic switching operation of the emotion suppression function according to this embodiment. Note that FIG. 10 is merely an example, and the order of at least some steps may be changed, steps not shown may be performed, or some steps may be omitted.

音声処理装置１０は、オペレータのストレス情報を生成する（Ｓ２０１）。 The voice processing device 10 generates operator stress information (S201).

音声処理装置１０は、ストレス情報が所定の条件を満たすか否かを判定する（Ｓ２０２）。例えば、所定の条件は、ストレス情報が示すストレス度数が所定の閾値以上又はより大きいことであってもよい。 The audio processing device 10 determines whether the stress information satisfies a predetermined condition (S202). For example, the predetermined condition may be that the stress level indicated by the stress information is greater than or equal to a predetermined threshold.

音声処理装置１０は、ストレス情報が所定の条件を満たす場合（Ｓ２０２：ＹＥＳ）、感情抑制機能を適用（すなわち、オペレータ端末２０から合成音声を出力）してもよい（Ｓ２０３）。一方、音声処理装置１０は、ストレス情報が所定の条件を満たさない場合（Ｓ２０２：ＮＯ）、感情抑制機能を非適用（すなわち、オペレータ端末２０から顧客の発話音声を出力）してもよい（Ｓ２０４）。 If the stress information satisfies a predetermined condition (S202: YES), the voice processing device 10 may apply an emotion suppression function (that is, output a synthesized voice from the operator terminal 20) (S203). On the other hand, if the stress information does not satisfy the predetermined condition (S202: NO), the voice processing device 10 may not apply the emotion suppression function (that is, output the customer's uttered voice from the operator terminal 20) (S204). ).

音声処理装置１０は、処理を終了するか否かを判定する（Ｓ２０５）。処理を終了しない場合（Ｓ２０５：ＮＯ）、音声処理装置１０は、処理Ｓ２０１～Ｓ２０４を再び実行する。一方、音声変換処理を終了する場合（Ｓ２０５：ＹＥＳ）、音声処理装置１０は、処理を終了する。なお、Ｓ２０１及びＳ２０２において、音声処理装置１０は、感情情報や感情抑制切替モデル１０１ｅの出力に基づいて、感情抑制機能を適用するか否を決定してもよい。 The audio processing device 10 determines whether to end the process (S205). If the process does not end (S205: NO), the audio processing device 10 executes the processes S201 to S204 again. On the other hand, when ending the audio conversion process (S205: YES), the audio processing device 10 ends the process. Note that in S201 and S202, the audio processing device 10 may determine whether to apply the emotion suppression function based on emotion information or the output of the emotion suppression switching model 101e.

以上のように、本実施形態に係る音声処理システム１によれば、顧客の発話音声信号に基づいてテキストデータを生成し、当該テキストデータに基づいて生成される合成音声をオペレータに出力する。このため、顧客の発話音声に含まれる顧客の感情を十分に抑制した合成音声をオペレータに聞かせることができ、顧客の感情的発話に起因するオペレータのストレスを軽減できる。本発明の発明者は、約５０名の被験者に対して、１）顧客の発話音声そのもの、２）顧客の発話音声の音量を調整した音声、３）顧客の発話音声の声質を変換した音声、４）顧客の発話音声をテキスト化してから生成した合成音声、の４種類の音声を聞き比べてもらい、音声から感じられる怒りの度合いを７段階の尺度で評価してもらう実験を行った。その結果、２）や３）と比較して４）が、被験者に伝わった怒りの軽減度合いが顕著であった。 As described above, according to the voice processing system 1 according to the present embodiment, text data is generated based on the customer's uttered voice signal, and synthesized voice generated based on the text data is output to the operator. Therefore, the operator can hear synthesized speech that sufficiently suppresses the customer's emotions contained in the customer's utterances, and the stress on the operator caused by the customer's emotional utterances can be reduced. The inventor of the present invention conducted experiments with approximately 50 test subjects, including: 1) the customer's utterance itself; 2) the volume-adjusted voice of the customer's utterance; and 3) the voice obtained by converting the voice quality of the customer's utterance. 4) An experiment was conducted in which participants were asked to listen to and compare four types of voices (synthesized voices generated after converting the customer's spoken voice into text) and to rate the degree of anger felt by the voices on a seven-point scale. As a result, compared to 2) and 3), 4) significantly reduced the degree of anger conveyed to the subjects.

また、本実施形態に係る音声処理システム１によれば、オペレータに対して、合成音声を出力するだけでなく顧客の感情情報を合成音声出力のタイミングに合わせて通知することができるので、合成音声を聞いたオペレータが顧客の感情をリアルタイムに認識でき、顧客に対して適切な応対を行うことができる。 Furthermore, according to the voice processing system 1 according to the present embodiment, it is possible to not only output the synthesized voice but also notify the customer's emotional information to the operator in accordance with the timing of outputting the synthesized voice. The operator who hears the customer's emotions can recognize the customer's emotions in real time and respond appropriately to the customer.

また、本実施形態に係る音声処理システム１によれば、オペレータのストレス情報又は顧客の感情情報等に基づいて、感情抑制機能を適用するか否か（すなわち、オペレータに対して合成音声又は発話音声のどちらを出力するか）が切り替えられるので、オペレータのストレスと顧客の満足度とのバランスを適切に図ることができる。 Further, according to the voice processing system 1 according to the present embodiment, it is possible to determine whether or not to apply the emotion suppression function based on the operator's stress information or the customer's emotional information (i.e., whether or not to apply the synthesized voice or spoken voice to the operator) Since it is possible to switch between outputs (which one to output), it is possible to appropriately balance operator stress and customer satisfaction.

（変更例）
上記音声処理システム１では、音声認識部１０３は、発話音声信号から、一つ又は複数の文として確定された単語列を含むテキストデータを生成したが、これに限られない。音声認識部１０３は、発話音声信号から認識された単語列が一つ又は複数の文として確定される前に、一つ又は複数の単語（品詞又は形態素）からなる単語列を含むテキストデータを生成してもよい。除去部１０４は、当該文として確定されていないテキストデータ内の特定の単語列を除去し、音声合成部１０５は、当該文として確定されていないテキストデータから合成音声信号を生成してもよい。 (Example of change)
In the above-mentioned speech processing system 1, the speech recognition unit 103 generates text data including a word string determined as one or more sentences from the uttered speech signal, but the present invention is not limited to this. The speech recognition unit 103 generates text data including a word string consisting of one or more words (parts of speech or morphemes) before the word string recognized from the uttered audio signal is determined as one or more sentences. You may. The removal unit 104 may remove a specific word string in the text data that has not been determined as the sentence, and the speech synthesis unit 105 may generate a synthesized speech signal from the text data that has not been determined as the sentence.

図１１は、本実施形態の変更例に係る合成音声信号の生成の一例を示す図である。図１１では、送受信部１０２で取得された発話音声信号Ｓ４に基づいて、音声認識部１０３においてテキストデータＴ４１～Ｔ４３が生成されるものとする。図１１に示すように、テキストデータＴ４１～Ｔ４３は、「はやく送ってください」という一文の確定前に、意味を持つ形態素単位（「はやく」、「送って」、「ください」）でテキストデータが生成される点で、図４と異なる。除去部１０４は、テキストデータＴ４１～Ｔ４３それぞれに対して特定の単語列が含まれるか否かを判定して、当該特定の単語列を除去して音声合成部１０５に出力する。音声合成部１０５は、テキストデータＴ４１～Ｔ４３からそれぞれ合成音声信号Ｓ４１～Ｓ４３を生成する。 FIG. 11 is a diagram illustrating an example of generation of a synthesized speech signal according to a modification of the present embodiment. In FIG. 11, it is assumed that text data T41 to T43 are generated in the speech recognition unit 103 based on the uttered audio signal S4 acquired by the transmission/reception unit 102. As shown in FIG. 11, text data T41 to T43 are divided into meaningful morpheme units ("quickly," "send," "please") before the sentence "please send quickly" is finalized. It differs from FIG. 4 in that it is generated. The removal unit 104 determines whether a specific word string is included in each of the text data T41 to T43, removes the specific word string, and outputs the removed word string to the speech synthesis unit 105. The speech synthesis unit 105 generates synthesized speech signals S41 to S43 from the text data T41 to T43, respectively.

図１１に示すように、文の確定前に一つ又は複数の形態素単位でテキストデータを生成して合成音声を出力することにより、テキストデータの生成によりオペレータの応答遅延を軽減できる。なお、形態素単位での複数のテキストデータ（又は合成音声）が意味的に不自然でないかを判定するモデルなどが用いられてもよい。 As shown in FIG. 11, by generating text data in units of one or more morphemes and outputting synthesized speech before finalizing a sentence, the operator's response delay can be reduced by generating the text data. Note that a model that determines whether a plurality of text data (or synthesized speech) in units of morphemes is semantically unnatural may be used.

また、応答遅延を軽減するために、図４に示す合成音声信号Ｓ１～Ｓ３、図１１に示す合成音声信号Ｓ４１～Ｓ４３それぞれの前及び／又は後に、例えば、「あ～」、「え～」、「まあ」等のフィラー音が追加されてもよい。これにより、オペレータも応答遅延による顧客の満足度の低下を防止できる。 In addition, in order to reduce response delay, for example, "Ah", "Eh" are added before and/or after each of the synthesized speech signals S1 to S3 shown in FIG. 4 and the synthesized speech signals S41 to S43 shown in FIG. 11. , "Well", etc., may be added. This also allows the operator to prevent customer satisfaction from decreasing due to response delays.

また、音声合成部１０５は、感情認識部１０６が推定した顧客の感情に基づいて、複数の音声合成モデル１０１ｂのうちから顧客の感情に合った音声合成モデル１０１ｂを選択してもよい。例えば、感情認識部１０６が推定した顧客の感情が「激昂」である場合、音声合成部１０５は、ピッチが速く抑揚が激しい音声合成モデル１０１ｂを用いてよい。例えば、感情認識部１０６が推定した顧客の感情が「号泣」である場合、音声合成部１０５は、泣き声のような音声を出力する音声合成モデル１０１ｂを用いてよい。或いは、音声合成部１０５は、感情認識部１０６が推定した顧客の感情に基づいて音声合成モデル１０１ｂのパラメータを変更し、顧客の感情に合った音声が出力されるように調整してよい。顧客が激昂している際の生の音声を直接聞いたオペレータは極めて強いストレスを感じてしまう。他方、オペレータは顧客対応業務を適切に遂行するために、顧客の感情をリアルタイムで正確に把握する必要がある。オペレータに発話音声を直接聞かせないことによりオペレータは過剰なストレスを感じることがなく、合成音声に顧客の感情を乗せることにより、オペレータは聴覚を通じて顧客の感情をリアルタイムに把握することができる。 Furthermore, the voice synthesis unit 105 may select a voice synthesis model 101b that matches the customer's emotion from among the plurality of voice synthesis models 101b based on the customer's emotion estimated by the emotion recognition unit 106. For example, if the customer's emotion estimated by the emotion recognition unit 106 is “enraged,” the speech synthesis unit 105 may use the speech synthesis model 101b with a fast pitch and strong intonation. For example, if the customer's emotion estimated by the emotion recognition unit 106 is “crying,” the voice synthesis unit 105 may use the voice synthesis model 101b that outputs a voice that sounds like crying. Alternatively, the voice synthesis unit 105 may change the parameters of the voice synthesis model 101b based on the customer's emotion estimated by the emotion recognition unit 106, and adjust the voice to match the customer's emotion. An operator who directly hears the voice of an angry customer feels extremely stressed. On the other hand, operators need to accurately grasp customer emotions in real time in order to appropriately perform customer service tasks. By not letting the operator hear the spoken voice directly, the operator does not feel excessive stress, and by adding the customer's emotions to the synthesized voice, the operator can grasp the customer's emotions in real time through hearing.

（その他の実施形態）
上記実施形態では、顧客の発話音声信号をテキスト化して、合成音声信号をオペレータに出力するものとしたがこれに限られない。音声処理装置１０は、顧客の発話音声信号に基づいて抽出される音声特徴量を音声変換モデルに入力して、変換音声の信号を生成し、オペレータ端末２０から変換音声を出力してもよい。 (Other embodiments)
In the above embodiment, the customer's uttered voice signal is converted into text and the synthesized voice signal is output to the operator, but the present invention is not limited thereto. The voice processing device 10 may input the voice feature amount extracted based on the customer's uttered voice signal into the voice conversion model, generate a converted voice signal, and output the converted voice from the operator terminal 20.

特許請求の範囲に記載の「音声変換モデル」は、発話音声信号を一旦テキスト化して合成音声として出力するモデルと、発話音声信号をテキスト化せずに声質を変換させて出力するモデルとの両方を包含する概念である。顧客の発話音声に代えて合成音声または変換音声をオペレータに対して出力することにより、効果の程度の差こそあれ、オペレータが感じるストレスを軽減できる。他方で、顧客対応業務の遂行のためには、オペレータが顧客の感情をリアルタイムに把握することも欠かせない。 The "speech conversion model" described in the claims includes both a model that converts a spoken voice signal into text and outputs it as a synthesized voice, and a model that converts the voice quality of the voice signal without converting it to text and outputs it. It is a concept that encompasses By outputting synthesized speech or converted speech to the operator instead of the customer's uttered speech, the stress felt by the operator can be reduced, although the degree of effectiveness may vary. On the other hand, in order to perform customer service operations, it is essential for operators to grasp customer emotions in real time.

本変形例における音声処理装置１０は、顧客の発話音声信号に基づいて抽出される音声特徴量を音声変換モデルに入力して、変換音声信号を生成する。音声処理装置１０は、１）変換音声信号と、２）顧客の発話音声から推定される顧客の感情情報とを時間軸上で関連付けた情報を生成し、オペレータ端末２０に対して送信する。音声処理装置１０が送信する情報には、発話音声信号、発話音声信号から生成されたテキストデータ、除去部１０４の処理を経たあとのテキストデータ、検出情報、感情抑制機能のオン・オフを切り替えるタイミングが関連付けされていてもよい。 The voice processing device 10 in this modification inputs voice features extracted based on a customer's uttered voice signal into a voice conversion model to generate a converted voice signal. The voice processing device 10 generates information in which 1) the converted voice signal and 2) the customer's emotional information estimated from the customer's uttered voice are associated on the time axis, and transmits it to the operator terminal 20. The information transmitted by the voice processing device 10 includes a spoken voice signal, text data generated from the spoken voice signal, text data after processing by the removal unit 104, detection information, and timing for switching on/off of the emotion suppression function. may be associated.

オペレータ端末２０は、音声処理装置１０から受信した変換音声の信号を音声出力部１０ｉから出力し、且つ、音声出力部１０ｉからの変換音声の出力タイミングＴに合わせて、感情情報を示す情報を表示部１０fに表示してよい。オペレータ端末２０は更に、音声出力部１０ｉからの変換音声の出力タイミングＴに合わせて、テキストデータを表示部１０fに表示してよい。かかる表示の態様は図７に図示するようであってよい。 The operator terminal 20 outputs the converted audio signal received from the audio processing device 10 from the audio output unit 10i, and displays information indicating emotional information in accordance with the output timing T of the converted audio from the audio output unit 10i. It may be displayed in section 10f. The operator terminal 20 may further display the text data on the display section 10f in accordance with the output timing T of the converted speech from the speech output section 10i. The mode of such display may be as illustrated in FIG.

本変形例における音声処理装置１０は、感情情報に基づいて、感情情報が示す感情が変換音声に反映されるように、変換音声の信号を生成してもよい。例えば感情情報が示す感情が「激昂」である場合、ピッチが速く抑揚が激しい音声変換モデルを用いてよい。例えば感情情報が示す感情が「号泣」である場合、泣き声のような音声を出力する音声変換モデルを用いてよい。音声処理装置１０は、感情情報が示す感情が変換音声に反映されるように、変換音声の信号を生成してよい。オペレータに発話音声を直接聞かせないことによりオペレータは過剰なストレスを感じるがことなく、変換音声に顧客の感情を乗せることにより、オペレータは聴覚を通じて顧客の感情をリアルタイムに把握することができる。 The audio processing device 10 in this modification may generate a converted voice signal based on the emotional information so that the emotion indicated by the emotional information is reflected in the converted voice. For example, if the emotion indicated by the emotional information is "enraged," a voice conversion model with a fast pitch and severe intonation may be used. For example, if the emotion indicated by the emotional information is "crying," a voice conversion model that outputs a crying-like sound may be used. The audio processing device 10 may generate a converted audio signal so that the emotion indicated by the emotional information is reflected in the converted audio. By not letting the operator hear the spoken voice directly, the operator does not feel excessive stress, and by adding the customer's emotion to the converted voice, the operator can grasp the customer's emotion in real time through hearing.

本変形例における音声処理システム１においては、オペレータによるアノテーション作業は、オペレータによる通常のコールセンター業務中において、変換音声に対して行われても良い。オペレータが変換音声に対して「怒りの感情」をアノテーションした場合、当該アノテーションの結果に基づいて、音声変換モデルがより柔らかい音声を出力するようにリアルタイムに調整されてもよい。 In the speech processing system 1 in this modification, the annotation work by the operator may be performed on the converted speech during normal call center work by the operator. When the operator annotates the converted voice with "emotion of anger," the voice conversion model may be adjusted in real time to output softer voice based on the result of the annotation.

以上説明した実施形態では、第１のユーザが顧客であり、第２のユーザがオペレータであるコールセンターを想定したが、本実施形態の適用場面はコールセンターに限られない。例えば、Ｗｅｂミーティング等、第１のユーザの感情を抑制した音声を第２のユーザに出力するどのような場面にも適用可能である。すなわち、本実施形態は、カスタマーハラスメント対策だけでなく、社内のパワーハラスメント等、様々なハラスメントに対する企業側の対策として利用可能である。 In the embodiment described above, a call center is assumed where the first user is a customer and the second user is an operator, but the application scene of this embodiment is not limited to a call center. For example, the present invention can be applied to any situation, such as a web meeting, in which a first user's emotion-suppressed voice is output to a second user. That is, the present embodiment can be used not only as a countermeasure against customer harassment but also as a countermeasure on the company side against various types of harassment such as internal power harassment.

以上説明した実施形態における、感情情報と合成音声とを「時間軸上で関連付け」する処理は、図７に示すように、合成音声または変換音声の出力タイミングに合わせて、それらの元となった発話音声から推定される感情情報を表示することが実現可能な態様であれば、その具体的な態様を問わない。以上説明した実施形態における「時間軸上で関連付け」する処理は、何時何分何秒といった時刻情報に基づいて関連付けする処理でも良いし、発話音声情報の開始から何分何秒経過時といった情報に基づいて関連付けする処理でも良いし、文単位、単語単位又は形態素単位で関連付けする処理でもよい。 In the embodiment described above, the process of "associating emotional information and synthesized speech on the time axis" is performed in accordance with the output timing of the synthesized speech or converted speech, as shown in FIG. The specific mode is not limited as long as it is possible to display the emotional information estimated from the uttered voice. The process of "associating on the time axis" in the embodiment described above may be a process of associating based on time information such as hours, minutes, and seconds, or based on information such as how many minutes and seconds have passed since the start of the uttered audio information. It may be a process of associating on a per-sentence, word-by-word, or morpheme-by-morpheme basis.

以上説明した実施形態における音声処理システム１において、顧客からは、自身の音声が感情抑制されてオペレータに届いていることが分からないようにしてもよい。すなわち、感情抑制機能がオンになっているかオフになっているかは、顧客からは把握できないようにしてもよい。 In the voice processing system 1 in the embodiment described above, the customer may not be able to tell that his or her own voice is being delivered to the operator with emotion suppressed. That is, the customer may not be able to know whether the emotion suppression function is on or off.

アノテーション作業は、オペレータがオペレータ端末２０上で行っても良いし、別途、アノテーション作業用の専用のアプリケーションや端末が用意されていてもよい。 The annotation work may be performed by an operator on the operator terminal 20, or a dedicated application or terminal for annotation work may be separately prepared.

また、以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。また、音声処理装置１０の機能として記載した機能をオペレータ端末２０が備えていてもよい。また、オペレータ端末２０の機能として記載した機能を音声処理装置１０が備えていてもよい。 Further, the embodiments described above are for facilitating understanding of the present invention, and are not intended to be interpreted as limiting the present invention. Each element included in the embodiment, as well as its arrangement, material, conditions, shape, size, etc., are not limited to those illustrated, and can be changed as appropriate. Further, it is possible to partially replace or combine the structures shown in different embodiments. Furthermore, the operator terminal 20 may have the functions described as the functions of the voice processing device 10. Furthermore, the voice processing device 10 may include the functions described as the functions of the operator terminal 20.

１…音声処理システム、１０…音声処理装置、２０…オペレータ端末、３０…顧客端末、１０ａ…プロセッサ、１０ｂ…ＲＡＭ、１０ｃ…ＲＯＭ、１０ｄ…通信部、１０ｅ…入力部、１０ｆ…表示部、１０ｇ…カメラ、１０ｈ…音声入力部、１０ｉ…音声出力部、１０１…記憶部、１０２…送受信部、１０３…音声認識部、１０４…除去部、１０５…音声合成部、１０６…感情認識部、１０７…ストレス認識部、１０８…制御部、１０９…学習部、２０１…送受信部、２０２…入力受付部、２０３…制御部 1... Voice processing system, 10... Voice processing device, 20... Operator terminal, 30... Customer terminal, 10a... Processor, 10b... RAM, 10c... ROM, 10d... Communication unit, 10e... Input unit, 10f... Display unit, 10g ...Camera, 10h...Audio input section, 10i...Audio output section, 101...Storage section, 102...Transmission/reception section, 103...Speech recognition section, 104...Removal section, 105...Speech synthesis section, 106...Emotion recognition section, 107... Stress recognition unit, 108...Control unit, 109...Learning unit, 201...Transmission/reception unit, 202...Input reception unit, 203...Control unit

Claims

第１のユーザの発話音声の信号である発話音声信号を取得する取得部と、
前記発話音声信号に基づいて抽出される特徴量を音声変換モデルに入力して、変換音声の信号を生成する音声変換部と、
第２のユーザによって入力される切り替え情報に基づいて、前記第２のユーザに対して前記変換音声を出力する音声出力部から前記変換音声又は前記発話音声のどちらを出力するかを切り替える制御部と、
前記第２のユーザによって入力された切り替え情報を、前記切り替え情報が入力された際の発話音声信号と時間軸上で関連付けた情報を生成し、当該情報に基づいて、発話音声信号、当該発話音声信号から抽出した特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータから抽出された特徴量、又はこれらの少なくとも二つの組み合わせを入力とし、前記変換音声と前記発話音声とを切り替えるタイミングを出力とする感情抑制切替モデルを機械学習する学習部と、を備え、
前記制御部は、前記感情抑制切替モデルに、前記取得部が取得した発話音声信号、当該発話音声信号から抽出した特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータから抽出された特徴量、又はこれらの少なくとも二つの組み合わせを入力することにより、前記変換音声と前記発話音声とを切り替えるタイミングを生成する、
音声処理装置。 an acquisition unit that acquires a spoken voice signal that is a signal of the first user's spoken voice;
a voice conversion unit that inputs feature quantities extracted based on the uttered voice signal into a voice conversion model to generate a converted voice signal;
a control unit that switches which of the converted voice or the spoken voice is output from an audio output unit that outputs the converted voice to the second user, based on switching information input by the second user; ,
Information is generated that associates the switching information input by the second user with the uttered audio signal when the switching information was input on the time axis, and based on the information, the uttered audio signal and the uttered audio are generated. A feature quantity extracted from a signal, text data generated from the speech audio signal, a feature quantity extracted from the text data, or a combination of at least two thereof is input, and a timing for switching between the converted speech and the speech speech is determined. A learning unit that performs machine learning on the emotion suppression switching model to be output,
The control unit includes, in the emotion suppression switching model, the utterance audio signal acquired by the acquisition unit, the feature amount extracted from the utterance audio signal, text data generated from the utterance audio signal, and features extracted from the text data. generating a timing for switching between the converted voice and the uttered voice by inputting the amount or a combination of at least two of these;
Audio processing device.

前記発話音声信号に対応する第１のユーザの感情情報を生成する感情認識部をさらに備える、請求項１に記載の音声処理装置。 The audio processing device according to claim 1, further comprising an emotion recognition unit that generates emotional information of the first user corresponding to the uttered audio signal.

前記感情認識部は、発話音声信号、当該発話音声信号から抽出した特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータから抽出された特徴量、又はこれらの少なくとも二つの組み合わせを入力とし、当該発話音声信号の発話者の感情情報を出力するよう機械学習された感情認識モデルに、前記取得部が取得した発話音声信号、当該発話音声信号から抽出した音声特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータに対応するテキスト特徴量、又はこれらの少なくとも二つの組み合わせを入力することにより、前記取得部が取得した発話音声信号に対応する第１のユーザの感情情報を生成する、請求項２に記載の音声処理装置。 The emotion recognition unit receives as input a spoken voice signal, a feature extracted from the spoken voice signal, text data generated from the spoken voice signal, a feature extracted from the text data, or a combination of at least two of these. , the utterance audio signal acquired by the acquisition unit, the audio feature extracted from the utterance audio signal, and the emotion recognition model that has been machine learned to output emotional information of the speaker of the utterance audio signal. By inputting the generated text data, a text feature corresponding to the text data, or a combination of at least two of these, the first user's emotional information corresponding to the speech audio signal acquired by the acquisition unit is generated. , The audio processing device according to claim 2.

前記感情認識部は、前記第２のユーザに対して、前記音声出力部による前記変換音声の出力タイミングに合わせて、前記変換音声に対応する前記感情情報を表示する表示部に表示される当該感情情報を生成する、請求項２又は３に記載の音声処理装置。 The emotion recognition unit is configured to recognize, for the second user, the emotion displayed on a display unit that displays the emotion information corresponding to the converted voice in accordance with the output timing of the converted voice by the voice output unit. The audio processing device according to claim 2 or 3, which generates information.

前記発話音声信号に基づいて抽出される特徴量を音声認識モデルに入力して、一以上の単語からなる単語列を含むテキストデータを生成する音声認識部と、
前記テキストデータに含まれる特定の単語列を検出し、前記特定の単語列を除去又は前記特定の単語列を他の単語列に置換したテキストデータを生成する除去部と、
をさらに備える、請求項１に記載の音声処理装置。 a speech recognition unit that inputs feature quantities extracted based on the speech audio signal into a speech recognition model to generate text data including a word string consisting of one or more words;
a removal unit that detects a specific word string included in the text data and generates text data in which the specific word string is removed or the specific word string is replaced with another word string;
The audio processing device according to claim 1, further comprising:

前記第２のユーザは、コールセンターのオペレータを含み、
前記除去部は、前記特定の単語列が検出される場合、前記第１のユーザに対する警告に関する情報を生成し、
生成された当該情報は、前記第１のユーザが操作する情報処理装置において出力される、請求項５に記載の音声処理装置。 The second user includes a call center operator,
The removing unit generates information regarding a warning to the first user when the specific word string is detected;
The audio processing device according to claim 5, wherein the generated information is outputted in an information processing device operated by the first user.

前記特定の単語列は、前記第２のユーザに心理的悪影響を与える一以上の単語を含む、請求項５又は６に記載の音声処理装置。 The audio processing device according to claim 5 or 6, wherein the specific word string includes one or more words that have a negative psychological impact on the second user.

前記音声変換部は、前記発話音声に含まれる前記第１のユーザの感情を抑制するように前記発話音声の信号の音量又は声質の少なくとも一方を変換した変換音声を生成する、請求項１から７のいずれか一項に記載の音声処理装置。 Claims 1 to 7, wherein the voice conversion unit generates a converted voice in which at least one of the volume and voice quality of the signal of the uttered voice is converted so as to suppress the emotion of the first user included in the uttered voice. The audio processing device according to any one of the above.

第１のユーザの発話音声の信号である発話音声信号を取得する工程と、
前記発話音声信号に基づいて抽出される特徴量を音声変換モデルに入力して、変換音声の信号を生成する工程と、
第２のユーザによって入力される切り替え情報に基づいて、前記変換音声又は前記発話音声のどちらを出力するかを切り替える工程と、
前記第２のユーザによって入力された切り替え情報を、前記切り替え情報が入力された際の発話音声信号と時間軸上で関連付けた情報を生成し、当該情報に基づいて、発話音声信号、当該発話音声信号から抽出した特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータから抽出された特徴量、又はこれらの少なくとも二つの組み合わせを入力とし、前記変換音声と前記発話音声とを切り替えるタイミングを出力とする感情抑制切替モデルを機械学習する工程と、を含み、
前記切り替える工程は、前記感情抑制切替モデルに、取得された前記発話音声信号、当該発話音声信号から抽出した特徴量、当該発話音声信号から生成したテキストデータ、当該テキストデータから抽出された特徴量、又はこれらの少なくとも二つの組み合わせを入力することにより、前記変換音声と前記発話音声とを切り替えるタイミングを生成する、
音声処理方法。 obtaining a spoken voice signal that is a signal of the first user's spoken voice;
inputting feature quantities extracted based on the uttered speech signal into a speech conversion model to generate a converted speech signal;
a step of switching whether to output the converted audio or the uttered audio based on switching information input by a second user;
Information is generated that associates the switching information input by the second user with the uttered audio signal when the switching information was input on the time axis, and based on the information, the uttered audio signal and the uttered audio are generated. A feature quantity extracted from a signal, text data generated from the speech audio signal, a feature quantity extracted from the text data, or a combination of at least two thereof is input, and a timing for switching between the converted speech and the speech speech is determined. A step of machine learning the emotion suppression switching model to be output,
The switching step includes, in the emotion suppression switching model, the acquired utterance audio signal, a feature extracted from the utterance audio signal, text data generated from the utterance audio signal, a feature extracted from the text data, or generating a timing for switching between the converted voice and the uttered voice by inputting a combination of at least two of these;
Audio processing method.

コンピュータを、請求項１から８のいずれか一項に記載の音声処理装置として機能させるためのプログラム。 A program for causing a computer to function as the audio processing device according to any one of claims 1 to 8.