JP7483226B2

JP7483226B2 - Computer program, server device and method

Info

Publication number: JP7483226B2
Application number: JP2019222758A
Authority: JP
Inventors: 暁彦白井
Original assignee: GREE Inc
Current assignee: GREE Inc
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2024-05-15
Anticipated expiration: 2039-12-10
Also published as: JP2021092644A

Description

本件出願に開示された技術は、ユーザの発話に基づいて得られた音声信号に対して信号処理を行うことにより加工された音声信号を生成するボイスチェンジャに関する。 The technology disclosed in this application relates to a voice changer that generates a processed voice signal by performing signal processing on a voice signal obtained based on a user's speech.

昨今、ユーザが自己の声とは異なる声により表現された音声を得るためのサービス又は製品が提供されている。 Recently, services and products have been provided that allow users to obtain audio expressed in a voice other than their own.

まず、「ＬｉｓＰｏｎ」と称されるサービスでは、或るユーザからのリクエストに対して、美声を有する他のユーザが自己の声を録音して上記或るユーザに返信するものとされている（非特許文献１）。 First, in a service called "LisPon," in response to a request from a certain user, other users with beautiful voices record their own voices and reply to the user (Non-Patent Document 1).

次に、入力した音声を加工して出力するボイスチェンジャと称される技術も知られている。ハードウェア形態のボイスチェンジャとしては、例えば「ＶＴ３ＶｏｉｃｅＴｒａｎｓｆｏｒｍｅｒ」と称されるＤＳＰ（ディジタル信号処理プロセッサ）を搭載した製品が挙げられる（非特許文献２）。ソフトウェア形態のボイスチェンジャとしては、「恋声」と称されるものが知られている（非特許文献３）。これらのボイスチェンジャは、ともに、マイクにより入力された音声信号のピッチ及びフォルマント等を含むパラメータをユーザにより設定された値に従って加工して音声信号として出力する。さらに別の形態のボイスチェンジャとしては、特開２００７－１１４５６１号公報（特許文献１）に記載されたものがある。この公報に記載された技術にあっては、携帯電話機が、マイクにより入力された音声信号に対して音声変換アルゴリズムを適用することにより、多人数によるハーモニーが付加されたような音声信号を出力する。 Next, there is also known a technology called a voice changer that processes and outputs input voice. An example of a hardware voice changer is a product equipped with a DSP (digital signal processor) called a "VT3 Voice Transformer" (Non-Patent Document 2). A software voice changer known as "Koe-sei" (Non-Patent Document 3). Both of these voice changers process parameters including pitch and formants of a voice signal input by a microphone according to values set by the user, and output the processed voice signal. Yet another type of voice changer is described in Japanese Patent Laid-Open Publication No. 2007-114561 (Patent Document 1). In the technology described in this publication, a mobile phone applies a voice conversion algorithm to a voice signal input by a microphone, and outputs a voice signal that sounds like it has multiple people singing in harmony.

さらに、ウェブサイトを介してボイスチェンジャを提供するサービスとしては、「ユーザーローカルボイスチェンジャ」と称されるサービスが知られている（非特許文献４）。このサービスにあっては、ウェブブラウザが、ユーザの音声を録音して生成した音声ファイルをアップロードし、さらに、ピッチ、フォルマント及び変換パターン等を含むパラメータを設定して送信すると、サーバが、ボイスチェンジャとして、設定されたパラメータに従って、音声ファイルを加工して再生する。 Furthermore, a service known as a "user local voice changer" is known as a service that provides a voice changer via a website (Non-Patent Document 4). With this service, a web browser records the user's voice and creates an audio file, uploads it, and then sets and transmits parameters including pitch, formant, conversion pattern, etc. The server then processes the audio file as a voice changer and plays it back according to the set parameters.

なお、上記非特許文献１乃至４及び上記特許文献１は、引用によりその全体が本明細書に組み入れられる。 The above non-patent documents 1 to 4 and the above patent document 1 are incorporated herein by reference in their entirety.

"ＬｉｓＰｏｎ"、［online］、２０１８年９月１７日、バイドゥ株式会社、［２０１８年１１月２日検索］、インターネット（URL: https://lispon.moe/）"LisPon", [online], September 17, 2018, Baidu Inc., [Retrieved November 2, 2018], Internet (URL: https://lispon.moe/) "ＶＴ３ＶｏｉｃｅＴｒａｎｓｆｏｒｍｅｒ"、［online］、２０１４年３月８日、ローランド株式会社、［２０１８年１１月２日検索］、インターネット（URL: https://www.roland.com/jp/products/vt-3/）"VT3 Voice Transformer", [online], March 8, 2014, Roland Corporation, [Retrieved November 2, 2018], Internet (URL: https://www.roland.com/jp/products/vt-3/) "恋声"、［online］、２０１８年５月１日、恋声萌、［２０１８年１１月２日検索］、インターネット（URL: http://www.geocities.jp/moe_koigoe/index.html）"Koe-sei", [online], May 1, 2018, Koe-sei Moe, [Retrieved November 2, 2018], Internet (URL: http://www.geocities.jp/moe_koigoe/index.html) "ユーザーローカルボイスチェンジャ"、［online］、２０１８年８月１日、株式会社ユーザーローカル、［２０１８年１１月２日検索］、インターネット（URL: https://voice-changer.userlocal.jp/）"UserLocal Voice Changer", [online], August 1, 2018, UserLocal, Inc., [searched November 2, 2018], Internet (URL: https://voice-changer.userlocal.jp/)

特開２００７－１１４５６１号公報JP 2007-114561 A

昨今、ユーザに適したボイスチェンジャを提供することが望まれている。したがって、本件出願に開示された技術は、ユーザに適したボイスチェンジャを提供することが可能な手法を提供する。 Nowadays, there is a demand for providing a voice changer that is suitable for the user. Therefore, the technology disclosed in this application provides a method that makes it possible to provide a voice changer that is suitable for the user.

一態様に係るコンピュータプログラムは、「少なくとも１つのプロセッサにより実行されることにより、対象ユーザによる発話に基づく音声信号に対する信号処理により算出される基本周波数を参照基本周波数として取得し、第１基準値を基準とした前記参照基本周波数の変化量を取得し、各々が前記第１基準値を基準とした基本周波数の変化量を定める複数の音声変換プリセットを取得し、前記複数の音声変換プリセットに含まれる各音声変換プリセットに対応する声と前記対象ユーザの声との間の距離を、前記音声変換プリセットにより定められる前記基本周波数の変化量及び前記参照基本周波数の変化量に基づいて算出する、ように前記プロセッサを機能させる」ものである。 The computer program according to one embodiment "is executed by at least one processor to cause the processor to function as follows: obtain a fundamental frequency calculated by signal processing of a voice signal based on the speech of a target user as a reference fundamental frequency; obtain an amount of change in the reference fundamental frequency based on a first reference value; obtain a plurality of voice conversion presets, each of which defines an amount of change in the fundamental frequency based on the first reference value; and calculate a distance between a voice corresponding to each voice conversion preset included in the plurality of voice conversion presets and the voice of the target user based on the amount of change in the fundamental frequency and the amount of change in the reference fundamental frequency defined by the voice conversion preset."

一態様に係るサーバ装置は、「少なくとも１つのプロセッサを具備し、該プロセッサが、対象ユーザによる発話に基づく音声信号に対する信号処理により算出される基本周波数を参照基本周波数として取得し、第１基準値を基準とした前記参照基本周波数の変化量を取得し、各々が第１基準値を基準とした基本周波数の変化量を定める複数の音声変換プリセットを取得し、前記複数の音声変換プリセットに含まれる各音声変換プリセットに対応する声と前記対象ユーザの声との距離を、前記音声変換プリセットにより定められる前記基本周波数の変化量及び前記参照基本周波数の変化量に基づいて算出する」ものである。 The server device according to one embodiment includes at least one processor, which acquires a fundamental frequency calculated by signal processing of a voice signal based on the speech of a target user as a reference fundamental frequency, acquires an amount of change in the reference fundamental frequency based on a first reference value, acquires a plurality of voice conversion presets each of which defines an amount of change in the fundamental frequency based on a first reference value, and calculates a distance between a voice corresponding to each voice conversion preset included in the plurality of voice conversion presets and the voice of the target user based on the amount of change in the fundamental frequency and the amount of change in the reference fundamental frequency determined by the voice conversion preset.

一態様に係る方法は、「各々が第１基準値を基準とした基本周波数の変化量を定める複数の音声変換プリセットを取得する第４取得工程と、前記複数の音声変換プリセットに含まれる各音声変換プリセットに対応する声と前記対象ユーザの声との距離を、前記音声変換プリセットにより定められる前記基本周波数の変化量及び前記参照基本周波数の変化量に基づいて算出する算出工程と、を含む」ものである。 The method according to one embodiment includes "a fourth acquisition step of acquiring a plurality of voice conversion presets, each of which defines an amount of change in fundamental frequency based on a first reference value, and a calculation step of calculating the distance between the voice corresponding to each voice conversion preset included in the plurality of voice conversion presets and the voice of the target user based on the amount of change in fundamental frequency defined by the voice conversion preset and the amount of change in the reference fundamental frequency."

図１は、一実施形態に係る通信システムの構成の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of a configuration of a communication system according to an embodiment. 図２は、図１に示した端末装置２０（サーバ装置３０）のハードウェア構成の一例を模式的に示すブロック図である。FIG. 2 is a block diagram illustrating an example of a hardware configuration of the terminal device 20 (server device 30) illustrated in FIG. 図３は、図１に示した端末装置２０（サーバ装置３０）の機能の一例を模式的に示すブロック図である。FIG. 3 is a block diagram illustrating an example of functions of the terminal device 20 (server device 30) illustrated in FIG. 図４は、人の発話に関する音声信号から得られる周波数スペクトルにおける基本周波数とフォルマントの周波数との関係を示す図である。FIG. 4 is a diagram showing the relationship between the fundamental frequency and the formant frequencies in a frequency spectrum obtained from an audio signal related to human speech. 図５Ａは、図１に示した通信システムにおいて用いられる男性用の音声変換プリセットの機能を説明するための模式図である。FIG. 5A is a schematic diagram for explaining the function of a male voice conversion preset used in the communication system shown in FIG. 図５Ｂは、図１に示した通信システムにおいて用いられる女性用の音声変換プリセットの機能を説明するための模式図である。FIG. 5B is a schematic diagram for explaining the function of the female voice conversion preset used in the communication system shown in FIG. 図５Ｃは、図１に示した通信システムにおいて用いられる中性用の音声変換プリセットの機能を説明するための模式図である。FIG. 5C is a schematic diagram for explaining the function of the neutral voice conversion preset used in the communication system shown in FIG. 図６は、図１に示した通信システム１において行われる動作の一例を示すフロー図である。FIG. 6 is a flow diagram showing an example of an operation performed in the communication system 1 shown in FIG. 図７は、図１に示した通信システム１において各キャラクター（各音声変換プリセット）について個別に用意されたセリフの例を示す図である。FIG. 7 is a diagram showing examples of lines prepared individually for each character (each voice conversion preset) in the communication system 1 shown in FIG. 図８は、図１に示した通信システム１において行われる対象ユーザの声と各音声変換プリセットに対応する声との距離を算出する方法の一例を示すフロー図である。FIG. 8 is a flow chart showing an example of a method for calculating the distance between the voice of the target user and the voice corresponding to each voice conversion preset, which is performed in the communication system 1 shown in FIG. 図９は、図１に示した通信システム１において基本周波数（及び第１フォルマントの周波数）を取得するために用いられる方法の一例を示すブロック図である。FIG. 9 is a block diagram showing an example of a method used to obtain the fundamental frequency (and the frequency of the first formant) in the communication system 1 shown in FIG. 図１０は、図１に示した通信システム１において端末装置２０の表示部２２０により表示される画面の一例を示す図である。FIG. 10 is a diagram showing an example of a screen displayed by the display unit 220 of the terminal device 20 in the communication system 1 shown in FIG. 図１１は、図１に示した通信システム１において各ユーザとそのユーザの声との距離が所定値未満である少なくとも１つの音声変換プリセットとを対応付けて記憶する情報の一例を示す図である。FIG. 11 is a diagram showing an example of information stored in communication system 1 shown in FIG. 1 in association with each user and at least one voice conversion preset in which the distance between the user's voice and the user's voice is less than a predetermined value. 図１２は、図１に示した通信システムにおいて各ユーザとそのユーザにより過去に使用された少なくとも１つの音声変換プリセットとを対応付けて記憶する情報の一例を示す図である。FIG. 12 is a diagram showing an example of information stored in the communication system shown in FIG. 1 in association with each user and at least one voice conversion preset used by the user in the past. 図１３は、図１に示した通信システムにおいて、少なくとも１人の類似ユーザにより過去に使用された音声変換プリセットのうち、協調フィルタリングを用いて、対象ユーザに推奨すべき音声変換プリセットを選択する方法の一例を示す図である。Figure 13 is a diagram showing an example of a method for selecting a voice conversion preset to be recommended to a target user from among voice conversion presets that have been used in the past by at least one similar user in the communication system shown in Figure 1 using collaborative filtering. 図１４は、図１に示した通信システムにおいて端末装置２０の表示部２２０により表示される画面の別の例を示す図である。FIG. 14 is a diagram showing another example of a screen displayed by the display unit 220 of the terminal device 20 in the communication system shown in FIG.

以下、添付図面を参照して本発明の様々な実施形態を説明する。なお、図面において共通した構成要素には同一の参照符号が付されている。また、或る図面に表現された構成要素が、説明の便宜上、別の図面においては省略されていることがある点に留意されたい。さらにまた、添付した図面が必ずしも正確な縮尺で記載されている訳ではないということに注意されたい。 Various embodiments of the present invention will now be described with reference to the accompanying drawings. Note that common components in the drawings are given the same reference numerals. It should also be noted that components depicted in one drawing may be omitted in another drawing for ease of explanation. Furthermore, it should also be noted that the accompanying drawings are not necessarily drawn to scale.

１．通信システムの例
図１は、一実施形態に係る通信システムの構成の一例を示すブロック図である。図１に示すように、通信システム１は、通信網１０に接続される１又はそれ以上の端末装置２０と、通信網１０に接続される１又はそれ以上のサーバ装置３０と、を含むことができる。なお、図１には、端末装置２０の例として、３つの端末装置２０Ａ～２０Ｃが例示され、サーバ装置３０の例として、３つのサーバ装置３０Ａ～３０Ｃが例示されているが、端末装置２０として、これら以外の１又はそれ以上の端末装置２０が通信網１０に接続され得るし、サーバ装置３０として、これら以外の１又はそれ以上のサーバ装置３０が通信網１０に接続され得る。 1. Example of a communication system FIG. 1 is a block diagram showing an example of a configuration of a communication system according to an embodiment. As shown in FIG. 1, the communication system 1 can include one or more terminal devices 20 connected to a communication network 10, and one or more server devices 30 connected to the communication network 10. Note that in FIG. 1, three terminal devices 20A to 20C are illustrated as examples of the terminal devices 20, and three server devices 30A to 30C are illustrated as examples of the server devices 30, but one or more terminal devices 20 other than these may be connected to the communication network 10 as the terminal devices 20, and one or more server devices 30 other than these may be connected to the communication network 10 as the server devices 30.

また、通信システム１は、通信網１０に接続される１又はそれ以上のスタジオユニット４０を含むことができる。なお、図１には、スタジオユニット４０の例として、２つのスタジオユニット４０Ａ及び４０Ｂが例示されているが、スタジオユニット４０として、これら以外の１又はそれ以上のスタジオユニット４０が通信網１０に接続され得る。 The communication system 1 may also include one or more studio units 40 connected to the communication network 10. Note that in FIG. 1, two studio units 40A and 40B are illustrated as examples of the studio units 40, but one or more studio units 40 other than these may be connected to the communication network 10 as the studio units 40.

「第１の態様」では、図１に示す通信システム１において、ユーザにより操作され特定のアプリケーション（音声／動画配信用のアプリケーション、及び／又は、ボイスチェンジャー機能を有するアプリケーション等。ここでいうアプリケーションに代えて又はアプリケーションとともに、ミドルウェアを用いることも可能である。）を実行する端末装置２０（例えば端末装置２０Ａ）が、端末装置２０Ａに対向するユーザの発話に関する音声信号を取得し、取得した音声信号に基づいて「変換器」、すなわち、音声変換アルゴリズム及び音声変換プリセット（音声変換に用いられるパラメータのセットを取得し、取得した変換器を用いて変換された音声信号を生成し、生成された音声信号を（必要に応じて動画信号とともに）、通信網１０を介してサーバ装置３０（例えばサーバ装置３０Ａ）に送信する。さらに、サーバ装置３０Ａが、端末装置２０Ａから受信した音声信号を（必要に応じて動画信号とともに）、通信網１０を介して他の１又はそれ以上の端末装置２０であって特定のアプリケーション（音声／動画視聴用のアプリケーション、及び／又は、ボイスチェンジャー機能を有するアプリケーション等。ここでいうアプリケーションに代えて又はアプリケーションとともに、ミドルウェアを用いることも可能である。）を実行して音声／動画の配信を要求する旨を送信した端末装置２０に配信することができる。 In the "first aspect," in the communication system 1 shown in FIG. 1, a terminal device 20 (e.g., terminal device 20A) that is operated by a user and executes a specific application (such as an application for audio/video distribution and/or an application with a voice changer function. Instead of or together with the application, middleware can be used.) acquires an audio signal related to the speech of a user facing the terminal device 20A, acquires a "converter," i.e., a voice conversion algorithm and a voice conversion preset (a set of parameters used for voice conversion) based on the acquired audio signal, generates a converted audio signal using the acquired converter, and transmits the generated audio signal (together with a video signal, if necessary) to a server device 30 (e.g., server device 30A) via the communication network 10. Furthermore, the server device 30A can distribute the audio signal received from the terminal device 20A (together with a video signal, if necessary) via the communication network 10 to one or more other terminal devices 20 that are executing a specific application (such as an application for audio/video viewing and/or an application with a voice changer function. Instead of or together with the application, middleware can be used.) and have transmitted a request for audio/video distribution.

また、この「第１の態様」では、後に説明するように、ユーザの発話に関する音声信号を取得してから変換器（音声変換アルゴリズム、及び、音声変換に用いられるパラメータのセット（音声変換プリセット））を取得するまでの一連の動作のすべてが、端末装置２０によって実行されるようにしてもよいし、これら一連の動作のうち、音声信号の取得を除く動作の少なくとも一部が、サーバ装置３０又は他の端末装置２０により実行されるようにしてもよい。 In addition, in this "first aspect," as described later, the entire series of operations from acquiring a voice signal related to the user's speech to acquiring a converter (a voice conversion algorithm and a set of parameters used for voice conversion (voice conversion preset)) may be executed by the terminal device 20, or at least some of the operations in the series of operations, excluding acquiring the voice signal, may be executed by the server device 30 or another terminal device 20.

「第２の態様」では、図１に示す通信システム１において、例えばスタジオ等又は他の場所に設置されたサーバ装置３０（例えばサーバ装置３０Ｂ）が、上記スタジオ等又は他の場所に居るユーザの発話に関する音声信号を取得し、取得した音声信号に基づいて変換器（音声変換アルゴリズム、及び、音声変換に用いられるパラメータのセット（音声変換プリセット））を取得し、取得した変換器を用いて変換された音声信号を生成し、生成された音声信号を（必要に応じて動画信号とともに）、通信網１０を介して１又はそれ以上の端末装置２０であって特定のアプリケーション（動画視聴用のアプリケーション、及び／又は、ボイスチェンジャー機能を有するアプリケーション等。ここでいうアプリケーションに代えて又はアプリケーションとともに、ミドルウェアを用いることも可能である。）を実行して動画の配信を要求する旨を送信した端末装置２０に配信することができる。 In the "second aspect," in the communication system 1 shown in FIG. 1, a server device 30 (e.g., server device 30B) installed in, for example, a studio or other location acquires an audio signal related to the speech of a user in the studio or other location, acquires a converter (a voice conversion algorithm and a set of parameters used for voice conversion (voice conversion preset)) based on the acquired voice signal, generates a converted voice signal using the acquired converter, and distributes the generated voice signal (together with a video signal, if necessary) via the communication network 10 to one or more terminal devices 20 that have executed a specific application (such as an application for watching videos and/or an application with a voice changer function. Middleware can also be used instead of or together with the application referred to here).

「第３の態様」では、図１に示す通信システム１において、例えばスタジオ等又は他の場所に設置されたスタジオユニット４０が、上記スタジオ等又は他の場所に居るユーザの発話に関する音声信号を取得し、取得した音声信号に基づいて変換器（音声変換アルゴリズム、及び、音声変換に用いられるパラメータのセット（音声変換プリセット））を取得し、取得した変換器を用いて変換された音声信号を生成し、生成された音声信号を（必要に応じて動画信号とともに）、通信網１０を介してサーバ装置３０（例えばサーバ装置３０Ａ）に送信する。さらに、サーバ装置３０Ａが、スタジオユニット４０から受信した音声信号を（必要に応じて動画信号とともに）、通信網１０を介して他の１又はそれ以上の端末装置２０であって特定のアプリケーション（音声／動画視聴用のアプリケーション、及び／又は、ボイスチェンジャー機能を有するアプリケーション等。ここでいうアプリケーションに代えて又はアプリケーションとともに、ミドルウェアを用いることも可能である。）を実行して音声／動画の配信を要求する旨を送信した端末装置２０に配信することができる。 In the "third aspect," in the communication system 1 shown in FIG. 1, for example, a studio unit 40 installed in a studio or other location acquires an audio signal related to the speech of a user in the studio or other location, acquires a converter (a voice conversion algorithm and a set of parameters used for voice conversion (voice conversion preset)) based on the acquired voice signal, generates a converted voice signal using the acquired converter, and transmits the generated voice signal (together with a video signal, if necessary) to a server device 30 (for example, server device 30A) via the communication network 10. Furthermore, the server device 30A can distribute the voice signal received from the studio unit 40 (together with a video signal, if necessary) via the communication network 10 to one or more other terminal devices 20 that are executing a specific application (such as an application for viewing audio/video and/or an application with a voice changer function. Middleware can also be used instead of or together with the application referred to here) and have transmitted a request for audio/video distribution.

通信網１０は、携帯電話網、無線ＬＡＮ、固定電話網、インターネット、イントラネット及び／又はイーサネット（登録商標）等をこれらに限定することなく含むことができるものである。 The communication network 10 may include, but is not limited to, a mobile phone network, a wireless LAN, a landline telephone network, the Internet, an intranet, and/or Ethernet (registered trademark), etc.

端末装置２０は、インストールされた特定のアプリケーションを実行することにより、ユーザの発話に関する音声信号を取得し、取得した音声信号に基づいて、変換器、すなわち、音声変換アルゴリズム及び音声変換プリセット（音声変換に用いられるパラメータのセット）を取得し、取得した変換器を用いて変換された音声信号を生成し、生成された音声信号を（必要に応じて動画信号とともに）、通信網１０を介してサーバ装置３０（例えばサーバ装置３０Ａ）に送信する、という動作等を実行することができる。なお、音声変換アルゴリズムとしては、任意のアルゴリズムを用いることが可能である。或いはまた、端末装置２０は、インストールされたウェブブラウザを実行することにより、サーバ装置３０からウェブページを受信及び表示して、同様の動作を実行することができる。 By executing a specific installed application, the terminal device 20 can perform operations such as acquiring an audio signal related to the user's speech, acquiring a converter, i.e., a voice conversion algorithm and a voice conversion preset (a set of parameters used for voice conversion) based on the acquired voice signal, generating a converted voice signal using the acquired converter, and transmitting the generated voice signal (together with a video signal, if necessary) to the server device 30 (e.g., server device 30A) via the communication network 10. Note that any algorithm can be used as the voice conversion algorithm. Alternatively, the terminal device 20 can execute an installed web browser to receive and display a web page from the server device 30 and perform a similar operation.

端末装置２０は、このような動作を実行することができる任意の端末装置であって、スマートフォン、タブレット、携帯電話（フィーチャーフォン）及び／又はパーソナルコンピュータ等を、これらに限定することなく含むことができるものである。 The terminal device 20 is any terminal device capable of performing such operations, and may include, but is not limited to, a smartphone, a tablet, a mobile phone (feature phone), and/or a personal computer.

サーバ装置３０は、「第１の態様」では、インストールされた特定のアプリケーションを実行してアプリケーションサーバとして機能することにより、各端末装置２０からユーザの音声信号を（必要に応じて動画信号とともに）、通信網１０を介して受信し、受信した音声信号を（必要に応じて動画信号とともに）通信網１０を介して各端末装置２０に配信する、という動作等を実行することができる。或いはまた、サーバ装置３０は、インストールされた特定のアプリケーションを実行してウェブサーバとして機能することにより、各端末装置２０に送信するウェブページを介して、同様の動作を実行することができる。 In the "first aspect," the server device 30 executes a specific installed application to function as an application server, thereby performing operations such as receiving a user's voice signal (together with a video signal, if necessary) from each terminal device 20 via the communication network 10, and distributing the received voice signal (together with a video signal, if necessary) to each terminal device 20 via the communication network 10. Alternatively, the server device 30 executes a specific installed application to function as a web server, thereby performing similar operations via a web page that is sent to each terminal device 20.

サーバ装置３０は、「第２の態様」では、インストールされた特定のアプリケーションを実行してアプリケーションサーバとして機能することにより、このサーバ装置３０が設置されたスタジオ等又は他の場所に居るユーザの発話に関する音声信号を取得し、取得した音声信号に基づいて変換器、すなわち、音声変換アルゴリズム及び音声変換プリセット（音声変換に用いられるパラメータのセット）を取得し、取得した変換器を用いて変換された音声信号を生成し、生成された音声信号を（必要に応じて動画信号とともに）通信網１０を介して各端末装置２０に配信する、という動作等を実行することができる。或いはまた、サーバ装置３０は、インストールされた特定のアプリケーションを実行してウェブサーバとして機能することにより、各端末装置２０に送信するウェブページを介して、同様の動作を実行することができる。 In the "second aspect," the server device 30 executes a specific installed application to function as an application server, thereby acquiring an audio signal related to the speech of a user in a studio or other location where the server device 30 is installed, acquiring a converter, i.e., a voice conversion algorithm and a voice conversion preset (a set of parameters used for voice conversion), based on the acquired audio signal, generating a converted audio signal using the acquired converter, and distributing the generated audio signal (together with a video signal, if necessary) to each terminal device 20 via the communication network 10. Alternatively, the server device 30 executes a specific installed application to function as a web server, thereby performing similar operations via a web page that is sent to each terminal device 20.

さらにまた、サーバ装置３０は、「第３の態様」では、インストールされた特定のアプリケーションを実行してアプリケーションサーバとして機能することにより、スタジオ等又は他の場所に設置されたスタジオユニット４０からこのスタジオ等に居るユーザの音声信号を（必要に応じて動画信号とともに）、通信網１０を介して受信し、受信した音声信号を（必要に応じて動画信号とともに）通信網１０を介して各端末装置２０に配信する、という動作等を実行することができる。或いはまた、サーバ装置３０は、インストールされた特定のアプリケーションを実行してウェブサーバとして機能することにより、各スタジオユニット４０に送信するウェブページを介して、同様の動作を実行することができる。 Furthermore, in the "third aspect," the server device 30 executes a specific installed application to function as an application server, thereby receiving, via the communication network 10, audio signals (along with video signals, if necessary) of a user in a studio or other location from a studio unit 40 installed in the studio or other location, and distributing the received audio signals (along with video signals, if necessary) to each terminal device 20 via the communication network 10. Alternatively, the server device 30 executes a specific installed application to function as a web server, thereby performing similar operations via web pages transmitted to each studio unit 40.

スタジオユニット４０は、インストールされた特定のアプリケーションを実行する情報処理装置として機能することにより、このスタジオユニット４０が設置されたスタジオ等又は他の場所に居るユーザの発話に関する音声信号を取得し、取得した音声信号に基づいて、変換器、すなわち、音声変換アルゴリズム及び音声変換プリセット（音声変換に用いられるパラメータのセット）を取得し、取得した変換器を用いて変換された音声信号を生成し、生成された音声信号を（必要に応じて動画信号とともに）、通信網１０を介してサーバ装置３０（例えばサーバ装置３０Ａ）に送信する、という動作等を実行することができる。或いはまた、スタジオユニット４０は、インストールされたウェブブラウザを実行することにより、サーバ装置３０からウェブページを受信及び表示して、同様の動作を実行することができる。 The studio unit 40 functions as an information processing device that executes a specific installed application, thereby acquiring an audio signal related to the speech of a user in the studio where the studio unit 40 is installed or in another location, acquiring a converter, i.e., a voice conversion algorithm and a voice conversion preset (a set of parameters used for voice conversion) based on the acquired audio signal, generating a converted audio signal using the acquired converter, and transmitting the generated audio signal (together with a video signal, if necessary) to the server device 30 (e.g., server device 30A) via the communication network 10. Alternatively, the studio unit 40 can execute an installed web browser to receive and display a web page from the server device 30 and perform similar operations.

２．各装置のハードウェア構成
次に、端末装置２０、サーバ装置３０及びスタジオユニット４０の各々が有するハードウェア構成の一例について説明する。 2. Hardware Configuration of Each Device Next, an example of the hardware configuration of each of the terminal device 20, the server device 30, and the studio unit 40 will be described.

２－１．端末装置２０のハードウェア構成
各端末装置２０のハードウェア構成例について図２を参照して説明する。図２は、図１に示した端末装置２０（サーバ装置３０）のハードウェア構成の一例を模式的に示すブロック図である（なお、図２において、括弧内の参照符号は、後述するように各サーバ装置３０に関連して記載されたものである。） 2-1. Hardware configuration of terminal device 20 An example of the hardware configuration of each terminal device 20 will be described with reference to Fig. 2. Fig. 2 is a block diagram showing a typical example of the hardware configuration of the terminal device 20 (server device 30) shown in Fig. 1 (note that in Fig. 2, the reference numerals in parentheses are those described in relation to each server device 30, as described later).

図２に示すように、各端末装置２０は、主に、中央処理装置２１と、主記憶装置２２と、入出力インタフェイス装置２３と、入力装置２４と、補助記憶装置２５と、出力装置２６と、を含むことができる。これら装置同士は、データバス及び／又は制御バスにより接続されている。 As shown in FIG. 2, each terminal device 20 can mainly include a central processing unit 21, a main memory device 22, an input/output interface device 23, an input device 24, an auxiliary memory device 25, and an output device 26. These devices are connected to each other by a data bus and/or a control bus.

中央処理装置２１は、「ＣＰＵ」と称されるものであり、主記憶装置２２に記憶されている命令及びデータに対して演算を行い、その演算の結果を主記憶装置２２に記憶させるものである。さらに、中央処理装置２１は、入出力インタフェイス装置２３を介して、入力装置２４、補助記憶装置２５及び出力装置２６等を制御することができる。端末装置２０は、１又はそれ以上のこのような中央処理装置２１を含むことが可能である。 The central processing unit 21 is referred to as a "CPU" and performs calculations on the instructions and data stored in the main memory 22, and stores the results of these calculations in the main memory 22. Furthermore, the central processing unit 21 can control the input device 24, the auxiliary memory 25, the output device 26, and the like via the input/output interface device 23. The terminal device 20 can include one or more such central processing units 21.

主記憶装置２２は、「メモリ」と称されるものであり、入力装置２４、補助記憶装置２５及び通信網１０等（サーバ装置３０等）から、入出力インタフェイス装置２３を介して受信した命令及びデータ、並びに、中央処理装置２１の演算結果を記憶するものである。主記憶装置２２は、ＲＡＭ（ランダムアクセスメモリ）、ＲＯＭ（リードオンリーメモリ）及び／又はフラッシュメモリ等をこれらに限定することなく含むことができる。 The main memory device 22 is also referred to as "memory" and stores instructions and data received from the input device 24, the auxiliary memory device 25, and the communication network 10 (such as the server device 30) via the input/output interface device 23, as well as the results of calculations by the central processing unit 21. The main memory device 22 can include, but is not limited to, RAM (random access memory), ROM (read only memory), and/or flash memory.

補助記憶装置２５は、主記憶装置２２よりも大きな容量を有する記憶装置である。上記特定のアプリケーションやウェブブラウザ等を構成する命令及びデータ（コンピュータプログラム）を記憶しておき、中央処理装置２１により制御されることにより、これらの命令及びデータ（コンピュータプログラム）を入出力インタフェイス装置２３を介して主記憶装置２２に送信することができる。補助記憶装置２５は、磁気ディスク装置及び／又は光ディスク装置等をこれらに限定することなく含むことができる。 The auxiliary storage device 25 is a storage device with a larger capacity than the main storage device 22. It stores the commands and data (computer programs) that constitute the specific applications, web browsers, etc., and can transmit these commands and data (computer programs) to the main storage device 22 via the input/output interface device 23 under the control of the central processing unit 21. The auxiliary storage device 25 can include, but is not limited to, a magnetic disk device and/or an optical disk device, etc.

入力装置２４は、外部からデータを取り込む装置であり、タッチパネル、ボタン、キーボード、マウス及び／又はセンサ（マイク）等をこれらに限定することなく含むものである。 The input device 24 is a device that inputs data from the outside, and includes, but is not limited to, a touch panel, a button, a keyboard, a mouse, and/or a sensor (microphone), etc.

出力装置２６は、ディスプレイ装置、タッチパネル及び／又はプリンタ装置等をこれらに限定することなく含むことができる。 The output device 26 may include, but is not limited to, a display device, a touch panel, and/or a printer device.

このようなハードウェア構成にあっては、中央処理装置２１が、補助記憶装置２５に記憶された特定のアプリケーションを構成する命令及びデータ（コンピュータプログラム）を順次主記憶装置２２にロードし、ロードした命令及びデータを演算することにより、入出力インタフェイス装置２３を介して出力装置２６を制御し、或いはまた、入出力インタフェイス装置２３及び通信網１０を介して、他の装置（例えばサーバ装置３０及び他の端末装置２０等）との間で様々な情報の送受信を行うことができる。 In this hardware configuration, the central processing unit 21 sequentially loads the instructions and data (computer program) constituting a specific application stored in the auxiliary storage device 25 into the main storage device 22, and by calculating the loaded instructions and data, it is possible to control the output device 26 via the input/output interface device 23, or to send and receive various information to and from other devices (such as a server device 30 and other terminal devices 20, etc.) via the input/output interface device 23 and the communication network 10.

これにより、端末装置２０は、インストールされた特定のアプリケーションを実行することにより、ユーザの発話に関する音声信号を取得し、取得した音声信号に基づいて、変換器、すなわち、音声変換アルゴリズム及び音声変換プリセット（音声変換に用いられるパラメータのセット）を取得し、取得した変換器を用いて変換された音声信号を生成し、生成された音声信号を（必要に応じて動画信号とともに）、通信網１０を介してサーバ装置３０（例えばサーバ装置３０Ａ）に送信することができる。或いはまた、端末装置２０は、インストールされたウェブブラウザを実行することにより、サーバ装置３０からウェブページを受信及び表示して、同様の動作を実行することができる。 As a result, the terminal device 20 can execute a specific installed application to acquire an audio signal related to the user's speech, acquire a converter, i.e., a voice conversion algorithm and a voice conversion preset (a set of parameters used for voice conversion) based on the acquired audio signal, generate a converted audio signal using the acquired converter, and transmit the generated audio signal (together with a video signal, if necessary) to the server device 30 (e.g., server device 30A) via the communication network 10. Alternatively, the terminal device 20 can execute an installed web browser to receive and display a web page from the server device 30 and perform a similar operation.

なお、端末装置２０は、中央処理装置２１に代えて又は中央処理装置２１とともに、１又はそれ以上のマイクロプロセッサ、及び／又は、グラフィックスプロセッシングユニット（ＧＰＵ）を含むものであってもよい。 In addition, the terminal device 20 may include one or more microprocessors and/or a graphics processing unit (GPU) instead of or in addition to the central processing unit 21.

２－２．サーバ装置３０のハードウェア構成
各サーバ装置３０のハードウェア構成例について同じく図２を参照して説明する。各サーバ装置３０のハードウェア構成としては、例えば、上述した各端末装置２０のハードウェア構成と同一のものを用いることが可能である。したがって、各サーバ装置３０が有する構成要素に対する参照符号は、図２において括弧内に示されている。 2-2. Hardware configuration of server device 30 An example of the hardware configuration of each server device 30 will be described with reference to Fig. 2. The hardware configuration of each server device 30 may be the same as that of each terminal device 20 described above. Therefore, the reference characters for the components of each server device 30 are shown in parentheses in Fig. 2.

図２に示すように、各サーバ装置３０は、主に、中央処理装置３１と、主記憶装置３２と、入出力インタフェイス装置３３と、入力装置３４と、補助記憶装置３５と、出力装置３６と、を含むことができる。これら装置同士は、データバス及び／又は制御バスにより接続されている。 As shown in FIG. 2, each server device 30 can mainly include a central processing unit 31, a main memory device 32, an input/output interface device 33, an input device 34, an auxiliary memory device 35, and an output device 36. These devices are connected to each other by a data bus and/or a control bus.

中央処理装置３１、主記憶装置３２、入出力インタフェイス装置３３、入力装置３４、補助記憶装置３５及び出力装置３６は、それぞれ、上述した各端末装置２０に含まれる、中央処理装置２１、主記憶装置２２、入出力インタフェイス装置２３、入力装置２４、補助記憶装置２５及び出力装置２６と略同一なものとすることができる。 The central processing unit 31, main memory device 32, input/output interface device 33, input device 34, auxiliary memory device 35, and output device 36 can be substantially the same as the central processing unit 21, main memory device 22, input/output interface device 23, input device 24, auxiliary memory device 25, and output device 26 included in each of the terminal devices 20 described above, respectively.

このようなハードウェア構成にあっては、中央処理装置３１が、補助記憶装置３５に記憶された特定のアプリケーションを構成する命令及びデータ（コンピュータプログラム）を順次主記憶装置３２にロードし、ロードした命令及びデータを演算することにより、入出力インタフェイス装置３３を介して出力装置３６を制御し、或いはまた、入出力インタフェイス装置３３及び通信網１０を介して、他の装置（例えば各端末装置２０等）との間で様々な情報の送受信を行うことができる。 In this hardware configuration, the central processing unit 31 sequentially loads the instructions and data (computer programs) constituting a specific application stored in the auxiliary storage device 35 into the main storage device 32, and by calculating the loaded instructions and data, it is possible to control the output device 36 via the input/output interface device 33, or to send and receive various information to and from other devices (such as each terminal device 20) via the input/output interface device 33 and the communication network 10.

これにより、サーバ装置３０は、「第１の態様」では、インストールされた特定のアプリケーションを実行してアプリケーションサーバとして機能することにより、各端末装置２０からユーザの音声信号を（必要に応じて動画信号とともに）、通信網１０を介して受信し、受信した音声信号を（必要に応じて動画信号とともに）通信網１０を介して各端末装置２０に配信する、という動作等を実行することができる。或いはまた、サーバ装置３０は、インストールされた特定のアプリケーションを実行してウェブサーバとして機能することにより、各端末装置２０に送信するウェブページを介して、同様の動作を実行することができる。 In this way, in the "first aspect," the server device 30 executes a specific installed application to function as an application server, thereby performing operations such as receiving a user's voice signal (together with a video signal, if necessary) from each terminal device 20 via the communication network 10, and distributing the received voice signal (together with a video signal, if necessary) to each terminal device 20 via the communication network 10. Alternatively, the server device 30 executes a specific installed application to function as a web server, thereby performing similar operations via a web page that is sent to each terminal device 20.

また、サーバ装置３０は、「第２の態様」では、インストールされた特定のアプリケーションを実行してアプリケーションサーバとして機能することにより、このサーバ装置３０が設置されたスタジオ等又は他の場所に居るユーザの発話に関する音声信号を取得し、取得した音声信号に基づいて、変換器、すなわち、音声変換アルゴリズム及び音声変換プリセット（音声変換に用いられるパラメータのセット）を取得し、取得した変換器を用いて変換された音声信号を生成し、生成された音声信号を（必要に応じて動画信号とともに）通信網１０を介して各端末装置２０に配信する、という動作等を実行することができる。或いはまた、サーバ装置３０は、インストールされた特定のアプリケーションを実行してウェブサーバとして機能することにより、各端末装置２０に送信するウェブページを介して、同様の動作を実行することができる。 In the "second aspect," the server device 30 executes a specific installed application to function as an application server, thereby acquiring an audio signal related to the speech of a user in a studio or other location where the server device 30 is installed, acquiring a converter, i.e., a voice conversion algorithm and a voice conversion preset (a set of parameters used for voice conversion), based on the acquired audio signal, generating a converted audio signal using the acquired converter, and distributing the generated audio signal (together with a video signal, if necessary) to each terminal device 20 via the communication network 10. Alternatively, the server device 30 executes a specific installed application to function as a web server, thereby performing similar operations via a web page that is sent to each terminal device 20.

さらにまた、サーバ装置３０は、「第３の態様」では、インストールされた特定のアプリケーションを実行してアプリケーションサーバとして機能することにより、スタジオ等又は他の場所に設置されたスタジオユニット４０からこのスタジオ等に居るユーザの音声信号を（必要に応じて動画信号とともに）、通信網１０を介して受信し、受信した音声信号を（必要に応じて動画信号とともに）通信網１０を介して各端末装置２０に配信する、という動作等を実行することができる。 Furthermore, in the "third aspect," the server device 30 executes a specific installed application and functions as an application server, thereby receiving, via the communication network 10, audio signals (along with video signals, if necessary) of users in a studio or other location from a studio unit 40 installed in the studio or other location, and distributing the received audio signals (along with video signals, if necessary) to each terminal device 20 via the communication network 10.

なお、サーバ装置３０は、中央処理装置３１に代えて又は中央処理装置３１とともに、１又はそれ以上のマイクロプロセッサ、及び／又は、グラフィックスプロセッシングユニット（ＧＰＵ）を含むものであってもよい。或いはまた、サーバ装置３０は、インストールされた特定のアプリケーションを実行してウェブサーバとして機能することにより、各スタジオユニット４０に送信するウェブページを介して、同様の動作を実行することができる。 The server device 30 may include one or more microprocessors and/or a graphics processing unit (GPU) instead of or in addition to the central processing unit 31. Alternatively, the server device 30 may execute a specific installed application to function as a web server, thereby performing similar operations via web pages that it transmits to each studio unit 40.

２－３．スタジオユニット４０のハードウェア構成
スタジオユニット４０は、パーソナルコンピュータ等の情報処理装置により実装可能なものであって、図示はされていないが、上述した端末装置２０及びサーバ装置３０と同様に、主に、中央処理装置と、主記憶装置と、入出力インタフェイス装置と、入力装置と、補助記憶装置と、出力装置と、を含むことができる。これら装置同士は、データバス及び／又は制御バスにより接続されている。 2-3. Hardware Configuration of Studio Unit 40 The studio unit 40 can be implemented by an information processing device such as a personal computer, and although not shown, can mainly include a central processing unit, a main memory device, an input/output interface device, an input device, an auxiliary memory device, and an output device, similar to the above-mentioned terminal device 20 and server device 30. These devices are connected to each other by a data bus and/or a control bus.

スタジオユニット４０は、インストールされた特定のアプリケーションを実行して情報処理装置として機能することにより、このスタジオユニット４０が設置されたスタジオ等又は他の場所に居るユーザの発話に関する音声信号を取得し、取得した音声信号に基づいて、変換器、すなわち、音声変換アルゴリズム及び音声変換プリセット（音声変換に用いられるパラメータのセット）を取得し、取得した変換器を用いて変換された音声信号を生成し、生成された音声信号を（必要に応じて動画信号とともに）、通信網１０を介してサーバ装置３０（例えばサーバ装置３０Ａ）に送信する、という動作等を実行することができる。或いはまた、スタジオユニット４０は、インストールされたウェブブラウザを実行することにより、サーバ装置３０からウェブページを受信及び表示して、同様の動作を実行することができる。 The studio unit 40 executes a specific installed application and functions as an information processing device, thereby acquiring an audio signal related to the speech of a user in the studio where the studio unit 40 is installed or in another location, acquiring a converter, i.e., a voice conversion algorithm and a voice conversion preset (a set of parameters used for voice conversion) based on the acquired audio signal, generating a converted audio signal using the acquired converter, and transmitting the generated audio signal (together with a video signal, if necessary) to the server device 30 (e.g., server device 30A) via the communication network 10. Alternatively, the studio unit 40 can execute an installed web browser to receive and display a web page from the server device 30 and perform similar operations.

３．各装置の機能
次に、端末装置２０、サーバ装置３０及びスタジオユニット４０の各々が有する機能の一例について説明する。 3. Functions of Each Device Next, an example of the functions of each of the terminal device 20, the server device 30, and the studio unit 40 will be described.

３－１．端末装置２０の機能
端末装置２０の機能の一例について図３を参照して説明する。図３は、図１に示した端末装置２０（サーバ装置３０）の機能の一例を模式的に示すブロック図である（なお、図３において、括弧内の参照符号は、後述するようにサーバ装置３０に関連して記載されたものである。）。 3-1. Functions of the terminal device 20 An example of the functions of the terminal device 20 will be described with reference to Fig. 3. Fig. 3 is a block diagram showing an example of the functions of the terminal device 20 (server device 30) shown in Fig. 1 (note that in Fig. 3, the reference numerals in parentheses are those described in relation to the server device 30, as described later).

図３に示すように、端末装置２０は、主に、音声入力部２１０と、特徴量抽出部２１２と、変換器取得部２１４と、記憶部２１６と、通信部２１８と、表示部２２０と、を含むことができる。さらに、端末装置２０は、特徴量変換部２２２と、音声合成部２２４と、を含むことができる。 As shown in FIG. 3, the terminal device 20 may mainly include a voice input unit 210, a feature extraction unit 212, a converter acquisition unit 214, a storage unit 216, a communication unit 218, and a display unit 220. Furthermore, the terminal device 20 may include a feature conversion unit 222 and a voice synthesis unit 224.

（１）音声入力部２１０
音声入力部２１０は、図示しないマイクを用いて、ユーザの発話に関する音声信号を入力する。なお、端末装置２０がスマートフォン、タブレット及びラップトップ型のパーソナルコンピュータ等である場合には、音声入力部２１０は、上記マイクとして、本体に内蔵されたマイクを用いることが可能である。 (1) Voice Input Unit 210
The voice input unit 210 inputs a voice signal related to the user's speech using a microphone (not shown). When the terminal device 20 is a smartphone, a tablet, a laptop-type personal computer, or the like, the voice input unit 210 can use a microphone built into the main body as the microphone.

（２）特徴量抽出部２１２
特徴量抽出部２１２は、音声入力部２１０により入力された音声信号に対して、例えば短時間フレーム分析を施すことにより、各時間フレームにおける各種の特徴量（音声特徴量）を抽出することができる。一実施形態では、特徴量抽出部２１２は、特徴量として、（i）声の高さを示す基本周波数、及び、（ii）声道の共鳴によって強調される周波数成分（例えば、第１フォルマントの周波数）を抽出することができる。 (2) Feature Extraction Unit 212
The feature extraction unit 212 can extract various features (speech features) in each time frame by, for example, performing a short-time frame analysis on the speech signal input by the speech input unit 210. In one embodiment, the feature extraction unit 212 can extract, as the features, (i) a fundamental frequency indicating the pitch of the voice, and (ii) a frequency component emphasized by resonance in the vocal tract (for example, the frequency of the first formant).

（３）変換器取得部２１４
変換器取得部２１４は、特徴量抽出部２１２により抽出された特徴量を用いて、ユーザにより用いられるべき１又は複数の変換器を取得することができる。ここで、「変換器」とは、ユーザの発話に関する音声信号であって変換対象である音声信号から抽出される少なくとも１つの特徴量をどのように変換するかを示すパラメータ（例えば、基本周波数をどの程度増加又は低下させるかを示すパラメータ、第１フォルマントの周波数をいずれの周波数の範囲に移動させるかを示すパラメータ等）有するものである。 (3) Converter acquisition unit 214
The converter acquisition unit 214 can acquire one or more converters to be used by the user, using the features extracted by the feature extraction unit 212. Here, the "converter" refers to a converter having a parameter indicating how to convert at least one feature extracted from a speech signal that is a conversion target and is related to the user's speech (e.g., a parameter indicating how much to increase or decrease the fundamental frequency, a parameter indicating into which frequency range the frequency of the first formant should be moved, etc.).

（４）記憶部２１６
記憶部２１６は、端末装置２０の動作に必要な様々な情報を記憶するものである。例えば、記憶部２１６は、音声／動画配信用のアプリケーション、音声／動画視聴用のアプリケーション、ボイスチェンジャー機能を有するアプリケーション、及び／又は、ウェブブラウザ等を含む様々なアプリケーションと、これらのアプリケーションにより必要とされる及び／又は生成される様々な情報・信号・データ等と、を記憶することができる。 (4) Storage unit 216
The storage unit 216 stores various information necessary for the operation of the terminal device 20. For example, the storage unit 216 can store various applications including an application for audio/video distribution, an application for audio/video viewing, an application with a voice changer function, and/or a web browser, and various information, signals, data, etc. required and/or generated by these applications.

（５）通信部２１８
通信部２１８は、ユーザの発話に関する音声信号に用いるべき変換器を取得するに際して必要とされる情報及び／又は生成される情報、ユーザの発話に関する音声信号に対して、取得した変換器を用いて生成（加工）された音声信号等、を含む様々な情報を、通信網１０を介してサーバ装置３０及び／又は他の端末装置２０等との間で送受信することができる。 (5) Communication unit 218
The communication unit 218 can transmit and receive various information between the server device 30 and/or other terminal devices 20, etc. via the communication network 10, including information required and/or information generated when obtaining a converter to be used for the audio signal related to the user's speech, an audio signal generated (processed) using the obtained converter for the audio signal related to the user's speech, etc.

（６）表示部２２０
表示部２２０は、音声／動画配信用のアプリケーション、音声／動画視聴用のアプリケーション、ボイスチェンジャー機能を有するアプリケーション、及び／又は、ウェブブラウザ等を含む様々なアプリケーションの実行により生成される様々な情報を、タッチパネル及びディスプレイ等を介して、ユーザに表示することができる。 (6) Display unit 220
The display unit 220 can display to the user, via a touch panel, display, etc., various information generated by the execution of various applications including applications for audio/video distribution, applications for audio/video viewing, applications with voice changer functions, and/or web browsers.

（７）特徴量変換部２２２
特徴量変換部２２２は、ユーザの発話に関する音声信号から抽出した少なくとも１つの特徴量を、変換器取得部２１４により取得された変換器（音声変換アルゴリズム及び音声変換に用いられるパラメータのセット（プリセット））を用いて変換し、変換された少なくとも１つの特徴量を、音声合成部２２４に出力することができる。 (7) Feature Transformation Unit 222
The feature conversion unit 222 converts at least one feature extracted from a voice signal related to the user's speech using a converter (a voice conversion algorithm and a set of parameters (presets) used for voice conversion) acquired by the converter acquisition unit 214, and outputs the at least one converted feature to the voice synthesis unit 224.

（８）音声合成部２２４
音声合成部２２４は、特徴量変換部２２２から入力した、変換された少なくとも１つの特徴量を用いて音声合成処理を行うことにより、ユーザの音声が加工された音声信号を生成することができる。例えば、音声合成部２２４は、変換された少なくとも１つの特徴量から、ボコーダを用いることにより、ユーザの音声が加工された音声信号を生成することができる。 (8) Voice synthesis unit 224
The voice synthesis unit 224 can generate a voice signal in which the user's voice has been processed by performing a voice synthesis process using at least one converted feature input from the feature conversion unit 222. For example, the voice synthesis unit 224 can generate a voice signal in which the user's voice has been processed from the at least one converted feature by using a vocoder.

上述した各部の動作は、ユーザの端末装置２０にインストールされた所定のアプリケーション（例えば音声／動画配信用のアプリケーション、及び／又は、ボイスチェンジャー機能を有するアプリケーション等。ここでいうアプリケーションに代えて又はアプリケーションとともに、ミドルウェアを用いることも可能である。）がこの端末装置２０により実行されることにより、この端末装置２０により実行され得るものである。 The operation of each of the above-mentioned parts can be performed by the terminal device 20 by executing a specific application (e.g., an application for audio/video distribution and/or an application with a voice changer function, etc.; it is also possible to use middleware instead of or together with the application) installed on the user's terminal device 20.

３－２．サーバ装置３０の機能
サーバ装置３０の機能の具体例について同じく図３を参照して説明する。サーバ装置３０の機能としては、例えば、上述した端末装置２０の機能の少なくとも一部を用いることが可能である。したがって、サーバ装置３０が有する構成要素に対する参照符号は、図３において括弧内に示されている。 3-2. Functions of the server device 30 A specific example of the functions of the server device 30 will be described with reference to Fig. 3. For example, at least some of the functions of the terminal device 20 described above can be used as the functions of the server device 30. Therefore, the reference characters for the components of the server device 30 are shown in parentheses in Fig. 3.

まず、上述した「第２の態様」では、サーバ装置３０は、以下に述べる相違点を除き、音声入力部３１０～音声合成部３２４として、それぞれ、端末装置２０に関連して説明した音声入力部２１０～音声合成部２２４と同一のものを有するものとすることができる。 First, in the above-mentioned "second aspect," the server device 30 can have the voice input unit 310 to the voice synthesis unit 324 that are the same as the voice input unit 210 to the voice synthesis unit 224 described in relation to the terminal device 20, respectively, except for the differences described below.

但し、この「第２の態様」では、サーバ装置３０は、スタジオ等又は他の場所に配置され、複数のユーザにより用いられることが想定され得る。したがって、記憶部３１６は、複数のユーザの各々に対応付けて、取得した変換器等を含む様々な情報を記憶することができる。 However, in this "second aspect," it is possible that the server device 30 is placed in a studio or other location and is used by multiple users. Therefore, the storage unit 316 can store various information, including the acquired converter, in association with each of the multiple users.

さらに、音声入力部３１０により用いられる又は音声入力部３１０に含まれるマイクは、サーバ装置３０が設置されるスタジオ等又は他の場所において、ユーザが発話を行う空間においてユーザに対向して配置され得るものである。同様に、表示部３２０を構成するディスプレイやタッチパネル等もまた、ユーザが発話を行う空間においてユーザに対向して又はユーザの近くに配置され得るものである。 Furthermore, the microphone used by or included in the voice input unit 310 can be placed facing the user in the space in which the user speaks, such as in a studio where the server device 30 is installed or in another location. Similarly, the display, touch panel, etc. that constitute the display unit 320 can also be placed facing the user or near the user in the space in which the user speaks.

通信部３１８は、ユーザの発話に関する音声信号に用いるべき変換器を取得するに際して必要とされる情報及び／又は生成される情報、ユーザの発話に関する音声信号に対して、取得した変換器を用いて生成（加工）された音声信号等、を含む様々な情報を、通信網１０を介して、他のサーバ装置３０及び／又は各端末装置２０等との間で送受信することができる。 The communication unit 318 can transmit and receive various information, including information required and/or generated when acquiring a converter to be used for a voice signal related to the user's speech, a voice signal generated (processed) using the acquired converter for the voice signal related to the user's speech, etc., between other server devices 30 and/or each terminal device 20, etc., via the communication network 10.

また、通信部２６０は、各ユーザに対応付けて記憶部３１６に記憶された音声信号及び／又は動画信号を格納したファイル等を、通信網１０を介して複数の端末装置２０に配信することができる。これら複数の端末装置２０の各々は、インストールされた所定のアプリケーション（例えば音声／動画視聴用のアプリケーション、及び／又は、ボイスチェンジャー機能を有するアプリケーション等。ここでいうアプリケーションに代えて又はアプリケーションとともに、ミドルウェアを用いることも可能である。）を実行して、サーバ装置３０に対して所望の動画の配信を要求する信号（リクエスト信号）を送信することにより、この信号に応答したサーバ装置３０から所望の音声信号及び／又は動画信号を格納したファイル等を当該所定のアプリケーションを介して受信することができる。 The communication unit 260 can also distribute files containing audio signals and/or video signals stored in the storage unit 316 in association with each user to a plurality of terminal devices 20 via the communication network 10. Each of the plurality of terminal devices 20 executes a predetermined application installed thereon (e.g., an application for audio/video viewing and/or an application with a voice changer function, etc. It is also possible to use middleware instead of or together with the application referred to here) and transmits a signal (request signal) requesting the delivery of a desired video to the server device 30, and can receive files containing the desired audio signals and/or video signals from the server device 30 in response to this signal via the predetermined application.

なお、記憶部３１６に記憶される情報（音声信号及び／又は動画信号を格納したファイル等）は、当該サーバ装置３０に通信網１０を介して通信可能な１又はそれ以上の他のサーバ装置（ストレージ）３０に記憶されるようにしてもよい。 In addition, the information stored in the memory unit 316 (such as a file storing an audio signal and/or a video signal) may be stored in one or more other server devices (storage) 30 that can communicate with the server device 30 via the communication network 10.

一方、上述した「第１の態様」では、上記「第２の態様」において用いられた音声入力部３１０～変換器取得部３１４、表示部３２０、特徴量変換部３２２及び音声合成部３２４をオプションとして用いることができる。通信部３１８は、上記のように動作することに加えて、各端末装置２０により送信され通信網１０から受信した、音声信号及び／又は動画信号を格納したファイル等を、記憶部３１６に記憶させた上で、複数の端末装置２０に対して配信することができる。 On the other hand, in the above-mentioned "first aspect," the voice input unit 310 to converter acquisition unit 314, display unit 320, feature conversion unit 322, and voice synthesis unit 324 used in the above-mentioned "second aspect" can be used as options. In addition to operating as described above, the communication unit 318 can store in the storage unit 316 files containing voice signals and/or video signals transmitted by each terminal device 20 and received from the communication network 10, and then distribute them to multiple terminal devices 20.

他方、「第３の態様」では、上記「第２の態様」において用いられた音声入力部３１０～変換器取得部３１４、表示部３２０、特徴量変換部３２２及び音声合成部３２４をオプションとして用いることができる。通信部３１８は、上記のように動作することに加えて、スタジオユニット４０により送信され通信網１０から受信した、音声信号及び／又は動画情報を格納したファイル等を、記憶部３１６に記憶させた上で、複数の端末装置２０に対して配信することができる。 On the other hand, in the "third aspect", the voice input unit 310 to converter acquisition unit 314, display unit 320, feature conversion unit 322 and voice synthesis unit 324 used in the above "second aspect" can be used as options. In addition to operating as described above, the communication unit 318 can store files containing voice signals and/or video information transmitted by the studio unit 40 and received from the communication network 10 in the storage unit 316 and distribute them to multiple terminal devices 20.

３－３．スタジオユニット４０の機能
スタジオユニット４０は、図３に示した端末装置２０又はサーバ装置３０と同様の構成を有することにより、端末装置２０又はサーバ装置３０と同様の動作を行うことが可能である。但し、通信部２１８（３１８）は、記憶部２１６（３１６）に記憶された、音声信号及び／又は動画信号を格納したファイル等、通信網１０を介してサーバ装置３０に送信することができる。 3-3. Functions of the studio unit 40 The studio unit 40 has a similar configuration to the terminal device 20 or the server device 30 shown in Fig. 3, and is therefore capable of performing the same operations as the terminal device 20 or the server device 30. However, the communication unit 218 (318) can transmit files, etc., that contain audio signals and/or video signals stored in the storage unit 216 (316) to the server device 30 via the communication network 10.

音声入力部２１０（３１０）により用いられる又は音声入力部２１０（３１０）に含まれるマイクは、スタジオユニット４０が設置されるスタジオ等又は他の場所において、ユーザが発話を行う空間においてユーザに対向して配置され得るものである。同様に、表示部２２０（３２０）を構成するディスプレイやタッチパネル等もまた、ユーザが発話を行う空間においてユーザに対向して又はユーザの近くに配置され得るものである。 The microphone used by or included in the audio input unit 210 (310) can be placed facing the user in the space in which the user speaks, such as in the studio where the studio unit 40 is installed or in another location. Similarly, the display, touch panel, etc. that constitutes the display unit 220 (320) can also be placed facing the user or near the user in the space in which the user speaks.

４．通信システム１において用いられる音声変換プリセットの機能について
次に、通信システム１において用いられる音声変換プリセットの機能について説明する。通信システム１では、特徴量の具体例として、（i）基本周波数、及び（ii）第１フォルマントの周波数が用いられる。 4. Functions of voice conversion presets used in the communication system 1 Next, a description will be given of functions of voice conversion presets used in the communication system 1. In the communication system 1, (i) the fundamental frequency and (ii) the frequency of the first formant are used as specific examples of features.

人の声は、基本周波数、周波数特性及び音圧という３つの要素により特徴付けられるものである。基本周波数は、人の声の高さを特徴付けるものであり、周波数特性は、人の声の音色を特徴付けるものであり、音圧は、人の声の大きさを特徴付けるものである。 The human voice is characterized by three elements: fundamental frequency, frequency response, and sound pressure. The fundamental frequency characterizes the pitch of the human voice, the frequency response characterizes the timbre of the human voice, and the sound pressure characterizes the loudness of the human voice.

人の声道は、共鳴によって特定の周波数成分を強調する一種のフィルタであるといえる。声道の共鳴によって強調される周波数成分がフォルマントの周波数である。フォルマントの周波数は、無数に存在するが、周波数の低いものから、順次、第１フォルマントの周波数、第２フォルマントの周波数、第３フォルマントの周波数等のように称される。図４（横軸及び縦軸にそれぞれ周波数（［Ｈｚ］）及び音圧・振幅（［ｄＢ］）が示されている）に例示されるように、周波数スペクトルにおいては、声の高さを示す基本周波数の後に、第１フォルマントの周波数、第２フォルマントの周波数等が順次続く。 The human vocal tract can be thought of as a kind of filter that emphasizes certain frequency components through resonance. The frequency components emphasized by resonance in the vocal tract are formant frequencies. There are an infinite number of formant frequencies, but from the lowest frequency they are called the first formant frequency, the second formant frequency, the third formant frequency, and so on. As illustrated in Figure 4 (where the horizontal and vertical axes show frequency ([Hz]) and sound pressure/amplitude ([dB]), respectively), in the frequency spectrum, the fundamental frequency indicating the pitch of the voice is followed by the first formant frequency, the second formant frequency, and so on.

通信システム１において用意される複数の音声変換プリセットの各々は、ユーザの発話に関する音声信号から抽出された基本周波数及び第１フォルマントの周波数を、その音声変換プリセットにより定められた変化量に応じて変換するものである。 Each of the multiple voice conversion presets provided in communication system 1 converts the fundamental frequency and the first formant frequency extracted from the voice signal related to the user's speech according to the amount of change determined by that voice conversion preset.

具体的には、図５Ａ、図５Ｂ及び図５Ｃに示すように、基本周波数（ｐｉｔｃｈ）（のオクターブ表現）を示す第１軸（横軸）と第１フォルマント（１ｓｔｆｏｒｍａｎｔ）の周波数（のオクターブ表現）を示す第２軸（縦軸）とにより定められる２次元座標系（以下「ｐｆ平面」と称する）を考える。 Specifically, as shown in Figures 5A, 5B, and 5C, consider a two-dimensional coordinate system (hereinafter referred to as the "pf plane") defined by a first axis (horizontal axis) indicating the fundamental frequency (pitch) (its octave expression) and a second axis (vertical axis) indicating the frequency of the first formant (1st formant) (its octave expression).

例えば、基本周波数ｆ_P１及び第１フォルマントの周波数ｆ_F１を有する標準的な男性の声が、ｐｆ平面において「標準男性」（０,０）として配置される。 For example, a standard male voice with a fundamental frequency f _P1 and a first formant frequency f _F1 is placed as "standard male" (0,0) in the pf plane.

一般的に、女性の基本周波数は、男性の基本周波数を１２ｐｉｔｃｈ増加させることにより得られることが分かっている。但し、８ｐｉｔｃｈが１物理的オクターブに相当するものとする。また、一般的には、基本周波数ｐと第１フォルマントｆとの間には、ｆ＝ｐ／３という関係が成り立ち得る。したがって、標準的な女性の声が、ｐｆ平面において「標準女性」（１２,４）として仮に配置される。これは、基本周波数ｆ_P２及び第１フォルマントの周波数ｆ_F２を有する標準的な女性の声が、ｐｆ平面において「標準女性」（１２,４）として配置されることを意味する。
さらに、中性の声が、標準男性（０,０）と標準女性（１２,４）との中点において「中性（６,２）」として配置される。 It is generally known that a female fundamental frequency can be obtained by increasing a male fundamental frequency by 12 pitches. Here, 8 pitches correspond to one physical octave. In addition, generally, the relationship f=p/3 can be established between the fundamental frequency p and the first formant f. Therefore, a standard female voice is provisionally placed as a "standard female" (12,4) on the pf plane. This means that a standard female voice having a fundamental frequency _fP2 and a first formant frequency _fF2 is placed as a "standard female" (12,4) on the pf plane.
Additionally, a neutral voice is placed as "neutral (6,2)" at the midpoint between standard male (0,0) and standard female (12,4).

図５Ａには、男性の声を変換する音声変換プリセットの例（Ａ_Ｍ、Ｂ_Ｍ及びＣ_Ｍ）が示され、図５Ｂには、女性の声を変換する音声変換プリセット（Ａ_Ｆ、Ｂ_Ｆ及びＣ_Ｆ）の例が示されている。図５Ｃには、中性の声を変換する音声変換プリセット（Ａ_Ｎ、Ｂ_Ｎ及びＣ_Ｎ）の例が示されている。なお、Ａ、Ｂ及びＣは、それぞれ、キャラクターＡ、Ｂ及びＣの声を目標として入力音声信号を変換する音声変換プリセットの名称を示し、添字Ｍは、男性用の入力音声信号を変換するプリセットを示し、添字Ｆは、女性用の入力音声信号を変換するプリセットを示し、添字Ｎは、中性用の入力音声信号を変換するプリセットを示すものである。 5A shows examples of voice conversion presets (A _M , B _M , and C _M ) for converting a male voice, and FIG. 5B shows examples of voice conversion presets (A _F , B F , and C _F ) for converting a female voice. FIG. 5C shows examples of voice conversion presets (A _N , B _N _, and C _N ) for converting a neutral voice. Note that A, B, and C indicate the names of voice conversion presets that convert input voice signals to the voices of characters A, B, and C, respectively, the subscript M indicates a preset that converts a male input voice signal, the subscript F indicates a preset that converts a female input voice signal, and the subscript N indicates a preset that converts a neutral input voice signal.

まず、図５Ａを参照すると、各音声変換プリセットは、標準男性の基本周波数（男性用の第１基準値）（＝０）を基準とした基本周波数の変化量を定め、標準男性の第１フォルマントの周波数（男性用の第２基準値）（＝０）を基準とした第１フォルマントの周波数の変化量を定めるものである。例えば、音声変換プリセットＡ_Ｍ（１７,６）は、入力音声信号の基本周波数及び第１フォルマントの周波数がｐｆ平面上において（０,０）に配置されると仮定して、その入力音声信号の基本周波数を１７ｐｉｔｃｈ増加させ、その入力音声信号の第１フォルマントの周波数を６ｆｏｒｍａｎｔ増加させるものである。 5A, each voice conversion preset defines the amount of change in fundamental frequency based on the standard male fundamental frequency (first reference value for men) (=0) and defines the amount of change in the frequency of the first formant based on the standard male first formant frequency (second reference value for men) (=0). For example, the voice conversion preset A _M (17,6) increases the fundamental frequency of the input voice signal by 17 pitches and increases the frequency of the first formant of the input voice signal by 6 formants, assuming that the fundamental frequency and the frequency of the first formant of the input voice signal are located at (0,0) on the pf plane.

同様に、音声変換プリセットＢ_Ｍ（９,３）は、入力音声信号の基本周波数及び第１フォルマントの周波数がｐｆ平面上において（０,０）に配置されると仮定して、その入力音声信号の基本周波数を９ｐｉｔｃｈ増加させ、その入力音声信号の第１フォルマントの周波数を３ｆｏｒｍａｎｔ増加させるものである。さらに同様に、音声変換プリセットＣ_Ｍ（－３,－１）は、入力音声信号の基本周波数及び第１フォルマントの周波数がｐｆ平面上において（０,０）に配置されると仮定して、その入力音声信号の基本周波数を３ｐｉｔｃｈ減少させ、その入力音声信号の第１フォルマントの周波数を１ｆｏｒｍａｎｔ減少させるものである。 Similarly, the voice conversion preset B _M (9,3) increases the fundamental frequency of the input voice signal by 9 pitches and increases the frequency of the first formant of the input voice signal by 3 formants, assuming that the fundamental frequency and the frequency of the first formant of the input voice signal are located at (0,0) on the pf plane.Further similarly, the voice conversion preset C _M (-3,-1) decreases the fundamental frequency of the input voice signal by 3 pitches and decreases the frequency of the first formant of the input voice signal by 1 formant, assuming that the fundamental frequency and the frequency of the first formant of the input voice signal are located at (0,0) on the pf plane.

次に、図５Ｂを参照すると、各音声変換プリセットは、標準女性の基本周波数（女性用の第１基準値）（＝１２）を基準とした基本周波数の変化量を定め、標準女性の第１フォルマントの周波数（女性用の第２基準値）（＝４）を基準とした第１フォルマントの周波数の変化量を定めるものである。例えば、音声変換プリセットＡ_Ｆ（５,３）は、入力音声信号の基本周波数及び第１フォルマントの周波数がｐｆ平面上において（１２,４）に配置されると仮定して、その入力音声信号の基本周波数を５ｐｉｔｃｈ増加させ、その入力音声信号の第１フォルマントの周波数を３ｆｏｒｍａｎｔ増加させるものである。 5B, each voice conversion preset defines the amount of change in fundamental frequency based on the standard female fundamental frequency (first reference value for women) (=12) and defines the amount of change in the frequency of the first formant based on the standard female first formant frequency (second reference value for women) (=4). For example, the voice conversion preset A _F (5,3) increases the fundamental frequency of the input voice signal by 5 pitches and increases the frequency of the first formant of the input voice signal by 3 formants, assuming that the fundamental frequency and the frequency of the first formant of the input voice signal are located at (12,4) on the pf plane.

同様に、音声変換プリセットＢ_Ｆ（－３,０）は、入力音声信号の基本周波数及び第１フォルマントの周波数がｐｆ平面上において（１２,４）に配置されると仮定して、その入力音声信号の基本周波数を３ｐｉｔｃｈ減少させ、その入力音声信号の第１フォルマントの周波数を変化させない（そのまま維持する）ものである。さらに同様に、音声変換プリセットＣ_Ｆ（－１５,－４）は、入力音声信号の基本周波数及び第１フォルマントの周波数がｐｆ平面上において（１２,４）に配置されると仮定して、その入力音声信号の基本周波数を１５ｐｉｔｃｈ減少させ、その入力音声信号の第１フォルマントの周波数を４ｆｏｒｍａｎｔ減少させるものである。 Similarly, the voice conversion preset B _F (-3,0) reduces the fundamental frequency of the input voice signal by 3 pitches and leaves the frequency of the first formant of the input voice signal unchanged (maintains it as is), assuming that the fundamental frequency and the frequency of the first formant of the input voice signal are located at (12,4) on the pf plane.Further similarly, the voice conversion preset C _F (-15,-4) reduces the fundamental frequency of the input voice signal by 15 pitches and reduces the frequency of the first formant of the input voice signal by 4 formants, assuming that the fundamental frequency and the frequency of the first formant of the input voice signal are located at (12,4) on the pf plane.

次に、図５Ｃを参照すると、各音声変換プリセットは、中性の基本周波数（中性用の第１基準値）（＝６）を基準とした基本周波数の変化量を定め、中性の第１フォルマントの周波数（中性用の第２基準値）（＝２）を基準とした第１フォルマントの周波数の変化量を定めるものである。例えば、音声変換プリセットＡ_Ｎ（１１,２.５）は、入力音声信号の基本周波数及び第１フォルマントの周波数がｐｆ平面上において（６,２）に配置されると仮定して、その入力音声信号の基本周波数を１１ｐｉｔｃｈ増加させ、その入力音声信号の第１フォルマントの周波数を２ｆｏｒｍａｎｔ増加させるものである。 5C, each voice conversion preset defines the amount of change in fundamental frequency based on a neutral fundamental frequency (first reference value for neutral) (=6) and defines the amount of change in the frequency of the first formant based on a neutral first formant frequency (second reference value for neutral) (=2). For example, voice conversion preset A _N (11,2.5) increases the fundamental frequency of the input voice signal by 11 pitches and increases the frequency of the first formant of the input voice signal by 2 formants, assuming that the fundamental frequency and the frequency of the first formant of the input voice signal are located at (6,2) on the pf plane.

同様に、音声変換プリセットＢ_Ｎ（２.５,３）は、入力音声信号の基本周波数及び第１フォルマントの周波数がｐｆ平面上において（６,２）に配置されると仮定して、その入力音声信号の基本周波数を２.５ｐｉｔｃｈ増加させ、その入力音声信号の第１フォルマントの周波数を３ｆｏｒｍａｎｔ増加させるものである。さらに同様に、音声変換プリセットＣ_Ｎ（－７,－４）は、入力音声信号の基本周波数及び第１フォルマントの周波数がｐｆ平面上において（６,２）に配置されると仮定して、その入力音声信号の基本周波数を７ｐｉｔｃｈ減少させ、その入力音声信号の第１フォルマントの周波数を４ｆｏｒｍａｎｔ減少させるものである。 Similarly, the voice conversion preset B _N (2.5, 3) increases the fundamental frequency of the input voice signal by 2.5 pitches and increases the frequency of the first formant of the input voice signal by 3 formants, assuming that the fundamental frequency and the frequency of the first formant of the input voice signal are located at (6, 2) on the pf plane.Further similarly, the voice conversion preset C _N (-7, -4) decreases the fundamental frequency of the input voice signal by 7 pitches and decreases the frequency of the first formant of the input voice signal by 4 formants, assuming that the fundamental frequency and the frequency of the first formant of the input voice signal are located at (6, 2) on the pf plane.

なお、ここでは、標準的な男性の声が、基本周波数ｆ_P１及び第１フォルマントの周波数ｆ_F１を有するものとして、ｐｆ平面上において（０,０）に配置される場合について説明したが、複数の男性の基本周波数及び第１フォルマントの周波数を収集し、これらの基本周波数の平均値（例えばｆ_PAVE）及びこれらの第１フォルマントの周波数の平均値（例えばｆ_FAVE）が、ｐｆ平面上において（０,０）に配置されるようにしてもよい。このように、男性用の第１基準値は、複数の男性ユーザから取得された基本周波数の平均値に基づいて設定され得るものであり、男性用の第２基準値は、複数の男性ユーザから取得された第１フォルマントの周波数の平均値に基づいて設定され得るものである。 Although the case has been described here where a standard male voice has a fundamental frequency _fP1 and a first formant frequency _fF1 and is placed at (0,0) on the pf plane, the fundamental frequencies and first formant frequencies of a plurality of males may be collected, and the average value of these fundamental frequencies (e.g., _fPAVE ) and the average value of these first formant frequencies (e.g., _fFAVE ) may be placed at (0,0) on the pf plane. In this way, the first reference value for males may be set based on the average value of the fundamental frequencies obtained from a plurality of male users, and the second reference value for males may be set based on the average value of the first formant frequencies obtained from a plurality of male users.

同様に、ここでは、標準的な女性の声が、ｐｆ平面上において（１２,４）に配置される場合について説明したが、複数の女性の基本周波数及び第１フォルマントの周波数を収集し、これらの基本周波数の平均値（例えばｆ_PAVE2）及びこれらの第１フォルマントの周波数の平均値（例えばｆ_FAVE2）が、ｐｆ平面上において（１２,４）に配置されるようにしてもよい。このように、女性用の第１基準値は、複数の女性ユーザから取得された基本周波数の平均値に基づいて設定され得るものであり、男性用の第２基準値は、複数の女性ユーザから取得された第１フォルマントの周波数の平均値に基づいて設定され得るものである。 Similarly, although a case has been described here in which a standard female voice is placed at (12, 4) on the pf plane, the fundamental frequencies and first formant frequencies of a plurality of female users may be collected, and the average value of these fundamental frequencies (e.g., f _PAVE2 ) and the average value of these first formant frequencies (e.g., f _FAVE2 ) may be placed at (12, 4) on the pf plane. In this way, the first reference value for females may be set based on the average value of the fundamental frequencies obtained from a plurality of female users, and the second reference value for males may be set based on the average value of the first formant frequencies obtained from a plurality of female users.

５．通信システム１の動作
次に、上述した構成を有する通信システム１の動作の具体例について、図６を参照して説明する。図６は、図１に示した通信システム１において行われる動作の一例を示すフロー図である。ここでは、特徴量として、（i）基本周波数、（ii）第１フォルマントの周波数を用いる場合に着目する。 5. Operation of communication system 1 Next, a specific example of the operation of the communication system 1 having the above-mentioned configuration will be described with reference to Fig. 6. Fig. 6 is a flow diagram showing an example of the operation performed in the communication system 1 shown in Fig. 1. Here, attention is paid to the case where (i) the fundamental frequency and (ii) the frequency of the first formant are used as the feature amount.

図６を参照すると、まず、ステップ（以下「ＳＴ」という。）６００において、対象ユーザＡの端末装置２０が、この対象ユーザＡの発話に関する音声信号をサンプルとして入力することができる。具体的には、まず、端末装置２０は、対象ユーザＡに対して、対象ユーザＡの性別（男性、女性又は中性）、及び、用意された複数のキャラクター（例えば、図７に例示するキャラクターＡ～Ｌ）の中から対象ユーザＡが希望するキャラクターを指定するように、表示部２２０に表示されたユーザインタフェイスを介して要求することができる。 Referring to FIG. 6, first, in step (hereinafter referred to as "ST") 600, the terminal device 20 of the target user A can input a sample of an audio signal related to the speech of the target user A. Specifically, the terminal device 20 can first request the target user A, via a user interface displayed on the display unit 220, to specify the gender of the target user A (male, female, or neutral) and a desired character from among a plurality of prepared characters (for example, characters A to L illustrated in FIG. 7).

端末装置２０は、図７に例示されるように、各キャラクター（各音声変換プリセット）に固有のサンプルとなるセリフを記憶する（又はサーバ装置３０から受信する）ことができる。各キャラクター（各音声変換プリセット）に個別に用意されたセリフは、同一の音素が並ぶように設定可能なものである。また、各キャラクターに固有のセリフは、そのキャラクターのイメージに沿って多様な抑揚が付与されたものとされ得る。これにより、このセリフを発話することにより得られる音声信号にあっては、ユーザごとに同様の抑揚が生ずる可能性が高くなる、すなわち、最も高い声（最も高い周波数）、最も低い声（最も低い周波数）及びその中間の声（その中間の周波数）が生ずる可能性が高くなる。さらに、各キャラクターに固有のセリフは、端末装置２０によりユーザインタフェイスを介して指定された発話開始時間から発話終了時間までの間（例えば約１０秒間）に、各ユーザにより発話されるものであるため、各ユーザ間において、発話を開始するタイミング及び発話を終了するタイミングが略一致するようになっている。なお、図７に例示した各セリフにおいて、〇〇という部分は、各ユーザに固有の名前等に相当し得る。よって、この部分は、ユーザ毎に異なるセリフとなるが、統計的誤差として吸収可能なものである。 As illustrated in FIG. 7, the terminal device 20 can store (or receive from the server device 30) sample lines specific to each character (each voice conversion preset). The lines individually prepared for each character (each voice conversion preset) can be set so that the same phonemes are arranged. In addition, the lines specific to each character can be given various intonations in accordance with the image of the character. This increases the likelihood that the voice signal obtained by speaking the lines will have similar intonations for each user, that is, the highest voice (highest frequency), the lowest voice (lowest frequency), and a voice in between (intermediate frequencies) will be produced. Furthermore, the lines specific to each character are spoken by each user between the speech start time and speech end time (for example, about 10 seconds) specified by the terminal device 20 via the user interface, so that the timing of starting and ending the speech is approximately the same for each user. In addition, in each of the lines shown in FIG. 7, the part marked with "XX" may correspond to a name or the like that is unique to each user. Therefore, this part will be a different line for each user, but this can be absorbed as a statistical error.

次に、端末装置２０は、対象ユーザＡにより指定されたキャラクターについて個別に用意されたセリフを表示部２２０に表示して、発話開始時間から発話終了時間までの間に対象ユーザにそのセリフを発話させる。これにより、端末装置２０は、対象ユーザＡの発話に関する音声信号（サンプル音声信号）を取得することができる。 Next, the terminal device 20 displays lines prepared individually for the character designated by the target user A on the display unit 220, and has the target user speak the lines between the speech start time and the speech end time. This allows the terminal device 20 to acquire an audio signal (sample audio signal) related to the speech of the target user A.

次に、ＳＴ６０２において、端末装置２０は、サンプル音声信号を用いて、対象ユーザＡの声と各音声変換プリセットに対応する声との距離を算出する。この処理について、図８を参照して説明する。図８は、図１に示した通信システム１において行われる対象ユーザの声と各音声変換プリセットに対応する声との距離を算出する方法の一例を示すフロー図である。 Next, in ST602, the terminal device 20 uses the sample voice signal to calculate the distance between the voice of the target user A and the voice corresponding to each voice conversion preset. This process will be described with reference to FIG. 8. FIG. 8 is a flow diagram showing an example of a method for calculating the distance between the voice of the target user and the voice corresponding to each voice conversion preset, which is performed in the communication system 1 shown in FIG. 1.

図８を参照すると、ＳＴ７００において、端末装置２０が、ＳＴ６００において取得したサンプル音声信号を用いて、基本周波数を参照基本周波数として取得することができる。具体的には、端末装置２０は、対象ユーザＡについて得られたサンプル音声信号に対して任意の既知の信号処理を実行することにより基本周波数を抽出することができる。 Referring to FIG. 8, in ST700, the terminal device 20 can acquire the fundamental frequency as a reference fundamental frequency using the sample voice signal acquired in ST600. Specifically, the terminal device 20 can extract the fundamental frequency by performing any known signal processing on the sample voice signal acquired for the target user A.

既知の信号処理の第１の手法として、ゼロ交差法を用いた手法を利用することが可能である。図９は、図１に示した通信システム１において基本周波数（及び第１フォルマントの周波数）を取得するために用いられる方法の一例を示すブロック図である。 As a first known signal processing technique, a technique using the zero-crossing method can be used. Figure 9 is a block diagram showing an example of a method used to obtain the fundamental frequency (and the frequency of the first formant) in the communication system 1 shown in Figure 1.

図９に例示するように、サンプル音声信号が、例えばＭ個のフィルタ（フィルタ７１０Ａ_１～７１０Ａ_Ｍ）に入力される。これらのフィルタの各々は、帯域通過フィルタとして機能するものであって、入力されたサンプル音声信号のうちそのフィルタに固有の通過帯域に対応する周波数成分のみを出力することができる。 9, a sample audio signal is input to, for example, M filters (filters 710A ₁ to 710A _M ). Each of these filters functions as a band-pass filter and can output only frequency components of the input sample audio signal that correspond to a pass band specific to that filter.

計算部７１０Ｂ_１～７１０Ｂ_Ｍは、それぞれ、フィルタ７１０Ａ_１～７１０Ａ_Ｍにより出力された信号を用いて、ゼロ交差法に基づいて基本周波数らしさを計算することができる。選択部７１０Ｃは、計算部Ｂ_１～７１０Ｂ_Ｍにより計算された基本周波数らしさのうち、最も信頼できるものを選択し、このように選択した基本周波数らしさに対応する周波数をサンプル音声信号の基本周波数として出力することができる。 The calculation units 710B ₁ to 710B _M can calculate the fundamental frequency likeness based on the zero-crossing method using the signals output by the filters 710A ₁ to 710A _M. The selection unit 710C can select the most reliable fundamental frequency likeness among the fundamental frequency likenesses calculated by the calculation units B ₁ to 710B _M , and output the frequency corresponding to the fundamental frequency likeness thus selected as the fundamental frequency of the sample audio signal.

また、既知の信号処理の第２の手法として、端末装置２０は、サンプル音声信号に対して、例えば、以下の信号処理を施すことにより、基本周波数を抽出することができる。
・プリエンファシスフィルタにより波形の高域成分を強調
・窓関数を掛けた後に高速フーリエ逆変換（ＦＦＴ）を行い振幅スペクトルを取得
・振幅スペクトルにメルフィルタバンクを掛けて圧縮
・上記圧縮した数値列を信号とみなして離散コサイン変換を実行
なお、一実施形態では、端末装置２０は、例えば、音声分析変換合成システム「Ｗｏｒｌｄ」（http://www.kki.yamanashi.ac.jp/~mmorise/world/index.html）においてオープンソース実装されているＨａｒｖｅｓｔ及びＤ１０等のアルゴリズムを用いること等により、基本周波数を算出することができる。 As a second known signal processing technique, the terminal device 20 can extract the fundamental frequency by performing, for example, the following signal processing on the sample voice signal.
- Emphasis the high frequency components of the waveform using a pre-emphasis filter - After applying a window function, perform an inverse fast Fourier transform (FFT) to obtain the amplitude spectrum - Compress the amplitude spectrum by applying a Mel filter bank - Treat the compressed numeric sequence as a signal and perform a discrete cosine transform In one embodiment, the terminal device 20 can calculate the fundamental frequency by using algorithms such as Harvest and D10, which are implemented as open source in the voice analysis, conversion and synthesis system "World" (http://www.kki.yamanashi.ac.jp/~mmorise/world/index.html).

一実施形態では、発話開始時間から発話終了時間まで約１０秒間のサンプル音声信号を複数の時間区間に分割し、各時間区間ごとに上述したいずれか既知の手法により基本周波数が算出され得る。これにより、基本周波数の最大値、最小値及び中央値が抽出され得る。このような基本周波数の最大値、最小値及び中央値を平均した値が、最終的な基本周波数（以下「参照基本周波数」ということがある。）として取得され得る。別の実施形態では、上記のように抽出された基本周波数の最大値、最小値及び中央値のうち、いずれか１つの値が「参照基本周波数」として抽出され得る。 In one embodiment, a sample voice signal of about 10 seconds from the speech start time to the speech end time is divided into multiple time intervals, and the fundamental frequency can be calculated for each time interval using any of the known methods described above. This allows the maximum, minimum, and median values of the fundamental frequency to be extracted. The average value of such maximum, minimum, and median values of the fundamental frequency can be obtained as the final fundamental frequency (hereinafter sometimes referred to as the "reference fundamental frequency"). In another embodiment, any one of the maximum, minimum, and median values of the fundamental frequency extracted as described above can be extracted as the "reference fundamental frequency".

図８に戻り、次に、ＳＴ７０２において、端末装置２０は、第１基準値を基準とした参照基本周波数の変化量を取得する。具体的には、対象ユーザＡが男性である場合、すなわち、対象ユーザＡが「男性」を選択した場合（対象ユーザＡの性別は上述したＳＴ６００において対象ユーザＡにより入力されている）には、男性用の第１基準値は、例えば標準的な男性の声の基本周波数ｆ_P1として設定されているところ、端末装置２０は、対象ユーザＡの参照基本周波数が、男性用の第１基準値からどれだけ（何ｐｉｔｃｈ）変化させたものであるかを算出することができる。対象ユーザＡの参照基本周波数が男性用の第１基準値から何ｐｉｔｃｈ変化させたものであるかは、基本周波数を０.５倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）減少し、基本周波数を２倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）増加するという周波数とｐｉｔｃｈとの関係と、男性用の第１基準値と参照基本周波数との差異（割合）と、に基づいて算出することが可能なものである。 Returning to FIG. 8, next, in ST702, the terminal device 20 acquires the amount of change in the reference fundamental frequency based on the first reference value. Specifically, when the target user A is a male, that is, when the target user A selects "male" (the gender of the target user A is input by the target user A in ST600 described above), the first reference value for males is set as the fundamental frequency f _P1 of a standard male voice, for example, and the terminal device 20 can calculate how much (how many pitches) the reference fundamental frequency of the target user A has changed from the first reference value for males. How many pitches the reference fundamental frequency of the target user A has changed from the first reference value for males can be calculated based on the relationship between frequency and pitch, that is, if the fundamental frequency is multiplied by 0.5, the pitch of the fundamental frequency decreases by 8 pitches (one octave), and if the fundamental frequency is multiplied by two, the pitch of the fundamental frequency increases by 8 pitches (one octave), and the difference (proportion) between the first reference value for males and the reference fundamental frequency.

一方、対象ユーザＡが女性である場合、すなわち、対象ユーザＡが上述したＳＴ６００において「女性」を選択した場合には、女性用の第１基準値は、例えば標準的な女性の声の基本周波数ｆ_P2として設定されているところ、端末装置２０は、対象ユーザＡの参照基本周波数が、女性用の第１基準値からどれだけ（何ｐｉｔｃｈ）変化させたものであるかを算出することができる。対象ユーザＡの参照基本周波数が女性用の第１基準値から何ｐｉｔｃｈ変化させたものであるかは、基本周波数を０.５倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）減少し、基本周波数を２倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）増加するという周波数とｐｉｔｃｈとの関係と、女性用の第１基準値と参照基本周波数との差異（割合）と、に基づいて算出することが可能なものである。 On the other hand, when the target user A is a woman, that is, when the target user A selects "female" in ST600 described above, the first reference value for women is set as, for example, the fundamental frequency f _P2 of a standard female voice, and the terminal device 20 can calculate how much (how many pitches) the reference fundamental frequency of the target user A has changed from the first reference value for women. How many pitches the reference fundamental frequency of the target user A has changed from the first reference value for women can be calculated based on the relationship between frequency and pitch, that is, if the fundamental frequency is multiplied by 0.5, the pitch of the fundamental frequency decreases by 8 pitches (one octave), and if the fundamental frequency is multiplied by two, the pitch of the fundamental frequency increases by 8 pitches (one octave), and on the difference (proportion) between the first reference value for women and the reference fundamental frequency.

他方、対象ユーザＡが中性である場合、すなわち、対象ユーザＡが上述したＳＴ６００において「中性」を選択した場合には、中性用の第１基準値は、例えば標準的な女性の声の基本周波数ｆ_P３として設定されているところ、端末装置２０は、対象ユーザＡの参照基本周波数が、中性用の第１基準値からどれだけ（何ｐｉｔｃｈ）変化させたものであるかを算出することができる。対象ユーザＡの参照基本周波数が中性用の第１基準値から何ｐｉｔｃｈ変化させたものであるかは、基本周波数を０.５倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）減少し、基本周波数を２倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）増加するという周波数とｐｉｔｃｈとの関係と、中性用の第１基準値と参照基本周波数との差異（割合）と、に基づいて算出することが可能なものである。 On the other hand, when the target user A is neutral, that is, when the target user A selects "neutral" in ST600 described above, the first reference value for neutral is set as, for example, the fundamental frequency f _P3 of a standard female voice, and the terminal device 20 can calculate how much (how many pitches) the reference fundamental frequency of the target user A has changed from the first reference value for neutral. How many pitches the reference fundamental frequency of the target user A has changed from the first reference value for neutral can be calculated based on the relationship between frequency and pitch, that is, if the fundamental frequency is multiplied by 0.5, the pitch of the fundamental frequency decreases by 8 pitches (one octave), and if the fundamental frequency is multiplied by two, the pitch of the fundamental frequency increases by 8 pitches (one octave), and on the difference (proportion) between the first reference value for neutral and the reference fundamental frequency.

次に、ＳＴ７０４において、端末装置２０が、ＳＴ６００において取得したサンプル音声信号を用いて、第１フォルマントの周波数を参照第１フォルマント周波数として取得することができる。具体的には、端末装置２０は、対象ユーザＡについて得られたサンプル音声信号に対して任意の既知の信号処理を実行することにより第１フォルマントの周波数を抽出することができる。 Next, in ST704, the terminal device 20 can acquire the frequency of the first formant as a reference first formant frequency using the sample voice signal acquired in ST600. Specifically, the terminal device 20 can extract the frequency of the first formant by performing any known signal processing on the sample voice signal obtained for the target user A.

上述した第１の手法（ゼロ交差法を用いた手法）を用いる場合には、図９に示したフィルタ７１０Ａ_１～７１０Ａ_Ｍの各々が、そのフィルタに固有の通過帯域として、第１フォルマントの周波数に対応する通過帯域を用い、計算部Ｂ_１～７１０Ｂ_Ｍが、それぞれフィルタ７１０Ａ_１～７１０Ａ_Ｍにより出力された信号を用いて、ゼロ交差法に基づいて第１フォルマントの周波数らしさを計算することができる。これにより、選択部７１０Ｃは、計算部Ｂ_１～７１０Ｂ_Ｍにより計算された第１フォルマントの周波数らしさのうち、最も信頼できるものを選択し、このように選択した第１フォルマントの周波数らしさに対応する周波数をサンプル音声信号の第１フォルマントの周波数として出力することができる。 When the above-mentioned first method (method using the zero-crossing method) is used, each of the filters 710A ₁ to 710A _M shown in Fig. 9 uses a passband corresponding to the frequency of the first formant as a passband specific to that filter, and the calculation units B ₁ to 710B _M can calculate the frequency-likeness of the first formant based on the zero-crossing method using the signals output by the filters 710A ₁ to 710A _M. In this way, the selection unit 710C can select the most reliable frequency-likeness of the first formant calculated by the calculation units B ₁ to 710B _M , and output the frequency corresponding to the frequency-likeness of the first formant selected in this manner as the frequency of the first formant of the sample speech signal.

また、上述した第２の手法を用いる場合には、端末装置２０は、サンプル音声信号に対して、例えば、以下の信号処理を施すことにより、基本周波数に加えて第１フォルマントの周波数をも抽出することができる。
・プリエンファシスフィルタにより波形の高域成分を強調
・窓関数を掛けた後に高速フーリエ逆変換（ＦＦＴ）を行い振幅スペクトルを取得
・振幅スペクトルにメルフィルタバンクを掛けて圧縮
・上記圧縮した数値列を信号とみなして離散コサイン変換を実行
なお、この場合にも、端末装置２０は、プログラミング言語であるＰｙｔｈｏｎにおいて用意されたライブラリである「ｏｐｅｎＳＭＩＬＥ」を用いること等により、基本周波数に加えて第１フォルマントの周波数をも算出することができる。 Furthermore, when the second technique described above is used, the terminal device 20 can extract the frequency of the first formant in addition to the fundamental frequency by performing, for example, the following signal processing on the sample voice signal.
- Emphasis the high frequency components of the waveform using a pre-emphasis filter - After applying a window function, perform an inverse fast Fourier transform (FFT) to obtain the amplitude spectrum - Compress the amplitude spectrum by applying a Mel filter bank - Treat the compressed numeric sequence as a signal and perform a discrete cosine transform Even in this case, the terminal device 20 can calculate the frequency of the first formant in addition to the fundamental frequency by using "openSMILE", a library provided in the programming language Python, for example.

一実施形態では、発話開始時間から発話終了時間まで約１０秒間のサンプル音声信号を複数の時間区間に分割し、各時間区間ごとに上述したいずれか既知の手法により第１フォルマントの周波数が算出され得る。これにより、第１フォルマントの周波数の最大値、最小値及び中央値が抽出され得る。このような第１フォルマントの周波数の最大値、最小値及び中央値を平均した値が、最終的な第１フォルマントの周波数（以下「参照第１フォルマント周波数」ということがある。）として取得され得る。別の実施形態では、上記のように抽出された第１フォルマントの周波数の最大値、最小値及び中央値のうち、いずれか１つの値が「参照第１フォルマント周波数」として抽出され得る。 In one embodiment, a sample speech signal of about 10 seconds from the speech start time to the speech end time is divided into a plurality of time intervals, and the frequency of the first formant can be calculated for each time interval by any of the known methods described above. This allows the maximum, minimum, and median values of the frequency of the first formant to be extracted. The average value of the maximum, minimum, and median values of the frequency of the first formant can be obtained as the final frequency of the first formant (hereinafter sometimes referred to as the "reference first formant frequency"). In another embodiment, any one of the maximum, minimum, and median values of the frequency of the first formant extracted as described above can be extracted as the "reference first formant frequency".

図８に戻り、次に、ＳＴ７０６において、端末装置２０は、第２基準値を基準とした参照基本周波数の変化量を取得する。具体的には、対象ユーザＡが男性である場合（対象ユーザＡの性別は上述したＳＴ６００において対象ユーザＡにより入力されている）には、男性用の第２基準値は、例えば標準的な男性の声の第１フォルマントの周波数ｆ_F1として設定されているところ、端末装置２０は、対象ユーザＡの参照第１フォルマント周波数が、男性用の第２基準値からどれだけ（何ｆｏｒｍａｎｔ）変化させたものであるかを算出することができる。対象ユーザＡの参照基本周波数が男性用の第２基準値から何ｆｏｒｍａｎｔ変化させたものであるかは、基本周波数を０.５倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）減少し、基本周波数を２倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）増加するという周波数とｐｉｔｃｈとの関係と、１ｆｏｒｍａｎｔは１ｐｉｔｃｈ／３であるという関係と、男性用の第２基準値と参照第１フォルマント周波数との差異（割合）と、に基づいて算出することが可能なものである。 Returning to Fig. 8, next, in ST706, the terminal device 20 acquires the amount of change in the reference fundamental frequency based on the second reference value. Specifically, when the target user A is male (the gender of the target user A is input by the target user A in the above-mentioned ST600), the second reference value for males is set as, for example, the frequency _fF1 of the first formant of a standard male voice, and the terminal device 20 can calculate how much (how many formants) the reference first formant frequency of the target user A has changed from the second reference value for males. The number of formants by which the reference fundamental frequency of target user A has changed from the second reference value for men can be calculated based on the relationship between frequency and pitch, that is, if the fundamental frequency is multiplied by 0.5, the pitch of the fundamental frequency decreases by 8 pitches (one octave), and if the fundamental frequency is multiplied by two, the pitch of the fundamental frequency increases by 8 pitches (one octave), the relationship that 1 formant is 1 pitch/3, and the difference (proportion) between the second reference value for men and the reference first formant frequency.

一方、対象ユーザＡが女性である場合には、女性用の第２基準値は、例えば標準的な女性の声の基本周波数ｆ_F2として設定されているところ、端末装置２０は、対象ユーザＡの参照第１フォルマント周波数が、女性用の第２基準値からどれだけ（何ｆｏｒｍａｎｔ）変化させたものであるかを算出することができる。対象ユーザＡの参照第１フォルマント周波数が女性用の第２基準値から何ｆｏｒｍａｎｔ変化させたものであるかは、基本周波数を０.５倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）減少し、基本周波数を２倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）増加するという周波数とｐｉｔｃｈとの関係と、１ｆｏｒｍａｎｔは１ｐｉｔｃｈ／３であるという関係と、女性用の第２基準値と参照基本周波数との差異（割合）と、に基づいて算出することが可能なものである。 On the other hand, when the target user A is a woman, the second reference value for women is set as, for example, the fundamental frequency _fF2 of a standard female voice, and the terminal device 20 can calculate how much (how many formants) the reference first formant frequency of the target user A has changed from the second reference value for women. How many formants the reference first formant frequency of the target user A has changed from the second reference value for women can be calculated based on the relationship between frequency and pitch, that is, if the fundamental frequency is multiplied by 0.5, the pitch of the fundamental frequency decreases by 8 pitches (one octave), and if the fundamental frequency is multiplied by two, the pitch of the fundamental frequency increases by 8 pitches (one octave), the relationship that 1 formant is 1 pitch/3, and the difference (proportion) between the second reference value for women and the reference fundamental frequency.

他方、対象ユーザＡが中性である場合には、中性用の第２基準値は、例えば標準的な女性の声の基本周波数ｆ_F３として設定されているところ、端末装置２０は、対象ユーザＡの参照第１フォルマント周波数が、中性用の第２基準値からどれだけ（何ｆｏｒｍａｎｔ）変化させたものであるかを算出することができる。対象ユーザＡの参照第１フォルマント周波数が中性用の第２基準値から何ｆｏｒｍａｎｔ変化させたものであるかは、基本周波数を０.５倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）減少し、基本周波数を２倍すれば、基本周波数のｐｉｔｃｈは８ｐｉｔｃｈ（１オクターブ）増加するという周波数とｐｉｔｃｈとの関係と、１ｆｏｒｍａｎｔは１ｐｉｔｃｈ／３であるという関係と、中性用の第２基準値と参照基本周波数との差異（割合）と、に基づいて算出することが可能なものである。 On the other hand, when the target user A is neutral, the second reference value for neutral is set as, for example, the fundamental frequency _fF3 of a standard female voice, and the terminal device 20 can calculate how much (how many formants) the reference first formant frequency of the target user A has changed from the second reference value for neutral. How many formants the reference first formant frequency of the target user A has changed from the second reference value for neutral can be calculated based on the relationship between frequency and pitch, that is, if the fundamental frequency is multiplied by 0.5, the pitch of the fundamental frequency decreases by 8 pitches (one octave), and if the fundamental frequency is multiplied by two, the pitch of the fundamental frequency increases by 8 pitches (one octave), the relationship that 1 formant is 1 pitch/3, and the difference (proportion) between the second reference value for neutral and the reference fundamental frequency.

次に、ＳＴ７０８において、端末装置２０は、対象ユーザＡの声と各音声変換プリセットに対応する声との距離を、ＳＴ７０２及びＳＴ７０６で取得した変化量を用いて取得することができる。具体的には、端末装置２０は、まず、対象ユーザＡの第１基準値を基準とした参照基本周波数の変化量、及び、対象ユーザＡの第２基準値を基準とした参照第１フォルマント周波数の変化量をｐｆ平面（２次元座標系）に配置する。さらに、端末装置２０は、各音声変換プリセットにより定められる、第１基準値を基準とした基本周波数の変化量及び第２基準値を基準とした第１フォルマントの周波数の変化量を上記ｐｆ平面（２次元座標系）に配置する。 Next, in ST708, the terminal device 20 can obtain the distance between the voice of the target user A and the voice corresponding to each voice conversion preset using the amount of change obtained in ST702 and ST706. Specifically, the terminal device 20 first arranges on the pf plane (two-dimensional coordinate system) the amount of change in the reference fundamental frequency based on the first reference value of the target user A and the amount of change in the reference first formant frequency based on the second reference value of the target user A. Furthermore, the terminal device 20 arranges on the pf plane (two-dimensional coordinate system) the amount of change in the fundamental frequency based on the first reference value and the amount of change in the first formant frequency based on the second reference value, which are determined by each voice conversion preset.

さらにまた、対象ユーザＡの参照基本周波数の変化量及び第１フォルマントの周波数の変化量をそれぞれ（Ｕ_ｐ,Ｕ_ｆ）とし、例えば３つの音声変換プリセットにより定められる基本周波数の変化量及び第１フォルマントの周波数の変化量を、それぞれ、（Ｔ_１ｐ，Ｔ_１ｆ）、（Ｔ_２ｐ，Ｔ_２ｆ）及び（Ｔ_３ｐ，Ｔ_３ｆ）とすると、端末装置２０は、対象ユーザの声と３つの音声変換プリセットに対応する声との距離Ｖ_１、Ｖ_２及びＶ_３は、それぞれ、ピタゴラスの定理を用いて次の数式により算出可能である。 Furthermore, if the amount of change in the reference fundamental frequency and the amount of change in the frequency of the first formant of the target user A are (U _p , U _f ), and the amount of change in the fundamental frequency and the amount of change in the frequency of the first formant determined by, for example, three voice conversion presets are (T _1p , T _1f ), (T _2p , T _2f ), and (T _3p , T _3f ), respectively, the terminal device 20 can calculate the distances V ₁ , V ₂ , and V ₃ between the voice of the target user and the voices corresponding to the three voice conversion presets using the Pythagorean theorem according to the following formulas.

Ｖ_１＝√｛（Ｔ_１ｐ－Ｕ_ｐ）^２＋（Ｔ_１ｆ－Ｕ_ｆ）^２｝－（１）
Ｖ_２＝√｛（Ｔ_２ｐ－Ｕ_ｐ）^２＋（Ｔ_２ｆ－Ｕ_ｆ）^２｝－（２）
Ｖ_３＝√｛（Ｔ_３ｐ－Ｕ_ｐ）^２＋（Ｔ_３ｆ－Ｕ_ｆ）^２｝－（３） V ₁ = √{(T _1p -U _p ) ² + (T _1f -U _f ) ² } - (1)
_V2 = √{( _T2p - _Up ) ² + ( _T2f - _Uf ) ² } - (2)
_V3 = √{( _T3p - _Up ) ² + ( _T3f - _Uf ) ² } - (3)

なお、対象ユーザＡが男性である場合には、対象ユーザＡの参照基本周波数の変化量及び参照第１フォルマント周波数の変化量と、「男性用の」各音声変換プリセット（例えば図５Ａに例示したＡ_Ｍ、Ｂ_Ｍ及びＣ_Ｍ等）により定められる基本周波数の変化量及び第１フォルマントの周波数の変化量と、の距離が算出される。一方、対象ユーザＡが女性である場合には、対象ユーザＡの参照基本周波数の変化量及び参照第１フォルマント周波数の変化量と、「女性用の」各音声変換プリセット（例えば図５Ｂに例示したＡ_Ｆ、Ｂ_Ｆ及びＣ_Ｆ等）により定められる基本周波数の変化量及び第１フォルマントの周波数の変化量と、の距離が算出される。他方、対象ユーザＡが中性である場合には、対象ユーザＡの参照基本周波数の変化量及び参照第１フォルマント周波数の変化量と、「中性用の」各音声変換プリセット（例えば図５Ｃに例示したＡ_Ｎ、Ｂ_Ｎ及びＣ_Ｎ等）により定められる基本周波数の変化量及び第１フォルマントの周波数の変化量と、の距離が算出される。 In addition, when the target user A is male, the distance between the change amount of the reference fundamental frequency and the change amount of the reference first formant frequency of the target user A and the change amount of the fundamental frequency and the change amount of the first formant frequency determined by each voice conversion preset for "male" (e.g., A _M , B _M and C _M as exemplified in FIG. 5A ) is calculated. On the other hand, when the target user A is female, the distance between the change amount of the reference fundamental frequency and the change amount of the reference first formant frequency of the target user A and the change amount of the fundamental frequency and the change amount of the first formant frequency determined by each voice conversion preset for "female" (e.g., A _F , B _F and C _F as exemplified in FIG. 5B ) is calculated. On the other hand, when the target user A is neutral, the distance between the change amount of the reference fundamental frequency and the change amount of the reference first formant frequency of the target user A and the change amount of the fundamental frequency and the change amount of the first formant frequency determined by each voice conversion preset for "neutral" (e.g., A _N , B _N and C _N as exemplified in FIG. 5C ) is calculated.

次に、ＳＴ７１０において、端末装置２０は、ＳＴ７０８において取得した距離を、各音声変換プリセットに関連する情報に対応付けて表示部２２０に表示することができる。図１０は、図１に示した通信システム１において端末装置２０の表示部２２０により表示される画面の一例を示す図である。 Next, in ST710, the terminal device 20 can display the distance acquired in ST708 on the display unit 220 in association with information related to each voice conversion preset. FIG. 10 is a diagram showing an example of a screen displayed by the display unit 220 of the terminal device 20 in the communication system 1 shown in FIG. 1.

図１０に示すように、端末装置２０は、各音声変換プリセット（各キャラクター）ごとに、キャラクターに対応する画像又は写真、キャラクターに対応する名称、プリセット番号に対応付けて、その音声変換プリセットに対応する声と対象ユーザＡの声との距離を表示することができる。例えば、端末装置２０は、キャラクター７２０Ａ_４を例に挙げると、そのキャラクターに対応する写真、そのキャラクターに対応する名称（「俳優」）、プリセット番号（「Ｐ_５９」）、及び、その音声変換プリセット（「Ｐ_５９」）に対応する声と対象ユーザＡの声との距離（「５５」）を表示することができる。これにより、対象ユーザＡは、複数の音声変換プリセットのうち、いずれの音声変換プリセットに対応する声が自分の声に近いのかを認識することができる。なお、図１０に示した例では、距離として「１２」が表示されたキャラクター７２０Ａ_１に対応する声が対象ユーザＡの声に最も近いということが理解される。 As shown in FIG. 10, the terminal device 20 can display, for each voice conversion preset (each character), an image or photo corresponding to the character, a name corresponding to the character, and a preset number, and display the distance between the voice corresponding to the voice conversion preset and the voice of the target user A. For example, taking the character 720A ₄ as an example, the terminal device 20 can display a photo corresponding to the character, a name corresponding to the character ("Actor"), a preset number ("P ₅₉ "), and a distance ("55") between the voice corresponding to the voice conversion preset ("P ₅₉ ") and the voice of the target user A. This allows the target user A to recognize which voice conversion preset among the multiple voice conversion presets is closest to his or her own voice. In the example shown in FIG. 10, it can be understood that the voice corresponding to the character 720A ₁ , which is displayed with a distance of "12", is closest to the voice of the target user A.

なお、図１０には、端末装置２０が、複数の音声変換プリセットのうち、対象ユーザＡの声との距離が所定値（ここでは「１００」）未満である少なくとも１つの（ここでは５つの）音声変換プリセットを、対象ユーザＡの声に近い特徴を有する音声変換プリセットとして表示している。この所定値は、端末装置２０の表示部の解像度等を含む様々な条件に応じて、任意に設定可能なものである。これに代えて又はこれに加えて、端末装置２０は、複数の音声変換プリセットのうち、対象ユーザＡとの距離が別の所定値を上回る少なくとも１つの音声変換プリセットを、対象ユーザＡの声から「遠い」特徴を有する音声変換プリセットとして表示することも可能である。これにより、対象ユーザＡは、自己にとって意外性のある音声変換プリセットを提示されることにより、当該サービスをさらに楽しむことができる。 In FIG. 10, the terminal device 20 displays at least one (here, five) voice conversion presets, among the multiple voice conversion presets, whose distance from the voice of the target user A is less than a predetermined value (here, "100"), as a voice conversion preset having characteristics close to the voice of the target user A. This predetermined value can be set arbitrarily depending on various conditions including the resolution of the display unit of the terminal device 20. Alternatively or in addition to this, the terminal device 20 can also display at least one voice conversion preset, among the multiple voice conversion presets, whose distance from the target user A exceeds another predetermined value, as a voice conversion preset having characteristics "distant" from the voice of the target user A. In this way, the target user A can enjoy the service even more by being presented with a voice conversion preset that is unexpected to him/her.

さらにまた、表示部２２０に表示されるその音声変換プリセットに対応する声と対象ユーザＡの声との距離は、上述した数式（１）～（３）等により算出された値そのものであってもよいし、このように算出された値に対して更なる任意の計算が施されたものであってもよい。
なお、図８には、一例として、ＳＴ７００において基本周波数（参照基本周波数）を取得して、ＳＴ７０２においてそれぞれ参照基本周波数の変化量を取得した後、ＳＴ７０４において参照フォルマント周波数を取得し、ＳＴ７０６において参照フォルマント周波数の変化量を取得する場合について説明した。別の例として、ＳＴ７００において、参照基本周波数及び参照フォルマント周波数（逆の順序でもよい）を順次取得した後、その後のステップにおいて、参照基本周波数の変化量及び参照フォルマント周波数の変化量（逆の順序でもよい）を順次取得することも可能である。
いずれの場合においても、Ｆ０推定（"Harvest: A High-Performance Fundamental Frequency Estimator from Speech Signals", Masanori Morise, Interspeech 2017, https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0068.html）、スペクトル包絡推定法（M. Morise, CheapTrick, a spectral envelope estimator for high-quality speech synthesis, Speech Communication, vol. 67, pp. 1-7, March 2015, http://www.sciencedirect.com/science/article/pii/S0167639314000697）、（M. Morise, Error evaluation of an F0-adaptive spectral envelope estimator in robustness against the additive noise and F0 error, IEICE transactions on information and systems, vol. E98-D, no. 7, pp. 1405-1408, July 2015）、音声パラメータのデザイン（https://www.jstage.jst.go.jp/article/jasj/74/11/74_608/_pdf）、及び、音声分析合成（p.118-122, https://www.amazon.co.jp/%E9%9F%B3%E5%A3%B0%E5%88%86%E6%9E%90%E5%90%88%E6%88%90-%E9%9F%B3%E9%9F%BF%E3%83%86%E3%82%AF%E3%83%8E%E3%83%AD%E3%82%B8%E3%83%BC%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA-22-%E6%A3%AE%E5%8B%A2-%E5%B0%86%E9%9B%85/dp/4339011371/ref=asc_df_4339011371/?tag=jpgo-22&linkCode=df0&hvadid=288872634447&hvpos=1o1&hvnetw=g&hvrand=13207960527415520975&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=1028853&hvtargid=pla-527203759435&psc=1&th=1&psc=1）を含む文献等に記載された任意の技術を用いて、参照基本周波数及び／又は参照フォルマント周波数を取得することが可能である。
なお、これらの文献等は、引用によりその全体が本明細書に組み入れられる。 Furthermore, the distance between the voice corresponding to the voice conversion preset displayed on the display unit 220 and the voice of the target user A may be the value calculated using the above-mentioned formulas (1) to (3), or may be a value calculated in this manner after further arbitrary calculations have been performed.
8 has been described as an example in which a fundamental frequency (reference fundamental frequency) is acquired in ST700, the amount of change in the reference fundamental frequency is acquired in ST702, the reference formant frequency is acquired in ST704, and the amount of change in the reference formant frequency is acquired in ST706. As another example, it is also possible to sequentially acquire the reference fundamental frequency and the reference formant frequency (the order may be reversed) in ST700, and then sequentially acquire the amount of change in the reference fundamental frequency and the amount of change in the reference formant frequency (the order may be reversed) in the subsequent steps.
In both cases, the F0 estimation ("Harvest: A High-Performance Fundamental Frequency Estimator from Speech Signals", Masanori Morise, Interspeech 2017, https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0068.html), the spectral envelope estimation (M. Morise, CheapTrick, a spectral envelope estimator for high-quality speech synthesis, Speech Communication, vol. 67, pp. 1-7, March 2015, http://www.sciencedirect.com/science/article/pii/S0167639314000697), and the F0 estimation (M. Morise, Error evaluation of an F0-adaptive spectral envelope estimator in robustness against the additive noise and F0 error, IEICE transactions on information and systems, vol. E98-D, no. 7, pp. 1405-1408, July 2016) are used. 2015), Voice Parameter Design (https://www.jstage.jst.go.jp/article/jasj/74/11/74_608/_pdf), and Voice Analysis and Synthesis (p.118-122, https://www.amazon.co.jp/%E9%9F%B3%E5%A3%B0%E5%88%86%E6%9E%90%E5%90%88%E6%88%90-%E9%9F%B3%E9%9F%BF%E3%83%86%E3%82%AF%E3%83%8E%E3%83%AD%E3%82%B8%E3%83%BC%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA-22-%E6%A3%AE%E5%8B%A2-%E5%B0%86%E9%9B%85/dp/4339011371/ref=asc_df_433901137 1/?tag=jpgo-22&linkCode=df0&hvadid=288872634447&hvpos=1o1&hvnetw=g&hvrand=13207960527415520975&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=1028853&hvtargid=pla-527203759435&psc=1&th=1&psc=1).
These documents are incorporated herein by reference in their entirety.

以上、図６に示したＳＴ６０２において行われる動作について、図８を用いて説明した。 The operations performed in ST602 shown in Figure 6 have been explained above using Figure 8.

図６に戻り、ＳＴ６０４において、端末装置２０は、複数の音声変換プリセットのうち、ＳＴ６０２において抽出された、対象ユーザＡの声との距離が所定値未満である少なくとも１つの音声変換プリセットに基づいて、対象ユーザＡに推奨すべき少なくとも１つの音声変換プリセットを取得する。ＳＴ６０４において行われる動作について、図１１及び図１２を参照して説明する。図１１は、図１に示した通信システム１において各ユーザとそのユーザの声との距離が所定値未満である少なくとも１つの音声変換プリセットとを対応付けて記憶する情報の一例を示す図である。図１２は、図１に示した通信システムにおいて各ユーザとそのユーザにより過去に使用された少なくとも１つの音声変換プリセットとを対応付けて記憶する情報の一例を示す図である。 Returning to FIG. 6, in ST604, the terminal device 20 acquires at least one voice conversion preset to be recommended to the target user A based on at least one voice conversion preset extracted in ST602 from among the multiple voice conversion presets, the voice of the target user A being less than a predetermined value in distance thereto. The operation performed in ST604 will be described with reference to FIG. 11 and FIG. 12. FIG. 11 is a diagram showing an example of information stored in the communication system 1 shown in FIG. 1 in association with each user and at least one voice conversion preset whose voice is less than a predetermined value in distance thereto. FIG. 12 is a diagram showing an example of information stored in the communication system shown in FIG. 1 in association with each user and at least one voice conversion preset used by the user in the past.

端末装置２０は、各ユーザ（例えばユーザＵ_１～ユーザＵ_Ｎ）に対してそのユーザの声との距離が所定値未満である音声変換プリセットが対応付けられた図１１に例示されるような情報を、例えばサーバ装置３０から受信することにより取得することができる。かかる情報は、例えば、各ユーザの端末装置２０が、図６におけるＳＴ７００～ＳＴ７０８において説明した処理を行い、そのユーザの声との距離が所定値未満である少なくとも１つの音声変換部リセットをサーバ装置３０に通知することにより、サーバ装置３０により生成可能なものである。 The terminal device 20 can obtain information as shown in Fig. 11, in which voice conversion presets whose distance from the voice of each user (e.g., user U ₁ to user U _N ) is less than a predetermined value are associated with each user, by receiving the information from, for example, the server device 30. Such information can be generated by, for example, the terminal device 20 of each user performing the processes described in ST700 to ST708 in Fig. 6 and notifying the server device 30 of at least one voice conversion unit reset whose distance from the voice of the user is less than a predetermined value.

ここで、対象ユーザＡがユーザＵ_２であるとする。対象ユーザＡの端末装置２０は、上記情報において、対象ユーザＡ（ユーザＵ_２）に対して選択された音声変換プリセットＰ_３、Ｐ_１５、Ｐ_３３、Ｐ_４０、Ｐ_７２のうちのいずれかと同一の音声変換プリセットに対応付けられた少なくとも１人のユーザを、対象ユーザＡに類似する声を有する少なくとも１人の類似ユーザとして選択することができる。この例では、対象ユーザＡの端末装置２０は、音声変換プリセットＰ_３に対応付けられたユーザＵ_４、及び、音声変換プリセットＰ_３３に対応付けられたユーザＵ_１を、対象ユーザＡに類似する声を有する少なくとも１人の類似ユーザとして選択することができる。 Here, it is assumed that the target user A is user _U2 . The terminal device 20 of the target user A can select at least one user associated with the same voice conversion preset as any one of the voice conversion presets _P3 , _P15 , _P33 , _P40 , and _P72 selected for the target user A (user _U2 ) in the above information as at least one similar user having a voice similar to that of the target user A. In this example, the terminal device 20 of the target user A can select user _U4 associated with the voice conversion preset _P3 and user _U1 associated with the voice conversion preset _P33 as at least one similar user having a voice similar to that of the target user A.

次に、端末装置２０は、各ユーザに対してそのユーザにより過去に使用された音声変換プリセットが対応付けられた図１２に例示されるような情報を、例えばサーバ装置３０から受信することにより取得することができる。かかる情報は、例えば、各ユーザの端末装置２０が、新たな音声変換プリセットを通信システム１において又は他の通信システム（他のホームページ及びＳＮＳサイト等を含む）において使用する度に、そのように使用した音声変換プリセットをサーバ装置３０に通知することにより、サーバ装置３０により生成可能なものである。 Next, the terminal device 20 can obtain information such as that illustrated in FIG. 12, in which each user is associated with a voice conversion preset that has been used by that user in the past, by receiving the information from, for example, the server device 30. For example, each time the terminal device 20 of each user uses a new voice conversion preset in the communication system 1 or in another communication system (including other homepages, SNS sites, etc.), the terminal device 20 notifies the server device 30 of the voice conversion preset that has been used in that way, and the information can be generated by the server device 30.

次に、端末装置２０は、図１２に例示された情報に基づいて、少なくとも１つの類似ユーザにより過去に使用された少なくとも１つの音声変換プリセット、すなわち、ユーザＵ_１により過去に使用された少なくとも１つの音声変換プリセット（ここでは音声変換プリセットＰ_５、Ｐ_６０、Ｐ_７２及びＰ_９９のうちの少なくとも１つ）、及び、ユーザＵ_４により過去に使用された少なくとも１つの音声変換プリセット（ここでは音声変換プリセットＰ_１、Ｐ_１８、Ｐ_３６、Ｐ_１０５、Ｐ_２５０のうちの少なくとも１つ）を、対象ユーザＡに推奨すべき少なくとも１つの音声変換プリセットとして選択することができる。 Next, based on the information illustrated in FIG. 12, the terminal device 20 can select at least one voice conversion preset previously used by at least one similar user, i.e., at least one voice conversion preset previously used by user _U1 (here, at least one of voice conversion presets _P5 , _P60 , _P72 , and _P99 ) and at least one voice conversion preset previously used by user _U4 (here, at least one of voice conversion presets _P1 , _P18 , _P36 , _P105 , and _P250 ), as at least one voice conversion preset to be recommended to the target user A.

一実施形態では、端末装置２０は、ユーザＵ_１により過去に使用されたすべての音声変換プリセット（ここでは音声変換プリセットＰ_５、Ｐ_６０、Ｐ_７２及びＰ_９９のすべて）、及び、ユーザＵ_４により過去に使用されたすべての音声変換プリセット（ここでは音声変換プリセットＰ_１、Ｐ_１８、Ｐ_３６、Ｐ_１０５、Ｐ_２５０のすべて）を、対象ユーザＡに推奨すべき少なくとも１つの音声変換プリセットとして選択してもよい。 In one embodiment, the terminal device 20 may select all voice conversion presets previously used by user _U1 (here, all of the voice conversion presets _P5 , _P60 , _P72 and _P99 ) and all voice conversion presets previously used by user _U4 (here, all of the voice conversion presets _P1 , _P18 , _P36 , _P105 and _P250 ) as at least one voice conversion preset to be recommended to the target user A.

別の実施形態では、端末装置２０は、ユーザＵ_１により過去に使用された音声変換プリセット（ここでは音声変換プリセットＰ_５、Ｐ_６０、Ｐ_７２及びＰ_９９）のうち、少なくとも１人のユーザにより良い評価が与えられた少なくとも１つの音声変換プリセット、及び、ユーザＵ_４により過去に使用された音声変換プリセット（ここでは音声変換プリセットＰ_１、Ｐ_１８、Ｐ_３６、Ｐ_１０５、Ｐ_２５０）のうち、少なくとも１人のユーザにより良い評価を与えられた少なくとも１つの音声変換プリセットを、対象ユーザＡに推奨すべき少なくとも１つの音声変換プリセットとして選択してもよい。 In another embodiment, the terminal device 20 may select, as at least _one voice conversion preset to be recommended to the target user _A , at least one voice conversion preset that has been given a better rating by at least one user among the voice conversion presets used in the past by user _U1 (here, voice conversion presets _P5 , _P60 , _P72 , and _P99 ), and at least one voice conversion preset that _has _been given a better rating by at least one user among _the voice conversion presets used in the past by user U4 (here, voice conversion presets P1, P18, P36, P105, and _P250 ).

この場合、あるユーザ（ここでは説明を簡単にするために例えば上記ユーザＵ_１とする）により過去に使用された音声変換プリセットのうち、少なくとも１人のユーザによりウェブサイト及び／又はＳＮＳにおいて良い評価が与えられた音声変換プリセットとは、（i）ユーザＵ_１本人又は他の少なくとも１人のユーザにより購入された音声変換プリセット、（ii）ユーザＵ_１本人又は他の少なくとも１人のユーザによりウェブサイト及び／又はＳＮＳにおいてレビューが作成された音声変換プリセット、（iii）ユーザＵ_１本人又は他の少なくとも１人のユーザによりウェブサイト及び／又はＳＮＳにおいてシェアされた音声変換プリセット、（iv）ユーザＵ_１本人又は他の少なくとも１人のユーザによりウェブサイト及び／又はＳＮＳにおいて参照された音声変換プリセット、及び／又は、（v）ユーザＵ_１本人又は他の少なくとも１人のユーザによりウェブサイト及び／又はＳＮＳにおいて再生された音声変換プリセットを、これらに限定することなく含むことができる。これを実現するために、サーバ装置３０は、各ユーザに対して、そのユーザにより過去に使用された音声変換プリセットを対応付けるだけでなく、そのように過去に使用された音声変換プリセットの各々に対して上記（i）～（v）に関する情報を対応付けて記憶する情報を用意及び更新しておき、かかる情報を各端末装置２０に送信することができる。 In this case, among the voice conversion presets used in the past by a certain user (for the sake of simplicity, let us assume that the user is user _U1 ), voice conversion presets that have been given good reviews on a website and/or SNS by at least one user may include, without being limited to, (i) voice conversion presets purchased by user _U1 himself or at least one other user, (ii) voice conversion presets for which reviews have been written _on a website and/or SNS by user _U1 himself or at least one other user, (iii) voice conversion presets shared on a website and/or SNS by user _U1 himself or at least one other user, (iv) voice conversion presets referenced on a website and/or SNS by user _U1 himself or at least one other user, and/or (v) voice conversion presets played on a website and/or SNS by user U1 himself or at least one other user. To achieve this, the server device 30 not only associates with each user the voice conversion presets that have been used by that user in the past, but also prepares and updates information that associates and stores the information regarding (i) to (v) above with each of those previously used voice conversion presets, and can transmit such information to each terminal device 20.

さらに別の実施形態では、端末装置２０が、あるユーザ（ここでは説明を簡単にするために例えば上記ユーザＵ_１とする）により過去に使用された音声変換プリセットの各々に対して、（a）ユーザＵ_１本人又は他の少なくとも１人のユーザにより購入された回数に比例する係数、（b）ユーザＵ_１本人又は他の少なくとも１人のユーザによりウェブサイト及び／又はＳＮＳにおいてレビューが作成された回数に比例する係数、（c）ユーザＵ_１本人又は他の少なくとも１人のユーザによりウェブサイト及び／又はＳＮＳにおいてシェアされた回数に比例する係数、（d）ユーザＵ_１本人又は他の少なくとも１人のユーザによりウェブサイト及び／又はＳＮＳにおいて参照された回数に比例する係数、（e）ユーザＵ_１本人又は他の少なくとも１人のユーザによりウェブサイト及び／又はＳＮＳにおいて再生された回数に比例する係数、のうちの少なくとも１つの係数を掛けた値が大きいものを優先的に、対象ユーザＡに推奨すべき少なくとも１つの音声変換プリセットとして選択することができる。 In yet another embodiment, the terminal device 20 can select, as at least _one voice conversion preset to be recommended to the target user A, the one having the larger value obtained by multiplying each of the voice conversion presets used in the past by a certain user (for example, the above-mentioned user _U1 for the sake of simplicity) by at least one of the following coefficients: (a) a coefficient proportional to the number of times purchased by the user _U1 himself or at least one other user; (b) a coefficient proportional to the number of times reviews were created on a website and/ _or SNS by the user _U1 himself or at least one other user; (c) a coefficient proportional to the number of times shared on a website and/or SNS by the user _U1 himself or at least one other user; (d) a coefficient proportional to the number of times referenced on a website and/or SNS by the user U1 himself or at least one other user; or (e) a coefficient proportional to the number of times played on a website and/or SNS by the user U1 himself or at least one other user.

さらに別の実施形態では、端末装置２０は、ユーザＵ_１により過去に使用された音声変換プリセット（ここでは音声変換プリセットＰ_５、Ｐ_６０、Ｐ_７２及びＰ_９９）のうち、少なくとも１人のユーザにより悪い評価が与えられた少なくとも１つの音声変換プリセット、及び、ユーザＵ_４により過去に使用された音声変換プリセット（ここでは音声変換プリセットＰ_１、Ｐ_１８、Ｐ_３６、Ｐ_１０５、Ｐ_２５０）のうち、少なくとも１人のユーザにより悪い評価を与えられた少なくとも１つの音声変換プリセットを、対象ユーザＡに推奨すべき少なくとも１つの音声変換プリセットとして「選択しない」ようにしてもよい。 In yet another embodiment, the terminal device 20 may "not select" at least one voice conversion preset that has been given a bad rating by at least one user among _the voice conversion presets used in the past by user _U1 (here, voice conversion presets _P5 , _P60 , _P72 , and _P99 ), and at least one voice conversion preset that has been given a bad rating by at least one user among the voice conversion presets used in the past by user U4 (here, voice conversion presets _P1 , _P18 , _P36 , _P105 , and _P250 ), as at least one voice conversion preset to be recommended to the target user A.

さらにまた、別の実施形態では、端末装置２０は、少なくとも１人の類似ユーザにより過去に使用された音声変換プリセットのうち、協調フィルタリングを用いて選択された少なくとも１つの音声変換プリセットを、対象ユーザＡに推奨すべき少なくとも１つの音声変換プリセットとして選択してもよい。図１３は、図１に示した通信システムにおいて、少なくとも１人の類似ユーザにより過去に使用された音声変換プリセットのうち、協調フィルタリングを用いて、対象ユーザに推奨すべき音声変換プリセットを選択する方法の一例を示す図である。 Furthermore, in another embodiment, the terminal device 20 may select at least one voice conversion preset selected using collaborative filtering from among voice conversion presets used in the past by at least one similar user as at least one voice conversion preset to be recommended to the target user A. Figure 13 is a diagram showing an example of a method for selecting a voice conversion preset to be recommended to a target user using collaborative filtering from among voice conversion presets used in the past by at least one similar user in the communication system shown in Figure 1.

図１３には、対象ユーザＡに類似する声を有する少なくとも１人の類似ユーザとして、一例として、３人の類似ユーザＵ_２０、Ｕ_３０、Ｕ_４０が示されている。これら３人の類似ユーザのすべてによって過去に使用された音声変換プリセットとして、音声変換プリセットＰ_２０、Ｐ_２５、Ｐ_３２が例示されている。類似ユーザＵ_２０は、音声変換プリセットＰ_２０、Ｐ_２５、Ｐ_３２に対して、それぞれ、５点、３点及び５点という評価（但し５点満点）を与えている。類似ユーザＵ_２０は、音声変換プリセットＰ_２０、Ｐ_２５、Ｐ_３２に対して、それぞれ、２点、５点及び２点という評価を与えている。類似ユーザＵ_３０は、音声変換プリセットＰ_２０、Ｐ_２５、Ｐ_３２に対して、それぞれ、２点、５点及び２点という評価を与えている。類似ユーザＵ_４０は、音声変換プリセットＰ_２０、Ｐ_２５、Ｐ_３２に対して、それぞれ、５点、２点及び５点という評価を与えている。 In Fig. 13, three similar users _U20 , _U30 , and _U40 are shown as examples of at least one similar user having a voice similar to that of the target user A. Voice conversion presets _P20 , _P25 , and _P32 are shown as voice conversion presets used in the past by all of these three similar users. The similar user _U20 gives the voice conversion presets _P20 , _P25 , and _P32 ratings of 5 points, 3 points, and 5 points (out of 5), respectively. The similar user _{U20 gives the voice conversion presets P20, P25, and P32 ratings of 2 points, 5 points, and 2 points, respectively. The similar user U30} _gives _the _voice _conversion presets _P20 , _P25 , and _P32 ratings of 2 points, 5 points, and 2 points, respectively. The similar user U ₄₀ gives ratings of 5 points, 2 points and 5 points to the voice conversion presets P ₂₀ , P ₂₅ and P ₃₂ , respectively.

ここで、対象ユーザＡが、図１３に例示されているように音声変換プリセットＰ_３２に対して５点という高い評価を与えている場合には、端末装置２０は、同一の音声変換プリセットＰ_３２に対して高い評価を与えている類似ユーザＵ_２０及びＵ_４０に着目し、これらの類似ユーザによって同様に高い評価（ここでは５点）が与えられている音声変換プリセットＰ_２０を、対象ユーザＡに推奨すべき音声変換プリセットとして選択することができる。よって、図１３において、対象ユーザＡの列において音声変換プリセットＰ_２０に対応する行には、推奨することを意味する記号（◎）が付されている。 Here, when the target user A gives a high rating of 5 points to the voice conversion preset _P32 as illustrated in Fig. 13, the terminal device 20 focuses on similar users _U20 and _U40 who give high ratings to the same voice conversion preset _P32 , and can select the voice conversion preset _P20 that has been given a similar high rating (here, 5 points) by these similar users as the voice conversion preset to be recommended to the target user A. Therefore, in Fig. 13, the row corresponding to the voice conversion preset _P20 in the column of the target user A is marked with a symbol (◎) indicating recommendation.

さらに、次に、対象ユーザＢに着目すると、対象ユーザＢが、図１３に例示されているように音声変換プリセットＰ_２５に対して４点という高い評価を与えている場合には、端末装置２０は、同一の音声変換プリセットＰ_２５に対して高い評価を与えている類似ユーザＵ_３０に着目し、この類似ユーザＵ_３０によって低い評価（ここでは２点）が与えられている音声変換プリセットＰ_２０、Ｐ_３２を、対象ユーザＢに推奨すべき音声変換プリセットとして「選択しない」ようにすることができる。よって、図１３において、対象ユーザＢの列において音声変換プリセットＰ_２０、Ｐ_３２に対応する行には、推奨しないことを意味する記号（×）が付されている。 Furthermore, next, when the target user B is focused on, if the target user B gives a high rating of 4 points to the voice conversion preset _P25 as illustrated in Fig. 13, the terminal device 20 focuses on the similar user _U30 who gives a high rating to the same voice conversion preset _P25 , and can "not select" the voice conversion presets _P20 and _P32 given a low rating (here, 2 points) by this similar user _U30 as voice conversion presets to be recommended to the target user B. Therefore, in Fig. 13, a symbol (x) meaning not to be recommended is added to the rows corresponding to the voice conversion presets _P20 and _P32 in the column of the target user B.

図１３に示した例において、類似ユーザ又は対象ユーザによって高い評価が与えられた音声プリセットは、上述した（i）～（v）のうちの少なくとも１つの音声変換プリセットを含むことができ、類似ユーザ又は対象ユーザによって低い評価が与えられた音声プリセットは、上述した（i）～（v）に反する少なくとも１つの音声変換プリセットを含むことができる。 In the example shown in FIG. 13, the voice presets that have been given high ratings by similar users or the target user may include at least one of the voice conversion presets (i) to (v) described above, and the voice presets that have been given low ratings by similar users or the target user may include at least one voice conversion preset that contradicts (i) to (v) described above.

なお、端末装置２０は、少なくとも１人の類似ユーザにより過去に使用された音声変換プリセットのうち、図１３を参照して上述したもの以外のその他の任意の協調フィルタリングを用いて選択された少なくとも１つの音声変換プリセットを、対象ユーザＡに推奨すべき少なくとも１つの音声変換プリセットとして選択してもよい。 In addition, the terminal device 20 may select at least one voice conversion preset selected using any other collaborative filtering other than that described above with reference to FIG. 13 from among the voice conversion presets used in the past by at least one similar user as at least one voice conversion preset to be recommended to the target user A.

以上、図６に示したＳＴ６０４において行われる動作について説明した。 The above explains the operations performed in ST604 shown in Figure 6.

次に、図６に戻り、ＳＴ６０６において、端末装置２０は、上述したＳＴ６０４において取得した少なくとも１つの推奨すべき音声変換プリセットを対象ユーザＡに提示する。図１４は、図１に示した通信システムにおいて端末装置２０の表示部２２０により表示される画面の別の例を示す図である。 Next, returning to FIG. 6, in ST606, the terminal device 20 presents at least one recommended voice conversion preset acquired in ST604 described above to the target user A. FIG. 14 is a diagram showing another example of a screen displayed by the display unit 220 of the terminal device 20 in the communication system shown in FIG. 1.

図１４には、対象ユーザＡに推奨すべき少なくとも１つの音声変換プリセットとして、５つのプリセット８００Ａ～８００Ｅが表示部２２０に表示される例が示されている。音声変換プリセット８００Ａ～８００Ｅの各々は、その音声変換プリセットに関連する情報（例えば、キャラクターに対応する画像又は写真、キャラクターに対応する名称、プリセット番号等）とともに表示され得る。 FIG. 14 shows an example in which five presets 800A-800E are displayed on the display unit 220 as at least one voice conversion preset to be recommended to target user A. Each of the voice conversion presets 800A-800E may be displayed together with information related to the voice conversion preset (e.g., an image or photo corresponding to the character, a name corresponding to the character, a preset number, etc.).

さらに、音声変換プリセット８００Ａ～８００Ｅの各々は、その音声変換プリセットの価値を示す少なくとも１つの情報とともに表示され得る。図１４には、音声変換プリセットの価値を示す少なくとも１つの情報として、その音声変換プリセットの価格、再生可能回数、再生可能時間及び同時使用人数が表示される例が示されている。別の実施形態では、これらの情報のうちの少なくとも１つの情報が表示されるようにしてもよい。 Furthermore, each of the audio conversion presets 800A to 800E may be displayed together with at least one piece of information indicating the value of the audio conversion preset. FIG. 14 shows an example in which the price of the audio conversion preset, the number of times it can be played, the amount of time it can be played, and the number of people who can use it simultaneously are displayed as at least one piece of information indicating the value of the audio conversion preset. In another embodiment, at least one piece of this information may be displayed.

音声変換プリセットの価格が高い（又は低い）ことは、その音声変換プリセットの価値が高い（又は低い）ことを意味する。 A higher (or lower) price for an audio conversion preset means that the audio conversion preset is more (or less) valuable.

音声変換プリセットの再生可能回数とは、その音声変換プリセットを再生可能な回数の上限を意味する。音声変換プリセットの再生可能回数が少ない（又は多い）ことは、その音声変換プリセットの価値が高い（又は低い）ことを意味する。 The number of times an audio conversion preset can be played means the maximum number of times that audio conversion preset can be played. A low (or high) number of times that an audio conversion preset can be played means that the audio conversion preset is of high (or low) value.

音声変換プリセットの再生可能時間とは、その音声変換プリセットを再生可能な時間の上限を意味する。音声変換プリセットの再生可能時間が短い（又は長い）ことは、その音声変換プリセットの価値が高い（又は低い）ことを意味する。 The playback time of an audio conversion preset means the maximum time that the audio conversion preset can be played. A short (or long) playback time of an audio conversion preset means that the value of the audio conversion preset is high (or low).

音声変換プリセットの同時使用人数とは、その音声変換プリセットを同時に使用（再生）することができる人数の上限を意味する。音声変換プリセットの同時使用人数が少ない（又は多い）ことは、その音声変換プリセットの価値が高い（又は低い）ことを意味する。 The number of simultaneous users of an audio conversion preset refers to the maximum number of people who can use (play) that audio conversion preset at the same time. A low (or high) number of simultaneous users of an audio conversion preset means that the value of that audio conversion preset is high (or low).

例えば、価格に着目すると、音声変換プリセット８００Ａの価値（４００円）は、音声変換プリセット８００Ｂ（２００円）の２倍高いといえる。次に、再生可能回数に着目すると、音声変換プリセット８００Ａの価値（１０回）は、音声変換プリセット８００Ｂ（２０回）の２倍高いといえる。さらに、再生可能時間に着目すると、音声変換プリセット８００Ａの価値（１０分）は、音声変換プリセット８００Ｂ（２０分）の２倍高いといえる。また、同時使用人数に着目すると、音声変換プリセット８００Ａの価値（１人）は、音声変換プリセット８００Ｂ（２人）の２倍高いといえる。 For example, when focusing on price, the value of voice conversion preset 800A (400 yen) is twice as high as that of voice conversion preset 800B (200 yen). Next, when focusing on the number of times it can be played, the value of voice conversion preset 800A (10 times) is twice as high as that of voice conversion preset 800B (20 times). Furthermore, when focusing on the amount of time it can be played, the value of voice conversion preset 800A (10 minutes) is twice as high as that of voice conversion preset 800B (20 minutes). Also, when focusing on the number of people using it at the same time, the value of voice conversion preset 800A (1 person) is twice as high as that of voice conversion preset 800B (2 people).

このような音声変換プリセット（通信システム１において用いられるすべての音声変換プリセット）の各々の価値は、以下の５つの係数のうちの少なくとも１つの係数を掛けることにより定められるようにしてもよい。
（１）いずれかのユーザ、複数のユーザ又はすべてのユーザにより購入された回数に比例する係数、
（２）いずれかのユーザ、複数のユーザ又はすべてのユーザによりウェブサイト及び／又はＳＮＳにおいてレビューが作成された回数に比例する係数、
（３）いずれかのユーザ、複数のユーザ又はすべてのユーザによりウェブサイト及び／又はＳＮＳにおいてシェアされた回数に比例する係数、
（４）いずれかのユーザ、複数のユーザ又はすべてのユーザによりウェブサイト及び／又はＳＮＳにおいて参照された回数に比例する係数、
（５）いずれかのユーザ、複数のユーザ又はすべてのユーザウェブサイト及び／又はＳＮＳにおいて再生された回数に比例する係数。 The value of each of such voice conversion presets (all voice conversion presets used in the communication system 1) may be determined by multiplying it by at least one of the following five coefficients:
(1) a coefficient proportional to the number of purchases made by any user, multiple users, or all users;
(2) a coefficient proportional to the number of reviews created on the website and/or social networking site by any user, some users, or all users;
(3) a coefficient proportional to the number of times the content is shared on websites and/or social media by any user, multiple users, or all users;
(4) a coefficient proportional to the number of times the website and/or social networking site was referenced by any user, multiple users, or all users;
(5) A coefficient proportional to the number of times the video was played by any user, by multiple users, or by all users' websites and/or social media.

これを実現するために、サーバ装置３０は、各音声変換プリセットに対して、上記（１）～（５）のうちの少なくとも１つの係数を対応付けて記憶する情報を、保持及び更新し、必要に応じて各端末装置２０に送信することができる。 To achieve this, the server device 30 can hold and update information that associates at least one of the coefficients (1) to (5) above with each voice conversion preset, and transmit this information to each terminal device 20 as necessary.

端末装置２０は、図１４に例示されたように推奨された少なくとも１つの音声変換プリセットのうち、対象ユーザＡによりユーザインタフェイスを介して選択された音声変換プリセットについて、その音声変換プリセットにより定められる、第１基準値を基準とした基本周波数の変化量及び第２基準値を基準とした第１フォルマントの周波数の変化量を、記憶部２１６から読み出すことにより取得することができる。或いはまた、端末装置２０は、このように対象ユーザＡにより選択された音声変換プリセットについて、その音声変換プリセットにより定められる、第１基準値を基準とした基本周波数の変化量及び第２基準値を基準とした第１フォルマントの周波数の変化量を、端末装置２０による要求に応答したサーバ装置３０から受信して取得することができる。 The terminal device 20 can obtain the amount of change in fundamental frequency based on the first reference value and the amount of change in frequency of the first formant based on the second reference value, which are determined by the voice conversion preset selected by the target user A via the user interface from among at least one voice conversion preset recommended as illustrated in FIG. 14, by reading them from the storage unit 216. Alternatively, the terminal device 20 can receive and obtain the amount of change in fundamental frequency based on the first reference value and the amount of change in frequency of the first formant based on the second reference value, which are determined by the voice conversion preset, for the voice conversion preset selected by the target user A in this manner, from the server device 30 in response to a request by the terminal device 20.

図６に戻り、次に、ＳＴ６０８において、端末装置２０は、上述したＳＴ６０６において取得した音声変換プリセットを用いて、音声入力部２１０により入力された音声信号（入力音声信号）を変換することにより、出力音声信号を生成することができる。具体的には、端末装置２０は、例えば次に述べるような処理を行うことにより、出力音声信号を生成することができる。 Returning to FIG. 6, next, in ST608, the terminal device 20 can generate an output audio signal by converting the audio signal (input audio signal) input by the audio input unit 210 using the audio conversion preset acquired in ST606 described above. Specifically, the terminal device 20 can generate an output audio signal by performing, for example, the processing described below.

まず、端末装置２０（の特徴量抽出部２１２）が、ＳＴ７００において説明したものと同様の手法により、入力音声信号から基本周波数を抽出し、ＳＴ７０４において説明したものと同様の手法により、第１フォルマントの周波数を抽出する。 First, the terminal device 20 (feature extraction unit 212) extracts the fundamental frequency from the input speech signal using a method similar to that described in ST700, and extracts the frequency of the first formant using a method similar to that described in ST704.

次に、端末装置２０（の特徴量変換部２２２）が、上記のように抽出した入力音声信号の基本周波数を、ＳＴ６０６において取得した音声変換プリセットにより定められる基本周波数の変化量に応じて変換（シフト、すなわち、増加又は減少）し、かつ、上記のように抽出した入力音声信号の第１フォルマントの周波数を、ＳＴ６０６において取得した音声変換プリセットにより定められる第１フォルマントの周波数の変化量に応じて変換（シフト、すなわち、増加又は減少）する。 Next, the terminal device 20 (the feature conversion unit 222) converts (shifts, i.e., increases or decreases) the fundamental frequency of the input audio signal extracted as described above in accordance with the amount of change in the fundamental frequency determined by the voice conversion preset acquired in ST606, and converts (shifts, i.e., increases or decreases) the frequency of the first formant of the input audio signal extracted as described above in accordance with the amount of change in the frequency of the first formant determined by the voice conversion preset acquired in ST606.

次に、端末装置２０（の音声合成部２２４）は、特徴量変換部２２２により変換された基本周波数及び第１フォルマントの周波数を用いて、音声合成処理を行うことにより、入力音声信号を加工した音声信号（出力音声信号）を生成することができる。変換された基本周波数及び第１フォルマントの周波数を用いて音声を合成する処理は、周知技術である様々な手法を用いて実行することが可能なものである。 Next, the terminal device 20 (the voice synthesis unit 224) can perform voice synthesis processing using the fundamental frequency and the first formant frequency converted by the feature conversion unit 222 to generate a voice signal (output voice signal) that is a processed input voice signal. The process of synthesizing voice using the converted fundamental frequency and the first formant frequency can be performed using various techniques that are well-known technologies.

次に、ＳＴ６１０において、端末装置２０（の通信部２１８）は、生成された出力音声信号をサーバ装置３０に送信することができる。さらに、ＳＴ６１２において、サーバ装置３０は、端末装置２０から受信した出力音声信号を、他の端末装置２０に配信することも可能である。 Next, in ST610, the terminal device 20 (the communication unit 218) can transmit the generated output audio signal to the server device 30. Furthermore, in ST612, the server device 30 can also distribute the output audio signal received from the terminal device 20 to other terminal devices 20.

以上、通信システム１において行われる動作の具体例について説明した。 The above describes specific examples of operations performed in communication system 1.

６．変形例
上述した様々な実施形態では、最も好ましい態様として、対象ユーザの発話に基づく音声信号に対する信号処理により、特徴量として基本周波数及び第１フォルマントの周波数を取得する場合について説明した。しかし、別の実施形態では、特徴量として基本周波数のみを参照基本周波数として抽出するようにしてもよい。この場合、対象ユーザの声と各音声変換プリセットに対応する声との距離は、第１基準値を基準とした参照基本周波数の変化量と、各音声変換プリセットにより定められる第１基準値を基準とした基本周波数の変化量と、の差に基づいて算出することが可能である。 6. Modifications In the various embodiments described above, as the most preferred embodiment, the fundamental frequency and the frequency of the first formant are obtained as features by signal processing of the voice signal based on the speech of the target user. However, in another embodiment, only the fundamental frequency may be extracted as a feature as a reference fundamental frequency. In this case, the distance between the voice of the target user and the voice corresponding to each voice conversion preset can be calculated based on the difference between the amount of change in the reference fundamental frequency based on the first reference value and the amount of change in the fundamental frequency based on the first reference value determined by each voice conversion preset.

また、上述した実施形態では、図６等を参照して、入力音声信号の入力（ＳＴ６００）、参照基本周波数の取得（ＳＴ７００）、参照基本周波数の変化量の取得（ＳＴ７０２）、参照第１フォルマント周波数の取得（ＳＴ７０４）、参照第１フォルマント周波数の取得（７０６）、距離の取得（ＳＴ７０８）及びレコメンドすべき音声変換プリセットの取得（ＳＴ６０４）が、すべて対象ユーザの端末装置２０により実行される場合について説明した。しかし、別の実施形態では、入力音声信号の入力（ＳＴ６００）のみが対象ユーザの端末装置２０により実行され、残りの工程がサーバ装置３０により及び／又はサーバ装置３０に接続される他のユーザの端末装置により実行されるようにしてもよい。この場合、取得した距離を各音声変換プリセットに関連する情報と対応付けて表示すること（ＳＴ７１０）及び音声変換プリセットの表示（ＳＴ６０６）は、依然として、対象ユーザの端末装置２０により実行されるようにしてもよい。 In the above embodiment, with reference to FIG. 6 and the like, the input of the input voice signal (ST600), acquisition of the reference fundamental frequency (ST700), acquisition of the change amount of the reference fundamental frequency (ST702), acquisition of the reference first formant frequency (ST704), acquisition of the reference first formant frequency (706), acquisition of the distance (ST708), and acquisition of the voice conversion preset to be recommended (ST604) are all performed by the terminal device 20 of the target user. However, in another embodiment, only the input of the input voice signal (ST600) may be performed by the terminal device 20 of the target user, and the remaining steps may be performed by the server device 30 and/or by a terminal device of another user connected to the server device 30. In this case, displaying the acquired distance in association with information related to each voice conversion preset (ST710) and displaying the voice conversion preset (ST606) may still be performed by the terminal device 20 of the target user.

さらに、上述した実施形態では、対象ユーザにより選択された音声変換プリセットを用いた入力音声信号の変換（Ｓ６０８）が、対象ユーザの端末装置２０により実行され、その変換により生成された出力音声信号が端末装置２０によりサーバ装置３０に送信される（ＳＴ６１０）場合について説明した。しかし、別の実施形態では、対象ユーザにより選択された音声変換プリセットを用いた入力音声信号の変換（Ｓ６０８）が、サーバ装置３０により実行され、ＳＴ６１０に代えて、その変換により生成された出力音声信号がサーバ装置３０により対象ユーザの端末装置２０に送信される工程が実行されるようにしてもよい。 Furthermore, in the above-described embodiment, the conversion of the input voice signal using the voice conversion preset selected by the target user (S608) is performed by the terminal device 20 of the target user, and the output voice signal generated by the conversion is transmitted by the terminal device 20 to the server device 30 (ST610) has been described. However, in another embodiment, the conversion of the input voice signal using the voice conversion preset selected by the target user (S608) may be performed by the server device 30, and instead of ST610, a step may be performed in which the output voice signal generated by the conversion is transmitted by the server device 30 to the terminal device 20 of the target user.

以上のように、様々な実施形態によれば、対象ユーザの声と各音声変換プリセットに対応する声との距離が、第１基準値を基準とした参照基本周波数の変化量及び第２基準値を基準とした参照第１フォルマント周波数の変化量（又は、第１基準値を基準とした参照基本周波数の変化量のみ）と、各音声変換プリセットにより定められる第１基準値を基準とした基本周波数の変化量及び第２基準値を基準とした第１フォルマントの周波数の変化量（又は、各音声変換プリセットにより定められる第１基準値を基準とした基本周波数の変化量のみ）とに基づいて算出され、対象ユーザに提示される。これにより、対象ユーザは、通信システム１において用意された複数の音声変換プリセットにおいて自己の声に近い特徴を有する音声変換プリセットとしてどのようなものが存在するのかを認識することができる。これにより、対象ユーザは、自己の声が例えば有名な芸能人の声や有名なキャラクター（声優）の声に近いことを発見することにより、当該通信システムにより提供されるサービスを楽しむことができる。 As described above, according to various embodiments, the distance between the target user's voice and the voice corresponding to each voice conversion preset is calculated based on the amount of change in the reference fundamental frequency based on the first reference value and the amount of change in the reference first formant frequency based on the second reference value (or only the amount of change in the reference fundamental frequency based on the first reference value), and the amount of change in the fundamental frequency based on the first reference value determined by each voice conversion preset and the amount of change in the frequency of the first formant based on the second reference value (or only the amount of change in the fundamental frequency based on the first reference value determined by each voice conversion preset), and is presented to the target user. This allows the target user to recognize what voice conversion presets have characteristics similar to his/her own voice among the multiple voice conversion presets prepared in the communication system 1. This allows the target user to enjoy the services provided by the communication system by discovering that his/her own voice is similar to the voice of, for example, a famous entertainer or a famous character (voice actor).

さらに、対象ユーザの声に近い特徴を有する音声変換プリセットが取得された際に、これと同一の音声変換プリセットに対応付けられた少なくとも１人の類似ユーザが選択され、当該類似ユーザにより過去に使用された少なくとも１つの音声変換プリセットが、推奨すべき音声変換プリセットとして対象ユーザに対して提供される。一般的に、予め用意された複数の音声変換プリセットのうち、いずれの音声変換プリセットが対象ユーザにとって品質の高い（満足される）ものであるかを認識することは容易ではない。しかし、上述した様々な実施形態では、対象ユーザに似通った声を有する類似ユーザにより過去に使用された（さらにはこの類似ユーザ又は他のユーザにより高い評価が与えられた）音声変換プリセットが、推奨すべき音声変換プリセットとして対象ユーザに提示されることにより、品質の高い音声変換プリセットが対象ユーザに提供される可能性を高めることができる。 Furthermore, when a voice conversion preset having characteristics similar to the target user's voice is acquired, at least one similar user associated with the same voice conversion preset is selected, and at least one voice conversion preset used in the past by the similar user is provided to the target user as a voice conversion preset to be recommended. In general, it is not easy to recognize which of a plurality of voice conversion presets prepared in advance is of high quality (satisfying) to the target user. However, in the various embodiments described above, a voice conversion preset that has been used in the past by a similar user having a voice similar to the target user (and has been highly rated by this similar user or other users) is presented to the target user as a voice conversion preset to be recommended, thereby increasing the possibility that a high-quality voice conversion preset will be provided to the target user.

また、予め用意された複数の音声変換プリセットの各々の品質が十分であるかどうかを判定することは、一般的に容易ではない。様々な実施形態では、各音声変換プリセットに対応付けて以下の５つの係数のうちの少なくとも１つの係数が記憶された情報が用意され更新される。
（１）いずれかのユーザ、複数のユーザ又はすべてのユーザにより購入された回数に比例する係数、
（２）いずれかのユーザ、複数のユーザ又はすべてのユーザによりウェブサイト及び／又はＳＮＳにおいてレビューが作成された回数に比例する係数、
（３）いずれかのユーザ、複数のユーザ又はすべてのユーザによりウェブサイト及び／又はＳＮＳにおいてシェアされた回数に比例する係数、
（４）いずれかのユーザ、複数のユーザ又はすべてのユーザによりウェブサイト及び／又はＳＮＳにおいて参照された回数に比例する係数、
（５）いずれかのユーザ、複数のユーザ又はすべてのユーザウェブサイト及び／又はＳＮＳにおいて再生された回数に比例する係数。
これにより、各音声変換プリセットに対する様々なユーザの反応に基づいて各音声変換プリセットが評価されることにより、各音声変換プリセットの価値が客観的に把握され得る。 In addition, it is generally not easy to determine whether the quality of each of the multiple voice conversion presets prepared in advance is sufficient. In various embodiments, information is prepared and updated in which at least one of the following five coefficients is stored in association with each voice conversion preset.
(1) a coefficient proportional to the number of purchases made by any user, multiple users, or all users;
(2) a coefficient proportional to the number of reviews created on the website and/or social networking site by any user, some users, or all users;
(3) a coefficient proportional to the number of times the content is shared on websites and/or social media by any user, multiple users, or all users;
(4) a coefficient proportional to the number of times the website and/or social networking site was referenced by any user, multiple users, or all users;
(5) A coefficient proportional to the number of times the video was played by any user, by multiple users, or by all users' websites and/or social media.
This allows each voice conversion preset to be evaluated based on various users' reactions to each voice conversion preset, so that the value of each voice conversion preset can be objectively grasped.

したがって、様々な実施形態によれば、ユーザに適したボイスチェンジャを提供することが可能な手法を提供することができる。 Therefore, according to various embodiments, a method can be provided that can provide a voice changer that is suitable for the user.

７．本件出願に開示された技術が適用される分野
本件出願に開示された技術は、例えば、次のような分野において適用することが可能なものである。
（１）音声及び／又は動画を通信網及び／又は放送網を介して配信するアプリケーション・サービス
（２）音声を用いてコミュニケーションすることができるアプリケーション・サービス（チャットアプリケーション、メッセンジャー、メールアプリケーション等）
（３）ユーザの音声を送信することが可能なゲーム・サービス（シューティングゲーム、恋愛ゲーム及びロールプレイングゲーム等） 7. Fields in which the technology disclosed in the present application is applicable The technology disclosed in the present application can be applied, for example, in the following fields.
(1) Application services that distribute audio and/or video via communication networks and/or broadcasting networks. (2) Application services that enable communication using audio (chat applications, messengers, email applications, etc.).
(3) Games and services that allow users to transmit their voice (shooting games, romance games, role-playing games, etc.)

１通信システム
１０通信網
２０（２０Ａ～２０Ｃ）端末装置
３０（３０Ａ～３０Ｃ）サーバ装置
４０（４０Ａ、４０Ｂ）スタジオユニット
２１０、３１０音声入力部
２１２、３１２特徴量抽出部
２１４、３１４変換器取得部
２１６、３１６記憶部
２１８、３１８通信部
２２０、３２０表示部
２２２、３２２特徴量変換部
２２４、３２４音声合成部 REFERENCE SIGNS LIST 1 Communication system 10 Communication network 20 (20A to 20C) Terminal device 30 (30A to 30C) Server device 40 (40A, 40B) Studio unit 210, 310 Voice input unit 212, 312 Feature extraction unit 214, 314 Converter acquisition unit 216, 316 Storage unit 218, 318 Communication unit 220, 320 Display unit 222, 322 Feature conversion unit 224, 324 Voice synthesis unit

Claims

少なくとも１つのプロセッサにより実行されることにより、
対象ユーザによる発話に基づく音声信号に対する信号処理により算出される基本周波数を参照基本周波数として取得し、
第１所定値を基準とした前記参照基本周波数の変化量を取得し、
各々が、前記第１所定値を基準とした基本周波数の変化量を定め、前記対象ユーザによる発話に基づく音声信号を変換するために用いられる、複数の音声変換プリセットを取得し、
前記複数の音声変換プリセットに含まれる各音声変換プリセットに対応する声と前記対象ユーザの声との間の距離を、前記音声変換プリセットにより定められる前記基本周波数の変化量及び前記参照基本周波数の変化量に基づいて算出する、ように前記プロセッサを機能させる、ことを特徴とするコンピュータプログラム。 When executed by at least one processor,
A fundamental frequency calculated by signal processing of a voice signal based on an utterance by a target user is obtained as a reference fundamental frequency;
Obtaining an amount of change in the reference fundamental frequency based on a first predetermined value ;
obtaining a plurality of voice conversion presets, each of which determines an amount of change in a fundamental frequency relative to the first predetermined value and is used to convert a voice signal based on speech by the target user;
A computer program that causes the processor to function to calculate a distance between a voice corresponding to each voice conversion preset included in the plurality of voice conversion presets and the voice of the target user based on an amount of change in the fundamental frequency and an amount of change in the reference fundamental frequency determined by the voice conversion preset.

前記コンピュータプログラムが、
前記対象ユーザによる発話に基づく音声信号に対する信号処理により算出されるフォルマントの周波数を参照フォルマント周波数として取得し、
第２所定値を基準とした前記参照フォルマント周波数の変化量を取得し、
前記複数の音声変換プリセットの各々が、さらに、前記第２所定値を基準としたフォルマントの周波数の変化量を定め、
前記複数の音声変換プリセットに含まれる各音声変換プリセットに対応する声と前記対象ユーザの声との間の距離を、第１軸及び第２軸がそれぞれ前記基本周波数の変化量及び前記フォルマントの周波数の変化量を表現する２次元座標系に配置された、前記音声変換プリセットにより定められる前記基本周波数の変化量及び前記フォルマントの周波数の変化量と、前記参照基本周波数の変化量及び前記参照フォルマント周波数の変化量と、を用いて算出する、請求項１に記載のコンピュータプログラム。 The computer program comprising:
acquiring, as a reference formant frequency, a formant frequency calculated by signal processing of a voice signal based on an utterance by the target user;
obtaining an amount of change in the reference formant frequency relative to a second predetermined value ;
Each of the plurality of voice conversion presets further defines a change amount of a formant frequency based on the second predetermined value ,
2. The computer program of claim 1, wherein the distance between the voice corresponding to each voice conversion preset included in the plurality of voice conversion presets and the voice of the target user is calculated using the amount of change in the fundamental frequency and the amount of change in the formant frequency determined by the voice conversion preset, and the amount of change in the reference fundamental frequency and the amount of change in the reference formant frequency, arranged in a two-dimensional coordinate system in which a first axis and a second axis respectively represent the amount of change in the fundamental frequency and the amount of change in the formant frequency.

前記第１所定値は、複数のユーザから取得された基本周波数の平均値に基づいて設定され、
前記第２所定値は、複数のユーザから取得されたフォルマントの周波数の平均値に基づいて設定される、請求項２に記載のコンピュータプログラム。 the first predetermined value is set based on an average value of fundamental frequencies acquired from a plurality of users;
The computer program product according to claim 2 , wherein the second predetermined value is set based on an average value of formant frequencies obtained from a plurality of users.

前記複数の音声変換プリセットに含まれる各音声変換プリセットに対応する声と前記対象ユーザの声との間の前記距離に基づく情報を、前記音声変換プリセットに関連する情報と対応付けて表示部に表示する、請求項１又は請求項２に記載のコンピュータプログラム。 The computer program according to claim 1 or 2, which displays information based on the distance between the voice corresponding to each voice conversion preset included in the plurality of voice conversion presets and the voice of the target user on a display unit in association with information related to the voice conversion preset.

前記複数の音声変換プリセットに含まれる各音声変換プリセットに対応する声と前記対象ユーザの声との間の距離は、
√｛（前記第１所定値を基準とした参照基本周波数の変化量－前記音声変換プリセットにより定められる前記第１所定値を基準とした基本周波数の変化量）^２＋（前記第２所定値を基準とした参照フォルマント周波数の変化量－前記音声変換プリセットにより定められる前記第２所定値を基準としたフォルマントの周波数の変化量）^２｝
という数式により算出される、請求項２又は請求項３に記載のコンピュータプログラム。 The distance between the voice corresponding to each of the voice conversion presets included in the plurality of voice conversion presets and the voice of the target user is
√{(amount of change in reference fundamental frequency based on the first predetermined value −amount of change in fundamental frequency based on the first predetermined value determined by the voice conversion preset) ² + (amount of change in reference formant frequency based on the second predetermined value −amount of change in formant frequency based on the second predetermined value determined by the voice conversion preset) ² }
The computer program according to claim 2 or 3, wherein the calculation is performed by the following formula:

前記複数の音声変換プリセットのうち、前記対象ユーザの声との距離が所定の条件を満たす少なくとも１つの音声変換プリセットを、前記対象ユーザの声に近い特徴を有する音声変換プリセットとして選択する、請求項１から請求項５のいずれかに記載のコンピュータプログラム。 The computer program according to any one of claims 1 to 5, wherein at least one voice conversion preset, the distance from the target user's voice satisfying a predetermined condition, is selected as a voice conversion preset having characteristics close to the target user's voice, from among the plurality of voice conversion presets.

複数のユーザに含まれる各ユーザと、該ユーザの声との距離が所定の条件を満たす少なくとも１つの音声変換プリセットと、を対応付ける参照情報に基づいて、
前記対象ユーザに対して選択された前記少なくとも１つの音声変換プリセットと同一の音声変換プリセットに対応付けられた少なくとも１人のユーザを、前記対象ユーザに類似する声を有する少なくとも１人の類似ユーザとして選択し、
前記複数の音声変換プリセットのうち、前記少なくとも１人の類似ユーザにより過去に使用された少なくとも１つの音声変換プリセットを、前記対象ユーザに推奨すべき少なくとも１つの推奨音声変換プリセットとして選択する、請求項１から請求項６のいずれかに記載のコンピュータプログラム。 Based on reference information associating each user included in the plurality of users with at least one voice conversion preset whose distance from the voice of the user satisfies a predetermined condition,
Selecting at least one user associated with the same voice conversion preset as the at least one voice conversion preset selected for the target user as at least one similar user having a voice similar to that of the target user;
The computer program product according to claim 1 , further comprising: selecting, from among the plurality of voice conversion presets, at least one voice conversion preset that has been used in the past by the at least one similar user as at least one recommended voice conversion preset to be recommended to the target user.

前記少なくとも１人の類似ユーザにより過去に使用された少なくとも１つの音声変換プリセットのうち、協調フィルタリングを用いて選択された少なくとも１つの音声変換プリセットを、前記対象ユーザに推奨すべき少なくとも１つの推奨音声変換プリセットとして選択する、請求項７に記載のコンピュータプログラム。 The computer program according to claim 7, further comprising: selecting, from among at least one voice conversion preset previously used by the at least one similar user, at least one voice conversion preset selected using collaborative filtering as at least one recommended voice conversion preset to be recommended to the target user.

前記少なくとも１人の類似ユーザにより過去に使用された少なくとも１つの音声変換プリセットのうち、少なくとも１人のユーザにより良い評価が与えられた少なくとも１つの音声変換プリセットを、前記対象ユーザに推奨すべき少なくとも１つの推奨音声変換プリセットとして選択する、請求項７又は請求項８に記載のコンピュータプログラム。 The computer program according to claim 7 or 8, further comprising: selecting, from among at least one voice conversion preset used in the past by the at least one similar user, at least one voice conversion preset that has been given a better rating by the at least one similar user, as at least one recommended voice conversion preset to be recommended to the target user.

前記少なくとも１人の類似ユーザにより過去に使用された少なくとも１つの音声変換プリセットのうち、少なくとも１人のユーザにより悪い評価が与えられた少なくとも１つの音声変換プリセットを、前記対象ユーザに推奨すべき少なくとも１つの推奨音声変換プリセットとして選択しない、請求項７から請求項９のいずれかに記載のコンピュータプログラム。 The computer program according to any one of claims 7 to 9, wherein at least one voice conversion preset that has been given a bad rating by at least one similar user among at least one voice conversion preset that has been used in the past by the at least one similar user is not selected as at least one recommended voice conversion preset to be recommended to the target user.

前記少なくとも１人の類似ユーザにより過去に使用された少なくとも１つの音声変換プリセットのうち、良い評価が与えられた少なくとも１つの音声変換プリセットは、前記少なくとも１人のユーザにより購入されたプリセット、前記少なくとも１人のユーザによりレビューが作成されたプリセット、前記少なくとも１人のユーザによりシェアされたプリセット、及び／又は、前記少なくとも１人のユーザにより再生されたプリセットを含む、請求項９に記載のコンピュータプログラム。 The computer program of claim 9, wherein at least one of the voice conversion presets that has been used in the past by the at least one similar user and that has been given a good rating includes a preset purchased by the at least one user, a preset reviewed by the at least one user, a preset shared by the at least one user, and/or a preset played by the at least one user.

複数の音声変換プリセットに含まれる各音声変換プリセットに関連する情報を表示部に表示する、請求項１から請求項１１のいずれかに記載のコンピュータプログラム。 A computer program according to any one of claims 1 to 11, which displays information related to each of the voice conversion presets included in the plurality of voice conversion presets on a display unit.

前記情報が前記音声変換プリセットに対応付けられたキャラクター又は人物に関する情報を含む、請求項１２に記載のコンピュータプログラム。 The computer program of claim 12, wherein the information includes information about a character or person associated with the voice conversion preset.

前記複数の音声変換プリセットに含まれる各音声変換プリセットを用いた音声変換処理は、
該音声変換プリセットが定める前記基本周波数の変化量に応じて、入力音声信号の基本周波数を変換し、該音声変換プリセットが定める前記フォルマントの周波数の変化量に応じて、前記入力音声信号のフォルマントの周波数を変換することにより、出力音声信号を生成するものである、請求項２、請求項３又は請求項５に記載のコンピュータプログラム。 The voice conversion process using each of the voice conversion presets included in the plurality of voice conversion presets includes:
A computer program as described in claim 2, 3 or 5, which generates an output audio signal by converting the fundamental frequency of an input audio signal in accordance with an amount of change in the fundamental frequency determined by the audio conversion preset, and converting the formant frequency of the input audio signal in accordance with an amount of change in the formant frequency determined by the audio conversion preset.

前記複数の音声変換プリセットに含まれる各音声変換プリセットの価値は、
該音声変換プリセットがいずれかのユーザにより再生された回数及び／又はいずれかのユーザによりシェアされた回数に比例するように設定される、請求項１から請求項１４のいずれかに記載のコンピュータプログラム。 The value of each of the plurality of audio conversion presets is:
A computer program product as claimed in any preceding claim, wherein the audio conversion presets are set proportionally to the number of times they have been played by any user and/or the number of times they have been shared by any user.

前記各音声変換プリセットの価値は、該音声変換プリセットの価格、該音声変換プリセットの再生可能回数、該音声変換プリセットの再生可能時間、及び／又は、該音声変換プリセットを同時に使用可能な人数を含む、請求項１５に記載のコンピュータプログラム。 The computer program of claim 15, wherein the value of each audio conversion preset includes a price for the audio conversion preset, a number of times the audio conversion preset can be played, a duration for which the audio conversion preset can be played, and/or a number of people who can simultaneously use the audio conversion preset.

前記少なくとも１つのプロセッサが、前記対象ユーザの端末装置、該対象ユーザの端末装置に対して通信回線を介して接続されるサーバ装置、又は、前記対象ユーザの端末装置に対して通信回線を介して接続される他のユーザの端末装置に搭載される、請求項１から請求項１６のいずれかに記載のコンピュータプログラム。 The computer program according to any one of claims 1 to 16, wherein the at least one processor is installed in the terminal device of the target user, a server device connected to the terminal device of the target user via a communication line, or a terminal device of another user connected to the terminal device of the target user via a communication line.

前記少なくとも１つのプロセッサが、中央処理装置（ＣＰＵ）、マイクロプロセッサ又はグラフィックスプロセッシングユニット（ＧＰＵ）を含む、請求項１から請求項１７のいずれかに記載のコンピュータプログラム。 The computer program product of any one of claims 1 to 17, wherein the at least one processor comprises a central processing unit (CPU), a microprocessor, or a graphics processing unit (GPU).

少なくとも１つのプロセッサを具備し、
該プロセッサが、
対象ユーザによる発話に基づく音声信号に対する信号処理により算出される基本周波数を参照基本周波数として取得し、
第１所定値を基準とした前記参照基本周波数の変化量を取得し、
各々が、第１所定値を基準とした基本周波数の変化量を定め、前記対象ユーザによる発話に基づく音声信号を変換するために用いられる、複数の音声変換プリセットを取得し、前記複数の音声変換プリセットに含まれる各音声変換プリセットに対応する声と前記対象ユーザの声との距離を、前記音声変換プリセットにより定められる前記基本周波数の変化量及び前記参照基本周波数の変化量に基づいて算出する、ことを特徴とするサーバ装置。 At least one processor;
The processor:
A fundamental frequency calculated by signal processing of a voice signal based on an utterance by a target user is obtained as a reference fundamental frequency;
Obtaining an amount of change in the reference fundamental frequency based on a first predetermined value ;
A server device characterized in that it obtains a plurality of voice conversion presets, each of which determines an amount of change in fundamental frequency based on a first predetermined value and is used to convert a voice signal based on the speech of the target user, and calculates a distance between a voice corresponding to each voice conversion preset included in the plurality of voice conversion presets and the voice of the target user based on the amount of change in the fundamental frequency and the amount of change in the reference fundamental frequency determined by the voice conversion preset.

前記プロセッサが、
複数のユーザに含まれる各ユーザと、該ユーザの声との距離が所定値未満である少なくとも１つの音声変換プリセットと、を対応付ける参照情報に基づいて、
前記対象ユーザに対して選択された前記少なくとも１つの音声変換プリセットと同一の音声変換プリセットに対応付けられた少なくとも１人のユーザを、前記対象ユーザに類似する声を有する少なくとも１人の類似ユーザとして選択し、
前記複数の音声変換プリセットのうち、前記少なくとも１人の類似ユーザにより過去に使用された少なくとも１つの音声変換プリセットを、前記対象ユーザに推奨すべき少なくとも１つの推奨音声変換プリセットとして選択する、請求項１９に記載のサーバ装置。 The processor,
Based on reference information associating each user included in the plurality of users with at least one voice conversion preset whose distance from the voice of the user is less than a predetermined value,
Selecting at least one user associated with the same voice conversion preset as the at least one voice conversion preset selected for the target user as at least one similar user having a voice similar to that of the target user;
The server device according to claim 19, wherein at least one of the plurality of voice conversion presets that has been used in the past by the at least one similar user is selected as at least one recommended voice conversion preset to be recommended to the target user.

前記プロセッサが、中央処理装置（ＣＰＵ）、マイクロプロセッサ又はグラフィックスプロセッシングユニット（ＧＰＵ）を含む、請求項１９又は請求項２０に記載のサーバ装置。 The server device according to claim 19 or 20, wherein the processor comprises a central processing unit (CPU), a microprocessor, or a graphics processing unit (GPU).

コンピュータにより読み取り可能な命令を実行する少なくとも１つのプロセッサにより実行される方法であって、
該プロセッサが、前記命令を実行することにより、
対象ユーザによる発話に基づいて音声信号を取得する第１取得工程と、
前記音声信号に対する信号処理により算出される基本周波数を参照基本周波数として取得する第２取得工程と、
第１所定値を基準とした参照基本周波数の変化量を取得する第３取得工程と、
各々が、第１所定値を基準とした基本周波数の変化量を定め、前記対象ユーザによる発話に基づく音声信号を変換するために用いられる、複数の音声変換プリセットを取得する第４取得工程と、
前記複数の音声変換プリセットに含まれる各音声変換プリセットに対応する声と前記対象ユーザの声との距離を、前記音声変換プリセットにより定められる前記基本周波数の変化量及び前記参照基本周波数の変化量に基づいて算出する算出工程と、
を含むことを特徴とする方法。 1. A method performed by at least one processor executing computer readable instructions, comprising:
The processor executes the instructions to:
A first acquisition step of acquiring a voice signal based on an utterance by a target user;
a second acquisition step of acquiring a fundamental frequency calculated by signal processing of the audio signal as a reference fundamental frequency;
a third acquisition step of acquiring an amount of change in the reference fundamental frequency based on the first predetermined value ;
a fourth acquisition step of acquiring a plurality of voice conversion presets, each of which defines a change amount of a fundamental frequency based on a first predetermined value and is used to convert a voice signal based on an utterance by the target user;
a calculation step of calculating a distance between a voice corresponding to each of the plurality of voice conversion presets and the voice of the target user based on an amount of change in the fundamental frequency and an amount of change in the reference fundamental frequency determined by the voice conversion preset;
The method according to claim 1, further comprising:

複数のユーザに含まれる各ユーザと、該ユーザの声との距離が所定値未満である少なくとも１つの音声変換プリセットと、を対応付ける参照情報に基づいて、
前記対象ユーザに対して選択された前記少なくとも１つの音声変換プリセットと同一の音声変換プリセットに対応付けられた少なくとも１人のユーザを、前記対象ユーザに類似する声を有する少なくとも１人の類似ユーザとして選択する第１選択工程と、
前記複数の音声変換プリセットのうち、前記少なくとも１人の類似ユーザにより過去に使用された少なくとも１つの音声変換プリセットを、前記対象ユーザに推奨すべき少なくとも１つの推奨音声変換プリセットとして選択する第２選択工程と、
を含む、請求項２２に記載の方法。 Based on reference information associating each user included in the plurality of users with at least one voice conversion preset whose distance from the voice of the user is less than a predetermined value,
A first selection step of selecting at least one user associated with the same voice conversion preset as the at least one voice conversion preset selected for the target user as at least one similar user having a voice similar to that of the target user;
a second selection step of selecting at least one voice conversion preset from among the plurality of voice conversion presets, the at least one voice conversion preset having been used in the past by the at least one similar user, as at least one recommended voice conversion preset to be recommended to the target user;
23. The method of claim 22, comprising:

各工程が前記対象ユーザの端末装置により実行される、請求項２３に記載の方法。 The method of claim 23, wherein each step is performed by a terminal device of the target user.

前記第１取得工程、前記第２取得工程、前記第３取得工程、前記第４取得工程、前記算出工程、前記第１選択工程及び前記第２選択工程のうち、前記第１取得工程から少なくとも１つの工程が前記対象ユーザの端末装置により実行され、残りの工程が、前記対象ユーザの端末装置に通信回線を介して接続されるサーバ装置により実行される、請求項２３に記載の方法。 24. The method according to claim 23, wherein at least one step from the first acquisition step among the first acquisition step, the second acquisition step, the third acquisition step, the fourth acquisition step, the calculation step, the first selection step, and the second selection step is executed by the terminal device of the target user, and the remaining steps are executed by a server device connected to the terminal device of the target user via a communication line.

前記第１取得工程、前記第２取得工程、前記第３取得工程、前記第４取得工程、前記算出工程、前記第１選択工程及び前記第２選択工程のうち、前記第１取得工程から少なくとも１つの工程が前記対象ユーザの端末装置により実行され、残りの工程が、前記対象ユーザの端末装置に通信回線を介して接続される他のユーザの端末装置により実行される、請求項２３に記載の方法。 24. The method according to claim 23, wherein at least one step from the first acquisition step among the first acquisition step, the second acquisition step, the third acquisition step, the fourth acquisition step, the calculation step, the first selection step, and the second selection step is executed by the terminal device of the target user, and the remaining steps are executed by the terminal device of another user connected to the terminal device of the target user via a communication line.

前記プロセッサが、中央処理装置（ＣＰＵ）、マイクロプロセッサ又はグラフィックスプロセッシングユニット（ＧＰＵ）を含む、請求項２２から請求項２６のいずれかに記載の方法。 The method of any of claims 22 to 26, wherein the processor comprises a central processing unit (CPU), a microprocessor, or a graphics processing unit (GPU).