JP4563418B2

JP4563418B2 - Audio processing apparatus, audio processing method, and program

Info

Publication number: JP4563418B2
Application number: JP2007082677A
Authority: JP
Inventors: 勉渡邉
Original assignee: Konami Digital Entertainment Co Ltd
Current assignee: Konami Digital Entertainment Co Ltd
Priority date: 2007-03-27
Filing date: 2007-03-27
Publication date: 2010-10-13
Anticipated expiration: 2027-03-27
Also published as: JP2008242082A

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently take out a desired part from a speech data. <P>SOLUTION: A storing section 201 stores a waveform data 251 for expressing speech which utters a character string. A determination section 202 determines a threshold period based on length of the character string. A holding section 203 extracts and holds a sound period from the waveform data 251. When time length of the held sound period is shorter than the threshold period determined by the determination section 202, an update section 204 updates a period composed of the sound period, another sound period near the sound period, and a period between the two sound periods, to a new sound period, and makes the holding section hold it. An output section 205 relates the character string to the sound period held by the holding section 203, and outputs it. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声データから所望の部分を効率よく取り出すために好適な音声処理装置、音声処理方法、ならびに、プログラムに関する。 The present invention relates to a sound processing apparatus, a sound processing method, and a program suitable for efficiently extracting a desired portion from sound data.

音声データを制作する過程では、例えばスタジオを借りて声優に様々なセリフをしゃべらせたり、効果音や楽曲などを演奏させたりして、まとめて収録することがよくある。例えば、キャラクタオブジェクト（以下、単に「キャラクタ」と呼ぶ）のセリフが多いテレビゲームの場合、複数のセリフを声優らに続けてしゃべってもらいまとめて録音して１つの音声データファイルを得た後、各々のセリフに対応する部分を抜き出し、編集して、セリフごとに別々の音声データを作成する、という一連の工程を踏むことになる。これらの作業の負担を軽減するために様々な工夫がなされている。 In the process of producing audio data, for example, it is often the case that a studio is rented and voice actors speak various lines, or sound effects and music are played and recorded together. For example, in the case of a video game with many lines of character objects (hereinafter simply referred to as “characters”), a plurality of lines are continuously spoken by voice actors and recorded together to obtain one audio data file. A part corresponding to each line is extracted, edited, and a separate audio data is created for each line. Various ideas have been made to reduce the burden of these operations.

例えば、特許文献１には、音声データの編集効率を向上させる装置が開示されている。これによれば、所定サイズ（例えばデータを記憶するＲＯＭ（Read Only Memory）サイズ）より音声データが大きければこの所定サイズを超える部分が削除され、音声データのサイズが大きくなりすぎなくするため、編集作業の負担を軽減できる。
例えば、特許文献２には、データ管理の仕方を工夫することにより音声データの編集を支援する装置が開示されている。これによれば、音声データは、編集前と編集後の独立したトラックデータとして記憶されるため、音声データの編集のアンドゥ（ユーザの直前の操作を取り消して元に戻すこと）ができる。また、編集後に微少量の音声データしか格納されないクラスタ（記憶装置がデータを記憶する領域の単位）がないように記憶されるので、安定した再生も行える。
例えば、特許文献３には、多彩な音色を生成できる装置が開示されている。これによれば、例えばＭＩＤＩ（Musical Instrument Digital Interface）において、複数の音色データをセットにして記録した音色セットを１つだけでなく複数個用いて音声データを編集することができる。
このように従来技術では、抽出された音声データを編集する作業の手間を省くための工夫がなされてきた。
特開２００６−２０１６６６号公報特開２００２−１２４０２２号公報特開２００１−１００７４４号公報 For example, Patent Literature 1 discloses an apparatus that improves audio data editing efficiency. According to this, if the audio data is larger than a predetermined size (for example, ROM (Read Only Memory) size for storing data), the portion exceeding the predetermined size is deleted, and the size of the audio data is not increased too much. Work burden can be reduced.
For example, Patent Document 2 discloses an apparatus that supports editing of audio data by devising a data management method. According to this, since the audio data is stored as independent track data before and after editing, it is possible to undo the editing of the audio data (cancel the operation immediately before the user and restore it to the original). In addition, since it is stored so that there is no cluster (unit of area in which the storage device stores data) in which only a small amount of audio data is stored after editing, stable reproduction can also be performed.
For example, Patent Document 3 discloses an apparatus that can generate various timbres. According to this, for example, in MIDI (Musical Instrument Digital Interface), audio data can be edited using not only one but also a plurality of tone color sets recorded as a set of tone color data.
As described above, in the prior art, contrivances have been made to save the labor of editing the extracted voice data.
JP 2006-201666 A Japanese Patent Laid-Open No. 2002-124022 JP 2001-1000074 A

従来の音声処理装置によれば、音声データの波形のアタック部分（鳴り始め、立ち上がり）やリリース部分（鳴り終わり、立ち下がり）の大きさや比率、無音区間の長さ等に基づいて音声データの中から有音区間を判別することによって、音声データから有音区間を抽出している。例えば、様々なセリフを収録した音声データの中から、あるセリフ部分だけを抜き出して１つのセリフ音声データを作成したいとする。もし、一連のセリフ「○○、△△」の中に「、」のような“間”をおくところや“つなぎ”の部分があると、その“間”や“つなぎ”の長さや音量によっては、それが無音区間であると判断され、本来１つの音声データであるべきものが分断されて２つの音声データとして抽出されてしまったり、不要な部分として削除されてしまったりするという問題があった。あるいは逆に、収録された２つのセリフ「○○」と「△△」の間隔が短いと、本来それぞれ別の音声データとして抽出されるべきものが１つの音声データとして抽出されてしまうという問題があった。そうすると、編集者は、音声データが正しく抽出されたかを１つずつ確認し、正しくなければ抽出をやり直したりデータをマニュアルで結合・分割させたりしなければならず面倒であった。また、編集者は、各セリフの音声データが大体どの程度の大きさになるかを経験などから推測して所望の音声データを抽出させる必要があり、編集作業の大きな負担を強いられていた。 According to the conventional speech processing apparatus, the content of speech data is determined based on the size and ratio of the attack part (sounding start and rise) and release part (sound end and fall) of the sound data waveform, the length of the silent period, and the like. The voiced section is extracted from the voice data by determining the voiced section from the voice data. For example, suppose that it is desired to create a single speech data by extracting only a certain speech portion from speech data recorded with various speech. If a series of lines “XX, △△” has an “interval” such as “,” or a “connecting” part, it depends on the length and volume of the “interval” or “connecting”. However, there is a problem that it is determined that it is a silent section, and what should originally be one voice data is divided and extracted as two voice data or deleted as an unnecessary part. It was. Or, conversely, if the interval between two recorded lines “XX” and “ΔΔ” is short, what should be extracted as separate audio data is extracted as one audio data. there were. Then, the editor has to confirm whether the voice data is correctly extracted one by one, and if it is not correct, the editor has to redo the extraction or manually combine / divide the data. In addition, the editor has to estimate the size of the speech data of each line from experience and the like to extract the desired speech data, which is a heavy burden of editing work.

本発明はこのような課題を解決するものであり、音声データから所望の部分を効率よく取り出すために好適な音声処理装置、音声処理方法、ならびに、プログラムを提供することを目的とする。 The present invention solves such a problem, and an object thereof is to provide a sound processing apparatus, a sound processing method, and a program suitable for efficiently extracting a desired portion from sound data.

以上の目的を達成するため、本発明の原理にしたがって、下記の発明を開示する。 In order to achieve the above object, the following invention is disclosed in accordance with the principle of the present invention.

本発明の第１の観点に係る音声処理装置は、記憶部、決定部、保持部、更新部を備える。
記憶部は、文字列を発音する音声を含む波形データを記憶する。
決定部は、当該文字列に基づいて、閾時間を決定する。
保持部は、記憶部に記憶された波形データから有音区間を抽出して保持する。
更新部は、保持部により保持された有音区間のそれぞれについて、当該有音区間の時間長が、決定部により決定された閾時間より短い場合、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として保持部に保持させるように更新する。
この結果、音声処理装置は、音声波形データの中から有音区間の部分を容易に抽出できる。例えば、有音区間には、映画の吹き替え音声、ゲームキャラクターのセリフ、音声案内システムのガイダンスなどのような、文字列を発音する音声の全部あるいは一部が含まれる。また、音声処理装置は、音声を抽出する際の最小時間長となる閾時間より短い時間長の音声が含まれないように抽出する。これにより、本来一連の繋がったセリフであるべき音声が分断されてしまったり、違うセリフの音声が繋がってしまったりすることがなくなるので、音声の編集作業のユーザへの負担を軽減でき、ユーザは効率よく音声を編集できる。 The speech processing apparatus according to the first aspect of the present invention includes a storage unit, a determination unit, a holding unit, and an update unit.
The storage unit stores waveform data including sound that generates a character string.
The determination unit determines a threshold time based on the character string.
The holding unit extracts and holds a voiced section from the waveform data stored in the storage unit.
The update unit, for each of the sound segments held by the holding unit, when the time length of the sound segment is shorter than the threshold time determined by the determination unit, the sound segment and before and after the sound segment Is updated so that the holding section holds a section composed of another one sounding section existing in the section and a section sandwiched between the two sounding sections as a new sounding section.
As a result, the speech processing apparatus can easily extract a voiced section from speech waveform data. For example, the voiced section includes all or part of a voice that pronounces a character string, such as a dubbed voice of a movie, a dialogue of a game character, or a guidance of a voice guidance system. In addition, the voice processing device extracts the voice so as not to include a voice having a time length shorter than a threshold time that is a minimum time length when the voice is extracted. As a result, the voice that should originally be a series of connected lines will not be cut off, or the voices of different lines will not be connected, so the burden on the user of voice editing work can be reduced. You can edit audio efficiently.

音声処理装置は、保持部に保持された有音区間のうちユーザによって選択された有音区間を当該文字列に対応付けて出力する出力部を更に備えてもよい。
この結果、音声処理装置は、音声を抽出する際の最小時間長となる閾時間より短い時間長の音声が含まれないように抽出して出力する。これにより、本来一連の繋がったセリフであるべき音声が分断されて出力されてしまったり、違うセリフの音声が繋がって出力されてしまったりすることがなくなるので、音声の編集作業のユーザへの負担を軽減でき、ユーザは効率よく音声を編集できる。また、音声波形データの中に複数のセリフが含まれていても、抽出したセリフ部分の音声波形データと音声内容を表す文字列とを対応付けて出力するので、どのデータがどの音声のものなのかが明確になり、ユーザは抽出後の音声を管理しやすくなる。 The speech processing apparatus may further include an output unit that outputs the voiced section selected by the user among the voiced sections held in the holding unit in association with the character string.
As a result, the speech processing apparatus extracts and outputs speech so as not to include speech having a duration shorter than the threshold time that is the minimum duration for extracting speech. As a result, the sound that should originally be a series of connected lines is not divided and output, or the sound of different lines is not connected and output, so the burden on the user of audio editing work is reduced. The user can edit the voice efficiently. Also, even if the speech waveform data includes a plurality of lines, since the speech waveform data of the extracted speech part and the character string representing the speech content are output in association with each other, which data corresponds to which speech. It becomes clear that the user can manage the extracted voice.

決定部は、当該文字列の長さに対して単調増加させて当該閾時間を決定することができる。
この結果、音声処理装置は、文字列の長さに応じて、音声を抽出する際の最小時間長を調節できる。すなわち、セリフの長さに応じて最適の時間長の音声を抽出する。例えば、編集対象の音声波形データの中に長いセリフが含まれていれば、そのセリフが分断されてしまわないように、最小時間長を長くして抽出できる。例えば、短いセリフが含まれていれば、そのセリフが他のセリフと一緒になってしまわないように、最小時間長を短くして抽出できる。 The determination unit can determine the threshold time by monotonically increasing the length of the character string.
As a result, the speech processing apparatus can adjust the minimum time length for extracting speech according to the length of the character string. That is, the voice having the optimum time length is extracted according to the length of the speech. For example, if the speech waveform data to be edited includes a long line, the minimum time length can be extracted so that the line is not divided. For example, if a short line is included, the minimum time length can be shortened so that the line is not mixed with other lines.

決定部は、文字の種類に応じて予め定められたゼロ以上の定数の総和を求めることにより当該閾時間を決定することができる。
例えば、この定数は、文字の種類に応じて決められた発音時間にすることができる。ここで用いる発音時間は、必ずしも人間が発音するときの厳密な数値である必要はなく、抽出したい音声の長さをおおよそ推定できる数値であればよい。
この結果、音声処理装置は、文字の種類に応じて決められた発音時間の合計を最小時間長にし、この最小時間長に満たない時間長にならないように音声を抽出できる。例えば、文字の種類とは、ひらがな、カタカナ、漢字、数字、アルファベット、その他の言語を表す文字、句読点などのことである。あるいは、ユーザが任意に設定した、記号と発音時間長との関連付けに基づいて、この最小時間長を計算してもよい。 The determination unit can determine the threshold time by obtaining a sum of zero or more constants determined in advance according to the character type.
For example, this constant can be a pronunciation time determined according to the type of character. The pronunciation time used here does not necessarily have to be a strict numerical value when a human being pronounces, but may be a numerical value that can roughly estimate the length of the voice to be extracted.
As a result, the sound processing apparatus can extract the sound so that the total sound generation time determined according to the character type is set to the minimum time length and the time length is not less than the minimum time length. For example, the types of characters include hiragana, katakana, kanji, numbers, alphabets, characters representing other languages, punctuation marks, and the like. Alternatively, the minimum time length may be calculated based on the association between the symbol and the pronunciation time length arbitrarily set by the user.

更新部は、
（ａ）当該有音区間と、当該有音区間より時系列的に前に存在する前方有音区間と、に挟まれる第１区間、
（ｂ）当該有音区間と、当該有音区間より時系列的に後に存在する後方有音区間と、に挟まれる第２区間、
のそれぞれの時間長を求め、当該第１区間と当該第２区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と、当該有音区間と、当該選択した区間に対応する当該前方有音区間又は当該後方有音区間のいずれかと、から構成される区間を、当該新たな有音区間として保持部に保持させるように更新することができる。
この結果、音声処理装置は、音声を抽出する際の最小時間長より短い有音区間が存在すると、その有音区間と、その有音区間の前後に存在する他の２つの有音区間のうち時間的に近い方とをまとめて１つの音声にして抽出する。これにより、音声処理装置は、最小時間長より短い音声がないように音声を抽出でき、ユーザの編集作業の負担を軽減できる。 The update department
(A) a first section sandwiched between the sounded section and a front sounded section existing in time series before the sounded section;
(B) a second section sandwiched between the sounded section and a rear sounded section existing in time series after the sounded section;
Each of the first and second intervals is selected, the shorter one of the determined time lengths is selected, the selected interval, the sounded interval, and the selected interval It is possible to update the holding section so that a section composed of either the front sounded section or the rear sounded section corresponding to is held as the new sounded section.
As a result, when there is a voiced section shorter than the minimum time length for extracting the voice, the voice processing device, among the voiced section and the other two voiced sections existing before and after the voiced section, The voices that are close in time are combined into one voice and extracted. As a result, the voice processing device can extract the voice so that there is no voice shorter than the minimum time length, and can reduce the burden of the user's editing work.

更新部は、
（ｃ）当該有音区間より時系列的に前に存在する前方有音区間と、当該前方有音区間と当該有音区間に挟まれる区間と、から構成される第１区間、
（ｄ）当該有音区間より時系列的に後に存在する後方有音区間と、当該後方有音区間と当該有音区間に挟まれる区間と、から構成される第２区間、
のそれぞれの時間長を求め、当該第１区間と当該第２区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と当該有音区間とから構成される区間を、当該新たな有音区間として保持部に保持させるように更新することができる。
この結果、音声処理装置は、音声を抽出する際の最小時間長より短い有音区間が存在すると、その有音区間と、その有音区間の前後に存在する他の２つの有音区間のうち結合後の長さが短い方とをまとめて１つの音声にして抽出する。これにより、音声処理装置は、最小時間長より短い音声がないように音声を抽出でき、ユーザの編集作業の負担を軽減できる。 The update department
(C) a first section composed of a front sound section existing before the sound section in time series, and a section sandwiched between the front sound section and the sound section;
(D) a second section composed of a rear sound section existing in time series after the sound section, and a rear sound section and a section sandwiched between the sound sections;
Each of the first time interval and the second time interval is selected, and the shorter one of the calculated time lengths is selected, and a time interval composed of the selected time interval and the sounded time interval is selected. The new sounded section can be updated so as to be held in the holding unit.
As a result, when there is a voiced section shorter than the minimum time length for extracting the voice, the voice processing device, among the voiced section and the other two voiced sections existing before and after the voiced section, The shortest combined length is extracted as one speech. As a result, the voice processing device can extract the voice so that there is no voice shorter than the minimum time length, and can reduce the burden of the user's editing work.

更新部は、
（ｅ）当該有音区間より時系列的に前に存在する前方有音区間、
（ｆ）当該有音区間より時系列的に後に存在する後方有音区間、
のそれぞれの時間長を求め、当該前方有音区間と当該後方有音区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と、当該有音区間と、当該選択した区間と当該有音区間に挟まれる区間を、当該新たな有音区間として保持部に保持させるように更新することができる。
この結果、音声処理装置は、音声を抽出する際の最小時間長より短い有音区間が存在すると、その有音区間と、その有音区間の前後に存在する他の２つの有音区間のうち時間長が短い方とをまとめて１つの音声にして抽出する。これにより、音声処理装置は、最小時間長より短い音声がないように音声を抽出でき、ユーザの編集作業の負担を軽減できる。 The update department
(E) a front voiced section existing before the voiced section in time series,
(F) a rear voiced section existing in time series after the voiced section;
Are selected, and the shorter section of the calculated time length is selected from the front voiced section and the rear voiced section, and the selected section, the voiced section, and the selected section are selected. It is possible to update the holding section to hold the section between the section and the section with the sound as the new section with the sound.
As a result, when there is a voiced section shorter than the minimum time length for extracting the voice, the voice processing device, among the voiced section and the other two voiced sections existing before and after the voiced section, The ones with shorter time lengths are collectively extracted as one voice. As a result, the voice processing device can extract the voice so that there is no voice shorter than the minimum time length, and can reduce the burden of the user's editing work.

保持部は、当該有音区間の開始位置から時系列的に前の所定長さのオフセット区間と、当該有音区間の終了位置から時系列的に後の所定長さのオフセット区間とを更に抽出して保持してもよい。
また、出力部は、抽出された２つのオフセット区間をさらに再生し、保持された有音区間の中からいずれかをユーザに選択させ、選択された有音区間と抽出された２つのオフセット区間を、当該文字列に対応付けて出力してもよい。
この結果、音声処理装置は、有音区間の前後の音声も合わせて抽出して出力できる。これにより、ユーザの編集作業の負担を軽減し、且つ、音声の前後にエフェクトをかけることができる。例えば、音量のフェードイン、フェードアウト、エコー、ローパスフィルタ、ハイパスフィルタ、再生スピードの変更、などのエフェクトをかけることができる。 The holding unit further extracts an offset section having a predetermined length in time series from the start position of the sound section and an offset section having a predetermined length in time series from the end position of the sound section. And may be held.
The output unit further reproduces the two extracted offset sections, causes the user to select one of the held sound sections, and selects the selected sound section and the two extracted offset sections. , It may be output in association with the character string.
As a result, the speech processing apparatus can also extract and output the speech before and after the voiced section. As a result, it is possible to reduce the burden of the user's editing work and to apply effects before and after the sound. For example, effects such as volume fade-in, fade-out, echo, low-pass filter, high-pass filter, and change of playback speed can be applied.

出力部は、当該有音区間の開始位置から時系列的に前の所定長さのオフセット区間の音量をゼロから単調増加させ、当該有音区間の終了位置から時系列的に後の所定長さのオフセット区間の音量を単調減少させてゼロにしてもよい。
この結果、音声処理装置は、抽出した音声の始めにフェードインさせ、終わりにフェードアウトさせる。これにより、ユーザの編集作業の負担を軽減し、且つ、音声の前後がスムーズに聞こえるようにエフェクトをかけることができる。 The output unit monotonically increases the volume of the offset section having a predetermined length in time series from the start position of the sound section in time series from zero, and then outputs the predetermined length in time series from the end position of the sound section. The volume of the offset section may be monotonously decreased to zero.
As a result, the sound processing apparatus fades in at the beginning of the extracted sound and fades out at the end. As a result, it is possible to reduce the burden of the user's editing work and apply effects so that the front and back of the sound can be heard smoothly.

本発明のその他の観点に係る音声処理方法は、記憶部、決定部、保持部、更新部を有する装置にて実行される音声処理方法であって、決定ステップ、保持ステップ、更新ステップを備える。
記憶部には、文字列を発音する音声を含む波形データが記憶される。
決定ステップは、決定部が、当該文字列に基づいて、閾時間を決定する。
保持ステップは、保持部が、記憶部に記憶された波形データから有音区間を抽出して保持する。
更新ステップは、保持ステップにより保持された有音区間のそれぞれについて、当該有音区間の時間長が決定された閾時間より短い場合、更新部が、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として保持ステップに保持させるように更新する。
この結果、この音声処理方法を用いた装置は、音声波形データの中から有音区間の部分を容易に抽出できる。例えば、有音区間には、映画の吹き替え音声、ゲームキャラクターのセリフ、音声案内システムのガイダンスなどのような、文字列を発音する音声が含まれる。また、この装置は、音声を抽出する際の最小時間長となる閾時間より短い時間長の音声が含まれないように抽出する。これにより、本来一連の繋がったセリフであるべき音声が分断されてしまったり、違うセリフの音声が繋がって抽出されてしまったりすることがなくなるので、音声の編集作業のユーザへの負担を軽減でき、ユーザは効率よく音声を編集できる。 A speech processing method according to another aspect of the present invention is a speech processing method executed by an apparatus having a storage unit, a determination unit, a holding unit, and an update unit, and includes a determination step, a holding step, and an update step.
The storage unit stores waveform data including a voice that generates a character string.
In the determining step, the determining unit determines a threshold time based on the character string.
In the holding step, the holding unit extracts and holds a sound section from the waveform data stored in the storage unit.
In the update step, for each of the sound segments held in the hold step, when the time length of the sound segment is shorter than the determined threshold time, the update unit determines that the sound segment and the sound segment before and after Is updated so that the holding step retains a section composed of the other sounded section existing in the section and a section sandwiched between the two sounded sections as a new sounded section.
As a result, a device using this speech processing method can easily extract a voiced section from speech waveform data. For example, a voiced section includes a voice that pronounces a character string, such as a dubbed voice of a movie, a dialogue of a game character, or a guidance of a voice guidance system. In addition, this apparatus performs extraction so as not to include a voice having a time length shorter than a threshold time that is a minimum time length for voice extraction. As a result, the voice that should originally be a series of connected lines will not be divided, or the voices of different lines will not be connected and extracted, so the burden on the user of the voice editing work can be reduced. The user can edit the voice efficiently.

本発明のその他の観点に係るプログラムは、コンピュータを、記憶部、決定部、保持部、更新部として機能させる。
記憶部は、文字列を発音する音声を含む波形データを記憶する。
決定部は、当該文字列に基づいて、閾時間を決定する。
保持部は、記憶部に記憶された波形データから有音区間を抽出して保持する。
更新部は、保持部により保持された有音区間のそれぞれについて、当該有音区間の時間長が、決定された閾時間より短い場合、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として保持部に保持させるように更新する。
この結果、プログラムは、コンピュータを、音声波形データの中から有音区間の部分を容易に抽出できる装置として機能させる。例えば、有音区間には、映画の吹き替え音声、ゲームキャラクターのセリフ、音声案内システムのガイダンスなどのような、文字列を発音する音声が含まれる。また、コンピュータは、音声を抽出する際の最小時間長となる閾時間より短い時間長の音声が含まれないように抽出する。これにより、本来一連の繋がったセリフであるべき音声が分断されてしまったり、違うセリフの音声が繋がって抽出されてしまったりすることがなくなるので、音声の編集作業のユーザへの負担を軽減でき、ユーザは効率よく音声を編集できる。 A program according to another aspect of the present invention causes a computer to function as a storage unit, a determination unit, a holding unit, and an update unit.
The storage unit stores waveform data including sound that generates a character string.
The determination unit determines a threshold time based on the character string.
The holding unit extracts and holds a voiced section from the waveform data stored in the storage unit.
The update unit, for each of the voiced sections held by the holding unit, is present before and after the voiced section when the duration of the voiced section is shorter than the determined threshold time. The section composed of the other one sounded section and the section sandwiched between the two sounded sections is updated so as to be held in the holding unit as a new sounded section.
As a result, the program causes the computer to function as an apparatus that can easily extract the portion of the sound section from the speech waveform data. For example, a voiced section includes a voice that pronounces a character string, such as a dubbed voice of a movie, a dialogue of a game character, or a guidance of a voice guidance system. Further, the computer extracts so as not to include a voice having a time length shorter than a threshold time which is a minimum time length when the voice is extracted. As a result, the voice that should originally be a series of connected lines will not be divided, or the voices of different lines will not be connected and extracted, so the burden on the user of the voice editing work can be reduced. The user can edit the voice efficiently.

また、本発明のプログラムは、コンパクトディスク、フレキシブルディスク、ハードディスク、光磁気ディスク、ディジタルビデオディスク、磁気テープ、半導体メモリ等のコンピュータ読取可能な情報記憶媒体に記録することができる。
上記プログラムは、プログラムが実行されるコンピュータとは独立して、コンピュータ通信網を介して配布・販売することができる。また、上記情報記憶媒体は、コンピュータとは独立して配布・販売することができる。 The program of the present invention can be recorded on a computer-readable information storage medium such as a compact disk, flexible disk, hard disk, magneto-optical disk, digital video disk, magnetic tape, and semiconductor memory.
The above program can be distributed and sold via a computer communication network independently of the computer on which the program is executed. The information storage medium can be distributed and sold independently from the computer.

本発明によれば、音声データから所望の部分を効率よく取り出すために好適な音声処理装置、音声処理方法、ならびに、プログラムを提供することができる。 According to the present invention, it is possible to provide a sound processing apparatus, a sound processing method, and a program suitable for efficiently extracting a desired portion from sound data.

以下、本発明に係る音声処理装置の実施形態を説明する。
（実施例１）
図１は、本実施形態の音声処理装置１００の構成を示す図である。本図に示すように、音声処理装置１００は、入力部１０１、画像処理部１０２、音声処理部１０３、通信処理部１０４、ＤＶＤ−ＲＯＭ（Digital Versatile Disk-Read Only Memory）ドライブ１０５、記憶装置１０６、ＲＯＭ（Read Only Memory）１０７、ＲＡＭ（Random Access Memory）１０８、制御部１０９、システムバス１１０を備える。 Hereinafter, embodiments of a speech processing apparatus according to the present invention will be described.
Example 1
FIG. 1 is a diagram illustrating a configuration of a speech processing apparatus 100 according to the present embodiment. As shown in this figure, an audio processing apparatus 100 includes an input unit 101, an image processing unit 102, an audio processing unit 103, a communication processing unit 104, a DVD-ROM (Digital Versatile Disk-Read Only Memory) drive 105, and a storage device 106. A ROM (Read Only Memory) 107, a RAM (Random Access Memory) 108, a control unit 109, and a system bus 110.

入力部１０１は、キーボード１２１やマウス１２２と接続され、これらを用いてユーザから入力される指示入力やデータ入力に基づいて入力信号を生成して、制御部１０９に入力する。ユーザは、キーボード１２１やマウス１２２を用いて、音声処理装置１００に所望の操作を指示することができる。入力部１０１は、タッチパネル等の他の入力デバイスと接続されていてもよい。 The input unit 101 is connected to a keyboard 121 and a mouse 122, generates an input signal based on an instruction input or a data input input by a user using these, and inputs the input signal to the control unit 109. The user can instruct the voice processing apparatus 100 to perform a desired operation using the keyboard 121 and the mouse 122. The input unit 101 may be connected to another input device such as a touch panel.

画像処理部１０２は、記憶装置１０６やＤＶＤ−ＲＯＭ等から読み出されたデータを制御部１０９や画像処理部１０２が備える画像演算プロセッサ（図示せず）によって加工処理した後、これを画像処理部１０２が備えるフレームメモリ（図示せず）に記録する。フレームメモリに記録された画像情報は、所定の同期タイミングでビデオ信号に変換され画像処理部１０２に接続されるモニター１２３へ出力される。これにより、各種の画像表示が可能となる。 The image processing unit 102 processes the data read from the storage device 106, the DVD-ROM, or the like by an image arithmetic processor (not shown) included in the control unit 109 or the image processing unit 102, and then processes the processed data. The data is recorded in a frame memory (not shown) included in 102. Image information recorded in the frame memory is converted into a video signal at a predetermined synchronization timing, and is output to a monitor 123 connected to the image processing unit 102. Thereby, various image displays are possible.

画像演算プロセッサは、２次元の画像の重ね合わせ演算やαブレンディング等の透過演算、各種の飽和演算を高速に実行できる。また、仮想空間が３次元にて構成される場合には、当該３次元空間内に配置され、各種のテクスチャ情報が付加されたポリゴン情報を、Ｚバッファ法によりレンダリングして、所定の視点位置から仮想空間に配置されたポリゴンを所定の視線の方向へ俯瞰したレンダリング画像を得る演算の高速実行も可能である。さらに、制御部１０９と画像演算プロセッサが協調動作することにより、文字の形状を定義するフォント情報にしたがって、文字列を２次元画像としてフレームメモリへ描画したり、各ポリゴン表面へ描画することが可能である。 The image calculation processor can execute a two-dimensional image overlay calculation, a transmission calculation such as α blending, and various saturation calculations at high speed. In addition, when the virtual space is configured in three dimensions, polygon information that is arranged in the three-dimensional space and to which various texture information is added is rendered by the Z buffer method, and a predetermined viewpoint position is used. It is also possible to perform a high-speed execution of a calculation that obtains a rendering image obtained by looking down at a polygon arranged in the virtual space in the direction of a predetermined line of sight. Furthermore, the control unit 109 and the image arithmetic processor cooperate to draw a character string as a two-dimensional image in the frame memory or on the surface of each polygon according to the font information that defines the character shape. It is.

音声処理部１０３は、音声データをＤ／Ａ（Digital/Analog）コンバータでアナログ音声信号に変換し、音声をスピーカー１２４から出力させる。また、制御部１０９の制御の下、様々な効果音や楽曲データを生成し、これに対応した音声をスピーカー１２４から出力させる。音声データがＭＩＤＩデータである場合には、これが有する音源データを参照して、ＭＩＤＩデータをＰＣＭデータに変換する。また、ADPCM形式やOgg Vorbis形式等の圧縮済みの音声データである場合には、これを展開してＰＣＭデータに変換する。ＰＣＭデータは、そのサンプリング周波数に応じたタイミングでＤ／Ａ変換を行って、スピーカー１２４に出力することにより、音声出力が可能となる。 The audio processing unit 103 converts the audio data into an analog audio signal by a D / A (Digital / Analog) converter, and outputs the audio from the speaker 124. In addition, under the control of the control unit 109, various sound effects and music data are generated, and sound corresponding to this is output from the speaker 124. When the audio data is MIDI data, the sound data included in the audio data is referred to convert the MIDI data into PCM data. In the case of compressed audio data such as ADPCM format or Ogg Vorbis format, it is expanded and converted to PCM data. The PCM data can be output by performing D / A conversion at a timing corresponding to the sampling frequency and outputting the PCM data to the speaker 124.

また、音声処理部１０３は、マイク１２５で集音した音声信号をＡ／Ｄ（Analog/Digital）コンバータでデジタル音声信号に変換し、音声信号を制御部１０９に入力する。音声処理部１０３は、ADPCM、Ogg Vorbis、AAC（Advanced Audio Coding）、MP3（Moving Picture Experts Group-1 Audio Layer-3）などの圧縮方式により、任意の音声信号を圧縮することができる。圧縮方式は本発明によって限定されない。 Also, the audio processing unit 103 converts the audio signal collected by the microphone 125 into a digital audio signal by an A / D (Analog / Digital) converter, and inputs the audio signal to the control unit 109. The audio processing unit 103 can compress an arbitrary audio signal by a compression method such as ADPCM, Ogg Vorbis, AAC (Advanced Audio Coding), or MP3 (Moving Picture Experts Group-1 Audio Layer-3). The compression method is not limited by the present invention.

通信処理部１０４は、音声処理装置１００をインターネット等のネットワークに接続するためのＮＩＣ（Network Interface Card）を備える。ＮＩＣは、ＬＡＮ（Local Area Network）を構成する際に用いられる１０ＢＡＳＥ−Ｔ／１００ＢＡＳＥ−Ｔ規格にしたがうものや、電話回線を用いてインターネットに接続するためのアナログモデム、ＩＳＤＮ（Integrated Services Digital Network）モデム、ＡＤＳＬ（Asymmetric Digital Subscriber Line）モデム、ケーブルテレビジョン回線を用いてインターネットに接続するためのケーブルモデム等と、これらと制御部１０９との仲立ちを行うインターフェイス（図示せず）により構成される。制御部１０９は、通信処理部１０４と協働して、インターネット等のネットワークに接続された他のコンピュータとの間でデータを送受信することができる。 The communication processing unit 104 includes a NIC (Network Interface Card) for connecting the voice processing apparatus 100 to a network such as the Internet. The NIC is based on the 10BASE-T / 100BASE-T standard used when configuring a LAN (Local Area Network), an analog modem for connecting to the Internet using a telephone line, ISDN (Integrated Services Digital Network). It comprises a modem, an ADSL (Asymmetric Digital Subscriber Line) modem, a cable modem for connecting to the Internet using a cable television line, and an interface (not shown) that mediates between these and the control unit 109. The control unit 109 can transmit and receive data to and from other computers connected to a network such as the Internet in cooperation with the communication processing unit 104.

ＤＶＤ−ＲＯＭドライブ１０５は、例えばゲーム用のプログラム、画像データ、音声データなどが記録されたＤＶＤ−ＲＯＭから読み出し処理を行って、必要なプログラムやデータを読み出す。これらはＲＡＭ１０８等に一時的に記憶される。なお、ＣＤ−ＲＯＭ（Compact Disc-Read Only Memory）など他の情報記録媒体からデータを読み出したり、あるいは情報記録媒体にデータを書き込んだりするドライブ装置であってもよい。 The DVD-ROM drive 105 performs a read process from a DVD-ROM in which, for example, a game program, image data, audio data, and the like are recorded, and reads a necessary program and data. These are temporarily stored in the RAM 108 or the like. In addition, the drive apparatus which reads data from other information recording media, such as CD-ROM (Compact Disc-Read Only Memory), or writes data in an information recording medium may be sufficient.

記憶装置１０６は、ハードディスクドライブなどから構成され、制御部１０９により実行されるオペレーティングシステム（ＯＳ）や各種の制御プログラムなどを記憶する。また、音声データ、静止画像データ、動画像データなど様々なデータを記憶することができる。 The storage device 106 includes a hard disk drive and the like, and stores an operating system (OS) executed by the control unit 109, various control programs, and the like. Various data such as audio data, still image data, and moving image data can be stored.

ＲＯＭ１０７は、制御部１０９が所定の処理を実行するためのプログラム等を予め格納する不揮発性メモリである。制御部１０９は、ＲＯＭ１０７から必要に応じてプログラム等を読み出してＲＡＭ１０８に展開し、このプログラム等に基づいて所定の処理を実行する。 The ROM 107 is a nonvolatile memory that stores in advance a program or the like for the control unit 109 to execute a predetermined process. The control unit 109 reads a program or the like from the ROM 107 as necessary, develops the program or the like in the RAM 108, and executes predetermined processing based on the program or the like.

ＲＡＭ１０８は、データやプログラムを一時的に記憶するためのもので、記憶装置１０６やＤＶＤ−ＲＯＭから読み出したデータなどが一時的に保持される。また、制御部１０９は、ＲＡＭ１０８に変数領域を設け、この変数に格納された値に対して演算を行ったり、ＲＡＭ１０８に格納された値を一旦レジスタに格納してからレジスタに対して演算を行い、演算結果をメモリに書き戻す、などの処理を行う。 The RAM 108 is for temporarily storing data and programs, and temporarily stores data read from the storage device 106 and the DVD-ROM. Further, the control unit 109 provides a variable area in the RAM 108 and performs an operation on a value stored in the variable, or stores a value stored in the RAM 108 in a register and then performs an operation on the register. And processing such as writing back the calculation result to the memory.

制御部１０９は、ＣＰＵ（Central Processing Unit）などから構成され、音声処理装置１００全体の動作を制御し、上述の各構成要素と接続され制御信号やデータをやりとりする。また、制御部１０９は、レジスタ（図示せず）という高速アクセスが可能な記憶域に対してＡＬＵ（Arithmetic Logic Unit）（図示せず）を用いて加減乗除等の算術演算や、論理和、論理積、論理否定等の論理演算、ビット和、ビット積、ビット反転、ビットシフト、ビット回転等のビット演算などを行うことができる。さらに、マルチメディア処理対応のための加減乗除等の飽和演算や、三角関数等、ベクトル演算などを高速に行えるように、制御部１０９自身が構成されているものや、コプロセッサを備えて実現するものがある。 The control unit 109 is configured by a CPU (Central Processing Unit) or the like, controls the operation of the entire sound processing apparatus 100, is connected to the above-described components, and exchanges control signals and data. In addition, the control unit 109 uses arithmetic operations such as addition / subtraction / multiplication / division, logical sum, logic, etc., using an ALU (Arithmetic Logic Unit) (not shown) for a storage area called a register (not shown) that can be accessed at high speed. Logical operations such as product and logical negation, bit operations such as bit sum, bit product, bit inversion, bit shift, and bit rotation can be performed. Furthermore, the control unit 109 itself is configured and a coprocessor is provided so that saturation operations such as addition / subtraction / multiplication / division for multimedia processing, vector operations such as trigonometric functions, etc. can be performed at high speed. There is something.

システムバス１１０は、上述した各部の間で命令やデータを転送するための伝送経路である。 The system bus 110 is a transmission path for transferring commands and data between the above-described units.

このような音声処理装置１００として、一般に広く普及しているようないわゆるパーソナルコンピュータ等の情報処理装置を用いることができる。 As such a sound processing apparatus 100, an information processing apparatus such as a so-called personal computer that is generally widely used can be used.

次に、本実施形態の音声処理装置１００の各部が行う処理について説明する。 Next, processing performed by each unit of the speech processing apparatus 100 according to the present embodiment will be described.

図２は、本実施形態の音声処理装置１００の構成を説明するための図である。本図に示すように、音声処理装置１００は、記憶部２０１、決定部２０２、保持部２０３、更新部２０４、出力部２０５を備える。 FIG. 2 is a diagram for explaining the configuration of the speech processing apparatus 100 according to the present embodiment. As shown in the figure, the speech processing apparatus 100 includes a storage unit 201, a determination unit 202, a holding unit 203, an update unit 204, and an output unit 205.

記憶部２０１は、音声の波形データ２５１を記憶する。波形データ２５１には、例えばゲームに使われるセリフなどの文字列を読み上げる音声が含まれる。波形データ２５１のフォーマットは自由であり、出力部２０５がデコードして再生できる形式であればよい。波形データ２５１は、１つのセリフの音声データ（以下、「セリフデータ」と呼ぶ）だけでなく、複数のセリフデータを含むことができる。なお、波形データ２５１は、セリフ以外の音楽、歌、効果音などの音声データを含んでいてもよい。制御部１０９と記憶装置１０６が協働して動作することにより、記憶部２０１として機能する。 The storage unit 201 stores audio waveform data 251. The waveform data 251 includes a voice that reads out a character string such as words used in a game, for example. The format of the waveform data 251 is arbitrary, and any format that can be decoded and reproduced by the output unit 205 may be used. The waveform data 251 can include not only one speech data (hereinafter referred to as “serif data”) but also a plurality of speech data. Note that the waveform data 251 may include audio data such as music, songs, and sound effects other than speech. The control unit 109 and the storage device 106 operate in cooperation to function as the storage unit 201.

図３は、波形データ２５１の例を示す図である。波形データ２５１は、典型的には、縦軸に音声のパワー値（音の強さ）、横軸に経過時間をとったスペクトルで表される。１つの波形データ２５１は、１つの音声データファイルとして記憶装置１０６に記憶される。波形データ２５１は有音区間３１０と無音区間３２０とを含むが、詳細については後述する。 FIG. 3 is a diagram illustrating an example of the waveform data 251. The waveform data 251 is typically represented by a spectrum in which the vertical axis represents the sound power value (sound intensity) and the horizontal axis represents the elapsed time. One waveform data 251 is stored in the storage device 106 as one audio data file. The waveform data 251 includes a voiced section 310 and a silent section 320, details of which will be described later.

例えば、映画の吹き替えの音声、ゲーム、アニメーション等のキャラクターの音声、音声案内システムのガイダンスの音声などを録音する場合、多くの声優達を一同にスタジオに集めることは各人のスケジュールの関係で容易ではなかったり、必要なセリフのバリエーションがとても多かったりする。そのため、制作現場では、多くのセリフをまとめて収録して１つの音声データにしたり、声優ごと別々の音声データを作成したりすることがよくある。本実施形態が扱う音声の波形データ２５１は、例えばこのように複数のセリフデータをまとめて記録した音声データである。 For example, when recording voice-over sound of a movie, voice of a character such as a game or animation, or voice of guidance of a voice guidance system, it is easy to gather many voice actors together in the studio because of the schedule of each person It is not, or there are so many variations of the required lines. For this reason, many production lines often collect a large number of lines into one voice data, or create separate voice data for each voice actor. The waveform data 251 of speech handled by the present embodiment is speech data in which a plurality of speech data is recorded together as described above, for example.

また、記憶部２０１は、波形データ２５１がセリフなどの音声を含む場合、波形データ２５１に対応付けて、セリフに相当する文字列データを記憶する。 In addition, when the waveform data 251 includes speech such as speech, the storage unit 201 stores character string data corresponding to speech in association with the waveform data 251.

図４（ａ）は、波形データ２５１に対応付けてセリフの文字列を記憶したセリフテーブル４００の構成例である。セリフテーブル４００は、波形データ名４１０とセリフ番号４２０と文字列データ４３０を対応付ける。波形データ名４１０は、波形データ２５１を識別するための情報であり、例えばデータファイル名が用いられる。セリフ番号４２０は、１つの波形データ２５１内で各セリフを識別するための情報であり、例えば記録されたセリフの順番を示す数字が用いられる。波形データ名４１０とセリフ番号４２０によって、１つのセリフデータを特定できる。文字列データ４３０は、セリフを表す文字列の情報であり、例えばひらがな、カタカナ、漢字、数字、アルファベット、その他の言語を表す文字、句読点などを用いて表される。なお、本図に示したセリフテーブル４００の構成は一例に過ぎず、これらの情報のうちの一部のみを含む構成や、これらの情報に加えて他の情報も記憶する構成を採用することもできる。 FIG. 4A is a configuration example of a serif table 400 that stores serif character strings in association with the waveform data 251. The serif table 400 associates a waveform data name 410, a serif number 420, and character string data 430 with each other. The waveform data name 410 is information for identifying the waveform data 251 and, for example, a data file name is used. The serif number 420 is information for identifying each serif in one waveform data 251, and for example, a number indicating the order of recorded serifs is used. One line data can be specified by the waveform data name 410 and the line number 420. The character string data 430 is information of a character string representing a line, and is represented by using, for example, hiragana, katakana, kanji, numbers, alphabets, characters representing other languages, punctuation marks, and the like. The configuration of the serif table 400 shown in the figure is merely an example, and a configuration including only a part of these pieces of information or a configuration storing other information in addition to these pieces of information may be employed. it can.

例えば、制御部１０９は、波形データ２５１の波形データ名４１０と、波形データ２５１に含まれるセリフを表す文字列データ４３０とが例えば図４（ｂ）に示すような形式で予め対応付けて記載されたデータファイルを読み出し、セリフテーブル４００を作成して記憶装置１０６に記憶させる。制御部１０９は、データファイルに記録された文字列の順にセリフ番号を割り当てればよい。また、文字列の数をカウントすれば、波形データ２５１に含まれるセリフデータの数を特定できる。このデータファイルは、ユーザによって予め作成された電子ファイルであり、典型的にはＣＳＶ（Comma Separated Values）形式やＸＭＬ（Extensible Markup Language）形式などで作成される。なお、データ形式はこれらに限定されない。また、入力部１０１が波形データ２５１ごとに記録されているセリフの文字列の入力をユーザから受け付けて制御部１０９に入力することにより、制御部１０９がセリフテーブル４００を作成してもよい。 For example, in the control unit 109, the waveform data name 410 of the waveform data 251 and the character string data 430 representing the words included in the waveform data 251 are described in advance in association with each other in the format shown in FIG. 4B, for example. The read data file is read out, and a speech table 400 is created and stored in the storage device 106. The control unit 109 may assign serif numbers in the order of the character strings recorded in the data file. Further, if the number of character strings is counted, the number of lines data included in the waveform data 251 can be specified. This data file is an electronic file created in advance by a user, and is typically created in a CSV (Comma Separated Values) format, an XML (Extensible Markup Language) format, or the like. The data format is not limited to these. Further, the control unit 109 may create the serif table 400 when the input unit 101 receives an input of a serif character string recorded for each waveform data 251 from the user and inputs it to the control unit 109.

なお、波形データ２５１は、記憶装置１０６に記憶されてもよいし、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、磁気テープなどの情報記録媒体や、インターネットやＬＡＮなどのネットワークに繋がった他のコンピュータに記憶されていてもよい。情報記録媒体に記憶される場合には、制御部１０９は、その情報記録媒体に対応したドライブ装置を用いて波形データ２５１を取得すればよい。ネットワーク上の他のコンピュータに記憶される場合には、制御部１０９は、通信処理部１０４を制御して、ネットワークアドレス等を用いて保存先のコンピュータに接続し、データ通信で波形データ２５１を取得すればよい。 The waveform data 251 may be stored in the storage device 106, or in an information recording medium such as a DVD-ROM, CD-ROM, or magnetic tape, or in another computer connected to a network such as the Internet or a LAN. It may be. When the information is stored in the information recording medium, the control unit 109 may obtain the waveform data 251 using a drive device corresponding to the information recording medium. When the data is stored in another computer on the network, the control unit 109 controls the communication processing unit 104 to connect to the storage destination computer using a network address or the like, and acquire the waveform data 251 by data communication. do it.

決定部２０２は、文字列データ４３０に基づいて、波形データ２５１からセリフ部分を判別するために用いられる閾時間ＴＳを決定する。制御部１０９と記憶装置１０６が協働して動作することにより、決定部２０２として機能する。 Based on the character string data 430, the determination unit 202 determines a threshold time TS that is used to determine a speech portion from the waveform data 251. The control unit 109 and the storage device 106 operate in cooperation to function as the determination unit 202.

詳細には、制御部１０９は、セリフに対応する文字列データ４３０の長さに対して単調増加するように閾時間ＴＳを決定する。例えば［数１］に示すように、セリフを表す文字列データ４３０を構成する文字１つにつき所定発音時間Ｔｐ（Ｔｐは正の定数）を決めておき、所定発音時間Ｔｐに文字数ｃｎｔを乗じた時間を閾時間ＴＳとする。この［数１］によれば、閾時間ＴＳは文字数に比例して増加する。 Specifically, the control unit 109 determines the threshold time TS so as to monotonously increase with respect to the length of the character string data 430 corresponding to the words. For example, as shown in [Expression 1], a predetermined pronunciation time Tp (Tp is a positive constant) is determined for each character constituting the character string data 430 representing a line, and the predetermined pronunciation time Tp is multiplied by the number of characters cnt. Let time be the threshold time TS. According to [Equation 1], the threshold time TS increases in proportion to the number of characters.

ＴＳ＝ｃｎｔ × Ｔｐ・・・［数１］ TS = cnt × Tp (Equation 1)

また、［数２］に示すように、文字の種類ｋに応じて予め決められた発音時間Ｔｋに、その種類に属する文字数ｃｎｔ（ｋ）を乗じ、すべての種類ｋについての総和の時間を閾時間ＴＳとしてもよい。 Further, as shown in [Equation 2], the sound generation time Tk determined in advance according to the character type k is multiplied by the number of characters cnt (k) belonging to that type, and the total time for all the types k is set as a threshold. It is good also as time TS.

ＴＳ＝ Σ（ｃｎｔ（ｋ）×Ｔｋ）・・・［数２］ TS = Σ (cnt (k) × Tk) (Equation 2)

文字の種類とは、例えば図５（ａ）に示すように、ひらがな、カタカナ、漢字、数字、アルファベット、その他の言語を表す文字、句読点、などの分類のことである。すなわち、ひらがなは１文字あたり何秒、などと予め決めておき、記憶装置１０６に記憶しておく。制御部１０９は、文字列データ４３０中のひらがなの数に所定発音時間Ｔｈを乗じて、ひらがなに相当する発音時間の和を計算する。同様に制御部１０９は、他の種類の文字や記号についても和を計算し、それらの総和を閾時間ＴＳとする。全角文字か半角文字か、大文字か小文字かによって所定発音時間を変えてもよい。 For example, as shown in FIG. 5A, the character type is a classification of hiragana, katakana, kanji, numbers, alphabets, characters representing other languages, punctuation marks, and the like. That is, hiragana is determined in advance as the number of seconds per character, and stored in the storage device 106. The control unit 109 multiplies the number of hiragana characters in the character string data 430 by a predetermined pronunciation time Th to calculate the sum of the pronunciation times corresponding to the hiragana characters. Similarly, the control unit 109 calculates the sum for other types of characters and symbols, and sets the sum as the threshold time TS. The predetermined pronunciation time may be changed depending on whether it is a full-width character, a half-width character, an uppercase letter, or a lowercase letter.

あるいは、図５（ｂ）に示すように、文字の種類は１つ１つの文字の分類でもよい。すなわち、ひらがなの“あ”はＴａ秒、“い”はＴｂ秒、漢字の“六”はＴｃ秒、アルファベットの“Ｚ”はＴｄ秒、読点“、”はＴｅ秒、などと文字や記号ごとに所定発音時間を決めておき、記憶装置１０６に記憶していてもよい。制御部１０９は、文字列データ４３０中の１つの文字“あ”の合計数に所定発音時間Ｔａを乗じて、“あ”に相当する時間の和を計算する。同様に制御部１０９は、他の文字や記号についても発音時間の和を計算し、それらの総和を閾時間ＴＳとする。ここに記載した文字以外のその他の文字についても同様である。 Or as shown in FIG.5 (b), the kind of character may be the classification of each character. That is, Hiragana “A” is Ta seconds, “I” is Tb seconds, Kanji “Six” is Tc seconds, Alphabet “Z” is Td seconds, Punctuation “,” is Te seconds, etc. Alternatively, a predetermined sound generation time may be determined and stored in the storage device 106. The control unit 109 calculates the sum of the time corresponding to “a” by multiplying the total number of one character “a” in the character string data 430 by the predetermined pronunciation time Ta. Similarly, the control unit 109 calculates the sum of pronunciation times for other characters and symbols, and sets the sum of them as the threshold time TS. The same applies to other characters other than those described here.

一般に、ひらがなは１文字あたりの発音時間の差が比較的小さいが、漢字は１文字あたりの発音時間の差が大きい傾向にある。例えば、同じ種類“漢字”であっても、“木”はひらがなに直せば“き”の１文字であるが、“閾”はひらがなに直せば“しきい”の３文字である。このように、特に文字列データ４３０に漢字を含む場合には［数２］のように閾時間ＴＳを求めると良い。 In general, hiragana has a relatively small difference in pronunciation time per character, but kanji tends to have a large difference in pronunciation time per character. For example, even if they are the same type “kanji”, “tree” is one character “ki” if hiragana is corrected, but “threshold” is three characters “shiki” if hiragana is corrected. Thus, especially when the character string data 430 includes Chinese characters, the threshold time TS may be obtained as in [Equation 2].

ここで、例えば文字列“こんにちは”はＴｆ秒、“六本木”はＴｇ秒、といったように、所定の文字列について発音時間を予め決めておいてもよい。すなわち、繰り返し用いられる単語や文などの文字列、頻繁に用いられると予想される単語や文などの文字列、同じ文字でも組み合わせによって発音時間が異なってくる文字を含む単語や文などの文字列、などに関連付けて、文字列全体の所定発音時間を決めておき、記憶装置１０６に記憶させ、制御部１０９は記憶部２０１記憶された文字列の所定発音時間に基づいて文字列データ４３０の発音時間を計算してもよい。 Here, for example, the string "Hello" is Tf sec, "Roppongi" is as Tg seconds, such as may be predetermined to pronounce time for a given string. That is, a character string such as a word or sentence that is used repeatedly, a character string such as a word or sentence that is expected to be used frequently, or a character string such as a word or sentence that includes the same character but has different pronunciation time depending on the combination , Etc., the predetermined pronunciation time of the entire character string is determined and stored in the storage device 106, and the control unit 109 generates the pronunciation of the character string data 430 based on the predetermined pronunciation time of the character string stored in the storage unit 201. Time may be calculated.

例えば、単独の文字“六”（ひらがなで表すと“ろく”）の発音時間と、文字列“六本木”に含まれる文字“六”に相当する部分（ひらがなで表すと“ろっ”）の発音時間は異なってくる。また、例えば単独の文字“木”の発音時間は、“き”（あるいは“ぎ”）と発音する場合と“もく”と発音する場合とで大きく異なってくる。このような場合、所定の文字列（ここでは“六本木”）について発音時間を予め決めて記憶装置１０６に記憶しておくのが望ましい。 For example, the pronunciation time of a single character “Roku” (“Roku” in Hiragana) and the pronunciation (“Ror” in Hiragana) corresponding to the character “Roku” included in the string “Roppongi” Time will be different. Further, for example, the pronunciation time of a single character “tree” differs greatly between the case where it is pronounced “ki” (or “gi”) and the case where it is pronounced “moku”. In such a case, it is desirable to determine the pronunciation time for a predetermined character string (here “Roppongi”) in advance and store it in the storage device 106.

また、文字列データ４３０に、“。”や“、”といった句読点や、“・・・”のような“間”を示す記号を含めることによって、制御部１０９は、現実に人間が発音したときにより近い発音時間を計算することができる。 Further, by including a punctuation mark such as “.” Or “,” in the character string data 430 or a symbol indicating “between” such as “... The pronunciation time closer to can be calculated.

例えば、文字列“今日は、いい天気です。”のように句読点が含まれている場合、制御部１０９は、言葉の微妙な“つなぎ”も含めて発音時間を計算できるので、全体を一つのセリフとして抽出することが容易になる。すなわち、１つ１つの文字・記号だけでなく、文字列全体に所定発音時間を関連付けることによって、前半部分“今日は”と後半部分“いい天気です”が別々のセリフの音声データに分割されてしまったり、最後の部分“す”が中途半端に途切れて抽出されてしまったりすることを避けることができる。これらの関連付けは予め記憶装置１０６の所定記憶領域に記憶され、制御部１０９は適宜この関連付けを参照して閾時間ＴＳを計算する。 For example, when punctuation marks are included such as the character string “Today is a good weather.”, The control unit 109 can calculate the pronunciation time including the subtle “connecting” of the words. It becomes easy to extract as a line. In other words, by associating the specified pronunciation time not only with each character / symbol but with the entire character string, the first half “Today” and the second half “Good weather” are divided into separate speech data. It can be avoided that the last part “s” is cut off halfway and extracted. These associations are stored in advance in a predetermined storage area of the storage device 106, and the control unit 109 calculates the threshold time TS with reference to the associations as appropriate.

文字列データ４３０の中に含めることができる記号は句読点に限られない。制御部１０９は、ユーザが任意に設定して予め記憶装置１０６に記憶させた、記号と発音時間との関連付けに基づいて、発音時間を計算することができる。例えば、「記号“○”は１秒の間をおく」「文字列の先頭に記号“▲”があれば、全体の発音時間をｎ倍する」「スペース（空白）は０秒とする」などのように、ユーザは自由に発音時間を決めることができる。 Symbols that can be included in the character string data 430 are not limited to punctuation marks. The control unit 109 can calculate the pronunciation time based on the association between the symbol and the pronunciation time that is arbitrarily set by the user and stored in the storage device 106 in advance. For example, “symbol“ ◯ ”is 1 second”, “if the symbol“ ▲ ”is at the beginning of the character string, the overall pronunciation time is multiplied by n”, “space (blank) is 0 seconds”, etc. As described above, the user can freely determine the pronunciation time.

そして、制御部１０９は、各々の文字列データ４３０について計算した発音時間に基づいて閾時間ＴＳを決定する。例えば、ある波形データ２５１に複数のセリフデータが含まれ、各々のセリフを表す文字列データ４３０の長さが、計算された発音時間にしてそれぞれＴａ、Ｔｂ、・・・、Ｔｎであるとする。このとき、制御部１０９は、Ｔａ、Ｔｂ、・・・、Ｔｎの中で最小の値を閾時間ＴＳに決定する。 Then, the control unit 109 determines the threshold time TS based on the pronunciation time calculated for each character string data 430. For example, a certain waveform data 251 includes a plurality of serif data, and the length of character string data 430 representing each serif is Ta, Tb,. . At this time, the control unit 109 determines the minimum value of Ta, Tb,..., Tn as the threshold time TS.

それぞれの発音時間Ｔｋ、Ｔｈ、Ｔａ〜Ｔｇはいずれも予め決められたゼロ以上の定数である。例えば、文字列データ４３０の中に、図５に示す関連付けに含まれていない文字・記号や、制御部１０９が正しく読み取れない文字・記号（機種依存文字など、いわゆる文字化けしてしまうようなもの）が存在する場合には、制御部１０９は、それらを無視して、それらの文字や記号に相当する発音時間をゼロにして閾時間ＴＳを計算してもよい。なお、ここに記載した文字や文字列は一例に過ぎず、任意の文字や文字列でもよいことは言うまでもない。図５の関連付けには文字又は文字列が書かれているが、所定のキャラクターコード又はこれの組み合わせによって表現されてもよい。 Each of the sound generation times Tk, Th, Ta to Tg is a predetermined constant of zero or more. For example, in the character string data 430, characters / symbols that are not included in the association shown in FIG. 5 or characters / symbols that cannot be read correctly by the control unit 109 (such as model-dependent characters) ) May be ignored, and the threshold time TS may be calculated by ignoring them and setting the pronunciation time corresponding to those characters and symbols to zero. Needless to say, the characters and character strings described here are merely examples, and arbitrary characters and character strings may be used. Although the character or character string is written in the association of FIG. 5, it may be expressed by a predetermined character code or a combination thereof.

保持部２０３は、記憶部２０１に記憶された波形データ２５１の有音区間３１０と無音区間３２０を判別し、波形データ２５１から有音区間３１０を抽出して記憶部２０１に記憶させる。制御部１０９と記憶装置１０６が協働して動作することにより、保持部２０３として機能する。ここで、保持部２０３が有音区間３１０と無音区間３２０を判別するための手法には様々な手法がある。 The holding unit 203 discriminates the voiced section 310 and the silent section 320 of the waveform data 251 stored in the storage unit 201, extracts the voiced section 310 from the waveform data 251, and stores it in the storage unit 201. The control unit 109 and the storage device 106 operate in cooperation to function as the holding unit 203. Here, there are various methods for the holding unit 203 to discriminate between the voiced section 310 and the silent section 320.

例えば図６に示すように、制御部１０９は、ある時刻Ｔ（ｉ）において波形データ２５１の表すスペクトルのパワー値（あるいは振幅、音の強度）が基準値Pbase以上であり、且つ、その時刻Ｔ（ｉ）以前の波形データ２５１のパワー値が所定時間ＴＸ以上続いて基準値Pbase未満である場合、その時刻Ｔ（ｉ）を有音区間３１０の開始点（あるいは無音区間３２０の終了点）６１０とする。
また、制御部１０９は、ある時刻Ｔ（ｊ）において波形データ２５１のパワー値が基準値Qbase未満であり、且つ、その時刻Ｔ（ｊ）以前の波形データ２５１のパワー値が所定時間ＴＹ以上続いて基準値Qbase以上である場合、その時刻Ｔ（ｊ）を有音区間３１０の終了点（あるいは無音区間３２０の開始点）６２０とする。すなわち、波形データ２５１の振幅がゼロでない区間があっても、それがノイズによるものであると判断し無音区間３２０とする場合がある。 For example, as shown in FIG. 6, the control unit 109 has a spectrum power value (or amplitude or sound intensity) represented by the waveform data 251 at a certain time T (i) equal to or higher than a reference value Pbase, and the time T (I) When the power value of the previous waveform data 251 is less than the reference value Pbase for a predetermined time TX or longer, the time T (i) is used as the start point of the sound section 310 (or the end point of the silent section 320) 610. And
In addition, the control unit 109 determines that the power value of the waveform data 251 is less than the reference value Qbase at a certain time T (j), and the power value of the waveform data 251 before that time T (j) continues for a predetermined time TY or more. If it is equal to or greater than the reference value Qbase, the time T (j) is set as the end point of the voiced section 310 (or the start point of the silent section 320) 620. That is, even if there is a section in which the amplitude of the waveform data 251 is not zero, it may be determined that it is caused by noise and set as the silent section 320.

また、制御部１０９は、例えば図７（ａ）に示すように、波形データ２５１の表すスペクトルのアタック比率（あるいはリリース比率）を単位時間ごとに計算し、計算されたアタック比率あるいはリリース比率に基づいて有音区間３１０と無音区間３２０を判別してもよい。アタック比率（あるいはリリース比率）は、波形データ２５１のパワー値の変化量を表す数値である。例えば、ある時刻Ｔ（１）においてパワー値がＦ（１）であり、その後の時刻Ｔ（２）においてパワー値がＦ（２）であったとすると、その間の増減比率はＦ（２）／Ｆ（１）となる。この増減比率が１以上であればアタック（波形の立ち上がり）であり、１未満であればリリース（波形の立ち下がり、減衰）である。 Further, for example, as shown in FIG. 7A, the control unit 109 calculates the attack ratio (or release ratio) of the spectrum represented by the waveform data 251 for each unit time, and based on the calculated attack ratio or release ratio. Thus, the voiced section 310 and the silent section 320 may be discriminated. The attack ratio (or release ratio) is a numerical value representing the amount of change in the power value of the waveform data 251. For example, if the power value is F (1) at a certain time T (1) and the power value is F (2) at a subsequent time T (2), the increase / decrease ratio between them is F (2) / F (1). If the increase / decrease ratio is 1 or more, it is an attack (rise of waveform), and if it is less than 1, it is release (fall of waveform, attenuation).

例えば図７（ｂ）に示すように、制御部１０９は、ある時刻Ｔ（ｉ）において波形データ２５１の表すスペクトルのアタック比率（あるいはリリース比率）が基準値Pbase以上であり、且つ、その時刻Ｔ（ｉ）以前の波形データ２５１のパワー値が所定時間ＴＸ以上続いて基準値Pbase未満である場合、その時刻Ｔ（ｉ）を有音区間３１０の開始点（あるいは無音区間３２０の終了点）６１０とする。
また、制御部１０９は、ある時刻Ｔ（ｊ）においてアタック比率（あるいはリリース比率）が基準値Qbase未満であり、且つ、その時刻Ｔ（ｊ）以前の波形データ２５１のパワー値が所定時間ＴＹ以上続いて基準値Qbase以上である場合、その時刻Ｔ（ｊ）を有音区間３１０の終了点（あるいは無音区間３２０の開始点）６２０とする。 For example, as shown in FIG. 7B, the control unit 109 has the attack ratio (or release ratio) of the spectrum represented by the waveform data 251 at a certain time T (i) greater than or equal to the reference value Pbase and the time T (I) When the power value of the previous waveform data 251 is less than the reference value Pbase for a predetermined time TX or longer, the time T (i) is used as the start point of the sound section 310 (or the end point of the silent section 320) 610. And
Further, the control unit 109 has an attack ratio (or release ratio) less than the reference value Qbase at a certain time T (j), and the power value of the waveform data 251 before the time T (j) is equal to or greater than a predetermined time TY. Subsequently, when the value is equal to or greater than the reference value Qbase, the time T (j) is set as the end point of the voiced section 310 (or the start point of the silent section 320) 620.

なお、制御部１０９は、ある時刻Ｔ（ｉ）より後の所定時間分のパワー値の平均値に対する、その時刻Ｔ（ｉ）より前の所定時間分のパワー値の平均値の比率を、アタック比率（あるいはリリース比率）として計算してもよい。また、制御部１０９は、所定時間分の分散、標準偏差など、他の統計的演算によって比率を計算してもよい。 The control unit 109 attacks the ratio of the average power value for a predetermined time before the time T (i) to the average power value for a predetermined time after a certain time T (i). It may be calculated as a ratio (or release ratio). In addition, the control unit 109 may calculate the ratio by other statistical operations such as a variance for a predetermined time and a standard deviation.

また、制御部１０９は、例えば図８に示すように、公知の高速フーリエ変換などの手法を用いて波形データ２５１（または波形データ２５１の絶対値をとったもの）を表す近似曲線８１０を求め、この近似曲線８１０に基づいて有音区間３１０と無音区間３２０を判別してもよい。すなわち、制御部１０９は、近似曲線８１０が基準値Pbaseとなるところ（言い換えれば、直線Ｐ＝Pbaseとクロスする交点）で波形データ２５１を区間に分け、ある時刻Ｔ（ｉ）の交点より以前の近似曲線８１０の値が所定時間ＴＸ以上続いて基準値Pbase未満である場合、その時刻Ｔ（ｉ）を有音区間３１０の開始点（あるいは無音区間３２０の終了点）６１０とする。
また、制御部１０９は、ある時刻Ｔ（ｊ）の交点より以前の近似曲線８１０の値が所定時間ＴＹ以上続いて基準値Qbase以上である場合、その時刻Ｔ（ｊ）を有音区間３１０の終了点（あるいは無音区間３２０の開始点）６２０とする。
なお、有音区間３１０の開始点６１０を判別するときの基準値Pbaseと、有音区間３１０の終了点６２０を判別するときの基準値Qbaseを同じ値にしてもよいし、別の値にしてもよい。 Further, as shown in FIG. 8, for example, the control unit 109 obtains an approximate curve 810 representing the waveform data 251 (or the absolute value of the waveform data 251) using a known method such as fast Fourier transform, Based on the approximate curve 810, the sounded section 310 and the silent section 320 may be determined. That is, the control unit 109 divides the waveform data 251 into sections where the approximate curve 810 becomes the reference value Pbase (in other words, the intersection point that intersects the straight line P = Pbase), and is earlier than the intersection point at a certain time T (i). When the value of the approximate curve 810 continues for a predetermined time TX or longer and is less than the reference value Pbase, the time T (i) is set as the start point of the voiced section 310 (or the end point of the silent section 320) 610.
In addition, when the value of the approximate curve 810 before the intersection of a certain time T (j) continues for a predetermined time TY or more and is equal to or more than the reference value Qbase, the control unit 109 sets the time T (j) of the sound section 310. The end point (or the start point of the silent section 320) 620 is assumed.
The reference value Pbase for determining the start point 610 of the sounded section 310 and the reference value Qbase for determining the end point 620 of the sounded section 310 may be the same value or different values. Also good.

図９は、このような手法によって判別された波形データ２５１の有音区間３１０と無音区間３２０を簡略化して表した図である。制御部１０９は、セリフ等の音声が記録されていると推定される有音区間３１０の音声データを抽出して、記憶装置１０６に記憶させる。なお、制御部１０９は、抽出した音声の部分波形データそのものを記憶装置１０６に記憶させてもよいし、抽出した有音区間を示す情報（例えば、有音区間の開始時刻と終了時刻等）のみを記憶させてもよい。制御部１０９は、有音区間３１０を示す情報をＲＡＭ１０８に記憶するようにしてもよい。 FIG. 9 is a diagram schematically showing the voiced section 310 and the silent section 320 of the waveform data 251 determined by such a method. The control unit 109 extracts voice data of the voiced section 310 that is estimated that voice such as speech is recorded, and stores the voice data in the storage device 106. Note that the control unit 109 may store the extracted partial waveform data of the sound itself in the storage device 106, or only information indicating the extracted sound section (for example, the start time and end time of the sound section). May be stored. The control unit 109 may store information indicating the sound section 310 in the RAM 108.

更新部２０４は、保持部２０３に保持された有音区間３１０のそれぞれについて、有音区間３１０の時間長が、決定部２０２により決定された閾時間ＴＳより短い場合、その有音区間３１０と、その有音区間３１０の近くに存在する他の有音区間３１０と、これら二つの有音区間３１０に挟まれる無音区間３２０と、から構成される区間を、新たな有音区間３１０として保持部２０３に保持させる。制御部１０９と記憶装置１０６が協働して動作することにより、更新部２０４として機能する。 When the time length of the sound section 310 is shorter than the threshold time TS determined by the determination section 202 for each of the sound sections 310 held in the holding section 203, the update unit 204, A section composed of another voiced section 310 existing near the voiced section 310 and a silent section 320 sandwiched between the two voiced sections 310 is set as a new voiced section 310 and the holding unit 203. To hold. The control unit 109 and the storage device 106 work together to function as the update unit 204.

具体的には、図１０（ａ）に示すように、ある有音区間１０１１の時間長が閾時間ＴＳ以上の場合、制御部１０９は、有音区間１０１１を更新せずにそのまま記憶装置１０６に記憶させる。
一方、図１０（ｂ）に示すように、有音区間１０１１の時間長が閾時間ＴＳより短い場合、制御部１０９は、有音区間１０１１と、他の有音区間１０１２と、２つの有音区間１０１１，１０１２に挟まれる無音区間１０２１と、から構成される区間を、新たな有音区間とする。すなわち、図１０（ｃ）に示すように、制御部１０９は、有音区間１０１１，１０１２及び無音区間１０２１から構成される区間を、新たな有音区間１０３０として記憶装置１０６に記憶させる。ここで、他の有音区間１０１２とは、例えば、１つの無音区間を挟んで隣り合わせに存在する有音区間のことである。 Specifically, as shown in FIG. 10A, when the time length of a certain sound section 1011 is equal to or greater than the threshold time TS, the control unit 109 does not update the sound section 1011 and stores it in the storage device 106 as it is. Remember.
On the other hand, as illustrated in FIG. 10B, when the time length of the sound section 1011 is shorter than the threshold time TS, the control unit 109 causes the sound section 1011, another sound section 1012, and two sound sections. A section composed of the silent section 1021 sandwiched between the sections 1011 and 1012 is set as a new voiced section. That is, as illustrated in FIG. 10C, the control unit 109 causes the storage device 106 to store a section composed of the voiced sections 1011, 1012 and the silent section 1021 as a new voiced section 1030. Here, the other voiced sections 1012 are, for example, voiced sections that are adjacent to each other across one silent section.

制御部１０９は、新たな有音区間の時間長が閾時間ＴＳ以上になるまで繰り返し更新する。例えば、図１０（ｄ）に示すように、有音区間１０１１の時間長が閾時間ＴＳより短い場合、制御部１０９は、有音区間１０１１と、他の有音区間１０１２と、２つの有音区間１０１１，１０１２に挟まれる無音区間１０２１と、から構成される時間長Ｔ１の区間を、新たな有音区間とする。しかし、時間長Ｔ１は閾時間ＴＳより短いため、制御部１０９は、有音区間１０１３と、挟まれる無音区間１０２２とをさらに含む時間長Ｔ２の区間を、図１０（ｅ）に示すように新たな有音区間１０３０とする。ここで、時間長Ｔ２は閾時間ＴＳより長いため、制御部１０９は有音区間１０３０の更新を終了する。もし時間長Ｔ２が閾時間ＴＳより短ければ、制御部１０９は再び有音区間１０３０を更新すればよく、何回繰り返してもよい。 The control unit 109 repeatedly updates until the time length of the new sound section becomes equal to or greater than the threshold time TS. For example, as illustrated in FIG. 10D, when the time length of the sound section 1011 is shorter than the threshold time TS, the control unit 109 causes the sound section 1011, another sound section 1012, and two sound sections. A section of time length T1 composed of the silent section 1021 sandwiched between the sections 1011 and 1012 is set as a new voiced section. However, since the time length T1 is shorter than the threshold time TS, the control unit 109 newly sets a time length T2 section further including a sounded section 1013 and a silence section 1022 sandwiched between them as shown in FIG. It is assumed that the sound section 1030 is long. Here, since the time length T2 is longer than the threshold time TS, the control unit 109 ends the update of the sound section 1030. If the time length T2 is shorter than the threshold time TS, the control unit 109 may update the voiced section 1030 again, and may be repeated any number of times.

このように、制御部１０９が有音区間３１０を更新することにより、計算された閾時間ＴＳ以上の時間長の有音区間が記憶装置１０６に記憶されることとなる。ここで、閾時間ＴＳは文字列データ４３０に基づいて計算されるため、編集者はいちいち閾時間ＴＳをマニュアルで指定する必要はない。また、制御部１０９によって更新された有音区間３１０には、計算された閾時間ＴＳに満たない時間長の有音区間がないので、本来１つであるべき音声データが分割されてしまったため結合し直す、といった余計な手間を省くことができる。 As described above, when the control unit 109 updates the sound section 310, a sound section having a length longer than the calculated threshold time TS is stored in the storage device 106. Here, since the threshold time TS is calculated based on the character string data 430, the editor does not need to manually specify the threshold time TS one by one. In addition, since the voiced section 310 updated by the control unit 109 does not have a voiced section having a length of time less than the calculated threshold time TS, the voice data that should originally be one has been divided, so that It is possible to save extra time such as reworking.

本実施形態では、更新部２０４は、ある有音区間３１０と、その有音区間３１０と１つの無音区間３２０を挟んで隣り合わせに存在する他の有音区間とをまとめて１つの新たな有音区間にする。ただし、２つ以上の無音区間３２０を挟んでいてもよい。すなわち、更新部２０４は、時間長が閾時間ＴＳ未満の有音区間３１０が存在すると、その有音区間３１０より時系列的に後に存在する他の有音区間のうち、その有音区間３１０の開始点から閾時間ＴＳ以上離れ、且つ、最も近いものを選択する。そして、更新部２０４は、その有音区間３１０の開始点から、選択した他の有音区間の終了点までを新たな有音区間にする。このようにすれば、更新部２０４は、有音区間３１０を繰り返し更新しなくて済み、制御部１０９が行う処理の負荷を軽減できる。
あるいは、更新部２０４は、時間長が閾時間ＴＳ未満の有音区間３１０が存在すると、その有音区間３１０より時系列的に前に存在する他の有音区間のうち、その有音区間３１０の終了点から閾時間ＴＳ以上離れ、且つ、最も近いものを選択してもよい。そして、更新部２０４は、その有音区間３１０の終了点から、選択した他の有音区間の開始点までを新たな有音区間にしてもよい。
さらには、更新部２０４は、時間長が閾時間ＴＳ未満の有音区間３１０が存在すると、
（１）その有音区間３１０より時系列的に後に存在する他の有音区間のうち、その有音区間３１０の開始点から閾時間ＴＳ以上離れ、且つ、最も近いもの
（２）その有音区間３１０より時系列的に前に存在する他の有音区間のうち、その有音区間３１０の終了点から閾時間ＴＳ以上離れ、且つ、最も近いもの
の両方を特定し、いずれか近い方を選択して、新たな有音区間を生成してもよい。 In the present embodiment, the update unit 204 collects one sound section 310 and another sound section that is adjacent to the sound section 310 and one sound section 320 with one new sound section. Make a section. However, two or more silent sections 320 may be sandwiched. That is, when there is a voiced section 310 whose time length is less than the threshold time TS, the update unit 204 includes the voiced section 310 among the other voiced sections that exist after the voiced section 310 in time series. The closest point that is more than the threshold time TS from the start point and that is closest is selected. Then, the update unit 204 sets a new sound section from the start point of the sound section 310 to the end point of the other selected sound section. In this way, the updating unit 204 does not need to repeatedly update the sound section 310, and the processing load performed by the control unit 109 can be reduced.
Alternatively, when there is a voiced section 310 whose time length is less than the threshold time TS, the update unit 204, among the other voiced sections existing before the voiced section 310 in time series, the voiced section 310. It is also possible to select the closest one that is more than the threshold time TS from the end point. Then, the update unit 204 may set a new sound section from the end point of the sound section 310 to the start point of another selected sound section.
Furthermore, the update unit 204, when there is a voiced section 310 whose time length is less than the threshold time TS,
(1) Among other voiced sections existing after the voiced section 310 in time series, the nearest one that is more than the threshold time TS from the start point of the voiced section 310 and (2) the voiced section Among the other sounded sections that exist before the section 310 in time series, specify both nearest ones that are more than the threshold time TS from the end point of the sounded section 310 and that are closest to each other. Then, a new sound section may be generated.

制御部１０９が有音区間３１０を更新する手法はこれに限られず、他の手法もある。 The method by which the control unit 109 updates the sound section 310 is not limited to this, and there are other methods.

例えば図１１（ａ）に示すように、制御部１０９は、
（１）有音区間１１１１と、有音区間１１１１より時系列的に前に存在する前方有音区間１１１２と、に挟まれる前方無音区間（第１区間）１１２１の時間長Ｔｆｗｄ、
（２）有音区間１１１１と、有音区間１１１１より時系列的に後に存在する後方有音区間１１１３と、に挟まれる後方無音区間（第２区間）１１２２の時間長Ｔｂｗｄ、
のそれぞれを計算する。そして、制御部１０９は、計算したＴｆｗｄとＴｂｗｄを比較して時間長の短い方を選択する。さらに、制御部１０９は、（イ）前方無音区間１１２１と後方無音区間１１２２のうち選択した方の無音区間、（ロ）有音区間１１１１、（ハ）選択した方の無音区間に対応する前方有音区間１１１２又は後方有音区間１１１３のどちらか、から構成される区間を、新たな有音区間１１３０としてもよい。
言い換えれば、Ｔｆｗｄ＞Ｔｂｗｄの場合、制御部１０９は、図１１（ｂ）に示すように、有音区間１１１１と前方無音区間１１２１と前方有音区間１１１２とから構成される区間を、新たな有音区間１１３０とする。一方、Ｔｆｗｄ＜Ｔｂｗｄの場合、制御部１０９は、図１１（ｃ）に示すように、有音区間１１１１と後方無音区間１１２２と後方有音区間１１１３とから構成される区間を、新たな有音区間１１３０とする。なお、Ｔｆｗｄ＝Ｔｂｗｄの場合には、制御部１０９は、前方無音区間１１２１と後方無音区間１１２２のどちらを選択してもよい。 For example, as shown in FIG.
(1) Time length Tfwd of a front silent section (first section) 1121 sandwiched between a voiced section 1111 and a front voiced section 1112 existing in time series before the voiced section 1111.
(2) The length of time Tbwd of the rear silent section (second section) 1122 sandwiched between the voiced section 1111 and the rear voiced section 1113 existing after the sound section 1111 in time series.
Calculate each of. Then, the control unit 109 compares the calculated Tfwd and Tbwd and selects the shorter time length. Further, the control unit 109 (a) selects the silent section of the selected one of the front silent section 1121 and the rear silent section 1122, (b) the voiced section 1111, and (c) the forward silent section corresponding to the selected silent section. A section composed of either the sound section 1112 or the rear sound section 1113 may be set as a new sound section 1130.
In other words, when Tfwd> Tbwd, as shown in FIG. 11 (b), the control unit 109 newly sets a section composed of a voiced section 1111, a front silent section 1121, and a front voiced section 1112. A sound section 1130 is assumed. On the other hand, when Tfwd <Tbwd, as shown in FIG. 11 (c), the control unit 109 sets a new voiced section as a section composed of a voiced section 1111, a rear silent section 1122, and a rear voiced section 1113. Let it be section 1130. When Tfwd = Tbwd, the control unit 109 may select either the front silent section 1121 or the rear silent section 1122.

あるいは、例えば図１１（ｄ）に示すように、制御部１０９は、
（１）有音区間１１１１に時系列的に前に存在する前方有音区間１１１２と、前方有音区間１１１２と有音区間１１１１に挟まれる前方無音区間１１２１と、から構成される第１区間の時間長Ｔｆｗｄ、
（２）有音区間１１１１に時系列的に後に存在する後方有音区間１１１３と、後方有音区間１１１３と有音区間１１１１に挟まれる後方無音区間１１２２と、から構成される第２区間の時間長Ｔｂｗｄ、
のそれぞれを計算する。そして、制御部１０９は、計算したＴｆｗｄとＴｂｗｄを比較して時間長の短い方を選択する。さらに、制御部１０９は、（イ）第１区間と第２区間のうち選択した方の区間、（ロ）有音区間１１１１、から構成される区間を、新たな有音区間１１３０としてもよい。
言い換えれば、Ｔｆｗｄ＞Ｔｂｗｄの場合、制御部１０９は、図１１（ｂ）に示すように、有音区間１１１１と前方無音区間１１２１と前方有音区間１１１２とから構成される区間を、新たな有音区間１１３０とする。一方、Ｔｆｗｄ＜Ｔｂｗｄの場合、制御部１０９は、図１１（ｃ）に示すように、有音区間１１１１と後方無音区間１１２２と後方有音区間１１１３とから構成される区間を、新たな有音区間１１３０とする。なお、Ｔｆｗｄ＝Ｔｂｗｄの場合には、制御部１０９は、前方無音区間１１２１と後方無音区間１１２２のどちらを選択してもよい。 Alternatively, for example, as shown in FIG.
(1) A first section composed of a forward voiced section 1112 existing in time series in the voiced section 1111, and a forward silent section 1112 and a forward silent section 1121 sandwiched between the voiced sections 1111. Time length Tfwd,
(2) The time of the second section composed of the rear sound section 1113 that exists in time series in the sound section 1111 and the rear silence section 1122 sandwiched between the rear sound section 1113 and the sound section 1111. Long Tbwd,
Calculate each of. Then, the control unit 109 compares the calculated Tfwd and Tbwd and selects the shorter time length. Further, the control unit 109 may set a section composed of (a) the selected section of the first section and the second section and (b) the sound section 1111 as a new sound section 1130.
In other words, when Tfwd> Tbwd, as shown in FIG. 11 (b), the control unit 109 newly sets a section composed of a voiced section 1111, a front silent section 1121, and a front voiced section 1112. A sound section 1130 is assumed. On the other hand, when Tfwd <Tbwd, as shown in FIG. 11 (c), the control unit 109 sets a new voiced section as a section composed of a voiced section 1111, a rear silent section 1122, and a rear voiced section 1113. Let it be section 1130. When Tfwd = Tbwd, the control unit 109 may select either the front silent section 1121 or the rear silent section 1122.

あるいは、例えば図１１（ｅ）に示すように、制御部１０９は、
（１）有音区間１１１１に時系列的に前に存在する前方有音区間１１１２の時間長Ｔｆｗｄ、
（２）有音区間１１１１に時系列的に後に存在する後方有音区間１１１３の時間長Ｔｂｗｄ、
のそれぞれを計算する。そして、制御部１０９は、計算したＴｆｗｄとＴｂｗｄを比較して時間長の短い方を選択する。さらに、制御部１０９は、（イ）前方有音区間１１１２と後方有音区間１１１３のうち選択した方の有音区間、（ロ）有音区間１１１１、（ハ）選択した方の有音区間と有音区間１１１１とに挟まれる無音区間、から構成される区間を、新たな有音区間１１３０としてもよい。
言い換えれば、Ｔｆｗｄ＞Ｔｂｗｄの場合、制御部１０９は、図１１（ｂ）に示すように、有音区間１１１１と前方無音区間１１２１と前方有音区間１１１２とから構成される区間を、新たな有音区間１１３０とする。一方、Ｔｆｗｄ＜Ｔｂｗｄの場合、制御部１０９は、図１１（ｃ）に示すように、有音区間１１１１と後方無音区間１１２２と後方有音区間１１１３とから構成される区間を、新たな有音区間１１３０とする。なお、Ｔｆｗｄ＝Ｔｂｗｄの場合には、制御部１０９は、前方有音区間１１１２と後方有音区間１１１３のどちらを選択してもよい。 Alternatively, for example, as shown in FIG.
(1) The time length Tfwd of the forward voiced section 1112 existing in time series in the voiced section 1111,
(2) The time length Tbwd of the rear voiced section 1113 that exists after the time series in the voiced section 1111,
Calculate each of. Then, the control unit 109 compares the calculated Tfwd and Tbwd and selects the shorter time length. Further, the control unit 109 (a) selects the voiced section of the front voiced section 1112 and the rear voiced section 1113, (b) the voiced section 1111, and (c) the selected voiced section. A section composed of a silent section sandwiched between the sounded sections 1111 may be a new sounded section 1130.
In other words, when Tfwd> Tbwd, as shown in FIG. 11 (b), the control unit 109 newly sets a section composed of a voiced section 1111, a front silent section 1121, and a front voiced section 1112. A sound section 1130 is assumed. On the other hand, when Tfwd <Tbwd, as shown in FIG. 11 (c), the control unit 109 sets a new voiced section as a section composed of a voiced section 1111, a rear silent section 1122, and a rear voiced section 1113. Let it be section 1130. When Tfwd = Tbwd, the control unit 109 may select either the front sounded section 1112 or the rear sounded section 1113.

図１１（ａ）、（ｄ）、（ｅ）に示すいずれの手法においても、制御部１０９は、計算したＴｆｗｄとＴｂｗｄを比較して、時間長の短い方ではなく、時間長の長い方を選択してもよい。時間長の短い方を選択する場合、閾時間より短い時間長の有音区間が近くに複数個存在すると、更新後の有音区間の数がなるべく多くなるような特性で有音区間を更新する。一方、時間長の長い方を選択する場合、閾時間より短い時間長の有音区間が近くに複数個存在すると、更新後の有音区間の数がなるべく少なくなるような特性で有音区間を更新する。どちらを採用するかは自由であり、状況によって使い分ければよい。 In any of the methods shown in FIGS. 11A, 11D, and 11E, the control unit 109 compares the calculated Tfwd and Tbwd to determine which one has the longer time length instead of the shorter one. You may choose. When the shorter duration is selected, if there are multiple nearby voiced segments with a duration shorter than the threshold time, the voiced segment is updated with characteristics that increase the number of updated voiced segments as much as possible. . On the other hand, when selecting the longer one, if there are multiple nearby sound segments with a length shorter than the threshold time, the number of the updated sound segments is reduced as much as possible. Update. Which one is adopted is free and can be selected according to the situation.

出力部２０５は、保持部２０４によって保持された（記憶された）有音区間３１０のそれぞれを再生して、保持された有音区間３１０の中からいずれかをユーザに選択させる。さらに、出力部２０５は、選択された有音区間３１０を文字列データ４３０に対応付けて出力する。制御部１０９、記憶装置１０６、音声処理部１０３、入力部１０１、画像処理部１０２が協働して動作することにより、出力部２０５として機能する。 The output unit 205 reproduces each of the sounded sections 310 held (stored) by the holding unit 204 and causes the user to select one of the held sounded sections 310. Further, the output unit 205 outputs the selected sound section 310 in association with the character string data 430. The control unit 109, the storage device 106, the sound processing unit 103, the input unit 101, and the image processing unit 102 operate in cooperation to function as the output unit 205.

具体的には、まず、制御部１０９は、記憶装置１０６に記憶された有音区間３１０を示す情報を取得する。記憶装置１０６には、上述のように更新された有音区間３１０と更新されていない有音区間３１０とが記憶される。次に、制御部１０９は、取得した有音区間３１０を示す情報から、波形データ２５１の中で有音区間３１０に相当する部分波形データを記憶装置１０６から読み出す。そして、音声処理部１０３は、制御部１０９の制御により、読み出された部分波形データを所定のアルゴリズムに基づいてデコードして再生する。これにより、有音区間３１０に相当する音声がスピーカー１２４から出力され、ユーザは抽出された音声を聴くことができる。有音区間３１０の更新後、複数の有音区間３１０が記憶装置１０６に記憶されている場合、制御部１０９は各々の有音区間３１０の部分波形データを読み出して、音声処理部１０３は各々の部分波形データを再生する。この部分波形データが、音声処理装置１００により抽出されユーザに提供される音声データである。 Specifically, first, the control unit 109 acquires information indicating the sound section 310 stored in the storage device 106. The storage device 106 stores the voiced section 310 updated as described above and the voiced section 310 that has not been updated. Next, the control unit 109 reads from the storage device 106 partial waveform data corresponding to the sound section 310 in the waveform data 251 from the acquired information indicating the sound section 310. The audio processing unit 103 then decodes and reproduces the read partial waveform data based on a predetermined algorithm under the control of the control unit 109. Thereby, the sound corresponding to the sound section 310 is output from the speaker 124, and the user can listen to the extracted sound. When a plurality of sound sections 310 are stored in the storage device 106 after the sound section 310 is updated, the control unit 109 reads partial waveform data of each sound section 310 and the sound processing unit 103 Play partial waveform data. The partial waveform data is voice data extracted by the voice processing apparatus 100 and provided to the user.

ここで、制御部１０９は、有音区間３１０のリストをモニター１２３に表示させ、ユーザにより選択された有音区間３１０の音声をスピーカー１２４から出力させるようにしてもよい。 Here, the control unit 109 may display a list of the sounded sections 310 on the monitor 123 and output the sound of the sounded sections 310 selected by the user from the speaker 124.

例えば、図１２は、モニター１２３に表示される画面の構成例である。制御部１０９は、記憶装置１０６から取得した有音区間３１０の一覧を作成し、画像処理部１０２を制御して音声リスト１２０１を含む画面を表示させる。この画面には、例えば再生開始ボタン１２０２、再生一時停止ボタン１２０３、再生終了ボタン１２０４、文字列入力ボタン１２０５、波形画像１２０６、文字列候補リスト１２０７などが含まれる。 For example, FIG. 12 is a configuration example of a screen displayed on the monitor 123. The control unit 109 creates a list of the sound sections 310 acquired from the storage device 106 and controls the image processing unit 102 to display a screen including the sound list 1201. This screen includes, for example, a playback start button 1202, a playback pause button 1203, a playback end button 1204, a character string input button 1205, a waveform image 1206, a character string candidate list 1207, and the like.

音声リスト１２０１は、有音区間３１０を識別するための番号と、有音区間３１０の開始位置・終了位置と、文字列データ４３０とを対応付けたリストである。有音区間３１０と文字列データ４３０との対応付けがまだなされていない場合には、文字列データ４３０の表示欄に文字列は表示されない。 The voice list 1201 is a list in which a number for identifying the voiced section 310, a start position / end position of the voiced section 310, and character string data 430 are associated with each other. If the voiced section 310 and the character string data 430 are not yet associated, the character string is not displayed in the display column of the character string data 430.

文字列候補リスト１２０７は、セリフテーブル４００に含まれる文字列データ４３０のリストである。ユーザは文字列候補リスト１２０７の中から自由に選択してそれぞれの有音区間３１０に文字列データ４３０を対応付けることができる。 The character string candidate list 1207 is a list of character string data 430 included in the serif table 400. The user can freely select from the character string candidate list 1207 and associate the character string data 430 with each voiced section 310.

ユーザによって音声リスト１２０１の中からいずれかの有音区間３１０が選択されて再生開始ボタン１２０２が押下されると、制御部１０９は、選択された有音区間３１０に対応する部分波形データを読み出して音声処理部１０３に再生させる。また、ユーザによって音声リスト１２０１の中からいずれかの有音区間３１０が選択され、且つ、文字列候補リスト１２０７の中からいずれかの文字列データ４３０が選択されると、制御部１０９は、選択された有音区間３１０と選択された文字列データ４３０とを対応付けて記憶装置１０６に記憶させ、音声リスト１２０１を更新して表示させる。 When the user selects one of the voiced sections 310 from the voice list 1201 and presses the playback start button 1202, the control unit 109 reads partial waveform data corresponding to the selected voiced section 310. The audio processing unit 103 reproduces. When the user selects any sound section 310 from the voice list 1201 and any character string data 430 from the character string candidate list 1207, the control unit 109 selects The voiced section 310 and the selected character string data 430 are associated with each other and stored in the storage device 106, and the voice list 1201 is updated and displayed.

なお、制御部１０９は、ユーザの指示に基づいて有音区間３１０の開始位置と終了位置を変更できるようにしてもよい。この場合、入力部１０１は、キーボード１２１やマウス１２２などの入力装置を用いたユーザからの指示を受け付け、対応する指示信号を制御部１０９に入力し、制御部１０９は入力された指示信号に応じて開始位置と終了位置を変更する。これにより、ユーザによる音声の抽出領域の調整が可能になる。 Note that the control unit 109 may change the start position and end position of the sounded section 310 based on a user instruction. In this case, the input unit 101 receives an instruction from a user using an input device such as a keyboard 121 or a mouse 122, inputs a corresponding instruction signal to the control unit 109, and the control unit 109 responds to the input instruction signal. To change the start position and end position. As a result, the user can adjust the voice extraction area.

また、制御部１０９は、選択された有音区間３１０と選択された文字列データ４３０とを対応付けて任意の形式の電子ファイルとして出力してもよい。例えば、制御部１０９は、選択された有音区間３１０に対応する部分波形データを波形データ２５１の中から抽出して電子ファイルを作成し、選択された文字列データ４３０をファイル名にする。これにより、ユーザは、どのファイルがどのセリフの音声データを格納しているのか容易に判別でき、各セリフの音声データを管理しやすくなる。 Further, the control unit 109 may output the selected sound section 310 and the selected character string data 430 as an electronic file in an arbitrary format in association with each other. For example, the control unit 109 extracts the partial waveform data corresponding to the selected sound section 310 from the waveform data 251 to create an electronic file, and uses the selected character string data 430 as the file name. Accordingly, the user can easily determine which file stores which speech data of which speech, and can easily manage the speech data of each speech.

次に、音声処理装置１００の上述した各部が行う音声抽出処理について図１３のフローチャートを用いて説明する。音声処理装置１００は、複数のセリフデータを格納する波形データ２５１の中から、セリフに対応する部分を抽出する。波形データ２５１は予め記憶部２０１に記憶されているものとする。 Next, voice extraction processing performed by the above-described units of the voice processing apparatus 100 will be described with reference to the flowchart of FIG. The speech processing apparatus 100 extracts a portion corresponding to a speech from the waveform data 251 storing a plurality of speech data. It is assumed that the waveform data 251 is stored in the storage unit 201 in advance.

まず、決定部２０２は、波形データ２５１に含まれるセリフの文字列が記録されたデータファイルを読み出して、文字列データ４３０を取得する（ステップＳ１３０１）。上述したように、決定部２０２は、波形データ２５１に含まれるセリフの文字列の入力をユーザから受け付けて、文字列データ４３０を取得してもよい。決定部２０２は、取得した文字列データ４３０に基づいてセリフテーブル４００を作成して記憶部２０１に記憶させる。 First, the determination unit 202 reads a data file in which a character string of lines included in the waveform data 251 is recorded, and acquires character string data 430 (step S1301). As described above, the determination unit 202 may acquire the character string data 430 by receiving an input of the character string of the serif included in the waveform data 251 from the user. The determination unit 202 creates a serif table 400 based on the acquired character string data 430 and stores it in the storage unit 201.

決定部２０２は、セリフテーブル４００に記憶された各々の文字列データ４３０の発音時間に基づいて閾時間ＴＳを計算する（ステップＳ１３０２）。具体的には、決定部２０２は、上述の［数１］や［数２］を用いた発音時間の計算方法によって、文字列データ４３０ごとに発音時間を計算する。ここで計算される発音時間は、人間による正確な発音時間とは限らず、人間が発音すればおおよそこの程度であろうと推測される目安値でよい。そして、決定部２０２は、計算された発音時間の中の最小値を閾時間ＴＳに決定する。 The determination unit 202 calculates a threshold time TS based on the pronunciation time of each character string data 430 stored in the serif table 400 (step S1302). Specifically, the determination unit 202 calculates the pronunciation time for each character string data 430 by the calculation method of the pronunciation time using the above [Equation 1] and [Equation 2]. The sounding time calculated here is not necessarily an accurate sounding time by a human, but may be a standard value that is estimated to be approximately this level if a human is sounding. Then, the determination unit 202 determines the minimum value among the calculated sound generation times as the threshold time TS.

次に、保持部２０３は、記憶部２０１に記憶された波形データ２５１を取得して有音区間３１０を抽出する（ステップＳ１３０３）。具体的には、保持部２０３は、上述したいずれかの判別方法により有音区間３１０を判別して抽出する。ここでは、波形データ２５１の中からＮ個（Ｎは１以上の整数）の有音区間３１０が抽出されたとする。 Next, the holding unit 203 acquires the waveform data 251 stored in the storage unit 201 and extracts the sound section 310 (step S1303). Specifically, the holding unit 203 discriminates and extracts the voiced section 310 by any one of the discrimination methods described above. Here, it is assumed that N (N is an integer of 1 or more) sounded sections 310 are extracted from the waveform data 251.

そして、更新部２０４は、抽出されたＮ個の有音区間３１０のそれぞれについて、時間長がステップＳ１３０２で計算された閾時間ＴＳより短いか否かを判別する（ステップＳ１３０４）。 Then, the update unit 204 determines whether or not the time length of each of the extracted N sound sections 310 is shorter than the threshold time TS calculated in step S1302 (step S1304).

Ｎ個の有音区間３１０の中に時間長が閾時間ＴＳより短いものがあると判別された場合（ステップＳ１３０４；ＹＥＳ）、更新部２０４は、各々の有音区間３１０の時間長が閾時間ＴＳ以上になるように、有音区間３１０を更新する（ステップＳ１３０５）。具体的には、図１０や図１１に示した方法を用いて、時間長が閾時間ＴＳより短いと判別された有音区間と、他の有音区間と、これら２つの有音区間に挟まれた区間と、から構成される区間を、新たな有音区間とする。更新部２０４は、波形データ２５１に含まれるすべての有音区間３１０の時間長が閾時間ＴＳ以上になるように更新する。 When it is determined that there is a time length shorter than the threshold time TS in the N sounded sections 310 (step S1304; YES), the update unit 204 determines that the time length of each sounded section 310 is the threshold time. The voiced section 310 is updated so as to be equal to or greater than TS (step S1305). Specifically, using the method shown in FIG. 10 or FIG. 11, a voiced segment in which the time length is determined to be shorter than the threshold time TS, another voiced segment, and the two voiced segments. A section composed of the above-described sections is set as a new sound section. The update unit 204 updates the time length of all the sound sections 310 included in the waveform data 251 so as to be equal to or longer than the threshold time TS.

時間長が閾時間ＴＳより短いものがないと判別された場合（ステップＳ１３０４；ＮＯ）、出力部２０５は、更新された有音区間３１０と、それに対応する文字列データ４３０とを対応付けて出力する（ステップＳ１３０６）。具体的には、出力部２０５は、図１２に示すような音声リスト１２０１と文字列候補リスト１２０７を含む画面をモニター１２３に表示させる。出力部２０５は、ユーザから任意の有音区間３１０を再生する指示入力を受け付けて、再生する旨の指示入力があった有音区間３１０に相当する部分波形データを再生する。また、出力部２０５は、有音区間３１０と文字列データ４３０とを対応付ける選択指示入力をユーザから受け付けて、この選択指示入力に基づいて音声リスト１２０１を更新して表示させる。出力部２０５は、有音区間３１０のそれぞれと文字列データ４３０とを対応付ける指示入力をユーザから受け付けて、この指示入力に基づいて有音区間３１０に相当する部分波形データを波形データ２５１から抽出してデータファイルとして出力してもよい。 If it is determined that the time length is not shorter than the threshold time TS (step S1304; NO), the output unit 205 outputs the updated sound section 310 and the corresponding character string data 430 in association with each other. (Step S1306). Specifically, the output unit 205 causes the monitor 123 to display a screen including the voice list 1201 and the character string candidate list 1207 as shown in FIG. The output unit 205 receives an instruction input for reproducing an arbitrary sound section 310 from the user, and reproduces partial waveform data corresponding to the sound section 310 in which the instruction input for reproduction is received. Further, the output unit 205 receives a selection instruction input that associates the voiced section 310 with the character string data 430 from the user, and updates and displays the voice list 1201 based on the selection instruction input. The output unit 205 receives from the user an instruction input that associates each of the sound segments 310 with the character string data 430, and extracts partial waveform data corresponding to the sound segment 310 from the waveform data 251 based on the instruction inputs. May be output as a data file.

このように、本実施形態によれば、音声処理装置１００は、波形データ２５１からセリフ部分の音声データを容易に抽出することができる。その際、抽出される音声データの最小の長さは、予め用意されたセリフに相当する文字列データ４３０の長さに基づいて最適になるように決定されるので、本来１つの音声データであるべきものが複数の音声データに分割されて抽出されてしまったり、逆に複数の音声データに分割されるべきものが１つの音声データに結合されて抽出されてしまったりすることはない。また、ユーザは音声データの抽出サイズをいちいち指定する必要はない。 Thus, according to the present embodiment, the speech processing apparatus 100 can easily extract speech data of the speech portion from the waveform data 251. At that time, the minimum length of the extracted voice data is determined to be optimal based on the length of the character string data 430 corresponding to a prepared line, and thus is originally one voice data. What should be divided into a plurality of audio data is not extracted, and conversely, what should be divided into a plurality of audio data is not combined and extracted into one audio data. Further, the user does not have to specify the extraction size of the audio data one by one.

（実施例２）
次に、本発明の他の実施形態について説明する。本実施形態は、音声処理装置１００に波形データ２５１から音声データを抽出させるための詳細な設定ができるようにしたものである。 (Example 2)
Next, another embodiment of the present invention will be described. In the present embodiment, detailed settings for allowing the audio processing apparatus 100 to extract audio data from the waveform data 251 can be performed.

本実施形態では、ユーザは次に示す基本パラメータと補助パラメータのそれぞれを任意の値に設定することができる。 In this embodiment, the user can set each of the following basic parameters and auxiliary parameters to arbitrary values.

基本パラメータには次の４つがある。
（Ａ）無音時間パラメータ・・・時間長を示す数値（例えばミリ秒単位など）で設定される。保持部２０３はこれより短い時間の発音があっても無音とみなす。これにより、ノイズ等による瞬間的な波形変化を無視できる。上述の実施形態における所定時間ＴＸ，ＴＹに相当する。
（Ｂ）最低発音時間パラメータ・・・時間長を示す数値で設定される。保持部２０３はこれより短い時間長の有音区間３１０を作成しない。すなわち、決定部２０２が計算した閾時間ＴＳがこれより短い場合、保持部２０３はこの最低発音時間パラメータを優先する。
（Ｃ）アタック音量パラメータ・・・音量を示す数値（例えばデシベル単位など）で設定される。保持部２０３はこれより大きい音量のときにアタック（発音の開始）とみなす。上述の実施形態における基準値Pbaseに相当する。
（Ｄ）リリース音量パラメータ・・・音量を示す数値（例えばデシベル単位など）で設定される。保持部２０３はこれより小さい音量のときにリリース（発音の終了）とみなす。上述の実施形態における基準値Qbaseに相当する。 There are four basic parameters:
(A) Silent time parameter: set by a numerical value (for example, in milliseconds) indicating the time length. The holding unit 203 considers that there is no sound even if the pronunciation is shorter than this. Thereby, an instantaneous waveform change due to noise or the like can be ignored. This corresponds to the predetermined times TX and TY in the embodiment described above.
(B) Minimum sounding time parameter: set with a numerical value indicating the time length. The holding unit 203 does not create a voiced section 310 having a shorter time length. That is, when the threshold time TS calculated by the determination unit 202 is shorter than this, the holding unit 203 prioritizes the minimum sounding time parameter.
(C) Attack volume parameter: It is set by a numerical value (for example, decibel unit) indicating the volume. The holding unit 203 regards it as an attack (start of sound generation) when the volume is higher than this. This corresponds to the reference value Pbase in the above-described embodiment.
(D) Release volume parameter: set by a numerical value (for example, decibel unit) indicating the volume. The holding unit 203 regards the release (end of sound generation) when the volume is lower than this. This corresponds to the reference value Qbase in the above-described embodiment.

保持部２０３は、これらのパラメータに基づいて有音区間３１０の開始位置と終了位置を判別する。 The holding | maintenance part 203 discriminate | determines the start position and end position of the sound section 310 based on these parameters.

補助パラメータには次の２つがある。図１４を用いて説明する。
（Ｘ）前オフセット時間パラメータ・・・時間長を示す数値で設定される。更新部２０４は、有音区間３１０の開始点６１０からこのパラメータで指定された時間長だけ前までの区間を前オフセット区間１４１０として抽出する。例えば、出力部２０５は、前オフセット区間１４１０では音量をゼロから単調増加させてだんだんと大きくしていき、有音区間３１０の開始点６１０の音量に滑らかに繋げて再生する（いわゆるフェードイン再生）。
（Ｙ）後ろオフセット時間パラメータ・・・時間長を示す数値で設定される。更新部２０４は、有音区間３１０の終了点６２０からこのパラメータで指定された時間長だけ後ろまでの区間を後ろオフセット区間１４２０として抽出する。例えば、出力部２０５は、後ろオフセット区間１４２０では音量を有音区間３１０の終了点６２０の音量から単調減少させてだんだんと小さくしていき、後ろオフセット区間１４２０の最後で音量をゼロにする（いわゆるフェードアウト再生）。 There are the following two auxiliary parameters. This will be described with reference to FIG.
(X) Pre-offset time parameter: Set with a numerical value indicating the time length. The update unit 204 extracts a section from the start point 610 of the sound section 310 to the previous time length specified by this parameter as the previous offset section 1410. For example, the output unit 205 increases the volume in a monotonous manner from zero in the previous offset section 1410 and gradually increases the volume, and smoothly plays back the volume at the start point 610 of the sound section 310 (so-called fade-in playback). .
(Y) Back offset time parameter: set with a numerical value indicating the time length. The update unit 204 extracts a section from the end point 620 of the voiced section 310 to the back by the time length specified by this parameter as the rear offset section 1420. For example, the output unit 205 monotonously decreases the volume in the rear offset section 1420 from the volume at the end point 620 of the voiced section 310 and gradually decreases the volume to zero at the end of the rear offset section 1420 (so-called Fade out playback).

なお、本実施形態では、出力部２０５は、前オフセット区間１４１０を音量のフェードイン区間としてフェードインさせ、後ろオフセット区間１４２０を音量のフェードアウト区間としてフェードアウトさせているが、これらの区間１４１０，１４２０で波形データ２５１に他のエフェクトをかけて再生するようにしてもよい。例えば、エコーをかける（反響させる）、所定周波数帯域のみ透過させる（ローパスフィルタやハイパスフィルタ）、再生スピードを変える、など様々なエフェクトをかけることができる。出力部２０５は、有音区間３１０にも同様なエフェクトをかけることができる。 In the present embodiment, the output unit 205 fades in the front offset section 1410 as a volume fade-in section and fades out the rear offset section 1420 as a volume fade-out section, but in these sections 1410 and 1420 The waveform data 251 may be reproduced by applying another effect. For example, it is possible to apply various effects such as applying echo (resonating), transmitting only a predetermined frequency band (low-pass filter or high-pass filter), and changing the reproduction speed. The output unit 205 can apply the same effect to the voiced section 310.

出力部２０５は、有音区間３１０と、その前オフセット区間１４１０と、その後ろオフセット区間１４２０とに相当する部分波形データを再生する代わりに、もしくは、再生するのに加えて、電子ファイルに出力して記憶部２０１に記憶するようにしてもよい。その際、出力部２０５は、前オフセット区間１４１０と後ろオフセット区間１４２０にフェードイン・フェードアウトなどのエフェクトをかけた後の部分波形データを出力してもよい。これにより、音声データの編集作業を効率化できる。出力部２０５は、有音区間３１０にも同様なエフェクトをかけて部分波形データを出力することができる。 The output unit 205 outputs the partial waveform data corresponding to the voiced section 310, the preceding offset section 1410, and the rear offset section 1420 to an electronic file instead of or in addition to the playback. May be stored in the storage unit 201. At that time, the output unit 205 may output partial waveform data after applying effects such as fade-in / fade-out to the front offset section 1410 and the rear offset section 1420. Thereby, the editing operation of the audio data can be made efficient. The output unit 205 can output the partial waveform data by applying a similar effect to the sound section 310.

このほか、上述の基本パラメータのそれぞれを別々に指定する代わりに、まとめてセットにして変化させることができるパラメータ（以下、「感度パラメータ」と呼ぶ）も用意されている。
（Ｚ）感度パラメータ・・・段階を示す数値やセットの固有番号等で設定される。ユーザは、図１５に示すように、予め記憶部２０１に記憶された強感度用、弱感度用といったセットを用いたり、よく使う設定として任意に編集した各パラメータをセットにして記憶部２０１に記憶させて用いたりすることができる。ユーザは、感度パラメータを設定すれば、各基本パラメータを一つ一つ設定する必要はないので、編集作業を効率化できる。なお、各セットに含まれるパラメータはこれらに限定されず、補助パラメータ等の他のパラメータも含まれていてもよい。 In addition, parameters (hereinafter referred to as “sensitivity parameters”) that can be changed together as a set instead of separately specifying each of the above basic parameters are also provided.
(Z) Sensitivity parameter: It is set by a numerical value indicating a stage, a set unique number, or the like. As shown in FIG. 15, the user uses a set for strong sensitivity and weak sensitivity stored in advance in the storage unit 201 or stores each parameter edited arbitrarily as a frequently used setting in the storage unit 201. Can be used. If the user sets the sensitivity parameter, it is not necessary to set each basic parameter one by one, and the editing work can be made more efficient. The parameters included in each set are not limited to these, and other parameters such as auxiliary parameters may also be included.

このように、本実施形態によれば、音声処理装置１００は、波形データ２５１からユーザの好みに合わせて音声データを抽出することができる。また、音声処理装置１００は、ユーザに大きな作業負担を強いることなく、抽出した音声データにユーザの好みに合わせてエフェクトをかけることができる。 Thus, according to the present embodiment, the speech processing apparatus 100 can extract speech data from the waveform data 251 according to the user's preference. Furthermore, the audio processing apparatus 100 can apply effects to the extracted audio data according to the user's preference without imposing a heavy work burden on the user.

本発明は、上述した実施形態に限定されず、種々の変形及び応用が可能である。また、上述した実施形態の各構成要素を自由に組み合わせることも可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible. Moreover, it is also possible to freely combine the constituent elements of the above-described embodiments.

音声処理装置１００を装置の全部又は一部として動作させるためのプログラムを、メモリカード、ＣＤ−ＲＯＭ、ＤＶＤ、ＭＯ（Magneto Optical disk）などのコンピュータ読み取り可能な記録媒体に格納して配布し、これを別のコンピュータにインストールし、上述の手段として動作させ、あるいは、上述の工程を実行させてもよい。 A program for operating the sound processing apparatus 100 as all or part of the apparatus is stored and distributed in a computer-readable recording medium such as a memory card, CD-ROM, DVD, or MO (Magneto Optical disk). May be installed in another computer and operated as the above-described means, or the above-described steps may be executed.

さらに、インターネット上のサーバ装置が有するディスク装置等にプログラムを格納しておき、例えば、搬送波に重畳させて、コンピュータにダウンロード等するものとしてもよい。 Furthermore, the program may be stored in a disk device or the like included in a server device on the Internet, and may be downloaded onto a computer by being superimposed on a carrier wave, for example.

以上説明したように、本発明によれば、音声データから所望の部分を効率よく取り出すために好適な音声処理装置、音声処理方法、ならびに、プログラムを提供することができる。 As described above, according to the present invention, it is possible to provide a sound processing apparatus, a sound processing method, and a program suitable for efficiently extracting a desired portion from sound data.

本発明の音声処理装置の構成を示す図である。It is a figure which shows the structure of the audio processing apparatus of this invention. 音声処理装置の各部が行う処理を説明するための図である。It is a figure for demonstrating the process which each part of a speech processing unit performs. 波形データの構成例を示す図である。It is a figure which shows the structural example of waveform data. （ａ）セリフテーブルの構成例を示す図である。（ｂ）波形データと、波形データに含まれるセリフの文字列とを対応付ける手法の一例である。(A) It is a figure which shows the structural example of a serif table. (B) It is an example of the method which matches waveform data and the character string of the serif contained in waveform data. （ａ）文字の種類と発音時間との関連付けの例を示す図である。（ｂ）文字または文字列と、発音時間との対応付けの例を示す図である。(A) It is a figure which shows the example of correlation with the kind of character and pronunciation time. (B) It is a figure which shows the example of matching with a character or a character string, and pronunciation time. 波形データの有音区間と無音区間を説明するための図である。It is a figure for demonstrating the sound section and silent section of waveform data. （ａ）パワー値と、保持部によって計算されるアタック比率を説明するための図である。（ｂ）波形データの有音区間と無音区間を説明するための図である。(A) It is a figure for demonstrating a power value and the attack ratio calculated by the holding | maintenance part. (B) It is a figure for demonstrating the sound section and silent section of waveform data. 近似曲線と、波形データの有音区間と無音区間を説明するための図である。It is a figure for demonstrating an approximate curve and the sound area and silent area of waveform data. 波形データの有音区間と無音区間を簡略化して表す図である。It is a figure which simplifies and represents the sound section and silent section of waveform data. （ａ）〜（ｅ）は、更新部が有音区間を更新する処理を説明するための図である。(A)-(e) is a figure for demonstrating the process in which an update part updates a sound area. （ａ）〜（ｅ）は、更新部が有音区間を更新する処理を説明するための図である。(A)-(e) is a figure for demonstrating the process in which an update part updates a sound area. 音声リストと文字列候補リストを含む画面の構成例である。It is an example of composition of a screen containing a voice list and a character string candidate list. 音声抽出処理を説明するためのフローチャートである。It is a flowchart for demonstrating an audio | voice extraction process. 前オフセット区間と後ろオフセット区間を説明するための図である。It is a figure for demonstrating the front offset area and the back offset area. 基本パラメータと補助パラメータを説明するための図である。It is a figure for demonstrating a basic parameter and an auxiliary parameter.

符号の説明Explanation of symbols

１００音声処理装置
１０１入力部
１０２画像処理部
１０３音声処理部
１０４通信処理部
１０５ＤＶＤ−ＲＯＭドライブ
１０６記憶装置
１０７ＲＯＭ
１０８ＲＡＭ
１０９制御部
１１０システムバス
１２１キーボード
１２２マウス
１２３モニタ
１２４スピーカー
１２５マイク
２０１記憶部
２０２決定部
２０３保持部
２０４更新部
２０５出力部
２５１波形データ
３１０有音区間
３２０無音区間
４００セリフテーブル
４１０波形データ名
４２０セリフ番号
４３０文字列データ
６１０有音区間の開始点（あるいは無音区間の終了点）
６２０有音区間の終了点（あるいは無音区間の開始点）
８１０近似曲線
１４１０前オフセット区間
１４２０後ろオフセット区間 DESCRIPTION OF SYMBOLS 100 Audio | voice processing apparatus 101 Input part 102 Image processing part 103 Audio | voice processing part 104 Communication processing part 105 DVD-ROM drive 106 Memory | storage device 107 ROM
108 RAM
109 Control Unit 110 System Bus 121 Keyboard 122 Mouse 123 Monitor 124 Speaker 125 Microphone 201 Storage Unit 202 Determination Unit 203 Holding Unit 204 Update Unit 205 Output Unit 251 Waveform Data 310 Spoken Section 320 Silent Section 400 Serif Table 410 Waveform Data Name 420 Serif Number 430 Character string data 610 The start point of a voiced section (or the end point of a silent section)
620 End point of voiced section (or start point of silent section)
810 Approximate curve 1410 Front offset section 1420 Back offset section

Claims

文字列を表すセリフデータを複数含む波形データを記憶する記憶部と、
前記複数のセリフデータが表す文字列の発音時間長のそれぞれに基づいて、前記波形データの時間長より短い閾時間を決定する決定部と、
前記記憶された波形データから有音区間を抽出して保持する保持部と、
前記保持された有音区間のそれぞれについて、当該有音区間の時間長が前記決定された閾時間より短い場合、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として前記保持部に保持させるように更新する更新部と、
を備え、
前記更新部は、
（ａ）当該有音区間と、当該有音区間より時系列的に前に存在する前方有音区間と、に挟まれる第１区間、
（ｂ）当該有音区間と、当該有音区間より時系列的に後に存在する後方有音区間と、に挟まれる第２区間、
のそれぞれの時間長を求め、当該第１区間と当該第２区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と、当該有音区間と、当該選択した区間に対応する当該前方有音区間又は当該後方有音区間のいずれかと、から構成される区間を、当該新たな有音区間として前記保持部に保持させるように更新する、
ことを特徴とする音声処理装置。 A storage unit for storing waveform data including a plurality of serif data representing character strings;
A determination unit that determines a threshold time shorter than the time length of the waveform data based on each of the pronunciation time lengths of the character strings represented by the plurality of serif data;
A holding unit for extracting and holding a voiced section from the stored waveform data;
For each of the held sound segments, if the time length of the sound segment is shorter than the determined threshold time, the sound segment and another sound existing before and after the sound segment An update unit that updates a section composed of a section and a section sandwiched between the two sounded sections so as to be held in the holding unit as a new sounded section;
With
The update unit
(A) a first section sandwiched between the sounded section and a front sounded section existing in time series before the sounded section;
(B) a second section sandwiched between the sounded section and a rear sounded section existing in time series after the sounded section;
Each of the first and second intervals is selected, the shorter one of the determined time lengths is selected, the selected interval, the sounded interval, and the selected interval Updating the section composed of either the front voiced section or the rear voiced section corresponding to the above as a new voiced section in the holding unit ,
A speech processing apparatus characterized by that.

文字列を表すセリフデータを複数含む波形データを記憶する記憶部と、
前記複数のセリフデータが表す文字列の発音時間長のそれぞれに基づいて、前記波形データの時間長より短い閾時間を決定する決定部と、
前記記憶された波形データから有音区間を抽出して保持する保持部と、
前記保持された有音区間のそれぞれについて、当該有音区間の時間長が前記決定された閾時間より短い場合、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として前記保持部に保持させるように更新する更新部と、
を備え、
前記更新部は、
（ｃ）当該有音区間より時系列的に前に存在する前方有音区間と、当該前方有音区間と当該有音区間に挟まれる区間と、から構成される第１区間、
（ｄ）当該有音区間より時系列的に後に存在する後方有音区間と、当該後方有音区間と当該有音区間に挟まれる区間と、から構成される第２区間、
のそれぞれの時間長を求め、当該第１区間と当該第２区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と当該有音区間とから構成される区間を、当該新たな有音区間として前記保持部に保持させるように更新する、
ことを特徴とする音声処理装置。 A storage unit for storing waveform data including a plurality of serif data representing character strings;
A determination unit that determines a threshold time shorter than the time length of the waveform data based on each of the pronunciation time lengths of the character strings represented by the plurality of serif data;
A holding unit for extracting and holding a voiced section from the stored waveform data;
For each of the held sound segments, if the time length of the sound segment is shorter than the determined threshold time, the sound segment and another sound existing before and after the sound segment An update unit that updates a section composed of a section and a section sandwiched between the two sounded sections so as to be held in the holding unit as a new sounded section;
With
The update unit
(C) a first section composed of a front sound section existing before the sound section in time series, and a section sandwiched between the front sound section and the sound section;
(D) a second section composed of a rear sound section existing in time series after the sound section, and a rear sound section and a section sandwiched between the sound sections;
Each of the first time interval and the second time interval is selected, and the shorter one of the calculated time lengths is selected, and a time interval composed of the selected time interval and the sounded time interval is selected. , To update the holding unit as the new voiced section ,
A speech processing apparatus characterized by that.

文字列を表すセリフデータを複数含む波形データを記憶する記憶部と、
前記複数のセリフデータが表す文字列の発音時間長のそれぞれに基づいて、前記波形データの時間長より短い閾時間を決定する決定部と、
前記記憶された波形データから有音区間を抽出して保持する保持部と、
前記保持された有音区間のそれぞれについて、当該有音区間の時間長が前記決定された閾時間より短い場合、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として前記保持部に保持させるように更新する更新部と、
を備え、
前記更新部は、
（ｅ）当該有音区間より時系列的に前に存在する前方有音区間、
（ｆ）当該有音区間より時系列的に後に存在する後方有音区間、
のそれぞれの時間長を求め、当該前方有音区間と当該後方有音区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と、当該有音区間と、当該選択した区間と当該有音区間に挟まれる区間を、当該新たな有音区間として前記保持部に保持させるように更新する、
ことを特徴とする音声処理装置。 A storage unit for storing waveform data including a plurality of serif data representing character strings;
A determination unit that determines a threshold time shorter than the time length of the waveform data based on each of the pronunciation time lengths of the character strings represented by the plurality of serif data;
A holding unit for extracting and holding a voiced section from the stored waveform data;
For each of the held sound segments, if the time length of the sound segment is shorter than the determined threshold time, the sound segment and another sound existing before and after the sound segment An update unit that updates a section composed of a section and a section sandwiched between the two sounded sections so as to be held in the holding unit as a new sounded section;
With
The update unit
(E) a front voiced section existing before the voiced section in time series,
(F) a rear voiced section existing in time series after the voiced section;
Are selected, and the shorter section of the calculated time length is selected from the front voiced section and the rear voiced section, and the selected section, the voiced section, and the selected section are selected. Update the section sandwiched between the section and the sound section to be held in the holding unit as the new sound section ,
A speech processing apparatus characterized by that.

請求項１乃至３のいずれか１項に記載の音声処理装置であって、
前記保持された有音区間のうちユーザによって選択された有音区間を当該文字列に対応付けて出力する出力部を更に備える、
ことを特徴とする音声処理装置。 The speech processing apparatus according to any one of claims 1 to 3 ,
An output unit that outputs a voiced section selected by the user among the held voiced sections in association with the character string ;
A speech processing apparatus characterized by that.

請求項４に記載の音声処理装置であって、
前記保持部は、当該有音区間の開始位置から時系列的に前の所定長さのオフセット区間と、当該有音区間の終了位置から時系列的に後の所定長さのオフセット区間とを更に抽出して保持し、
前記出力部は、前記抽出された２つのオフセット区間をさらに再生し、前記保持された有音区間の中からいずれかをユーザに選択させ、前記選択された有音区間と前記抽出された２つのオフセット区間を、当該文字列に対応付けて出力する、
ことを特徴とする音声処理装置。 The speech processing apparatus according to claim 4 ,
The holding unit further includes an offset section having a predetermined length in time series before the start position of the sound section and an offset section having a predetermined length in time series after the end position of the sound section. Extract and hold
The output unit further reproduces the extracted two offset sections, causes the user to select one of the held sound sections, and selects the selected sound section and the two extracted sections. Output the offset section in association with the character string .
A speech processing apparatus characterized by that.

請求項５に記載の音声処理装置であって、
前記出力部は、当該有音区間の開始位置から時系列的に前の所定長さのオフセット区間の音量をゼロから単調増加させ、当該有音区間の終了位置から時系列的に後の所定長さのオフセット区間の音量を単調減少させてゼロにする、
ことを特徴とする音声処理装置。 The speech processing apparatus according to claim 5 ,
The output unit monotonically increases the volume of an offset section having a predetermined length in time series from the start position of the voiced section from zero, and then has a predetermined length in time series from the end position of the voiced section. The volume of the offset section is monotonously decreased to zero ,
A speech processing apparatus characterized by that.

請求項１乃至６のいずれか１項に記載の音声処理装置であって、
前記決定部は、当該文字列の長さに対して単調増加させて当該閾時間を決定する、
ことを特徴とする音声処理装置。 The speech processing apparatus according to any one of claims 1 to 6 ,
The determination unit determines the threshold time by monotonically increasing the length of the character string .
A speech processing apparatus characterized by that.

請求項１乃至６のいずれか１項に記載の音声処理装置であって、
前記決定部は、文字の種類に応じて予め定められたゼロ以上の定数の総和を求めることにより当該閾時間を決定する、
ことを特徴とする音声処理装置。 The speech processing apparatus according to any one of claims 1 to 6 ,
The determination unit determines the threshold time by obtaining a sum of zero or more constants determined in advance according to the type of character .
A speech processing apparatus characterized by that.

記憶部、決定部、保持部、更新部を有する音声処理装置にて実行される音声処理方法であって、  A speech processing method executed by a speech processing apparatus having a storage unit, a determination unit, a holding unit, and an update unit,
前記記憶部には、文字列を表すセリフデータを複数含む波形データが記憶され、  The storage unit stores waveform data including a plurality of serif data representing character strings,
前記決定部が、前記複数のセリフデータが表す文字列の発音時間長のそれぞれに基づいて、前記波形データの時間長より短い閾時間を決定する決定ステップと、  The determining unit determines a threshold time shorter than the time length of the waveform data based on each of the pronunciation time lengths of the character strings represented by the plurality of serif data;
前記保持部が、前記記憶された波形データから有音区間を抽出して保持する保持ステップと、  A holding step in which the holding unit extracts and holds a voiced section from the stored waveform data;
前記保持された有音区間のそれぞれについて、当該有音区間の時間長が前記決定された閾時間より短い場合、前記更新部が、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として前記保持ステップに保持させるように更新する更新ステップと、  For each of the held sound segments, when the time length of the sound segment is shorter than the determined threshold time, the update unit is present in the sound segment and before and after the sound segment. An update step for updating the holding step as a new voiced section with a section composed of one voiced section and a section sandwiched between the two voiced sections;
を備え、  With
前記更新ステップでは、前記更新部が、  In the updating step, the updating unit includes:
（ａ）当該有音区間と、当該有音区間より時系列的に前に存在する前方有音区間と、に挟まれる第１区間、(A) a first section sandwiched between the sounded section and a front sounded section existing in time series before the sounded section;
（ｂ）当該有音区間と、当該有音区間より時系列的に後に存在する後方有音区間と、に挟まれる第２区間、(B) a second section sandwiched between the sounded section and a rear sounded section existing in time series after the sounded section;
のそれぞれの時間長を求め、当該第１区間と当該第２区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と、当該有音区間と、当該選択した区間に対応する当該前方有音区間又は当該後方有音区間のいずれかと、から構成される区間を、当該新たな有音区間として前記保持部に保持させるように更新する、Each of the first and second intervals is selected, the shorter one of the determined time lengths is selected, the selected interval, the sounded interval, and the selected interval Updating the section composed of either the front voiced section or the rear voiced section corresponding to the above as a new voiced section in the holding unit,
ことを特徴とする音声処理方法。  And a voice processing method.

記憶部、決定部、保持部、更新部を有する音声処理装置にて実行される音声処理方法であって、  A speech processing method executed by a speech processing apparatus having a storage unit, a determination unit, a holding unit, and an update unit,
前記記憶部には、文字列を表すセリフデータを複数含む波形データが記憶され、  The storage unit stores waveform data including a plurality of serif data representing character strings,
前記決定部が、前記複数のセリフデータが表す文字列の発音時間長のそれぞれに基づいて、前記波形データの時間長より短い閾時間を決定する決定ステップと、  The determining unit determines a threshold time shorter than the time length of the waveform data based on each of the pronunciation time lengths of the character strings represented by the plurality of serif data;
前記保持部が、前記記憶された波形データから有音区間を抽出して保持する保持ステップと、  A holding step in which the holding unit extracts and holds a voiced section from the stored waveform data;
前記保持された有音区間のそれぞれについて、当該有音区間の時間長が前記決定された閾時間より短い場合、前記更新部が、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として前記保持ステップに保持させるように更新する更新ステップと、  For each of the held sound segments, when the time length of the sound segment is shorter than the determined threshold time, the update unit is present in the sound segment and before and after the sound segment. An update step for updating the holding step as a new voiced section with a section composed of one voiced section and a section sandwiched between the two voiced sections;
を備え、  With
前記更新ステップでは、前記更新部が、  In the updating step, the updating unit includes:
（ｃ）当該有音区間より時系列的に前に存在する前方有音区間と、当該前方有音区間と当該有音区間に挟まれる区間と、から構成される第１区間、(C) a first section composed of a front sound section existing before the sound section in time series, and a section sandwiched between the front sound section and the sound section;
（ｄ）当該有音区間より時系列的に後に存在する後方有音区間と、当該後方有音区間と当該有音区間に挟まれる区間と、から構成される第２区間、(D) a second section composed of a rear sound section existing in time series after the sound section, and a rear sound section and a section sandwiched between the sound sections;
のそれぞれの時間長を求め、当該第１区間と当該第２区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と当該有音区間とから構成される区間を、当該新たな有音区間として前記保持部に保持させるように更新する、Each of the first time interval and the second time interval is selected, and the shorter one of the calculated time lengths is selected, and a time interval composed of the selected time interval and the sounded time interval is selected. , To update the holding unit as the new voiced section,
ことを特徴とする音声処理方法。  And a voice processing method.

記憶部、決定部、保持部、更新部を有する音声処理装置にて実行される音声処理方法であって、  A speech processing method executed by a speech processing apparatus having a storage unit, a determination unit, a holding unit, and an update unit,
前記記憶部には、文字列を表すセリフデータを複数含む波形データが記憶され、  The storage unit stores waveform data including a plurality of serif data representing character strings,
前記決定部が、前記複数のセリフデータが表す文字列の発音時間長のそれぞれに基づいて、前記波形データの時間長より短い閾時間を決定する決定ステップと、  The determining unit determines a threshold time shorter than the time length of the waveform data based on each of the pronunciation time lengths of the character strings represented by the plurality of serif data;
前記保持部が、前記記憶された波形データから有音区間を抽出して保持する保持ステップと、  A holding step in which the holding unit extracts and holds a voiced section from the stored waveform data;
前記保持された有音区間のそれぞれについて、当該有音区間の時間長が前記決定された閾時間より短い場合、前記更新部が、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として前記保持ステップに保持させるように更新する更新ステップと、  For each of the held sound segments, when the time length of the sound segment is shorter than the determined threshold time, the update unit is present in the sound segment and before and after the sound segment. An update step for updating the holding step as a new voiced section with a section composed of one voiced section and a section sandwiched between the two voiced sections;
を備え、  With
前記更新ステップでは、前記更新部が、  In the updating step, the updating unit includes:
（ｅ）当該有音区間より時系列的に前に存在する前方有音区間、(E) a front voiced section existing before the voiced section in time series,
（ｆ）当該有音区間より時系列的に後に存在する後方有音区間、(F) a rear voiced section existing in time series after the voiced section;
のそれぞれの時間長を求め、当該前方有音区間と当該後方有音区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と、当該有音区間と、当該選択した区間と当該有音区間に挟まれる区間を、当該新たな有音区間として前記保持部に保持させるように更新する、Are selected, and the shorter section of the calculated time length is selected from the front voiced section and the rear voiced section, and the selected section, the voiced section, and the selected section are selected. Update the section sandwiched between the section and the sound section to be held in the holding unit as the new sound section,
ことを特徴とする音声処理方法。  And a voice processing method.

コンピュータを、  Computer
文字列を表すセリフデータを複数含む波形データを記憶する記憶部、  A storage unit for storing waveform data including a plurality of serif data representing character strings;
前記複数のセリフデータが表す文字列の発音時間長のそれぞれに基づいて、前記波形データの時間長より短い閾時間を決定する決定部、  A determination unit that determines a threshold time shorter than the time length of the waveform data, based on each of the pronunciation time lengths of the character strings represented by the plurality of serif data;
前記記憶された波形データから有音区間を抽出して保持する保持部、  A holding unit for extracting and holding a voiced section from the stored waveform data;
前記保持された有音区間のそれぞれについて、当該有音区間の時間長が前記決定された閾時間より短い場合、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として前記保持部に保持させるように更新する更新部、  For each of the held sound segments, if the time length of the sound segment is shorter than the determined threshold time, the sound segment and another sound existing before and after the sound segment An update unit that updates a section composed of a section and a section sandwiched between the two sounded sections to be held in the holding unit as a new sounded section;
として機能させ、  Function as
前記更新部は、  The update unit
（ａ）当該有音区間と、当該有音区間より時系列的に前に存在する前方有音区間と、に挟まれる第１区間、(A) a first section sandwiched between the sounded section and a front sounded section existing in time series before the sounded section;
（ｂ）当該有音区間と、当該有音区間より時系列的に後に存在する後方有音区間と、に挟まれる第２区間、(B) a second section sandwiched between the sounded section and a rear sounded section existing in time series after the sounded section;
のそれぞれの時間長を求め、当該第１区間と当該第２区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と、当該有音区間と、当該選択した区間に対応する当該前方有音区間又は当該後方有音区間のいずれかと、から構成される区間を、当該新たな有音区間として前記保持部に保持させるように更新する、Each of the first and second intervals is selected, the shorter one of the determined time lengths is selected, the selected interval, the sounded interval, and the selected interval Updating the section composed of either the front voiced section or the rear voiced section corresponding to the above as a new voiced section in the holding unit,
ことを特徴とするプログラム。  A program characterized by that.

コンピュータを、  Computer
文字列を表すセリフデータを複数含む波形データを記憶する記憶部、  A storage unit for storing waveform data including a plurality of serif data representing character strings;
前記複数のセリフデータが表す文字列の発音時間長のそれぞれに基づいて、前記波形データの時間長より短い閾時間を決定する決定部、  A determination unit that determines a threshold time shorter than the time length of the waveform data, based on each of the pronunciation time lengths of the character strings represented by the plurality of serif data;
前記記憶された波形データから有音区間を抽出して保持する保持部、  A holding unit for extracting and holding a voiced section from the stored waveform data;
前記保持された有音区間のそれぞれについて、当該有音区間の時間長が前記決定された閾時間より短い場合、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として前記保持部に保持させるように更新する更新部、  For each of the held sound segments, if the time length of the sound segment is shorter than the determined threshold time, the sound segment and another sound existing before and after the sound segment An update unit that updates a section composed of a section and a section sandwiched between the two sounded sections to be held in the holding unit as a new sounded section;
として機能させ、  Function as
前記更新部は、  The update unit
（ｃ）当該有音区間より時系列的に前に存在する前方有音区間と、当該前方有音区間と当該有音区間に挟まれる区間と、から構成される第１区間、(C) a first section composed of a front sound section existing before the sound section in time series, and a section sandwiched between the front sound section and the sound section;
（ｄ）当該有音区間より時系列的に後に存在する後方有音区間と、当該後方有音区間と当該有音区間に挟まれる区間と、から構成される第２区間、(D) a second section composed of a rear sound section existing in time series after the sound section, and a rear sound section and a section sandwiched between the sound sections;
のそれぞれの時間長を求め、当該第１区間と当該第２区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と当該有音区間とから構成される区間を、当該新たな有音区間として前記保持部に保持させるように更新する、Each of the first time interval and the second time interval is selected, and the shorter one of the calculated time lengths is selected, and a time interval composed of the selected time interval and the sounded time interval is selected. , To update the holding unit as the new voiced section,
ことを特徴とするプログラム。  A program characterized by that.

コンピュータを、  Computer
文字列を表すセリフデータを複数含む波形データを記憶する記憶部、  A storage unit for storing waveform data including a plurality of serif data representing character strings;
前記複数のセリフデータが表す文字列の発音時間長のそれぞれに基づいて、前記波形データの時間長より短い閾時間を決定する決定部、  A determination unit that determines a threshold time shorter than the time length of the waveform data, based on each of the pronunciation time lengths of the character strings represented by the plurality of serif data;
前記記憶された波形データから有音区間を抽出して保持する保持部、  A holding unit for extracting and holding a voiced section from the stored waveform data;
前記保持された有音区間のそれぞれについて、当該有音区間の時間長が前記決定された閾時間より短い場合、当該有音区間と、当該有音区間の前後に存在する他の一の有音区間と、当該二つの有音区間に挟まれる区間と、から構成される区間を新たな有音区間として前記保持部に保持させるように更新する更新部、  For each of the held sound segments, if the time length of the sound segment is shorter than the determined threshold time, the sound segment and another sound existing before and after the sound segment An update unit that updates a section composed of a section and a section sandwiched between the two sounded sections to be held in the holding unit as a new sounded section;
として機能させ、  Function as
前記更新部は、  The update unit
（ｅ）当該有音区間より時系列的に前に存在する前方有音区間、(E) a front voiced section existing before the voiced section in time series,
（ｆ）当該有音区間より時系列的に後に存在する後方有音区間、(F) a rear voiced section existing in time series after the voiced section;
のそれぞれの時間長を求め、当該前方有音区間と当該後方有音区間のうち、当該求めた時間長の短い方の区間を選択し、当該選択した区間と、当該有音区間と、当該選択した区間と当該有音区間に挟まれる区間を、当該新たな有音区間として前記保持部に保持させるように更新する、Are selected, and the shorter section of the calculated time length is selected from the front voiced section and the rear voiced section, and the selected section, the voiced section, and the selected section are selected. Update the section sandwiched between the section and the sound section to be held in the holding unit as the new sound section,
ことを特徴とするプログラム。  A program characterized by that.