TW201530535A

TW201530535A - Speech processing system and speech processing method

Info

Publication number: TW201530535A
Application number: TW102149186A
Authority: TW
Inventors: Hai-Tao Liu
Original assignee: Hon Hai Prec Ind Co Ltd
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2015-08-01

Abstract

A speech processing method includes extracting voice feature of a speaker firstly speaking from a voice that the speaker speaks during a continuous period, and extracting voice feature of each speaker from the voice that each speaker speaks during a continuous period according to the time that each speaker speaks; determining whether the extracted voice feature matches a voice model in a database successively; converting the voice that the speaker speaks during the continuous period into a paragraph successively; inserting a corresponding person information of the voice model before the converted paragraph, generating a text including the paragraph and the person information, and adding other paragraphs converted from the voice that each speaker speaks during the continuous period and the corresponding person information into the text according to the time that each speaker speaks.

Description

語音處理系統及語音處理方法Speech processing system and speech processing method

本發明涉及語音處理系統及語音處理方法。The present invention relates to a speech processing system and a speech processing method.

目前，開會或採訪所記錄的語音可被自動轉換為文字。但是由於所生成的文字前沒有對應的發言者的個人資訊，從而使得該生成的文字晦澀難懂，不方便記錄者後續的文字整理。Currently, the voice recorded in a meeting or interview can be automatically converted to text. However, since there is no corresponding personal information of the speaker before the generated text, the generated text is difficult to understand, and it is inconvenient for the recorder to follow up the text.

鑒於以上內容，有必要提供一種語音處理系統及語音處理方法，可使得該生成的文字易懂。In view of the above, it is necessary to provide a speech processing system and a speech processing method that make the generated text easy to understand.

一種語音處理系統，該語音處理系統包括：一特徵獲取模組，用於在一聲音採集單元採集的一發言者在一連續時間段內的語音中提取發言者對應的語音特徵，並依各發言者發言的時間先後，依次提取各發言者在連續時間段內的語音中的語音特徵；一語音識別模組，用於依次判斷該提取的語音特徵是否與一聲紋資料庫中的一聲紋模型匹配，其中，該聲紋資料庫中的每一聲紋模型關聯有一對應的個人資訊；一語音轉換模組，用於在確定該當前提取的語音特徵與聲紋資料庫中的一聲紋模型匹配時，依次將該發言者在該連續時間段內的語音轉換為一段文字；及一文本生成模組，用於在該發言者的語音轉換的該段文字前***該聲紋模型對應的個人資訊，並生成一包括該段文字及個人資訊的文本，並依次在下一發言者的語音轉換的一段文字前***該聲紋模型對應的個人資訊後，將下一發言者語音所轉換的一段文字及對應的個人資訊依次增加到該文本中，並依各發言者發言的時間先後，依次將各發言者的連續時間段內的語音所轉換的一段文字及對應的個人資訊增加至該文本中。A speech processing system, comprising: a feature acquisition module, configured to extract a speech feature corresponding to a speaker in a speech collected by a sound collection unit in a continuous time period, and according to each speech The time sequence of the speeches is to sequentially extract the speech features of the speeches of the speakers in the continuous time period; a speech recognition module is used to sequentially determine whether the extracted speech features and a voice pattern in a voiceprint database Model matching, wherein each voiceprint model in the voiceprint database is associated with a corresponding personal information; a voice conversion module is configured to determine a voiceprint in the currently extracted voice feature and voiceprint database When the model is matched, the speech of the speaker in the continuous time period is sequentially converted into a piece of text; and a text generating module is configured to insert the corresponding voiceprint model before the segment of the speech of the speech conversion of the speaker. Personal information, and generate a text including the paragraph of text and personal information, and then insert the voiceprint model pair in front of a text of the next speaker's voice conversion After the personal information, a paragraph of text converted by the next speaker's voice and corresponding personal information are sequentially added to the text, and the speeches of the consecutive periods of each speaker are sequentially arranged according to the time sequence of each speaker's speech. The converted text and corresponding personal information are added to the text.

一種語音處理方法，該方法包括：在一聲音採集單元採集的一發言者在一連續時間段內的語音中提取發言者對應的語音特徵，並依各發言者發言的時間先後，依次提取各發言者在連續時間段內的語音中的語音特徵；依次判斷該提取的語音特徵是否與一聲紋資料庫中的一聲紋模型匹配，其中，該聲紋資料庫中的每一聲紋模型關聯有一對應的個人資訊；在確定該當前提取的語音特徵與聲紋資料庫中的一聲紋模型匹配時，依次將該發言者在該連續時間段內的語音轉換為一段文字；及在該發言者的語音轉換的該段文字前***該聲紋模型對應的個人資訊，並生成一包括該段文字及個人資訊的文本，並依次在下一發言者的語音轉換的一段文字前***該聲紋模型對應的個人資訊後，將下一發言者語音所轉換的一段文字及對應個人資訊依次增加到該文本中，並依各發言者發言的時間先後，依次將各發言者的連續時間段內的語音所轉換的一段文字及對應的個人資訊增加至該文本中。A speech processing method, comprising: extracting a speech feature corresponding to a speaker in a speech collected by a sound collection unit in a continuous time period, and sequentially extracting each speech according to a time sequence of each speaker speaking a speech feature in the speech in a continuous time period; sequentially determining whether the extracted speech feature matches a voiceprint model in a voiceprint database, wherein each voiceprint model in the voiceprint database is associated Corresponding personal information; when it is determined that the currently extracted speech feature matches a voiceprint model in the voiceprint database, the speech of the speaker in the continuous time period is sequentially converted into a text; and in the speech Inserting the personal information corresponding to the voiceprint model before the text of the voice conversion, and generating a text including the paragraph of text and personal information, and sequentially inserting the voiceprint model before a text of the next speaker's voice conversion After the corresponding personal information, a piece of text converted by the next speaker's voice and corresponding personal information are sequentially added to the text, and each Time speaker to speak, in the order of the successive periods each speaker's voice converted a paragraph of text and the corresponding increase in personal information to the text.

本發明藉由先將一發言者的連續時間段內的語音轉換為一段文字，將個人資訊***在該段文字前，生成一包括該段文字及個人資訊的文本，並依各發言者發言的時間先後，依次將各發言者的語音所轉換的一段文字及個人資訊增加至該文本中，從而使得該生成的文字易懂。The invention converts the speech in a continuous period of time of a speaker into a piece of text, inserts the personal information into the piece of text, generates a text including the piece of text and personal information, and speaks according to each speaker. Time-sequentially, a piece of text and personal information converted by each speaker's voice is sequentially added to the text, thereby making the generated text easy to understand.

圖1係本發明一實施方式中語音處理系統的方框示意圖。1 is a block schematic diagram of a speech processing system in accordance with an embodiment of the present invention.

圖2係本發明一實施方式中語音處理方法的流程圖。2 is a flow chart of a voice processing method in an embodiment of the present invention.

請參閱圖1，為本發明一實施方式的語音處理系統10的方框示意圖。在本實施方式中，該語音處理系統10安裝並運行於一語音處理裝置1中，用於將各發言者的語音轉換成文字，並將各發言者對應的個人資訊***在對應的文字前，從而識別該語音轉換成的文字對應的發言者身份。所述的語音處理裝置1連接有一輸入單元2以及一聲音採集單元3，該語音處理裝置1還包括一中央處理器（Central Processing Unit, CPU）20及一記憶體30。Please refer to FIG. 1, which is a block diagram of a speech processing system 10 according to an embodiment of the present invention. In this embodiment, the voice processing system 10 is installed and operated in a voice processing device 1 for converting the voice of each speaker into a text, and inserting the personal information corresponding to each speaker in front of the corresponding text. Thereby identifying the speaker identity corresponding to the text into which the speech is converted. The voice processing device 1 is connected to an input unit 2 and a sound collection unit 3. The voice processing device 1 further includes a central processing unit (CPU) 20 and a memory 30.

在本實施方式中，該語音處理系統10包括一特徵獲取模組11、一語音識別模組12、一語音轉換模組13及一文本生成模組14。本發明所稱的模組是指一種能夠被語音處理裝置1的中央處理器20所執行並且能夠完成特定功能的一系列電腦程式塊，其存儲於語音處理裝置1的記憶體30中。其中，該記憶體30中還存儲有聲紋資料庫及語音檔，該聲紋資料庫中存儲有發言者的聲紋模型以及該聲紋模型所對應發言者的個人資訊，如姓名、照片等。該語音檔為記錄的包括各發言者的發言記錄的音頻檔。In this embodiment, the voice processing system 10 includes a feature acquisition module 11 , a voice recognition module 12 , a voice conversion module 13 , and a text generation module 14 . The module referred to in the present invention refers to a series of computer blocks that can be executed by the central processing unit 20 of the speech processing apparatus 1 and capable of performing specific functions, which are stored in the memory 30 of the speech processing apparatus 1. The memory 30 further stores a voiceprint database and a voice file. The voiceprint database stores a voiceprint model of the speaker and personal information of the speaker corresponding to the voiceprint model, such as a name, a photo, and the like. The voice file is an recorded audio file including the speech records of the respective speakers.

該特徵獲取模組11用於在該聲音採集單元3採集的一發言者在一連續時間段內的語音中提取發言者對應的語音特徵，並依各發言者發言的時間先後，依次提取各發言者在連續時間段內的語音中的語音特徵。其中，在一實施方式中，該特徵獲取模組11在聲音採集單元3採集發言者時，同步進行發言者語音特徵的提取。在另一實施方式中，該特徵獲取模組11回應用戶藉由該輸入單元2選擇一語音檔的操作，從該選擇的語音檔中依次提取各發言者的語音特徵。在本實施方式中，該特徵獲取模組11藉由梅爾倒頻譜系數進行發言者的語音特徵的提取。但本發明提取語音特徵並不限於上述方式，其他提取語音特徵也包括在本發明所揭露的範圍之內。The feature acquisition module 11 is configured to extract a speech feature corresponding to a speaker in a speech collected by the sound collection unit 3 in a continuous time period, and sequentially extract each speech according to the time sequence of each speaker's speech. Speech characteristics in speech over a continuous period of time. In an embodiment, the feature acquisition module 11 synchronously performs the extraction of the speaker's voice feature when the voice collection unit 3 collects the speaker. In another embodiment, the feature acquisition module 11 sequentially extracts the voice features of each speaker from the selected voice file in response to the user's operation of selecting a voice file by the input unit 2. In the present embodiment, the feature acquisition module 11 performs the extraction of the speech features of the speaker by the Mel cepstral coefficients. However, the feature of the present invention for extracting speech is not limited to the above, and other features for extracting speech are also included in the scope of the present invention.

該語音識別模組12用於依次判斷該特徵獲取模組11提取的發言者的語音特徵是否與該聲紋資料庫中的一聲紋模型匹配。The voice recognition module 12 is configured to sequentially determine whether the voice feature of the speaker extracted by the feature acquisition module 11 matches a voiceprint model in the voiceprint database.

該語音轉換模組13用於在確定該語音特徵與該聲紋資料庫中的一聲紋模型匹配時，依次將發言者在該連續時間段內的語音轉換為一段文字。The voice conversion module 13 is configured to sequentially convert the voice of the speaker in the continuous time period into a piece of text when determining that the voice feature matches a voiceprint model in the voiceprint database.

該文本生成模組14用於在發言者的語音轉換的該段文字前***該聲紋模型對應的個人資訊，生成一包括該段文字及個人資訊的文本，並依次在下一發言者的語音轉換的一段文字前***該聲紋模型對應的個人資訊後，將下一發言者語音所轉換的一段文字及對應個人資訊依次增加到該文本中，並依各發言者發言的時間先後，依次將各發言者的連續時間段內的語音所轉換的一段文字及對應的個人資訊增加至該文本中。The text generating module 14 is configured to insert personal information corresponding to the voiceprint model before the segment of the voice converted by the speaker, generate a text including the segment of text and personal information, and sequentially perform voice conversion on the next speaker. After inserting a piece of text into the personal information corresponding to the voiceprint model, a piece of text converted by the next speaker's voice and corresponding personal information are sequentially added to the text, and each time according to the time of each speaker's speech, A piece of text converted by the speech in the continuous time period of the speaker and corresponding personal information are added to the text.

例如，在該聲音採集單元3採集時，麗麗在9：00-9：05發言，雷林在9：05-9：10發言，張梅在9：05-9：08發言，接著麗麗在9：10-9:15發言。該特徵獲取模組先獲取麗麗的語音特徵。該語音轉換模組13在該語音識別模組12確定該提取的語音特徵與聲紋資料庫中的一聲紋模型匹配時，將麗麗在9：00-9：05發言轉換為一段文字“我肚子餓了，你們那有吃的嗎？”。該文本生成模組14生成文本“麗麗：我肚子餓了，你們那有吃的嗎？”。接著，該特徵獲取模組11、該語音識別模組12、該語音轉換模組13依次執行相應的操作，該文本生成模組則依次將雷林、張梅、麗麗的語音所轉換的段文字及對應的個人資訊增加至該文本中。For example, when the sound collection unit 3 collects, Lili speaks at 9:00-9:05, Lei Lin speaks at 9:05-9:10, Zhang Mei speaks at 9:05-9:08, and then Lili Speaking at 9:10-9:15. The feature acquisition module first acquires the voice features of Lili. When the voice recognition module 12 determines that the extracted voice feature matches a voiceprint model in the voiceprint database, the voice conversion module 13 converts the speech at 9:00-9:05 into a paragraph of text. I am hungry, do you have anything to eat?". The text generation module 14 generates the text "Lili: I am hungry, do you have anything to eat?". Then, the feature acquisition module 11, the voice recognition module 12, and the voice conversion module 13 sequentially perform corresponding operations, and the text generation module sequentially converts the segments converted by the voices of Lei Lin, Zhang Mei, and Lili. Text and corresponding personal information are added to the text.

在本實施方式中，該特徵獲取模組11用於判斷該前後獲取的語音特徵是否發生變化。當該前後獲取的語音特徵發生變化時，該特徵獲取模組11確定該獲取的語音特徵為不同發言者的發言，從而確定一發言者的發言已經結束，而確定該聲音採集單元3在該結束前的時間段內採集的語音為該發言者在連續時間段內的語音。當該特徵獲取模組11判斷前後獲取的語音特徵沒有發生變化時，該特徵獲取模組11確定該前後獲取的語音特徵為同一發言者的發言，即，確定仍然為該發言者在連續時間段內的發言，並確定該發言者的語音特徵。In this embodiment, the feature acquiring module 11 is configured to determine whether the voice features acquired before and after are changed. When the voice feature acquired before and after the change occurs, the feature acquisition module 11 determines that the acquired voice feature is a different speaker's speech, thereby determining that a speaker's speech has ended, and determining that the sound collection unit 3 is ending. The speech collected during the previous time period is the speech of the speaker for a continuous period of time. When the feature acquisition module 11 determines that the voice features acquired before and after the change does not change, the feature acquisition module 11 determines that the voice features acquired before and after are the same speaker's speech, that is, determining that the speaker is still in the continuous time period. Within the speech, and determine the speech characteristics of the speaker.

在本實施方式中，該文本生成模組14在依次將各發言者的段文字及個人資訊增加至該文本中時，先將下一發言者語音所轉換的一段文字及對應的個人資訊重起新的段落，再增加到該文本中，並依各發言者發言的時間先後，依次將各發言者語音所轉換的一段文字及對應的個人資訊重起新的段落及增加至該文本中。如在第一段中記載的內容為“麗麗：我肚子餓了，你那有吃的嗎？”，在第二段中記載的內容為“雷林：我那有巧克力，你要吃嗎？”，在第三段中記載的內容為“張梅：牛奶，怎麼樣？”，在第四段中記載的內容為“麗麗：真的？巧克力？太好了，那正是我最愛吃的東西。”。則，記錄者可方便的確定各發言者的發言。In this embodiment, when the text generating module 14 sequentially adds the segment text and personal information of each speaker to the text, the text segment and the corresponding personal information converted by the next speaker voice are first restarted. The new paragraph is added to the text, and according to the time sequence of the speakers, the text and the corresponding personal information converted by each speaker's voice are sequentially restarted and added to the text. As stated in the first paragraph, "Lily: I am hungry, do you have anything to eat?", the content recorded in the second paragraph is "Raylin: I have chocolate, do you want to eat?" "", the content recorded in the third paragraph is "Zhang Mei: Milk, how?", the content recorded in the fourth paragraph is "Lily: Really? Chocolate? Great, that is my favorite." Things to eat.". Then, the recorder can conveniently determine the speech of each speaker.

在本實施方式中，該文本生成模組14還用於記錄每一發言者在每一連續時間段內的發言起始時間，並將該記錄的起始時間分別***對應的各段文字前。從而該記錄者可知道各發言者的發言先後順序及具體發言時間。In this embodiment, the text generating module 14 is further configured to record the speaking start time of each speaker in each successive time period, and insert the start time of the record into the corresponding segments of text respectively. Therefore, the recorder can know the order of the speakers and the specific speaking time.

請參閱圖2，為本發明一實施方式的語音處理方法的流程圖。Please refer to FIG. 2, which is a flowchart of a voice processing method according to an embodiment of the present invention.

在步驟S201中，該特徵獲取模組11在該聲音採集單元3採集的一發言者在一連續時間段內的語音中提取發言者對應的語音特徵，並依各發言者發言的時間先後，依次提取各發言者在連續時間段內的語音中的語音特徵。In step S201, the feature acquisition module 11 extracts the speech features corresponding to the speaker in the speech of the speaker collected by the sound collection unit 3 in a continuous time period, and sequentially according to the time sequence of the speakers. The speech features of each speaker in the speech over a continuous period of time are extracted.

在步驟S202中，該語音識別模組12依次判斷該特徵獲取模組11提取的發言者的語音特徵是否與該聲紋資料庫中的一聲紋模型匹配。當該特徵獲取模組11提取的發言者的語音特徵與該聲紋資料庫中的一聲紋模型匹配時，執行步驟S203。當該特徵獲取模組11提取的發言者的語音特徵與該聲紋資料庫中的任意聲紋模型不匹配時，執行步驟S201。In step S202, the voice recognition module 12 sequentially determines whether the voice feature of the speaker extracted by the feature acquisition module 11 matches a voiceprint model in the voiceprint database. When the voice feature of the speaker extracted by the feature acquisition module 11 matches a voiceprint model in the voiceprint database, step S203 is performed. When the voice feature of the speaker extracted by the feature acquisition module 11 does not match any of the voiceprint models in the voiceprint database, step S201 is performed.

在步驟S203中，該語音轉換模組13依次將發言者在連續時間段內的語音轉換為一段文字。In step S203, the voice conversion module 13 sequentially converts the speech of the speaker in a continuous time period into a piece of text.

在步驟S204中，該文本生成模組14在發言者的語音轉換的該段文字前***該聲紋模型對應的個人資訊，生成一包括該段文字及個人資訊的文本，並依次在下一發言者的語音轉換的一段文字前***該聲紋模型對應的個人資訊後，將下一發言者語音所轉換的一段文字及對應的個人資訊依次增加到該文本中，並依各發言者發言的時間先後，依次將各發言者的連續時間段內的語音所轉換的一段文字及對應的個人資訊增加至該文本中。In step S204, the text generating module 14 inserts the personal information corresponding to the voiceprint model before the segment of the voice converted by the speaker, and generates a text including the segment of the text and personal information, and sequentially in the next speaker. After inserting a piece of text in the voice conversion into the personal information corresponding to the voiceprint model, a piece of text converted by the next speaker's voice and corresponding personal information are sequentially added to the text, and according to the time of each speaker's speech And sequentially adding a piece of text converted by the voice of each speaker in the continuous time period and corresponding personal information to the text.

對本領域的普通技術人員來說，可以根據本發明的發明方案和發明構思結合生產的實際需要做出其他相應的改變或調整，而這些改變和調整都應屬於本發明權利要求的保護範圍。It is to be understood by those skilled in the art that the present invention may be made in accordance with the present invention.

10‧‧‧語音處理系統10‧‧‧Voice Processing System

1‧‧‧語音處理裝置1‧‧‧Voice processing device

2‧‧‧輸入單元2‧‧‧Input unit

3‧‧‧聲音採集單元3‧‧‧Sound acquisition unit

20‧‧‧中央處理器20‧‧‧Central processor

30‧‧‧記憶體30‧‧‧ memory

11‧‧‧特徵獲取模組11‧‧‧Feature acquisition module

12‧‧‧語音識別模組12‧‧‧Voice recognition module

13‧‧‧語音轉換模組13‧‧‧Voice Converter Module

14‧‧‧文本生成模組14‧‧‧Text generation module

無no

1‧‧‧語音處理裝置 1‧‧‧Voice processing device

2‧‧‧輸入單元 2‧‧‧Input unit

3‧‧‧聲音採集單元 3‧‧‧Sound acquisition unit

10‧‧‧語音處理系統 10‧‧‧Voice Processing System

11‧‧‧特徵獲取模組 11‧‧‧Feature acquisition module

12‧‧‧語音識別模組 12‧‧‧Voice recognition module

13‧‧‧語音轉換模組 13‧‧‧Voice Converter Module

14‧‧‧文本生成模組 14‧‧‧Text generation module

20‧‧‧中央處理器 20‧‧‧Central processor

30‧‧‧記憶體 30‧‧‧ memory

Claims

一種語音處理系統，該語音處理系統包括：
一特徵獲取模組，用於在一聲音採集單元採集的一發言者在一連續時間段內的語音中提取發言者對應的語音特徵，並依各發言者發言的時間先後，依次提取各發言者在連續時間段內的語音中的語音特徵；
一語音識別模組，用於依次判斷該提取的語音特徵是否與一聲紋資料庫中的一聲紋模型匹配，其中，該聲紋資料庫中的每一聲紋模型關聯有一對應的個人資訊；
一語音轉換模組，用於在確定該當前提取的語音特徵與聲紋資料庫中的一聲紋模型匹配時，依次將該發言者在該連續時間段內的語音轉換為一段文字；及
一文本生成模組，用於在該發言者的語音轉換的該段文字前***該聲紋模型對應的個人資訊，並生成一包括該段文字及個人資訊的文本，並依次在下一發言者的語音轉換的一段文字前***該聲紋模型對應的個人資訊後，將下一發言者語音所轉換的一段文字及對應的個人資訊依次增加到該文本中，並依各發言者發言的時間先後，依次將各發言者的連續時間段內的語音所轉換的一段文字及對應的個人資訊增加至該文本中。A speech processing system, the speech processing system comprising:
a feature acquisition module, configured to extract a speech feature corresponding to a speaker in a speech collected by a sound collection unit in a continuous time period, and sequentially extract each speaker according to a time sequence of each speaker's speech Speech features in speech over a continuous period of time;
a voice recognition module, configured to sequentially determine whether the extracted voice feature matches a voiceprint model in a voiceprint database, wherein each voiceprint model in the voiceprint database is associated with a corresponding personal information ;
a voice conversion module, configured to sequentially convert the voice of the speaker in the continuous time period into a piece of text when determining that the currently extracted voice feature matches a voiceprint model in the voiceprint database; and a text generating module, configured to insert personal information corresponding to the voiceprint model before the segment of the speaker's voice conversion, and generate a text including the segment of text and personal information, and sequentially in the next speaker's voice After the converted piece of text is inserted into the personal information corresponding to the voiceprint model, a piece of text converted by the next speaker's voice and corresponding personal information are sequentially added to the text, and sequentially according to the time sequence of each speaker's speech. A piece of text converted by the voice of each speaker in a continuous time period and corresponding personal information are added to the text.

如申請專利範圍第1項所述之語音處理系統，其中，該特徵獲取模組用於判斷該前後獲取的語音特徵是否發生變化；當特徵獲取模組判斷該前後獲取的語音特徵發生變化時，確認一發言者的發言已經結束，而確定該聲音採集單元在該結束前的時間段內採集的語音為該發言者在連續時間段內的語音；該特徵獲取模組在判斷當該前後獲取的語音特徵沒有發生變化時，確定仍然為同一發言者在連續時間段內的發言。The voice processing system of claim 1, wherein the feature acquisition module is configured to determine whether the voice feature acquired before or after the change occurs; and when the feature acquisition module determines that the voice feature acquired before and after the change occurs, Confirming that a speaker's speech has ended, and determining that the voice collected by the voice collecting unit in the time period before the end is the voice of the speaker in a continuous time period; the feature acquiring module determines that the voice is acquired before and after When the speech feature has not changed, it is determined that the same speaker is still speaking for a continuous period of time.

如申請專利範圍第1項所述之語音處理系統，其中，該文本生成模組在依次將各發言者的段文字及個人資訊增加至該文本中時，先將下一發言者語音所轉換的一段文字及對應的個人資訊重起新的段落，再增加到該文本中，並依各發言者發言的時間先後，依次將各發言者語音所轉換的一段文字及對應的個人資訊重起新的段落及增加至該文本中。The voice processing system of claim 1, wherein the text generating module first converts the segment text and personal information of each speaker into the text, first converting the voice of the next speaker. A paragraph of text and corresponding personal information is restarted into a new paragraph, and then added to the text, and according to the time of each speaker's speech, the text and corresponding personal information converted by each speaker's voice are sequentially restarted. Paragraphs are added to the text.

如申請專利範圍第1項所述之語音處理系統，其中，該文本生成模組還用於記錄每一發言者在每一連續時間段內的發言起始時間，並將該記錄的起始時間分別***對應的各段文字前。The speech processing system of claim 1, wherein the text generating module is further configured to record a speaking start time of each speaker in each consecutive time period, and start the recording time. Insert the corresponding paragraphs before each text.

一種語音處理方法，該方法包括：
在一聲音採集單元採集的一發言者在一連續時間段內的語音中提取發言者對應的語音特徵，並依各發言者發言的時間先後，依次提取各發言者在連續時間段內的語音中的語音特徵；
依次判斷該提取的語音特徵是否與一聲紋資料庫中的一聲紋模型匹配，其中，該聲紋資料庫中的每一聲紋模型關聯有一對應的個人資訊；
在確定該當前提取的語音特徵與聲紋資料庫中的一聲紋模型匹配時，依次將該發言者在該連續時間段內的語音轉換為一段文字；及
在該發言者的語音轉換的該段文字前***該聲紋模型對應的個人資訊，並生成一包括該段文字及個人資訊的文本，並依次在下一發言者的語音轉換的一段文字前***該聲紋模型對應的個人資訊後，將下一發言者語音所轉換的一段文字及對應個人資訊依次增加到該文本中，並依各發言者發言的時間先後，依次將各發言者的連續時間段內的語音所轉換的一段文字及對應的個人資訊增加至該文本中。A speech processing method, the method comprising:
The speaker collected by a sound collecting unit extracts the voice features corresponding to the speaker in a continuous time period, and sequentially extracts the voices of the speakers in the continuous time period according to the time sequence of each speaker's speech. Voice feature
Determining, in turn, whether the extracted voice feature matches a voiceprint model in a voiceprint database, wherein each voiceprint model in the voiceprint database is associated with a corresponding personal information;
When it is determined that the currently extracted speech feature matches a voiceprint model in the voiceprint database, the speech of the speaker in the continuous time period is sequentially converted into a piece of text; and the speech conversion of the speaker Inserting the personal information corresponding to the voiceprint model before the segment text, and generating a text including the segment text and personal information, and sequentially inserting the personal information corresponding to the voiceprint model before the text of the next speaker's voice conversion. Adding a piece of text and corresponding personal information converted by the next speaker's voice to the text in turn, and sequentially converting a piece of text converted by the speech of each speaker in a continuous time period according to the time sequence of each speaker's speech The corresponding personal information is added to the text.

如申請專利範圍第5項所述之語音處理方法，其中，該方法包括：
判斷該前後獲取的語音特徵是否發生變化；
當該前後獲取的語音特徵發生變化時，確認一發言者的發言已經結束，而確定該聲音採集單元在該結束前的時間段內採集的語音為該發言者在連續時間段內的語音；及
當該前後獲取的語音特徵沒有發生變化時，確定仍然為同一發言者在連續時間段內的發言。The voice processing method of claim 5, wherein the method comprises:
Determining whether the voice features acquired before and after are changed;
When the voice feature acquired before and after the change is changed, confirming that the speaker's speech has ended, and determining that the voice collected by the voice collecting unit in the time period before the end is the voice of the speaker in the continuous time period; When the previously acquired speech features have not changed, it is determined that the same speaker is still in the continuous time period.

如申請專利範圍第5項所述之語音處理方法，其中，該方法包括：
在依次將各發言者的段文字及個人資訊增加至該文本中時，先將下一發言者語音所轉換的一段文字及對應的個人資訊重起新的段落，再增加到該文本中，並依各發言者發言的時間先後，依次將各發言者語音所轉換的一段文字及對應的個人資訊重起新的段落及增加至該文本中。The voice processing method of claim 5, wherein the method comprises:
When the segment text and personal information of each speaker are sequentially added to the text, the segment of the text converted by the next speaker's voice and the corresponding personal information are first restarted into the new paragraph, and then added to the text, and then added to the text, and According to the time sequence of each speaker's speech, a paragraph of text converted by each speaker's voice and corresponding personal information are sequentially updated and added to the text.

如申請專利範圍第5項所述之語音處理方法，其中，該方法包括：
記錄每一發言者在每一連續時間段內的發言起始時間，並將該記錄的起始時間分別***對應的各段文字前。The voice processing method of claim 5, wherein the method comprises:
The start time of each speaker in each successive time period is recorded, and the start time of the record is inserted before the corresponding paragraph text.