TW202006532A

TW202006532A - Broadcast voice determination method, device and apparatus

Info

Publication number: TW202006532A
Application number: TW108115683A
Authority: TW
Inventors: 韓喆; 陳力; 姚四海; 楊磊; 吳軍
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2018-07-17
Filing date: 2019-05-07
Publication date: 2020-02-01
Also published as: CN109086026B; TWI711967B; WO2020015479A1; CN109086026A

Abstract

The present description provides a broadcast voice determination method, device and apparatus. The broadcast voice determination method comprises: acquiring a target digital sequence to be broadcasted; converting the target digital sequence into a character string; acquiring audio data of the primary syllables of characters and audio data of linking syllables between adjacent characters, the linking syllables being used to connect the primary syllables of adjacent characters; and connecting, in a preset order, the audio data of the primary syllables of the characters and the audio data of the linking syllables between adjacent characters, so as to obtain audio data of the target digital sequence. Voice audio data having more appropriate syllable transitions is obtained by acquiring audio data of linking syllables between adjacent characters and connecting audio data of linking syllables between adjacent characters with the audio data of primary syllables of corresponding characters, so as to perform voice broadcasting of digital content, thereby enabling the broadcasted target digital sequence to be more natural and smooth, improving user experience.

Description

播報語音的確定方法、裝置和設備Method, device and equipment for determining broadcast voice

本發明所涉及的技術屬於語音合成技術領域，尤其涉及一種播報語音的確定方法、裝置和設備。The technology involved in the present invention belongs to the field of speech synthesis technology, and in particular to a method, device and equipment for determining broadcast speech.

在日常的生活工作中，常常會面臨許多需要對數字內容進行語音播報的情況。例如，在交易活動中，商家通常會使用手機支付軟體內建的外掛程式程式來自動語音播報商家的帳戶上所收到的錢款的金額數目。目前，現有的播報語音的確定方法在播報數字內容時大多是獲取並拼接各個字元（包括與數字、單位等對應的字元）的字元音節的主體部分的音訊資料。例如，在播報某一個具體數字時，會提取得到該數字中的各個字元的字元音節的主體部分的音訊資料進行拼接，得到用於播放的音訊資料，以進行語音播放。這種通過獲取並利用各個字元的字元音節的主體部分的音訊資料直接進行拼接所得到的音訊資料在播放時，往往會出現字元音節之間的過渡不夠流暢、自然，人們在收聽所播放的語音時會覺得相對比較突兀，感覺不符合人類的語音習慣，甚至影響收聽者對所播報的數字內容的理解，用戶體驗相對較差。因此，亟需一種能夠自然、流暢地對數字內容進行語音播報的播報語音的確定方法。In daily life and work, we often face many situations that require voice broadcasting of digital content. For example, in a transaction, a merchant usually uses a plug-in program built in the mobile payment software to automatically announce the amount of money received on the merchant’s account. At present, most of the existing methods for determining the broadcast voice are to obtain and splice the audio data of the main part of the character syllable of each character (including characters corresponding to numbers and units) when broadcasting digital content. For example, when broadcasting a specific number, the audio data of the main part of the character syllable of each character in the number will be extracted and spliced to obtain audio data for playback for voice playback. The audio data obtained by directly splicing the audio data of the main part of the character syllables of each character when playing, often appears that the transition between character syllables is not smooth enough and natural. When playing the voice, it will feel relatively abrupt, and it does not meet the human voice habits, and even affects the listener's understanding of the digital content broadcast, and the user experience is relatively poor. Therefore, there is an urgent need for a method for determining broadcast speech that can broadcast speech of digital content naturally and smoothly.

本發明目的在於提供一種播報語音的確定方法、裝置和設備，以解決現有方法中存在的數字播報不自然、用戶體驗差的問題，達到能兼顧運算成本，高效、流暢地進行有關數字內容的語音播報。本發明提供的一種播報語音的確定方法、裝置和設備是這樣實現的：一種播報語音的確定方法，包括：獲取待播報的目標數字序列；將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元；獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料。一種播報語音的確定裝置，包括：第一獲取模組，用於獲取待播報的目標數字序列；轉換模組，用於將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元；第二獲取模組，用於獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；拼接模組，用於按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料。一種播報語音的確定方法，包括：獲取待播放的字串，其中，所述字串包括多個按照預設順序排列的字元；獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述待播放的字串的音訊資料。一種播報語音的確定設備，包括處理器以及用於儲存處理器可執行指令的記憶體，所述處理器執行所述指令時實現獲取待播報的目標數字序列；將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元；獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料。一種電腦可讀儲存介質，其上儲存有電腦指令，所述指令被執行時實現獲取待播報的目標數字序列；將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元；獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料。本發明提供的一種播報語音的確定方法、裝置和設備，由於通過獲取相鄰的字元之間的銜接音節的音訊資料，並利用相鄰的字元之間的銜接音節的音訊資料拼接對應的字元的主幹音節的音訊資料，得到過渡更為自然的語音音訊資料，以進行語音播報，從而解決了現有方法中存在的數字播報不自然、用戶體驗差的問題，達到能兼顧運算成本，高效、流暢地進行有關數字內容的語音播報。The purpose of the present invention is to provide a method, device and equipment for determining broadcast speech, to solve the problems of unnatural digital broadcast and poor user experience in the existing methods, and to achieve efficient and smooth voice of digital content while taking into account the operation cost Broadcast. The method, device and equipment for determining broadcast speech provided by the present invention are implemented as follows: A method for determining broadcast speech includes: acquiring a target digital sequence to be broadcast; converting the target digital sequence into a character string, wherein the character string includes a plurality of characters arranged in a preset order; acquiring the word The audio data of the main syllable of each character in the string, and the audio data of the connecting syllable between adjacent characters in the character string, wherein the connecting syllable is used to connect the main character of the adjacent character Syllables; splicing the audio data of the main syllables of the characters and the audio data of the connecting syllables between the adjacent characters in a predetermined order to obtain the audio data of the target digital sequence. A device for determining broadcast speech includes: a first acquisition module for acquiring a target digital sequence to be broadcast; a conversion module for converting the target digital sequence into a word string, wherein the word string includes multiple Characters arranged in a preset order; a second acquisition module for acquiring the audio data of the main syllables of each character in the character string, and between the adjacent characters in the character string Audio data of cohesive syllables, wherein the cohesive syllables are used to connect trunk syllables of adjacent characters; a splicing module is used to splice the audio data of the main syllables of the characters and the neighbors according to a preset order The audio data of the connected syllables between the characters of the characters to obtain the audio data of the target digital sequence. A method for determining broadcast speech includes: acquiring a character string to be played, wherein the character string includes a plurality of characters arranged in a preset order; acquiring audio data of a main syllable of each character in the character string , And the audio data of the connecting syllables between adjacent characters in the character string, wherein the connecting syllables are used to connect the main syllables of adjacent characters; The audio data of the main syllable and the audio data of the connecting syllable between the adjacent characters obtain the audio data of the character string to be played. A device for determining broadcast speech, including a processor and a memory for storing processor executable instructions, when the processor executes the instructions, the target digital sequence to be broadcast is obtained; the target digital sequence is converted into words A string, wherein the character string includes a plurality of characters arranged in a preset order; acquiring audio data of a main syllable of each character in the character string, and one of adjacent characters in the character string Audio data of inter-connected syllables, wherein the inter-connected syllables are used to connect the main character syllables of adjacent characters; the audio data of the main character syllables of the characters and the adjacent characters are spliced in a preset order The audio data of the connected syllables between them can obtain the audio data of the target digital sequence. A computer-readable storage medium on which computer instructions are stored. When the instructions are executed, the target digital sequence to be broadcast is obtained; the target digital sequence is converted into a word string, wherein the word string includes multiple Characters arranged in a preset order; acquiring audio data of the main syllables of each character in the string, and audio data of the connecting syllables between adjacent characters in the string, wherein, the The cohesive syllable is used to connect the main character syllables of adjacent characters; the audio data of the main character syllable of the character and the audio data of the coherent syllables between the adjacent characters are spliced in a preset order to obtain the target Digital sequence of audio data. The present invention provides a method, device, and equipment for determining broadcast speech. By acquiring the audio data of the connecting syllables between adjacent characters, and using the audio data of the connecting syllables between adjacent characters to splice the corresponding The audio data of the main syllables of the characters is obtained as a more natural voice audio data for voice broadcasting, thereby solving the problems of unnatural digital broadcasting and poor user experience in the existing methods, and achieving the high efficiency of taking into account the operation cost , Smoothly broadcast voice about digital content.

為了使本技術領域的人員更好地理解本發明中的技術方案，下面將結合本發明實施例中的圖式，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明一部分實施例，而不是全部的實施例。基於本發明中的實施例，本領域普通技術人員在沒有作出創造性勞動前提下所獲得的所有其他實施例，都應當屬於本發明保護的範圍。考慮到現有的播報語音的確定方法往往沒有深入地分析人類正常說話時的語言習慣和語音特點。例如，人在說“十六”這個數字時，在發出字元音節“十”之後，發出字元音節“六”之前，通常還會發出一種用於連接上述兩種字元音節“十”和“六”的銜接音節。且不同的字元音節之間的銜接音節往往還會存在差異。例如“五十”中字元音節“五”和字元音節“十”之間的銜接音節與“十五”中字元音節“十”和字元音節“五”之間的銜接音節也是不相同的。上述銜接音節本身並不對應某個具體字元，也不能表徵什麼具體的內容或含義，而是類似於一種連接助詞，將人類正常說的話中相鄰的字元音節自然、流暢地連接在了一起，以便聽話者能夠更好地接收並理解說話者說的話中的資訊和內容。而現有的播報語音的確定方法由於沒有考慮到上述人類的語音習慣和語音特點，在合成關於待播報的目標數字序列的語音音訊資料時，通常只截取樣本資料中的對應的數字字元的字元音節的主體部分的音訊資料直接進行拼接。由於相鄰的字元音節之間沒有符合人類語音習慣的自然過渡，導致基於上述方法所產生的關於目標數字序列的語音音訊資料在播放時往往不像人類說的數字那麼自然、流暢，甚至會影響人們對所播報的數字內容的理解，造成使用上的不方便。因此，現有方法在具體實施時，往往會存在數字播報不自然、用戶體驗差的問題。針對產生上述問題的根本原因，本發明深入、全面地分析了人類正常說話時的語言習慣和語音特點，考慮並關注了人類在正常說話時相鄰的字元音節之間的銜接音節的存在和作用。在建立預設的音訊資料庫時，不但截取儲存了字元音節的主幹音節的音訊資料，還有意識地截取儲存了相鄰的字元音節之間的銜接音節的音訊資料。進而在產生某一個具體數字的語音音訊資料時，會同時獲取該數字對應的多個字元中各個字元的主幹音節的音訊資料和相鄰的字元之間的銜接音節的音訊資料，再利用相鄰的字元之間的銜接音節的音訊資料拼接對應的兩個相鄰的字元的主幹音節的音訊資料，使得所產生的語音音訊資料中，相鄰的字元音節之間的過渡更加自然、流暢，從而解決了現有方法中存在的數字播報不自然、用戶體驗差的問題，達到能兼顧運算成本，高效、流暢地進行有關數字的語音播報。基於上述原因，本發明實施例提供了一種能夠高效、自然地進行數字語音播報的播報語音的確定設備，通過該播報語音的確定設備可以實現以下功能：獲取待播報的目標數字序列；將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元；獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料；播放所述目標數字序列的音訊資料。在本實施方式中，所述播報語音的確定設備可以是一種在用戶側使用的較為簡單的電子設備。具體地，所述播報語音的確定設備可以是一種具有資料運算、語音播放功能以及網路交互功能的電子設備；也可以為運行於該電子設備中，為資料處理、語音播放和網路交互等提供支援的軟體應用。具體地，上述播報語音的確定設備例如可以是桌上型電腦、平板電腦、筆記型電腦、智慧手機、數位助理、智慧可穿戴設備、導購終端等。或者，上述播報語音的確定設備也可以是能夠運行於上述電子設備中的軟體應用。例如，上述播報語音的確定設備還可以是在智慧手機中運行的XX寶APP。在一個場景示例中，可以通過應用本發明實施例提供的播報語音的確定方法的播報語音的確定設備為商家A自動播報商家A的帳戶即時到帳的錢款的金額數字。在本實施方式中，商家A可以使用自己的手機作為上述播報語音的確定設備。在具體實施前，商家A可以先通過手機的設置操作將手機號碼與商家A在某支付平台上的帳戶關聯。參閱圖1所示，通常消費者在商家A的店中消費結束後可以直接通過手機上的某支付平台的支付軟體在網上進行結帳付款，而不需要線上下與商家進行當面付款。具體的，消費者可以利用手機與某支付平台的伺服器進行通信，通過支付平台將應付給商家A的錢款轉帳到商家A的帳戶中，完成結帳付款。支付平台的伺服器在確認商家A的帳戶接收到消費者通過網上轉帳的錢款後，會向商家A的手機發送到帳提示資訊（例如發送到帳提示短信，或者在商家A的手機上的支付APP中推送對應的到帳提示對話方塊等），以提示商家A：消費者已經在網上進行了結帳付款，同時還會在提示資訊中標識出商家A的帳戶所收到的錢款的具體金額數字，以便商家A可以進一步確認消費者在網上支付的錢款的金額是否準確。例如，支付平台的伺服器可以在確認商家A的帳戶接收到消費者網上轉帳的54元的錢款時，可以向與商家A的帳戶關聯的手機發送包括以下內容的提示資訊：“帳戶到帳54元”。通常在營業期間，商家會相對比較忙，往往可能沒有時間及時地翻看、閱讀上述提示資訊，因此不方便及時地確認消費者是否在網上進行了結帳付款，以及消費者在網上結帳付款的金額是否準確。這時商家希望可以通過手機即時地語音播報出自己的帳戶所收到錢款具體的金額數字，這樣商家即使營業期間比較忙，沒有時間自己去翻看、確認支付平台的伺服器發送的提示資訊，也能及時地瞭解到消費者通過支付平台結帳付款的具體情況。手機在接收到支付平台發送的提示資訊後，可以先對提示資訊進行解析，並提取提示資訊中的金額數字“54”作為待播報的目標數字序列，以便後續確定該數字序列所對應的音訊資料進行語音播報。在本實施方式中，上述提示資訊通常是按照固定規則產生的，因此具有相對統一的格式。例如，在本場景示例中，上述提示資訊可以是按照以下格式構成的：前置引導語部分（即“帳戶到帳”）+數字部分（即具體金額“54”）+單位部分（即“元”）。因此，在獲取待播報的目標數字序列，即提示資訊中數字部分的具體內容時，可以按照與上述產生提示資訊的固定規則對應的解析規則對提示資訊進行解析、拆分，即可以從提示資訊的數字部分中提取得到待播報的數字，即目標數字序列。在本實施方式中，需要說明的是，對於不同的提示資訊，上述前置引導語部分和單位部分的內容通常都是一樣的，只有數字部分的內容會隨提示資訊不同而不同。因此，可以預先產生並儲存統一的前置引導語部分的音訊資料、單位部分的音訊資料，在播報提示資訊時，只需要產生提示資訊中數字部分的音訊資料，再與預先儲存的前置引導語部分的音訊資料、單位部分的音訊資料進行拼接，即可以得到提示資訊完整的語音音訊資料。手機在獲取得到了待播報的目標數字序列後，可以先將目標數字序列轉換為對應的字串。其中，上述字串具體可以理解為用於表徵目標數字序列的字元音節的，且按照與目標數字序列對應的排列順序（即預設排列順序）排列的字串，上述字串中每一個字元對應目標數字序列中的一個字元音節。例如，目標數字序列“54”轉換後得到的對應的字串可以表示為“五十四”。字串“五十四”可以理解為表徵目標數字序列“54”的字元音節的字串，其中，字串中的字元“五”、“十”與目標數字序列中的位於十位元上的數字“5”對應；字串中的字元“四”與目標數字序列中的位於個位上的數字“4”對應。且字串中的字元按照與目標數字序列“54”中數字的排列順序（即先“5”後“4”）對應的預設排列順序進行排列，即先排對應十位上的“5”的字元“五”“十”，再排對應個位上“4”的字元“四”。當然，需要說明的是，上述所列舉的字串，以及對應的預設排列順序只是為了更好地說明本發明實施方式。具體實施時，還可以根據具體的場景情況，選擇使用其他形式的字串和預設規則，也可以對目標數字序列不作轉換，直接進行識別拼接等。對此，本發明不作限定。手機在得到與目標數字序列對應的字串後，可以識別並確定字串中按順序排列的各個字元，以及相鄰字元之間的連接關係。其中，上述相鄰字元之間的連接關係具體可以理解為相鄰的兩個字元之間的先後順序的一種標識資訊。例如，字串“五十四”中字元“五”和“十”是相鄰的兩個字元，“五”和“十”之間的連接關係可以表述為：字元“五”連字號“十”。當然，需要說明的是上述所列舉的相鄰字元之間的連接關係只是一種示意性說明。具體實施時還可以通過其他標識方式表示相鄰字元之間的連接關係。對此，本發明不作限定。在本實施方式中，手機通過字元識別，可以確定出字串中的各個字元按順序依次為“五”、“十”、“四”，對應的相鄰字元之間的連接關係依次為：字元“五”連字號“十”、字元“十”連字號“四”。進一步，手機可以根據所識別得到的各個字元，以及相鄰的字元之間的連接關係，從預設的音訊資料庫中進行檢索，以得到與各個字元，以及相鄰的字元之間的連接關係對應的音訊資料，即獲取字串中各個字元的主幹音節的音訊資料，以及相鄰的字元之間的銜接音節的音訊資料。其中，上述字元的主幹音節具體可以理解為字元音節的主要部分，通常該部分的音節具有較高的辨識度，同一個字元音節的主幹音節的基頻、音強等音訊特徵較為一致，近似相同，因此可以提取字元音節的主幹音節用以區分其他字元音節。例如，人在發出字元“五”對應的語音時，中間部分的語音是該字元音節的主要部分，即主幹音節，通常不同人發字元“五”對應的語音時，雖然存在差異，但主幹音節部分大多都是一致的。上述相鄰字元之間的銜接音節具體可以理解為用於連接相鄰字元的主幹音節的連接部分的音節。例如，人在發出“五十”時，在字元“五”的主幹音節和字元“十”的主幹音節之間的連接部分的語，即為字元“五”和字元“十”之間的銜接音節。這部分音節不同於主幹音節本身並沒有什麼具體含義，也不用於對應表徵某一個具體字元，但在音訊資料中的波形資料並不為0。在人的語音習慣中，通常會出現在相鄰的字元的主幹音節之間，起到承接、過渡的作用，從而能夠使得人說的話不同於機器發音，不是單調、呆板地直接將各個字元的主幹音節簡單地連接起來，而是很自然、流暢地從一個字元音節過渡到另一個字元音節。這樣播報出來的數字更符合人類的聽說習慣，便於人類的接收和理解，同時也會使得收聽者收聽時感覺更舒服，體驗更好。還需要補充的是，不同的相鄰字元（包括字元不同，以及字元相同字元先後順序不同等）之間的銜接音節往往也不相同。例如，字元“五”和“十”之間的銜接音節與“五”和“百”之間的銜接音節、“十”和“五”之間的銜接音節在對應的音訊資料的波形上相互之間都存在差異。因此，在本實施方式中，需要利用相鄰字元之間的連接關係準確地獲取到對應的銜接音節的音訊資料。上述預設的音訊資料庫具體可以是事先有平台伺服器建立並儲存於伺服器或者播報語音的確定設備的資料庫，其中，上述預設的音訊資料庫中具體可以包含有各個字元的主幹音節的音訊資料，以及各個相鄰字元之間的銜接音節的音訊資料。具體的，手機可以根據所識別得到的各個字元，以及相鄰的字元之間的連接關係，檢索預設的音訊資料庫分別得到字元“五”的主幹音節的音訊資料A、“十”的主幹音節的音訊資料B、“四”的主幹音節的音訊資料C，以及字元“五”連字號“十”之間的銜接音節的音訊資料f、字元“十”連字號“四”之間的銜接音節的音訊資料r。進而，手機可以將上述字元的主幹音節的音訊資料，以及相鄰的字元之間的銜接音節的音訊資料按照字串中字元的排列順序（即預設順序）進行拼接，以得到對應目標數字序列的音訊資料。具體的，可以按照預設順序（即與目標數字序列的字串中字元的排列順序），排列各個字元的主幹音節的音訊資料；再利用相鄰的字元之間的銜接音節的音訊資料連接相鄰的字元的主幹音節的音訊資料。具體的，例如，可以參閱圖2所示，按照字串（即“五十四”）中字元的排列順序先排“五”的主幹音節的音訊資料A，然後再排“十”的主幹音節的音訊資料B，最後排“四”的主幹音節的音訊資料C。在排好主幹音節的音訊資料後；進一步可以利用字元“五”連字號“十”之間的銜接音節的音訊資料f連接音訊資料A和音訊資料B，利用字元“十”連字號“四”之間的銜接音節的音訊資料r連接音訊資料B和音訊資料C。最終得到的拼接好的，針對目標數字序列“54”的音訊資料可以表示為：“A-f-B-r-C”。這樣便得到了過渡更為自然的針對目標數字序列的音訊資料。在獲得了目標數字序列的音訊資料後，可以將預先設置好儲存在手機或者伺服器的用於指示所述目標數字序列所表徵的資料對象的前置音訊資料（例如前置引導語部分的音訊資料、單位部分的音訊資料）與目標數字序列的音訊資料進行拼接，得到待播放的語音音訊資料，手機再根據上述語音音訊資料，播放相應的內容資訊。在本實施方式中，參閱圖3所示，商家A的手機可以獲取預設並儲存在手機本地的前置音訊資料，即預先設置好的用於表述“帳戶到帳”的音訊資料Y和用於表述“元”的音訊資料Z；並將上述前置音訊資料與產生的關於目標數字序列“54”的音訊資料進行拼接，得到完整的待播放的語音音訊資料，可以表示為“Y-A-f-B-r-C-Z”，進而播放上述語音音訊資料，這樣商家A就可以聽到清楚、自然、流暢，且更符合人類正常的收聽習慣的語音播報，避免了機器語音對商家收聽體驗造成的影響。由上可見，本發明實施例提供的播報語音的確定方法通過獲取相鄰的字元之間的銜接音節的音訊資料，並利用相鄰的字元之間的銜接音節的音訊資料拼接對應的字元的主幹音節的音訊資料，得到過渡更為自然的語音音訊資料，以進行語音播報，從而解決了現有方法中存在的數字播報不自然、用戶體驗差的問題，達到能兼顧運算成本，高效、流暢地進行有關數字的語音播報。在另一個場景示例中，支付平台的伺服器可以預先建立預設的音訊資料庫，並將上述預設的音訊資料庫發送至播報語音的確定設備。播報語音的確定設備在接收到預設的音訊資料庫，可以將預設的音訊資料庫儲存在播報語音的確定設備的本地，以便播報語音的確定設備可以通過檢索預設的音訊資料庫以獲取目標數字序列的字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料。當然，支付平台的伺服器在建立了預設的音訊資料庫後，也可以不將預設的音訊資料庫發送至播報語音的確定設備，而是儲存在伺服器一側，播報語音的確定設備在產生目標數字序列的音訊資料時，可以通過調用儲存在伺服器一側的預設的音訊資料庫以獲取目標數字序列的字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料。在本實施方式中，具體實施時，伺服器可以獲取包含有數字的音訊資料作為樣本資料。進而可以按照一定的規則從標記後的樣本資料中分別截取獲得字元的主幹音節的音訊資料，以及相鄰的字元之間的銜接音節的音訊資料，再根據上述字元的主幹音節的音訊資料，以及相鄰的字元之間的銜接音節的音訊資料，建立預設的音訊資料庫。具體的，上述獲取包含有數字的音訊資料作為樣本資料可以包括：截取播音員的播報音訊資料中包含有與數字相關的播報內容的音訊資料作為上述樣本資料。也可以採集人按照預設文本讀出的語音資料，作為上述樣本資料，其中，上預設文本可以是預先設置的包含有多種數字組合的文本內容。在獲取了樣本資料後，還可以先對樣本資料進行標記。具體的，可以參閱圖4所示，在所獲取的樣本資料中，利用字元音節標識可以標識出各個字元音節對應的音訊資料的所處的範圍區域。例如，對於樣本資料中的音訊資料“五十六”，可以利用“5”、“10”、“6”分別作為字元音節“五”的字元音節標識、字元音節“十”的字元音節標識、字元音節“六”的字元音節標識，分別標識出字元音節“五”、“十”、“六”在所述音訊資料中的範圍區域。當然，需要說明的是，上述所列舉的字元音節標識只是一種示意性說明，不應構成對本發明的不當限定。進一步的，在從樣本資料中截取得到字元的主幹音節的音訊資料時，具體可以包括：檢索所述樣本資料中的字元音節標識；根據所述字元音節標識，截取所述樣本資料中所述字元音節標識所標識的範圍中的指定區域的音訊資料作為所述字元的主幹音節的音訊資料。具體的，可以檢索確定樣本資料中音訊資料的字元音節標識，進而可以根據上述字元音節標識，確定樣本資料中的各個字元音節在音訊資料中的區域範圍，即樣本資料中的字元音節標識所標識的範圍；再從上述字元音節在音訊資料中的區域範圍中按照預設的規則截取指定區域內的音訊資料作為字元的主幹音節的音訊資料。例如，對於樣本資料中的音訊資料“五十六”，可以先檢索該音訊資料中的字元音節標識“5”、“10”、“6”；進而可以根據字元音節標識“5”確定字元音節“五”在音訊資料中的區域範圍，根據字元音節標識“10”確定字元音節“十”在音訊資料中的區域範圍，根據字元音節標識“6”確定字元音節“六”在音訊資料中的區域範圍；進而可以從字元音節“五”所在的區域範圍的音訊資料中截取指定區域的音訊資料作為字元“五”的主幹音節的音訊資料，從字元音節“十”所在的區域範圍的音訊資料中截取指定區域的音訊資料作為字元“十”的主幹音節的音訊資料，從字元音節“六”所在的區域範圍的音訊資料中截取指定區域的音訊資料作為字元“六”的主幹音節的音訊資料。在具體截取字元的主幹音節的音訊資料時，考慮到人在說具體數字時，對應於數字中的每一個數字或單位的音節的中間部分的音訊資料大多是較為一致的，即相同字元音節的音訊資料大多中間部分的音訊資料差異相對較小，不同字元音節的音訊資料大多中間部分的音訊資料差異相對較大。例如，人在說“五十六”和“六十五”這兩個數字時，“五十六”中的字元“五”的音節的音訊資料的中間部分往往與“六十五”中的字元“五”的音節的音訊資料的中間部分相同。因此，可以將字元音節的音訊資料中的中間部分的音訊資料作為指定區域的音訊資料進行截取，以得到該字元音節的主幹音節的音訊資料。基於上述特點，具體實施時，可以在所述字元音節標識所標識的範圍中，以所述字元音節標識所標識的範圍中的中點為中心對稱點，且區域的區間長度與所述字元音節標識所標識的範圍的區間長度的比值等於預設比值的區域。例如，可以參閱圖5所示，將字元音節標識“5”所標識的範圍中的中點O作為中心對稱點，分別截取中心對稱點O兩側1/2區域組合作為指定區域，將該指定區域的音訊資料確定為字元“五”的主幹音節的音訊資料。其中，上述指定區域占字元音節標識“5”所標識的範圍的1/2。按照上述方式，還可以截取得到字元“十”的主幹音節的音訊資料，以及字元“六”的主幹音節的音訊資料。當然，上述所列舉的預設比值只是為了更好地說明本發明實施方式。具體實施時，也可以根據具體的場景情況，選擇其他數值作為預設比值，以確定指定區域進而截取對應的字元的主幹音節的音訊資料。在截取得到字元的主幹音節的音訊資料後，可以截取樣本資料的音訊資料中相鄰的字元的主幹音節的音訊資料之間的區域內的音訊資料作為上述相鄰的字元之間的銜接音節的音訊資料。例如，可以參閱圖5所示，截取樣本資料的音訊資料中的相鄰的字元“五”的主幹音節的音訊資料與字元“十”的主幹音節的音訊資料之間的區域內的音訊資料作為字元“五”連字號“十”之間的銜接音節的音訊資料，即相鄰的字元之間的銜接音節的音訊資料。按照上述方式還可以截取得到相鄰的字元“十”和字元“六”之間的銜接音節的音訊資料。在本場景實例中，考慮到如果樣本資料較為豐富，可以截取得到多個表徵同一相鄰的字元之間的銜接音節的音訊資料。例如，樣本資料中的音訊資料“五十六”、“五十四”中都可以截取到相同的字元“五”與字元“十”之間的銜接音節的音訊資料（或稱字元“五”連字號“十”之間的銜接音節的音訊資料）。此外，樣本資料中可能包含有不同人發出的“五十六”的音訊資料，進而可以基於不同人的音訊資料，得到多個字元“五”和字元“十”之間的銜接音節的音訊資料。因此，在所截取得到的相鄰的字元之間的銜接音節的音訊資料中包括有同一相鄰的字元之間的銜接音節的音訊資料的情況下，為了獲取效果較好的音訊資料作為相鄰的字元之間的銜接音節的音訊資料，以便後續用於銜接相應的字元的主幹音節的音訊資料時更為自然、流暢，可以將同一相鄰的字元之間的多個銜接音節的音訊資料劃分為多種類型，分別統計樣本資料中各種類型的音訊資料的出現頻率，並從多種類型的音訊資料中篩選出現頻率最高的類型的音訊資料作為上述相鄰的字元之間的銜接音節之間的音訊資料，儲存在預設的音訊資料庫中。當然，除了上述所列舉的根據各種類型的音訊資料的出現頻率從同一相鄰的字元之間的多個銜接音節的音訊資料中篩選出效果較好的音訊資料進行儲存外還可以採用其他合適的方式從同一相鄰的字元之間的多個銜接音節的音訊資料中篩選出效果較好的音訊資料進行儲存。例如，還可以分別計算同一相鄰的字元之間的多個銜接音節的音訊資料的MOS值（Mean Opinion Score，平均主觀意見分），根據銜接音節的音訊資料的MOS值，篩選出MOS值最高的銜接音節的音訊資料作為相鄰的字元之間的銜接音節的音訊資料。其中，上述MOS值可以用於較為準確、客觀地評價音訊資料的自然、流暢程度。類似的，在截取得到多個表徵同一字元的主幹音節的音訊資料時，可以統計同一字元的多個主幹音節的音訊資料中不同類型的主幹音節的音訊資料的出現頻率，進而可以從同一子符的多種類型的主幹音節的音訊資料中篩選出出現頻率最高的音訊資料作為該字元的主幹音節的音訊資料並儲存至預設的音訊資料庫中。也可以分別確定同一字元的多個主幹音節的音訊資料的MOS值，篩選出MOS值最高的音訊資料作為該字元的主幹音節的音訊資料並儲存至預設的音訊資料庫中等。由上可見，本發明實施例提供的播報語音的確定方法通過獲取相鄰的字元之間的銜接音節的音訊資料，並利用相鄰的字元之間的銜接音節的音訊資料拼接對應的字元的主幹音節的音訊資料，得到過渡更為自然的語音音訊資料，以進行語音播報，從而解決了現有方法中存在的數字播報不自然、用戶體驗差的問題，達到能兼顧運算成本，高效、流暢地進行有關數字的語音播報；還通過獲取包含有數字的樣本資料，從樣本資料中截取指定區域內的音訊資料作為字元的主幹音節的音訊資料，進而截取字元的主幹音節的音訊資料之間的音訊資料作為相鄰字元之間的銜接音節的音訊資料，從而可以建立較為準確的預設的音訊資料庫，以便可以通過檢索上述預設的音訊資料庫，產生更為自然、流暢的目標數字序列的音訊資料。參閱圖6所示，本發明提供了一種播報語音的確定方法，其中，該方法具體應用於播報語音的確定設備（或用戶端）一側。具體實施時，該方法可以包括以下內容。 S601：獲取待播報的目標數字序列。在本實施方式中，上述待播報的目標數字序列具體可以是到帳的錢款的金額數字，例如54元中的54；也可以是汽車行駛里程的距離數字，例如80公里中的80；還可以是股票的即時價格，例如20.9元每股中的20.9。當然上述所列舉的目標數字序列所表徵的資料對象只是為了更好地說明本實施方式。具體實施時，根據具體的應用場景，上述待播報的目標數字序列還可以是用於表徵其他資料對象的數字。對此，本發明不作限定。在本實施方式中，獲取待播報的目標數字序列具體可以理解為，獲取待播報的資料，解析待播報的資料，提取所述待播報的資料中數字作為上述待播報的目標數字序列。例如，支付平台的伺服器在確認用戶的帳戶到帳54元時，會向與該用戶的帳戶關聯的播報語音的確定設備（例如該用戶的手機）發送到帳提示資訊“帳戶到帳54元”。播報語音的確定設備在接收到上述到帳提示資訊後，可以解析該提示資訊，並提取該提示資訊中的數字“54”作為待播報的目標數字序列。當然，需要說明的是，上述所列舉的獲取待播報的目標數字序列只是一種示意性說明，對此，本發明不作限定。 S603：將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元。在本實施方式中，其中，上述字串具體可以理解為用於表徵目標數字序列的字元音節的，且按照與目標數字序列對應的排列順序（即預設排列順序）排列的字串，上述字串中每一個字元對應目標數字序列中的一個字元音節。例如，目標數字序列“67”的字串可以表示為“六十七”，其中，字元“六”、“十”、“七”分別對應於目標數字序列中的一個字元音節，並且上述字元按照與目標數字序列對應的預設順序排列。當然，需要說明的是，上述所列舉的字串只是為了更好地說明本實施方式。具體實施時，根據具體情況還可以選擇使用其他類型的字串。對此，本發明不作限定。在本實施方式中，上述將所述目標數字序列轉換為字串，具體可以理解為根據預設的映射規則，將目標數字序列轉換為對應的用於表徵目標數字序列的字元音節的字串。例如，根據預設的映射規則，可以將目標數字序列“67”中十位元上的數字“6”轉換為對應的字元“六”和“十”，將個位上的數字“7”轉換為對應的字元“七”，再按照與目標數字序列“67”對應的預設順序，排列得到的字元，從而得到對應的字串為“六十七”。當然，需要說明的是，上述所列舉的將所述目標數字序列轉換為字串的實現方式只是一種示意性說明。具體實施時，也可以根據具體情況，採用其他方式將目標數字序列轉換為對應的字串。對此，本發明不作限定。 S605：獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節。在本實施方式中，上述字元的主幹音節具體可以理解為一個字元音節的主要部分（例如字元音節的中間部分）。通常這部分的音節具有較高的辨識度，同一個字元音節的主幹音節的基頻、音強等音訊特徵較為一致，近似相同，因此可以提取字元音節的主幹音節用以區分其他字元音節。在本實施方式中，上述相鄰字元之間的銜接音節具體可以理解為用於連接相鄰字元的主幹音節的連接部分的音節。通常這部分的音節不同於主幹音節本身並沒有什麼具體含義，也不用於對應表徵某一個具體字元，但在音訊資料中的波形資料並不為0。在人的語音習慣中，通常會出現在相鄰的字元的主幹音節之間，起到承接、過渡的作用，從而能夠使得人說的話不同於機器發音，不是單調、呆板地直接將各個字元的主幹音節簡單地連接起來，而是很自然、流暢地從一個字元音節過渡到另一個字元音節。例如，人在發出“五十”時，在字元“五”的主幹音節和字元“十”的主幹音節之間的連接部分的語，即為字元“五”和字元“十”之間的銜接音節。在本實施方式中，上述獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，具體可以包括：根據目標數字序列的字串中的具體字元，檢索預設的音訊資料庫以獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料。其中，上述預設的音訊資料庫具體可以是事先建立的並儲存於伺服器或者播報語音的確定設備的資料庫。具體的，上述預設的音訊資料庫中具體可以包含有各個字元的主幹音節的音訊資料，以及各個相鄰字元之間的銜接音節的音訊資料。 S607：按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料。在本實施方式中，上述目標數字序列的音訊資料具體可以理解為用於語音播報目標數字序列的音訊資料。在本實施方式中，上述按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，具體實施時，可以包括：按照預設順序（即與目標數字序列的字串中字元的排列順序），排列各個字元的主幹音節的音訊資料；再利用相鄰的字元之間的銜接音節的音訊資料連接相鄰的字元的主幹音節的音訊資料。在本實施方式中，需要說明的是，考慮到通常用戶使用的播報語音的確定設備大多是嵌入式的設備系統，這類設備系統受限於自身的結構，往往運算能力、資料處理能力相對較弱，導致直接通過語音合成模型合成相應的數字序列的音訊資料成本相對較高、處理效率也相對較差。通過利用本發明實施例提供的播報語音的確定方法可以避免通過資源佔用較高的語音合成模型產生對應的音訊資料，而是簡單地在預設的音訊資料庫中檢索確定對應的字元的主幹音節的音訊資料，以及相鄰字元之間的銜接音節的音訊資料進行拼接組合，以得到具有較高準確度的目標數字序列的音訊資料，從而可以降低對資源的佔用，提高處理效率，更好地適用於嵌入式的設備系統。在一個實施方式中，上述獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，具體實施時，可以包括以下內容。 S1：識別所述字串中的各個字元，並確定所述字串中的相鄰的字元之間的連接關係，其中，所述字串中的相鄰的字元之間的連接關係用於指示字串中的相鄰的字元之間的先後連接順序； S2：根據所述字串中的各個字元，從預設的音訊資料庫中檢索並獲取各個字元的主幹音節的音訊資料，其中，所述預設的音訊資料庫中儲存有字元的主幹音節的音訊資料和相鄰的字元之間的銜接音節的音訊資料； S3：根據所述字串中的相鄰的字元之間的連接關係，從預設的音訊資料庫中檢索並獲取所述字串中的相鄰的字元之間的銜接音節的音訊資料。在本實施方式中，上述相鄰的字元之間的連接關係具體可以理解為相鄰的兩個字元之間的先後順序的一種標識資訊。例如，字串“五十四”中字元“五”和“十”是相鄰的兩個字元，“五”和“十”之間的連接關係可以表述為：字元“五”連字號“十”。當然，需要說明的是上述所列舉的相鄰字元之間的連接關係只是一種示意性說明。具體實施時還可以通過其他標識方式表示相鄰字元之間的連接關係。對此，本發明不作限定。在本實施方式中，具體實施時，可以根據將所識別的字元，以及所確定的相鄰的字元之間的連接關係作為標識，在預設的音訊資料庫中進行檢索，以提取預設的音訊資料庫中與上述標識匹配的音訊資料中作為上述字元的主幹音節的音訊資料，或相鄰的字元之間的銜接音節的音訊資料。在一個實施方式中，所述預設的音訊資料庫具體可以按照以下方式建立。 S1：獲取樣本資料；其中，所述樣本資料為包含有數字序列所對應的字串的音訊資料； S2：從所述樣本資料中截取得到字元的主幹音節的音訊資料； S3：從所述樣本資料中截取得到相鄰的字元之間的銜接音節的音訊資料； S4：根據所述字元的主幹音節的音訊資料、所述相鄰的字元之間的銜接音節的音訊資料，建立所述預設的音訊資料庫。在本實施方式中，上述獲取包含有數字的音訊資料作為樣本資料具體實施時，可以包括：截取播音員的播報音訊資料中包含有與數字相關的播報內容的音訊資料作為上述樣本資料；也可以採集人按照預設文本讀出的語音資料，作為上述樣本資料，其中，上預設文本可以是預先設置的包含有多種數字組合的文本內容。當然需要說明的是，上述所列舉的獲取包含有數字的音訊資料作為樣本資料的實現方式只是一種示意性說明。具體實施時，還可以根據具體情況選擇通過其他方式獲取包含有數字的音訊資料作為樣本資料。對此，本發明不作限定。在本實施方式中，在獲取了樣本資料後，還可以對樣本資料進行標記。具體的，可以在所獲取的樣本資料中，利用相應的字元音節標識標記出各個字元音節對應的音訊資料的所處的範圍區域。相應的，上述從樣本資料中截取得到字元的主幹音節的音訊資料時，具體可以包括：檢索所述樣本資料中的字元音節標識；根據所述字元音節標識，截取所述樣本資料中所述字元音節標識所標識的範圍中的指定區域的音訊資料作為所述字元的主幹音節的音訊資料。在本實施方式中，上述指定區域具體可以理解為在所述字元音節標識所標識的範圍中，以所述字元音節標識所標識的範圍中的中點為中心對稱點，且區域的區間長度與所述字元音節標識所標識的範圍的區間長度的比值等於預設比值的區域。例如，可以將字元音節標識“5”所標識的範圍中的中點O作為中心對稱點，分別截取中心對稱點O兩側1/2區域組合作為指定區域，將該指定區域的音訊資料確定為字元“五”的主幹音節的音訊資料。其中，上述指定區域占字元音節標識“5”所標識的範圍的1/2。當然，需要說明的是，上述所列舉的指定區域，以及確定指定區域的方式只是為了更好地說明本發明實施方式。具體實施時，還可以根據具體情況選擇使用其他的區域作為指定區域，進而採用對應的確定方式確定指定區域。例如，還可以將字元音節標識所標識的範圍中音強幅值大於閾值強度的區域作為指定區域。相應的，具體實施時，可以根據音強，從字元音節標識所表示的範圍中，截取音強幅值大於閾值強度的區域內的音訊資料作為字元的主幹音節的音訊資料。具體實施時，可以參閱圖7所示。從字元音節標識所標識的範圍中，選擇音強的幅值大於閾值強度的第一個週期中的音強值為0的位置點與音強幅值小於閾值強度的第一個週期中的音強為0的位置點之間的區域作為指定區域，進而可以截取上述指定區域中的音訊資料作為上述字元的主幹音節的音訊資料。其中，需要說明的是，上述閾值強度的具體數值可以根據字元音節的音素確定。具體的，如果字元音節的音素為母音，可以將上述閾值強度設置得相對較高，例如可以設置為0.1。如果字元音節的音素為輔音，可以將上述閾值強度設置得相對較低，例如可以設置為0.03。例如，對於某一個字元的字元音節以母音開頭，以輔音結尾，具體實施時可以將該字元的字元音節標識所標識範圍中音強的幅值大於0.1的第一個週期中的音強值為0的位置點與音強幅值小於0.03的第一個週期中的音強為0的位置點之間的區域作為指定區域，進而可以獲取該指定區域中的音訊資料作為該字元的主幹音節的音訊資料。此外，上述閾值強度的具體數值還可以根據音訊資料中背景聲音的強弱確定、具體的，如果音訊資料中的背景聲音較強，可以將上述閾值強度設置得相對較高，例如可以設置為0.16.如果，音訊資料中的背景聲音較弱，可以將上述閾值強度設置得相對較低，例如可以設置為0.047。當然，需要說明的是，上述所列舉的確定閾值強度的方式只是為了更好地說明本實施時方式。具體實施時，還可以根據具體的應用場景，選擇採用其他合適的方式確定閾值強度。對此，本發明不作限定。在從所述樣本資料中截取得到字元的主幹音節的音訊資料後，相應的，上述從所述樣本資料中截取得到相鄰的字元之間的銜接音節的音訊資料，具體實施時，可以包括：截取所述樣本資料中相鄰的字元的主幹音節的音訊資料之間的區域的音訊資料作為所述相鄰的字元之間的銜接音節的音訊資料。在本實施方式中，進一步考慮到根據人類的語音習慣，在發出關於目標數字序列的語音資料中的第一個字元音節時，在音強為0至第一個字元的主幹音節的音訊資料之間也存在一種起銜接作用的連接音節的音訊資料。因此，具體實施時，還可以截取樣本資料中的音訊資料中起始位置與第一字元有的主幹音節的音訊資料之間的音訊資料作為一種銜接音節的音訊資料，以便後續可以拼接得到效果較好、較為自然流暢的目標數字的音訊資料的起始部分的字元的音訊資料。在本實施方式中，具體實施時，可以截取樣本資料中音訊資料內兩個相鄰的指定區域之間的區域內的音訊資料作為對應的相鄰字元之間的銜接音節的音訊資料。在本實施方式中，具體實施時，可以按照上述方式分別對樣本資料中的各個音訊資料進行截取，以獲取所述字元的主幹音節的音訊資料、所述相鄰的字元之間的銜接音節的音訊資料，進而可以儲存所獲取的所述字元的主幹音節的音訊資料、所述相鄰的字元之間的銜接音節的音訊資料，並根據所述字元的主幹音節的音訊資料、所述相鄰的字元之間的銜接音節的音訊資料，建立所述預設的音訊資料庫。在一個實施方式中，從所述樣本資料中截取得到字元的主幹音節的音訊資料，具體實施時，可以包括以下內容：檢索所述樣本資料中的字元音節標識；根據所述字元音節標識，截取所述樣本資料中所述字元音節標識所標識的範圍中的指定區域的音訊資料作為所述字元的主幹音節的音訊資料。在一個實施方式中，所述指定區域具體可以理解為在所述字元音節標識所標識的範圍中，以所述字元音節標識所標識的範圍中的中點為中心對稱點，且區域的區間長度與所述字元音節標識所標識的範圍的區間長度的比值等於預設比值的區域。在一個實施方式中，從所述樣本資料中截取得到相鄰的字元之間的銜接音節的音訊資料，具體實施時，可以包括以下內容：截取所述樣本資料中相鄰的字元的主幹音節的音訊資料之間的區域的音訊資料作為所述相鄰的字元之間的銜接音節的音訊資料。在一個實施方式中，在從所述樣本資料中截取得到相鄰的字元之間的銜接音節的音訊資料後，為尋找並確定銜接效果較好、較為自然流暢的銜接音節的音訊資料進行儲存，具體實施時，所述方法還可以包括以下內容： S1：檢測所述相鄰的字元之間的銜接音節的音訊資料中是否包括同一相鄰的字元之間的多個銜接音節的音訊資料； S2：在確定所述相鄰的字元之間的銜接音節的音訊資料中包括同一相鄰的字元之間的多個銜接音節的音訊資料的情況下，統計所述同一相鄰的字元之間的多個銜接音節的音訊資料中各種類型的銜接音節的音訊資料的出現頻率，將所述出現頻率最高的類型的銜接音節的音訊資料確定為所述相鄰的字元之間的銜接音節的音訊資料。在本實施方式中，由於樣本資料大多是由人發出的包含有數字的語音音訊資料，對於同一相鄰的字元之間的多個銜接音節的音訊資料，出現頻率越高對應在人類正常的語音習慣中使用越頻繁，越能吻合人類較為普遍的語音習慣。因此可以將出現頻率最高的類型的銜接音節的音訊資料作為效果較好、較為自然的音訊資料儲存在預設的音訊資料庫中以提高音訊資料庫的準確度。具體的，可以將同一相鄰的字元之間的多個銜接音節的音訊資料劃分為多種類型，分別統計樣本資料中各種類型的音訊資料的出現頻率，並從多種類型的音訊資料中篩選出現頻率最高的類型的音訊資料作為上述相鄰的字元之間的銜接音節之間的音訊資料，儲存在預設的音訊資料庫中。當然，除了上述所列舉的根據各種類型的音訊資料的出現頻率從同一相鄰的字元之間的多個銜接音節的音訊資料中篩選出效果較好的音訊資料進行儲存外還可以採用其他合適的方式從同一相鄰的字元之間的多個銜接音節的音訊資料中篩選出效果較好的音訊資料進行儲存。例如，還可以分別計算同一相鄰的字元之間的多個銜接音節的音訊資料的MOS值（Mean Opinion Score，平均主觀意見分），根據銜接音節的音訊資料的MOS值，篩選出MOS值最高的銜接音節的音訊資料作為相鄰的字元之間的銜接音節的音訊資料。其中，上述MOS值可以用於較為準確、客觀地評價音訊資料的自然、流暢程度。類似的，在截取得到多個表徵同一字元的主幹音節的音訊資料時，可以統計同一字元的多個主幹音節的音訊資料中不同類型的主幹音節的音訊資料的出現頻率，進而可以從同一子符的多種類型的主幹音節的音訊資料中篩選出出現頻率最高的音訊資料作為該字元的主幹音節的音訊資料並儲存至預設的音訊資料庫中。也可以分別確定同一字元的多個主幹音節的音訊資料的MOS值，篩選出MOS值最高的音訊資料作為該字元的主幹音節的音訊資料並儲存至預設的音訊資料庫中等。在一個實施方式中，為了得到較為完整的語音音訊資料進行包含有目標數字序列的語音播報，在得到所述目標數字序列的音訊資料後，所述方法具體實施時還可以包括以下內容： S1：獲取預設的前置音訊資料，其中，所述預設的前置音訊資料用於指示所述目標數字序列所表徵的資料對象； S2：將所述預設的前置音訊資料和所述目標數字序列的音訊資料進行拼接，得到待播放的語音音訊資料； S3：播放所述待播放的語音音訊資料。在本實施方式中，上述預設的前置音訊資料具體可以是用於指示目標數字序列所表徵的資料對象等內容的音訊資料。例如，對於到帳金額播報而言，上述預設的前置音訊資料可以包括設置在金額數字之前的語音音訊資料“帳戶到帳”，以及設置在金額數字之後的語音音訊資料“元”。對於股票價格播報而言，上述預設的前置音訊資料可以包括設置在價格數字之前的語音音訊資料“XX股票的最新單價是”，以及設置在價格數字之後的語音音訊資料“元每股”。當然，上述所列舉的預設的前置音訊資料只是一種示意性說明。具體實施時，還可以根據具體的應用場景，設置其他的音訊資料作為上述預設的前置音訊資料。對此，本發明不作限定。在本實施方式中，需要說明的是，通常所播報的語音資料中前置音訊資料往往較為固定，變化的只是語音資料中待播報的目標數字序列。以到帳金額播報為例，不同的到帳金額的語音播報數據中前置音訊資料都是相同。例如，“帳戶到帳金額為五十四元”、“帳戶到帳金額為七十九元”中前置音訊資料完全相同都是“帳戶到帳金額為”，以及“元”，不同只是待播報的金額數字。因此，具體實施時，為了提高處理效率，可以預先設置儲存對應的前置音訊資料，再產生了目標數字序列的音訊資料後，可以將預設的前置音訊資料與所產生的目標數字序列的音訊資料直接進行拼接組合，得到待播放的語音音訊資料，進行語音播放。從而可以避免對內容相同的前置音訊資料進行重複的音訊資料合成，提高處理效率，使得本發明提供的播報語音的確定方法更加適用於資料處理能力有限的嵌入式系統，例如手機等播報語音的確定設備。具體的，例如，在得到了目標數字序列“54”的音訊資料後，可以先調用預設設置好的前置音訊資料“帳戶到帳金額為”、“元”；再按照一定的順序將目標數字序列“54”的音訊資料與預設的前置音訊資料進行拼接組合。具體的，可以在“帳戶到帳金額為”的音訊資料後連接目標數字序列“54”的音訊資料，在在目標數字序列“54”的音訊資料後連接“元”，從而得到了較為完整的，包含有目標數字序列的到帳金額的語音播報數據。在一個實施方式中，所述預設的前置音訊資料具體可以包括以下至少之一：用於播報到帳金額的前置用語的音訊資料、用於播報行駛里程的前置用語的音訊資料、用於播報股票價格的前置用語的音訊資料等。當然，需要說明的是，上述所列舉的預設的前置音訊資料只是為了更好地說明本實施方式。具體實施時，根據具體的應用場景和要求，還可以選擇使用其他的預設的音訊資料作為上述預設的前置資料。對此，本發明不作限定。由上可見，本發明實施例提供的播報語音的確定方法通過獲取相鄰的字元之間的銜接音節的音訊資料，並利用相鄰的字元之間的銜接音節的音訊資料拼接對應的字元的主幹音節的音訊資料，得到過渡更為自然的語音音訊資料，以進行語音播報，從而解決了現有方法中存在的數字播報不自然、用戶體驗差的問題，達到能兼顧運算成本，高效、流暢地進行有關數字的語音播報；還通過獲取包含有數字的樣本資料，從樣本資料中截取指定區域內的音訊資料作為字元的主幹音節的音訊資料，進而截取字元的主幹音節的音訊資料之間的音訊資料作為相鄰字元之間的銜接音節的音訊資料，從而可以建立較為準確的預設的音訊資料庫，以便可以通過檢索上述預設的音訊資料庫，產生更為自然、流暢的目標數字序列的音訊資料。參閱圖8所示，本發明提供了一種播報語音的確定方法，其中，該方法具體應用於播報語音的確定設備一側。具體實施時，該方法可以包括以下內容。 S801：獲取待播放的字串，其中，所述字串包括多個按照預設順序排列的字元； S803：獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節； S805：按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述待播放的字串的音訊資料。在本實施方式中，上述待播放的字串具體可以是待播放的數字序列的字串，也可以是待播放的文字資訊的字串。具體實施時，可以根據具體應用場景和實施要求選擇相應內容的字串作為上述待播放的字串。對於上述待播放的字串所表徵的具體內容，本發明不作限定。本發明實施例還提供了一種播報語音的確定設備，包括處理器以及用於儲存處理器可執行指令的記憶體，所述處理器具體實施時可以根據指令執行以下步驟：獲取待播報的目標數字序列；將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元；獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料。為了能夠更加準確地完成上述指令，參閱圖9，本發明還提供了另一種具體的播報語音的確定設備，其中，所述播報語音的確定設備包括輸入介面901、處理器902以及記憶體903，上述結構通過內部線纜相連，以便各個結構可以進行具體的資料交互。其中，所述輸入介面901，具體可以用於輸入待播報的目標數字序列。所述處理器902，具體可以用於將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元；獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料。所述記憶體903，具體可以用於儲存經輸入介面901輸入的待播報的目標數字序列、預設的音訊資料庫，以及儲存相應的指令程式。在本實施方式中，所述輸入介面901具體可以是一種支援播報語音的確定設備獲取，並從所獲取的資訊資料中提取待播報的目標資料序列的單元、模組。在本實施方式中，所述處理器902可以按任何適當的方式實現。例如，處理器可以採取例如微處理器或處理器以及儲存可由該（微）處理器執行的電腦可讀程式碼（例如軟體或韌體）的電腦可讀介質、邏輯閘、開關、專用積體電路（Application Specific Integrated Circuit，ASIC）、可程式化邏輯控制器和嵌入微控制器的形式等等。本發明並不作限定。在本實施方式中，所述記憶體903可以包括多個層次，在數字系統中，只要能儲存二進位資料的都可以是記憶體；在積體電路中，一個沒有實物形式的具有儲存功能的電路也叫記憶體，如RAM、FIFO等；在系統中，具有實物形式的存放裝置也叫記憶體，如記憶體條、TF卡等。本發明實施例還提供了一種基於上述支付方法的電腦儲存介質，所述電腦儲存介質儲存有電腦程式指令，在所述電腦程式指令被執行時實現：將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元；獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料。在本實施方式中，上述儲存介質包括但不限於隨機存取記憶體（Random Access Memory, RAM）、唯讀記憶體（Read-Only Memory, ROM）、快取（Cache）、硬碟（Hard Disk Drive, HDD）或者儲存卡（Memory Card）。所述記憶體可以用於儲存電腦程式指令。網路通信單元可以是依照通信協定規定的標準設定的，用於進行網路連接通信的介面。在本實施方式中，該電腦儲存介質儲存的程式指令具體實現的功能和效果，可以與其它實施方式對照解釋，在此不再贅述。參閱圖10，在軟體層面上，本發明實施例還提供了一種播報語音的確定裝置，該裝置具體可以包括以下的結構模組：第一獲取模組1001，具體可以用於獲取待播報的目標數字序列；轉換模組1002，具體可以用於將所述目標數字序列轉換為字串，其中，所述字串包括多個按照預設順序排列的字元；第二獲取模組1003，具體可以用於獲取所述字串中的各個字元的主幹音節的音訊資料，以及所述字串中的相鄰的字元之間的銜接音節的音訊資料，其中，所述銜接音節用於連接相鄰的字元的主幹音節；拼接模組1004，具體可以用於按照預設順序拼接所述字元的主幹音節的音訊資料和所述相鄰的字元之間的銜接音節的音訊資料，得到所述目標數字序列的音訊資料。在一個實施方式中，所述第二獲取模組1003具體可以包括以下結構單元：識別單元，具體可以用於識別所述字串中的各個字元，並確定所述字串中的相鄰的字元之間的連接關係，其中，所述字串中的相鄰的字元之間的連接關係用於指示字串中的相鄰的字元之間的先後連接順序；第一獲取單元，具體可以用於根據所述字串中的各個字元，從預設的音訊資料庫中檢索並獲取各個字元的主幹音節的音訊資料，其中，所述預設的音訊資料庫中儲存有字元的主幹音節的音訊資料和相鄰的字元之間的銜接音節的音訊資料；第二獲取單元，具體可以用於根據所述字串中的相鄰的字元之間的連接關係，從預設的音訊資料庫中檢索並獲取所述字串中的相鄰的字元之間的銜接音節的音訊資料。在一個實施方式中，為了預先準備好需要使用的預設的音訊資料庫，具體實施時，所述裝置還可以包括建立模組，具體可以用於建立預設的音訊資料庫。在一個實施方式中，所述建立模組具體實施時，可以包括以下結構單元：第三獲取單元，具體可以用於獲取包含有數字的音訊資料作為樣本資料；第一截取單元，具體可以用於從所述樣本資料中截取得到字元的主幹音節的音訊資料；第二截取單元，具體可以用於從所述樣本資料中截取得到相鄰的字元之間的銜接音節的音訊資料；建立單元，具體可以用於根據所述字元的主幹音節的音訊資料、所述相鄰的字元之間的銜接音節的音訊資料，建立所述預設的音訊資料庫。在一個實施方式中，所述裝置具體實施時，還可以包括播放模組，具體可以用於獲取預設的前置音訊資料，其中，所述預設的前置音訊資料用於指示所述目標數字序列所表徵的資料對象；將所述預設的前置音訊資料和所述目標數字序列的音訊資料進行拼接，得到待播放的語音音訊資料；播放所述待播放的語音音訊資料。在一個實施方式中，所述預設的前置音訊資料具體可以包括以下至少之一：用於播報到帳金額的前置用語的音訊資料、用於播報行駛里程的前置用語的音訊資料、用於播報股票變化值的前置用語的音訊資料等。當然，需要說明的是上述所列舉的前置音訊資料只是一種示意性說明。具體實施時，還可以根據具體的應用場景和要求，選擇或者獲取其他合適的音訊資料作為上述預設的前置音訊資料。對此，本發明不作限定。需要說明的是，上述實施例闡明的單元、裝置或模組等，具體可以由電腦晶片或實體實現，或者由具有某種功能的產品來實現。為了描述的方便，描述以上裝置時以功能分為各種模組分別描述。當然，在實施本發明時可以把各模組的功能在同一個或多個軟體和/或硬體中實現，也可以將實現同一功能的模組由多個子模組或子單元的組合實現等。以上所描述的裝置實施例僅僅是示意性的，例如，所述單元的劃分，僅僅為一種邏輯功能劃分，實際實現時可以有另外的劃分方式，例如多個單元或元件可以結合或者可以整合到另一個系統，或一些特徵可以忽略，或不執行。另一點，所顯示或討論的相互之間的耦合或直接耦合或通信連接可以是通過一些介面，裝置或單元的間接耦合或通信連接，可以是電性，機械或其它的形式。由上可見，本發明實施例提供的播報語音的確定裝置通過第二獲取模組獲取相鄰的字元之間的銜接音節的音訊資料，並通過拼接模組利用相鄰的字元之間的銜接音節的音訊資料拼接對應的字元的主幹音節的音訊資料，得到過渡更為自然的語音音訊資料，以進行語音播報，從而解決了現有方法中存在的數字播報不自然、用戶體驗差的問題，達到能兼顧運算成本，高效、流暢地進行有關數字的語音播報；還通過建立模組獲取包含有數字的樣本資料，從樣本資料中截取指定區域內的音訊資料作為字元的主幹音節的音訊資料，進而截取字元的主幹音節的音訊資料之間的音訊資料作為相鄰字元之間的銜接音節的音訊資料，從而可以建立較為準確的預設的音訊資料庫，以便可以通過檢索上述預設的音訊資料庫，產生更為自然、流暢的目標數字序列的音訊資料。雖然本發明提供了如實施例或流程圖所述的方法操作步驟，但基於常規或者無創造性的手段可以包括更多或者更少的操作步驟。實施例中列舉的步驟順序僅僅為眾多步驟執行順序中的一種方式，不代表唯一的執行順序。在實際中的裝置或用戶端產品執行時，可以按照實施例或者圖式所示的方法循序執行或者並存執行（例如並行處理器或者多執行緒處理的環境，甚至為分散式資料處理環境）。術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、產品或者設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、產品或者設備所固有的要素。在沒有更多限制的情況下，並不排除在包括所述要素的過程、方法、產品或者設備中還存在另外的相同或等同要素。第一，第二等詞語用來表示名稱，而並不表示任何特定的順序。本領域技術人員也知道，除了以純電腦可讀程式碼方式實現控制器以外，完全可以通過將方法步驟進行邏輯程式設計來使得控制器以邏輯閘、開關、專用積體電路、可程式化邏輯控制器和嵌入微控制器等的形式來實現相同功能。因此這種控制器可以被認為是一種硬體元件，而對其內部包括的用於實現各種功能的裝置也可以視為硬體元件內的結構。或者甚至，可以將用於實現各種功能的裝置視為既可以是實現方法的軟體模組又可以是硬體元件內的結構。本發明可以在由電腦執行的電腦可執行指令的一般上下文中描述，例如程式模組。一般地，程式模組包括執行特定任務或實現特定抽象資料類型的常式、程式、對象、元件、資料結構、類等等。也可以在分散式運算環境中實踐本發明，在這些分散式運算環境中，由通過通信網路而被連接的遠端處理設備來執行任務。在分散式運算環境中，程式模組可以位於包括存放裝置在內的本地和遠端電腦儲存介質中。通過以上的實施方式的描述可知，本領域的技術人員可以清楚地瞭解到本發明可借助軟體加必需的通用硬體平台的方式來實現。基於這樣的理解，本發明的技術方案本質上或者說對現有技術做出貢獻的部分可以以軟體產品的形式體現出來，該電腦軟體產品可以儲存在儲存介質中，如ROM/RAM、磁碟、光碟等，包括若干指令用以使得一台電腦設備（可以是個人電腦，移動終端，伺服器，或者網路設備等）執行本發明各個實施例或者實施例的某些部分所述的方法。本發明中的各個實施例採用遞進的方式描述，各個實施例之間相同或相似的部分互相參見即可，每個實施例重點說明的都是與其他實施例的不同之處。本發明可用於眾多通用或專用的電腦系統環境或配置中。例如：個人電腦、伺服器電腦、手持設備或可擕式設備、平板型設備、多處理器系統、基於微處理器的系統、機上盒、可程式化的電子設備、網路PC、小型電腦、大型電腦、包括以上任何系統或設備的分散式運算環境等等。雖然通過實施例描繪了本發明，本領域普通技術人員知道，本發明有許多變形而不脫離本發明的精神，希望所附的申請專利範圍包括這些變形和變化而不脫離本發明的精神。In order to enable those skilled in the art to better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the drawings in the embodiments of the present invention. Obviously, the described The embodiments are only a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention. Considering that the existing methods for determining broadcast speech often do not deeply analyze the language habits and speech characteristics of human normal speaking. For example, when a person speaks the number "sixteen", after issuing the vowel syllable "ten" and before issuing the vowel syllable "six", they usually also emit a syllable "ten" that connects the above two characters. "Six" connection syllable. Moreover, there are often differences in the connection syllables between different vowel syllables. For example, the connection syllable between the vowel syllable "five" and the character syllable "ten" in "Fifty" is also different from the connection syllable between the character syllable "ten" and the character syllable "five" in "Fifteen". . The above-mentioned cohesive syllables do not correspond to a specific character, nor can they represent any specific content or meaning, but are similar to a connecting auxiliary word, which naturally and smoothly connects adjacent syllables of words in human normal speaking. Together, so that the listener can better receive and understand the information and content in what the speaker said. Since the existing method for determining the broadcast speech does not take into account the above-mentioned human speech habits and speech characteristics, when synthesizing the audio and audio data about the target digital sequence to be broadcast, usually only the samples of the corresponding digital characters in the data are intercepted The audio data of the main part of the vowel syllable is directly spliced. Because there is no natural transition between adjacent character syllables that conforms to human speech habits, the audio information about the target digital sequence generated based on the above method is often not as natural, smooth, or even as humans say the numbers when playing. It affects people's understanding of the broadcasted digital content, causing inconvenience in use. Therefore, when the existing method is specifically implemented, there are often problems of unnatural digital broadcasting and poor user experience. In view of the root cause of the above problems, the present invention thoroughly and comprehensively analyzes the language habits and phonetic characteristics of human normal speaking, and considers and pays attention to the existence of cohesive syllables between adjacent vowel syllables of human when speaking normally effect. When creating a default audio database, not only intercept the audio data of the main syllables that store the syllables of characters, but also consciously intercept the audio data of the connected syllables that store adjacent character syllables. Furthermore, when generating audio data of a specific number, the audio data of the main syllable of each character in the multiple characters corresponding to the number and the audio data of the connecting syllables between adjacent characters will be obtained at the same time. Use the audio data of the contiguous syllables between adjacent characters to splice the audio data of the corresponding two adjacent characters of the main syllables, so that in the generated audio and audio data, the transition between adjacent characters and syllables It is more natural and smooth, which solves the problems of unnatural digital broadcasting and poor user experience in the existing methods, and achieves efficient and smooth voice broadcasting of numbers in consideration of computing costs. Based on the above reasons, the embodiments of the present invention provide a broadcast voice determining device capable of efficiently and naturally performing digital voice broadcast, through which the broadcast voice determining device can realize the following functions: acquiring a target digital sequence to be broadcast; The target digital sequence is converted into a character string, wherein the character string includes a plurality of characters arranged in a preset order; acquiring audio data of the main syllables of each character in the character string, and the Audio data of cohesive syllables between adjacent characters, wherein the cohesive syllables are used to connect the main syllables of adjacent characters; the audio data of the main syllables of the characters and the The audio data of the connected syllables between adjacent characters obtains the audio data of the target digital sequence; and plays the audio data of the target digital sequence. In this embodiment, the voice announcement determining device may be a relatively simple electronic device used on the user side. Specifically, the voice broadcast determining device may be an electronic device with data calculation, voice playback function and network interaction function; it may also be an electronic device running in the electronic device for data processing, voice playback and network interaction, etc. Provide supported software applications. Specifically, the device for determining the broadcast voice may be, for example, a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, and a shopping guide terminal. Alternatively, the device for determining the broadcast voice may also be a software application that can run in the electronic device. For example, the above-mentioned device for determining the broadcast voice may also be the XX Bao app running on a smartphone. In an example of a scenario, the device for determining a broadcast voice by applying the method for determining a broadcast voice provided by an embodiment of the present invention can automatically broadcast the amount of money instantly received by the account of Merchant A for Merchant A. In this embodiment, the merchant A can use his own mobile phone as the device for determining the broadcast voice. Before the specific implementation, the merchant A may first associate the mobile phone number with the merchant A account on a payment platform through the setting operation of the mobile phone. As shown in FIG. 1, generally, after spending in the store of the merchant A, the consumer can directly make a checkout payment online through the payment software of a payment platform on the mobile phone, and does not need to make a face-to-face payment with the merchant online or offline. Specifically, the consumer can use the mobile phone to communicate with the server of a payment platform, and transfer the money due to merchant A to the account of merchant A through the payment platform to complete the checkout payment. After confirming that the merchant A’s account has received the money transferred by the consumer through online transfer, the server of the payment platform will send the account reminder information to the merchant A’s mobile phone (for example, the SMS message sent to the account, or on the merchant A’s mobile phone) The corresponding payment reminder dialog box is pushed in the payment APP of the app to remind merchant A: the consumer has already made a checkout payment online, and also identifies the money received by merchant A's account in the prompt information The specific amount of money, so that merchant A can further confirm whether the amount of money paid by consumers online is accurate. For example, when the server of the payment platform can confirm that the merchant A’s account receives the 54 yuan of money transferred by the consumer online, he can send a reminder message including the following to the mobile phone associated with the merchant A’s account: 54 yuan." Usually during business hours, merchants will be relatively busy, and they may not have the time to read and read the above prompt information in time, so it is inconvenient to confirm whether the consumer has made the checkout payment online and the consumer is online. Whether the amount of the payment is accurate. At this time, the merchant hopes to be able to broadcast the specific amount of money received by his account in real time through the mobile phone, so that even if the business is busy during the business period, the merchant has no time to go through and confirm the prompt information sent by the server of the payment platform. It can also know the specific situation of consumers' payment through the payment platform in time. After receiving the prompt information sent by the payment platform, the mobile phone can first parse the prompt information and extract the amount digit "54" in the prompt information as the target digital sequence to be broadcast, so as to subsequently determine the audio data corresponding to the digital sequence Perform voice broadcast. In this embodiment, the above prompt information is usually generated according to a fixed rule, so it has a relatively uniform format. For example, in the example of this scenario, the above prompt information may be formed in the following format: pre-leader part (that is, "account to account") + numeric part (that is, the specific amount "54") + unit part (that is, "yuan "). Therefore, when obtaining the target digital sequence to be broadcast, that is, the specific content of the digital part of the prompt information, the prompt information can be parsed and split according to the parsing rules corresponding to the above fixed rules for generating the prompt information, that is, the prompt information can be In the digital part of, the number to be broadcast is extracted, that is, the target digital sequence. In this embodiment, it should be noted that for different prompt information, the content of the preamble and the unit part are usually the same, and only the content of the digital part will be different according to the different prompt information. Therefore, it is possible to generate and store the audio data of the unified preamble part and the audio data of the unit part in advance. When broadcasting the prompt information, only the audio data of the digital part of the prompt information needs to be generated, and then the pre-stored pre-guide The audio data of the language part and the audio data of the unit part are spliced to obtain complete voice and audio data with prompt information. After obtaining the target digital sequence to be broadcast, the mobile phone can first convert the target digital sequence into a corresponding word string. Wherein, the above-mentioned character string can be understood as a character string used to characterize the vowel syllable of the target digital sequence, and arranged in accordance with the arrangement order corresponding to the target digital sequence (that is, the preset arrangement order), each word in the above-mentioned character string The vowel corresponds to a syllable in the target digital sequence. For example, the corresponding character string obtained after the conversion of the target digital sequence "54" can be expressed as "54". The character string "54" can be understood as a character string representing the character syllable of the target number sequence "54", wherein the characters "five" and "ten" in the character string and the target number sequence are located in the tens digit The number “5” on the top corresponds to the character “four” in the string corresponds to the number “4” in the single digit in the target number sequence. And the characters in the string are arranged according to the preset arrangement order corresponding to the arrangement order of the numbers in the target number sequence "54" (that is, "5" followed by "4"), that is, the first row corresponds to the "5" on the tenth digit "Five" and "Ten", and then the corresponding characters "4" in the "4". Of course, it should be noted that the above-listed character strings and the corresponding preset arrangement order are only for better explaining the embodiments of the present invention. During specific implementation, other forms of word strings and preset rules can also be selected according to the specific scenario, or the target digital sequence can be directly identified and spliced without conversion. In this regard, the present invention is not limited. After obtaining the character string corresponding to the target digital sequence, the mobile phone can recognize and determine each character in sequence in the character string and the connection relationship between adjacent characters. Wherein, the connection relationship between the adjacent characters can be specifically understood as a kind of identification information of the sequence between two adjacent characters. For example, the characters "five" and "ten" in the string "54" are two adjacent characters. The connection between "five" and "ten" can be expressed as: the hyphen of the character "five" "ten". Of course, it should be noted that the connection relationship between adjacent characters listed above is only a schematic illustration. In specific implementation, the connection relationship between adjacent characters may also be expressed by other identification methods. In this regard, the present invention is not limited. In this embodiment, the mobile phone can determine that the characters in the string are “five”, “ten”, and “four” in sequence by character recognition, and the connection relationship between the corresponding adjacent characters is in order. For: the character "five" hyphen "ten", the character "ten" hyphen "four". Further, the mobile phone can retrieve from the preset audio database according to the recognized characters and the connection relationship between the adjacent characters to obtain the relationship between each character and the adjacent characters The audio data corresponding to the connection relationship between the two is to obtain the audio data of the main syllables of each character in the string, and the audio data of the connecting syllables between adjacent characters. Among them, the main syllables of the above characters can be understood as the main part of the syllables of the character, usually the syllables of this part have a higher degree of recognition, and the main syllables of the same character syllable are more consistent in the basic frequency and sound intensity. , Is approximately the same, so you can extract the main syllables of syllables to distinguish other syllables. For example, when a person utters the speech corresponding to the character "five", the middle part of the speech is the main part of the syllable of the character, that is, the main syllable. Usually, when different people send the speech corresponding to the character "five", although there are differences, But the main syllables are mostly the same. The cohesive syllables between the adjacent characters can be specifically understood as the syllables used to connect the main syllables of the adjacent characters. For example, when a person pronounces "fifty", the connection between the main syllable of the character "five" and the main syllable of the character "ten" is the character "five" and the character "ten" The connection between syllables. This part of the syllable is different from the main syllable itself and has no specific meaning, nor is it used to characterize a specific character, but the waveform data in the audio data is not 0. In human speech habits, it usually appears between the main characters of adjacent characters, which plays the role of inheritance and transition, so that people can say something different from machine pronunciation, instead of monotonous and dull The main syllables of the vowels are simply connected, but they transition from one syllable syllable to another naturally and smoothly. The numbers broadcasted in this way are more in line with human hearing and listening habits, which is convenient for human reception and understanding, and at the same time, it makes listeners feel more comfortable and experience better when listening. It should also be added that the contiguous syllables between different adjacent characters (including different characters, and the same character in different order, etc.) are often different. For example, the connection syllables between the characters "five" and "ten" and the connection syllables between "five" and "hundred", and the connection syllables between "ten" and "five" are on the waveform of the corresponding audio data There are differences between each other. Therefore, in this embodiment, it is necessary to accurately obtain the audio data of the corresponding connected syllable by using the connection relationship between adjacent characters. The above-mentioned preset audio database may specifically be a database created by a platform server in advance and stored in the server or a certain device for broadcasting voice, wherein the above-mentioned preset audio database may specifically include the backbone of each character The audio data of syllables, and the audio data of connecting syllables between adjacent characters. Specifically, the mobile phone can search the preset audio database according to the recognized characters and the connection relationship between adjacent characters to obtain the audio data A, "X" of the main syllable of the character "Five" "The audio data B of the main syllables", the audio data C of the "four" main syllables, and the audio data f of the connecting syllables between the character "five" hyphen "ten" and the character "ten" hyphen "four" The audio data of the connecting syllables between r. Furthermore, the mobile phone can splice the audio data of the main syllables of the above-mentioned characters and the audio data of the connecting syllables between adjacent characters according to the order of the characters in the string (ie, the preset order) to obtain the corresponding target Digital sequence of audio data. Specifically, the audio data of the main syllables of each character can be arranged according to the preset order (that is, the arrangement order of the characters in the string of the target digital sequence); and then the audio data of the connecting syllables between adjacent characters The audio data of the main syllables connecting adjacent characters. Specifically, for example, as shown in FIG. 2, according to the arrangement order of the characters in the string (ie, “54”), the audio data A of the “five” main syllables are arranged first, and then the “ten” main syllables are arranged. Audio data B, and the audio data C of the main syllable in the last row of "four". After arranging the audio data of the main syllables; further use the audio data f of the connecting syllables between the character "five" hyphen "ten" to connect the audio data A and the audio data B, and use the character "ten" hyphen " The audio data r connecting the syllables between fours connects audio data B and audio data C. The resulting spliced audio data for the target digital sequence "54" can be expressed as: "A-f-B-r-C". In this way, the audio data for the target digital sequence with a more natural transition is obtained. After the audio data of the target digital sequence is obtained, the pre-set audio data (such as the audio of the preamble part) pre-stored in the mobile phone or server and used to indicate the data object characterized by the target digital sequence can be set Data, audio data of the unit) and the audio data of the target digital sequence are spliced to obtain the voice audio data to be played, and the mobile phone then plays the corresponding content information according to the above voice audio data. In this embodiment, referring to FIG. 3, the mobile phone of the merchant A can obtain the pre-set audio data preset and stored locally on the mobile phone, that is, the pre-set audio data Y for expressing “account to account” and the user To express the audio data Z of "Yuan"; splice the above pre-audio data with the generated audio data about the target digital sequence "54" to obtain the complete voice audio data to be played, which can be expressed as "YAfBrCZ", Furthermore, the above voice and audio information is played, so that the merchant A can hear the voice announcement clearly, naturally, and smoothly, and more in line with the normal listening habits of human beings, so as to avoid the influence of the machine voice on the merchant's listening experience. It can be seen from the above that the method for determining the broadcast speech provided by the embodiment of the present invention obtains the audio data of the connecting syllables between adjacent characters, and uses the audio data of the connecting syllables between the adjacent characters to splice the corresponding characters The audio data of the main syllables of the yuan is obtained as a more natural voice audio data for voice broadcast, thereby solving the problems of unnatural digital broadcast and poor user experience in the existing methods, achieving the ability to take into account the operation cost, high efficiency, Smooth voice announcements about numbers. In another scenario example, the server of the payment platform may create a preset audio database in advance, and send the above-mentioned preset audio database to the voice broadcast determining device. When the voice broadcast confirming device receives the preset audio database, the preset audio database can be stored locally in the broadcast voice confirming device, so that the voice broadcast confirming device can be obtained by retrieving the default audio database The audio data of the main syllable of each character in the character string of the target digital sequence, and the audio data of the connecting syllables between adjacent characters in the character string. Of course, after establishing the default audio database, the server of the payment platform may not send the default audio database to the voice confirmation device, but store it on the server side, and the voice confirmation device When generating the audio data of the target digital sequence, the audio data of the main syllables of each character in the string of the target digital sequence can be obtained by calling the default audio database stored on the server side, and the character The audio data of the connecting syllables between adjacent characters in the string. In this embodiment, during specific implementation, the server can obtain audio data containing numbers as sample data. Furthermore, according to certain rules, the audio data of the main syllables of the characters and the audio data of the connecting syllables between adjacent characters can be intercepted from the marked sample data, and then the audio information of the main syllables of the above characters can be obtained. Data, and the audio data of connecting syllables between adjacent characters, to create a default audio database. Specifically, the acquiring of audio data containing numbers as sample data may include: intercepting the broadcast audio data of the announcer including audio data containing broadcast content related to numbers as the sample data. It is also possible to collect the voice data read by the person according to the preset text as the above-mentioned sample data, wherein the upper preset text may be a preset text content containing a variety of digital combinations. After obtaining the sample data, you can also mark the sample data. Specifically, as shown in FIG. 4, in the acquired sample data, the character syllable identifier can be used to identify the range area of the audio data corresponding to each character syllable. For example, for the audio data "56" in the sample data, you can use "5", "10", and "6" as the vowel syllable identifier of the character syllable "five" and the character syllable "ten". The vowel syllable logo and the vowel syllable logo of the vowel syllable "six" respectively identify the range areas of the vowel syllables "five", "ten", and "six" in the audio data. Of course, it should be noted that the character syllable identifiers listed above are only schematic illustrations, and should not constitute an undue limitation on the present invention. Further, when intercepting the audio data of the main syllables of the characters from the sample data, it may specifically include: retrieving the character syllable identifier in the sample data; intercepting the sample data according to the character syllable identifier The audio data of the designated area in the range identified by the character syllable identifier is used as the audio data of the main syllable of the character. Specifically, the character syllable identifier of the audio data in the sample data can be retrieved and determined, and then the area range of each character syllable in the sample data in the audio data can be determined according to the character syllable identifiers above, that is, the characters in the sample data The range identified by the syllable identification; then the audio data in the specified area is intercepted from the area of the above-mentioned character syllable in the audio data according to the preset rules as the audio data of the main syllable of the character. For example, for the audio data "56" in the sample data, the character syllable identifiers "5", "10", and "6" in the audio material can be retrieved first; then the character syllable identifier "5" can be determined The regional scope of the vowel syllable "five" in the audio data, the regional scope of the vowel syllable "ten" in the audio data, and the vowel syllable "6" according to the vowel syllable identifier "6" "Six" regional scope in the audio data; furthermore, the audio data of the designated area can be intercepted from the audio data of the region where the character syllable "five" is located as the audio data of the main syllable of the character "five", from the character syllable The audio data of the specified area is intercepted from the audio data of the area where the "ten" is located as the audio data of the main syllable of the character "ten", and the audio data of the specified area is intercepted from the audio data of the area where the character syllable "six" is located The data is used as the audio information of the main syllable of the character "six". When specifically intercepting the audio data of the main syllables of characters, considering that when people speak specific numbers, the audio data of the middle part of the syllable corresponding to each digit or unit in the number is mostly more consistent, that is, the same character Most of the audio data of syllables have relatively small differences in audio data in the middle part, and most of the audio data of different character syllables have relatively large differences in audio data in the middle part. For example, when people say the two numbers "five-six" and "sixty-five", the middle part of the audio data of the syllable of the character "five" in "five-six" is often the same as in "sixty-five" The middle part of the audio data of the syllable of the character "five" is the same. Therefore, the audio data of the middle part of the audio data of the character syllable can be intercepted as the audio data of the designated area to obtain the audio data of the main syllable of the character syllable. Based on the above characteristics, during specific implementation, in the range identified by the character syllable identification, the midpoint in the range identified by the character syllable identification may be the center symmetrical point, and the interval length of the area is the same as the The area where the ratio of the interval length of the range identified by the vowel syllable identification is equal to the preset ratio. For example, as shown in FIG. 5, take the midpoint O in the range identified by the syllable syllable identifier "5" as the center symmetry point, and intercept the 1/2 area combination on both sides of the center symmetry point O as the specified area, respectively. The audio data of the designated area is determined as the audio data of the main syllable of the character "five". Among them, the above-mentioned designated area occupies 1/2 of the range identified by the character syllable identification "5". According to the above method, the audio data of the main syllable of the character "ten" and the audio data of the main syllable of the character "six" can also be intercepted. Of course, the preset ratios listed above are only for better illustrating the embodiments of the present invention. During specific implementation, other numerical values can also be selected as the preset ratio according to the specific scene to determine the designated area and then intercept the audio data of the main character syllable of the corresponding character. After intercepting the audio data of the main syllable of the character, the audio data in the area between the audio data of the main character syllable of the adjacent character in the audio data of this data can be intercepted as between the adjacent characters Audio data that connects syllables. For example, as shown in FIG. 5, the audio data in the area between the audio data of the main character syllable of the adjacent character "five" and the audio data of the main character syllable of the character "ten" in the audio data of this data is intercepted The data is used as the audio data of the connecting syllables between the characters "five" and "ten", that is, the audio data of the connecting syllables between the adjacent characters. According to the above method, the audio data of the connected syllable between the adjacent character "ten" and the character "six" can also be obtained. In this scenario example, it is considered that if the sample data is relatively rich, multiple pieces of audio data representing the cohesive syllables between the same adjacent characters can be obtained. For example, the audio data "56" and "54" in the sample data can intercept the audio data (or character) of the consonant syllables between the same character "five" and the character "ten" (The audio information of the syllables between the "Five" hyphen "Ten"). In addition, the sample data may contain the audio data of "56" sent by different people, and then based on the audio data of different people, the connection between multiple characters "five" and "ten" can be obtained. Audio data. Therefore, in the case where the captured audio data of the connecting syllables between adjacent characters includes the audio data of the connecting syllables between the same adjacent characters, in order to obtain better audio data as The audio data of the connecting syllables between adjacent characters, so that the subsequent audio data of the main syllables connecting the corresponding characters is more natural and smooth, and multiple connections between the same adjacent characters can be connected. The syllable audio data is divided into multiple types, and the frequency of occurrence of various types of audio data in the sample data is counted separately, and the audio data of the type with the highest frequency is filtered from the multiple types of audio data as between the adjacent characters The audio data connecting syllables is stored in the default audio database. Of course, in addition to the above-listed audio data from multiple contiguous syllables between the same adjacent characters according to the frequency of appearance of various types of audio data, other suitable audio data can be selected and stored. The method is to select and store the audio data with better effect from the audio data of multiple connected syllables between the same adjacent characters. For example, you can also calculate the MOS value (Mean Opinion Score, average subjective opinion score) of the audio data of multiple connected syllables between the same adjacent characters, and filter out the MOS value based on the MOS value of the audio data of the connected syllables The audio data of the highest connected syllable is used as the audio data of the connected syllable between adjacent characters. Among them, the above MOS value can be used to more accurately and objectively evaluate the naturalness and fluency of audio data. Similarly, when the audio data of multiple trunk syllables characterizing the same character is intercepted, the frequency of occurrence of the audio data of different types of trunk syllables in the audio data of multiple trunk syllables of the same character can be counted, and thus the same The audio data of the main types of main syllables of the sub-symbol is filtered out as the audio data of the main syllable of the character and stored in the default audio database. You can also determine the MOS value of the audio data of multiple trunk syllables of the same character, and filter out the audio data with the highest MOS value as the audio data of the trunk syllable of the character and store it in the default audio database. It can be seen from the above that the method for determining the broadcast speech provided by the embodiment of the present invention obtains the audio data of the connecting syllables between adjacent characters, and uses the audio data of the connecting syllables between the adjacent characters to splice the corresponding characters The audio data of the main syllables of the yuan is obtained as a more natural voice audio data for voice broadcast, thereby solving the problems of unnatural digital broadcast and poor user experience in the existing methods, achieving the ability to take into account the operation cost, high efficiency, Smoothly carry out voice announcements about numbers; also obtain sample data containing numbers, intercept audio data in the designated area from the sample data as the main syllable audio data of the characters, and then intercept the main syllable audio data of the characters The audio data between is used as the audio data of the connecting syllables between adjacent characters, so that a more accurate default audio database can be established, so that the above-mentioned default audio database can be retrieved to produce a more natural and smooth Audio data of the target digital sequence. Referring to FIG. 6, the present invention provides a method for determining a broadcast voice, where the method is specifically applied to a device (or user side) side for determining a broadcast voice. During specific implementation, the method may include the following. S601: Obtain the target digital sequence to be broadcast. In this embodiment, the target digital sequence to be broadcasted may specifically be the amount of money received, such as 54 of 54 yuan; it may also be the distance number of the car's mileage, such as 80 of 80 kilometers; also It can be the instant price of the stock, for example 20. 20 of 9 yuan per share. 9. Of course, the data objects represented by the target digital sequences listed above are only for better illustrating this embodiment. During specific implementation, according to specific application scenarios, the target digital sequence to be broadcasted may also be a number used to characterize other data objects. In this regard, the present invention is not limited. In this embodiment, obtaining the target digital sequence to be broadcast can be specifically understood as acquiring the data to be broadcast, parsing the data to be broadcast, and extracting the numbers in the data to be broadcast as the target digital sequence to be broadcast. For example, when the server of the payment platform confirms that the user's account is 54 yuan, it will send the account reminder information "account to account 54 yuan" to the voice-determining device associated with the user's account (such as the user's mobile phone). ". The device for determining the broadcast voice can parse the prompt information after receiving the above prompt information, and extract the number "54" in the prompt information as the target number sequence to be broadcast. Of course, it should be noted that the above-mentioned acquisition of the target digital sequence to be broadcast is only a schematic illustration, which is not limited in the present invention. S603: Convert the target digital sequence into a character string, where the character string includes a plurality of characters arranged in a preset order. In this embodiment, the above-mentioned character string can be understood as a character string that is used to characterize the syllable of the target digital sequence and is arranged according to the arrangement order corresponding to the target digital sequence (that is, the preset arrangement sequence). Each character in the string corresponds to a character syllable in the target number sequence. For example, the string of the target number sequence "67" can be represented as "sixty-seven", where the characters "six", "ten", and "seven" respectively correspond to a character syllable in the target number sequence, and the above The characters are arranged in a preset order corresponding to the target number sequence. Of course, it should be noted that the above-listed character strings are only for better explanation of this embodiment. During specific implementation, other types of character strings can also be selected according to the specific situation. In this regard, the present invention is not limited. In this embodiment, the above conversion of the target digital sequence into a character string may be specifically understood as conversion of the target digital sequence into a corresponding character string representing the character syllable of the target digital sequence according to a preset mapping rule . For example, according to the preset mapping rule, the number "6" in the ten digits in the target number sequence "67" can be converted into the corresponding characters "six" and "ten", and the digit "7" in the single digits Convert to the corresponding character "seven", and then arrange the obtained characters according to the preset order corresponding to the target number sequence "67", so as to obtain the corresponding character string as "sixty-seven". Of course, it should be noted that the above-mentioned implementation manner of converting the target digital sequence into a character string is only a schematic illustration. During specific implementation, other methods may be used to convert the target digital sequence into a corresponding word string according to specific conditions. In this regard, the present invention is not limited. S605: Acquire audio data of the main syllables of each character in the character string, and audio data of the connecting syllables between adjacent characters in the character string, where the connecting syllables are used to connect phases The main syllables of adjacent characters. In this embodiment, the main syllable of the above-mentioned character can be specifically understood as the main part of one character syllable (for example, the middle part of the character syllable). Usually this part of the syllable has a high degree of recognition. The main frequency and sound intensity of the main syllable of the same character syllable are more consistent and approximately the same. Therefore, the main syllable of the syllable of the character can be extracted to distinguish other characters syllable. In this embodiment, the cohesive syllables between the adjacent characters can be specifically understood as the syllables used to connect the main syllables of the adjacent characters. Usually the syllable of this part is different from the main syllable and has no specific meaning, nor is it used to correspond to a specific character, but the waveform data in the audio data is not 0. In human speech habits, it usually appears between the main characters of adjacent characters, which plays the role of inheritance and transition, so that people can say something different from machine pronunciation, instead of monotonous and dull The main syllables of the vowels are simply connected, but they transition from one syllable syllable to another naturally and smoothly. For example, when a person pronounces "fifty", the connection between the main syllable of the character "five" and the main syllable of the character "ten" is the character "five" and the character "ten" The connection between syllables. In this embodiment, the obtaining of the audio data of the main syllable of each character in the character string and the audio data of the connecting syllables between adjacent characters in the character string may specifically include: Specific characters in the string of the target digital sequence, search the default audio database to obtain the audio data of the main syllable of each character in the string, and the adjacent characters in the string The audio data of the connected syllables between. Wherein, the above-mentioned preset audio database may specifically be a database created in advance and stored in a server or a certain device for broadcasting voice. Specifically, the preset audio database may specifically include audio data of the main syllables of each character, and audio data of contiguous syllables between adjacent characters. S607: Splicing the audio data of the main syllables of the characters and the audio data of the connecting syllables between the adjacent characters according to a preset order to obtain the audio data of the target digital sequence. In this embodiment, the audio data of the target digital sequence may be specifically understood as audio data used for voice broadcast of the target digital sequence. In this embodiment, the audio data of the main syllables of the characters and the connecting syllables between the adjacent characters are spliced according to the preset order. (That is, the order of the characters in the string of the target digital sequence), arrange the audio data of the main syllables of each character; then use the audio data of the connecting syllables between adjacent characters to connect the main characters of the adjacent characters Syllable audio data. In this embodiment, it should be noted that considering that most of the determining devices used by users for broadcasting voice are mostly embedded device systems, such device systems are limited by their own structures, and often have relatively high computing power and data processing capabilities. Weakness leads to the relatively high cost and relatively poor processing efficiency of the audio data directly synthesizing the corresponding digital sequence through the speech synthesis model. By using the method for determining broadcast speech provided by the embodiment of the present invention, it is possible to avoid generating corresponding audio data through a speech synthesis model with a high resource consumption, but simply retrieve and determine the backbone of the corresponding character in the preset audio database The audio data of syllables and the audio data of contiguous syllables between adjacent characters are spliced and combined to obtain audio data of the target digital sequence with higher accuracy, which can reduce the occupation of resources and improve processing efficiency. It is well suited for embedded equipment systems. In one embodiment, the audio data of the main syllables of the characters in the character string and the connection of the syllables between adjacent characters in the character string are obtained as described above. It includes the following. S1: Identify each character in the character string, and determine the connection relationship between adjacent characters in the character string, wherein the connection relationship between adjacent characters in the character string Used to indicate the sequential connection order between adjacent characters in the string; S2: Retrieve and obtain the audio data of the main syllable of each character from the preset audio database according to each character in the character string, wherein the preset audio database stores characters The audio data of the main syllable and the audio data of the connecting syllables between adjacent characters; S3: According to the connection relationship between the adjacent characters in the character string, retrieve and obtain the audio data of the connecting syllables between the adjacent characters in the character string from the preset audio database . In this embodiment, the connection relationship between the adjacent characters can be specifically understood as a piece of identification information of the sequence between two adjacent characters. For example, the characters "five" and "ten" in the string "54" are two adjacent characters. The connection between "five" and "ten" can be expressed as: the hyphen of the character "five" "ten". Of course, it should be noted that the connection relationship between adjacent characters listed above is only a schematic illustration. In specific implementation, the connection relationship between adjacent characters may also be expressed by other identification methods. In this regard, the present invention is not limited. In this embodiment, during specific implementation, the identified characters and the determined connection relationship between adjacent characters may be used as identifiers to search in a preset audio database to extract The audio data in the audio data set that matches the above identifier is the audio data of the main syllables of the above characters, or the audio data of the connecting syllables between adjacent characters. In one embodiment, the preset audio database may be specifically established in the following manner. S1: Obtain sample data; wherein, the sample data is audio data containing a string corresponding to a digital sequence; S2: intercept the audio data of the main syllables of characters from the sample data; S3: intercept the audio data of the connecting syllables between adjacent characters from the sample data; S4: Create the preset audio database according to the audio data of the main syllable of the character and the audio data of the connected syllable between the adjacent characters. In this embodiment, when the acquiring of audio data containing digits as sample data is specifically implemented, it may include: intercepting the audio data containing broadcast content related to the digits of the announcer as the sample data; or The voice data read by the collector according to the preset text is used as the above-mentioned sample data, wherein the upper preset text may be a preset text content including a variety of digital combinations. Of course, it should be noted that the above-mentioned implementation method for acquiring audio data containing numbers as sample data is only a schematic illustration. During specific implementation, you can also choose to obtain audio data containing numbers as sample data in other ways according to specific situations. In this regard, the present invention is not limited. In this embodiment, after the sample data is acquired, the sample data may also be marked. Specifically, in the acquired sample data, a corresponding character syllable identifier may be used to mark the range area of the audio data corresponding to each character syllable. Correspondingly, the above interception of the audio data of the main syllables of the characters from the sample data may specifically include: retrieving the character syllable identifier in the sample data; intercepting the sample data according to the character syllable identifier The audio data of the designated area in the range identified by the character syllable identifier is used as the audio data of the main syllable of the character. In this embodiment, the above-mentioned designated area may be specifically understood as that within the range identified by the character syllable ID, the midpoint of the area identified by the character syllable ID is the center symmetrical point, and the interval of the area The area where the ratio of the length to the section length of the range identified by the character syllable identification is equal to the preset ratio. For example, the midpoint O in the range identified by the syllable syllable symbol "5" can be used as the center symmetry point, and the 1/2 area combination on both sides of the center symmetry point O can be intercepted as the designated area, and the audio data of the designated area can be determined It is the audio data of the main syllable of the character "five". Among them, the above-mentioned designated area occupies 1/2 of the range identified by the character syllable identification "5". Of course, it should be noted that the above-mentioned designated area and the manner of determining the designated area are only for better explaining the embodiments of the present invention. During specific implementation, other areas can also be selected to be used as the designated area according to specific conditions, and then the designated area can be determined by using the corresponding determination method. For example, a region where the amplitude of the sound intensity in the range identified by the vowel syllable identifier is greater than the threshold intensity may also be used as the designated region. Correspondingly, in specific implementation, the audio data in the area where the amplitude of the sound intensity is greater than the threshold intensity can be intercepted as the audio data of the main syllable of the character from the range indicated by the character syllable identifier according to the sound intensity. For specific implementation, refer to FIG. 7. From the range identified by the vowel syllable identifier, select the position where the intensity value is 0 in the first cycle where the intensity intensity is greater than the threshold intensity, and the value in the first cycle where the intensity intensity value is less than the threshold intensity The area between the position points with the sound intensity of 0 is used as the designated area, and then the audio data in the designated area can be intercepted as the audio data of the main syllable of the character. It should be noted that the specific value of the threshold intensity may be determined according to the phoneme of the syllable of the character. Specifically, if the phoneme of the syllable is a vowel, the threshold intensity can be set relatively high, for example, it can be set to 0. 1. If the phoneme of the vowel syllable is a consonant, the threshold intensity can be set relatively low, for example, it can be set to 0. 03. For example, for a certain character, the syllable of the character starts with a vowel and ends with a consonant. In specific implementation, the amplitude of the intensity of the character in the range of the character syllable identification of the character can be greater than 0. The position where the intensity value of the first cycle of 1 is 0 and the intensity amplitude is less than 0. The area between the position points whose sound intensity is 0 in the first cycle of 03 is used as the designated area, and then the audio data in the designated area can be obtained as the audio data of the main syllable of the character. In addition, the specific value of the threshold intensity can also be determined according to the strength of the background sound in the audio data, specific, if the background sound in the audio data is strong, the threshold intensity can be set relatively high, for example, can be set to 0. 16. If the background sound in the audio data is weak, the threshold intensity can be set relatively low, for example, it can be set to 0. 047. Of course, it should be noted that the above-mentioned methods for determining the threshold intensity are only for better explanation of the method in this embodiment. During specific implementation, other suitable methods may be used to determine the threshold intensity according to specific application scenarios. In this regard, the present invention is not limited. After intercepting the audio data of the main syllables of the characters from the sample data, correspondingly, intercepting the audio data of the connecting syllables between the adjacent characters from the sample data, in specific implementation, can The method includes: intercepting the audio data of the area between the audio data of the main syllables of adjacent characters in the sample data as the audio data of the connecting syllables between the adjacent characters. In this embodiment, it is further considered that when the first character syllable in the speech data about the target digital sequence is issued according to human speech habits, the audio of the main syllable with a pitch of 0 to the first character There is also a kind of audio data connecting the syllables. Therefore, in specific implementation, the audio data between the starting position in the audio data of this data and the audio data of the main syllable of the first character can also be intercepted as a kind of audio data for connecting the syllables, so that the subsequent can be spliced to obtain the effect Better, more natural and fluent audio data of characters at the beginning of the target digital audio data. In this embodiment, during specific implementation, the audio data in the area between two adjacent designated areas in the audio data in this data may be intercepted as the audio data of the contiguous syllables between corresponding adjacent characters. In this embodiment, in the specific implementation, each audio data in the sample data may be intercepted in the above manner to obtain the audio data of the main syllable of the character and the connection between the adjacent characters Syllable audio data, which can further store the acquired audio data of the main syllable of the character, audio data of the connecting syllable between the adjacent characters, and according to the audio data of the main syllable of the character 2. The audio data of the connected syllables between the adjacent characters to create the default audio database. In one embodiment, the audio data of the main syllables of the characters is intercepted from the sample data, and in specific implementation, it may include the following content: retrieving the character syllable identifier in the sample data; according to the character syllables Mark, intercepting the audio data of the specified area in the range identified by the character syllable ID in the sample data as the audio data of the main syllable of the character. In one embodiment, the specified area may be understood as being within the range identified by the character syllable identification, with the midpoint in the range identified by the character syllable identification as the center symmetry point, and the The area where the ratio of the interval length to the interval length of the range identified by the character syllable identifier is equal to the preset ratio. In one embodiment, the audio data of the connecting syllables between adjacent characters is intercepted from the sample data. In specific implementation, it may include the following content: intercepting the trunk of the adjacent characters in the sample data The audio data of the area between the audio data of the syllables serves as the audio data of the connecting syllables between the adjacent characters. In one embodiment, after the audio data of the connecting syllables between adjacent characters is intercepted from the sample data, the audio data of the connecting syllables with better connection effect and more natural and smooth are stored to find and determine In specific implementation, the method may further include the following: S1: Detect whether the audio data of the connecting syllables between the adjacent characters includes the audio data of multiple connecting syllables between the same adjacent characters; S2: when the audio data of the connecting syllables between the adjacent characters includes multiple connecting syllables of the same adjacent character, the same adjacent character is counted The frequency of occurrence of the audio data of various types of contiguous syllables in the audio data of multiple contiguous syllables between, and determining the audio data of the contiguous syllables of the most frequent type as the connection between the adjacent characters Syllable audio data. In this embodiment, since most of the sample data is voice audio data containing digits issued by humans, for audio data of multiple contiguous syllables between the same adjacent characters, the higher the frequency of occurrence, the higher the frequency of occurrence. The more frequently it is used in speech habits, the more it can fit the more common speech habits of human beings. Therefore, it is possible to store the audio data of the connected syllable with the highest frequency as the better and more natural audio data in the preset audio database to improve the accuracy of the audio database. Specifically, the audio data of multiple connected syllables between the same adjacent characters can be divided into multiple types, the frequency of occurrence of various types of audio data in the sample data is counted, and the occurrences are filtered from multiple types of audio data The audio data of the highest frequency type is stored in the default audio database as the audio data between the connecting syllables between the above-mentioned adjacent characters. Of course, in addition to the above-listed audio data from multiple contiguous syllables between the same adjacent characters according to the frequency of appearance of various types of audio data, other suitable audio data can be selected and stored. The method is to select and store the audio data with better effect from the audio data of multiple connected syllables between the same adjacent characters. For example, you can also calculate the MOS value (Mean Opinion Score, average subjective opinion score) of the audio data of multiple connected syllables between the same adjacent characters, and filter out the MOS value based on the MOS value of the audio data of the connected syllables The audio data of the highest connected syllable is used as the audio data of the connected syllable between adjacent characters. Among them, the above MOS value can be used to more accurately and objectively evaluate the naturalness and fluency of audio data. Similarly, when the audio data of multiple trunk syllables characterizing the same character is intercepted, the frequency of occurrence of the audio data of different types of trunk syllables in the audio data of multiple trunk syllables of the same character can be counted, and thus the same The audio data of the main types of main syllables of the sub-symbol is filtered out as the audio data of the main syllable of the character and stored in the default audio database. You can also determine the MOS value of the audio data of multiple trunk syllables of the same character, and filter out the audio data with the highest MOS value as the audio data of the trunk syllable of the character and store it in the default audio database. In one embodiment, in order to obtain a relatively complete voice and audio data for the voice broadcast including the target digital sequence, after the audio data of the target digital sequence is obtained, the method may further include the following when specifically implemented: S1: Acquire preset pre-audio data, wherein the preset pre-audio data is used to indicate the data object represented by the target digital sequence; S2: splicing the preset pre-audio data and the audio data of the target digital sequence to obtain voice audio data to be played; S3: Play the voice and audio data to be played. In this embodiment, the preset pre-audio data may specifically be audio data used to indicate the content of the data object represented by the target digital sequence. For example, for the broadcast of the received amount, the above-mentioned preset pre-audio data may include the voice audio data "account arrival" set before the amount number, and the voice audio data "yuan" set after the amount number. For stock price broadcasts, the above-mentioned preset pre-audio data may include the voice audio data "XX's latest unit price is" set before the price number, and the voice audio data "yuan per share" set after the price number . Of course, the preset pre-audio data listed above is only a schematic illustration. During specific implementation, other audio data can also be set as the above-mentioned preset pre-audio data according to specific application scenarios. In this regard, the present invention is not limited. In this embodiment, it should be noted that, generally, the pre-audio data in the voice data broadcast is often fixed, and only the target digital sequence to be broadcast in the voice data is changed. Taking the account amount broadcast as an example, the pre-audio information is the same in the voice announcement data of different account amounts. For example, the pre-audio information in the "account arrival amount is fifty-four yuan" and "account arrival amount is seventy-nine yuan" are exactly the same as "account arrival amount is" and "yuan", the difference is just to wait The reported amount. Therefore, in specific implementation, in order to improve the processing efficiency, the corresponding pre-audio data can be preset to be stored, and after the audio data of the target digital sequence is generated, the preset pre-audio data and the generated target digital sequence can be The audio data is directly spliced and combined to obtain the voice audio data to be played, and the voice is played. Therefore, it is possible to avoid repeated audio data synthesis of the pre-audio data with the same content and improve processing efficiency, so that the method for determining broadcast speech provided by the present invention is more suitable for embedded systems with limited data processing capabilities, such as mobile phones and other broadcast speech. Identify the device. Specifically, for example, after obtaining the audio data of the target digital sequence "54", you can first call the pre-set pre-set audio data "account arrival amount" and "yuan"; and then target in a certain order The audio data of the digital sequence "54" is combined with the preset pre-audio data. Specifically, the audio data of the target digital sequence "54" can be connected after the audio data of the "account arrival amount is", and the "yuan" can be connected after the audio data of the target digital sequence "54", thereby obtaining a more complete , Contains the voice broadcast data of the target digital sequence of the received amount. In one embodiment, the preset front-end audio data may specifically include at least one of the following: audio data of a front-end language used to broadcast the amount of credit, audio data of a front-end language used to broadcast mileage, Audio data for pre-terms used to broadcast stock prices, etc. Of course, it should be noted that the preset pre-audio data listed above is only for better illustrating the embodiment. During specific implementation, according to specific application scenarios and requirements, other preset audio data may also be selected to be used as the above-mentioned preset pre-data. In this regard, the present invention is not limited. It can be seen from the above that the method for determining the broadcast speech provided by the embodiment of the present invention obtains the audio data of the connecting syllables between adjacent characters, and uses the audio data of the connecting syllables between the adjacent characters to splice the corresponding characters The audio data of the main syllables of the yuan is obtained as a more natural voice audio data for voice broadcast, thereby solving the problems of unnatural digital broadcast and poor user experience in the existing methods, achieving the ability to take into account the operation cost, high efficiency, Smoothly carry out speech announcements about numbers; also obtain sample data containing numbers, intercept audio data in the designated area from the sample data as the main syllable audio data of characters, and then intercept the main syllable audio data of characters The audio data between is used as the audio data of the connecting syllables between adjacent characters, so that a more accurate default audio database can be established, so that the above-mentioned default audio database can be retrieved to produce a more natural and smooth Audio data of the target digital sequence. Referring to FIG. 8, the present invention provides a method for determining a broadcast voice, where the method is specifically applied to a side of a device for determining a broadcast voice. During specific implementation, the method may include the following. S801: Obtain a character string to be played, wherein the character string includes a plurality of characters arranged in a preset order; S803: Obtain the audio data of the main syllables of each character in the character string, and the audio data of the connecting syllables between adjacent characters in the character string, where the connecting syllables are used to connect phases The main syllables of adjacent characters; S805: Splicing the audio data of the main syllables of the characters and the audio data of the connecting syllables between the adjacent characters according to a preset order to obtain the audio data of the character string to be played. In this embodiment, the character string to be played may specifically be a character string of a digital sequence to be played, or a character string of text information to be played. During specific implementation, the word string of the corresponding content may be selected as the word string to be played according to specific application scenarios and implementation requirements. The present invention does not limit the specific content represented by the character string to be played. An embodiment of the present invention also provides a device for determining broadcast speech, which includes a processor and a memory for storing processor executable instructions. When the processor is specifically implemented, the processor may perform the following steps according to the instructions: obtain the target number to be broadcast Sequence; converting the target digital sequence into a character string, wherein the character string includes a plurality of characters arranged in a preset order; acquiring audio data of the main syllables of each character in the character string, and all The audio data of the cohesive syllables between adjacent characters in the character string, wherein the cohesive syllables are used to connect the main syllables of adjacent characters; The audio data and the audio data of the connected syllable between the adjacent characters obtain the audio data of the target digital sequence. In order to be able to complete the above instructions more accurately, referring to FIG. 9, the present invention also provides another specific device for determining the broadcast voice, wherein the device for determining the broadcast voice includes an input interface 901, a processor 902, and a memory 903, The above structures are connected by internal cables, so that each structure can perform specific data interaction. The input interface 901 may be specifically used to input a target digital sequence to be broadcast. The processor 902 may be specifically configured to convert the target digital sequence into a character string, where the character string includes a plurality of characters arranged in a preset order; and obtain the characters of each character in the character string The audio data of the main syllables and the audio data of the connecting syllables between adjacent characters in the string, wherein the connecting syllables are used to connect the main character syllables of adjacent characters; The audio data of the main syllable of the character and the audio data of the connecting syllable between the adjacent characters obtain the audio data of the target digital sequence. The memory 903 can be specifically used to store the target digital sequence to be broadcasted via the input interface 901, a preset audio database, and store corresponding command programs. In this embodiment, the input interface 901 may specifically be a unit or module that obtains a device that supports broadcast speech and obtains the target data sequence to be broadcast from the obtained information data. In this embodiment, the processor 902 can be implemented in any suitable manner. For example, the processor may adopt, for example, a microprocessor or a processor and a computer-readable medium storing a computer-readable program code (such as software or firmware) executable by the (micro)processor, a logic gate, a switch, a dedicated product Circuit (Application Specific Integrated Circuit, ASIC), programmable logic controller and embedded microcontroller, etc. The invention is not limited. In this embodiment, the memory 903 may include multiple layers. In a digital system, as long as it can store binary data, it can be a memory; in an integrated circuit, a storage function without a physical form The circuit is also called memory, such as RAM, FIFO, etc.; in the system, the storage device with physical form is also called memory, such as memory stick, TF card, etc. An embodiment of the present invention also provides a computer storage medium based on the above payment method, where the computer storage medium stores computer program instructions, which are implemented when the computer program instructions are executed: converting the target digital sequence into a string, Wherein, the character string includes a plurality of characters arranged in a preset order; acquiring audio data of the main syllables of each character in the character string, and between the adjacent characters in the character string Audio data of cohesive syllables, wherein the cohesive syllables are used to connect the main character syllables of adjacent characters; the audio data of the main character syllables of the characters and the adjacent characters are spliced in a preset order Connect the audio data of the syllable to obtain the audio data of the target digital sequence. In this embodiment, the storage medium includes but is not limited to random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), cache (Hard Disk), hard disk (Hard Disk) Drive, HDD) or memory card (Memory Card). The memory can be used to store computer program instructions. The network communication unit may be set according to a standard stipulated by a communication protocol, and is an interface for performing network connection communication. In this embodiment, the functions and effects specifically implemented by the program instructions stored in the computer storage medium can be explained in comparison with other embodiments, and will not be repeated here. Referring to FIG. 10, at the software level, an embodiment of the present invention also provides a device for determining broadcast speech. The device may specifically include the following structural modules: The first obtaining module 1001 may be specifically used to obtain the target digital sequence to be broadcast; The conversion module 1002 may specifically be used to convert the target digital sequence into a character string, where the character string includes a plurality of characters arranged in a preset order; The second obtaining module 1003 may be specifically used to obtain the audio data of the main syllables of each character in the character string, and the audio data of the connecting syllables between adjacent characters in the character string, wherein , The connection syllable is used to connect the main characters of adjacent characters; The splicing module 1004 may be specifically used to splice the audio data of the main syllables of the characters and the audio data of the connecting syllables between the adjacent characters in a preset order to obtain the audio data of the target digital sequence . In one embodiment, the second acquiring module 1003 may specifically include the following structural units: The identification unit may specifically be used to identify each character in the character string and determine the connection relationship between adjacent characters in the character string, wherein the adjacent characters in the character string The connection relationship between them is used to indicate the sequential connection order between adjacent characters in the string; The first acquiring unit may be specifically configured to retrieve and acquire audio data of the main syllables of each character from a preset audio database according to each character in the character string, wherein the preset audio data The library stores the audio data of the main syllables of characters and the audio data of the connecting syllables between adjacent characters; The second obtaining unit may be specifically configured to retrieve and obtain the adjacent characters in the character string from a preset audio database according to the connection relationship between the adjacent characters in the character string The audio data of the connected syllables between. In one embodiment, in order to prepare a preset audio database to be used in advance, during specific implementation, the device may further include a creation module, which may be specifically used to create a preset audio database. In an embodiment, when the building module is specifically implemented, it may include the following structural units: The third obtaining unit can be specifically used to obtain audio data containing numbers as sample data; The first intercepting unit can be specifically used to intercept the audio data of the main syllables of the characters from the sample data; The second intercepting unit can be specifically used to intercept the audio data of the connecting syllables between adjacent characters from the sample data; The establishing unit may be specifically configured to establish the preset audio database based on the audio data of the main syllable of the character and the audio data of the connecting syllable between the adjacent characters. In an embodiment, when the device is implemented, it may further include a playback module, which may be specifically used to obtain preset pre-audio data, wherein the preset pre-audio data is used to indicate the target A data object represented by a digital sequence; splicing the preset pre-audio data and the audio data of the target digital sequence to obtain voice audio data to be played; playing the voice audio data to be played. In one embodiment, the preset front-end audio data may specifically include at least one of the following: audio data of a front-end language used to broadcast the amount of credit, audio data of a front-end language used to broadcast mileage, Audio data for pre-terms used to broadcast stock changes. Of course, it should be noted that the pre-audio data listed above is only a schematic illustration. During specific implementation, other suitable audio data may also be selected or obtained as the above-mentioned preset pre-audio data according to specific application scenarios and requirements. In this regard, the present invention is not limited. It should be noted that the units, devices, or modules explained in the foregoing embodiments may be specifically implemented by a computer chip or entity, or by a product with a certain function. For the convenience of description, when describing the above device, the functions are divided into various modules and described separately. Of course, when implementing the present invention, the functions of each module may be implemented in the same software or multiple hardware and/or hardware, or the modules that implement the same function may be implemented by a combination of multiple sub-modules or sub-units, etc. . The device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or elements may be combined or integrated into Another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. It can be seen from the above that the device for determining broadcast speech provided by the embodiment of the present invention obtains the audio data of the connected syllables between adjacent characters through the second acquisition module, and uses the information between adjacent characters through the splicing module The audio data of the connected syllables is spliced to the audio data of the main characters of the corresponding syllables to obtain more natural voice audio data for voice broadcast, thereby solving the problems of unnatural digital broadcast and poor user experience in the existing methods. , To achieve the efficient and smooth voice broadcasting of numbers by taking into account the calculation cost; also through the establishment of modules to obtain sample data containing numbers, intercepting the audio data in the specified area from the sample data as the main syllables of characters Data, and then intercept the audio data between the audio data of the main syllables of the characters as the audio data of the connecting syllables between adjacent characters, so that a more accurate default audio database can be established so that The established audio database generates more natural and smooth audio data of the target digital sequence. Although the present invention provides method operation steps as described in the embodiments or flowcharts, more or less operation steps may be included based on conventional or non-inventive means. The order of the steps listed in the embodiment is only one way among the order of execution of many steps, and does not represent a unique order of execution. When the actual device or user-side product is executed, it can be executed sequentially or concurrently according to the method shown in the embodiment or the drawings (such as a parallel processor or multi-threaded processing environment, or even a distributed data processing environment). The terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, product, or device that includes a series of elements includes not only those elements, but also others that are not explicitly listed Elements, or also include elements inherent to such processes, methods, products, or equipment. Without more restrictions, it does not exclude that there are other identical or equivalent elements in the process, method, product or equipment including the elements. The first and second words are used to indicate names, but do not indicate any particular order. Those skilled in the art also know that, in addition to implementing the controller in a purely computer-readable program code manner, the method steps can be logically programmed to make the controller controlled by logic gates, switches, dedicated integrated circuits, and programmable logic To achieve the same function in the form of a controller and embedded microcontroller. Therefore, such a controller can be regarded as a hardware component, and the devices included therein for realizing various functions can also be regarded as the structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module of the implementation method and a structure within the hardware element. The invention can be described in the general context of computer executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types. The present invention can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices. It can be known from the description of the above embodiments that those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence or part that contributes to the existing technology. The computer software product can be stored in a storage medium, such as ROM/RAM, disk, The optical disc, etc., includes several instructions to enable a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to execute the methods described in the embodiments of the present invention or some parts of the embodiments. The embodiments of the present invention are described in a progressive manner, and the same or similar parts between the embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. The invention can be used in many general or special computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable electronic devices, network PCs, small computers , Large-scale computers, distributed computing environments including any of the above systems or devices, etc. Although the present invention has been described through the embodiments, a person of ordinary skill in the art knows that there are many variations of the present invention without departing from the spirit of the present invention, and it is hoped that the appended patent application includes these variations and changes without departing from the spirit of the present invention.

S601-805‧‧‧步驟 901‧‧‧輸入介面 902‧‧‧處理器 903‧‧‧記憶體 1001‧‧‧第一獲取模組 1002‧‧‧轉換模組 1003‧‧‧第二獲取模組 1004‧‧‧拼接模組S601-805‧‧‧Step 901‧‧‧ input interface 902‧‧‧ processor 903‧‧‧Memory 1001‧‧‧ First acquisition module 1002‧‧‧ conversion module 1003‧‧‧Second acquisition module 1004‧‧‧splicing module

為了更清楚地說明本發明實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的圖式作簡單地介紹，顯而易見地，下面描述中的圖式僅僅是本發明中記載的一些實施例，對於本領域普通技術人員來講，在不付出創造性勞動性的前提下，還可以根據這些圖式獲得其他的圖式。圖1是在一個場景示例中，應用本發明實施例提供的播報語音的確定方法進行到帳金額播報的一種實施例的示意圖；圖2是在一個場景示例中，應用本發明實施例提供的播報語音的確定方法拼接得到目標數字序列的音訊資料的一種實施例的示意圖；圖3是在一個場景示例中，應用本發明實施例提供的播報語音的確定方法得到用於播放到帳金額的語音音訊資料的一種實施例的示意圖；圖4是在一個場景示例中，標記音訊資料的一種實施例的示意圖；圖5是在一個場景示例中，截取字元的主幹音節的音訊資料，以及相鄰字元之間的銜接音節的音訊資料的一種實施例的示意圖；圖6是本發明的一個實施例提供的播報語音的確定方法的一種實施例的流程示意圖；圖7是本發明的一個實施例提供的播報語音的確定方法中確定指定區域的位置點的一種實施例的示意圖；圖8是本發明的一個實施例提供的播報語音的確定方法的一種實施例的流程示意圖；圖9是本發明的一個實施例提供的播報語音的確定設備的結構的一種實施例的示意圖；圖10是本發明的一個實施例提供的播報語音的確定裝置的結構的一種實施例的示意圖。In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only Some embodiments described in the invention, for those of ordinary skill in the art, can also obtain other drawings based on these drawings without paying any creative labor. FIG. 1 is a schematic diagram of an embodiment in which a method of determining broadcast speech provided by an embodiment of the present invention is used to broadcast an account amount in an example of a scenario; FIG. 2 is a schematic diagram of an embodiment in which an audio data of a target digital sequence is spliced by applying the method for determining broadcast speech provided in an embodiment of the present invention; FIG. 3 is a schematic diagram of an embodiment of applying a method for determining a broadcast voice provided by an embodiment of the present invention to obtain voice audio data for playing an account amount in an example of a scenario; 4 is a schematic diagram of an embodiment of marking audio data in an example of a scenario; FIG. 5 is a schematic diagram of an embodiment of intercepting audio data of main syllables of characters and audio data of connecting syllables between adjacent characters in an example of a scenario; 6 is a schematic flowchart of an embodiment of a method for determining broadcast speech provided by an embodiment of the present invention; 7 is a schematic diagram of an embodiment of determining a location point of a designated area in a method for determining broadcast speech provided by an embodiment of the present invention; 8 is a schematic flowchart of an embodiment of a method for determining broadcast speech provided by an embodiment of the present invention; 9 is a schematic diagram of an embodiment of a structure of a device for determining a broadcast voice provided by an embodiment of the present invention; FIG. 10 is a schematic diagram of an embodiment of a structure of a device for determining broadcast speech provided by an embodiment of the present invention.

Claims

一種播報語音的確定方法，該方法包括：獲取待播報的目標數字序列；將該目標數字序列轉換為字串，其中，該字串包括多個按照預設順序排列的字元；獲取該字串中的各個字元的主幹音節的音訊資料，以及該字串中的相鄰的字元之間的銜接音節的音訊資料，其中，該銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接該字元的主幹音節的音訊資料和該相鄰的字元之間的銜接音節的音訊資料，得到該目標數字序列的音訊資料。A method for determining broadcast speech, the method includes: Obtain the target digital sequence to be broadcast; Converting the target digital sequence into a character string, wherein the character string includes a plurality of characters arranged in a preset order; Obtain the audio data of the main syllables of each character in the string, and the audio data of the connecting syllables between adjacent characters in the string, where the connecting syllables are used to connect the adjacent characters Main syllable; The audio data of the main syllable of the character and the audio data of the connecting syllable between the adjacent characters are spliced according to the preset order to obtain the audio data of the target digital sequence.

根據請求項1所述的方法，獲取該字串中的各個字元的主幹音節的音訊資料，以及該字串中的相鄰的字元之間的銜接音節的音訊資料，包括：識別該字串中的各個字元，並確定該字串中的相鄰的字元之間的連接關係，其中，該字串中的相鄰的字元之間的連接關係用於指示字串中的相鄰的字元之間的先後連接順序；根據該字串中的各個字元，從預設的音訊資料庫中檢索並獲取各個字元的主幹音節的音訊資料，其中，該預設的音訊資料庫中儲存有字元的主幹音節的音訊資料和相鄰的字元之間的銜接音節的音訊資料；根據該字串中的相鄰的字元之間的連接關係，從預設的音訊資料庫中檢索並獲取該字串中的相鄰的字元之間的銜接音節的音訊資料。According to the method described in claim 1, obtaining the audio data of the main syllables of each character in the string and the audio data of the connecting syllables between adjacent characters in the string include: Recognize each character in the character string and determine the connection relationship between adjacent characters in the character string, wherein the connection relationship between adjacent characters in the character string is used to indicate the character string The connection order between adjacent characters in According to each character in the string, the audio data of the main syllable of each character is retrieved and obtained from the default audio database, wherein the audio of the main syllable of the character is stored in the default audio database The audio data of the connection syllable between the data and the adjacent characters; According to the connection relationship between the adjacent characters in the character string, retrieve and obtain the audio data of the connecting syllables between the adjacent characters in the character string from the preset audio database.

根據請求項2所述的方法，該預設的音訊資料庫按照以下方式建立：獲取樣本資料；從該樣本資料中截取得到字元的主幹音節的音訊資料；從該樣本資料中截取得到相鄰的字元之間的銜接音節的音訊資料；根據該字元的主幹音節的音訊資料、該相鄰的字元之間的銜接音節的音訊資料，建立該預設的音訊資料庫。According to the method described in claim 2, the preset audio database is created in the following manner: Obtain sample information; Obtain audio data of the main syllables of characters from the sample data; Intercept the syllable audio data between adjacent characters from the sample data; According to the audio data of the main syllable of the character, and the audio data of the connecting syllable between the adjacent characters, the default audio database is created.

根據請求項3所述的方法，從該樣本資料中截取得到字元的主幹音節的音訊資料，包括：檢索該樣本資料中的字元音節標識；根據該字元音節標識，截取該樣本資料中該字元音節標識所標識的範圍中的指定區域的音訊資料作為該字元的主幹音節的音訊資料。According to the method described in claim 3, intercepting the audio data of the main syllables of characters from the sample data includes: Retrieve the character syllable identification in the sample data; According to the character syllable identification, the audio data of the designated area in the range identified by the character syllable identification in the sample data is intercepted as the audio data of the main syllable of the character.

根據請求項4所述的方法，該指定區域為在該字元音節標識所標識的範圍中，以該字元音節標識所標識的範圍中的中點為中心對稱點，且區域的區間長度與該字元音節標識所標識的範圍的區間長度的比值等於預設比值的區域。According to the method described in claim 4, the designated area is within the range identified by the syllable syllable of the character, the midpoint of the range identified by the syllable syllable of the character is the center symmetrical point, and the length of the area is The area where the ratio of the interval length of the range identified by the vowel syllable identifier is equal to the preset ratio.

根據請求項4所述的方法，從該樣本資料中截取得到相鄰的字元之間的銜接音節的音訊資料，包括：截取該樣本資料中相鄰的字元的主幹音節的音訊資料之間的區域的音訊資料作為該相鄰的字元之間的銜接音節的音訊資料。According to the method described in claim 4, intercepting the audio data of the connecting syllables between adjacent characters from the sample data includes: The audio data of the region between the audio data of the main syllables of adjacent characters in the sample data is intercepted as the audio data of the connecting syllables between the adjacent characters.

根據請求項3所述的方法，在從該樣本資料中截取得到相鄰的字元之間的銜接音節的音訊資料後，該方法還包括：檢測該相鄰的字元之間的銜接音節的音訊資料中是否包括同一相鄰的字元之間的多個銜接音節的音訊資料；在確定該相鄰的字元之間的銜接音節的音訊資料中包括同一相鄰的字元之間的多個銜接音節的音訊資料的情況下，統計該同一相鄰的字元之間的多個銜接音節的音訊資料中各種類型的銜接音節的音訊資料的出現頻率，將該出現頻率最高的銜接音節的音訊資料確定為該相鄰的字元之間的銜接音節的音訊資料。According to the method described in claim 3, after intercepting the audio data of the connecting syllables between adjacent characters from the sample data, the method further includes: Detecting whether the audio data of the connecting syllables between the adjacent characters includes the audio data of multiple connecting syllables between the same adjacent characters; When determining that the audio data of the connecting syllables between the adjacent characters includes the audio data of multiple connecting syllables between the same adjacent characters, the number of the adjacent characters is counted. The frequency of occurrence of audio data of various types of connecting syllables in the audio data of each connecting syllable is determined as the audio data of connecting syllables between the adjacent characters.

根據請求項1所述的方法，在得到該目標數字序列的音訊資料後，該方法還包括：獲取預設的前置音訊資料，其中，該預設的前置音訊資料用於指示該目標數字序列所表徵的資料對象；將該預設的前置音訊資料和該目標數字序列的音訊資料進行拼接，得到待播放的語音音訊資料；播放該待播放的語音音訊資料。According to the method of claim 1, after obtaining the audio data of the target digital sequence, the method further includes: Acquiring preset pre-audio data, wherein the preset pre-audio data is used to indicate the data object represented by the target digital sequence; Splicing the preset pre-audio data and the audio data of the target digital sequence to obtain the voice audio data to be played; Play the voice and audio data to be played.

根據請求項8所述的方法，該預設的前置音訊資料包括以下至少之一：用於播報到帳金額的前置用語的音訊資料、用於播報行駛里程的前置用語的音訊資料、用於播報股票價格的前置用語的音訊資料。According to the method described in claim 8, the preset front-end audio data includes at least one of the following: audio data of a front-end language used to broadcast the amount of credit, audio data of a front-end language used to broadcast mileage, Audio data used to broadcast the preamble of stock prices.

一種播報語音的確定裝置，該裝置包括：第一獲取模組，用於獲取待播報的目標數字序列；轉換模組，用於將該目標數字序列轉換為字串，其中，該字串包括多個按照預設順序排列的字元；第二獲取模組，用於獲取該字串中的各個字元的主幹音節的音訊資料，以及該字串中的相鄰的字元之間的銜接音節的音訊資料，其中，該銜接音節用於連接相鄰的字元的主幹音節；拼接模組，用於按照預設順序拼接該字元的主幹音節的音訊資料和該相鄰的字元之間的銜接音節的音訊資料，得到該目標數字序列的音訊資料。A device for determining broadcast voice, the device includes: The first acquisition module is used to acquire the target digital sequence to be broadcast; The conversion module is used to convert the target digital sequence into a character string, wherein the character string includes a plurality of characters arranged in a preset order; The second obtaining module is used to obtain the audio data of the main syllables of each character in the character string and the audio data of the connecting syllables between adjacent characters in the character string, wherein the connecting syllable is used The main syllables connecting adjacent characters; The splicing module is used for splicing the audio data of the main syllable of the character and the audio data of the connecting syllable between the adjacent characters according to a preset order to obtain the audio data of the target digital sequence.

根據請求項10所述的裝置，該第二獲取模組包括：識別單元，用於識別該字串中的各個字元，並確定該字串中的相鄰的字元之間的連接關係，其中，該字串中的相鄰的字元之間的連接關係用於指示字串中的相鄰的字元之間的先後連接順序；第一獲取單元，用於根據該字串中的各個字元，從預設的音訊資料庫中檢索並獲取各個字元的主幹音節的音訊資料，其中，該預設的音訊資料庫中儲存有字元的主幹音節的音訊資料和相鄰的字元之間的銜接音節的音訊資料；第二獲取單元，用於根據該字串中的相鄰的字元之間的連接關係，從預設的音訊資料庫中檢索並獲取該字串中的相鄰的字元之間的銜接音節的音訊資料。According to the device of claim 10, the second acquisition module includes: A recognition unit, used to recognize each character in the character string, and determine the connection relationship between adjacent characters in the character string, wherein the connection relationship between adjacent characters in the character string Used to indicate the sequential connection order between adjacent characters in the string; The first obtaining unit is configured to retrieve and obtain the audio data of the main syllables of each character from the preset audio database according to each character in the character string, wherein the default audio database stores The audio data of the main syllable of the character and the audio data of the connecting syllable between adjacent characters; The second obtaining unit is configured to retrieve and acquire the cohesive syllables between the adjacent characters in the character string from the preset audio database according to the connection relationship between the adjacent characters in the character string Audio data.

根據請求項10所述的裝置，該裝置還包括建立模組，用於建立預設的音訊資料庫。The device according to claim 10, further comprising a building module for building a preset audio database.

根據請求項12所述的裝置，該建立模組，包括：第三獲取單元，用於獲取樣本資料；第一截取單元，用於從該樣本資料中截取得到字元的主幹音節的音訊資料；第二截取單元，用於從該樣本資料中截取得到相鄰的字元之間的銜接音節的音訊資料；建立單元，用於根據該字元的主幹音節的音訊資料、該相鄰的字元之間的銜接音節的音訊資料，建立該預設的音訊資料庫。The device according to claim 12, the building module includes: The third obtaining unit is used to obtain sample data; The first intercepting unit is used to intercept the audio data of the main syllables of the characters from the sample data; The second intercepting unit is used to intercept the audio data of the connecting syllables between adjacent characters from the sample data; The establishing unit is used to create the preset audio database according to the audio data of the main syllable of the character and the audio data of the connecting syllable between the adjacent characters.

根據請求項10所述的裝置，該裝置還包括播放模組，用於獲取預設的前置音訊資料，其中，該預設的前置音訊資料用於指示該目標數字序列所表徵的資料對象；將該預設的前置音訊資料和該目標數字序列的音訊資料進行拼接，得到待播放的語音音訊資料；播放該待播放的語音音訊資料。The device according to claim 10, the device further comprises a playback module for acquiring preset pre-audio data, wherein the preset pre-audio data is used to indicate the data object represented by the target digital sequence ; Splicing the preset pre-audio data and the audio data of the target digital sequence to obtain voice audio data to be played; playing the voice audio data to be played.

根據請求項14所述的裝置，該預設的前置音訊資料包括以下至少之一：用於播報到帳金額的前置用語的音訊資料、用於播報行駛里程的前置用語的音訊資料、用於播報股票價格的前置用語的音訊資料。According to the device described in claim 14, the preset pre-audio data includes at least one of the following: audio data of a pre-language used to broadcast the amount of credit, audio data of a pre-language used to broadcast mileage, Audio data used to broadcast the preamble of stock prices.

一種播報語音的確定方法，該方法包括：獲取待播放的字串，其中，該字串包括多個按照預設順序排列的字元；獲取該字串中的各個字元的主幹音節的音訊資料，以及該字串中的相鄰的字元之間的銜接音節的音訊資料，其中，該銜接音節用於連接相鄰的字元的主幹音節；按照預設順序拼接該字元的主幹音節的音訊資料和該相鄰的字元之間的銜接音節的音訊資料，得到該待播放的字串的音訊資料。A method for determining broadcast speech, the method includes: Obtain the character string to be played, wherein the character string includes a plurality of characters arranged in a preset order; Obtain the audio data of the main syllables of each character in the string, and the audio data of the connecting syllables between adjacent characters in the string, where the connecting syllables are used to connect the adjacent characters Main syllable; The audio data of the main syllable of the character and the audio data of the contiguous syllable between the adjacent characters are spliced according to the preset order to obtain the audio data of the character string to be played.

一種播報語音的確定設備，包括處理器以及用於儲存處理器可執行指令的記憶體，該處理器執行該指令時實現請求項1至9中任一項所述方法的步驟。A device for determining broadcast speech includes a processor and a memory for storing processor executable instructions, and when the processor executes the instructions, the method steps of any one of request items 1 to 9 are implemented.

一種電腦可讀儲存介質，其上儲存有電腦指令，該指令被執行時實現請求項1至9中任一項所述方法的步驟。A computer-readable storage medium having computer instructions stored thereon, which when executed, implements the steps of the method described in any one of request items 1 to 9.