CN110444190A

CN110444190A - Method of speech processing, device, terminal device and storage medium

Info

Publication number: CN110444190A
Application number: CN201910746794.9A
Authority: CN
Inventors: 陈昊亮; 罗伟航
Original assignee: Guangzhou National Acoustic Intelligent Technology Co Ltd
Current assignee: Guangzhou National Acoustic Intelligent Technology Co Ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2019-11-12

Abstract

The invention discloses a kind of method of speech processing, device, terminal device and computer readable storage mediums, by obtaining the voice messaging in environment, determine voice data in default speech database according to the voice messaging；The text information that preset interface receives is extracted, target speech data is searched from the voice data based on the text information；It is instructed according to speech synthesis, the target speech data is synthesized into voice sequence.The present invention realizes, it is not limited by factors such as scene, contexts and carries out speech recognition and speech synthesis processing, the efficiency of the processing carried out to voice is improved, and is customized based on user and carries out speech synthesis and output with individual demand, improves the performance of speech processes.

Description

Method of speech processing, device, terminal device and storage medium

Technical field

The present invention relates to speech analysis techniques field more particularly to a kind of method of speech processing, device, terminal device and meters Calculation machine readable storage medium storing program for executing.

Background technique

The development of computer technology and Digital Signal Processing facilitates the development and practical application of speech analysis techniques. Waveform concatenation phoneme synthesizing method based on unit selection has been used more due to the raising of Computing ability and memory capacity Large-scale sound library and the finer unit selection strategy of introducing, improve the sound quality of synthesis voice, tone color on very significantly And naturalness.And another mainstream speech synthesis technique, based on hidden Markov model (hidden Markov model, HMM) Parameter phoneme synthesizing method, also because its better robust performance and generalization obtain the high praise of many researchers.

Existing speech analysis processing technique, speech synthesis technique and speech recognition technology etc., building tradition Sound library in speech synthesis system, relies primarily on and is manually operated, need to arrange professional recording personnel selection to the rhythm and Segment carries out manual mark, and larger workload needed for constructing, fabrication cycle is longer, to the inefficiency that voice is handled, In addition it is also necessary to which the sound library that could complete recording corpus under the recording environment of profession is recorded, speech processes are seriously limited By scene, context etc..

Above content is only used to facilitate the understanding of the technical scheme, and is not represented and is recognized that above content is existing skill Art.

Summary of the invention

The main purpose of the present invention is to provide a kind of method of speech processing, terminal device and computer-readable storage mediums Matter, it is intended to solve the existing mode handled voice, by the serious limitation of the factors such as scene, context, treatment effeciency is low Under technical problem.

The embodiment of the present invention proposes a kind of method of speech processing, which includes:

The voice messaging in environment is obtained, voice data is determined in default speech database according to the voice messaging；

The text information that preset interface receives is extracted, target is searched from the voice data based on the text information Voice data；

It is instructed according to speech synthesis, the target speech data is synthesized into voice sequence.

Optionally, it is described acquisition environment in voice messaging the step of before, the method also includes:

The sound in the environment including the voice messaging is carried out at noise reduction according to the wave volume in the environment Reason；

The step of voice messaging in the acquisition environment includes:

From the sound after noise reduction process, the voice messaging is extracted.

Optionally, described the step of determining voice data in default speech database according to the voice messaging, includes:

Identify the word content and sound quality information of the voice messaging；

It whether detects in the default speech database containing voice data corresponding to the word content；

If not containing, the corresponding pass between the word content and voice data in presently described voice messaging is established System, and the voice data in presently described voice messaging is stored into the default speech database；

If it does, voice data is then determined in the default speech database based on the sound quality information recognized.

Optionally, described to determine voice data in the default speech database based on the sound quality information recognized The step of, comprising:

The sound quality information for detecting voice data corresponding to the word content stored in the default speech database is The no sound quality information better than voice data in the presently described voice messaging recognized；

If it is not, then voice data corresponding to the word content is updated to currently in the default speech database Voice data in the voice messaging；

If so, abandoning the language being updated to voice data corresponding to the word content in presently described voice messaging Sound data.

Optionally, described the step of voice data is determined in default speech database according to the voice messaging it Afterwards, the method also includes:

Establish the incidence relation list between word content and sentence；

Described the step of target speech data is searched from the voice data based on the text information, comprising:

Word segmentation processing is carried out to obtain the first word content of the text information to the text information, and in the pass Join matching criteria sentence in relation list；

According to the second word content in the standard sentence, the voice data stored in the default speech database Middle lookup target speech data.

Optionally, it is described acquisition environment in voice messaging the step of after, the method also includes:

Application on Voiceprint Recognition processing is carried out to extract vocal print feature to the voice data in the voice messaging, and is based on mentioning The vocal print feature taken determines output tone color.

Optionally, described to be instructed according to speech synthesis, the step of target speech data is synthesized into voice sequence, packet It includes:

Detect carrying sequence synthesis demand and Timbre Synthesis demand in the speech synthesis instruction；

Demand is synthesized based on the sequence, according in the character order or second text of first word content The character order of appearance combines the target speech data to form initial speech sequence；

Based on the Timbre Synthesis demand, the output tone color is added in the initial speech sequence to form final language Sound sequence.

In addition, to achieve the above object, the present invention also provides a kind of voice processing apparatus, the voice processing apparatus packet It includes:

Module is obtained, for obtaining the voice messaging in environment, according to the voice messaging in default speech database Determine voice data；

Searching module, the text information received for extracting preset interface are based on the text information from the voice Target speech data is searched in data；

The target speech data is synthesized voice sequence for instructing according to speech synthesis by synthesis module.

In addition, to achieve the above object, the present invention also provides a kind of terminal device, the terminal device include: memory, Processor and it is stored in the voice processing program that can be run on the memory and on the processor, the speech processes journey The step of sequence realizes method of speech processing as described above when being executed by the processor.

In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium Voice processing program is stored on storage medium, the voice processing program realizes voice as described above when being executed by processor The step of processing method.

A kind of method of speech processing, device, terminal device and the computer readable storage medium that the embodiment of the present invention proposes, By obtaining the voice messaging in environment, voice data is determined in default speech database according to the voice messaging；It extracts The text information that preset interface receives searches target speech data from the voice data based on the text information；It presses It is instructed according to speech synthesis, the target speech data is synthesized into voice sequence.

By obtaining the voice messaging in the external environmental sounds, and based on to the acquisition under meaning external environment in office Voice messaging carries out respective handling, to determine voice data (to previously stored phase in pre-set speech database Same voice data is updated replacement, stores to new voice data correspondence), from preset for receiving user Text information is extracted on the interface of inputted text information, after carrying out basic processing to text information, preparatory If preservation voice data default speech database in, find out the corresponding target voice number of text information with extraction According to then according to the demand of user in the triggered speech synthesis instruction of user, the target speech data obtained to lookup carries out group It closes to form the voice sequence for meeting user demand.Realize, by the factors such as scene, context do not limited carry out speech recognition and Speech synthesis processing, improve to voice carry out processing treatment effeciency, and based on user customize and individual demand into Row speech synthesis improves the performance of speech processes.

Detailed description of the invention

Fig. 1 is the terminal structure schematic diagram for the hardware running environment that the embodiment of the present invention is related to；

Fig. 2 is the flow diagram of method of speech processing first embodiment of the present invention；

Fig. 3 is the flow diagram of method of speech processing second embodiment of the present invention；

Fig. 4 is the flow diagram of method of speech processing 3rd embodiment of the present invention；

Fig. 5 is the module diagram of voice processing apparatus of the present invention.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

The primary solutions of the embodiment of the present invention are: obtaining the voice messaging in environment, existed according to the voice messaging Voice data is determined in default speech database；Extract the text information that receives of preset interface, based on the text information from Target speech data is searched in the voice data；It is instructed according to speech synthesis, the target speech data is synthesized into voice Sequence.

Due to existing speech analysis processing technique, speech synthesis technique and speech recognition technology etc., building Sound library in traditional voice synthesis system, relies primarily on and is manually operated, and needs to arrange professional recording personnel selection to rhythm Rule and segment carry out manual mark, and larger workload needed for constructing, fabrication cycle is longer, the low efficiency handled voice Under, in addition it is also necessary to which the sound library that could complete recording corpus under the recording environment of profession is recorded, seriously limit speech processes By scene or context.

The present invention provides a solution, can not be limited by factors such as scene, contexts and carry out speech recognition and voice Synthesis processing is improved the treatment effeciency of the processing carried out to voice, and is customized based on user and carry out language with individual demand Sound synthesis, improves the performance of speech processes.

As shown in Figure 1, Fig. 1 is the terminal structure schematic diagram for the hardware running environment that the embodiment of the present invention is related to.

The terminal of that embodiment of the invention equipment can be various network-termination devices, such as terminal server, and PC is also possible to Smart phone, tablet computer, E-book reader, MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3) player, MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 3) player, digit broadcasting receiver, wearable device The packaged types terminal device or immovable such as (such as Intelligent bracelet, smartwatch etc.), navigation device, portable computer Terminal device.

As shown in Figure 1, the terminal device may include: processor 1001, such as CPU, network interface 1004, user interface 1003, memory 1005, communication bus 1002.Wherein, communication bus 1002 is for realizing the connection communication between these components. User interface 1003 may include display screen (Display), input unit such as keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 optionally may include that the wired of standard connects Mouth, wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, be also possible to stable memory (non-volatile memory), such as magnetic disk storage.Memory 1005 optionally can also be independently of aforementioned processor 1001 storage device.

It will be understood by those skilled in the art that terminal structure shown in Fig. 1 does not constitute the restriction to terminal device, it can To include perhaps combining certain components or different component layouts than illustrating more or fewer components.

As shown in Figure 1, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium Believe module, Subscriber Interface Module SIM and voice processing program.

In terminal device shown in Fig. 1, network interface 1004 is mainly used for connecting background server, with background server Carry out data communication；User interface 1003 is mainly used for connecting client (user terminal), carries out data communication with client；And locate Reason device 1001 can be used for calling the voice processing program stored in memory 1005, and execute following operation:

Further, it is described acquisition environment in voice messaging the step of before, the method also includes:

The step of voice messaging in the acquisition environment includes:

From the sound after noise reduction process, the voice messaging is extracted.

Further, described the step of determining voice data in default speech database according to the voice messaging, wraps It includes:

Further, described to determine voice number in the default speech database based on the sound quality information recognized According to the step of, comprising:

Further, described the step of voice data is determined in default speech database according to the voice messaging it Afterwards, the method also includes:

Establish the incidence relation list between word content and sentence；

Further, it is described acquisition environment in voice messaging the step of after, the method also includes:

Further, described to be instructed according to speech synthesis, the step of target speech data is synthesized into voice sequence, Include:

Based on above-mentioned hardware configuration, each embodiment of method of speech processing of the present invention is proposed.

Referring to figure 2., in method of speech processing first embodiment of the present invention, which includes:

Step S10 obtains the voice messaging in environment, determines language in default speech database according to the voice messaging Sound data.

In the present embodiment, terminal device can be the equipment such as smart phone, tablet computer.It needs to carry out voice in user When typing, terminal device can enable voice transformation mode based on the voice translative mode instruction that user triggers；Certainly, terminal is set It is standby to enable voice transformation mode automatically in some scenarios, such as in terminal device when entering recording state, automatically Starting voice transformation mode.

After terminal device enables voice translative mode, by installing microphone reception external environment on the terminal device In sound, and the voice messaging in the received sound of microphone is filtered out based on speech recognition, thus the voice that will be filtered out Information carries out typing.

Specifically, it for example, in the case where external environment is noisy, is removed in the sound that the microphone on terminal device receives It, will also be miscellaneous comprising many unwanted noises comprising the voice messaging (i.e. user speak voice) for currently needing to filter out Sound, such as Che Mingsheng, clamour or machine run sound interference sound, terminal device is to the ambient sound in the external environment received Sound carries out screening to get voice messaging, after terminal device gets voice messaging, is located in advance to voice messaging Reason, such as Text region processing and sound quality identifying processing, thus based on the voice messaging obtained after being pre-processed from pre- In the default speech database for voice data first established, the current accessed voice messaging of terminal device is determined Voice data.

Further, before step 10, method of speech processing of the present invention further include:

Step A carries out the sound in the environment including the voice messaging according to the wave volume in the environment Noise reduction process.

In the present embodiment, to be promoted, terminal device obtains voice messaging and the processing handled voice messaging is imitated Rate, before screening voice messaging in external environmental sounds received by the microphone on terminal device, terminal device is first Whether the wave volume in automatic detection external environment is more than preset volume value, and wherein the default volume value can be according to user It needs to carry out flexible setting, if detecting, the volume value in current outside environment has been more than default volume value, it is determined that terminal is set Standby that the sound currently received to microphone is needed to carry out noise reduction screening, the voice messaging that can be just needed just leads to immediately It crosses and noise reduction filtration treatment is carried out using the modes such as the noise reduction algorithm sound currently received to microphone.

In another embodiment, if detecting, the volume value in current outside environment is less than default volume value, really Determine terminal device and be currently not necessarily to the sound progress noise reduction screening currently received to microphone, and directly will can currently be connect The sound received carries out typing as the voice messaging needed.

Further, in step S10, the voice messaging in environment is obtained, comprising:

Step S101 extracts the voice messaging from the sound after noise reduction process.

Detecting that the volume value in current outside environment has been more than default volume value, and by using the side such as noise reduction algorithm After the formula sound currently received to microphone carries out noise reduction filtration treatment, it is based on existing speech recognition skill Art extracts relatively clear voice messaging from by the sound after noise reduction filtration treatment.

Step S20 extracts the text information that preset interface receives, based on the text information from the voice data Search target speech data.

In the present embodiment, preset interface is that the data preset for obtaining the inputted text sentence information of user connect Mouthful.

Terminal device mentions after detecting the text sentence information that pre-set data-interface receives user's input The text sentence information that takes the user to be inputted simultaneously carries out word segmentation processing to text sentence information, is inputted to obtain user Each word content of text sentence information, based on each word content from the default voice for voice data pre-established In database, target speech data corresponding with each word content of the inputted text sentence information of active user is searched.

Step S30, instructs according to speech synthesis, and the target speech data is synthesized voice sequence.

The speech synthesis instruction that user is triggered is obtained, and detects user entrained in speech synthesis instruction to current Inputted text sentence information carries out the synthesis demand of speech synthesis, such as needs to carry out sentence to inputted text sentence information Retouching, and/or pairing is needed to retouch etc. at the output tone color of voice, based on the user detected to current inputted text Sentence information carries out the synthesis demand of speech synthesis, and each word content of the inputted text sentence information of the user found is opposite The target speech data answered is combined, to form the voice sequence that can be exported by voice mode.

In the present embodiment, by obtaining the voice messaging in the external environmental sounds, and base under meaning external environment in office In the voice messaging progress respective handling to the acquisition, so that voice data is determined in pre-set speech database, from Text information is extracted on the preset interface for receiving the inputted text information of user, by carrying out to text information After basic processing, in the default speech database of the preservation voice data set in advance, the text envelope with extraction is found out The corresponding target speech data of manner of breathing obtains lookup then according to the demand of user in the triggered speech synthesis instruction of user Target speech data be combined to be formed and meet the voice sequence of user demand.Realize, can not by scene, context etc. because Element, which limits, carries out speech recognition and speech synthesis processing, improves the treatment effeciency of the processing carried out to voice, and based on use Family customizes and individual demand carries out speech synthesis, improves the performance of speech processes.

Further, on the basis of above-mentioned first embodiment, the second embodiment of method of speech processing of the present invention is proposed, Referring to figure 3., in the second embodiment of method of speech processing of the present invention, in above-mentioned steps S10, existed according to the voice messaging The step of determining voice data in default speech database, comprising:

Step S102 identifies the word content and sound quality information of the voice messaging.

Terminal device passes through the language to the typing after noise reduction filtration treatment of external environmental sounds received by microphone Message breath carries out Text region processing and sound quality identifying processing, so that identification obtains the word content and sound of current speech information Matter information, wherein terminal device gets current speech information after carrying out sound quality identifying processing to the voice messaging of typing Volume, tone sound quality information.

Whether step S103 detects in the default speech database containing voice number corresponding to the word content According to.

Terminal device detects whether to have protected in the pre-set default speech database for voice data There is the word content of current institute's typing voice messaging obtained with identification, voice data corresponding to same text content, In, in default speech database, according to the corresponding relationship between word content and voice data to the conscientious preservation of voice data, For example, in default speech database, can save word content " you are good " and with corresponding to current " you are good " word content Voice band, alternatively, the word content of the typing voice data recognized can be only saved in default speech database, and Voice Band Data corresponding with the word content of preservation can be then downloaded from other databases.

Step S104 establishes the corresponding relationship in the word content and presently described voice messaging between voice data, And the voice data in presently described voice messaging is stored into the default speech database.

If detecting in default speech database, do not preserve in the current institute's typing voice messaging text obtained with identification When holding voice data corresponding to identical word content (voice band), then the current institute's typing voice messaging not obtained is established Corresponding relationship in word content and current speech data between voice data, and the word content is corresponding with the voice data It saves into current preset speech database, alternatively, being obtained when being not detected not preserve in default speech database with identification Current institute's typing voice messaging word content identical word content when, will be in obtained current institute's typing voice messaging text Hold and save into default speech database, and when needing to export voice data corresponding to current character content, under online Load mode is downloaded and voice data corresponding to word content from other databases.

Step S105 determines voice data based on the sound quality information recognized in the default speech database.

If detecting in default speech database, save identical with current institute's typing voice messaging word content When voice data corresponding to word content, further believed by detecting the sound quality of voice data in current institute's typing voice messaging Whether breath is better than the sound quality information for presetting the voice data being saved in speech database, to current institute's typing voice messaging Voice data corresponding to word content is updated replacement.

Further, step S105, comprising:

Step S1051 detects voice data corresponding to the word content stored in the default speech database Sound quality information, if better than the sound quality information of voice data in the presently described voice messaging recognized.

Terminal device detects in the pre-set presetting database for voice data, the word content stored The sound quality information such as volume, the tone of voice data identical with word content in the voice messaging of current typing, if better than working as Preceding identification obtains the sound quality information such as volume, the tone of institute's typing voice messaging voice data.

Voice data corresponding to the word content is updated to work as by step S1052 in the default speech database Voice data in the preceding voice messaging.

If terminal device detects that current identification obtains the sound quality such as volume, the tone of institute's typing voice messaging voice data letter Breath, voice number identical with word content in the voice messaging of current typing better than the word content stored in presetting database According to the sound quality information such as volume, tone, then will be pre-saved in presetting database and voice corresponding to current character content Data are deleted, and the voice data in current institute's typing voice messaging is saved into current preset database.

Step S1053 abandons for voice data corresponding to the word content being updated in presently described voice messaging Voice data.

If terminal device detects the voice messaging Chinese of the word content and current typing stored in presetting database The sound quality information such as volume, the tone of the identical voice data of word content obtain institute's typing voice messaging voice better than current identification The sound quality information such as volume, tone of data is then abandoned deleting currently stored voice data and will be current in presetting database Voice data in institute's typing voice messaging re-starts the operation of storage.

In the present embodiment by the typing after noise reduction filtration treatment of external environmental sounds received by microphone Voice messaging, carry out Text region processing and sound quality identifying processing, thus identification obtain the word content of current speech information It detects whether to be saved and in the pre-set default speech database for voice data with sound quality information There are the word content of the current institute's typing voice messaging obtained with identification, voice data corresponding to same text content, if inspection It measures in default speech database, does not preserve corresponding to word content identical with current institute's typing voice messaging word content Voice data when, establish the corresponding relationship between in current character content and current speech data, and by the word content with The voice data is corresponding to be saved into current preset speech database, if detecting in default speech database, has been saved When voice data corresponding to word content identical with current institute's typing voice messaging word content, further worked as by detection Whether the sound quality information of voice data is better than the voice number being saved in default speech database in preceding institute's typing voice messaging According to sound quality information, voice data corresponding to the word content to current institute's typing voice messaging is updated replacement.

It realizes and replacement is updated, to default language to identical voice data previously stored in default speech database The new speech data correspondence of not stored current typing voice messaging is stored in sound database, ensure that in speech database The accuracy of institute's voice data, to improve the effect for carrying out speech synthesis based on speech database institute's voice data Rate.

Further, on the basis of above-mentioned first embodiment, the 3rd embodiment of method of speech processing of the present invention is proposed, Referring to figure 4., in the 3rd embodiment of method of speech processing of the present invention, existed in above-mentioned steps S10 according to the voice messaging After determining voice data in default speech database, method of speech processing of the present invention further include:

Step S40 establishes the incidence relation list between word content and sentence.

Terminal device can first pass through the corresponding relationship list established between word content and sentence in advance, and be stored in corresponding It, can be in text information inputted to user based on the corresponding relationship list established between word content and sentence in memory Certain texts are retouched, so that terminal device is same available accurate when user rapidly inputs fuzzy text information The word content for needing to carry out speech synthesis.

In the present embodiment, the corresponding relationship list between word content and sentence should be pair of common text and sentence Should be related to, for example, with word content in the fuzzy text information rapidly input in every family be " tomorrow removes film ", then again from described right Lookup and the matched sentence in " film tomorrow " content part in relation list are answered, for example the corresponding sentence found is " tomorrow Go to the cinema together " etc..

In the present embodiment further, it in above-mentioned steps S20, is searched from the voice data based on the text information Target speech data, comprising:

Step S201, to text information progress word segmentation processing to obtain the first word content of the text information, And the matching criteria sentence in the incidence relation list.

When terminal device get pre-set data-interface receive user input text sentence information after it is right Text sentence information carries out word segmentation processing, so that the first word content of the inputted text sentence information of user is obtained, such as For " tomorrow watches movie ", further, based in the triggered speech synthesis instruction of user detected to inputted text sentence Information carries out the demand of sentence retouching, from the corresponding relationship list between the word content and sentence pre-established, matching pair The standard sentence answered, for example, " tomorrow goes to the cinema together ".

Step S202 is stored in the default speech database according to the second word content in the standard sentence Voice data in search target speech data.

Further, terminal device carries out at participle the standard sentence " tomorrow goes to the cinema together " being matched to again Reason, thus obtain standard sentence the second word content (i.e. " bright ", " day ", " one ", " rising ", " going ", " seeing ", " electricity " and " shadow "), thus based on each word content from the default speech database for voice data pre-established, search with The corresponding target speech data of each second word content of Current standards sentence.

Further, in above-mentioned steps 10, after obtaining the voice messaging in environment, method of speech processing of the present invention, also Include:

Step B carries out Application on Voiceprint Recognition to the voice data in the voice messaging and handles to extract vocal print feature, and Output tone color is determined based on the vocal print feature of extraction.

Based on technologies such as Application on Voiceprint Recognition, from the microphone institute extracted in the voice messaging of typing in sending present terminal equipment The vocal print feature of the speaker of voice messaging is received, thus according to the vocal print feature currently extracted, in the use pre-established In the tamber data library of different output tone colors when storing pairing and carrying out voice output at voice sequence, output tone color is determined.

Specifically, for example, in the tamber data library for detecting the different output tone colors of the storage pre-established, if save Have special based on vocal print identical as the speaker vocal print feature of voice messaging received by the microphone issued in present terminal equipment Levy establish output tone color, and detect do not preserve based on identical vocal print feature establish output tone color when, be based on immediately The vocal print feature of current speaker establishes new output tone color in the tamber data library, and marks the pronunciation of current output tone color People's information, if detect the output tone color for having saved in current tamber data library and having established based on the identical vocal print feature, Just it abandons establishing new output tone color in current tamber data library.

Further, step S30 includes:

Step S301 detects carrying sequence synthesis demand and Timbre Synthesis demand in the speech synthesis instruction.

When terminal device detects that user is based on the triggering speech synthesis instruction of preset instruction control, the language is obtained Sound synthetic instruction, and detect user entrained in speech synthesis instruction and voice is carried out to current inputted text sentence information The synthesis demand of synthesis, for example, user needs to carry out inputted text sentence information demand and/or the user of sentence retouching The demand for needing pairing to be retouched at the output tone color of voice.

Step S302 synthesizes demand based on the sequence, according to the character order of first word content or described The character order of second word content combines the target speech data to form initial speech sequence.

Terminal device carries out inputted text sentence information according to entrained in the triggered speech synthesis instruction of user The demand of sentence retouching, according to word content (i.e. " bright ", " day ", " electricity " each in the inputted text information of user " film tomorrow " " shadow ") character order, or according to the standard sentence being matched to based on the inputted text information of user " film tomorrow " " tomorrow goes to the cinema together " second word content (i.e. " bright ", " day ", " one ", " rising ", " going ", " seeing ", " electricity " and " shadow ") The target speech data answered with each word content found from default speech database is combined by character order, To form initial voice sequence.

Step S303, be based on the Timbre Synthesis demand, added in the initial speech sequence output tone color with Form final voice sequence.

Terminal device is based on the output tone color according to pairing entrained in the triggered speech synthesis instruction of user at voice The demand retouched, from pre-establishing for storing different output tone colors of the pairing at voice sequence progress voice output when Tamber data library in search target and export tone color, and the vocal print feature of target output tone color is added to and has currently synthesized Initial speech sequence in, to form the final final voice sequence for carrying out voice output.

In another embodiment, it if terminal device detects in the triggered speech synthesis instruction of user, does not carry When the demand that pairing is retouched at the output tone color of voice, then without being added in the initial speech sequence currently synthesized The vocal print feature of tone color is exported, i.e., voice output directly is carried out to voice sequence according to " machine talk ".

In the present embodiment, the sequence carried in speech synthesis instruction is triggered by detection user and synthesizes demand and tone color Synthesis demand, and based on sequence synthesis demand, according to the character order of each word content in the inputted text information of user, or It is matched in incidence relation list between the word content pre-established and sentence according to based on the inputted text information of user Standard sentence Chinese word content each character order, will find from stating in the voice data stored in default speech database Target speech data, be combined to form initial speech sequence, and be further based on Timbre Synthesis demand, from according to language Voice data in message breath carries out searching institute in output tamber data library of the Application on Voiceprint Recognition processing to extract vocal print feature foundation Need target to export tone color, and by the vocal print feature of target output tone color be added to currently the initial speech sequence that has synthesized with Final voice sequence is generated, to carry out voice output according to voice sequence of the selected output tone color to synthesis.

It realizes, based on user to the different synthesis demands of speech synthesis, synthesis and the voice for flexibly carrying out voice are defeated Out, meet the needs of user individual customization, improve the performance processing such as synthesized, exported to voice.

In addition, referring to figure 5., the embodiment of the present invention also proposes a kind of voice processing apparatus, the voice processing apparatus packet It includes:

Preferably, voice processing apparatus of the present invention, further includes:

Detection module, for according to the wave volume in the environment in the environment include the voice messaging sound Sound carries out noise reduction process；

Obtain module, comprising:

Extraction unit, for extracting the voice messaging from the sound after noise reduction process.

Preferably, module is obtained, further includes:

Recognition unit, for identification word content of the voice messaging and sound quality information；

First detection unit, for whether detecting in the default speech database containing corresponding to the word content Voice data；

First determination unit, for establishing pair in the word content and presently described voice messaging between voice data It should be related to, and the voice data in presently described voice messaging is stored into the default speech database；

Second determination unit, for determining language in the default speech database based on the sound quality information recognized Sound data.

Preferably, the second determination unit, comprising:

Second detection unit, for detecting voice corresponding to the word content stored in the default speech database The sound quality information of data, if better than the sound quality information of voice data in the presently described voice messaging recognized；

Updating unit, in the default speech database, voice data corresponding to the word content to be updated For the voice data in presently described voice messaging；Wherein, updating unit is also used to abandon by language corresponding to the word content Sound data, the voice data being updated in presently described voice messaging.

Module is established, the incidence relation list for establishing between word content and sentence；

Searching module, comprising:

Participle unit, for carrying out word segmentation processing to the text information to obtain in the first text of the text information Hold, and the matching criteria sentence in the incidence relation list；

Searching unit, for according to the second word content in the standard sentence, in the default speech database Target speech data is searched in the voice data of storage.

Voiceprint extraction module is handled for carrying out Application on Voiceprint Recognition to the voice data in the voice messaging to extract Vocal print feature, and output tone color is determined based on the vocal print feature of extraction.

Preferably, synthesis module, comprising:

Third detection unit is needed for detecting carrying sequence synthesis demand and Timbre Synthesis in the speech synthesis instruction It asks；

First synthesis unit, for synthesizing demand based on the sequence, according to the character order of first word content Or the character order of second word content, the target speech data is combined to form initial speech sequence；

Second synthesis unit adds described defeated for being based on the Timbre Synthesis demand in the initial speech sequence Tone color is out to form final voice sequence.

The each functional module for the voice processing apparatus that the present embodiment proposes at runtime, is realized at voice as described above The step of reason method, details are not described herein.

In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, deposited on computer readable storage medium The step of containing voice processing program, method of speech processing as described above realized when voice processing program is executed by processor.

Computer readable storage medium specific embodiment of the present invention is referred to each embodiment of above-mentioned method of speech processing, Details are not described herein.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the system that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or system institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or system.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of method of speech processing, which is characterized in that the method for speech processing includes:

The text information that preset interface receives is extracted, target voice is searched from the voice data based on the text information Data；

2. method of speech processing as described in claim 1, which is characterized in that the step of the voice messaging in the acquisition environment Before rapid, the method also includes:

Noise reduction process is carried out to the sound in the environment including the voice messaging according to the wave volume in the environment；

The step of voice messaging in the acquisition environment includes:

From the sound after noise reduction process, the voice messaging is extracted.

3. method of speech processing as described in claim 1, which is characterized in that it is described according to the voice messaging in default voice The step of voice data is determined in database, comprising:

If not containing, the corresponding relationship in the word content and presently described voice messaging between voice data is established, and Voice data in presently described voice messaging is stored into the default speech database；

4. method of speech processing as claimed in claim 3, which is characterized in that described to be existed based on the sound quality information recognized The step of voice data is determined in the default speech database, comprising:

Detect the sound quality information of voice data corresponding to the word content stored in the default speech database, if excellent The sound quality information of voice data in the presently described voice messaging recognized；

If it is not, voice data corresponding to the word content is updated to presently described then in the default speech database Voice data in voice messaging；

If so, abandoning the voice number being updated to voice data corresponding to the word content in presently described voice messaging According to.

5. method of speech processing as described in claim 1, which is characterized in that it is described according to the voice messaging in default language After the step of determining voice data in sound database, the method also includes:

Establish the incidence relation list between word content and sentence；

Word segmentation processing is carried out to the text information to obtain the first word content of the text information, and close in the association Matching criteria sentence in series of tables；

According to the second word content in the standard sentence, looked into the voice data stored in the default speech database Look for target speech data.

6. method of speech processing as described in claim 1, which is characterized in that the step of the voice messaging in the acquisition environment After rapid, the method also includes:

It carries out Application on Voiceprint Recognition to the voice data in the voice messaging to handle to extract vocal print feature, and based on extraction The vocal print feature determines output tone color.

7. the method for speech processing as described in claim 1 to 6, which is characterized in that it is described to be instructed according to speech synthesis, it will be described Target speech data synthesizes the step of voice sequence, comprising:

Demand is synthesized based on the sequence, according to the character order or second word content of first word content Character order combines the target speech data to form initial speech sequence；

Based on the Timbre Synthesis demand, the output tone color is added in the initial speech sequence to form final voice sequence Column.

8. a kind of voice processing apparatus, which is characterized in that the voice processing apparatus includes:

Module is obtained, for obtaining the voice messaging in environment, is determined in default speech database according to the voice messaging Voice data；

Searching module, the text information received for extracting preset interface are based on the text information from the voice data Middle lookup target speech data；

9. a kind of terminal device, which is characterized in that the terminal device includes: memory, processor and is stored in the storage It is real when the voice processing program is executed by the processor on device and the voice processing program that can run on the processor Now the step of method of speech processing as described in any one of claims 1 to 7.

10. a kind of storage medium, which is characterized in that be stored with voice processing program, the speech processes on the storage medium The step of method of speech processing as described in any one of claims 1 to 7 is realized when program is executed by processor.