CN107644646A

CN107644646A - Method of speech processing, device and the device for speech processes

Info

Publication number: CN107644646A
Application number: CN201710892705.2A
Authority: CN
Inventors: 陈小帅; 张扬
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-09-27
Filing date: 2017-09-27
Publication date: 2018-01-30
Anticipated expiration: 2037-09-27
Also published as: CN107644646B

Abstract

The invention provides method of speech processing, device and device for speech processes, one of which method of speech processing includes：In response to choosing initial speech data, speech recognition is carried out to the initial speech data and obtains corresponding text data；In response to the editor to the text data, the text data after editor is obtained；It is target speech data by sound bite Data Synthesis corresponding to the text data after the editor.Using the embodiment of the present invention, modified some more convenient application environments for being inconvenient to input voice based on text data, can meet that voice of the user in chat sends demand and editor's demand to voice, lift Consumer's Experience.

Description

Method of speech processing, device and the device for speech processes

Technical field

The present invention relates to input method technique field, more particularly to method of speech processing, device, the dress for speech processes Put, and computer-readable medium.

Background technology

With the rapid development of Internet technology, the mode that user carries out information exchange using intelligent terminal is more and more general Logical, increasing user uses the voice-enabled chat function of chat software.When user carries out voice-enabled chat, user can be with By trigger the voice send button of chat software come to the voice of oneself carry out voice, and unclamp voice send button when Wait, voice can be automatically transmitted to other users by chat software.

The content of the invention

But inventor has found in research process, existing chat software is all sent automatically to the voice of user, After i.e. user unclamps the touch to voice send button, the voice that user speaks will be automatically sent out, although being said in user In the case of mistake, user can also be recalled the voice within 2 minutes, but user can not but be carried out to the voice of oneself Editor.In addition, user is after the voice of other users transmission is received, if the environment residing for oneself is quieter or can not Sound is sent such as having a meeting or attend a lecture, then user can not just send same or similar content, the voice of oneself.

Based on this, the invention provides a kind of speech processes scheme, for voice or the user for needing to send to user The voice chosen carries out speech recognition, and the text data of recognition result is showed into user, and text data is carried out for user Editor, and by target speech data corresponding to the text data generation after user is edited, so as to realize that user oneself sends Sound can generate the speech data of user；Or can also be real in the case where user chooses the speech data of other users The speech data of other users is now converted into the function of the voice of oneself, so as to which more convenient user is not suitable for sending sound various Voice occurs under the application scenarios of sound, not only improves the chat efficiency of user, also the further chat experience of lifting user.

Present invention also offers a kind of voice processing apparatus, to ensure the realization and application of the above method in practice.

The embodiments of the invention provide a kind of method of speech processing, this method includes：

In response to choosing initial speech data, speech recognition is carried out to the initial speech data and obtains corresponding textual data According to；

In response to the editor to the text data, the text data after editor is obtained；

It is target speech data by sound bite Data Synthesis corresponding to the text data after the editor.

Wherein, it is described in response to choosing initial speech data, speech recognition is carried out to the initial speech data and obtained pair The text data answered, including：

In response to choosing initial speech data, the initial speech data is identified as full copy data；

The full copy data are segmented to obtain participle text data.

Wherein, sound bite Data Synthesis corresponding to the text data by after editor is target speech data, including：

Sound bite data corresponding to the participle text data after the editor are obtained from default sound bank；Institute's predicate Sound storehouse is used to preserve participle text data, sound bite data and both corresponding relations；

It is target by the sound bite Data Synthesis of acquisition according to the display order of the participle text data after the editor Speech data.

Wherein, the initial speech data is the voice of the first user, and the target speech data is the language of second user Sound；Then sound bite data corresponding to the participle text data obtained from default sound bank after the editor, including：

Searched from default sound bank corresponding to the participle text data after the editor, the sound bite of second user Data；

Accordingly, sound bite Data Synthesis corresponding to the text data by after editor is target speech data, bag Include：

According to the display order of the participle text data after the editor, by the lookup, second user voice number According to synthesizing target speech data.

Wherein, methods described also includes：

For the participle text data after each editor, judge whether to find the speech data of corresponding second user, such as Fruit can all find, then perform according to the word segmentation result data after the editor sequencing, by the lookup, second The speech data of user synthesizes target speech data；

If can not all find, the sound bite of the first user corresponding to the participle text data after this is edited Data, and, the sound bite data of the second user found, according to the display order of the participle text data after editor, close As target speech data.

Wherein, the initial speech data and the voice that the target speech data is the first user；It is then described from default Sound bank in obtain sound bite data corresponding to the participle text data after the editor, including：

Searched from default sound bank corresponding to the participle text data after the editor, the sound bite of the first user Data；

According to the display order of the participle text data after the editor, by the lookup, the first user voice sheet Segment data synthesizes target speech data.

Wherein, the editor includes：Delete, increase, change and/or replace.

Wherein, methods described can also include：The target speech data is sent to recipient.

The embodiment of the present invention additionally provides a kind of method of speech processing, and this method includes：

Receive pending primary voice data；

In response to the triggering of the supplemental text data for the primary voice data, the supplemental text data pair are searched The supplement speech data answered；

The primary voice data and supplement speech data are synthesized into target speech data.

Wherein, the triggering of the supplemental text data in response to for the primary voice data, searches the supplement Supplement speech data corresponding to text data, including：

In response to the triggering of the supplemental text data for the primary voice data, the supplemental text data are obtained；

The supplemental text data are segmented to obtain participle text data；

Search sound bite data corresponding to the participle text data, the sound bank respectively from default sound bank For preserving participle text data, sound bite data and both corresponding relations.

Wherein, it is described that the primary voice data and supplement speech data are synthesized into target speech data, including：

According to the display order of the participle text data, the sound bite data found are synthesized and supplemented Speech data；

According to the primary voice data and the semantic relation of supplement speech data, by the primary voice data and described Supplement speech data synthesizes target speech data.

The embodiment of the present invention additionally provides a kind of voice processing apparatus, and the device includes：

Voice recognition unit, in response to choosing initial speech data, voice knowledge to be carried out to the initial speech data Corresponding text data is not obtained；

Acquiring unit, in response to the editor to the text data, obtaining the text data after editor；

Synthesis unit, for being target voice number by sound bite Data Synthesis corresponding to the text data after the editor According to.

Wherein, the voice recognition unit includes：

Subelement is identified, in response to choosing initial speech data, the initial speech data to be identified as into complete text Notebook data；And participle subelement, for being segmented to obtain participle text data to the full copy data.

Wherein, the synthesis unit includes：

Subelement is obtained, for obtaining voice corresponding to the participle text data after the editor from default sound bank Fragment data；The sound bank is used for the corresponding relation for preserving participle text data and speech data；And synthesis subelement, It is target voice by the sound bite Data Synthesis of acquisition for the display order according to the participle text data after the editor Data.

Wherein, the initial speech data is the voice of the first user, and the target speech data is the language of second user Sound；Then the acquiring unit is used for：

Searched from default sound bank corresponding to the participle text data after the editor, the sound bite of second user Data；Accordingly, the synthesis unit 503 can be used for：, will according to the display order of the participle text data after the editor The lookup, second user speech data synthesizes target speech data.

Wherein, the synthesis unit also includes：

Judgment sub-unit, for for the participle text data after each editor, judging whether that finding corresponding second uses The speech data at family；First processing subelement, if for can all find, is performed according to the participle knot after the editor The sequencing of fruit data, the lookup, second user speech data is synthesized into target speech data；And second Subelement is handled, if for can not all find, the first user corresponding to the participle text data after this is edited Sound bite data, and, the sound bite data of the second user found, according to the display of the participle text data after editor Sequentially, target speech data is synthesized.

Wherein, the initial speech data and the voice that the target speech data is the first user；It is then described to search list Member is used for：Searched from default sound bank corresponding to the participle text data after the editor, the sound bite of the first user Data；Accordingly, the synthesis unit is used for：According to the display order of the participle text data after the editor, looked into described Sound bite Data Synthesis look for, the first user is target speech data.

Wherein, the edit operation can include：Deletion, increase, modification and/or replacement operation.

Wherein, the voice processing apparatus also includes：

Transmitting element, for the target speech data to be sent to recipient.

Receiving unit, for receiving pending primary voice data；

Searching unit, for the triggering in response to the supplemental text data for the primary voice data, described in lookup Supplement speech data corresponding to supplemental text data；

Synthesis unit, for the primary voice data and supplement speech data to be synthesized into target speech data.

Wherein, the searching unit includes：

Subelement is obtained, for the triggering in response to the supplemental text data for the primary voice data, obtains institute State supplemental text data；Subelement is segmented, for being segmented to obtain participle text data to the supplemental text data；Search Subelement, for searching sound bite data corresponding to the participle text data, institute's predicate respectively from default sound bank Sound storehouse is used to preserve participle text data, sound bite data and both corresponding relations.

Wherein, the synthesis unit includes：

First synthesis subelement, for the display order according to the participle text data, the sound bite that will be found Data are synthesized to obtain supplement speech data；Second synthesis subelement, for according to the primary voice data and supplement language The semantic relation of sound data, the primary voice data and the supplement speech data are synthesized into target speech data.

The embodiment of the present invention additionally provides a kind of device for speech processes, includes memory, and one or More than one program, one of them or more than one program storage are configured to by one or one in memory Individual above computing device is one or more than one program bag contains the instruction for being used for being operated below：

Receive pending primary voice data；

The embodiment of the present invention additionally provides a kind of computer-readable medium, instruction is stored thereon with, when by one or more During computing device so that device performs one or more method of speech processing as the aforementioned.

In embodiments of the present invention, user is not only facilitated oneself to repeat initial speech data in chat process Content can sends voice, also, based on text data modify it is more convenient some be inconvenient to input the application ring of voice Border, meet that voice of the user in chat sends demand and editor's demand to voice, lift Consumer's Experience.It is in addition, of the invention Embodiment obtains segmenting text data after can also segmenting the full copy data of initial speech data, so as to realize use Family operates to the fragment stage of participle, so as to quickly be modified to initial speech data, the effect of lifting synthesis target speech data Rate.

Brief description of the drawings

Technical scheme in order to illustrate the embodiments of the present invention more clearly, make required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, without having to pay creative labor, it can also be obtained according to these accompanying drawings His accompanying drawing.

Fig. 1 is a kind of flow chart of method of speech processing embodiment of the present invention；

Fig. 2 a and Fig. 2 b are the illustrative diagrams of the display interface of the embodiment of the present invention；

Fig. 3 is the flow chart of another method of speech processing embodiment of the present invention；

Fig. 4 is the flow chart of another method of speech processing embodiment of the present invention；

Fig. 5 is the structured flowchart of the voice processing apparatus embodiment of the present invention；

Fig. 6 is a kind of block diagram of device 800 for speech processes according to an exemplary embodiment in the present invention；

Fig. 7 is the structural representation of server in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

The present invention can be used in numerous general or special purpose computing device environment or configuration.Such as：Personal computer, service Device computer, handheld device or portable set, laptop device, multi-processor device including any of the above device or equipment DCE etc..

The present invention can be described in the general context of computer executable instructions, such as program Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type Part, data structure etc..The present invention can also be put into practice in a distributed computing environment, in these DCEs, by Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with In the local and remote computer-readable storage medium including storage device.

With reference to figure 1, show a kind of flow chart of method of speech processing embodiment of the present invention, the present embodiment can include with Lower step：

Step 101：In response to choosing initial speech data, speech recognition is carried out to the initial speech data and obtained correspondingly Text data.

In the present embodiment, initial speech data can be the language that user triggers when using terminal and other users are chatted Sound data.For example, user A sends speech data using the chat software of smart mobile phone to user B, in a user A voice Data-triggered completes (for example, starting to pause after finishing in short, it is possible to think to trigger a speech data), and also not Before being sent to user B, the embodiment of the present invention can intercept this speech data, and carry out voice knowledge to this speech data Not, so as to obtaining text data corresponding to this speech data.Certainly, in actual applications, initial speech data can also be User A preserves the speech data of oneself in the terminal, or, the voice number of other users that user A is received or preservation According to, etc..Therefore, the source of initial speech data is not limited in the embodiment of the present invention.

Specifically, the calculation based on dynamic time warping (Dynamic Time Warping) can be used in actual applications Method, the method for the hidden Markov model (HMM) based on parameter model, the side of the vector quantization (VQ) based on nonparametric model Method, or, algorithm and hybrid algorithm based on artificial neural network (ANN) etc..Using any audio recognition method not shadow The realization of the present invention is rung, therefore those skilled in the art can independently select.

Specifically, step 101 may comprise steps of A1~step A2 when realizing：

Step A1：In response to choosing initial speech data, the initial speech data is identified as full copy data.

In the present embodiment, carry out speech recognition for the user A speech data intercepted and obtain corresponding complete After whole text data, full copy data can be showed on a display interface of ejection.Specifically, the display interface It can be differently configured from the chat interface for the chat software that user A is used.For example, user A initial speech data is identified To full copy data be：" what time Tuesday afternoon attends a party ", then with reference to shown in figure 2a, for one in display interface The illustrative diagram of upper displaying full copy data.In fig. 2 a, user A and user B (name is Zhang San) chat interface 201 can be used for showing conversation content between the two, and the text data identified to user A initial speech data is carried out The display interface 202 of displaying, then can be another interface different from chat interface 201, the displaying on display interface 202 Content is the text data that step 101 identification obtains, and facilitates user A to be edited after checking to this article notebook data.

Step A2：The full copy data are segmented to obtain participle text data.

In actual applications, edlin is entered to text corresponding to initial speech data in order to facilitate user, this step can be with First full copy data are segmented to obtain participle text data, then participle text data is shown.For example, to complete Text data " what time Tuesday afternoon attends a party " carry out participle can obtain " Tuesday ", " afternoon ", " what time ", " ginseng Add ", the participle text data of " party ".With reference to shown in figure 2b, for the example of the displaying participle text data on display interface 202 Property schematic diagram, it is in figure 2b, each to segment the sequencing spoken according to user A, be shown respectively from top to bottom successively.When So, can also be shown using other sequential systems, as long as can be and initial by the displaying order of each participle text data The pronunciation order of speech data is mutually corresponding.

It should be noted that example that is that Fig. 2 a and Fig. 2 b are provided to describe in detail step 101 and step 102 and enumerating Property data, should not be construed as the restriction of the embodiment of the present invention.Chat interface 201 is shown with which kind of mode in Fig. 2 a and Fig. 2 b Chat record between two users, either display interface 202 with which kind of mode show full copy data or participle text Data can, this has no effect on the realization of the embodiment of the present invention.

Step 102：In response to the editor to the text data, the text data after editor is obtained.

In this step, the full copy data or participle textual data that user A can be on display interface 202 to displaying According to entering edlin.For example, exemplified by being edited to participle text data, user deletes the participle text data shown in Fig. 2 b " Tuesday " and " participation " two word segmentation result data, orders of other participle text datas keep constant；Or add Text data " at which " is segmented, then full copy data corresponding to original participle text data are just for " what time is Tuesday afternoon Attended a party where ".Certainly, in actual applications, each participle can also be tied with increased operation, user except deleting Fruit data are modified (such as user will segment " Tuesday " be revised as " Tuesday "), the order of each word segmentation result data of adjustment (flow into, user will segment " what time " and exchanged " at which " sequentially) etc. operate, in the present embodiment with regard to no longer being lifted one by one Illustrate meaning.

Step 103：It is target speech data by sound bite Data Synthesis corresponding to the text data after the editor.

In actual applications, a sound bank can be previously generated, sound bank can be used for preserving participle text data, language Tablet segment data and both corresponding relations.Wherein, the sound bite data that sound bank preserves can be user A voice sheet Segment data, it is of course also possible to be the sound bite data of other users.It is understood that the embodiment of the present invention is collecting use When family A each sound bite data, user A sound bite data can be obtained in advance, such as, there is provided each participle Text allows user A to be read aloud and preserves each sound bite data of user；Or user A and other use can also be passed through Speech data when family carries out voice-enabled chat is learnt, so as to obtain user A each sound bite data.

Specifically, the realization of this step can specifically include step B1~step B2：

Step B1：Voice sheet hop count corresponding to the participle text data after the editor is obtained from default sound bank According to.

After user A completes to each participle text data editor, user A can be by triggering in Fig. 2 a or Fig. 2 b Trigger button 203 generates the operation of new target speech data to trigger.In response to the trigger action of user, the embodiment of the present invention It can obtain the participle text data after user edits, such as " afternoon " in Fig. 2 b, " what time " and " party ", and can be by aobvious Show that the displaying order of each participle text data in interface 202 obtains the sequencing of corresponding each sound bite data, i.e., " afternoon " is first sound bite data, " what time " it is second sound bite data, " party " is the 3rd sound bite Data.

If it is understood that it should be noted that preserved in sound bank only user A sound bite data, What is then got in this step is exactly user A sound bite data.And if there is the voice sheet hop count for preserving other users According to sound bank, some participle text datas can also be increased by user A, then obtain increased participle textual data from sound bank According to corresponding, other users sound bite data, target language is combined into come the part or all of initial speech data with oneself Sound data.

Step B2：According to the display order of the word segmentation result data after the editor, the sound bite data of acquisition are closed As target speech data.

Then the sequencing shown according to the word segmentation result data after editor on display interface, by each point Sound bite Data Synthesis corresponding to word text data difference is new target speech data, i.e. " afternoon, some was met ".So, After user A triggers trigger button 203, the sequencing that is shown according to each participle text data after editor, It is target speech data by corresponding sound bite Data Synthesis.Still with the data instance in step B1, i.e., " what time is afternoon for generation The voice of party ".

In actual applications, after generating target speech data, user A can also again by the target speech data send to Give user B that it is chatted or send into chat group, etc., then can also include step 104 after step 103：

Step 104：The target speech data is sent to recipient.

In this step, new target speech data step 103 generated is sent to recipient, i.e. user B.Certainly, In actual applications, recipient may not be the user B to be chatted with user A, can also be user C, or other groups The target speech data of generation can be sent to other any users by user etc., user A, and the embodiment of the present invention is not limited this It is fixed.

It can be seen that in embodiments of the present invention, during user chats, the voice that is triggered in order to facilitate user to oneself The speech data of data or existing other users enters edlin, and the initial speech data that can be chosen to user carries out voice Identification, and after user enters edlin to the content of text data, for example, deletion or increased portion non-textual content, or adjustment Order of part content of text etc., text data after subsequently user is edited again generate one it is new, different from initial language The target speech data of sound data.Therefore, the embodiment of the present invention not only facilitates user oneself need not be repeated in chat process The content can of initial speech data sends voice, also, defeated based on text data some more convenient inconveniences of modifying The application environment entered, meet that voice of the user in chat sends demand and editor's demand to voice, lift Consumer's Experience.

In addition, the embodiment of the present invention can also be carried out the full copy data or participle text data that speech recognition obtains Displaying, so as to realize that user operates to the fragment stage for segmenting text data, so as to realize that user quickly enters to initial speech data Row modification, the efficiency of lifting synthesis target speech data.

With reference to figure 3, the flow chart of another method of speech processing embodiment of the present invention is shown, the present embodiment can include Following steps：

Step 301：The initial speech data of the first user is chosen in response to second user, to the initial of first user Speech data carries out speech recognition and obtains corresponding full copy data.

The present embodiment and the difference of upper one embodiment are, are user A to another user B in the present embodiment Speech data enter edlin, so that user B speech data to be converted into the speech data of oneself.Assuming that the first user is use Family B, user A are then second user.First, user A and user B is chatted using chat software, and the first user is user B One section of initial speech data has been sent out to user A, after then user A receives the initial speech data, has chosen that user B's is initial Speech data, the embodiment of the present invention choose operation in response to user A's, and carrying out speech recognition to initial speech data obtains correspondingly Text data.

It is understood that user A can choose user B to send out by mode again by long-press, double-click or 3D-touch The initial speech data sent.The initial speech data which kind of mode to choose user B especially by does not affect the embodiment of the present invention Realization.

Step 302：The participle text data of the full copy data is shown.

Then full copy data corresponding to user B initial speech data are segmented, obtain segmenting text data, And participle text data is showed on display interface.For example, user B is identified to the user A initial speech datas sent Afterwards, corresponding full copy data are " Lao Zhang, allows you to organize classmate, carrys out 8 buildings meeting room meetings at once ", to the full copy Data obtain segmenting text data after being segmented：" Lao Zhang ", " allowing ", " you ", " group ", " classmate ", " horse back ", " next ", " 8 buildings ", " meeting room ", " meeting ", and user A may be inconvenient to send out voice now, so choosing user B initial speech number According to so as to which participle text data is shown to oneself corresponding to triggering.

Step 303：Editor in response to second user to the participle text data, obtains the participle textual data after editor According to.

In this step, user A can enter edlin to each word segmentation result data shown on display interface.User A Edlin can be entered to the word segmentation result data of display as needed.For example, user A deletes word segmentation result data：" Lao Zhang ", " allowing ", " you ", " group ", " classmate ", and without the order adjusted between each word segmentation result data, then the participle after editing Result data is just：" horse back ", " next ", " 8 buildings ", " meeting room ", " meeting ".In actual applications, user A can also be by dragging The mode such as drag and adjust sequencing between multiple word segmentation result data.

Step 304：From the sound bank pre-established, search it is corresponding with the participle text data after the editor, the The sound bite data of two users.

In this example, it is assumed that having pre-established a sound bank for user A, the sound bank can be used for preserving User A each sound bite data, each participle text data and both corresponding relations.For example, save user A passes In the corresponding relation of the sound bite data of participle text data " meeting " and " meeting ", the content of text and voice of " meeting room " Corresponding relation of data, etc..Certainly, as long as one-to-one map is closed between each sound bite data and content of text System, for example, each sound bite data can also be numbered by those skilled in the art, each numbering is corresponding only One sound bite data and participle text data, etc..

Then this step is just from the sound bank pre-established, search with word segmentation result data " horse back ", " next ", " 8 buildings ", " meeting room ", " meeting " are corresponding respectively, user A speech data.

Step 305：For the participle text data after each editor, judge whether to find the voice of corresponding second user Fragment data, if can all find, into step 306, if can not all find, into step 307.

For the participle text data after each editor, i.e. for " horse back ", " next ", " 8 buildings ", " meeting room ", " open Meeting ", judge whether sound bite data corresponding to whole can be found in default sound bank respectively.If can all it search Arrive, step 306 can be entered, if can not all find, step 307 can be entered.

Step 306：According to the display order of the participle text data after the editor, by the lookup, second user Sound bite Data Synthesis be target speech data.

After all user A each speech data is found, the priority according to the word segmentation result data after editor is suitable Sequence, for example, for word segmentation result data " horse back ", " next ", " 8 buildings ", " meeting room ", " meeting ", the order of " horse back " is First, the order of " next " is second, and by that analogy, the word segmentation result data of last are " meeting ", by user A each language Sound data are synthesized according to tandem, and so as to obtain target speech data, i.e. user A come 8 buildings meeting rooms for " opening at once The speech data of meeting ".The target speech data of oneself synthesis can also be transmitted to targeted customer by user A, for example, it is desired to go meeting Discuss the user C and user D of room meeting.

Step 307：The voice sheet of the first user reads data corresponding to participle text data after this is edited, and, search The sound bite data of the second user arrived, according to the display order of the participle text data after editor, synthesize target voice Data.

And if user A sound bite data can not all be found, then can be by the user A found voice sheet Segment data, and do not find corresponding to user A participle text data, user B sound bite data are synthesized, so as to raw Into part of speech fragment data and the target speech data of user B part of speech fragment data including user A.For example, User A is only found in sound bank on " horse back " and the speech data of " meeting room ", for the participle text not found These still then can be segmented textual data by data " next ", " 8 buildings " and " meeting " using user B sound bite data Synthesized according to according to sequencing, obtain target speech data.

Certainly, can also be all using user B in the case where can not all find user A sound bite data Sound bite data generate target speech data.Those skilled in the art can pre-set, and not limit herein.

It can be seen that the present embodiment is except facilitating the voice operating in user's chat process, can also be by collecting user in advance Speech data, the speech data that the speech data so as to realize other users is converted into user oneself is transmitted, So as to which user need not speak i.e. transmittable speech data, meet user in being inconvenient to send the environment of speech data The demand of oneself voice can be sent.

With reference to figure 4, the flow chart of another method of speech processing embodiment of the invention is shown, the present embodiment can include Following steps：

Step 401：Receive pending primary voice data.

In this example, it is assumed that user A have received one section of user B speech data, for example, it may be user B One section of speech or introduction to a product etc., user A wish that speech to user B or introduction are evaluated or summarized Deng primary voice data that then can be using user B speech data as the present embodiment.

Step 402：In response to the triggering of the supplemental text data for the primary voice data, the supplement text is searched Supplement speech data corresponding to notebook data.

In the present embodiment, it is with the difference of the first two embodiment, user B speech data can not be converted For full copy data or text data is segmented, but user B primary voice data keeps constant, user A remains use Family B whole primary voice datas, speeches or the voice such as introduction of the user A for user B, again supplemented with oneself viewpoint or Person's evaluation etc., then user A can input the supplemental text data for oneself thinking supplement on editing interface, then for supplemental text Data are segmented to obtain participle text data, then voice corresponding with each participle text data is searched in default sound bank Fragment data, wherein, sound bank can be used for preserving participle text data, sound bite data and both corresponding relations.

For example, user B initial speech data is the introduction to sight spot " Yuanmingyuan Park ", user A then can be in the volume of offer Input " introduction to Yuanmingyuan Park is very comprehensively and accurate above, it is desirable to which everybody studies hard record " on editing interface, then to this Sentence is segmented, and obtains several participle text datas, and is directly searched corresponding to each participle text data in sound bank Sound bite data.It is of course also possible to without being segmented to supplemental text data, if directly preserving supplement in sound bank Supplement speech data, can also directly search to obtain corresponding to text data.Certainly, this is only a kind of specific example, should not It is understood as the restriction of the present invention.

Step 403：The primary voice data and supplement speech data are synthesized into target speech data.

Then, if supplement speech data corresponding to user A supplemental text data can be found, will directly can use Family B primary voice data and user A complete speech data synthesize one section of target speech data.And if what is searched is to use Each sound bite data corresponding to family A participle text data, then this multiple sound bite data can first be synthesized use Family A complete speech data, then the complete speech data of user B primary voice data and user A is spliced, synthesize The target speech data of one speech data for including user B and user A simultaneously.

It is, of course, understood that when target speech data is synthesized, can be according to primary voice data and supplement Semantic relation between speech data, to be taken up in order of priority synthesis primary voice data and supplement speech data, or it is taken up in order of priority Synthesis supplement speech data and primary voice data.Wherein, semantic relation can be used to indicate that primary voice data and supplement language Logical relation between the voice content of sound data.For example, primary voice data is one section of landscape introduction, and supplement speech data It is the evaluation to this section of landscape introduction, then primary voice data should be placed on to supplement speech data when synthesizing target speech data Before；And if primary voice data is the speech of a speaker, and it is the introduction to the speaker to supplement speech data, Supplement speech data can be then placed on before primary voice data.

It will also be appreciated that if user B primary voice data has multistage, user A can also be directed to this respectively Multistage primary voice data is evaluated or summarized, meanwhile, by the multiple of user B multistage primary voice data and user A Evaluation or summary are spliced respectively, form the target speech data that a similar user B and user A engages in the dialogue.For example, with Family B has three sections of primary voice datas B1, B2 and B3, and user A also has the evaluation for this three sections of primary voice datas respectively, point Not Wei A1, A2 and A3, then above-mentioned speech data can be synthesized one section according to B1, A1, B2, A2, B3 and A3 order respectively Target speech data.It is of course also possible to synthesize target speech data in other orders, those skilled in the art can be advance Set.

In the present embodiment, active user be directed to other users primary voice data, can without making an amendment by its with The speech data of oneself is synthesized, so as to realize that the speech data of different user is spliced into the purpose of target speech data, and And the splicing for the form of engaging in the dialogue can also be segmented, it is more suitable for the application scenarios that active user is inconvenient to input voice, lifting is used The voice at family sends experience.

For foregoing embodiment of the method, in order to be briefly described, therefore it is all expressed as to a series of combination of actions, still Those skilled in the art should know that the present invention is not limited by described sequence of movement, because according to the present invention, it is some Step can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know, described in the specification Embodiment belong to preferred embodiment, necessary to involved action and the module not necessarily present invention.

Corresponding with the method that the method for speech processing embodiment of the invention described above is provided, referring to Fig. 5, the present invention also carries Voice processing apparatus embodiment is supplied, in the present embodiment, the device can include：

Voice recognition unit 501, in response to choosing initial speech data, voice to be carried out to the initial speech data Identification obtains corresponding text data.

Wherein, the voice recognition unit 501 can specifically include：

Acquiring unit 502, in response to the editor to the text data, obtaining the text data after editor.

Synthesis unit 503, for being target language by sound bite Data Synthesis corresponding to the text data after the editor Sound data.

Wherein, the synthesis unit 503 can specifically include：

Wherein, the initial speech data is the voice of the first user, and the target speech data is the language of second user Sound；Then the acquiring unit 502 can be used for：

Wherein, the synthesis unit 503 can also include：

Wherein, the initial speech data and the voice that the target speech data is the first user；It is then described to search list Member 502 specifically can be used for：Searched from default sound bank corresponding to the participle text data after the editor, the first user Sound bite data；Accordingly, the synthesis unit 503 specifically can be used for：According to the participle textual data after the editor According to display order, be target speech data by the lookup, the first user sound bite Data Synthesis.

Wherein, the voice processing apparatus can also include：

Transmitting element 504, for the target speech data to be sent to recipient.

It can be seen that in embodiments of the present invention, the initial speech data chosen to user carries out speech recognition, and in user couple After the content of text data enters edlin, for example, deletion or increased portion non-textual content, or adjustment member content of text are suitable Sequence etc., the text data after subsequently user is edited again generate a target voice new, different from initial speech data Data.Therefore, the embodiment of the present invention not only facilitates user oneself need not be repeated in chat process in initial speech data Hold can and send voice, also, based on text data modify it is more convenient some be inconvenient to the application environment that inputs, satisfaction Voice of the user in chat sends demand and editor's demand to voice, lifts Consumer's Experience.

Present invention also offers another voice processing apparatus embodiment, and in the present embodiment, the device can include：

Receiving unit, for receiving pending primary voice data；Searching unit, in response to for described original The triggering of the supplemental text data of speech data, search supplement speech data corresponding to the supplemental text data；And synthesis Unit, for the primary voice data and supplement speech data to be synthesized into target speech data.

Wherein, the searching unit can specifically include：

Wherein, the synthesis unit can specifically include：

On the device in above-described embodiment, wherein modules perform the concrete mode of operation in relevant this method Embodiment in be described in detail, explanation will be not set forth in detail herein.

Fig. 6 is a kind of block diagram of voice processing apparatus 800 according to an exemplary embodiment.For example, device 800 can To be mobile phone, computer, digital broadcast terminal, messaging devices, game console, tablet device, Medical Devices, it is good for Body equipment, personal digital assistant etc..

Reference picture 6, device 800 can include following one or more assemblies：Processing component 802, memory 804, power supply Component 806, multimedia groupware 808, audio-frequency assembly 810, the interface 812 of input/output (I/O), sensor cluster 814, and Communication component 816.

The integrated operation of the usual control device 800 of processing component 802, such as communicated with display, call, data, phase The operation that machine operates and record operation is associated.Treatment element 802 can refer to including one or more processors 820 to perform Order, to complete all or part of step of above-mentioned method.In addition, processing component 802 can include one or more modules, just Interaction between processing component 802 and other assemblies.For example, processing component 802 can include multi-media module, it is more to facilitate Interaction between media component 808 and processing component 802.

Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown Example includes the instruction of any application program or method for being operated on device 800, contact data, telephone book data, disappears Breath, picture, video etc..Memory 804 can be by any kind of volatibility or non-volatile memory device or their group Close and realize, as static RAM (SRAM), Electrically Erasable Read Only Memory (EEPROM) are erasable to compile Journey read-only storage (EPROM), programmable read only memory (PROM), read-only storage (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 can include power management system System, one or more power supplys, and other components associated with generating, managing and distributing electric power for device 800.

Multimedia groupware 808 is included in the screen of one output interface of offer between described device 800 and user.One In a little embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch-screen, to receive the input signal from user.Touch panel includes one or more touch sensings Device is with the gesture on sensing touch, slip and touch panel.The touch sensor can not only sensing touch or sliding action Border, but also detect and touched or the related duration and pressure of slide with described.In certain embodiments, more matchmakers Body component 808 includes a front camera and/or rear camera.When equipment 800 is in operator scheme, such as screening-mode or During video mode, front camera and/or rear camera can receive outside multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio-frequency assembly 810 is configured as output and/or input audio signal.For example, audio-frequency assembly 810 includes a Mike Wind (MIC), when device 800 is in operator scheme, during such as call model, logging mode and speech recognition mode, microphone by with It is set to reception external audio signal.The audio signal received can be further stored in memory 804 or via communication set Part 816 is sent.In certain embodiments, audio-frequency assembly 810 also includes a loudspeaker, for exports audio signal.

I/O interfaces 812 provide interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to：Home button, volume button, start button and lock Determine button.

Sensor cluster 814 includes one or more sensors, and the state for providing various aspects for device 800 is commented Estimate.For example, sensor cluster 814 can detect opening/closed mode of equipment 800, and the relative positioning of component, for example, it is described Component is the display and keypad of device 800, and sensor cluster 814 can be with 800 1 components of detection means 800 or device Position change, the existence or non-existence that user contacts with device 800, the orientation of device 800 or acceleration/deceleration and device 800 Temperature change.Sensor cluster 814 can include proximity transducer, be configured to detect in no any physical contact The presence of neighbouring object.Sensor cluster 814 can also include optical sensor, such as CMOS or ccd image sensor, for into As being used in application.In certain embodiments, the sensor cluster 814 can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 816 is configured to facilitate the communication of wired or wireless way between device 800 and other equipment.Device 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G, or combinations thereof.In an exemplary implementation In example, communication component 816 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 816 also includes near-field communication (NFC) module, to promote junction service.Example Such as, in NFC module radio frequency identification (RFID) technology can be based on, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 800 can be believed by one or more application specific integrated circuits (ASIC), numeral Number processor (DSP), digital signal processing appts (DSPD), PLD (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for performing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instructing, example are additionally provided Such as include the memory 804 of instruction, above-mentioned instruction can be performed to complete the above method by the processor 820 of device 800.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of mobile terminal When device performs so that mobile terminal is able to carry out a kind of method of speech processing, and methods described includes：In response to choosing initial speech Data, speech recognition is carried out to the initial speech data and obtains corresponding text data；In response to the text data Editor, obtain the text data after editor；It is target by sound bite Data Synthesis corresponding to the text data after the editor Speech data.

Wherein, it is described in response to choosing initial speech data, speech recognition is carried out to the initial speech data and obtained pair The text data answered, it can include：

In response to choosing initial speech data, the initial speech data is identified as full copy data；To described complete Whole text data is segmented to obtain participle text data.

Wherein, sound bite Data Synthesis corresponding to the text data by after editor is target speech data, can be with Including：

Sound bite data corresponding to the participle text data after the editor are obtained from default sound bank；Institute's predicate Sound storehouse is used to preserve participle text data, sound bite data and both corresponding relations；According to the participle after the editor The display order of text data, it is target speech data by the sound bite Data Synthesis of acquisition.

The initial speech data is the voice of the first user, and the target speech data is the voice of second user；Then Sound bite data corresponding to the participle text data obtained from default sound bank after the editor, including：

Wherein, the initial speech data is the voice of the first user, and the target speech data is the language of second user Sound；Then sound bite data corresponding to the participle text data obtained from default sound bank after the editor, can be with Including：Searched from default sound bank corresponding to the participle text data after the editor, the voice sheet hop count of second user According to；Accordingly, sound bite Data Synthesis is target speech data corresponding to the text data by after editor, can be wrapped Include：According to the display order of the participle text data after the editor, the lookup, second user speech data is synthesized For target speech data.

Wherein, described device 800 can also be configured to by one or more than one computing device it is one or More than one program bag of person contains the instruction for being used for being operated below：

For the participle text data after each editor, judge whether to find the speech data of corresponding second user, such as Fruit can all find, then perform according to the word segmentation result data after the editor sequencing, by the lookup, second The speech data of user synthesizes target speech data；If can not all find, the participle textual data after this is edited According to the sound bite data of corresponding first user, and, the sound bite data of the second user found, after editor The display order of text data is segmented, synthesizes target speech data.

Wherein, the initial speech data and the voice that the target speech data is the first user；It is then described from default Sound bank in obtain the participle text data after the editor corresponding to sound bite data, can include：From default language Searched in sound storehouse corresponding to the participle text data after the editor, the sound bite data of the first user；Accordingly, it is described to incite somebody to action Sound bite Data Synthesis corresponding to text data after editor is target speech data, can be included：After the editor Participle text data display order, be target voice number by the lookup, the first user sound bite Data Synthesis According to.

Wherein, the editor can include：Delete, increase, change and/or replace.

The target speech data is sent to recipient.

Fig. 7 is the structural representation of server in the embodiment of the present invention.The server 1900 can be different because of configuration or performance And produce bigger difference, can include one or more central processing units (central processing units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage application programs 1942 or the storage medium 1930 (such as one or more mass memory units) of data 1944.Wherein, memory 1932 Can be of short duration storage or persistently storage with storage medium 1930.Be stored in storage medium 1930 program can include one or More than one module (diagram does not mark), each module can include operating the series of instructions in server.Further Ground, central processing unit 1922 be could be arranged to communicate with storage medium 1930, and storage medium 1930 is performed on server 1900 In series of instructions operation.

Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..

Those skilled in the art will readily occur to the present invention its after considering specification and putting into practice invention disclosed herein Its embodiment.It is contemplated that cover the present invention any modification, purposes or adaptations, these modifications, purposes or Person's adaptations follow the general principle of the present invention and including the undocumented common knowledges in the art of the disclosure Or conventional techniques.Description and embodiments are considered only as exemplary, and true scope and spirit of the invention are by following Claim is pointed out.

It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims

1. a kind of method of speech processing, it is characterised in that this method includes：

In response to choosing initial speech data, speech recognition is carried out to the initial speech data and obtains corresponding text data；

2. according to the method for claim 1, it is characterised in that it is described in response to choosing initial speech data, to described first Beginning speech data carries out speech recognition and obtains corresponding text data, including：

The full copy data are segmented to obtain participle text data.

3. according to the method for claim 2, it is characterised in that sound bite corresponding to the text data by after editor Data Synthesis is target speech data, including：

Sound bite data corresponding to the participle text data after the editor are obtained from default sound bank；The sound bank For preserving participle text data, sound bite data and both corresponding relations；

It is target voice by the sound bite Data Synthesis of acquisition according to the display order of the participle text data after the editor Data.

4. a kind of method of speech processing, it is characterised in that this method includes：

Receive pending primary voice data；

In response to the triggering of the supplemental text data for the primary voice data, search corresponding to the supplemental text data Supplement speech data；

5. a kind of voice processing apparatus, it is characterised in that the device includes：

Voice recognition unit, in response to choosing initial speech data, carrying out speech recognition to the initial speech data and obtaining To corresponding text data；

Synthesis unit, for being target speech data by sound bite Data Synthesis corresponding to the text data after the editor.

6. a kind of voice processing apparatus, it is characterised in that the device includes：

Receiving unit, for receiving pending primary voice data；

Searching unit, for the triggering in response to the supplemental text data for the primary voice data, search the supplement Supplement speech data corresponding to text data；

7. a kind of device for speech processes, it is characterised in that include memory, and one or more than one journey Sequence, one of them or more than one program storage are configured to by one or more than one processor in memory Perform one or more than one program bag and contain the instruction for being used for being operated below：

8. a kind of device for speech processes, it is characterised in that include memory, and one or more than one journey Sequence, one of them or more than one program storage are configured to by one or more than one processor in memory Perform one or more than one program bag and contain the instruction for being used for being operated below：

Receive pending primary voice data；

9. a kind of computer-readable medium, instruction is stored thereon with, when executed by one or more processors so that device is held Method of speech processing of the row as described in one or more in claims 1 to 3.

10. a kind of computer-readable medium, instruction is stored thereon with, when executed by one or more processors so that device Perform method of speech processing as claimed in claim 4.