CN110162598A

CN110162598A - A kind of data processing method and device, a kind of device for data processing

Info

Publication number: CN110162598A
Application number: CN201910295556.0A
Authority: CN
Inventors: 樊博; 孟凡博; 刘恺; 段文君; 陈汉英; 陈伟; 王砚峰
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2019-08-23
Anticipated expiration: 2039-04-12
Also published as: CN110162598B

Abstract

The embodiment of the invention provides a kind of data processing method and device, a kind of device for data processing, method therein is specifically included: receiving text；The text is related at least two language；Determine the corresponding target voice sequence of the text；According to the mapping relations between phonetic feature sequence and characteristics of image sequence, the corresponding target image sequence of the target voice sequence is determined；Phonetic feature sequence described in the mapping relations is aligned on a timeline with described image characteristic sequence；The mapping relations are to obtain according to time axis aligned speech samples and image pattern；The speech samples are related to a kind of language, alternatively, the speech samples are related to multilingual；The target voice sequence and the target image sequence are merged, to obtain corresponding target video.The embodiment of the present invention can shorten the generation time of target video, save human cost, and can be adapted for the corresponding text of at least two language.

Description

A kind of data processing method and device, a kind of device for data processing

Technical field

The present invention relates to field of computer technology, more particularly to a kind of data processing method and device, one kind for counting According to the device of processing.

Background technique

Currently, numerous content expression scenes need the mankind to participate in, such as news report scene, teaching scene, medical scene, visitor Take scene, law scene etc..

By taking news report scene as an example, media worker can convey news to spectators by way of news-video Content.

In practical applications, news-video usually requires very long recording process, causes the timeliness of news-video poor, So that news-video, which can not be applied to the higher content of timeliness, expresses scene, such as breaking news scene.Also, video The casting of news needs to expend the more human cost of media worker, so that the working efficiency of media industry is lower.

Summary of the invention

In view of the above problems, the embodiment of the present invention proposes one kind and overcomes the above problem or at least be partially solved above-mentioned Data processing method, data processing equipment and the device for data processing of problem, the embodiment of the present invention can shorten target The generation time of video saves human cost, and can be adapted for the corresponding text of at least two language.

To solve the above-mentioned problems, the invention discloses a kind of data processing methods, comprising:

Receive text；The text is related at least two language；

Determine the corresponding target voice sequence of the text；

According to the mapping relations between phonetic feature sequence and characteristics of image sequence, determine that the target voice sequence is corresponding Target image sequence；Phonetic feature sequence described in the mapping relations and described image characteristic sequence are right on a timeline Together；The mapping relations are to obtain according to time axis aligned speech samples and image pattern；The speech samples are related to one kind Language, alternatively, the speech samples are related to multilingual；

The target voice sequence and the target image sequence are merged, to obtain corresponding target video.

On the other hand, the invention discloses a kind of data processing equipments, comprising:

Receiving module, for receiving text；The text is related at least two language；

Voice determining module, for determining the corresponding target voice sequence of the text；

Image determining module, for determining institute according to the mapping relations between phonetic feature sequence and characteristics of image sequence State the corresponding target image sequence of target voice sequence；Phonetic feature sequence described in the mapping relations and described image feature Sequence is aligned on a timeline；The mapping relations are to obtain according to time axis aligned speech samples and image pattern；It is described Speech samples are related to a kind of language, alternatively, the speech samples are related to multilingual；And

Fusion Module, for being merged to the target voice sequence and the target image sequence, to be corresponded to Target video.

In another aspect, the invention discloses a kind of device for data processing, include memory and one or More than one program, perhaps more than one program is stored in memory and is configured to by one or one for one of them It includes the instruction for performing the following operation that a above processor, which executes the one or more programs:

Receive text；The text is related at least two language；

Determine the corresponding target voice sequence of the text；

The embodiment of the present invention includes following advantages:

The target voice sequence of the embodiment of the present invention can match with the tone color of target utterance body, and target image sequence can To be obtained on the basis of target entity image, it is possible thereby to by obtained target video realize by target entity image according to The tone color of target utterance body expresses text；Since above-mentioned target video can be generated by machine, therefore target view can be shortened Generation time of frequency, and then the timeliness of target video can be improved, so as to can be adapted for timeliness higher for target video Content expresses scene, such as breaking news scene.

Also, target video expresses text according to the tone color of target utterance body by target entity image, relatively In expressing according to manual type text, human cost can be saved, and the working efficiency of relevant industries can be improved.

In addition, the corresponding target image sequence of target voice sequence is to obtain according to mapping relations, mapping relations reflection Phonetic feature sequence and characteristics of image sequence between rule, can be adapted for arbitrary language, thus can be adapted for Few corresponding text of bilingual.

Detailed description of the invention

Fig. 1 is a kind of step flow chart of data processing method embodiment one of the invention；

Fig. 2 is a kind of step flow chart of data processing method embodiment two of the invention；

Fig. 3 is a kind of step flow chart of data processing method embodiment three of the invention；

Fig. 4 is a kind of structural block diagram of data processing equipment embodiment of the invention；

Fig. 5 be a kind of device for data processing of the invention as equipment when structural block diagram；And

Fig. 6 is the structural block diagram of server-side in some embodiments of the present invention.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

For the long technical problem of the recording time of conventional video news, the embodiment of the invention provides one kind to pass through machine The scheme of the corresponding target video of text is generated, the program can specifically include: receiving text；Determine the corresponding mesh of the text Mark voice sequence；According to the mapping relations between phonetic feature sequence and characteristics of image sequence, the target voice sequence is determined Corresponding target image sequence；The target voice sequence and the target image sequence are merged, it is corresponding to obtain Target video.

The embodiment of the present invention can be applied to news report scene, teaching scene, medical scene, customer service scene, law field The contents such as scape express scene.

The target video of the embodiment of the present invention may include: the corresponding target voice sequence of text and target voice sequence Arrange corresponding target image sequence.

The embodiment of the present invention, text can be related at least two language, such as Chinese, Japanese, Korean, English, French, moral At least two etc. in the language such as text, Arabic.Then target voice sequence and target image sequence also relate at least Bilingual, therefore the embodiment of the present invention can be adapted for multilingual content expression scene.

For example, text can be news release in news report scene.It may include: first language and news in news release The second language of the corresponding country of event, for example, first language can be Chinese, second language can English.Certainly, in addition to first Except language and second language, text can also relate to the language such as third language, the 4th language.

For another example, in customer service scene, text can be the problem of user inputs text, may include: to make in the question text First language for mother tongue and the second language as non-mother tongue.For example, question text is related to computer glitch, question text can To include: the Chinese text of the corresponding English text of computer glitch and user conclusion and summary.

For another example, in meeting presiding scene, it may include: multi-lingual in the conference speech original text that text, which can be conference speech original text, Say the corresponding multilingual of user.

It is appreciated that the text for being related at least two language can be applied to arbitrary content expression scene, the present invention is real It is without restriction for specific content expression scene to apply example.

In practical applications, it can use TTS (speech synthesis, Text To Speech) technology, convert text to mesh The corresponding target voice of voice sequence is marked, target voice sequence can be characterized as the form of waveform.It is appreciated that can be according to language Sound synthetic parameters obtain the target voice sequence for meeting demand.

Optionally, speech synthesis parameter may include: at least one of tamber parameter, pitch parameter and loudness parameter.

Wherein, tamber parameter can refer to the distinguished characteristic in terms of the frequency of different sound shows waveform, lead to The sounding body of Chang Butong corresponds to different tone colors, therefore can obtain the tone color phase with target utterance body according to tamber parameter The target voice sequence matched, target utterance body can be specified by user, for example, target utterance body can be specified media worker Deng.In practical applications, the tamber parameter of target utterance body can be obtained according to the audio of the preset length of target utterance body.

Pitch parameter can characterize tone, be measured with frequency.Loudness parameter, the also referred to as sound intensity or volume can refer to The size of sound is measured with decibel (dB).

Target image sequence can be used for characterizing entity (entity) image.Entity is that have distinguishability and self-existent thing Object, entity may include: people, robot, animal, plant etc..The embodiment of the present invention is mainly taking human as example to target image sequence It is illustrated, the corresponding target image sequence of other entities is cross-referenced.The corresponding solid images of people are properly termed as portrait.

For entity state angle, above-mentioned characteristics of image may include entity state feature, and entity state feature can be with Reflect feature of the image sequence in terms of entity state.

Optionally, above-mentioned entity state feature may include at least one of following feature:

Expressive features；

Lip feature；And

Limbs feature.

Expression gives expression to one's sentiment, affection, can refer to the thoughts and feelings for showing face.

Expressive features are usually to be directed to entire face.Lip feature can be specifically for lip, and with the text of text Target voice sequence described in this content, voice, articulation type etc., therefore the corresponding nature expressed of image sequence can be improved Degree.

Limbs feature can be conveyed by the Coordinating Activity of the human bodies such as head, eye, neck, hand, elbow, arm, body, hip, foot The thought of personage, visually so as to communicating views.Limbs feature may include: rotary head, shrug, gesture etc., and image can be improved The corresponding richness expressed of sequence.For example, at least one arm naturally droops when speaking, when silent at least one arm from So it is placed on abdomen etc..

Phonetic feature sequence may include: language feature and/or acoustic feature.

Language feature may include: phoneme feature.Phoneme be marked off according to the natural quality of voice come minimum voice Unit is analyzed according to the articulation in syllable, and a movement constitutes a phoneme.Phoneme may include: vowel and consonant.

Acoustic feature can characterize the feature of voice from sounding angle.

Acoustic feature can include but is not limited to following feature:

It is related special to specifically include duration correlated characteristic, fundamental frequency for prosodic features (super-segmental feature/paralinguistics feature) Sign, energy correlated characteristic etc.；

Sound quality feature；

Correlation analysis feature based on spectrum is the embodiment of correlation between vocal tract shape variation and sound generating movements, mesh The preceding correlated characteristic based on spectrum specifically include that linear prediction residue error (LPCC, LinearPredictionCoefficients), mel-frequency cepstrum coefficient (MFCC, Mel Frequency Cepstrum Coefficient) etc..

The embodiment of the present invention can according to time axis aligned speech samples and image pattern, obtain phonetic feature sequence with Mapping relations between characteristics of image sequence.

It is regular governed between phonetic feature sequence and characteristics of image sequence.For example, specific phoneme feature is corresponding Specific lip feature；For another example, specific prosodic features corresponds to specific expressive features；Alternatively, specific phoneme feature pair Answer specific limbs feature etc..

Therefore, the embodiment of the present invention can obtain mapping relations according to time axis aligned speech samples and image pattern, To reflect the rule between phonetic feature sequence and characteristics of image sequence by the mapping relations.

Rule between the phonetic feature sequence and characteristics of image sequence of mapping relations reflection can be adapted for arbitrary Language, therefore can be adapted for the corresponding text of at least two language.

The embodiment of the present invention can use machine learning method end to end, to time axis aligned speech samples and image Sample is learnt, to obtain above-mentioned mapping relations.The input of machine learning method end to end can be voice sequence, output It can be image sequence, this method can be by the study of training data, between the feature of the feature and output that are inputted Rule.

In broad terms, machine learning is a kind of ability that can assign machine learning, it is allowed to complete Direct Programming with this The method of impossible function.But it is said in the sense that practice, machine learning is a kind of by utilizing data, training depanning Then type uses a kind of method of model prediction.Machine learning method may include: traditional decision-tree, linear regression method, patrol Collect homing method, neural network method etc., it will be understood that the embodiment of the present invention does not limit specific machine learning method System.

The alignment of speech samples and image pattern on a timeline can be improved same between phonetic feature and characteristics of image Step property.

In an embodiment of the present invention, the speech samples and described image sample can be originated from same video text Part, it is possible thereby to realize the alignment of speech samples and image pattern on a timeline.For example, the video text of recording can be collected Part may include: the voice of sounding body and the video pictures of sounding body in the video file.

In another embodiment of the invention, speech samples and image pattern can be originated from different files, specifically, The speech samples can be originated from audio file, and described image sample can be originated from video file or image file, image text Part may include: multiple image.In such cases, time shaft alignment can be carried out to speech samples and image pattern, to obtain Time axis aligned speech samples and image pattern.

It is appreciated that above-mentioned machine learning method end to end is intended only as the optional implementation of method for confirming mapping relation Example, actually those skilled in the art can determine mapping relations using other methods according to practical application request, such as other Method can be statistical method etc., and the embodiment of the present invention is without restriction for the specific determining method of mapping relations.

The target image sequence of the embodiment of the present invention can obtain on the basis of target entity image, in other words, this hair Bright embodiment can assign target voice sequence corresponding characteristics of image (entity state feature) for target entity image, to obtain Target image sequence.Target entity image can be specified by user, for example, target entity image can be (such as main for star personality Hold people) image.

To sum up, the target voice sequence of the embodiment of the present invention can match with the tone color of target utterance body, target image Sequence can obtain on the basis of target entity image, it is possible thereby to be realized by obtained target video by target entity figure As expressing according to the tone color of target utterance body text；Since above-mentioned target video can be generated by machine, therefore can shorten The generation time of target video, and then the timeliness of target video can be improved, so that target video can be adapted for timeliness Higher content expresses scene, such as breaking news scene

Also, target video target entity image expresses text according to the tone color of target utterance body, relative to pressing Text is expressed according to manual type, human cost can be saved, and the working efficiency of relevant industries can be improved.

Data processing method provided in an embodiment of the present invention can be applied in client and the corresponding application environment of server-side, Client and server-side are located in wired or wireless network, and by the wired or wireless network, client is counted with server-side According to interaction.

Optionally, client may operate in terminal, and above-mentioned terminal specifically includes but unlimited: smart phone, plate electricity Brain, E-book reader, MP3 (dynamic image expert's compression standard audio level 3, Moving Picture Experts Group Audio Layer III) player, MP4 (dynamic image expert's compression standard audio level 4, Moving Picture Experts Group Audio Layer IV) player, pocket computer on knee, vehicle-mounted computer, desktop computer, machine top Box, intelligent TV set, wearable device etc..

Client refers to corresponding with server-side, provides the program of local service for user.Visitor in the embodiment of the present invention Family end can provide target video, and target video can be generated by client or server-side, and the embodiment of the present invention is for specific visitor Family end is without restriction.

In an embodiment of the present invention, client can determine the target of user's selection by question and answer interactive operation Sounding body information and target entity image information receive the text of user, and upload text, target utterance body information to server-side With target entity image information so that server-side generates text, target utterance body and the corresponding target video of target entity image； Also, client can export the target video to user.

Embodiment of the method one

Referring to Fig.1, a kind of step flow chart of data processing method embodiment one of the invention is shown, specifically can wrap Include following steps:

Step 101 receives text；The text can be related at least two language；

Step 102 determines the corresponding target voice sequence of the text；

Mapping relations between step 103, foundation phonetic feature sequence and characteristics of image sequence, determine the target voice The corresponding target image sequence of sequence；

Phonetic feature sequence described in the mapping relations is aligned on a timeline with described image characteristic sequence；It is described to reflect The relationship of penetrating can obtain for foundation time axis aligned speech samples and image pattern；The speech samples can be related to a kind of language Speech, alternatively, the speech samples are related to multilingual；

Step 104 merges the target voice sequence and the target image sequence, to obtain corresponding target Video.

In step 101, for client, the text of user's upload can receive；For server-side, Ke Yijie Receive the text that client is sent.It is appreciated that arbitrary first equipment can receive text, the embodiment of the present invention from the second equipment It is without restriction for the specific transmission mode of text.

In step 102, TTS technology can use, convert text to the corresponding target voice of target voice sequence, target Voice sequence can be characterized as the form of waveform.

It is alternatively possible to determine the corresponding target language of the text according to the corresponding tamber parameter of target utterance body information Sound sequence, it is hereby achieved that the target voice sequence to match with the tone color of target utterance body.Target utterance body information can be with It include: the mark of people, such as the mark of star personality；Alternatively, target utterance body information may include: the audio of target utterance body.

Step 102 determines that the process of the corresponding target voice sequence of the text may include: the corresponding mesh of determining text Language feature is marked, and determines the corresponding target voice sequence of object language feature.

The embodiment of the present invention can use following method of determination, determine the corresponding target voice sequence of object language feature:

Method of determination 1 searches the first voice unit to match with object language feature in the first sound bank, to first Voice unit is spliced, to obtain target voice sequence.

Method of determination 2 determines the corresponding target acoustical feature of object language feature, lookup and target in the second sound bank The second voice unit that acoustic feature matches, splices the second voice unit, to obtain target voice sequence.

Method of determination 3, using phoneme synthesizing method end to end, the source of phoneme synthesizing method can wrap end to end Include: text or the corresponding object language feature of text, target side can be the target voice sequence of wave form.

In an alternative embodiment of the invention, phoneme synthesizing method can use neural network, the mind end to end It may include: single layer RNN (Recognition with Recurrent Neural Network, Recurrent Neural Network) and the double-deck active coating through network, it is double Layer active coating is for predicting 16 voice outputs.The state demarcation of RNN is at two parts: the first (most-significant byte) state and second (low 8 Position) state.First state and the second state input corresponding active coating respectively, and the second state is obtained based on first state, 16 based on previous moment of first state obtain.The neural network is by first state and the second Design of State in a network knot In structure, training speed can be accelerated and simplify training process, therefore the operand of neural network can be reduced, and then end can be made to arrive The phoneme synthesizing method at end is suitable for the limited mobile terminal of calculation resources, such as mobile phone.

It is appreciated that those skilled in the art can be according to practical application request, using above-mentioned method of determination 1 to determination side Any or combination in formula 3, the embodiment of the present invention is for determining the specific of the corresponding target voice sequence of object language feature Process is without restriction.

In step 103, the corresponding target image sequence of target voice sequence is to obtain according to mapping relations, the mapping relations Rule between the phonetic feature sequence and characteristics of image sequence of reflection, can be adapted for arbitrary language, therefore can be applicable in In the corresponding text of at least two language.

In an alternative embodiment of the invention, it determines the corresponding target image sequence of the target voice sequence, has Body may include: according to the corresponding target voice characteristic sequence of target voice sequence and phonetic feature sequence and characteristics of image Mapping relations between sequence determine the corresponding target image characteristics sequence of target voice characteristic sequence, and then can determine mesh The corresponding target image sequence of logo image characteristic sequence.

Phonetic feature can specifically include: language feature and/or acoustic feature for characterizing voice.Characteristics of image is used In characterization solid images, can specifically include: entity state feature above-mentioned.

In an alternative embodiment of the invention, the corresponding target image sequence of above-mentioned determining target image characteristics sequence Column, can specifically include: synthesizing to target entity image and target image characteristics sequence, to obtain target image sequence, Target image characteristics sequence can be assigned for target entity image.

Target entity image can be specified by user, for example, target entity image can be star personality (such as host) Image.

Target entity image can not carry entity state, close to target entity image and target image characteristics sequence At target image sequence being made to carry the entity state to match with text, and then entity in target video can be improved The naturalness and richness of state.

It, optionally, can be special to the corresponding threedimensional model of target entity image and target image in the embodiment of the present invention Sign sequence is synthesized, and target image sequence is obtained.Threedimensional model can be for multiframe target entity image progress three-dimensional reconstruction It obtains.

In practical applications, entity exists usually in the form of three-dimensional geometry entity.Traditional two-dimensional image passes through Comparison of light and shade and perspective relation cause visual space multistory sense, can not generate spectacular naturally three-dimensional perception.And The spatial modelling and prototype of 3-dimensional image are close, not only have height, width, depth three-dimensional space geometrical body feature, but also With true status information true to nature, the sense of reality that planar picture can not provide is changed, warm, sense true to nature can be given Feel.

In computer graphics, usually with threedimensional model come to solid modelling, threedimensional model is corresponded in spatial entities Entity, can be shown by computer or other video equipments.

The corresponding feature of threedimensional model may include: geometrical characteristic, texture phase, entity state feature etc., entity state Feature may include: expressive features, lip feature, limbs feature etc..Wherein, geometrical characteristic usually with polygon come or voxel It indicates, for geometric part of the polygon to express threedimensional model, i.e., with Polygons Representation or approximate representation entity Curved surface.Its basic object is the vertex in three-dimensional space, and the straight line that two vertex connect is known as side, three vertex It connects through three sides as triangle, triangle is simplest polygon in Euclidean space.Multiple triangles can group At more complicated polygon, or generate the single entity on more than three vertex.Quadrangle and triangle are polygon expression Threedimensional model in most common shape, in terms of the expression of threedimensional model, triangulation network threedimensional model because its data structure is simple, It is easy a kind of prevalence that the features such as being drawn by all graphics hardware devices expresses as threedimensional model to select, wherein each triangle Shape is exactly a surface, therefore triangle is also known as tri patch.

Threedimensional model can be with the default entity state and point cloud data of dense correspondence, and default entity state can To include: neutral expression, lip closed state and arm droop state etc..

The corresponding threedimensional model of target entity image and target image characteristics sequence are synthesized, modification three can be passed through Vertex position on dimension module etc. realizes that the synthetic method of use can specifically include: keyframe interpolation method, parametric method Deng.Wherein, keyframe interpolation method can carry out difference to the characteristics of image of key frame.Parametric method can pass through threedimensional model Parameter the variation of entity state is described, by adjusting the different entity state of these gain of parameter.

Using keyframe interpolation method, the embodiment of the present invention can be obtained according to target image characteristics sequence Difference value vector.Using parametric method, the embodiment of the present invention can be joined according to target image characteristics sequence Number vector.

It is appreciated that above-mentioned keyframe interpolation method, parametric method is intended only as the alternative embodiment of synthetic method, practical On, those skilled in the art can be according to practical application request, and using required synthetic method, the embodiment of the present application is for specific Synthetic method it is without restriction.

In step 102, during determining target image sequence corresponding characteristics of image, phonetic feature sequence is utilized With the rule between characteristics of image sequence.Characteristics of image therein may include: in expressive features, lip feature and limbs feature At least one.

In order to improve the accuracy of the corresponding characteristics of image of target image sequence, the embodiment of the present invention can also be to target figure As the corresponding characteristics of image of sequence is extended or adjusts.

In an alternative embodiment of the invention, the corresponding limbs feature of the target image sequence can be for according to institute The corresponding semantic expressiveness of text is stated to obtain.The embodiment of the present invention uses text corresponding language during determining limbs feature Justice indicates, therefore the accuracy of limbs feature can be improved.

In the embodiment of the present invention, optionally, the direction of limbs feature, position, any parameter in speed and strength with The corresponding semantic expressiveness of text is related.

Optionally, above-mentioned semantic expressiveness can be related to affective characteristics.Limbs feature can be carried out according to affective characteristics Classification, to obtain the corresponding limbs feature of a kind of affective characteristics.

Optionally, affective characteristics may include: positive affirmative, passive negative or neutrality etc..

The band of position of limbs feature may include: Shang Qu, Zhong Qu, lower area.More than shoulder it is upper area, reason can be expressed Actively certainly the affective characteristics such as think, wish, is happy, congratulating.Middle Qu Zhicong shoulder can describe things and illustrate whole to waist Reason expresses neutral emotion.Lower area refer to waist hereinafter, can express abhor, oppose, criticizing, the emotion of the passive negative such as disappointment.

Other than the band of position, limbs feature can also include: direction.For example, palm turned upwards, it can express and actively agree Fixed affective characteristics.For another example, palm turned downwards, can express the emotion of passive negative.

In the embodiment of the present invention, the type of semantic expressiveness may include: that keyword, one-hot encoding (one-hot) vector, word are embedding Incoming vector (WordEmbedding) etc..Word embedding exactly finds a mapping or function, generates new at one Expression spatially, which is exactly word representation.

The embodiment of the present invention can determine the corresponding language of text by the mapping relations between semantic expressiveness and limbs feature Justice indicates corresponding limbs feature.Mapping relations between semantic expressiveness and limbs feature can be obtained by statistical method, It can be obtained by method end to end.

In step 104, since mapping relations can obtain for foundation time axis aligned speech samples and image pattern, therefore Target voice sequence and target image sequence can be aligned on a timeline, therefore can be to target voice sequence and target image Sequence is merged, to obtain target video.It is alternatively possible to using multi-modal fusion technology, to target voice sequence and mesh Logo image sequence is merged.It is appreciated that the embodiment of the present invention is without restriction for specific fusion method.

After obtaining target video, target video can be saved or be exported.For example, server-side can be to client End sends target video, and for another example, client can export target video etc. to user.

To sum up, the data processing method of the embodiment of the present invention, target voice sequence can be with the tone color phases of target utterance body Matching, target image sequence can obtain on the basis of target entity image, it is possible thereby to real by obtained target video Now text is expressed according to the tone color of target utterance body by target entity image；Since above-mentioned target video can be given birth to by machine At therefore the generation time of target video can be shortened, and then the timeliness of target video can be improved, so that target video can be with Scene, such as breaking news scene are expressed suitable for the higher content of timeliness.

Also, target video expresses text according to the tone color of target utterance body by target entity image, relative to Text is expressed according to manual type, human cost can be saved, and the working efficiency of relevant industries can be improved.

Embodiment of the method two

Referring to Fig. 2, a kind of step flow chart of data processing method embodiment two of the invention is shown, specifically can wrap Include following steps:

Step 201 receives text；The text can be related at least two language；

Step 202 determines the corresponding target voice sequence of the text；

Mapping relations between step 203, foundation phonetic feature sequence and characteristics of image sequence, determine the target voice The corresponding target image sequence of sequence；

Step 204 compensates the boundary of predeterminable area in the target image sequence；

Step 205 merges the target voice sequence and compensated target image sequence, corresponding to obtain Target video.

The embodiment of the present invention is during determining the target voice sequence corresponding target image sequence, it will usually use To the threedimensional model of target entity image, and in the method for reconstructing of threedimensional model and threedimensional model and characteristics of image sequence The limitation of synthetic method is easy so that details missing problem occurs in the polygon of threedimensional model, this will be so that target image sequence Arranging corresponding target entity image, there are certain place missings of imperfect problem, such as part absence of tooth, nose.

The embodiment of the present invention compensates the boundary of predeterminable area in the target image sequence, and preset areas can be improved The integrality in domain.

Above-mentioned predeterminable area can characterize the position of entity, such as face or limbs portion, correspondingly, above-mentioned preset areas Domain can specifically include at least one of following region:

Facial area；

Dress ornament region；And

Limbs region.

In an embodiment of the present invention, the boundary in target image sequence Tooth region is compensated, it can To repair incomplete tooth or supplement the tooth not occurred, therefore the integrality of tooth regions can be improved.

It in practical applications, can be with reference to the target entity image including complete predeterminable area, to the target image sequence The boundary of predeterminable area compensates in column, and the embodiment of the present invention is without restriction for specific compensation process.

Embodiment of the method three

Referring to Fig. 3, a kind of step flow chart of data processing method embodiment three of the invention is shown, specifically can wrap Include following steps:

Step 301, Receiver Problem related text；Described problem related text can be related at least two language；

Step 302 determines the corresponding target voice sequence of described problem related text；

Mapping relations between step 303, foundation phonetic feature sequence and characteristics of image sequence, determine the target voice The corresponding target image sequence of sequence；The corresponding mode of the target image sequence may include: answering model or listen attentively to mould Formula；

Step 304 merges the target voice sequence and the target image sequence, to obtain corresponding target Video.

The embodiment of the present invention can be applied to question and answer interaction scenarios, such as customer service scene, video conference scene.This hair In bright embodiment, the corresponding mode of target image sequence may include: answering model or listen attentively to mode, and customer service can be improved The intelligence of target image sequence under service scenarios.

Answering model can refer to the mode answered a question by target video, can correspond to first instance state.It is returning It answers under mode, the corresponding target entity image of target video can read aloud problem answers by target voice sequence, and pass through mesh The corresponding first instance state of logo image sequence expresses the emotion during reading aloud problem answers.

The mode of listening attentively to can refer to the mode that user inputs problem of listening attentively to, and can correspond to second instance state.Listen attentively to mould Under formula, the corresponding target entity image of target video can be listened attentively to by the corresponding second instance state expression of target image sequence Emotion in the process.Second instance state may include: feature etc. of nodding.Optionally, in the listen mode, mesh can also be passed through The expression of poster sound sequence " uh ", " continuing with " etc. listen attentively to state text.

Problem related text may include: answer text or listen attentively to state text.Wherein, answer text can correspond to Answering model, the state text of listening attentively to can correspond to the mode of listening attentively to.

In an alternative embodiment of the invention, in the input process of described problem, the target image sequence pair The mode answered is to listen attentively to mode；Or

After the completion of the input of described problem, the corresponding mode of the target image sequence can be answering model.

Whether the embodiment of the present invention can input completion according to problem, cut to the corresponding mode of target image sequence It changes.Optionally, if not receiving the input of user in preset duration, it is believed that the input of problem is completed.

It in an alternative embodiment of the invention, can be corresponding to target image sequence according to linking image pattern Mode switches over, to improve the fluency of switching.

Being connected image pattern may include: the first linking image pattern.First linking image pattern may include: successively to go out Existing listens attentively to the corresponding image pattern of mode and the corresponding image pattern of answering model, can be by the first linking image Sample is learnt, and the rule switched from the mode of listening attentively to answering model is obtained, it is possible thereby to improve from the mode of listening attentively to answer The fluency of pattern switching.

Being connected image pattern may include: the second linking image pattern.Second linking image pattern may include: successively to go out The corresponding image pattern of existing answering model and the corresponding image pattern of mode is listened attentively to, it can be by the second linking image Sample is learnt, and is obtained from answering model to listening attentively to the rule of pattern switching, it is possible thereby to improve from answering model to listening attentively to The fluency of pattern switching.

A kind of data processing method example of the invention, can specifically include following steps:

Step S1, the problem of in the listen mode, playing first object video, and receiving user's input；

First object video can correspond to the mode of listening attentively to, can be by first object voice sequence and first object image sequence Column obtain, and first object image sequence can correspond to the mode of listening attentively to.

Step S2, whether decision problem inputs completion, if so, then follow the steps S3, otherwise return step S1；

Step S3, the corresponding mode of target image sequence is set to answering model, and plays the second target video；

The determination process of second target video may include:

Step S31, answer text is determined；

Step S32, the corresponding second target voice sequence of the answer text is determined；

Step S33, according to the mapping relations between phonetic feature sequence and characteristics of image sequence, second target is determined Corresponding second target image sequence of voice sequence；Second target image sequence can be corresponding with answering model；

Step S34, the second target voice sequence and second target image sequence are merged, to obtain pair The second target video answered.

Step S4, after the second target video finishes, the corresponding mode of target image sequence is set to the mode of listening attentively to.

It is appreciated that above-mentioned output target video is intended only as alternative embodiment, in fact, the embodiment of the present invention can be to User exports the link of the target video, so that user determines whether to play above-mentioned target video.

Optionally, the embodiment of the present invention can also export the target voice sequence or the target voice to user The link of sequence.

Optionally, the embodiment of the present invention can also export problem related text to user.Problem related text may include: Answer text or listen attentively to state text.Wherein, answer text can correspond to answering model, and the state text of listening attentively to can correspond to Listen attentively to mode.

In an alternative embodiment of the invention, above-mentioned question and answer interaction can be corresponding with communication window, can communicate Show at least one of following information in window: the link of target voice sequence, problem answers text and target video Link.Wherein, the link of target video is displayed at the identified areas of communication terminal.Identified areas can be used for showing communication terminal The information such as the pet name, ID (mark, Identity), head portrait.

It should be noted that for simple description, therefore, it is stated as a series of movement is dynamic for embodiment of the method It combines, but those skilled in the art should understand that, the embodiment of the present invention is not by the limit of described athletic performance sequence System, because according to an embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, art technology Personnel also should be aware of, and the embodiments described in the specification are all preferred embodiments, and related athletic performance is simultaneously different It surely is necessary to the embodiment of the present invention.

Installation practice

Referring to Fig. 4, a kind of structural block diagram of data processing equipment embodiment of the invention is shown, can specifically include:

Receiving module 401, for receiving text；The text is related at least two language；

Voice determining module 402, for determining the corresponding target voice sequence of the text；

Image determining module 403, for determining according to the mapping relations between phonetic feature sequence and characteristics of image sequence The corresponding target image sequence of the target voice sequence；Phonetic feature sequence described in the mapping relations and described image are special Sign sequence is aligned on a timeline；The mapping relations can obtain for foundation time axis aligned speech samples and image pattern It arrives；The speech samples are related to a kind of language, alternatively, the speech samples are related to multilingual；And

Fusion Module 404, for being merged to the target voice sequence and the target image sequence, to obtain pair The target video answered.

Optionally, the speech samples and described image sample are originated from same video file；Or

The speech samples are originated from audio file, and described image sample source is from video file or image file.

Optionally, the corresponding characteristics of image of the target image sequence may include at least one of following feature:

Expressive features；

Lip feature；And

Limbs feature.

Optionally, the corresponding limbs feature of the target image sequence is to obtain according to the corresponding semantic expressiveness of the text It arrives.

Optionally, described device can also include:

Compensating module, for melting in the Fusion Module to the target voice sequence and the target image sequence Before conjunction, the boundary of predeterminable area in the target image sequence is compensated.

Optionally, the predeterminable area may include at least one of following region:

Facial area；

Dress ornament region；And

Limbs region.

Optionally, the text may include: the problems in question and answer interaction related text；

The corresponding mode of the target image sequence may include: answering model or listen attentively to mode.

Optionally, in the input process of described problem, the corresponding mode of the target image sequence is to listen attentively to mode；Or Person

After the completion of the input of described problem, the corresponding mode of the target image sequence is answering model.

Optionally, described device can also include:

First output module, for exporting the target video to user；Or

Second output module, for exporting the link of the target video to user；Or

Third output module, for exporting the chain of the target voice sequence or the target voice sequence to user It connects；Or

4th output module, for exporting problem related text to user.

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.

Fig. 5 be a kind of device for data processing shown according to an exemplary embodiment as equipment when structural frames Figure.For example, device 900 can be mobile incoming call, computer, digital broadcasting terminal, messaging device, game console put down Panel device, Medical Devices, body-building equipment, personal digital assistant etc..

Referring to Fig. 5, device 900 may include following one or more components: processing component 902, memory 904, power supply Component 906, multimedia component 908, audio component 910, the interface 912 of input/output (I/O), sensor module 914, and Communication component 916.

The integrated operation of the usual control device 900 of processing component 902, such as with display, incoming call, data communication, phase Machine operation and record operate associated operation.Processing element 902 may include that one or more processors 920 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 902 may include one or more modules, just Interaction between processing component 902 and other assemblies.For example, processing component 902 may include multi-media module, it is more to facilitate Interaction between media component 908 and processing component 902.

Memory 904 is configured as storing various types of data to support the operation in equipment 900.These data are shown Example includes the instruction of any application or method for operating on device 900, contact data, and book data of sending a telegram here disappear Breath, picture, video etc..Memory 904 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 906 provides electric power for the various assemblies of device 900.Power supply module 906 may include power management system System, one or more power supplys and other with for device 900 generate, manage, and distribute the associated component of electric power.

Multimedia component 908 includes the screen of one output interface of offer between described device 900 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding motion The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, Multimedia component 908 includes a front camera and/or rear camera.When equipment 900 is in operation mode, as shot mould When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 910 is configured as output and/or input audio signal.For example, audio component 910 includes a Mike Wind (MIC), when device 900 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 904 or via communication set Part 916 is sent.In some embodiments, audio component 910 further includes a loudspeaker, is used for output audio signal.

I/O interface 912 provides interface between processing component 902 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.

Sensor module 914 includes one or more sensors, and the state for providing various aspects for device 900 is commented Estimate.For example, sensor module 914 can detecte the state that opens/closes of equipment 900, and the relative positioning of component, for example, it is described Component is the display and keypad of device 900, and sensor module 914 can be with 900 1 components of detection device 900 or device Position change, the existence or non-existence that user contacts with device 900,900 orientation of device or acceleration/deceleration and device 900 Temperature change.Sensor module 914 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 914 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 916 is configured to facilitate the communication of wired or wireless way between device 900 and other equipment.Device 900 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation In example, communication component 916 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 916 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 900 can be believed by one or more application specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 904 of instruction, above-metioned instruction can be executed by the processor 920 of device 900 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

Fig. 6 is the structural block diagram of server in some embodiments of the present invention.The server 1900 can be because of configuration or performance It is different and generate bigger difference, it may include one or more central processing units (central processing Units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage applications The storage medium 1930 (such as one or more mass memory units) of program 1942 or data 1944.Wherein, memory 1932 and storage medium 1930 can be of short duration storage or persistent storage.The program for being stored in storage medium 1930 may include one A or more than one module (diagram does not mark), each module may include to the series of instructions operation in server.More into One step, central processing unit 1922 can be set to communicate with storage medium 1930, execute storage medium on server 1900 Series of instructions operation in 1930.

Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium by device (equipment or Server) processor execute when, enable a device to execute a kind of data processing method, which comprises receive text； The text is related at least two language；Determine the corresponding target voice sequence of the text；According to phonetic feature sequence and figure As the mapping relations between characteristic sequence, the corresponding target image sequence of the target voice sequence is determined；The mapping relations Described in phonetic feature sequence be aligned on a timeline with described image characteristic sequence；The mapping relations are according to time shaft pair Neat speech samples and image pattern obtain；The speech samples are related to a kind of language, alternatively, the speech samples be related to it is a variety of Language；The target voice sequence and the target image sequence are merged, to obtain corresponding target video.

The embodiment of the invention discloses A1, a kind of data processing method, comprising:

Receive text；The text is related at least two language；

Determine the corresponding target voice sequence of the text；

A2, method according to a1, the speech samples and described image sample are originated from same video file；Or

A3, method according to a1, the corresponding characteristics of image of the target image sequence include in following feature extremely Few one kind:

Expressive features；

Lip feature；And

Limbs feature.

A4, method according to a1, the corresponding limbs feature of the target image sequence are corresponding according to the text Semantic expressiveness obtain.

A5, according to A1 into A4 any method, described to the target voice sequence and the target image Before sequence is merged, the method also includes:

The boundary of predeterminable area in the target image sequence is compensated.

A6, method according to a5, the predeterminable area includes at least one of following region:

Facial area；

Dress ornament region；And

Limbs region.

A7, according to A1, into A4, any method, the text include: the problems in question and answer interaction related text；

The corresponding mode of the target image sequence includes: answering model or listens attentively to mode.

A8, the method according to A7, in the input process of described problem, the corresponding mode of the target image sequence For the mode of listening attentively to；Or

A9, the method according to A7, the method also includes:

The target video is exported to user；Or

The link of the target video is exported to user；Or

The link of the target voice sequence or the target voice sequence is exported to user；Or

Problem related text is exported to user.

The embodiment of the invention discloses B10, a kind of data processing equipment, comprising:

B11, device according to b10, the speech samples and described image sample are originated from same video file；Or

B12, device according to b10, the corresponding characteristics of image of the target image sequence includes in following feature It is at least one:

Expressive features；

Lip feature；And

Limbs feature.

B13, device according to b10, the corresponding limbs feature of the target image sequence are according to the text pair The semantic expressiveness answered obtains.

B14, according to B10 into B13 any device, described device further include:

B15, device according to b14, the predeterminable area includes at least one of following region:

Facial area；

Dress ornament region；And

Limbs region.

B16, according to B10, into B13, any device, the text include: the related text of the problems in question and answer interaction This；

B17, the device according to B16, in the input process of described problem, the corresponding mould of the target image sequence Formula is to listen attentively to mode；Or

B18, the device according to B16, described device further include:

First output module, for exporting the target video to user；Or

Second output module, for exporting the link of the target video to user；Or

4th output module, for exporting problem related text to user.

The embodiment of the invention discloses C19, a kind of device for data processing, include memory and one or The more than one program of person, one of them perhaps more than one program be stored in memory and be configured to by one or It includes the instruction for performing the following operation that more than one processor, which executes the one or more programs:

Receive text；The text is related at least two language；

Determine the corresponding target voice sequence of the text；

C20, the device according to C19, the speech samples and described image sample are originated from same video file；Or

C21, the device according to C19, the corresponding characteristics of image of the target image sequence includes in following feature It is at least one:

Expressive features；

Lip feature；And

Limbs feature.

C22, the device according to C19, the corresponding limbs feature of the target image sequence are according to the text pair The semantic expressiveness answered obtains.

C23, according to C19 into C22 any device, described to the target voice sequence and the target figure Before being merged as sequence, described device be also configured to by one or more than one processor execute it is one or More than one program includes the instruction for performing the following operation:

C24, the device according to C23, the predeterminable area includes at least one of following region:

Facial area；

Dress ornament region；And

Limbs region.

C25, according to C19, into C22, any device, the text include: the related text of the problems in question and answer interaction This；

C26, the device according to C25, in the input process of described problem, the corresponding mould of the target image sequence Formula is to listen attentively to mode；Or

C27, the device according to C25, described device are also configured to by one or the execution of more than one processor The one or more programs include the instruction for performing the following operation:

The target video is exported to user；Or

The link of the target video is exported to user；Or

Problem related text is exported to user.

The embodiment of the invention discloses D28, a kind of machine readable media, instruction are stored thereon with, when by one or more When processor executes, so that device executes the data processing method as described in A1 one or more into A9.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Above to a kind of data processing method provided by the present invention, a kind of data processing equipment and a kind of at data The device of reason, is described in detail, and specific case used herein explains the principle of the present invention and embodiment It states, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas；Meanwhile for this field Those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute It states, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of data processing method characterized by comprising

Receive text；The text is related at least two language；

Determine the corresponding target voice sequence of the text；

According to the mapping relations between phonetic feature sequence and characteristics of image sequence, the corresponding mesh of the target voice sequence is determined Logo image sequence；Phonetic feature sequence described in the mapping relations is aligned on a timeline with described image characteristic sequence；Institute Mapping relations are stated to obtain according to time axis aligned speech samples and image pattern；The speech samples are related to a kind of language, Alternatively, the speech samples are related to multilingual；

2. the method according to claim 1, wherein the speech samples and described image sample are originated from same view Frequency file；Or

3. the method according to claim 1, wherein the corresponding characteristics of image of the target image sequence includes such as At least one of lower feature:

Expressive features；

Lip feature；And

Limbs feature.

4. the method according to claim 1, wherein the corresponding limbs feature of the target image sequence is foundation The corresponding semantic expressiveness of the text obtains.

5. according to claim 1 to the method any in 4, which is characterized in that it is described to the target voice sequence and Before the target image sequence is merged, the method also includes:

6. according to the method described in claim 5, it is characterized in that, the predeterminable area includes at least one in following region Kind:

Facial area；

Dress ornament region；And

Limbs region.

7. according to claim 1 to any method in 4, which is characterized in that the text includes: asking in question and answer interaction Inscribe related text；

8. a kind of data processing equipment characterized by comprising

Image determining module, for determining the mesh according to the mapping relations between phonetic feature sequence and characteristics of image sequence Mark the corresponding target image sequence of voice sequence；Phonetic feature sequence described in the mapping relations and described image characteristic sequence It is aligned on a timeline；The mapping relations are to obtain according to time axis aligned speech samples and image pattern；The voice Sample is related to a kind of language, alternatively, the speech samples are related to multilingual；And

Fusion Module, for being merged to the target voice sequence and the target image sequence, to obtain corresponding mesh Mark video.

9. a kind of device for data processing, which is characterized in that include memory and one or more than one journey Sequence, perhaps more than one program is stored in memory and is configured to by one or more than one processor for one of them Executing the one or more programs includes the instruction for performing the following operation:

Receive text；The text is related at least two language；

Determine the corresponding target voice sequence of the text；

10. a kind of machine readable media is stored thereon with instruction, when executed by one or more processors, so that device is held Data processing method of the row as described in one or more in claim 1 to 7.