Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real
Applying mode, the present invention is described in further detail.
For the long technical problem of the recording time of conventional video news, the embodiment of the invention provides one kind to pass through machine
The scheme of the corresponding target video of text is generated, the program can specifically include: receiving text;Determine the corresponding mesh of the text
Mark voice sequence;According to the mapping relations between phonetic feature sequence and characteristics of image sequence, the target voice sequence is determined
Corresponding target image sequence;The target voice sequence and the target image sequence are merged, it is corresponding to obtain
Target video.
The embodiment of the present invention can be applied to news report scene, teaching scene, medical scene, customer service scene, law field
The contents such as scape express scene.
The target video of the embodiment of the present invention may include: the corresponding target voice sequence of text and target voice sequence
Arrange corresponding target image sequence.
The embodiment of the present invention, text can be related at least two language, such as Chinese, Japanese, Korean, English, French, moral
At least two etc. in the language such as text, Arabic.Then target voice sequence and target image sequence also relate at least
Bilingual, therefore the embodiment of the present invention can be adapted for multilingual content expression scene.
For example, text can be news release in news report scene.It may include: first language and news in news release
The second language of the corresponding country of event, for example, first language can be Chinese, second language can English.Certainly, in addition to first
Except language and second language, text can also relate to the language such as third language, the 4th language.
For another example, in customer service scene, text can be the problem of user inputs text, may include: to make in the question text
First language for mother tongue and the second language as non-mother tongue.For example, question text is related to computer glitch, question text can
To include: the Chinese text of the corresponding English text of computer glitch and user conclusion and summary.
For another example, in meeting presiding scene, it may include: multi-lingual in the conference speech original text that text, which can be conference speech original text,
Say the corresponding multilingual of user.
It is appreciated that the text for being related at least two language can be applied to arbitrary content expression scene, the present invention is real
It is without restriction for specific content expression scene to apply example.
In practical applications, it can use TTS (speech synthesis, Text To Speech) technology, convert text to mesh
The corresponding target voice of voice sequence is marked, target voice sequence can be characterized as the form of waveform.It is appreciated that can be according to language
Sound synthetic parameters obtain the target voice sequence for meeting demand.
Optionally, speech synthesis parameter may include: at least one of tamber parameter, pitch parameter and loudness parameter.
Wherein, tamber parameter can refer to the distinguished characteristic in terms of the frequency of different sound shows waveform, lead to
The sounding body of Chang Butong corresponds to different tone colors, therefore can obtain the tone color phase with target utterance body according to tamber parameter
The target voice sequence matched, target utterance body can be specified by user, for example, target utterance body can be specified media worker
Deng.In practical applications, the tamber parameter of target utterance body can be obtained according to the audio of the preset length of target utterance body.
Pitch parameter can characterize tone, be measured with frequency.Loudness parameter, the also referred to as sound intensity or volume can refer to
The size of sound is measured with decibel (dB).
Target image sequence can be used for characterizing entity (entity) image.Entity is that have distinguishability and self-existent thing
Object, entity may include: people, robot, animal, plant etc..The embodiment of the present invention is mainly taking human as example to target image sequence
It is illustrated, the corresponding target image sequence of other entities is cross-referenced.The corresponding solid images of people are properly termed as portrait.
For entity state angle, above-mentioned characteristics of image may include entity state feature, and entity state feature can be with
Reflect feature of the image sequence in terms of entity state.
Optionally, above-mentioned entity state feature may include at least one of following feature:
Expressive features;
Lip feature;And
Limbs feature.
Expression gives expression to one's sentiment, affection, can refer to the thoughts and feelings for showing face.
Expressive features are usually to be directed to entire face.Lip feature can be specifically for lip, and with the text of text
Target voice sequence described in this content, voice, articulation type etc., therefore the corresponding nature expressed of image sequence can be improved
Degree.
Limbs feature can be conveyed by the Coordinating Activity of the human bodies such as head, eye, neck, hand, elbow, arm, body, hip, foot
The thought of personage, visually so as to communicating views.Limbs feature may include: rotary head, shrug, gesture etc., and image can be improved
The corresponding richness expressed of sequence.For example, at least one arm naturally droops when speaking, when silent at least one arm from
So it is placed on abdomen etc..
Phonetic feature sequence may include: language feature and/or acoustic feature.
Language feature may include: phoneme feature.Phoneme be marked off according to the natural quality of voice come minimum voice
Unit is analyzed according to the articulation in syllable, and a movement constitutes a phoneme.Phoneme may include: vowel and consonant.
Acoustic feature can characterize the feature of voice from sounding angle.
Acoustic feature can include but is not limited to following feature:
It is related special to specifically include duration correlated characteristic, fundamental frequency for prosodic features (super-segmental feature/paralinguistics feature)
Sign, energy correlated characteristic etc.;
Sound quality feature;
Correlation analysis feature based on spectrum is the embodiment of correlation between vocal tract shape variation and sound generating movements, mesh
The preceding correlated characteristic based on spectrum specifically include that linear prediction residue error (LPCC,
LinearPredictionCoefficients), mel-frequency cepstrum coefficient (MFCC, Mel Frequency Cepstrum
Coefficient) etc..
The embodiment of the present invention can according to time axis aligned speech samples and image pattern, obtain phonetic feature sequence with
Mapping relations between characteristics of image sequence.
It is regular governed between phonetic feature sequence and characteristics of image sequence.For example, specific phoneme feature is corresponding
Specific lip feature;For another example, specific prosodic features corresponds to specific expressive features;Alternatively, specific phoneme feature pair
Answer specific limbs feature etc..
Therefore, the embodiment of the present invention can obtain mapping relations according to time axis aligned speech samples and image pattern,
To reflect the rule between phonetic feature sequence and characteristics of image sequence by the mapping relations.
Rule between the phonetic feature sequence and characteristics of image sequence of mapping relations reflection can be adapted for arbitrary
Language, therefore can be adapted for the corresponding text of at least two language.
The embodiment of the present invention can use machine learning method end to end, to time axis aligned speech samples and image
Sample is learnt, to obtain above-mentioned mapping relations.The input of machine learning method end to end can be voice sequence, output
It can be image sequence, this method can be by the study of training data, between the feature of the feature and output that are inputted
Rule.
In broad terms, machine learning is a kind of ability that can assign machine learning, it is allowed to complete Direct Programming with this
The method of impossible function.But it is said in the sense that practice, machine learning is a kind of by utilizing data, training depanning
Then type uses a kind of method of model prediction.Machine learning method may include: traditional decision-tree, linear regression method, patrol
Collect homing method, neural network method etc., it will be understood that the embodiment of the present invention does not limit specific machine learning method
System.
The alignment of speech samples and image pattern on a timeline can be improved same between phonetic feature and characteristics of image
Step property.
In an embodiment of the present invention, the speech samples and described image sample can be originated from same video text
Part, it is possible thereby to realize the alignment of speech samples and image pattern on a timeline.For example, the video text of recording can be collected
Part may include: the voice of sounding body and the video pictures of sounding body in the video file.
In another embodiment of the invention, speech samples and image pattern can be originated from different files, specifically,
The speech samples can be originated from audio file, and described image sample can be originated from video file or image file, image text
Part may include: multiple image.In such cases, time shaft alignment can be carried out to speech samples and image pattern, to obtain
Time axis aligned speech samples and image pattern.
It is appreciated that above-mentioned machine learning method end to end is intended only as the optional implementation of method for confirming mapping relation
Example, actually those skilled in the art can determine mapping relations using other methods according to practical application request, such as other
Method can be statistical method etc., and the embodiment of the present invention is without restriction for the specific determining method of mapping relations.
The target image sequence of the embodiment of the present invention can obtain on the basis of target entity image, in other words, this hair
Bright embodiment can assign target voice sequence corresponding characteristics of image (entity state feature) for target entity image, to obtain
Target image sequence.Target entity image can be specified by user, for example, target entity image can be (such as main for star personality
Hold people) image.
To sum up, the target voice sequence of the embodiment of the present invention can match with the tone color of target utterance body, target image
Sequence can obtain on the basis of target entity image, it is possible thereby to be realized by obtained target video by target entity figure
As expressing according to the tone color of target utterance body text;Since above-mentioned target video can be generated by machine, therefore can shorten
The generation time of target video, and then the timeliness of target video can be improved, so that target video can be adapted for timeliness
Higher content expresses scene, such as breaking news scene
Also, target video target entity image expresses text according to the tone color of target utterance body, relative to pressing
Text is expressed according to manual type, human cost can be saved, and the working efficiency of relevant industries can be improved.
In addition, the corresponding target image sequence of target voice sequence is to obtain according to mapping relations, mapping relations reflection
Phonetic feature sequence and characteristics of image sequence between rule, can be adapted for arbitrary language, thus can be adapted for
Few corresponding text of bilingual.
Data processing method provided in an embodiment of the present invention can be applied in client and the corresponding application environment of server-side,
Client and server-side are located in wired or wireless network, and by the wired or wireless network, client is counted with server-side
According to interaction.
Optionally, client may operate in terminal, and above-mentioned terminal specifically includes but unlimited: smart phone, plate electricity
Brain, E-book reader, MP3 (dynamic image expert's compression standard audio level 3, Moving Picture Experts
Group Audio Layer III) player, MP4 (dynamic image expert's compression standard audio level 4, Moving Picture
Experts Group Audio Layer IV) player, pocket computer on knee, vehicle-mounted computer, desktop computer, machine top
Box, intelligent TV set, wearable device etc..
Client refers to corresponding with server-side, provides the program of local service for user.Visitor in the embodiment of the present invention
Family end can provide target video, and target video can be generated by client or server-side, and the embodiment of the present invention is for specific visitor
Family end is without restriction.
In an embodiment of the present invention, client can determine the target of user's selection by question and answer interactive operation
Sounding body information and target entity image information receive the text of user, and upload text, target utterance body information to server-side
With target entity image information so that server-side generates text, target utterance body and the corresponding target video of target entity image;
Also, client can export the target video to user.
Embodiment of the method one
Referring to Fig.1, a kind of step flow chart of data processing method embodiment one of the invention is shown, specifically can wrap
Include following steps:
Step 101 receives text;The text can be related at least two language;
Step 102 determines the corresponding target voice sequence of the text;
Mapping relations between step 103, foundation phonetic feature sequence and characteristics of image sequence, determine the target voice
The corresponding target image sequence of sequence;
Phonetic feature sequence described in the mapping relations is aligned on a timeline with described image characteristic sequence;It is described to reflect
The relationship of penetrating can obtain for foundation time axis aligned speech samples and image pattern;The speech samples can be related to a kind of language
Speech, alternatively, the speech samples are related to multilingual;
Step 104 merges the target voice sequence and the target image sequence, to obtain corresponding target
Video.
In step 101, for client, the text of user's upload can receive;For server-side, Ke Yijie
Receive the text that client is sent.It is appreciated that arbitrary first equipment can receive text, the embodiment of the present invention from the second equipment
It is without restriction for the specific transmission mode of text.
In step 102, TTS technology can use, convert text to the corresponding target voice of target voice sequence, target
Voice sequence can be characterized as the form of waveform.
It is alternatively possible to determine the corresponding target language of the text according to the corresponding tamber parameter of target utterance body information
Sound sequence, it is hereby achieved that the target voice sequence to match with the tone color of target utterance body.Target utterance body information can be with
It include: the mark of people, such as the mark of star personality;Alternatively, target utterance body information may include: the audio of target utterance body.
Step 102 determines that the process of the corresponding target voice sequence of the text may include: the corresponding mesh of determining text
Language feature is marked, and determines the corresponding target voice sequence of object language feature.
The embodiment of the present invention can use following method of determination, determine the corresponding target voice sequence of object language feature:
Method of determination 1 searches the first voice unit to match with object language feature in the first sound bank, to first
Voice unit is spliced, to obtain target voice sequence.
Method of determination 2 determines the corresponding target acoustical feature of object language feature, lookup and target in the second sound bank
The second voice unit that acoustic feature matches, splices the second voice unit, to obtain target voice sequence.
Method of determination 3, using phoneme synthesizing method end to end, the source of phoneme synthesizing method can wrap end to end
Include: text or the corresponding object language feature of text, target side can be the target voice sequence of wave form.
In an alternative embodiment of the invention, phoneme synthesizing method can use neural network, the mind end to end
It may include: single layer RNN (Recognition with Recurrent Neural Network, Recurrent Neural Network) and the double-deck active coating through network, it is double
Layer active coating is for predicting 16 voice outputs.The state demarcation of RNN is at two parts: the first (most-significant byte) state and second (low 8
Position) state.First state and the second state input corresponding active coating respectively, and the second state is obtained based on first state,
16 based on previous moment of first state obtain.The neural network is by first state and the second Design of State in a network knot
In structure, training speed can be accelerated and simplify training process, therefore the operand of neural network can be reduced, and then end can be made to arrive
The phoneme synthesizing method at end is suitable for the limited mobile terminal of calculation resources, such as mobile phone.
It is appreciated that those skilled in the art can be according to practical application request, using above-mentioned method of determination 1 to determination side
Any or combination in formula 3, the embodiment of the present invention is for determining the specific of the corresponding target voice sequence of object language feature
Process is without restriction.
In step 103, the corresponding target image sequence of target voice sequence is to obtain according to mapping relations, the mapping relations
Rule between the phonetic feature sequence and characteristics of image sequence of reflection, can be adapted for arbitrary language, therefore can be applicable in
In the corresponding text of at least two language.
In an alternative embodiment of the invention, it determines the corresponding target image sequence of the target voice sequence, has
Body may include: according to the corresponding target voice characteristic sequence of target voice sequence and phonetic feature sequence and characteristics of image
Mapping relations between sequence determine the corresponding target image characteristics sequence of target voice characteristic sequence, and then can determine mesh
The corresponding target image sequence of logo image characteristic sequence.
Phonetic feature can specifically include: language feature and/or acoustic feature for characterizing voice.Characteristics of image is used
In characterization solid images, can specifically include: entity state feature above-mentioned.
In an alternative embodiment of the invention, the corresponding target image sequence of above-mentioned determining target image characteristics sequence
Column, can specifically include: synthesizing to target entity image and target image characteristics sequence, to obtain target image sequence,
Target image characteristics sequence can be assigned for target entity image.
Target entity image can be specified by user, for example, target entity image can be star personality (such as host)
Image.
Target entity image can not carry entity state, close to target entity image and target image characteristics sequence
At target image sequence being made to carry the entity state to match with text, and then entity in target video can be improved
The naturalness and richness of state.
It, optionally, can be special to the corresponding threedimensional model of target entity image and target image in the embodiment of the present invention
Sign sequence is synthesized, and target image sequence is obtained.Threedimensional model can be for multiframe target entity image progress three-dimensional reconstruction
It obtains.
In practical applications, entity exists usually in the form of three-dimensional geometry entity.Traditional two-dimensional image passes through
Comparison of light and shade and perspective relation cause visual space multistory sense, can not generate spectacular naturally three-dimensional perception.And
The spatial modelling and prototype of 3-dimensional image are close, not only have height, width, depth three-dimensional space geometrical body feature, but also
With true status information true to nature, the sense of reality that planar picture can not provide is changed, warm, sense true to nature can be given
Feel.
In computer graphics, usually with threedimensional model come to solid modelling, threedimensional model is corresponded in spatial entities
Entity, can be shown by computer or other video equipments.
The corresponding feature of threedimensional model may include: geometrical characteristic, texture phase, entity state feature etc., entity state
Feature may include: expressive features, lip feature, limbs feature etc..Wherein, geometrical characteristic usually with polygon come or voxel
It indicates, for geometric part of the polygon to express threedimensional model, i.e., with Polygons Representation or approximate representation entity
Curved surface.Its basic object is the vertex in three-dimensional space, and the straight line that two vertex connect is known as side, three vertex
It connects through three sides as triangle, triangle is simplest polygon in Euclidean space.Multiple triangles can group
At more complicated polygon, or generate the single entity on more than three vertex.Quadrangle and triangle are polygon expression
Threedimensional model in most common shape, in terms of the expression of threedimensional model, triangulation network threedimensional model because its data structure is simple,
It is easy a kind of prevalence that the features such as being drawn by all graphics hardware devices expresses as threedimensional model to select, wherein each triangle
Shape is exactly a surface, therefore triangle is also known as tri patch.
Threedimensional model can be with the default entity state and point cloud data of dense correspondence, and default entity state can
To include: neutral expression, lip closed state and arm droop state etc..
The corresponding threedimensional model of target entity image and target image characteristics sequence are synthesized, modification three can be passed through
Vertex position on dimension module etc. realizes that the synthetic method of use can specifically include: keyframe interpolation method, parametric method
Deng.Wherein, keyframe interpolation method can carry out difference to the characteristics of image of key frame.Parametric method can pass through threedimensional model
Parameter the variation of entity state is described, by adjusting the different entity state of these gain of parameter.
Using keyframe interpolation method, the embodiment of the present invention can be obtained according to target image characteristics sequence
Difference value vector.Using parametric method, the embodiment of the present invention can be joined according to target image characteristics sequence
Number vector.
It is appreciated that above-mentioned keyframe interpolation method, parametric method is intended only as the alternative embodiment of synthetic method, practical
On, those skilled in the art can be according to practical application request, and using required synthetic method, the embodiment of the present application is for specific
Synthetic method it is without restriction.
In step 102, during determining target image sequence corresponding characteristics of image, phonetic feature sequence is utilized
With the rule between characteristics of image sequence.Characteristics of image therein may include: in expressive features, lip feature and limbs feature
At least one.
In order to improve the accuracy of the corresponding characteristics of image of target image sequence, the embodiment of the present invention can also be to target figure
As the corresponding characteristics of image of sequence is extended or adjusts.
In an alternative embodiment of the invention, the corresponding limbs feature of the target image sequence can be for according to institute
The corresponding semantic expressiveness of text is stated to obtain.The embodiment of the present invention uses text corresponding language during determining limbs feature
Justice indicates, therefore the accuracy of limbs feature can be improved.
In the embodiment of the present invention, optionally, the direction of limbs feature, position, any parameter in speed and strength with
The corresponding semantic expressiveness of text is related.
Optionally, above-mentioned semantic expressiveness can be related to affective characteristics.Limbs feature can be carried out according to affective characteristics
Classification, to obtain the corresponding limbs feature of a kind of affective characteristics.
Optionally, affective characteristics may include: positive affirmative, passive negative or neutrality etc..
The band of position of limbs feature may include: Shang Qu, Zhong Qu, lower area.More than shoulder it is upper area, reason can be expressed
Actively certainly the affective characteristics such as think, wish, is happy, congratulating.Middle Qu Zhicong shoulder can describe things and illustrate whole to waist
Reason expresses neutral emotion.Lower area refer to waist hereinafter, can express abhor, oppose, criticizing, the emotion of the passive negative such as disappointment.
Other than the band of position, limbs feature can also include: direction.For example, palm turned upwards, it can express and actively agree
Fixed affective characteristics.For another example, palm turned downwards, can express the emotion of passive negative.
In the embodiment of the present invention, the type of semantic expressiveness may include: that keyword, one-hot encoding (one-hot) vector, word are embedding
Incoming vector (WordEmbedding) etc..Word embedding exactly finds a mapping or function, generates new at one
Expression spatially, which is exactly word representation.
The embodiment of the present invention can determine the corresponding language of text by the mapping relations between semantic expressiveness and limbs feature
Justice indicates corresponding limbs feature.Mapping relations between semantic expressiveness and limbs feature can be obtained by statistical method,
It can be obtained by method end to end.
In step 104, since mapping relations can obtain for foundation time axis aligned speech samples and image pattern, therefore
Target voice sequence and target image sequence can be aligned on a timeline, therefore can be to target voice sequence and target image
Sequence is merged, to obtain target video.It is alternatively possible to using multi-modal fusion technology, to target voice sequence and mesh
Logo image sequence is merged.It is appreciated that the embodiment of the present invention is without restriction for specific fusion method.
After obtaining target video, target video can be saved or be exported.For example, server-side can be to client
End sends target video, and for another example, client can export target video etc. to user.
To sum up, the data processing method of the embodiment of the present invention, target voice sequence can be with the tone color phases of target utterance body
Matching, target image sequence can obtain on the basis of target entity image, it is possible thereby to real by obtained target video
Now text is expressed according to the tone color of target utterance body by target entity image;Since above-mentioned target video can be given birth to by machine
At therefore the generation time of target video can be shortened, and then the timeliness of target video can be improved, so that target video can be with
Scene, such as breaking news scene are expressed suitable for the higher content of timeliness.
Also, target video expresses text according to the tone color of target utterance body by target entity image, relative to
Text is expressed according to manual type, human cost can be saved, and the working efficiency of relevant industries can be improved.
In addition, the corresponding target image sequence of target voice sequence is to obtain according to mapping relations, mapping relations reflection
Phonetic feature sequence and characteristics of image sequence between rule, can be adapted for arbitrary language, thus can be adapted for
Few corresponding text of bilingual.
Embodiment of the method two
Referring to Fig. 2, a kind of step flow chart of data processing method embodiment two of the invention is shown, specifically can wrap
Include following steps:
Step 201 receives text;The text can be related at least two language;
Step 202 determines the corresponding target voice sequence of the text;
Mapping relations between step 203, foundation phonetic feature sequence and characteristics of image sequence, determine the target voice
The corresponding target image sequence of sequence;
Phonetic feature sequence described in the mapping relations is aligned on a timeline with described image characteristic sequence;It is described to reflect
The relationship of penetrating can obtain for foundation time axis aligned speech samples and image pattern;The speech samples can be related to a kind of language
Speech, alternatively, the speech samples are related to multilingual;
Step 204 compensates the boundary of predeterminable area in the target image sequence;
Step 205 merges the target voice sequence and compensated target image sequence, corresponding to obtain
Target video.
The embodiment of the present invention is during determining the target voice sequence corresponding target image sequence, it will usually use
To the threedimensional model of target entity image, and in the method for reconstructing of threedimensional model and threedimensional model and characteristics of image sequence
The limitation of synthetic method is easy so that details missing problem occurs in the polygon of threedimensional model, this will be so that target image sequence
Arranging corresponding target entity image, there are certain place missings of imperfect problem, such as part absence of tooth, nose.
The embodiment of the present invention compensates the boundary of predeterminable area in the target image sequence, and preset areas can be improved
The integrality in domain.
Above-mentioned predeterminable area can characterize the position of entity, such as face or limbs portion, correspondingly, above-mentioned preset areas
Domain can specifically include at least one of following region:
Facial area;
Dress ornament region;And
Limbs region.
In an embodiment of the present invention, the boundary in target image sequence Tooth region is compensated, it can
To repair incomplete tooth or supplement the tooth not occurred, therefore the integrality of tooth regions can be improved.
It in practical applications, can be with reference to the target entity image including complete predeterminable area, to the target image sequence
The boundary of predeterminable area compensates in column, and the embodiment of the present invention is without restriction for specific compensation process.
Embodiment of the method three
Referring to Fig. 3, a kind of step flow chart of data processing method embodiment three of the invention is shown, specifically can wrap
Include following steps:
Step 301, Receiver Problem related text;Described problem related text can be related at least two language;
Step 302 determines the corresponding target voice sequence of described problem related text;
Mapping relations between step 303, foundation phonetic feature sequence and characteristics of image sequence, determine the target voice
The corresponding target image sequence of sequence;The corresponding mode of the target image sequence may include: answering model or listen attentively to mould
Formula;
Phonetic feature sequence described in the mapping relations is aligned on a timeline with described image characteristic sequence;It is described to reflect
The relationship of penetrating can obtain for foundation time axis aligned speech samples and image pattern;The speech samples can be related to a kind of language
Speech, alternatively, the speech samples are related to multilingual;
Step 304 merges the target voice sequence and the target image sequence, to obtain corresponding target
Video.
The embodiment of the present invention can be applied to question and answer interaction scenarios, such as customer service scene, video conference scene.This hair
In bright embodiment, the corresponding mode of target image sequence may include: answering model or listen attentively to mode, and customer service can be improved
The intelligence of target image sequence under service scenarios.
Answering model can refer to the mode answered a question by target video, can correspond to first instance state.It is returning
It answers under mode, the corresponding target entity image of target video can read aloud problem answers by target voice sequence, and pass through mesh
The corresponding first instance state of logo image sequence expresses the emotion during reading aloud problem answers.
The mode of listening attentively to can refer to the mode that user inputs problem of listening attentively to, and can correspond to second instance state.Listen attentively to mould
Under formula, the corresponding target entity image of target video can be listened attentively to by the corresponding second instance state expression of target image sequence
Emotion in the process.Second instance state may include: feature etc. of nodding.Optionally, in the listen mode, mesh can also be passed through
The expression of poster sound sequence " uh ", " continuing with " etc. listen attentively to state text.
Problem related text may include: answer text or listen attentively to state text.Wherein, answer text can correspond to
Answering model, the state text of listening attentively to can correspond to the mode of listening attentively to.
In an alternative embodiment of the invention, in the input process of described problem, the target image sequence pair
The mode answered is to listen attentively to mode;Or
After the completion of the input of described problem, the corresponding mode of the target image sequence can be answering model.
Whether the embodiment of the present invention can input completion according to problem, cut to the corresponding mode of target image sequence
It changes.Optionally, if not receiving the input of user in preset duration, it is believed that the input of problem is completed.
It in an alternative embodiment of the invention, can be corresponding to target image sequence according to linking image pattern
Mode switches over, to improve the fluency of switching.
Being connected image pattern may include: the first linking image pattern.First linking image pattern may include: successively to go out
Existing listens attentively to the corresponding image pattern of mode and the corresponding image pattern of answering model, can be by the first linking image
Sample is learnt, and the rule switched from the mode of listening attentively to answering model is obtained, it is possible thereby to improve from the mode of listening attentively to answer
The fluency of pattern switching.
Being connected image pattern may include: the second linking image pattern.Second linking image pattern may include: successively to go out
The corresponding image pattern of existing answering model and the corresponding image pattern of mode is listened attentively to, it can be by the second linking image
Sample is learnt, and is obtained from answering model to listening attentively to the rule of pattern switching, it is possible thereby to improve from answering model to listening attentively to
The fluency of pattern switching.
A kind of data processing method example of the invention, can specifically include following steps:
Step S1, the problem of in the listen mode, playing first object video, and receiving user's input;
First object video can correspond to the mode of listening attentively to, can be by first object voice sequence and first object image sequence
Column obtain, and first object image sequence can correspond to the mode of listening attentively to.
Step S2, whether decision problem inputs completion, if so, then follow the steps S3, otherwise return step S1;
Step S3, the corresponding mode of target image sequence is set to answering model, and plays the second target video;
The determination process of second target video may include:
Step S31, answer text is determined;
Step S32, the corresponding second target voice sequence of the answer text is determined;
Step S33, according to the mapping relations between phonetic feature sequence and characteristics of image sequence, second target is determined
Corresponding second target image sequence of voice sequence;Second target image sequence can be corresponding with answering model;
Step S34, the second target voice sequence and second target image sequence are merged, to obtain pair
The second target video answered.
Step S4, after the second target video finishes, the corresponding mode of target image sequence is set to the mode of listening attentively to.
It is appreciated that above-mentioned output target video is intended only as alternative embodiment, in fact, the embodiment of the present invention can be to
User exports the link of the target video, so that user determines whether to play above-mentioned target video.
Optionally, the embodiment of the present invention can also export the target voice sequence or the target voice to user
The link of sequence.
Optionally, the embodiment of the present invention can also export problem related text to user.Problem related text may include:
Answer text or listen attentively to state text.Wherein, answer text can correspond to answering model, and the state text of listening attentively to can correspond to
Listen attentively to mode.
In an alternative embodiment of the invention, above-mentioned question and answer interaction can be corresponding with communication window, can communicate
Show at least one of following information in window: the link of target voice sequence, problem answers text and target video
Link.Wherein, the link of target video is displayed at the identified areas of communication terminal.Identified areas can be used for showing communication terminal
The information such as the pet name, ID (mark, Identity), head portrait.
It should be noted that for simple description, therefore, it is stated as a series of movement is dynamic for embodiment of the method
It combines, but those skilled in the art should understand that, the embodiment of the present invention is not by the limit of described athletic performance sequence
System, because according to an embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, art technology
Personnel also should be aware of, and the embodiments described in the specification are all preferred embodiments, and related athletic performance is simultaneously different
It surely is necessary to the embodiment of the present invention.
Installation practice
Referring to Fig. 4, a kind of structural block diagram of data processing equipment embodiment of the invention is shown, can specifically include:
Receiving module 401, for receiving text;The text is related at least two language;
Voice determining module 402, for determining the corresponding target voice sequence of the text;
Image determining module 403, for determining according to the mapping relations between phonetic feature sequence and characteristics of image sequence
The corresponding target image sequence of the target voice sequence;Phonetic feature sequence described in the mapping relations and described image are special
Sign sequence is aligned on a timeline;The mapping relations can obtain for foundation time axis aligned speech samples and image pattern
It arrives;The speech samples are related to a kind of language, alternatively, the speech samples are related to multilingual;And
Fusion Module 404, for being merged to the target voice sequence and the target image sequence, to obtain pair
The target video answered.
Optionally, the speech samples and described image sample are originated from same video file;Or
The speech samples are originated from audio file, and described image sample source is from video file or image file.
Optionally, the corresponding characteristics of image of the target image sequence may include at least one of following feature:
Expressive features;
Lip feature;And
Limbs feature.
Optionally, the corresponding limbs feature of the target image sequence is to obtain according to the corresponding semantic expressiveness of the text
It arrives.
Optionally, described device can also include:
Compensating module, for melting in the Fusion Module to the target voice sequence and the target image sequence
Before conjunction, the boundary of predeterminable area in the target image sequence is compensated.
Optionally, the predeterminable area may include at least one of following region:
Facial area;
Dress ornament region;And
Limbs region.
Optionally, the text may include: the problems in question and answer interaction related text;
The corresponding mode of the target image sequence may include: answering model or listen attentively to mode.
Optionally, in the input process of described problem, the corresponding mode of the target image sequence is to listen attentively to mode;Or
Person
After the completion of the input of described problem, the corresponding mode of the target image sequence is answering model.
Optionally, described device can also include:
First output module, for exporting the target video to user;Or
Second output module, for exporting the link of the target video to user;Or
Third output module, for exporting the chain of the target voice sequence or the target voice sequence to user
It connects;Or
4th output module, for exporting problem related text to user.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple
Place illustrates referring to the part of embodiment of the method.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with
The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 5 be a kind of device for data processing shown according to an exemplary embodiment as equipment when structural frames
Figure.For example, device 900 can be mobile incoming call, computer, digital broadcasting terminal, messaging device, game console put down
Panel device, Medical Devices, body-building equipment, personal digital assistant etc..
Referring to Fig. 5, device 900 may include following one or more components: processing component 902, memory 904, power supply
Component 906, multimedia component 908, audio component 910, the interface 912 of input/output (I/O), sensor module 914, and
Communication component 916.
The integrated operation of the usual control device 900 of processing component 902, such as with display, incoming call, data communication, phase
Machine operation and record operate associated operation.Processing element 902 may include that one or more processors 920 refer to execute
It enables, to perform all or part of the steps of the methods described above.In addition, processing component 902 may include one or more modules, just
Interaction between processing component 902 and other assemblies.For example, processing component 902 may include multi-media module, it is more to facilitate
Interaction between media component 908 and processing component 902.
Memory 904 is configured as storing various types of data to support the operation in equipment 900.These data are shown
Example includes the instruction of any application or method for operating on device 900, contact data, and book data of sending a telegram here disappear
Breath, picture, video etc..Memory 904 can be by any kind of volatibility or non-volatile memory device or their group
It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile
Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash
Device, disk or CD.
Power supply module 906 provides electric power for the various assemblies of device 900.Power supply module 906 may include power management system
System, one or more power supplys and other with for device 900 generate, manage, and distribute the associated component of electric power.
Multimedia component 908 includes the screen of one output interface of offer between described device 900 and user.One
In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen
Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings
Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding motion
The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.In some embodiments,
Multimedia component 908 includes a front camera and/or rear camera.When equipment 900 is in operation mode, as shot mould
When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting
Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 910 is configured as output and/or input audio signal.For example, audio component 910 includes a Mike
Wind (MIC), when device 900 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched
It is set to reception external audio signal.The received audio signal can be further stored in memory 904 or via communication set
Part 916 is sent.In some embodiments, audio component 910 further includes a loudspeaker, is used for output audio signal.
I/O interface 912 provides interface between processing component 902 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock
Determine button.
Sensor module 914 includes one or more sensors, and the state for providing various aspects for device 900 is commented
Estimate.For example, sensor module 914 can detecte the state that opens/closes of equipment 900, and the relative positioning of component, for example, it is described
Component is the display and keypad of device 900, and sensor module 914 can be with 900 1 components of detection device 900 or device
Position change, the existence or non-existence that user contacts with device 900,900 orientation of device or acceleration/deceleration and device 900
Temperature change.Sensor module 914 may include proximity sensor, be configured to detect without any physical contact
Presence of nearby objects.Sensor module 914 can also include optical sensor, such as CMOS or ccd image sensor, at
As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors
Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 916 is configured to facilitate the communication of wired or wireless way between device 900 and other equipment.Device
900 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation
In example, communication component 916 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel.
In one exemplary embodiment, the communication component 916 further includes near-field communication (NFC) module, to promote short range communication.Example
Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology,
Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, device 900 can be believed by one or more application specific integrated circuit (ASIC), number
Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided
It such as include the memory 904 of instruction, above-metioned instruction can be executed by the processor 920 of device 900 to complete the above method.For example,
The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk
With optical data storage devices etc..
Fig. 6 is the structural block diagram of server in some embodiments of the present invention.The server 1900 can be because of configuration or performance
It is different and generate bigger difference, it may include one or more central processing units (central processing
Units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage applications
The storage medium 1930 (such as one or more mass memory units) of program 1942 or data 1944.Wherein, memory
1932 and storage medium 1930 can be of short duration storage or persistent storage.The program for being stored in storage medium 1930 may include one
A or more than one module (diagram does not mark), each module may include to the series of instructions operation in server.More into
One step, central processing unit 1922 can be set to communicate with storage medium 1930, execute storage medium on server 1900
Series of instructions operation in 1930.
Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets
Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or
More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM
Etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium by device (equipment or
Server) processor execute when, enable a device to execute a kind of data processing method, which comprises receive text;
The text is related at least two language;Determine the corresponding target voice sequence of the text;According to phonetic feature sequence and figure
As the mapping relations between characteristic sequence, the corresponding target image sequence of the target voice sequence is determined;The mapping relations
Described in phonetic feature sequence be aligned on a timeline with described image characteristic sequence;The mapping relations are according to time shaft pair
Neat speech samples and image pattern obtain;The speech samples are related to a kind of language, alternatively, the speech samples be related to it is a variety of
Language;The target voice sequence and the target image sequence are merged, to obtain corresponding target video.
The embodiment of the invention discloses A1, a kind of data processing method, comprising:
Receive text;The text is related at least two language;
Determine the corresponding target voice sequence of the text;
According to the mapping relations between phonetic feature sequence and characteristics of image sequence, determine that the target voice sequence is corresponding
Target image sequence;Phonetic feature sequence described in the mapping relations and described image characteristic sequence are right on a timeline
Together;The mapping relations are to obtain according to time axis aligned speech samples and image pattern;The speech samples are related to one kind
Language, alternatively, the speech samples are related to multilingual;
The target voice sequence and the target image sequence are merged, to obtain corresponding target video.
A2, method according to a1, the speech samples and described image sample are originated from same video file;Or
The speech samples are originated from audio file, and described image sample source is from video file or image file.
A3, method according to a1, the corresponding characteristics of image of the target image sequence include in following feature extremely
Few one kind:
Expressive features;
Lip feature;And
Limbs feature.
A4, method according to a1, the corresponding limbs feature of the target image sequence are corresponding according to the text
Semantic expressiveness obtain.
A5, according to A1 into A4 any method, described to the target voice sequence and the target image
Before sequence is merged, the method also includes:
The boundary of predeterminable area in the target image sequence is compensated.
A6, method according to a5, the predeterminable area includes at least one of following region:
Facial area;
Dress ornament region;And
Limbs region.
A7, according to A1, into A4, any method, the text include: the problems in question and answer interaction related text;
The corresponding mode of the target image sequence includes: answering model or listens attentively to mode.
A8, the method according to A7, in the input process of described problem, the corresponding mode of the target image sequence
For the mode of listening attentively to;Or
After the completion of the input of described problem, the corresponding mode of the target image sequence is answering model.
A9, the method according to A7, the method also includes:
The target video is exported to user;Or
The link of the target video is exported to user;Or
The link of the target voice sequence or the target voice sequence is exported to user;Or
Problem related text is exported to user.
The embodiment of the invention discloses B10, a kind of data processing equipment, comprising:
Receiving module, for receiving text;The text is related at least two language;
Voice determining module, for determining the corresponding target voice sequence of the text;
Image determining module, for determining institute according to the mapping relations between phonetic feature sequence and characteristics of image sequence
State the corresponding target image sequence of target voice sequence;Phonetic feature sequence described in the mapping relations and described image feature
Sequence is aligned on a timeline;The mapping relations are to obtain according to time axis aligned speech samples and image pattern;It is described
Speech samples are related to a kind of language, alternatively, the speech samples are related to multilingual;And
Fusion Module, for being merged to the target voice sequence and the target image sequence, to be corresponded to
Target video.
B11, device according to b10, the speech samples and described image sample are originated from same video file;Or
The speech samples are originated from audio file, and described image sample source is from video file or image file.
B12, device according to b10, the corresponding characteristics of image of the target image sequence includes in following feature
It is at least one:
Expressive features;
Lip feature;And
Limbs feature.
B13, device according to b10, the corresponding limbs feature of the target image sequence are according to the text pair
The semantic expressiveness answered obtains.
B14, according to B10 into B13 any device, described device further include:
Compensating module, for melting in the Fusion Module to the target voice sequence and the target image sequence
Before conjunction, the boundary of predeterminable area in the target image sequence is compensated.
B15, device according to b14, the predeterminable area includes at least one of following region:
Facial area;
Dress ornament region;And
Limbs region.
B16, according to B10, into B13, any device, the text include: the related text of the problems in question and answer interaction
This;
The corresponding mode of the target image sequence includes: answering model or listens attentively to mode.
B17, the device according to B16, in the input process of described problem, the corresponding mould of the target image sequence
Formula is to listen attentively to mode;Or
After the completion of the input of described problem, the corresponding mode of the target image sequence is answering model.
B18, the device according to B16, described device further include:
First output module, for exporting the target video to user;Or
Second output module, for exporting the link of the target video to user;Or
Third output module, for exporting the chain of the target voice sequence or the target voice sequence to user
It connects;Or
4th output module, for exporting problem related text to user.
The embodiment of the invention discloses C19, a kind of device for data processing, include memory and one or
The more than one program of person, one of them perhaps more than one program be stored in memory and be configured to by one or
It includes the instruction for performing the following operation that more than one processor, which executes the one or more programs:
Receive text;The text is related at least two language;
Determine the corresponding target voice sequence of the text;
According to the mapping relations between phonetic feature sequence and characteristics of image sequence, determine that the target voice sequence is corresponding
Target image sequence;Phonetic feature sequence described in the mapping relations and described image characteristic sequence are right on a timeline
Together;The mapping relations are to obtain according to time axis aligned speech samples and image pattern;The speech samples are related to one kind
Language, alternatively, the speech samples are related to multilingual;
The target voice sequence and the target image sequence are merged, to obtain corresponding target video.
C20, the device according to C19, the speech samples and described image sample are originated from same video file;Or
The speech samples are originated from audio file, and described image sample source is from video file or image file.
C21, the device according to C19, the corresponding characteristics of image of the target image sequence includes in following feature
It is at least one:
Expressive features;
Lip feature;And
Limbs feature.
C22, the device according to C19, the corresponding limbs feature of the target image sequence are according to the text pair
The semantic expressiveness answered obtains.
C23, according to C19 into C22 any device, described to the target voice sequence and the target figure
Before being merged as sequence, described device be also configured to by one or more than one processor execute it is one or
More than one program includes the instruction for performing the following operation:
The boundary of predeterminable area in the target image sequence is compensated.
C24, the device according to C23, the predeterminable area includes at least one of following region:
Facial area;
Dress ornament region;And
Limbs region.
C25, according to C19, into C22, any device, the text include: the related text of the problems in question and answer interaction
This;
The corresponding mode of the target image sequence includes: answering model or listens attentively to mode.
C26, the device according to C25, in the input process of described problem, the corresponding mould of the target image sequence
Formula is to listen attentively to mode;Or
After the completion of the input of described problem, the corresponding mode of the target image sequence is answering model.
C27, the device according to C25, described device are also configured to by one or the execution of more than one processor
The one or more programs include the instruction for performing the following operation:
The target video is exported to user;Or
The link of the target video is exported to user;Or
The link of the target voice sequence or the target voice sequence is exported to user;Or
Problem related text is exported to user.
The embodiment of the invention discloses D28, a kind of machine readable media, instruction are stored thereon with, when by one or more
When processor executes, so that device executes the data processing method as described in A1 one or more into A9.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its
Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or
Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following
Claim is pointed out.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Above to a kind of data processing method provided by the present invention, a kind of data processing equipment and a kind of at data
The device of reason, is described in detail, and specific case used herein explains the principle of the present invention and embodiment
It states, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas;Meanwhile for this field
Those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute
It states, the contents of this specification are not to be construed as limiting the invention.