CN111899738A

CN111899738A - Dialogue generating method, device and storage medium

Info

Publication number: CN111899738A
Application number: CN202010742806.3A
Authority: CN
Inventors: 李武波
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-11-06

Abstract

The application provides a dialog generating method, a device and a storage medium, the method determines the signal characteristics of a multi-modal signal by acquiring the multi-modal signal in a target dialog scene, performs characteristic enhancement on the signal characteristics, inputs the enhanced characteristics into a preset neural network for high-level characteristic extraction, and inputs the extracted high-level characteristics into another neural network for target dialog sentence generation, wherein the multi-modal signal comprises a plurality of voice signals, image signals and text signals, so that the acquired information is more comprehensive, and the signal characteristics of the multi-modal signal are enhanced, the information contained in the enhanced characteristics is richer, and the high-level characteristic extraction is performed through one neural network, so that the comprehension and reasoning capability of the other neural network on the multi-modal information is improved, the generated dialogue sentences have higher accuracy and relevance, and the performance of the dialogue system based on the embodiment of the application is improved.

Description

Dialogue generating method, device and storage medium

Technical Field

The present application relates to computer technologies, and in particular, to a method and an apparatus for generating a dialog, and a storage medium.

Background

With the rapid development of scientific and technical and economic levels, the society is gradually changing to a service-based society to better provide services for users. The intelligent dialogue system which is popular at present is generated based on the above idea. After receiving a question initiated by a user, the intelligent dialogue system can automatically answer the question, and a dialogue between a person and a machine is formed in the process of one-time question and answer.

In the related art, during a man-machine conversation, the intelligent dialog system generally generates reply content based on voice information, for example, car navigation, a user initiates a question "yes route to location a", and the intelligent dialog system in navigation generates a reply by the voice information, for example, performs semantic analysis on the voice information, extracts two entity information of location a and location route ", and then performs a corresponding reply according to the two entity information.

However, in the above-mentioned process of the man-machine conversation, the intelligent conversation system generates a reply only by the voice message, the obtained information is limited, and the feature extracted from the voice message contains less information, which easily causes a mistake in the reply generated by the intelligent conversation system, and reduces the performance of the conversation system.

Disclosure of Invention

In order to solve the problems in the prior art, the present application provides a dialog generation method, apparatus, and storage medium.

In a first aspect, an embodiment of the present application provides a dialog generation method, including:

acquiring multi-modal signals in a target conversation scene, wherein the multi-modal signals comprise a plurality of voice signals, image signals and text signals;

determining signal features of the multi-modal signal;

performing feature enhancement on the signal features to obtain enhanced features;

inputting the enhanced features into a first preset neural network, wherein the first preset neural network is obtained through training of signal features and dialogue sentences of multi-modal signals in a dialogue scene;

and acquiring the target dialogue statement output by the first preset neural network.

In one possible implementation, the performing feature enhancement on the signal feature includes:

and if the multi-modal signal comprises a speech signal, performing speech feature enhancement on signal features of the speech signal, wherein the speech feature enhancement comprises one or more of time domain warping, frequency domain masking and time domain masking.

and if the multi-modal signal comprises an image signal, performing image feature enhancement on the signal features of the image signal, wherein the image feature enhancement comprises one or more of picture cropping, Gaussian blur processing, contrast adjustment, Gaussian noise processing and affine change.

performing text feature enhancement on signal features of the text signal if the multi-modal signal comprises a text signal, wherein the text feature enhancement comprises one or more of synonym replacement and context-based word replacement.

In one possible implementation manner, before the inputting the enhanced features into the first preset neural network, the method further includes:

inputting the enhanced features into a second preset neural network, wherein the second preset neural network is obtained through signal feature and high-level feature training;

acquiring a target high-level feature output by the second preset neural network;

the inputting the enhanced features into a first preset neural network comprises:

and inputting the target high-level features into the first preset neural network.

In one possible implementation, the high-level features include one or more of vggist features of speech, I3D Red Green Blue (RGB) features and I3D Flow features of images, and word vectors of text.

In one possible implementation, the determining signal features of the multi-modal signal includes:

and if the multi-mode signal comprises a Voice signal, performing Voice preprocessing on the Voice signal to obtain the signal characteristics of the Voice signal, wherein the Voice preprocessing comprises one or more of silence Activity Detection (VAD), short-time Fourier transform (STFT) and F-BANK.

and if the multi-modal signal comprises an image signal, performing image preprocessing on the image signal to obtain the signal characteristics of the image signal, wherein the image preprocessing comprises one or more of image enhancement and normalization.

if the multi-mode signal comprises a voice signal, inputting the voice signal into a third preset neural network, wherein the third preset neural network is obtained through signal feature training of the voice signal and the voice signal;

and acquiring the signal characteristics of the voice signal output by the third preset neural network.

if the multi-mode signal comprises an image signal, inputting the image signal into a fourth preset neural network, wherein the fourth preset neural network is obtained through signal feature training of the image signal and the image signal;

and acquiring the signal characteristics of the image signal output by the fourth preset neural network.

In a second aspect, an embodiment of the present application provides a dialog generating apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring multi-modal signals in a target conversation scene, and the multi-modal signals comprise a plurality of voice signals, image signals and text signals;

a determination module to determine signal characteristics of the multi-modal signal;

the enhancement module is used for carrying out feature enhancement on the signal features to obtain enhanced features;

the first input module is used for inputting the enhanced features into a first preset neural network, wherein the first preset neural network is obtained by training signal features and dialogue sentences of multi-modal signals in a dialogue scene;

and the second acquisition module is used for acquiring the target dialogue statement output by the first preset neural network.

In a possible implementation manner, the enhancing module is specifically configured to:

In a possible implementation manner, the apparatus further includes:

the second input module is used for inputting the enhanced features into a second preset neural network before the first input module inputs the enhanced features into the first preset neural network, wherein the second preset neural network is obtained through signal feature and high-level feature training;

the third acquisition module is used for acquiring the target high-level features output by the second preset neural network;

the first input module is specifically configured to:

In one possible implementation, the high-level features include one or more of VGGish features for speech, I3D RGB features and I3D Flow features for images, and word vectors for text.

In a possible implementation manner, the determining module is specifically configured to:

and if the multi-mode signal comprises a voice signal, performing voice preprocessing on the voice signal to obtain the signal characteristics of the voice signal, wherein the voice preprocessing comprises one or more of VAD, STFT and F-BANK.

In a third aspect, an embodiment of the present application provides a server, including:

a processor;

a memory; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program causes a server to execute the method according to the first aspect.

Compared with the prior art which only acquires voice information, the method and the device for generating dialog provided by the embodiment of the application have the advantages that the information acquired by the embodiment of the application is more comprehensive, the signal characteristics of the multi-modal signals are enhanced, the information contained in the enhanced characteristics is richer, the understanding and reasoning capability of the neural network on the multi-modal information are improved, and the generated dialog sentences have higher accuracy and relevance, the performance of a dialog system based on the embodiment of the application is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic diagram of a dialog generation system architecture provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a dialog generation method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another dialog generation method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another dialog generation method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a dialog generation provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a dialog generating device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of another dialog generating device according to an embodiment of the present application;

FIG. 8A is a diagram of one possible basic hardware architecture of a dialog generating device according to an embodiment of the present application;

fig. 8B is another possible basic hardware architecture diagram of a dialog generating device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," if any, in the description and claims of this application and the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The dialog generation according to the embodiment of the present application refers to acquiring a multi-modal signal in a dialog scene, where the multi-modal signal includes a plurality of speech signals, image signals, and text signals, and further performing feature enhancement on signal features of the multi-modal signal, so that a neural network generates a dialog sentence based on the enhanced signal features, the understanding and reasoning ability of the neural network on the multi-modal information is improved, and the generated dialog sentence has higher accuracy and correlation.

The dialog generation method provided by the embodiment of the application can be applied to application scenes such as an intelligent terminal auxiliary system, automobile navigation, an intelligent sound box and a human-computer interaction robot, and the embodiment of the application is not particularly limited.

Optionally, fig. 1 is a schematic diagram of a dialog generation system architecture. In fig. 1, taking car navigation as an example, the architecture includes a processing device 11 and a plurality of information acquiring devices, such as a voice acquiring device, an image acquiring device, a text acquiring device, etc., which are not particularly limited in this embodiment of the application, here, the processing device 11 may be disposed in a navigation system of a car, and the plurality of information acquiring devices take a voice acquiring device 12, an image acquiring device 13, and a text acquiring device 14 as examples.

It is to be understood that the illustrated structure of the embodiments of the present application does not constitute a specific limitation on the dialog generation architecture. In other possible embodiments of the present application, the foregoing architecture may include more or less components than those shown in the drawings, or combine some components, or split some components, or arrange different components, which may be determined according to practical application scenarios, and is not limited herein. The components shown in fig. 1 may be implemented in hardware, software, or a combination of software and hardware.

In a specific implementation process, the number and the setting positions of the voice acquiring device 12, the image acquiring device 13 and the text acquiring device 14 in the embodiment of the present application may be determined according to actual situations, and the embodiment of the present application is not particularly limited thereto. In the application scenario, when a user dialogues with a navigation system in a car during driving, a processing device 11 in the navigation system may obtain a multi-modal signal in the dialog scenario, specifically, taking the multi-modal signal as an example that the multi-modal signal includes a voice signal, an image signal and a text signal, the processing device 11 may obtain the voice signal in the dialog scenario through the voice obtaining device 12, obtain the image signal in the dialog scenario through the image obtaining device 13, obtain the text signal in the dialog scenario through the text obtaining device 14, then the processing device 11 may perform feature enhancement on the signal feature of the multi-modal signal, and generate a dialog sentence based on the enhanced signal feature through a neural network, wherein the processing device 11 obtains the multi-modal signal in the dialog scenario, and more comprehensively obtains information in the dialog scenario, the processing device 11 performs feature enhancement on the signal features of the multi-modal signal, so that the enhanced features contain richer information, the understanding and reasoning capabilities of the neural network on the multi-modal information are improved, the generated dialogue sentences have higher accuracy and relevance, the dialogue performance of the navigation system is improved, and further, a user can acquire accurate navigation information and improve user experience through dialogue with the navigation system.

In addition, the system architecture and the service scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a limitation to the technical solution provided in the embodiment of the present application, and it can be known by a person skilled in the art that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

The technical solutions of the present application are described below with several embodiments as examples, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 2 is a flowchart illustrating a dialog generating method according to an embodiment of the present application, where the dialog generating method according to the embodiment of the present application may be executed by the processing device 11 in fig. 1, and the device may be implemented by software and/or hardware. As shown in fig. 2, the dialog generating method provided in the embodiment of the present application includes the following steps:

s201: a multi-modal signal in a target dialog scene is acquired, the multi-modal signal including a plurality of a speech signal, an image signal, and a text signal.

The target dialogue scene may be determined according to an actual situation, for example, in fig. 1, the user has a dialogue with a navigation system on an automobile during driving, which is not particularly limited in the embodiment of the present application.

Modality refers to the way things happen or exist, such as sound, images, text, etc. Here, the above-mentioned multi-modal signal includes a plurality of voice signals, image signals, and text signals, wherein the image signals include pictures and/or video signals, and the like.

For example, the manner of acquiring the multi-modal signal in the target dialog scene may be determined according to actual situations, for example, in fig. 1, the processing device 11 acquires the voice signal in the dialog scene through the voice acquiring device 12, acquires the image signal in the dialog scene through the image acquiring device 13, and acquires the text signal in the dialog scene through the text acquiring device 14, which is not limited in particular by the embodiment of the present application.

S202: signal features of the multi-modal signal are determined.

Here, taking the example that the above-mentioned multi-modal signal includes a speech signal, the signal characteristics of the speech signal include a spectrogram, an F-bank characteristic, or the like.

In one possible implementation, if the multi-modal signal includes a speech signal, the speech signal may be subjected to speech pre-processing to obtain signal characteristics of the speech signal, wherein the speech pre-processing includes one or more of VAD, STFT, and F-BANK.

In addition, if the multi-modal signal includes a speech signal, the speech signal may be input to a third predetermined neural network, where the third predetermined neural network is obtained by training the speech signal and signal characteristics of the speech signal, so as to obtain signal characteristics of the speech signal output by the third predetermined neural network.

In the embodiment of the present application, the voice signal may be extracted through VAD, STFT, F-BANK, or the like, or the voice signal characteristic may be extracted through a deep learning method such as a neural network, which may be determined specifically according to the situation, and the embodiment of the present application does not particularly limit this.

In one possible implementation, if the multi-modal signal includes an image signal, the image signal may be subjected to image pre-processing to obtain a signal characteristic of the image signal, wherein the image pre-processing includes one or more of image enhancement and normalization.

In addition, if the multi-modal signal includes an image signal, the image signal may be input to a fourth pre-set neural network, wherein the fourth pre-set neural network is obtained by training the image signal and the signal characteristics of the image signal, so as to obtain the signal characteristics of the image signal output by the fourth pre-set neural network.

Here, the image signal may be extracted to obtain the image signal features by methods such as image enhancement and normalization, or may be extracted to obtain the image signal features by a neural network such as Vggish and ImageNet, which may be determined according to the situation, and this is not particularly limited in this embodiment of the present application.

S203: and performing characteristic enhancement on the signal characteristics to obtain enhanced characteristics.

Illustratively, taking the above-mentioned multi-modal signals including the voice signal, the image signal and the text signal as an example, the processing device 11 further enhances the determined signal characteristics after determining the signal characteristics of the voice signal, the image signal and the text signal, to obtain enhanced voice, image and text characteristics, that is, enhanced characteristics, wherein the enhanced characteristics contain richer information, thereby improving the understanding and reasoning ability of the following neural network on the multi-modal information and generating a more accurate dialog statement.

S204: and inputting the enhanced features into a first preset neural network, wherein the first preset neural network is obtained by training signal features and dialogue sentences of multi-modal signals in a dialogue scene.

S205: and acquiring a target dialogue statement output by the first preset neural network.

The processing device 11 trains a first preset neural network by using a large number of signal features and dialogue sentences of multi-modal signals in a dialogue scene, and inputs the enhanced features into the first preset neural network after the training is completed, so as to obtain a target dialogue sentence output by the first preset neural network.

In the embodiment of the application, by acquiring the multi-modal signal in the target dialog scene, the signal characteristics of the multi-modal signal are further determined, and performing feature enhancement on the signal features to obtain enhanced features, inputting the enhanced features into a first preset neural network, thereby obtaining a target dialogue statement output by the first preset neural network, wherein the multi-modal signal includes a plurality of voice signals, image signals and text signals, compared with the prior art that only voice information is acquired, the information acquired by the embodiment of the application is more comprehensive, in addition, the signal characteristics of the multi-modal signals are enhanced, the information contained in the enhanced characteristics is richer, the understanding and reasoning capability of the neural network on the multi-modal information is improved, the generated dialogue sentences have higher accuracy and relevance, and the performance of a dialogue system based on the embodiment of the application is improved.

In the embodiment of the present invention, when performing feature enhancement on the signal features, it is considered how to perform feature enhancement on the signal features when the multi-modal signal includes a speech signal, when the multi-modal signal includes an image signal, and when the multi-modal signal includes a text signal. Fig. 3 is a flowchart illustrating another dialog generation method according to an embodiment of the present application. As shown in fig. 3, the method includes:

s301: a multi-modal signal in a target dialog scene is acquired, the multi-modal signal including a plurality of a speech signal, an image signal, and a text signal.

S302: signal features of the multi-modal signal are determined.

The steps S301 to S302 are the same as the steps S201 to S202, and are not described herein again.

S303: and if the multi-modal signal comprises a voice signal, performing voice feature enhancement on the signal feature of the voice signal, wherein the voice feature enhancement comprises one or more of time domain distortion, frequency domain mask and time domain mask.

Here, the time domain warping is to randomly perform a nonlinear deformation operation on the signal feature of the speech signal in the time domain, thereby performing feature enhancement on the signal feature of the speech signal.

The frequency domain masking is to perform a mask operation on the frequency domain of the signal feature of the speech signal, where the window size and the window position of the mask are randomly set, for example, the window is set to have a length of 5, the number of windows is 1-2, after a certain frequency domain is selected, the feature in the range is changed to 0, and the signal feature of the speech signal is erased, so as to enhance the signal feature of the speech signal. Similarly, the time-domain mask is a mask operation performed on the time domain of the signal feature of the speech signal, the window size and the window position of the mask are randomly set, for example, the window is set to have a length of 10ms and the number of windows is 1-2, after a certain time domain is selected, the feature in the range is changed to 0, and the signal feature of the time domain is erased, so as to enhance the signal feature of the speech signal.

S304: if the multi-modal signal comprises an image signal, performing image feature enhancement on signal features of the image signal, wherein the image feature enhancement comprises one or more of picture cropping, Gaussian blur processing, contrast adjustment, Gaussian noise processing and affine change.

Here, taking the example that the image signal includes a video signal, the picture cropping is to crop each frame in the video with a certain probability, and the contrast adjustment is to adjust the contrast of each frame of the image, so as to enhance the signal characteristics of the image signal.

The gaussian blur processing is to add gaussian blur to each frame of image according to a certain probability (for example, 50%), and similarly, the gaussian noise processing is to add gaussian noise to each frame of image to achieve the purpose of enhancing the signal characteristics of the image signals.

The affine change is to perform changes including translation, rotation, scale change, shearing and the like on each frame of image, so as to enhance the signal characteristics of the image signal.

S305: and if the multi-modal signal comprises a text signal, performing text feature enhancement on the signal feature of the text signal, wherein the text feature enhancement comprises one or more of synonym replacement and context-based word replacement.

Here, the synonym replacement means performing synonym replacement on the text signal, and the context-based word replacement means performing word replacement on the text signal based on the context content of the text signal, thereby performing feature enhancement on the signal feature of the text signal.

In addition, in addition to the above-mentioned manner of performing feature enhancement on the signal feature, the embodiment of the present application may also perform feature enhancement on the signal feature by using other techniques, which may be determined according to practical situations, and this is not limited in particular by the embodiment of the present application.

S306: after the feature enhancement is carried out, an enhanced feature is obtained, and the enhanced feature is input into a first preset neural network, wherein the first preset neural network is obtained through training of signal features and dialogue sentences of multi-modal signals in a dialogue scene.

S307: and acquiring a target dialogue statement output by the first preset neural network.

The implementation of steps S306 to S307 is similar to that of steps S204 to S205, and is not described herein again.

In the embodiment of the application, the signal characteristics are enhanced in different modes, different requirements of various application scenes are met, and the application is suitable, and the multi-modal signals comprise a plurality of voice signals, image signals and text signals.

In addition, before the enhanced features are input into the first preset neural network, the enhanced features are also input into the second preset neural network, and high-level features are extracted. Fig. 4 is a flowchart illustrating another dialog generation method according to an embodiment of the present application. As shown in fig. 4, the method includes:

s401: a multi-modal signal in a target dialog scene is acquired, the multi-modal signal including a plurality of a speech signal, an image signal, and a text signal.

S402: signal features of the multi-modal signal are determined.

S403: and performing characteristic enhancement on the signal characteristics to obtain enhanced characteristics.

The steps S401 to S403 are the same as the steps S201 to S203, and are not described herein again.

S404: and inputting the enhanced features into a second preset neural network, wherein the second preset neural network is obtained by training signal features and high-level features, and the high-level features comprise one or more of VGGish features of voice, I3D RGB features and I3DFlow features of images and word vectors of texts.

In the embodiment of the application, after the signal characteristics of the multi-modal signal are determined and the signal characteristics are subjected to characteristic enhancement, high-level characteristic extraction is performed through a second preset neural network, wherein the high-level characteristics include but are not limited to VGGish characteristics of voice, I3D RGB characteristics and I3D Flow characteristics of images, word vectors of texts and the like, so that information contained in the characteristics input into a subsequent first neural network is richer, the understanding and reasoning capability of the first neural network on the multi-modal information is improved, and an accurate dialogue sentence is generated.

S405: and acquiring the high-level target features output by the second preset neural network, and inputting the high-level target features into the first preset neural network, wherein the first preset neural network is obtained by training the signal features of the multi-mode signals in the dialogue scene and the dialogue sentences.

S406: and acquiring a target dialogue statement output by the first preset neural network.

Illustratively, as shown in fig. 5, the target high-level features input into a first preset neural network, which may be a multi-layer attention model, the first preset neural network outputting the target dialogue sentence are exemplified by the target high-level features including VGGish feature of speech, I3DRGB feature and I3D Flow feature of an image, and word vectors of text.

Compared with the prior art that only voice information is acquired, the information acquired by the embodiment of the application is more comprehensive, the signal characteristics of the multi-mode signal are enhanced, and high-level characteristic extraction is performed through the second preset neural network, so that information included in the characteristics of the input follow-up first neural network is richer, the understanding and reasoning capability of the first neural network on the multi-mode information is improved, and accurate conversation sentences are generated.

Fig. 6 is a schematic structural diagram of a dialog generating device according to an embodiment of the present application, which corresponds to the dialog generating method according to the foregoing embodiment. For convenience of explanation, only portions related to the embodiments of the present application are shown. Fig. 6 is a schematic structural diagram of a dialog generating device according to an embodiment of the present application, where the dialog generating device 60 includes: a first obtaining module 601, a determining module 602, an enhancing module 603, a first input module 604, and a second obtaining module 605. The dialog generating means here may be the processing means itself described above, or a chip or an integrated circuit that implements the functionality of the processing means. It should be noted here that the division of the first obtaining module, the determining module, the enhancing module, the first inputting module and the second obtaining module is only a division of logic functions, and the two may be integrated or independent physically.

The first obtaining module 601 is configured to obtain a multi-modal signal in a target dialog scene, where the multi-modal signal includes a plurality of a voice signal, an image signal, and a text signal.

A determining module 602 for determining signal features of the multi-modal signal.

An enhancing module 603, configured to perform feature enhancement on the signal feature to obtain an enhanced feature.

The first input module 604 is configured to input the enhanced features into a first preset neural network, where the first preset neural network is obtained by training signal features of a multi-modal signal in a dialog scene and a dialog statement.

A second obtaining module 605, configured to obtain the target dialog statement output by the first preset neural network.

In one possible design, the enhancement module 603 is specifically configured to:

In one possible design, the determining module 602 is specifically configured to:

In a possible implementation manner, the determining module 602 is specifically configured to:

The apparatus provided in the embodiment of the present application may be configured to implement the technical solution of the method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again in the embodiment of the present application.

Fig. 7 is a schematic structural diagram of another dialog generating device according to an embodiment of the present application. As shown in fig. 7, in addition to fig. 6, the dialog generating device 60 further includes: a second input module 606 and a third acquisition module 607.

The second input module 606 is configured to input the enhanced features into a second preset neural network before the first input module 604 inputs the enhanced features into the first preset neural network, where the second preset neural network is obtained through signal feature and high-level feature training.

A third obtaining module 607, configured to obtain the target high-level feature output by the second preset neural network.

The first input module 604 is specifically configured to:

In one possible design, the high-level features include one or more of VGGish features for speech, I3D RGB features and I3D Flow features for images, and word vectors for text.

Alternatively, fig. 8A and 8B each schematically provide one possible basic hardware architecture of the dialog generating device described in the present application.

Referring to fig. 8A and 8B, a dialog generating device 800 comprises at least one processor 801 and a communication interface 803. Further optionally, a memory 802 and a bus 804 may also be included.

The dialog generating device 800 may be the processing device, and the present application is not limited to this. In the dialog generating device 800, the number of the processors 801 may be one or more, and fig. 8A and 8B illustrate only one of the processors 801. Alternatively, the processor 801 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a Digital Signal Processing (DSP). If the dialog generating device 800 has multiple processors 801, the types of the multiple processors 801 may be different, or may be the same. Alternatively, the plurality of processors 801 of the dialog generating device 800 may also be integrated as a multi-core processor.

Memory 802 stores computer instructions and data; the memory 802 may store computer instructions and data necessary to implement the dialog generation methods provided herein, e.g., the memory 802 stores instructions for implementing the steps of the dialog generation methods described above. The memory 802 may be any one or any combination of the following storage media: nonvolatile memory (e.g., Read Only Memory (ROM), Solid State Disk (SSD), hard disk (HDD), optical disk), volatile memory.

The communication interface 803 may provide information input/output for the at least one processor. Any one or any combination of the following devices may also be included: a network interface (e.g., an ethernet interface), a wireless network card, etc. having a network access function.

Optionally, the communication interface 803 may also be used for the dialog generating device 800 to communicate data with other computing devices or terminals.

Further alternatively, fig. 8A and 8B show the bus 804 by a thick line. A bus 804 may connect the processor 801 with the memory 802 and the communication interface 803. Thus, via bus 804, processor 801 may access memory 802 and may also interact with other computing devices or terminals using communication interface 803.

In the present application, the dialog generating device 800 executes computer instructions in the memory 802, so that the dialog generating device 800 implements the dialog generating method provided by the present application, or so that the dialog generating device 800 deploys the dialog generating means described above.

From the perspective of logical functional division, illustratively, as shown in fig. 8A, the memory 802 may include therein a first obtaining module 601, a determining module 602, an enhancing module 603, a first input module 604, and a second obtaining module 605. The inclusion herein merely refers to that the instructions stored in the memory may, when executed, implement the functionality of the first obtaining module, the determining module, the enhancing module, the first inputting module, and the second obtaining module, respectively, without limitation to physical structure.

Illustratively, as shown in fig. 8B, the memory 802 may further include a second input module 606 and a third obtaining module 607. The inclusion herein merely refers to that the instructions stored in the memory may, when executed, implement the functions of the second input module and the third obtaining module, respectively, and is not limited to a physical structure.

The first input module 604 is specifically configured to:

In addition, the dialog generating device may be implemented by software as in fig. 8A and 8B, or may be implemented by hardware as a hardware module or as a circuit unit.

The present application provides a computer-readable storage medium, the computer program product comprising computer instructions that instruct a computing device to perform the above-described dialog generation method provided herein.

The present application provides a chip comprising at least one processor and a communication interface providing information input and/or output for the at least one processor. Further, the chip may also include at least one memory for storing computer instructions. The at least one processor is configured to call and execute the computer instructions to perform the dialog generation method provided in the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Claims

1. A dialog generation method, comprising:

determining signal features of the multi-modal signal;

2. The method of claim 1, wherein the feature enhancing the signal feature comprises:

3. The method of claim 1, wherein the feature enhancing the signal feature comprises:

4. The method of claim 1, wherein the feature enhancing the signal feature comprises:

5. The method of any one of claims 1 to 4, further comprising, prior to said inputting said enhanced features into a first pre-set neural network:

6. The method of claim 5, wherein the advanced features include one or more of VGGish features for speech, I3D RGB red green blue RGB features and I3D Flow features for images, and word vectors for text.

7. The method of any of claims 1 to 4, wherein said determining signal features of the multi-modal signal comprises:

and if the multi-mode signal comprises a voice signal, performing voice preprocessing on the voice signal to obtain the signal characteristics of the voice signal, wherein the voice preprocessing comprises one or more of silence suppression VAD, short-time Fourier transform (STFT) and F-BANK.

8. The method of any of claims 1 to 4, wherein said determining signal features of the multi-modal signal comprises:

9. The method of any of claims 1 to 4, wherein said determining signal features of the multi-modal signal comprises:

10. The method of any of claims 1 to 4, wherein said determining signal features of the multi-modal signal comprises:

11. A dialog generation device, comprising:

12. The apparatus according to claim 11, wherein the enhancement module is specifically configured to:

13. The apparatus according to claim 11, wherein the enhancement module is specifically configured to:

14. The apparatus according to claim 11, wherein the enhancement module is specifically configured to:

15. The apparatus of any one of claims 11 to 14, further comprising:

the first input module is specifically configured to:

16. The apparatus of claim 15, wherein the high-level features comprise one or more of VGGish features for speech, I3D RGB features and I3D Flow features for images, and word vectors for text.

17. The apparatus according to any one of claims 11 to 14, wherein the determining module is specifically configured to:

18. The apparatus according to any one of claims 11 to 14, wherein the determining module is specifically configured to:

19. A dialog generating device, comprising:

a processor;

a memory; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-10.

20. A computer-readable storage medium, characterized in that it stores a computer program that causes a server to execute the method of any one of claims 1-10.