WO2023082752A1 - 基于多模态特征的语音对话处理方法、装置和电子设备 - Google Patents

基于多模态特征的语音对话处理方法、装置和电子设备 Download PDF

Info

Publication number
WO2023082752A1
WO2023082752A1 PCT/CN2022/113640 CN2022113640W WO2023082752A1 WO 2023082752 A1 WO2023082752 A1 WO 2023082752A1 CN 2022113640 W CN2022113640 W CN 2022113640W WO 2023082752 A1 WO2023082752 A1 WO 2023082752A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
voice
speech
feature information
input
Prior art date
Application number
PCT/CN2022/113640
Other languages
English (en)
French (fr)
Inventor
王培英
杨久东
陈蒙
Original Assignee
京东科技信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东科技信息技术有限公司 filed Critical 京东科技信息技术有限公司
Publication of WO2023082752A1 publication Critical patent/WO2023082752A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present application relates to the field of computer technology, and in particular to a multimodal feature-based speech dialogue processing method, device and electronic equipment.
  • the voice dialogue system when the user speaks, the voice dialogue system needs to judge the right time to take over the right to speak, that is, the voice dialogue system switches back and forth between the roles of listener and speaker, so that the human-computer interaction is smooth and natural.
  • VAD Voice Activity Detection
  • the present application proposes a multimodal feature-based speech dialogue processing method, device and electronic equipment.
  • An embodiment of one aspect of the present application proposes a multi-modal feature-based voice dialogue processing method, including: acquiring the first voice information currently input by the user during the process of dialogue interaction with the user, wherein the first voice The information includes a silent segment; according to the text information of the first voice information and the historical context information of the first voice information, determine the semantic feature information of the text information; according to the silent segment in the first voice information For the previous voice segment, determine the voice feature information of the first voice information; acquire the time feature information of the first voice information; determine the voice feature information according to the semantic feature information, the voice feature information and the time feature information Describes whether the user ends the voice input.
  • the determining the semantic feature information of the text information according to the text information of the first voice information and the historical context information of the first voice information includes: performing voice recognition on the voice information to obtain the text information of the first voice information; obtaining the historical context information of the first voice information; inputting the text information and the historical context information into the semantic representation model to obtain Semantic feature information of the text information.
  • the determining the speech feature information of the first speech information according to the speech segment before the silent segment in the first speech information includes: acquiring the first speech information A speech segment of a first preset time length before the silent segment; according to a second preset time length, the speech segment is segmented to obtain a plurality of speech segments; extracting the respective corresponding Acoustic feature information, and splice the acoustic feature information corresponding to each of the multiple speech segments respectively, so as to obtain the splicing features corresponding to each of the multiple speech segments; input the splicing features into the deep residual network to obtain the splicing features voice feature information of the first voice information.
  • the acquiring the time feature information of the first voice information includes: acquiring the voice duration, speech rate and text length of the first voice information; combining the voice duration, the The speech rate and the text length are input to the pre-trained multi-layer perceptron MLP model to obtain the temporal feature information of the first speech information.
  • the determining whether the user finishes voice input according to the semantic feature information, the voice feature information, and the time feature information includes: combining the semantic feature information, the The speech feature information and the time feature information are input into the multi-modal fusion model; according to the output result of the multi-modal fusion model, it is determined whether the user ends the speech input.
  • the method further includes: when it is determined that the user finishes the voice input, determining first reply voice information corresponding to the first voice information, and outputting the first reply voice information.
  • the method further includes: if it is determined that the user has not finished the voice input, acquiring the second voice information re-input by the user; according to the first voice information and the second voice information, determine the corresponding second reply voice information, and output the second reply voice information.
  • a voice dialogue processing device based on multimodal features, including: a first acquisition module, configured to acquire the first voice information currently input by the user during dialogue interaction with the user , wherein the first voice information includes a silent segment; a first determining module, configured to determine the semantic features of the text information according to the text information of the first voice information and the historical context information of the first voice information information; a second determining module, configured to determine the voice feature information of the first voice information according to the voice segment before the silent segment in the first voice information; a second acquiring module, used to acquire the first voice information Time feature information of the voice information; a third determining module, configured to determine whether the user ends the voice input according to the semantic feature information, the voice feature information, and the time feature information.
  • the first determination module is specifically configured to: perform voice recognition on the first voice information to obtain text information of the first voice information; acquire the first voice information the historical context information; input the text information and the historical context information into the semantic representation model to obtain the semantic feature information of the text information.
  • the second determining module is specifically configured to: acquire a voice segment of the first preset time length before the silent segment in the first voice information; The length of time, segmenting the speech segment to obtain a plurality of speech segments; extracting the respective acoustic feature information corresponding to the plurality of speech segments, and respectively splicing the respective acoustic feature information corresponding to the plurality of speech segments to obtain splicing features corresponding to each of the multiple speech segments; inputting the splicing features into a deep residual network to obtain speech feature information of the first speech information.
  • the second obtaining module is specifically configured to: obtain the speech duration, speech rate and text length of the first speech information; input to the pre-trained multi-layer perceptron MLP model to obtain the temporal feature information of the first speech information.
  • the third determination module includes: a multimodal processing unit, configured to input the semantic feature information, the speech feature information and the time feature information into the multimodal fusion
  • a determining unit configured to determine whether the user ends the speech input according to the output result of the multimodal fusion model.
  • it further includes: a first processing module, configured to determine the first reply voice information corresponding to the first voice information when it is determined that the user ends the voice input, and output the Describe the first reply voice message.
  • a first processing module configured to determine the first reply voice information corresponding to the first voice information when it is determined that the user ends the voice input, and output the Describe the first reply voice message.
  • a third acquiring module configured to acquire second voice information re-input by the user when it is determined that the user has not finished voice input
  • a second processing module configured to Determine corresponding second reply voice information according to the first voice information and the second voice information, and output the second reply voice information.
  • Another embodiment of the present application proposes an electronic device, including: a memory and a processor; computer instructions are stored in the memory, and when the computer instructions are executed by the processor, the implementation based on the embodiment of the present application Speech dialog processing method with multi-modal features.
  • Another embodiment of the present application provides a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to enable the computer to execute the multimodal feature-based speech dialogue processing disclosed in the embodiment of the present application method.
  • Another embodiment of the present application provides a computer program product.
  • an instruction processor in the computer program product executes, the multimodal feature-based speech dialogue processing method in the embodiment of the present application is implemented.
  • Fig. 1 is a schematic flowchart of a method for processing a speech dialogue based on multimodal features according to an embodiment of the present application.
  • Fig. 2 is an example diagram for describing a speech dialogue processing method in combination with a model framework diagram according to a specific embodiment of the present application.
  • Fig. 3 is a schematic structural diagram of an apparatus for processing speech dialogue based on multimodal features according to an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of an apparatus for processing speech dialogue based on multimodal features according to another embodiment of the present application.
  • FIG. 5 is a block diagram of an electronic device according to one embodiment of the present application.
  • Fig. 1 is a schematic flowchart of a method for processing a speech dialogue based on multimodal features according to an embodiment of the present application.
  • the execution subject of the speech dialogue processing method based on multimodal features provided in the embodiment of the present application is a speech dialogue processing device based on multimodal features, and the speech dialogue processing device based on multimodal features can Realized by software and/or hardware.
  • the voice dialogue system of the voice dialogue processing apparatus based on multimodal features in the embodiment of the present application the voice dialogue system may be configured in an electronic device. Electronic devices may include terminal devices or servers.
  • the multimodal feature-based speech dialogue processing method may include steps 101 to 105 .
  • Step 101 during a dialog interaction process with a user, acquire first voice information currently input by the user, wherein the first voice information includes a silent segment.
  • Step 102 Determine semantic feature information of the text information according to the text information of the first voice information and the historical context information of the first voice information.
  • voice recognition can be performed on the first voice information to obtain the text information of the first voice information, and obtain the historical context information of the first voice information, and input the text information and historical context information into the In the semantic representation model, the semantic feature information of the text information is obtained.
  • the above-mentioned semantic representation model can be based on self-attention Mechanism of the converter Transformer model.
  • a Transformer model may include multiple encoding layers.
  • Each encoding layer includes a Transformer-based encoding structure, which encodes the input content corresponding to the encoding structure, and inputs the output result to the corresponding next encoding layer for processing.
  • an exemplary implementation manner of acquiring the historical context information of the first voice information is: multiple pieces of historical voice dialogue information before the first voice information can be acquired, and according to the multiple pieces of historical voice dialogue information, Obtain historical context information of the first voice information.
  • Step 103 Determine voice feature information of the first voice information according to the voice segment before the silent segment in the first voice information.
  • the voice segment of the first preset time length before the silence segment in the first voice information can be obtained; according to the second preset time length, the voice segment is segmented to obtain multiple voice segments; extract Acoustic feature information corresponding to each of the multiple speech segments, and respectively splicing the corresponding acoustic feature information of the multiple speech segments to obtain the splicing features corresponding to each of the multiple speech segments; input the splicing features into the deep residual network to obtain the first Voice feature information of the voice information.
  • the first preset time length is preset, for example, the above-mentioned first preset time length can be 2 seconds, that is to say, a period of time before the silent segment in the first voice information can be intercepted is a 2-second audio clip.
  • the second preset time length is preset, and the first preset time length is longer than the second preset time length, for example, the first preset time length is 2 seconds, and the above-mentioned second preset time length Can be 50 milliseconds (ms).
  • the speech segment after acquiring a speech segment with a length of 2 seconds, the speech segment may be segmented according to 50 ms to obtain multiple speech segments, wherein each speech segment is 50 ms long.
  • the acoustic feature information may include, but not limited to, energy, volume, pitch, zero-crossing rate, and the like.
  • Step 104 acquiring time feature information of the first voice information.
  • the speech duration, speech rate and text length of the first speech information may be taken, and the speech duration, speech rate and text length are input to a pre-trained multi-layer perceptron (Multi Layer Perceptron, MLP) model to Time feature information of the first voice information is obtained.
  • MLP Multi Layer Perceptron
  • the text length may be determined based on text information corresponding to the first voice information.
  • Step 105 determine whether the user finishes the voice input.
  • the semantic feature information, voice feature information and time feature information can be input into the multi-modal fusion model, and according to the output result of the multi-modal fusion model, determine Whether the user ends the speech input.
  • the multimodal fusion model when the multimodal fusion model acquires semantic feature information, voice feature information, and time feature information, it can acquire the respective weights corresponding to the semantic feature information, voice feature information, and time feature information, and based on the weights, the semantic The feature information, voice feature information and time feature information are weighted, and the weighted result is input into the activation function of the multi-modal fusion model to obtain the output result of the multi-modal fusion model.
  • the output result of the multimodal fusion model when the output result of the multimodal fusion model indicates that the user finishes the voice input, it can be determined that the user has finished the voice input, and at this time, it can be determined that the dialogue system can take over the right to speak. In some other embodiments, when the output result of the multimodal fusion model indicates that the user has not finished the voice input, it can be determined that the user has not finished the voice input. At this time, the dialogue system can continue to listen, and after determining that the user input is finished to reply.
  • the text information of the text information is determined in combination with the text information of the speech information currently input by the user and the historical context information of the first speech information.
  • Semantic feature information according to the voice segment before the silent segment in the first voice information, determine the voice feature information of the first voice information, and obtain the time feature information of the first voice information; according to the semantic feature information, voice feature information and time feature information , to determine whether the user has ended the spoken input.
  • semantic feature information, voice feature information and time feature information it is accurately determined whether the system can take over the right to speak.
  • the dialog system in order to enable the dialog system to accurately reply to the voice information input by the user, in some embodiments, when it is determined that the user ends the voice input, determine the first reply voice corresponding to the first voice information message, and output the first reply voice message.
  • the second voice information input again by the user is acquired; according to the first voice information and the second voice information, the corresponding second reply voice information is determined and output The second reply voice message.
  • an accurate reply is made in combination with the first voice information currently input by the user and the second voice information re-input.
  • features on three different dimensions of voice feature information, semantic feature information and time feature information are used to determine whether the user ends the voice input.
  • Input that is, in the embodiment of the present application, features in three different dimensions of semantic feature information, speech feature information and time feature information are used to determine whether the dialogue system can take over the right to speak, that is, to determine whether the dialogue system outputs a corresponding reply .
  • semantic feature information comes from the text information after speech recognition, which is self-evident for the importance of discourse power decision-making, especially considering that "semantic integrity" is the basic element of discourse power switching, that is to say, when , after determining that the user has fully expressed their intentions, it often means that the system can take over the right to speak. Semantic integrity is generally judged in conjunction with the context, such as the following simple example:
  • the user gave a deterministic answer with clear semantics.
  • the dialogue system can take over the right to speak.
  • the user hesitated briefly, but according to the content currently entered by the user, it can be determined that the user has not finished speaking.
  • the dialogue system can choose to continue listening and wait for the user to finish speaking.
  • the process of dialogue interaction between the user and the dialogue system is summarized.
  • the voice information currently input by the user is obtained, the voice information can be recognized to obtain the current text information, which can be used for the current
  • the historical context information of the input speech information and the current text information are encoded to obtain the semantic feature information of the text information.
  • a Transformer model based on a self-attention mechanism may be used to encode the historical context information of the currently input speech information and the text information corresponding to the current text information.
  • the self-attention mechanism in the Transformer model can capture the long-distance dependencies between historical context information and text information.
  • the final semantic features are expressed as:
  • time features (such as the duration of a speech segment, speech rate, text length, etc.) also play a role in judging whether to switch the right to speak. For example, in the system-led outbound dialogue scenario, in most cases, the system can take over the right to speak after the user makes a short reply; while the situation that requires the system to listen is mostly due to the user's hesitation and other factors.
  • a relatively long answer in order to accurately determine whether the dialogue system can take over the right to speak, in the process of dialogue interaction with the user, the voice duration, speech rate and text length of the voice information currently input by the user can be obtained, and The speech duration, speech rate and text length are divided into buckets, and input into the MLP model according to the processed speech duration, speech rate and text length to obtain the low-dimensional time feature information of the speech information.
  • the multi-modal fusion model After obtaining the feature representation of each modality, it is then input into the multi-modal fusion model to fuse three different features to judge the right to speak:
  • ⁇ ( ⁇ ) refers to the sigmoid function
  • y is the predicted binary classification label: 1-means the user has finished speaking and the system takes over the right to speak; 0-means the system should continue to listen to the user's reply, and b represents the bias value.
  • the above-mentioned multimodal fusion model may be established based on a feed-forward neural network.
  • an embodiment of the present application also provides a multi-modal feature-based speech dialogue processing device, since the embodiment of the present application provides The speech dialogue processing device based on multimodal features corresponds to the speech dialogue processing method based on multimodal features provided in the above-mentioned several embodiments, so the implementation of the speech dialogue processing method based on multimodal features is also applicable to this The multimodal feature-based speech dialogue processing device provided by the embodiment of the application.
  • Fig. 3 is a schematic structural diagram of an apparatus for processing speech dialogue based on multimodal features according to an embodiment of the present application.
  • the multimodal feature-based speech dialogue processing apparatus 300 includes a first acquisition module 301 , a first determination module 302 , a second determination module 303 , a second acquisition module 304 and a third determination module 305 .
  • the first acquiring module 301 is configured to acquire the first voice information currently input by the user during the dialog interaction process with the user, wherein the first voice information includes a silent segment.
  • the first determining module 302 is configured to determine semantic feature information of the text information according to the text information of the first voice information and the historical context information of the first voice information.
  • the second determination module 303 is configured to determine voice feature information of the first voice information according to the voice segment before the silent segment in the first voice information.
  • the second obtaining module 304 is configured to obtain time feature information of the first voice information.
  • the third determination module 305 is configured to determine whether the user finishes the voice input according to the semantic feature information, the voice feature information and the time feature information.
  • the first determination module 302 is specifically configured to: perform voice recognition on the first voice information to obtain text information of the first voice information; obtain historical context information of the first voice information; Information and historical context information are input into the semantic representation model to obtain the semantic feature information of the text information.
  • the second determination module 303 is specifically configured to: acquire a voice segment of the first preset time length before the silent segment in the first voice information; Segments are segmented to obtain multi-segment speech segments; extract the corresponding acoustic feature information of the multi-segment speech segments, and respectively splice the corresponding acoustic feature information of the multi-segment speech segments to obtain the respective splicing features corresponding to the multi-segment speech segments; The features are input into the deep residual network to obtain the speech feature information of the first speech information.
  • the above-mentioned second acquisition module 304 is specifically configured to: acquire the speech duration, speech rate and text length of the first speech information; input the speech duration, speech rate and text length into the pre-trained A multi-layer perceptron MLP model to obtain temporal feature information of the first speech information.
  • the third determination module 305 may include a multimodal processing unit 3051 and a determination unit 3052 .
  • the multimodal processing unit 3051 is configured to input semantic feature information, speech feature information and time feature information into the multimodal fusion model.
  • the determination unit 3052 is configured to determine whether the user finishes the voice input according to the output result of the multimodal fusion model.
  • the device 300 for processing speech dialogue based on multimodal features further includes a first processing module 306 .
  • the first processing module 306 is configured to determine the first reply voice information corresponding to the first voice information and output the first reply voice information when it is determined that the user finishes the voice input.
  • the device 300 for processing speech dialogue based on multimodal features further includes a third acquiring module 307 and a second processing module 308 .
  • the third acquiring module 307 is configured to acquire second voice information re-input by the user when it is determined that the user has not finished the voice input.
  • the second processing module 308 is configured to determine corresponding second reply voice information according to the first voice information and the second voice information, and output the second reply voice information.
  • the speech dialogue processing device in the process of dialogue interaction with the user, combines the text information of the speech information currently input by the user and the historical context information of the first speech information to determine the content of the text information.
  • Semantic feature information according to the voice segment before the silent segment in the first voice information, determine the voice feature information of the first voice information, and obtain the time feature information of the first voice information; according to the semantic feature information, voice feature information and time feature information , to determine whether the user has ended the spoken input.
  • semantic feature information, voice feature information and time feature information it is accurately determined whether the system can take over the right to speak.
  • the present application also provides an electronic device and a readable storage medium.
  • FIG. 5 it is a block diagram of an electronic device according to an embodiment of the present application.
  • the electronic device includes a memory 501 , a processor 502 and computer instructions stored in the memory 501 and executable on the processor 502 .
  • the processor 502 executes the instructions, the multimodal feature-based speech dialogue processing method provided in the foregoing embodiments is implemented.
  • the electronic device further includes a communication interface 503 for communication between the memory 501 and the processor 502 .
  • the memory 501 is used to store computer instructions that can be executed on the processor 502 .
  • the memory 501 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the processor 502 is configured to implement the multimodal feature-based speech dialogue processing method of the above-mentioned embodiment when executing the program.
  • the bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component
  • EISA Extended Industry Standard Architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 5 , but it does not mean that there is only one bus or one type of bus.
  • the memory 501, the processor 502, and the communication interface 503 may communicate with each other through the internal interface.
  • the processor 502 may be a central processing unit (Central Processing Unit, referred to as CPU), or a specific integrated circuit (Application Specific Integrated Circuit, referred to as ASIC), or be configured to implement one or more of the embodiments of the present application integrated circuit.
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • the present application also proposes a computer program product, which implements the multimodal feature-based speech dialogue processing method of the embodiment of the present application when the instruction processor in the computer program product executes.
  • first and second are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features.
  • the features defined as “first” and “second” may explicitly or implicitly include at least one of these features.
  • “plurality” means at least two, such as two, three, etc., unless otherwise specifically defined.
  • a "computer-readable medium” may be any device that can contain, store, communicate, propagate or transmit a program for use in or in conjunction with an instruction execution system, device or device.
  • computer-readable media include the following: electrical connection with one or more wires (electronic device), portable computer disk case (magnetic device), random access memory (RAM), Read Only Memory (ROM), Erasable and Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM).
  • the computer-readable medium may even be paper or other suitable medium on which the program may be printed, as it may be possible, for example, by optically scanning the paper or other medium, followed by editing, interpretation or other suitable means if necessary. Processing to obtain programs electronically and store them in computer memory.
  • each part of the present application may be realized by hardware, software, firmware or a combination thereof.
  • various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware as in another embodiment, it can be implemented by any one or a combination of the following techniques known in the art: a discrete Logic circuits, ASICs with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.
  • each functional unit in each embodiment of the present application may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.
  • the storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • User Interface Of Digital Computer (AREA)
  • Machine Translation (AREA)

Abstract

一种基于多模态特征的语音对话处理方法、装置(300)和电子设备。方法包括:在与用户进行对话交互的过程中,获取用户当前输入的第一语音信息,其中,第一语音信息包括静默段(101);根据第一语音信息的文本信息和第一语音信息的历史上下文信息,确定文本信息的语义特征信息(102),根据第一语音信息中在静默段之前的语音片段,确定第一语音信息的语音特征信息(103),获取第一语音信息的时间特征信息(104);根据语义特征信息、语音特征信息和时间特征信息,确定用户是否结束语音输入(105)。

Description

基于多模态特征的语音对话处理方法、装置和电子设备
相关申请的交叉引用
本申请基于申请号为202111337746.8、申请日为2021年11月09日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及计算机技术领域,尤其涉及基于多模态特征的语音对话处理方法、装置和电子设备。
背景技术
在语音对话***中,用户讲话时,语音对话***需要判断在合适的时机接过话语权,即语音对话***在听者和说话者的角色间来回转换,使得人机交互流畅自然。
目前,多数语音对话***采用通过语音活动检测(Voice Activity Detection,VAD)识别用户静默时长的方式,当用户静默时长超过阈值(比如0.8s~1s)时,***接过话语权。但是,在这种固定静默时长的方式中,如果用户并未讲完且在思考中,然而静默时长超过阈值,这时***响应就会过于迅速敏感;如果用户的交互迅速简明,这时***仍然等待静默时长达到设定阈值才接过话语权,导致***响应迟钝,可能造成用户重复回答。因此,如何确定语音对话***何时接过话语权是目前亟需解决的问题。
发明内容
本申请提出一种基于多模态特征的语音对话处理方法、装置和电子设备。
本申请一方面实施例提出了一种基于多模态特征的语音对话处理方法,包括:在与用户进行对话交互的过程中,获取用户当前输入的第一语音信息,其中,所述第一语音信息包括静默段;根据所述第一语音信息的文本信息和所述第一语音信息的历史上下文信息,确定所述文本信息的语义特征信息;根据所述第一语音信息中在所述静默段之前的语音片段,确定所述第一语音信息的语音特征信息;获取所述第一语音信息的时间特征信息;根据所述语义特征信息、所述语音特征信息和所述时间特征信息,确定所述用户是否结束语音输入。
在本申请的一个实施例中,所述根据所述第一语音信息的文本信息和所述第一语音信息的历史上下文信息,确定所述文本信息的语义特征信息,包括:对所述第一语音信息进行语音识别,以得到所述第一语音信息的文本信息;获取所述第一语音信息的历史上下文信息;将所述文本信息和所述历史上下文信息输入到语义表示模型中,以得到所述文本信息的语义特征信息。
在本申请的一个实施例中,所述根据所述第一语音信息中在所述静默段之前的语音片段,确定所述第一语音信息的语音特征信息,包括:获取所述第一语音信息中在所述 静默段之前的第一预设时间长度的语音片段;按照第二预设时间长度,对所述语音片段进行分段,以得到多段语音片段;提取所述多段语音片段各自对应的声学特征信息,并分别对所述多段语音片段各自对应的声学特征信息进行拼接,以得到所述多段语音频段各自对应的拼接特征;将所述拼接特征输入到深度残差网络中,以得到所述第一语音信息的语音特征信息。
在本申请的一个实施例中,所述获取所述第一语音信息的时间特征信息,包括:获取所述第一语音信息的语音时长、语速和文本长度;将所述语音时长、所述语速和文本长度输入到预先训练好的多层感知机MLP模型,以得到所述第一语音信息的时间特征信息。
在本申请的一个实施例中,所述根据所述语义特征信息、所述语音特征信息和所述时间特征信息,确定所述用户是否结束语音输入,包括:将所述语义特征信息、所述语音特征信息和所述时间特征信息输入到多模态融合模型中;根据所述多模态融合模型的输出结果,确定所述用户是否结束语音输入。
在本申请的一个实施例中,还包括:在确定所述用户结束语音输入的情况下,确定所述第一语音信息所对应的第一回复语音信息,并输出所述第一回复语音信息。
在本申请的一个实施例中,还包括:在确定所述用户未结束语音输入的情况下,获取所述用户再次输入的第二语音信息;根据所述第一语音信息和所述第二语音信息,确定对应的第二回复语音信息,并输出所述第二回复语音信息。
本申请另一方面实施例提出了一种基于多模态特征的语音对话处理装置,包括:第一获取模块,用于在与用户进行对话交互的过程中,获取用户当前输入的第一语音信息,其中,所述第一语音信息包括静默段;第一确定模块,用于根据所述第一语音信息的文本信息和所述第一语音信息的历史上下文信息,确定所述文本信息的语义特征信息;第二确定模块,用于根据所述第一语音信息中在所述静默段之前的语音片段,确定所述第一语音信息的语音特征信息;第二获取模块,用于获取所述第一语音信息的时间特征信息;第三确定模块,用于根据所述语义特征信息、所述语音特征信息和所述时间特征信息,确定所述用户是否结束语音输入。
在本申请的一个实施例中,所述第一确定模块,具体用于:对所述第一语音信息进行语音识别,以得到所述第一语音信息的文本信息;获取所述第一语音信息的历史上下文信息;将所述文本信息和所述历史上下文信息输入到语义表示模型中,以得到所述文本信息的语义特征信息。
在本申请的一个实施例中,所述第二确定模块,具体用于:获取所述第一语音信息中在所述静默段之前的第一预设时间长度的语音片段;按照第二预设时间长度,对所述语音片段进行分段,以得到多段语音片段;提取所述多段语音片段各自对应的声学特征信息,并分别对所述多段语音片段各自对应的声学特征信息进行拼接,以得到所述多段语音频段各自对应的拼接特征;将所述拼接特征输入到深度残差网络中,以得到所述第一语音信息的语音特征信息。
在本申请的一个实施例中,所述第二获取模块,具体用于:获取所述第一语音信息的语音时长、语速和文本长度;将所述语音时长、所述语速和文本长度输入到预先训练好的多层感知机MLP模型,以得到所述第一语音信息的时间特征信息。
在本申请的一个实施例中,所述第三确定模块,包括:多模态处理单元,用于将所述语义特征信息、所述语音特征信息和所述时间特征信息输入到多模态融合模型中;确定单元,用于根据所述多模态融合模型的输出结果,确定所述用户是否结束语音输入。
在本申请的一个实施例中,还包括:第一处理模块,用于在确定所述用户结束语音输入的情况下,确定所述第一语音信息所对应的第一回复语音信息,并输出所述第一回复语音信息。
在本申请的一个实施例中,还包括:第三获取模块,用于在确定所述用户未结束语音输入的情况下,获取所述用户再次输入的第二语音信息;第二处理模块,用于根据所述第一语音信息和所述第二语音信息,确定对应的第二回复语音信息,并输出所述第二回复语音信息。
本申请另一方面实施例提出了一种电子设备,包括:存储器,处理器;所述存储器中存储有计算机指令,当所述计算机指令被所述处理器执行时,实现本申请实施例的基于多模态特征的语音对话处理方法。
本申请另一方面实施例提出了一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使所述计算机执行本申请实施例公开的基于多模态特征的语音对话处理方法。
本申请另一方面实施例提出了一种计算机程序产品,当所述计算机程序产品中的指令处理器执行时实现本申请实施例中的基于多模态特征的语音对话处理方法。
上述可选方式所具有的其他效果将在下文中结合具体实施例加以说明。
附图说明
附图用于更好地理解本方案,不构成对本申请的限定。其中:
图1是根据本申请一个实施例的基于多模态特征的语音对话处理方法的流程示意图。
图2是根据本申请一个具体实施例的结合模型框架图对语音对话处理方法进行描述的示例图。
图3是根据本申请一个实施例的基于多模态特征的语音对话处理装置的结构示意图。
图4是根据本申请另一个实施例的基于多模态特征的语音对话处理装置的结构示意图。
图5是根据本申请一个实施例的电子设备的框图。
具体实施方式
下面详细描述本公开的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附 图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。
下面参考附图描述本申请实施例的基于多模态特征的语音对话处理方法、装置和电子设备。
图1是根据本申请一个实施例的基于多模态特征的语音对话处理方法的流程示意图。其中,需要说明的是,本申请实施例提供的基于多模态特征的语音对话处理方法的执行主体为基于多模态特征的语音对话处理装置,该基于多模态特征的语音对话处理装置可以由软件和/或硬件的方式实现。在本申请实施例中的基于多模态特征的语音对话处理装置语音对话***中,该语音对话***可以配置在电子设备中。电子设备可以包括终端设备或者服务器等。
如图1所示,该基于多模态特征的语音对话处理方法可以包括步骤101至步骤105。
步骤101,在与用户进行对话交互的过程中,获取用户当前输入的第一语音信息,其中,第一语音信息包括静默段。
步骤102,根据第一语音信息的文本信息和第一语音信息的历史上下文信息,确定文本信息的语义特征信息。
在本申请的一个实施例中,可对第一语音信息进行语音识别,以得到第一语音信息的文本信息,并获取第一语音信息的历史上下文信息,以及将文本信息和历史上下文信息输入到语义表示模型中,以得到文本信息的语义特征信息。
在一些实施例中,为了可以捕获文本信息与历史上下文信息之间的长距离依赖关系,并基于长距离依赖关系准确地确定出文本信息的语义特征信息,上述语义表示模型可以为基于自注意力机制的转换器Transformer模型。
在一些实施例中,Transformer模型可以包括多层编码层。每层编码层中均包括基于Transformer的编码结构,对应编码结构对输入内容进行编码,并将输出结果输入至对应的下一层编码层进行处理。
在一些实施例中,获取第一语音信息的历史上下文信息的一种示例性的实施方式为:可获取在第一语音信息之前的多条历史语音对话信息,并根据多条历史语音对话信息,获取第一语音信息的历史上下文信息。
步骤103,根据第一语音信息中在静默段之前的语音片段,确定第一语音信息的语音特征信息。
在一些实施例中,可获取第一语音信息中在静默段之前的第一预设时间长度的语音片段;按照第二预设时间长度,对语音片段进行分段,以得到多段语音片段;提取多段语音片段各自对应的声学特征信息,并分别对多段语音片段各自对应的声学特征信息进行拼接,以得到多段语音频段各自对应的拼接特征;将拼接特征输入到深度残差网络中,以得到第一语音信息的语音特征信息。
在一些实施例中,第一预设时间长度是预先设置的,例如,上述第一预设时间长度可以为2秒,也就是说,可截取第一语音信息中在静默段之前的一段时长长度为2秒的语音片段。
在一些实施例中,第二预设时间长度是预先设置的,第一预设时间长度大于第二预设时长长度,例如,第一预设时间长度为2秒,上述第二预设时间长度可以为50毫秒(ms)。在一些实施例中,在获取2秒长度的语音片段后,可按照50ms,对该语音片段进行分段,以得到多段语音片段,其中,每段语音频段场50ms。
在一些实施例中,声学特征信息可以包括但不限于能量、音量、音高、过零率等。
步骤104,获取第一语音信息的时间特征信息。
在一些实施例中,可取第一语音信息的语音时长、语速和文本长度,将语音时长、语速和文本长度输入到预先训练好的多层感知机(Multi Layer Perceptron,MLP)模型,以得到第一语音信息的时间特征信息。
在一些实施例中,文本长度可以是基于第一语音信息对应的文本信息所确定出的。
步骤105,根据语义特征信息、语音特征信息和时间特征信息,确定用户是否结束语音输入。
在一些实施例中,为了可以准确确定出用户是否结束语音输入,可将语义特征信息、语音特征信息和时间特征信息输入到多模态融合模型中,根据多模态融合模型的输出结果,确定用户是否结束语音输入。
在一些实施例中,多模态融合模型在获取语义特征信息、语音特征信息和时间特征信息,可获取上述语义特征信息、语音特征信息和时间特征信息各自对应的权重,并基于权重,对语义特征信息、语音特征信息和时间特征信息进行加权处理,并将加权结果输入到多模态融合模型的激活函数中,以得到多模态融合模型的输出结果。
在一些实施例中,在多模态融合模型的输出结果指示用户结束语音输入的情况下,可确定用户结束语音输入,此时,可确定对话***可以接过话语权。在另一些实施例中,在多模态融合模型的输出结果指示用户未结束语音输入的情况下,可确定用户未结束语音输入,此时,对话***可继续倾听,并在确定用户输入结束后进行回复。
本申请实施例的基于多模态特征的语音对话处理方法,在与用户进行对话交互的过程中,结合用户当前输入的语音信息的文本信息和第一语音信息的历史上下文信息,确定文本信息的语义特征信息,根据第一语音信息中在静默段之前的语音片段,确定第一语音信息的语音特征信息,获取第一语音信息的时间特征信息;根据语义特征信息、语音特征信息和时间特征信息,确定用户是否结束语音输入。由此,在与用户进行对话交互的过程中,结合语义特征信息、语音特征信息和时间特征信息,准确确定出了***是否可接过话语权。
基于上述实施例的基础上,为了使得对话***准确对用户输入的语音信息进行回复,在一些实施例中,在确定用户结束语音输入的情况下,确定第一语音信息所对应的第一回复语音信息,并输出第一回复语音信息。
在另一些实施例中,在确定用户未结束语音输入的情况下,获取用户再次输入的第二语音信息;根据第一语音信息和第二语音信息,确定对应的第二回复语音信息,并输出第二回复语音信息。由此,结合用户当前输入的第一语音信息和再次输入的第二语音 信息进行准确回复。
为了使得本领域技术人员可以清楚了解本申请,下面结合图2对本申请实施例的方法进行进一步阐述。
通过图2可以看出,本申请实施例中在确定用户是否结束语音输入的过程中,使用了语音特征信息、语义特征信息和时间特征信息三个不同维度上的特征,来确定用户是否结束语音输入,即,本申请实施例中使用了语义特征信息、语音特征信息和时间特征信息三个不同维度上的特征,来确定对话***是否可以接过话语权,即,确定对话***是否输出对应回复。
下面分别对获取语义特征信息、语音特征信息和时间特征信息的过程进行描述。
1)获取语义特征信息。
其中,语义特征信息来源于语音识别后的文本信息,它对于话语权决策的重要性是不言而喻的,尤其是考虑到“语义完整性”是话语权切换的基本要素,也就是说当,在确定用户已经完整表述其意图之后,往往意味着***可以接过话语权。而语义完整性一般也是结合上下文语境来判断的,例如下面的简单示例:
Figure PCTCN2022113640-appb-000001
左边的例子中,用户进行了确定性的答复,语义明确,此时,对话***可以接过话语权。右边的例子里用户出现了短暂的犹豫,但是根据用户当前输入的内容,可以确定用户没有讲完,此时,对话***可以该选择继续倾听,等待用户把话讲完。
为了建模这种语义的完整性,在用户与对话***进行对话交互的过程汇总,在获取用户当前输入的语音信息后,可对该语音信息进行语音识别,以得到当前文本信息,可对当前输入的语音信息的历史上上下文context信息以及当前文本信息进行编码,以得到该文本信息的语义特征信息。
在一些实施例中,可采用基于自注意力机制的Transformer模型对当前输入的语音信息的历史上上下文context信息以及当前文本信息所对应的文本信息进行编码。
其中,可以理解的是,Transformer模型中的自注意力机制可以捕获历史上下文信息与文本信息之间的长距离依赖关系。最终语义特征表示为:
r s=Transformer(e)
2)获取语音特征信息
可以理解的是,在对话过程中,一些语音特征例如音调的转变、音量的高低等都判断是否进行话语权切换的重要线索。因此,在与用户进行对话的过程中,在获取用户当前输入的语音信息后,可从语音信息中截取用户静默前的一段音频(2s),然后将其切分成固定长度的小段,即分帧(每桢50ms)。接下来对每帧音频提取其对应的声学特征,如能量、音量、 音高、过零率等,并将其拼接成一维向量,得到每一帧音频的特征表示f_i。最后,可将序列帧的特征F=[f 1,f 2,…,f n]输入到一个多层的深度残差(Residual Network,ResNet)网络中,得到最终的语音特征表示:
r a=ResNet(F)
3)时间特征
需要理解的是,时间特征(例如语音片段的时长、语速、文本长度等)对于判断话语权的切换与否也是有一定作用的。比如在以***为主导的外呼对话场景,多数情况下,在用户进行简短的回复之后***便可接过话语权;而需要***进行倾听的情况则大多是由于用户因为犹豫等因素而产生了比较长的答复,因此,为了可以准确确定出对话***是否可以接过话语权,在与用户进行对话交互的过程中,可获取用户当前输入的语音信息的语音时长、语速和文本长度,并将语音时长、语速和文本长度分别进行分桶处理,并根据处理后的语音时长、语速和文本长度输入到MLP模型中,以得到语音信息的低维度的时间特征信息。
其中,通过多层感知网络提取其低维特征表示:
r t=MLP(t)
4)多模态特征融合
在一些实施例中,在获取各个模态的特征表示,接下来通过将其输入到多模态融合模型,融合三种不同特征进行话语权的判断:
y=σ(W sr s+W ar a+W tr t+b)
其中,σ(·)指的是sigmoid函数,y即预测的二分类标签:1-表示用户结束讲话,***接过话语权;0-则表示***应该继续倾听用户回复,b表示偏置值。
在一些实施例中,上述多模态融合模型可以是基于前馈神经网络而建立的。
与上述几种实施例提供的基于多模态特征的语音对话处理方法相对应,本申请的一种实施例还提供一种基于多模态特征的语音对话处理装置,由于本申请实施例提供的基于多模态特征的语音对话处理装置与上述几种实施例提供的基于多模态特征的语音对话处理方法相对应,因此在基于多模态特征的语音对话处理方法的实施方式也适用于本申请实施例提供的基于多模态特征的语音对话处理装置。
图3是根据本申请一个实施例的基于多模态特征的语音对话处理装置的结构示意图。
如图3所示,该基于多模态特征的语音对话处理装置300包括第一获取模块301、第一确定模块302、第二确定模块303、第二获取模块304和第三确定模块305。
第一获取模块301,用于在与用户进行对话交互的过程中,获取用户当前输入的第一语音信息,其中,第一语音信息包括静默段。
第一确定模块302,用于根据第一语音信息的文本信息和第一语音信息的历史上下文信息,确定文本信息的语义特征信息。
第二确定模块303,用于根据第一语音信息中在静默段之前的语音片段,确定第一语音信息的语音特征信息。
第二获取模块304,用于获取第一语音信息的时间特征信息。
第三确定模块305,用于根据语义特征信息、语音特征信息和时间特征信息,确定用户是否结束语音输入。
在本申请的一个实施例中,第一确定模块302,具体用于:对第一语音信息进行语音识别,以得到第一语音信息的文本信息;获取第一语音信息的历史上下文信息;将文本信息和历史上下文信息输入到语义表示模型中,以得到文本信息的语义特征信息。
在本申请的一个实施例中,第二确定模块303,具体用于:获取第一语音信息中在静默段之前的第一预设时间长度的语音片段;按照第二预设时间长度,对语音片段进行分段,以得到多段语音片段;提取多段语音片段各自对应的声学特征信息,并分别对多段语音片段各自对应的声学特征信息进行拼接,以得到多段语音频段各自对应的拼接特征;将拼接特征输入到深度残差网络中,以得到第一语音信息的语音特征信息。
在本申请的一个实施例中,上述第二获取模块304,具体用于:获取第一语音信息的语音时长、语速和文本长度;将语音时长、语速和文本长度输入到预先训练好的多层感知机MLP模型,以得到第一语音信息的时间特征信息。
在本申请的一个实施例中,在图3所示的装置实施例的基础上,如图4所示,上述第三确定模块305,可以包括多模态处理单元3051和确定单元3052。
多模态处理单元3051,用于将语义特征信息、语音特征信息和时间特征信息输入到多模态融合模型中。
确定单元3052,用于根据多模态融合模型的输出结果,确定用户是否结束语音输入。
在本申请的一个实施例中,如图4所示,该基于多模态特征的语音对话处理装置300还包括第一处理模块306。
第一处理模块306,用于在确定用户结束语音输入的情况下,确定第一语音信息所对应的第一回复语音信息,并输出第一回复语音信息。
在本申请的一个实施例中,如图4所示,该基于多模态特征的语音对话处理装置300还包括第三获取模块307和第二处理模块308。
第三获取模块307,用于在确定用户未结束语音输入的情况下,获取用户再次输入的第二语音信息。
第二处理模块308,用于根据第一语音信息和第二语音信息,确定对应的第二回复语音信息,并输出第二回复语音信息。
本申请实施例的基于多模态特征的语音对话处理装置,在与用户进行对话交互的过程 中,结合用户当前输入的语音信息的文本信息和第一语音信息的历史上下文信息,确定文本信息的语义特征信息,根据第一语音信息中在静默段之前的语音片段,确定第一语音信息的语音特征信息,获取第一语音信息的时间特征信息;根据语义特征信息、语音特征信息和时间特征信息,确定用户是否结束语音输入。由此,在与用户进行对话交互的过程中,结合语义特征信息、语音特征信息和时间特征信息,准确确定出了***是否可接过话语权。
根据本申请的实施例,本申请还提供了一种电子设备和一种可读存储介质。
如图5所示,是根据本申请一个实施例的电子设备的框图。
如图5所示,该电子设备包括存储器501、处理器502及存储在存储器501上并可在处理器502上运行的计算机指令。
处理器502执行指令时实现上述实施例中提供的基于多模态特征的语音对话处理方法。
进一步地,电子设备还包括通信接口503,用于存储器501和处理器502之间的通信。
存储器501,用于存放可在处理器502上运行的计算机指令。
存储器501可以包含高速RAM存储器,也可以还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
处理器502,用于执行程序时实现上述实施例的基于多模态特征的语音对话处理方法。
如果存储器501、处理器502和通信接口503独立实现,则通信接口503、存储器501和处理器502可以通过总线相互连接并完成相互间的通信。总线可以是工业标准体系结构(Industry Standard Architecture,简称为ISA)总线、外部设备互连(Peripheral Component,简称为PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,简称为EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图5中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
在一些实施例中,如果存储器501、处理器502及通信接口503,集成在一块芯片上实现,则存储器501、处理器502及通信接口503可以通过内部接口完成相互间的通信。
处理器502可以是一个中央处理器(Central Processing Unit,简称为CPU),或者是特定集成电路(Application Specific Integrated Circuit,简称为ASIC),或者是被配置成实施本申请实施例的一个或多个集成电路。
本申请还提出一种计算机程序产品,当计算机程序产品中的指令处理器执行时实现本申请实施例的基于多模态特征的语音对话处理方法。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或 多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行***、装置或设备(如基于计算机的***、包括处理器的***或其他可以从指令执行***、装置或设备取指令并执行指令的***)使用,或结合这些指令执行***、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行***、装置或设备或结合这些指令执行***、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得程序,然后将其存储在计算机存储器中。
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行***执行的软件或固件来实现。如,如果用硬件来实现和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (17)

  1. 一种基于多模态特征的语音对话处理方法,包括:
    在与用户进行对话交互的过程中,获取用户当前输入的第一语音信息,其中,所述第一语音信息包括静默段;
    根据所述第一语音信息的文本信息和所述第一语音信息的历史上下文信息,确定所述文本信息的语义特征信息;
    根据所述第一语音信息中在所述静默段之前的语音片段,确定所述第一语音信息的语音特征信息;
    获取所述第一语音信息的时间特征信息;
    根据所述语义特征信息、所述语音特征信息和所述时间特征信息,确定所述用户是否结束语音输入。
  2. 如权利要求1所述的方法,其中,所述根据所述第一语音信息的文本信息和所述第一语音信息的历史上下文信息,确定所述文本信息的语义特征信息,包括:
    对所述第一语音信息进行语音识别,以得到所述第一语音信息的文本信息;
    获取所述第一语音信息的历史上下文信息;
    将所述文本信息和所述历史上下文信息输入到语义表示模型中,以得到所述文本信息的语义特征信息。
  3. 如权利要求1所述的方法,其中,所述根据所述第一语音信息中在所述静默段之前的语音片段,确定所述第一语音信息的语音特征信息,包括:
    获取所述第一语音信息中在所述静默段之前的第一预设时间长度的语音片段;
    按照第二预设时间长度,对所述语音片段进行分段,以得到多段语音片段;
    提取所述多段语音片段各自对应的声学特征信息,并分别对所述多段语音片段各自对应的声学特征信息进行拼接,以得到所述多段语音频段各自对应的拼接特征;
    将所述拼接特征输入到深度残差网络中,以得到所述第一语音信息的语音特征信息。
  4. 如权利要求1所述的方法,其中,所述获取所述第一语音信息的时间特征信息,包括:
    获取所述第一语音信息的语音时长、语速和文本长度;
    将所述语音时长、所述语速和文本长度输入到预先训练好的多层感知机MLP模型,以得到所述第一语音信息的时间特征信息。
  5. 如权利要求1所述的方法,其中,所述根据所述语义特征信息、所述语音特征信息和所述时间特征信息,确定所述用户是否结束语音输入,包括:
    将所述语义特征信息、所述语音特征信息和所述时间特征信息输入到多模态融合模型中;
    根据所述多模态融合模型的输出结果,确定所述用户是否结束语音输入。
  6. 如权利要求1至5中任一项所述的方法,还包括:
    在确定所述用户结束语音输入的情况下,确定所述第一语音信息所对应的第一回复 语音信息,并输出所述第一回复语音信息。
  7. 如权利要求1至5中任一项所述的方法,还包括:
    在确定所述用户未结束语音输入的情况下,获取所述用户再次输入的第二语音信息;
    根据所述第一语音信息和所述第二语音信息,确定对应的第二回复语音信息,并输出所述第二回复语音信息。
  8. 一种基于多模态特征的语音交互处理装置,包括:
    第一获取模块,用于在与用户进行对话交互的过程中,获取用户当前输入的第一语音信息,其中,所述第一语音信息包括静默段;
    第一确定模块,用于根据所述第一语音信息的文本信息和所述第一语音信息的历史上下文信息,确定所述文本信息的语义特征信息;
    第二确定模块,用于根据所述第一语音信息中在所述静默段之前的语音片段,确定所述第一语音信息的语音特征信息;
    第二获取模块,用于获取所述第一语音信息的时间特征信息;
    第三确定模块,用于根据所述语义特征信息、所述语音特征信息和所述时间特征信息,确定所述用户是否结束语音输入。
  9. 如权利要求8所述的装置,其中,所述第一确定模块,具体用于:
    对所述第一语音信息进行语音识别,以得到所述第一语音信息的文本信息;
    获取所述第一语音信息的历史上下文信息;
    将所述文本信息和所述历史上下文信息输入到语义表示模型中,以得到所述文本信息的语义特征信息。
  10. 如权利要求8所述的装置,其中,所述第二确定模块,具体用于:
    获取所述第一语音信息中在所述静默段之前的第一预设时间长度的语音片段;
    按照第二预设时间长度,对所述语音片段进行分段,以得到多段语音片段;
    提取所述多段语音片段各自对应的声学特征信息,并分别对所述多段语音片段各自对应的声学特征信息进行拼接,以得到所述多段语音频段各自对应的拼接特征;
    将所述拼接特征输入到深度残差网络中,以得到所述第一语音信息的语音特征信息。
  11. 如权利要求8所述的装置,其中,所述第二获取模块,具体用于:
    获取所述第一语音信息的语音时长、语速和文本长度;
    将所述语音时长、所述语速和文本长度输入到预先训练好的多层感知机MLP模型,以得到所述第一语音信息的时间特征信息。
  12. 如权利要求8所述的装置,其中,所述第三确定模块,包括:
    多模态处理单元,用于将所述语义特征信息、所述语音特征信息和所述时间特征信息输入到多模态融合模型中;
    确定单元,用于根据所述多模态融合模型的输出结果,确定所述用户是否结束语音输入。
  13. 如权利要求8至12中任一项所述的装置,还包括:
    第一处理模块,用于在确定所述用户结束语音输入的情况下,确定所述第一语音信息所对应的第一回复语音信息,并输出所述第一回复语音信息。
  14. 如权利要求8至12中任一项所述的装置,还包括:
    第三获取模块,用于在确定所述用户未结束语音输入的情况下,获取所述用户再次输入的第二语音信息;
    第二处理模块,用于根据所述第一语音信息和所述第二语音信息,确定对应的第二回复语音信息,并输出所述第二回复语音信息。
  15. 一种电子设备,包括:存储器,处理器;其中,所述存储器中存储有计算机指令,当所述计算机指令被所述处理器执行时,实现如权利要求1至7中任一项所述的基于多模态特征的语音对话处理方法。
  16. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行权利要求1至7中任一项所述的基于多模态特征的语音对话处理方法。
  17. 一种计算机程序产品,其中,当所述计算机程序产品中的指令处理器执行时,实现如权利要求1至7中任一项所述的基于多模态特征的语音对话处理方法。
PCT/CN2022/113640 2021-11-09 2022-08-19 基于多模态特征的语音对话处理方法、装置和电子设备 WO2023082752A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111337746.8A CN114078474A (zh) 2021-11-09 2021-11-09 基于多模态特征的语音对话处理方法、装置和电子设备
CN202111337746.8 2021-11-09

Publications (1)

Publication Number Publication Date
WO2023082752A1 true WO2023082752A1 (zh) 2023-05-19

Family

ID=80283747

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113640 WO2023082752A1 (zh) 2021-11-09 2022-08-19 基于多模态特征的语音对话处理方法、装置和电子设备

Country Status (2)

Country Link
CN (1) CN114078474A (zh)
WO (1) WO2023082752A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114078474A (zh) * 2021-11-09 2022-02-22 京东科技信息技术有限公司 基于多模态特征的语音对话处理方法、装置和电子设备
CN114418038A (zh) * 2022-03-29 2022-04-29 北京道达天际科技有限公司 基于多模态融合的天基情报分类方法、装置及电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180113854A1 (en) * 2016-10-24 2018-04-26 Palo Alto Research Center Incorporated System for automatic extraction of structure from spoken conversation using lexical and acoustic features
US20200042595A1 (en) * 2018-08-03 2020-02-06 International Business Machines Corporation Conversation boundary determination
CN111105782A (zh) * 2019-11-27 2020-05-05 深圳追一科技有限公司 会话交互处理方法、装置、计算机设备和存储介质
CN112101045A (zh) * 2020-11-02 2020-12-18 北京淇瑀信息科技有限公司 一种多模态语义完整性识别方法、装置及电子设备
CN112825248A (zh) * 2019-11-19 2021-05-21 阿里巴巴集团控股有限公司 语音处理方法、模型训练方法、界面显示方法及设备
CN113035180A (zh) * 2021-03-22 2021-06-25 建信金融科技有限责任公司 语音输入完整性判断方法、装置、电子设备和存储介质
CN113160854A (zh) * 2020-01-22 2021-07-23 阿里巴巴集团控股有限公司 语音交互***、相关方法、装置及设备
CN114078474A (zh) * 2021-11-09 2022-02-22 京东科技信息技术有限公司 基于多模态特征的语音对话处理方法、装置和电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180113854A1 (en) * 2016-10-24 2018-04-26 Palo Alto Research Center Incorporated System for automatic extraction of structure from spoken conversation using lexical and acoustic features
US20200042595A1 (en) * 2018-08-03 2020-02-06 International Business Machines Corporation Conversation boundary determination
CN112825248A (zh) * 2019-11-19 2021-05-21 阿里巴巴集团控股有限公司 语音处理方法、模型训练方法、界面显示方法及设备
CN111105782A (zh) * 2019-11-27 2020-05-05 深圳追一科技有限公司 会话交互处理方法、装置、计算机设备和存储介质
CN113160854A (zh) * 2020-01-22 2021-07-23 阿里巴巴集团控股有限公司 语音交互***、相关方法、装置及设备
CN112101045A (zh) * 2020-11-02 2020-12-18 北京淇瑀信息科技有限公司 一种多模态语义完整性识别方法、装置及电子设备
CN113035180A (zh) * 2021-03-22 2021-06-25 建信金融科技有限责任公司 语音输入完整性判断方法、装置、电子设备和存储介质
CN114078474A (zh) * 2021-11-09 2022-02-22 京东科技信息技术有限公司 基于多模态特征的语音对话处理方法、装置和电子设备

Also Published As

Publication number Publication date
CN114078474A (zh) 2022-02-22

Similar Documents

Publication Publication Date Title
WO2023082752A1 (zh) 基于多模态特征的语音对话处理方法、装置和电子设备
CN111028827B (zh) 基于情绪识别的交互处理方法、装置、设备和存储介质
KR102289917B1 (ko) 화행 정보를 이용한 대화 처리 방법 및 그 장치
US10679613B2 (en) Spoken language understanding system and method using recurrent neural networks
US9154629B2 (en) System and method for generating personalized tag recommendations for tagging audio content
CN109840052B (zh) 一种音频处理方法、装置、电子设备及存储介质
US20240203400A1 (en) Speaker awareness using speaker dependent speech model(s)
Addlesee et al. A comprehensive evaluation of incremental speech recognition and diarization for conversational AI
JP2024502946A (ja) 音声認識トランスクリプトの句読点付け及び大文字化
KR20190064314A (ko) 지능형 대화 에이전트를 위한 대화 태스크 처리 방법 및 그 장치
US20220399013A1 (en) Response method, terminal, and storage medium
US8868419B2 (en) Generalizing text content summary from speech content
JP2021096847A (ja) ユーザの発言に基づくマルチメディア推奨
CN115935182A (zh) 模型训练方法、多轮对话中的话题分割方法、介质及装置
KR20190074508A (ko) 챗봇을 위한 대화 모델의 데이터 크라우드소싱 방법
CN108962228B (zh) 模型训练方法和装置
CN110909135A (zh) 对话代理的操作方法和对话代理设备
CN114328867A (zh) 一种人机对话中智能打断的方法及装置
US11544504B1 (en) Dialog management system
KR20220040813A (ko) 인공지능 음성의 컴퓨팅 탐지 장치
CN112102807A (zh) 语音合成方法、装置、计算机设备和存储介质
CN114758665B (zh) 音频数据增强方法、装置、电子设备及存储介质
KR20200119035A (ko) 대화 시스템, 전자장치 및 대화 시스템의 제어 방법
US11646035B1 (en) Dialog management system
CN111414468A (zh) 话术选择方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891559

Country of ref document: EP

Kind code of ref document: A1