WO2023222090A1 - 基于深度学习的信息推送方法和装置 - Google Patents

基于深度学习的信息推送方法和装置 Download PDF

Info

Publication number
WO2023222090A1
WO2023222090A1 PCT/CN2023/095083 CN2023095083W WO2023222090A1 WO 2023222090 A1 WO2023222090 A1 WO 2023222090A1 CN 2023095083 W CN2023095083 W CN 2023095083W WO 2023222090 A1 WO2023222090 A1 WO 2023222090A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
data
speech
text data
historical
Prior art date
Application number
PCT/CN2023/095083
Other languages
English (en)
French (fr)
Inventor
曾谁飞
孔令磊
张景瑞
李敏
刘卫强
Original Assignee
青岛海尔电冰箱有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛海尔电冰箱有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔电冰箱有限公司
Publication of WO2023222090A1 publication Critical patent/WO2023222090A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present invention relates to the field of computer technology, and specifically to an information push method and device based on deep learning.
  • the purpose of the present invention is to provide an information push method and device based on deep learning.
  • the present invention provides an information push method based on deep learning, which includes the steps:
  • Fusion features are obtained by fusing the text features of the real-time speech data and the text features of the historical text data;
  • the transcribing the real-time speech data into speech text data, and extracting the text features of the speech text data specifically include:
  • the first speech text vector is input into a two-way long and short memory network model to obtain a speech text context feature vector containing context feature information based on the speech text data.
  • the extraction of real-time voice data features specifically includes:
  • extracting text features of the historical text data specifically includes:
  • the historical text word vector is input into a two-way long and short memory network model to obtain a historical text context feature vector containing contextual feature information based on the historical text data.
  • the text features of the speech text data and the historical text data are enhanced.
  • the text features of the speech text data and historical text data are enhanced, specifically including:
  • fusing the text features of the real-time speech data and the text features of the historical text data into a fusion feature vector specifically includes:
  • the fusion feature vector is obtained by fusing the speech text attention feature vector and the historical text attention feature vector.
  • the entity extraction and intent recognition of the fused features are performed to generate a session state tracking task, which specifically includes:
  • the fused feature vector is input into a combined model of a bidirectional long short memory network and a convolutional neural network to perform entity extraction and intent recognition to generate the session state tracking task.
  • the tracking task calculation result information based on the session state specifically includes:
  • the step of transcribing the real-time voice data into voice text data, extracting text features of the voice text data, and extracting text features of the historical text data also includes:
  • Obtain the configuration data stored in the external cache perform deep neural network calculations on the speech text data and the historical text data based on the configuration data, perform text transcription and extract text features.
  • the obtaining of real-time voice data specifically includes:
  • the real-time voice data transmitted from the client terminal is obtained.
  • obtaining historical text data specifically includes:
  • Preprocessing the real-time voice data includes: framing and windowing the real-time voice data,
  • Preprocessing the historical text data includes: cleaning, annotating, word segmenting, and removing stop words on the speech text data.
  • the outputting the result information includes:
  • obtaining the context information and weight information of the real-time voice data and the historical text data specifically includes:
  • Obtain the configuration data stored in the external cache perform deep neural network calculations on the speech text data and the historical text data based on the configuration data, and obtain the context information and weight information of the real-time speech data and the historical text data.
  • the present invention also provides an information push device based on deep learning, including:
  • Data acquisition module used to acquire real-time voice data and historical text data
  • a transliteration module used to transcribe the real-time voice data into voice text data
  • Feature extraction module used to extract text features of the speech text data and extract the historical text data text features
  • a fusion module used to fuse the text features of the real-time speech data and the text features of the historical text data to obtain fusion features
  • a result calculation module configured to perform entity extraction and intent recognition on the fusion features to generate a session state tracking task, and calculate result information based on the session state tracking task;
  • An output module is used to output the result information.
  • the present invention completes the task of identifying and classifying the acquired speech data, and by acquiring historical text data, the historical text data is used as part of the data set of the pre-training and prediction model to obtain more comprehensive
  • the historical text data is used as supplementary data to make up for the problem of less textual semantic information in voice data, effectively improving the accuracy of text classification, thereby improving related information.
  • Push accuracy Moreover, the accuracy of real-time speech recognition is improved by constructing a neural network model that integrates ASR components and contextual information mechanisms; by constructing a neural network model that integrates contextual information mechanisms and self-mutual attention mechanisms, text semantic feature information can be more fully extracted .
  • the overall model structure has excellent deep learning representation capabilities, high speech recognition accuracy, and high accuracy in classifying speech text, which greatly improves the accuracy and efficiency of information push.
  • Figure 1 is a structural block diagram of a model involved in an information push method based on deep learning in an embodiment of the present invention.
  • Figure 2 is a schematic diagram of the steps of an information push method based on deep learning in an embodiment of the present invention.
  • Figure 3 is a schematic diagram of the steps of acquiring real-time voice data and acquiring historical text data in an embodiment of the present invention.
  • Figure 4 is a schematic diagram of the steps of translating the real-time voice data into voice text data and extracting text features of the voice text data in an embodiment of the invention.
  • Figure 5 is a schematic diagram of the steps of extracting text features of the historical text data in an embodiment of the invention.
  • Figure 6 is a schematic structural diagram of an information push device based on deep learning in an embodiment of the present invention.
  • this article uses terms indicating relative positions in space, such as “upper”, “lower”, “back”, “front”, etc., to describe one unit or feature shown in the drawings relative to another.
  • Spatially relative terms may refer to different orientations of the device in use or operation in addition to the orientation illustrated in the figures. For example, if the device in the diagram is turned over, elements described as “below” or “above” other elements or features would then be oriented “below” or “above” the other elements or features.
  • the exemplary term “below” may encompass both spatial orientations, below and above.
  • FIG. 1 it is a structural block diagram of the model involved in an information push method based on deep learning provided by the present invention.
  • FIG. 2 it is a schematic diagram of the steps of the information push method based on deep learning, which includes:
  • S1 Get real-time voice data and historical text data.
  • S2 Transcribe the real-time voice data into voice text data, and extract text features of the voice text data.
  • S5 Perform entity extraction and intent recognition on the fused features to generate a session state tracking task.
  • S6 Track task calculation result information based on the session state.
  • the method provided by the present invention can be used by an intelligent electronic device to implement functions such as real-time interaction or message push with the user based on the user's real-time voice input.
  • a smart refrigerator is taken as an example, and the method is explained in combination with a pre-trained deep learning model.
  • the smart refrigerator Based on the user's voice input, the smart refrigerator classifies the text content corresponding to the user's voice, and calculates the text content that needs to be output based on the classification result information. Result information.
  • step S1 it specifically includes:
  • the real-time voice data transmitted from the client terminal is obtained.
  • the real-time voice mentioned here refers to the inquiry or instructional statements currently spoken by the user to the intelligent electronic device or to the client terminal device that is communicatively connected to the intelligent electronic device.
  • the user can ask questions such as "What vegetables are in the refrigerator today?" Fruits of the season” and other commands.
  • the processor of the smart refrigerator performs voice recognition through the method provided by the present invention, and then performs real-time voice interaction or pushes relevant information with the user.
  • the historical text data mentioned here refers to the speech text data transcribed into the user's real-time speech during the previous use process. Furthermore, it may also include historical text data input by the user himself. Specifically, in this embodiment, it may include: text transcribed into relevant questions and instructions after the user asked questions or issued instructions in the past; text transcribed from the explanatory voice issued by the user based on the items put in during the past use. Text, such as "I put in a watermelon today", "There are 3 bottles of yogurt left in the refrigerator”, etc.; text transcribed from the user's comments on the ingredients in the past, such as "The chili I put in today is very spicy" "A certain brand of yogurt tastes good” etc.; or other text data entered by the user during past use, etc. In different implementations, one or more of the above historical texts can be selected as the historical text data required by this method as needed.
  • the user's real-time voice can be collected through voice collection devices such as pickups and microphone arrays installed in the smart refrigerator.
  • voice collection devices such as pickups and microphone arrays installed in the smart refrigerator.
  • the smart refrigerator emits a voice.
  • the transmitted real-time voice of the user can also be obtained through the client terminal connected to the smart refrigerator based on the wireless communication protocol.
  • the client terminal is an electronic device with information sending function, such as mobile phone, tablet computer, smart speaker, smart bracelet or Bluetooth. Headphones and other smart electronics During the use of the device, the user directly sends voice to the client terminal, and the client terminal collects the voice and transmits it to the smart refrigerator through wireless communication methods such as wifi or Bluetooth.
  • users When users have interaction needs, they can send real-time voice through any convenient channel, which can significantly improve user convenience.
  • one or more of the above real-time voice acquisition methods may also be used, or the real-time voice may be acquired through other channels based on existing technology, and the present invention does not impose specific limitations on this.
  • the historical text data can be obtained by reading the historical text stored in the internal memory of the smart refrigerator. Moreover, the historical text data can also be obtained by reading the historical text stored in the external storage device configured in the smart refrigerator.
  • the external storage device is a mobile storage device such as a U disk, SD card, etc., which can be further expanded by setting an external storage device. Smart refrigerator storage space.
  • the historical text data stored in a client terminal such as a mobile phone or a tablet computer or an application software server can also be obtained.
  • the realization of multi-channel historical text acquisition channels can greatly increase the data volume of historical text information, thereby improving the accuracy of subsequent speech recognition.
  • one or more of the above methods for obtaining historical text data may also be used, or the historical text data may be obtained through other channels based on existing technologies, which the present invention does not specify. limit.
  • the smart refrigerator is configured with an external cache, and at least part of the historical text data is stored in the external cache. As the use time increases, the historical text data increases. By storing part of the data in In the external cache, the internal storage space of the smart refrigerator can be saved, and when performing neural network calculations, the historical text data stored in the external cache can be directly read, which can improve algorithm efficiency.
  • the Redis component is used as the external cache.
  • the Redis component is currently a distributed cache system that uses a relatively widely used key/value storage structure. It can be used as a database, cache and message queue. acting.
  • Other external caches such as Memcached may also be used in other embodiments of the present invention, and the present invention places no specific limitations on this.
  • steps S11 and S12 real-time voice data and historical text data containing item information can be flexibly obtained through multiple channels, which not only improves the user experience, but also ensures the amount of data and effectively improves algorithm efficiency. .
  • step S1 also includes the step of preprocessing the data, which includes:
  • S13 Preprocess the real-time voice data, including: performing frame processing and windowing processing on the real-time voice data.
  • S14 Preprocess the historical text data, including cleaning, annotating, word segmenting, and removing stop words on the speech text data.
  • step S13 the speech is segmented according to the specified length (time period or number of samples), structured into a programmable data structure, and the frame processing of the speech is completed to obtain speech signal data. Then, the speech signal data is multiplied by a window function, so that the originally non-periodic speech signal exhibits some characteristics of the periodic function, completing the windowing process. Furthermore, pre-emphasis processing can be performed before frame processing to emphasize the high-frequency part of the speech to eliminate the influence of lip radiation during the voicing process, thereby compensating for the high-frequency part of the speech signal that is suppressed by the articulation system, and can Highlight the high frequency resonance peaks.
  • steps such as filtering audio noise points and enhancing vocal processing can be performed to complete the enhancement of the real-time voice data, extract the characteristic parameters of the real-time voice, and make the real-time voice data Meet the input requirements of subsequent neural network models.
  • step S14 irrelevant data and duplicate data in the historical text data set are deleted, abnormal value and missing value data are processed, information irrelevant to the classification is initially screened out, and the historical text data is cleaned. Then, the historical text data is labeled with category labels using methods based on rule statistics, and word segmentation methods based on string matching, understanding-based word segmentation methods, statistics-based word segmentation methods, and rule-based word segmentation methods are used to label the historical text data. Text data is segmented into words. Afterwards, stop words are removed and preprocessing of the historical text data is completed, so that the historical text data meets the input requirements of the subsequent neural network model.
  • step S13 and step S14 the specific algorithm used to preprocess the real-time voice data and the historical text data may refer to the current technology in the field, and will not be described again here.
  • step S2 it specifically includes the following steps:
  • S23 Input the first speech text vector into a two-way long short memory network model to obtain information based on the Speech text context feature vector of speech text data context feature information.
  • step S21 extracting the real-time voice data features specifically includes:
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • step S21 may include:
  • the preprocessed real-time speech data is subjected to fast Fourier transform to obtain the energy spectrum of each frame of real-time speech data signal, and the energy spectrum is passed through a set of Mel-scale triangular filter banks to smooth the spectrum and eliminate
  • the role of harmonics highlights the formants of real-time speech, and then the MFCC coefficient characteristics are obtained through further logarithmic operations and discrete cosine transforms.
  • characteristic parameters such as the Perceptual Linear Predictive (PLP) or Linear Predictive Coding (LPC) characteristics of the real-time speech data can also be obtained through different algorithm steps.
  • PLP Perceptual Linear Predictive
  • LPC Linear Predictive Coding
  • step S22 the text content of the real-time speech data is transcribed through the deep neural network model of the Automatic Speech Recognition (ASR) component to obtain the first speech text vector.
  • ASR Automatic Speech Recognition
  • speech recognition is completed through a deep neural network model.
  • the deep neural network model avoids the assumption that acoustic features need to obey independent and identical distributions, and is different from Gaussian mixture models.
  • the network inputs in the model are different.
  • the deep neural network model is obtained by splicing and overlapping several adjacent frames, so that it can better utilize context information, obtain more speech feature information, and have higher speech recognition accuracy.
  • the Bi-directional Long Short-Term Memory (BiLSTM) is composed of the forward Long Short-Term Memory (LSTM) and the backward long short memory network.
  • the LSTM model It can better obtain the long-distance dependencies of text semantics, and based on it, the BiLSTM model can better obtain the bidirectional semantics of text.
  • the first speech text vector is input into the BiLSTM model. After forward LSTM and backward LSTM, the hidden layer state representing effective information output at each time step is obtained, and the speech text with contextual context information is output. Context feature vector.
  • the real-time speech data can also be transcribed into the speech text data by constructing other structural neural network models or using models such as Gaussian mixture models, as long as the real-time speech data can be Just transcribe it into the voice text data.
  • step S2 may also include:
  • Obtain the configuration data stored in the external cache perform deep neural network calculations on the speech text data based on the configuration data, perform text transcription and extract text features.
  • the model configuration data can be read and updated quickly and efficiently, thereby improving computing efficiency and effectively solving the historical problem.
  • the large amount of text data brings about problems such as time response and spatial calculation complexity, thereby improving the user experience.
  • Redis component can be used as the external cache.
  • step S2 the text transcription and feature extraction of the real-time voice data are completed through step S2.
  • step S3 it specifically includes:
  • S32 Input the historical text word vector into a two-way long and short memory network model to obtain a historical text context feature vector containing contextual feature information based on the historical text data.
  • step S31 in order to convert the text data into a vectorized form that can be recognized and processed by the computer, the historical text data can be converted into the historical text word vector through the Word2Vec algorithm, or other methods such as the Glove algorithm can be used.
  • the word vector is obtained by converting existing algorithms in this field, and the present invention does not place any specific restrictions on this.
  • step S32 similar to the above, the BiLSTM model is used to obtain the context information of the Historical text context feature vector.
  • a common recurrent network model in the field such as a Gated Recurrent Unit (GRU) network can also be used to extract contextual feature information, and the present invention does not impose specific limitations on this.
  • GRU Gated Recurrent Unit
  • step S3 may also include:
  • Redis component can be used as the external cache.
  • steps S2 and S3 the feature extraction of the speech text data and the historical text data is completed respectively, different semantic feature information is obtained respectively, and contextual feature information is extracted, thereby improving the accuracy of item classification. This avoids the loss or filtering of useful information and improves the performance of the model.
  • step S3 the following steps are also included:
  • S3a Based on the attention mechanism model, enhance the text features of the speech text data and the historical text data.
  • step S3a includes:
  • the attention mechanism can guide the neural network to focus on more critical information and suppress other non-critical information. Therefore, by introducing the attention mechanism, the local key features or weight information of the output text data can be obtained, thereby further reducing model training. Irregular error alignment phenomenon of time series.
  • the input speech text context feature vector and the historical text context feature vector are given their own weight information, thereby better obtaining the speech text data and
  • the internal weight information of the text semantic features of the historical text data is used to enhance the importance of different parts of the text semantic feature information, so as to further optimize the interpretability of the model.
  • step S3a mutual attention can also be based on step S3a.
  • the force mechanism model assigns correlation weight information to the speech text context feature vector and the historical text context feature vector, thereby obtaining correlation weight information between the speech text data and the historical text data.
  • other algorithm models can be used to enhance the text features of the speech text data and the historical text data.
  • the order of the layers of the deep neural network can be adjusted or some layers can be omitted as needed, as long as the text classification of the speech text data and the historical text data can be completed. There is no specific limit to this.
  • step S4 it specifically includes:
  • the fusion feature vector is obtained by fusing the speech text attention feature vector and the historical text attention feature vector.
  • the fusion feature vector of multi-modal fusion integrates the optimal representation capabilities such as context information of text semantics and historical data features, and has rich semantic feature information, so that excellent text and speech representation capabilities can be obtained.
  • step S4 may also be:
  • the speech text attention feature vector and the historical text attention feature vector are jointly mapped to a unified multi-modal vector space for joint representation to obtain the joint feature vector.
  • Multi-modal fusion and multi-modal joint feature representation are intended to combine the real-time speech data and the historical text to better extract and represent the feature information of both.
  • step S5 it specifically includes:
  • the fused feature vector is input into a combined model of a bidirectional long short memory network and a convolutional neural network to perform entity extraction and intent recognition to generate the session state tracking task.
  • step S5 the convolutional neural network used consists of two convolutional layers and one maximum pooling layer.
  • step S6 it specifically includes:
  • the next action information to be executed is calculated to obtain Feedback the result information.
  • the method provided by the present invention sequentially completes the recognition and classification tasks of the acquired speech data through the above steps, and by acquiring the historical text data, the historical text data is used as pre-training and prediction As part of the data set of the model, the text semantic feature information is more comprehensively obtained.
  • the historical text data is used as supplementary data to make up for the text semantics of the speech data.
  • the accuracy of text classification is effectively improved.
  • the accuracy of real-time speech recognition is improved by constructing a neural network model that integrates ASR components and contextual information mechanisms; by constructing a neural network model that integrates contextual information mechanisms and self-mutual attention mechanisms, text semantic feature information can be more fully extracted .
  • the calculation efficiency of the model is improved by obtaining externally stored configuration data for calculation.
  • the overall model structure has excellent deep learning representation capabilities, which improves the accuracy of speech text classification, thus greatly improving the accuracy and generalization ability of classifying item categories.
  • step S7 it specifically includes:
  • step S7 after obtaining the classification result information through the previous steps and judging that the result information is obtained, it can be converted into voice, and the result information can be broadcast through the sound playback device built in the smart refrigerator, thereby directly Perform voice interaction with the user, or the result information can be converted into text and displayed directly through the display device configured in the smart refrigerator.
  • the voice communication of the result information can also be transmitted to the client terminal for output.
  • the client terminal is an electronic device with an information receiving function, such as transmitting the voice to a mobile phone, smart speaker, Bluetooth headset and other devices for broadcast, or classifying the result information.
  • the text is transmitted to client terminals such as mobile phones and tablets or application software installed on the client terminal through communication such as text messages and emails for users to review.
  • a multi-channel and multi-type classification result information output method is realized.
  • the user is not limited to only obtaining relevant information near the smart refrigerator.
  • the multi-channel and multi-type real-time voice acquisition method provided by the present invention, the user can directly obtain relevant information remotely. Interacting with the smart refrigerator is extremely convenient and greatly improves the user experience.
  • only one or more of the above classification result information output methods may be used, or the classification result information may be output through other channels based on existing technology. There are no specific restrictions on this.
  • the present invention provides an information push method based on deep learning, which obtains real-time voice data containing item information through multiple channels. After translating the real-time voice data into text, it combines historical text data with deep neural The network model fully extracts the semantic features of the text, obtains the result information and outputs it through multiple channels, which significantly improves the accuracy of speech recognition and item category judgment, while making the interaction method more convenient and diverse, greatly improving the user experience.
  • the present invention also provides an information push device 8 based on deep learning, which includes:
  • Data acquisition module 81 used to acquire real-time voice data and acquire historical text data
  • Transcription module 82 used to transcribe the real-time voice data into voice text data
  • Feature extraction module 83 used to extract text features of the voice text data and extract text features of the historical text data
  • the fusion module 84 is used to fuse the text features of the real-time voice data and the text features of the historical text data to obtain fusion features;
  • the result calculation module 85 is used to perform entity extraction and intent recognition on the fusion features to generate a session state tracking task, and calculate result information based on the session state tracking task;
  • the output module 86 is used to output the result information.
  • an electrical device which includes:
  • Memory used to store executable instructions
  • the processor is configured to implement the above-mentioned deep learning-based information push method when running executable instructions stored in the memory.
  • the present invention also provides a refrigerator, which includes:
  • Memory used to store executable instructions
  • the processor is configured to implement the above-mentioned deep learning-based information push method when running executable instructions stored in the memory.
  • the present invention also provides a computer-readable storage medium that stores executable instructions, which is characterized in that when the executable instructions are executed by a processor, the above-mentioned information based on deep learning is realized.
  • Information push method a computer-readable storage medium that stores executable instructions, which is characterized in that when the executable instructions are executed by a processor, the above-mentioned information based on deep learning is realized. Information push method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

本发明提供一种基于深度学习的信息推送方法和装置,涉及计算机技术领域,所述方法包括步骤:获取实时语音数据,获取历史文本数据;转写实时语音数据为语音文本数据,提取语音文本数据文本特征;提取历史文本数据的文本特征;将实时语音数据文本特征和历史文本数据文本特征融合得到融合特征;对融合特征进行实体抽取和意图识别生成会话状态跟踪任务;基于会话状态跟踪任务计算结果信息;输出结果信息。

Description

基于深度学习的信息推送方法和装置 技术领域
本发明涉及计算机技术领域,具体地涉及一种基于深度学习的信息推送方法、和装置。
背景技术
伴随智能语音技术的快速发展及应用场景的成熟落地,目前冰箱在食材选择及推送方面普遍存在2个层面的问题,一是智能冰箱所使用的应用软件在进行食材信息推送时效率低而导致用户体验差;二是推送食材内容主题的准确率较低、或响应时间较慢。以上问题难以满足人们在日常生活之中使用冰箱的基本需求,甚至引起推送食材信息不准确或不对称信息。因此,如何使用智能语音技术推送食材内容已成为冰箱智能化、一体化的关键技术与迫切问题。特别是人类与机器的交互越来越频繁,简单、便捷的人机交互方式成为AI核心技术的根本特征及便利生活方式,这些交互方式离不开语音、文本、图像等多模态数据,针对这些多模态数据如何利用好及如何融合最有效的特征表示,从而为用户提供更加的使用体验效果,已成为学界、产业界及工业界面临的关键问题。
发明内容
本发明的目的在于提供一种基于深度学习的信息推送方法和装置。
本发明提供一种基于深度学习的信息推送方法,包括步骤:
获取实时语音数据,获取历史文本数据;
转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征;
提取所述历史文本数据的文本特征;
将所述实时语音数据文本特征和所述历史文本数据文本特征融合得到融合特征;
对所述融合特征进行实体抽取和意图识别生成会话状态跟踪任务;
基于所述会话状态跟踪任务计算结果信息;
输出所述结果信息。
作为本发明的进一步改进,所述转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征,具体包括:
提取所述实时语音数据特征,得到语音特征;
将所述语音特征输入语音识别组件的深度神经网络模型转写得到第一语音文本向量;
将所述第一语音文本向量输入双向长短记忆网络模型,获取包含基于所述语音文本数据上下文特征信息的语音文本上下文特征向量。
作为本发明的进一步改进,所述提取所述实时语音数据特征,具体包括:
提取所述实时语音数据特征,获取其梅尔频率倒谱系数特征。
作为本发明的进一步改进,提取所述历史文本数据的文本特征,具体包括:
将所述历史文本数据转化为历史文本词向量;
将所述历史文本词向量输入双向长短记忆网络模型,获取包含基于所述历史文本数据上下文特征信息的历史文本上下文特征向量。
作为本发明的进一步改进,还包括步骤:
基于注意力机制模型,增强所述语音文本数据和所述历史文本数据的文本特征。
作为本发明的进一步改进,所述基于注意力机制模型,增强所述语音文本数据和历史文本数据的文本特征,具体包括:
分别将所述语音文本上下文特征向量和所述历史文本上下文特征向量输入自注意力机制和全连接层的融合模型;
获取包含所述语音文本数据自身权重信息的语音文本注意力特征向量;
获取包含所述历史文本数据自身权重信息的历史文本注意力特征向量。
作为本发明的进一步改进,所述将所述实时语音数据文本特征和所述历史文本数据文本特征融合到融合特征向量,具体包括:
将所述语音文本注意力特征向量和所述历史文本注意力特征向量进行融合得到所述融合特征向量。
作为本发明的进一步改进,所述对所述融合特征进行实体抽取和意图识别生成会话状态跟踪任务,具体包括:
将所述融合特征向量输入双向长短记忆网络和卷积神经网络的组合模型进行实体抽取和意图识别生成所述会话状态跟踪任务。
作为本发明的进一步改进,所述基于所述会话状态跟踪任务计算结果信息,具体包括:
根据所述会话状态跟踪任务,并通过基于实体信息和意图识别所形成的***自有和历史积累形成的决策库、以及执行动作命令的引擎库,计算得到用以反馈 的所述结果信息。
作为本发明的进一步改进,所述转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征;提取所述历史文本数据的文本特征,还包括:
获取存储于外部缓存的配置数据,将所述语音文本数据和所述历史文本数据基于所述配置数据执行深度神经网络计算,进行文本转写和提取文本特征。
作为本发明的进一步改进,所述获取实时语音数据,具体包括:
获取语音采集装置所采集的所述实时语音数据,和/或
获取自客户终端传输的所述实时语音数据。
作为本发明的进一步改进,获取历史文本数据,具体包括:
获取内部存储的历史文本作为历史文本数据,和/或
获取外部存储的历史文本作为历史文本数据,和/或
获取客户终端传输的历史文本作为历史文本数据。
作为本发明的进一步改进,还包括步骤:
对所述实时语音数据进行预处理,包括:对所述实时语音数据进行分帧处理和加窗处理,
对所述历史文本数据进行预处理,包括:对所述语音文本数据进行清洗处理、标注、分词、去停用词。
作为本发明的进一步改进,所述输出所述结果信息包括:
将所述结果信息转换为语音进行输出,和/或
将所述结果信息转换为语音传输至客户终端输出,和/或
将所述结果信息转换为文本进行输出,和/或
将所述结果信息转换为文本传输至客户终端输出。
作为本发明的进一步改进,所述获取所述实时语音数据和所述历史文本数据的上下文信息和权重信息,具体包括:
获取存储于外部缓存的配置数据,将所述语音文本数据和所述历史文本数据基于所述配置数据执行深度神经网络计算,获取所述实时语音数据和所述历史文本数据的上下文信息和权重信息。
本发明还提供一种基于深度学习的信息推送装置,包括:
数据获取模块,用于获取实时语音数据和获取历史文本数据;
转写模块,用于转写所述实时语音数据为语音文本数据;
特征提取模块,用于提取所述语音文本数据文本特征和提取所述历史文本数据 的文本特征;
融合模块,用于将所述实时语音数据文本特征和所述历史文本数据文本特征融合得到融合特征;
结果计算模块,用于对所述融合特征进行实体抽取和意图识别生成会话状态跟踪任务,并基于所述会话状态跟踪任务计算结果信息;
输出模块,用于输出所述结果信息。
本发明的有益效果是:本发明完成了对所获取的语音数据的识别与分类任务,并且通过获取历史文本数据,将历史文本数据作为预训练和预测模型的数据集的一部分,更全面地获取了文本语义特征信息,通过综合运用语音文本数据和历史文本数据,将历史文本数据作为补充数据,弥补了语音数据文本语义信息较少的问题,有效提高了文本分类准确度,从而提高了相关信息推送的准确率。并且,通过构建融合了ASR组件、上下文信息机制的神经网络模型提高了实时语音识别的精度;通过构建融合了上下文信息机制和自互注意力机制的神经网络模型,更充分地提取文本语义特征信息。另外,通过获取外部存储的配置数据进行计算,提高了模型的计算效率,从而降低了信息推送的响应时间。整体模型结构具有优秀的深度学习表征能力,语音识别精度高,对语音文本分类的准确率高,大幅提升了信息推送的准确率和效率。
附图说明
图1是本发明一实施方式中的基于深度学习的信息推送方法所涉及模型的结构框图。
图2是本发明一实施方式中的基于深度学习的信息推送方法步骤示意图。
图3本发明一实施方式中获取实时语音数据,获取历史文本数据步骤示意图。
图4是发明一实施方式中转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征步骤示意图。
图5是发明一实施方式中提取所述历史文本数据的文本特征步骤示意图。
图6是本发明一实施方式中的基于深度学习的信息推送装置结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施方式及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施方 式仅是本申请一部分实施方式,而不是全部的实施方式。基于本申请中的实施方式,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施方式,都属于本申请保护的范围。
下面详细描述本发明的实施方式,实施方式的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。
为方便说明,本文使用表示空间相对位置的术语来进行描述,例如“上”、“下”、“后”、“前”等,用来描述附图中所示的一个单元或者特征相对于另一个单元或特征的关系。空间相对位置的术语可以包括设备在使用或工作中除了图中所示方位以外的不同方位。例如,如果将图中的装置翻转,则被描述为位于其他单元或特征“下方”或“上方”的单元将位于其他单元或特征“下方”或“上方”。因此,示例性术语“下方”可以囊括下方和上方这两种空间方位。
如图1所示,为本发明所提供的一种基于深度学习的信息推送方法所涉及模型的结构框图,如图2所示,为基于深度学习的信息推送方法步骤示意图,其包括:
S1:获取实时语音数据,获取历史文本数据。
S2:转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征。
S3:提取所述历史文本数据的文本特征。
S4:将所述实时语音数据文本特征和所述历史文本数据文本特征融合得到融合特征。
S5:对所述融合特征进行实体抽取和意图识别生成会话状态跟踪任务。
S6:基于所述会话状态跟踪任务计算结果信息。
S7:输出所述结果信息。
本发明提供的方法可供智能电子设备基于用户的实时语音输入,来实现与用户之间的实时交互或消息推送等功能。示例性的,在本实施方式中,以智能冰箱为例,并结合预先训练好的深度学习模型对本方法进行说明。基于用户的语音输入,智能冰箱对用户语音所对应的文本内容进行分类,并根据分类结果信息计算需要输出的 结果信息。
如图3所示,在步骤S1中,其具体包括:
S11:获取语音采集装置所采集的所述实时语音数据,和/或
获取自客户终端传输的所述实时语音数据。
S12:获取内部存储的历史文本作为历史文本数据,和/或
获取外部存储的历史文本作为历史文本数据,和/或
获取客户终端传输的历史文本作为历史文本数据。
这里所述的实时语音指的是用户当前对智能电子设备或对与智能电子设备通信连接的客户终端设备等说出的询问性或指令性语句等。如在本实施方式中,用户可提出诸如“今天冰箱里有啥蔬菜”、“今天有什么菜谱推荐”等问题,或用户可发出诸如“提醒冰箱里快到期的酸奶”、“给出当季的水果”等命令指令。基于上述信息,智能冰箱的处理器通过本发明所提供的方法进行语音识别后,与用户进行实时语音交互或推送相关信息。
这里所述的历史文本数据指的是在以往使用过程中,用户的实时语音所转写成的语音文本数据,进一步的,其还可包括用户自行输入的历史文本数据等。具体的,在本实施方式中,其可包括:以往用户提问或发出指令后,相关问题和指令所转写成的文本;以往使用过程中用户依据放入的物品发出的说明性语音所转写的文本,如“今天放入了一个西瓜”、“冰箱里还剩3瓶酸奶”等;以往使用过程中用户对食材进行的评论所转写的文本,如“今天放进去的辣椒很辣”“某种品牌的酸奶很好喝”等;或者用户在以往使用过程中其他自行输入的文本数据等。在不同实施方式中,可以根据需要选择以上历史文本中的一种或多种作为本方法所需的所述历史文本数据。
如步骤S11所述,在本实施方式中,可通过设置于智能冰箱内的拾音器、麦克风阵列等语音采集装置采集用户实时语音,在使用过程中,当用户需要与智能冰箱进行交互时,直接对智能冰箱发出语音即可。并且,也可通过与智能冰箱基于无线通信协议连接的客户终端获取传输而来的用户实时语音,客户终端为具有信息发送功能的电子设备,如手机、平板电脑、智能音响、智能手环或蓝牙耳机等智能电子 设备,在使用过程中,用户直接对客户终端发出语音,客户终端采集语音后通过wifi或蓝牙等无线通信方式传输至智能冰箱。从而实现多渠道的实时语音获取方式,并不局限于必须面向智能冰箱发出语音。当用户有交互需求时,通过任意便捷渠道发出实时语音即可,从而能够显著提高用户的使用便捷度。在本发明的其他实施方式中,也可采用上述实时语音获取方法中一种或任意多种,或者也可基于现有技术通过其他渠道获取所述实时语音,本发明对此不作具体限制。
如步骤S12所述,在本实施方式中,可通过读取智能冰箱的内部存储器所存储的历史文本来获取所述历史文本数据。并且,也可通过读取智能冰箱配置的外部存储装置所存储的历史文本来获取所述历史文本数据,外部存储装置为诸如U盘、SD卡等移动存储设备,通过设置外部存储装置可进一步拓展智能冰箱的存储空间。并且,也可通过获取存储在诸如手机、平板电脑等客户终端或应用软件服务器端等处的所述历史文本数据。实现多渠道的历史文本获取渠道,能够大幅提高历史文本信息的数据量,从而提高后续语音识别的准确度。在本发明的其他实施方式中,也可采用上述历史文本数据获取方法中的一种或任意多种,或者也可基于现有技术通过其他渠道获取所述历史文本数据,本发明对此不作具体限制。
进一步的,在本实施方式中,智能冰箱配置有外部缓存,至少有部分所述历史文本数据被储存在所述外部缓存中,随着使用时间增加,历史文本数据增多,通过将部分数据存储在外部缓存中,能够节省智能冰箱内部存储空间,并且在进行神经网络计算时,直接读取存储于外部缓存中的所述历史文本数据,能够提高算法效率。
具体的,在本实施方式中,采用Redis组件作为所述外部缓存,Redis组件为当前一种使用较为广泛的key/value存储结构的分布式缓存***,其可用作数据库,高速缓存和消息队列代理。在本发明的其他实施方式中也可采用诸如Memcached等其他外部缓存,本发明对此不作具体限制。
综上所述,在步骤S11和步骤S12中,能够通过多渠道灵活获取包含物品信息的实时语音数据和历史文本数据,在提升了用户体验的同时,保证了数据量,并有效提升了算法效率。
进一步的,步骤S1还包括对数据进行预处理的步骤,其包括:
S13:对所述实时语音数据进行预处理,包括:对所述实时语音数据进行分帧处理和加窗处理。
S14:对所述历史文本数据进行预处理,包括:对所述语音文本数据进行清洗处理、标注、分词、去停用词。
具体的,在步骤S13中,将语音根据指定的长度(时间段或者采样数)进行分段,结构化为可编程的数据结构,完成对语音的分帧处理得到语音信号数据。接着,将语音信号数据与一个窗函数相乘,使原本没有周期性的语音信号呈现出周期函数的部分特征,完成加窗处理。进一步的,还可在分帧处理之前进行预加重处理,对语音的高频部分进行加重,以消除发声过程中***辐射的影响,从而补偿语音信号受到发音***所压抑的高频部分,并能突显高频的共振峰。并且,在加窗处理之后还可进行过滤音频噪音点处理和增强人声处理等步骤,从而完成对所述实时语音数据的加强,提取得到所述实时语音的特征参数,使所述实时语音数据符合后续神经网络模型的输入要求。
具体的,在步骤S14中,删除历史文本数据集中的无关数据、重复数据以及处理异常值和缺失值数据等,初步筛选掉与分类无关的信息,对所述历史文本数据进行清洗处理。接着,基于规则统计的方法等对所述历史文本数据进行类别标签标注,以及基于字符串匹配的分词方法、基于理解的分词方法、基于统计的分词方法和基于规则的分词方法等对所述历史文本数据进行分词处理。之后,去除停用词,完成对所述历史文本数据的预处理,从而使所述历史文本数据符合后续神经网络模型的输入要求。
在步骤S13和步骤S14中,对所述实时语音数据和所述历史文本数据预处理所采用的的具体算法可参考当前本领域现有技术,具体在此不再赘述。
如图4所示,在步骤S2中,其具体包括步骤:
S21:提取所述实时语音数据特征,得到语音特征。
S22:将所述语音特征输入语音识别组件的深度神经网络模型转写得到第一语音文本向量。
S23:将所述第一语音文本向量输入双向长短记忆网络模型,获取包含基于所述 语音文本数据上下文特征信息的语音文本上下文特征向量。
在步骤S21中,提取所述实时语音数据特征具体包括:
提取所述实时语音数据特征,获取其梅尔频率倒谱系数特征(Mel-scale Frequency Cepstral Coefficients,简称MFCC)。MFCC是一种语音信号中具有辨识性的成分,是在Mel标度频率域提取出来的倒谱参数,其中,Mel标度描述了人耳频率的非线性特性,MFCC的参数考虑到了人耳对不同频率的感受程度,特别适用于语音辨别和语者辨识。
示例性的,步骤S21可包括:
将预处理后的所述实时语音数据经过快速傅里叶变换后得到各帧实时语音数据信号的能量谱,并将能量谱通过一组Mel尺度的三角形滤波器组来对频谱进行平滑化,消除谐波的作用,突显实时语音的共振峰,之后在进一步通过对数运算和离散余弦变换后得到MFCC系数特征。
在本发明的其他实施方式中,也可通过不同算法步骤获取所述实时语音数据的感知线性预测特征(Perceptual Linear Predictive,简称PLP)或线性预测系数特征(Linear Predictive Coding,简称LPC)等特征参数来取代MFCC特征,具体可基于实际模型参数和本方法实际应用的领域而进行具体选择,本发明对此不做具体限制。
上述步骤中所涉及的具体的算法步骤可参考当前本领域现有技术,具体在此不再赘述。
在步骤S22中,通过预语音识别(Automatic Speech Recognition,简称ASR)组件的深度神经网络模型完成对所述实时语音数据的文本内容转写,得到所述第一语音文本向量。
在本实施方式中,通过深度神经网络模型来完成语音识别,相比于现有技术中常用的高斯混合模型等模型,深度神经网络模型避免了声学特征需要服从独立同分布的假设,与高斯混合模型中的网络输入不同,深度神经网络模型由相邻的若干帧拼接重叠得到,从而能够更好地利用上下文的信息,获取更多语音特征信息,具有更高的语音识别精度。
在步骤S23中,双向长短记忆网络(Bi-directional Long Short-Term Memory,简写BiLSTM)由前向长短记忆网络(Long Short-Term Memory,简写LSTM)和后向长短记忆网络组合而成,LSTM模型能够更好地获取文本语义长距离的依赖关系,而在其基础上,BiLSTM模型能更好地获取文本双向语义。将所述第一语音文本向量输入BiLSTM模型中,经过前向LSTM和后向LSTM后,得到每个时间步输出的表示有效信息的隐藏层状态,输出带有语境上下文信息的所述语音文本上下文特征向量。
在本发明的其他实施方式中,也可通过构建其他结构神经网络模型或者通过高斯混合模型等模型等来将所述实时语音数据转写为所述语音文本数据,只要能够将所述实时语音数据转写为所述语音文本数据即可。
进一步的,在本实施方式中,步骤S2还可包括:
获取存储于外部缓存的配置数据,将所述语音文本数据基于所述配置数据执行深度神经网络计算,进行文本转写和提取文本特征。
将相关模型的配置数据存储在所述外部缓存中,并配合用以访问所述外部缓存的接口,可快速、高效地读取和更新模型配置数据,从而提高计算效率,有效解决了所述历史文本数据量较大带来的时间响应和空间计算复杂度等问题,从而提升用户的使用体验。
与上文所述类似,可使用Redis组件作为所述外部缓存。
综上所述,通过步骤S2完成了对所述实时语音数据的文本转写及特征提取。
如图5所示,在步骤S3中,其具体包括:
S31:将所述历史文本数据转化为历史文本词向量。
S32:将所述历史文本词向量输入双向长短记忆网络模型,获取包含基于所述历史文本数据上下文特征信息的历史文本上下文特征向量。
在步骤S31中,为了将文本数据转化为计算机能够识别和处理的向量化形式,可通过Word2Vec算法,将所述历史文本数据转化为所述历史文本词向量,或者也可通过其他诸如Glove算法等本领域现有算法转化得到所述词向量,本发明对此不做具体限制。
在步骤S32中,与上述类似,通过BiLSTM模型得到带有语境上下文信息的所述 历史文本上下文特征向量。
在本发明的其他实施方式中,也可采用诸如门控循环单元(Gated Recurrent Unit,简写GRU)网络等本领域常见的循环网络模型来提取上下文特征信息,本发明对此不作具体限制。
进一步的,在本实施方式中,步骤S3还可包括:
获取存储于外部缓存的配置数据,将所述历史文本数据基于所述配置数据执行深度神经网络计算,提取文本特征。
与上文所述类似,可使用Redis组件作为所述外部缓存。
从而,通过步骤S2和S3分别完成了对所述语音文本数据和所述历史文本数据的特征提取,分别得到了不同的语义特征信息并进而提取了上下文特征信息,提升了物品分类的准确性,避免有用信息的丢失或过滤,提升了模型的性能。
进一步的,在本发明一些实施方式中,在步骤S3之后,还包括步骤:
S3a:基于注意力机制模型,增强所述语音文本数据和所述历史文本数据的文本特征。
具体的,步骤S3a包括:
分别将所述语音文本上下文特征向量和所述历史文本上下文特征向量输入自注意力机制和全连接层的融合模型;
获取包含所述语音文本数据自身权重信息的语音文本注意力特征向量;
获取包含所述历史文本数据自身权重信息的历史文本注意力特征向量。
注意力机制可以引导神经网络去关注更为关键的信息而抑制其他非关键的信息,因此,通过引入注意力机制,能够得到所述输出文本数据的局部关键特征或权重信息,从而进一步减少模型训练时序列的不规则误差对齐现象。
这里,通过自注意力机制和全连接层相融合的模型将输入的所述语音文本上下文特征向量和所述历史文本上下文特征向量赋予其自身权重信息,从而更好的获得所述语音文本数据和所述历史文本数据文本语义特征的内部权重信息,以增强文本语义特征信息不同部分的重要性,使得模型的可解释性进一步优化。
进一步的,在本发明的其他实施方式中,也可在步骤S3a的基础上基于互注意 力机制模型,对所述语音文本上下文特征向量和所述历史文本上下文特征向量赋予其相互之间的关联权重信息,从而获得所述语音文本数据和所述历史文本数据之间的关联权重信息。或通过其他算法模型完成对所述语音文本数据和所述历史文本数据的文本特征增强。
在本发明的其他实施方式中,可以根据需要调整深度神经网络各层的排列顺序或省略部分层,只要能够完成对所述语音文本数据和所述历史文本数据的文本分类即可,本发明对此不作具体限制。
在步骤S4中,其具体包括:
将所述语音文本注意力特征向量和所述历史文本注意力特征向量进行融合得到所述融合特征向量。多模态融合的所述融合特征向量融合了文本语义的上下文信息、历史数据特征等最优表征能力,具有丰富的语义特征信息,从而能够获得到优秀的文本、语音表征能力。
需要说明的是,在目前的神经网络模型中,多模态的融合和多模态联合特征表示之间已经没有明确的界限,因此,在本发明的一些实施方式中,步骤S4也可为:将所述语音文本注意力特征向量和所述历史文本注意力特征向量共同映射到一个统一多模态向量空间进行联合表示得到所述联合特征向量。多模态融合以及多模态联合特征表示均是为了将所述实时语音数据和所述历史文本组合,更好地提取和表示两者的特征信息。
在步骤S5中,其具体包括:
将所述融合特征向量输入双向长短记忆网络和卷积神经网络的组合模型进行实体抽取和意图识别生成所述会话状态跟踪任务。
具体的,在本实施方式中,在步骤S5中,所采用的卷积神经网络由2层卷积层和一层最大池化构成。
在步骤S6中,其具体包括:
根据所述会话状态跟踪任务,并通过基于实体信息和意图识别所形成的***自有和历史积累形成的决策库、以及执行动作命令的引擎库,计算所要执行的下一步动作信息,得到用以反馈的所述结果信息。
综上所述,本发明所提供的方法依次通过上述步骤,完成了对所获取的语音数据的识别与分类任务,并且通过获取所述历史文本数据,将所述历史文本数据作为预训练和预测模型的数据集的一部分,更全面地获取了文本语义特征信息,通过综合运用所述语音文本数据和所述历史文本数据,将所述历史文本数据作为补充数据,弥补了所述语音数据文本语义信息较少的问题,有效提高了文本分类准确度。并且,通过构建融合了ASR组件、上下文信息机制的神经网络模型提高了实时语音识别的精度;通过构建融合了上下文信息机制和自互注意力机制的神经网络模型,更充分地提取文本语义特征信息。另外,通过获取外部存储的配置数据进行计算,提高了模型的计算效率。整体模型结构具有优秀的深度学习表征能力,提高了对语音文本分类的准确率,从而大幅提升了对物品类别进行分类的准确率和泛化能力。
在步骤S7中,其具体包括:
将所述结果信息转换为语音进行输出,和/或
将所述结果信息转换为语音传输至客户终端输出,和/或
将所述结果信息转换为文本进行输出,和/或
将所述结果信息转换为文本传输至客户终端输出。
如步骤S7所述,在本实施方式中,在通过前述步骤获得分类结果信息并判断得到结果信息后,可将其转换为语音,通过智能冰箱内置的声音播放设备播报所述结果信息,从而直接与用户进行语音交互,或者也可将所述结果信息转换为文本,直接通过智能冰箱配置的显示装置显示。并且,也可将结果信息语音通信传输至客户终端输出,这里,客户终端为具有信息接收功能的电子设备,如将语音传输至手机、智能音响、蓝牙耳机等设备进行播报,或将分类结果信息文本通过短信、邮件等方式通讯传输至诸如手机、平板电脑等客户终端或客户终端安装的应用软件,供用户查阅。从而实现多渠道多种类的分类结果信息输出方式,用户并不局限于只能在智能冰箱附近处获得相关信息,配合本发明所提供的多渠道多种类实时语音获取方式,使得用户能够直接在远程与智能冰箱进行交互,具有极高的便捷性,大幅提高了用户使用体验。在本发明的其他实施方式中,也可仅采用上述分类结果信息输出方式中的一种或几种,或者也可基于现有技术通过其他渠道输出分类结果信息,本发明 对此不作具体限制。
综上所述,本发明提供的一种基于深度学习的信息推送方法,其通过多渠道获取包含物品信息的实时语音数据,在将实时语音数据进行文本转写后,结合历史文本数据通过深度神经网络模型充分提取了文本语义特征,获得结果信息后通过多渠道进行输出,显著改善语音识别精度和物品类别判断准确率的同时,使得交互方式更加便捷多元,大幅提高用户体验。
基于同一发明构思,本发明还提供一种基于深度学习的信息推送装置8,其包括:
数据获取模块81,用于获取实时语音数据和获取历史文本数据;
转写模块82,用于转写所述实时语音数据为语音文本数据;
特征提取模块83,用于提取所述语音文本数据文本特征和提取所述历史文本数据的文本特征;
融合模块84,用于将所述实时语音数据文本特征和所述历史文本数据文本特征融合得到融合特征;
结果计算模块85,用于对所述融合特征进行实体抽取和意图识别生成会话状态跟踪任务,并基于所述会话状态跟踪任务计算结果信息;
输出模块86,用于输出所述结果信息。
基于同一发明构思,本发明还提供一种电器设备,其包括:
存储器,用于存储可执行指令;
处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于深度学习的信息推送方法。
基于同一发明构思,本发明还提供一种冰箱,其包括:
存储器,用于存储可执行指令;
处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于深度学习的信息推送方法。
基于同一发明构思,本发明还提供一种计算机可读存储介质,其存储有可执行指令,其特征在于,所述可执行指令被处理器执行时实现上述的基于深度学习的信 息推送方法。
应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施方式中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。
上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。

Claims (16)

  1. 一种基于深度学习的信息推送方法,其特征在于,包括步骤:
    获取实时语音数据,获取历史文本数据;
    转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征;
    提取所述历史文本数据的文本特征;
    将所述实时语音数据文本特征和所述历史文本数据文本特征融合得到融合特征;
    对所述融合特征进行实体抽取和意图识别生成会话状态跟踪任务;
    基于所述会话状态跟踪任务计算结果信息;
    输出所述结果信息。
  2. 根据权利要求1所述的基于深度学习的信息推送方法,其特征在于,所述转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征,具体包括:
    提取所述实时语音数据特征,得到语音特征;
    将所述语音特征输入语音识别组件的深度神经网络模型转写得到第一语音文本向量;
    将所述第一语音文本向量输入双向长短记忆网络模型,获取包含基于所述语音文本数据上下文特征信息的语音文本上下文特征向量。
  3. 根据权利要求2所述的基于深度学习的信息推送方法,其特征在于,所述提取所述实时语音数据特征,具体包括:
    提取所述实时语音数据特征,获取其梅尔频率倒谱系数特征。
  4. 根据权利要求2所述的基于深度学习的信息推送方法,其特征在于,提取所述历史文本数据的文本特征,具体包括:
    将所述历史文本数据转化为历史文本词向量;
    将所述历史文本词向量输入双向长短记忆网络模型,获取包含基于所述历史文本数据上下文特征信息的历史文本上下文特征向量。
  5. 根据权利要求4所述的基于深度学习的信息推送方法,其特征在于,还包括步骤:
    基于注意力机制模型,增强所述语音文本数据和所述历史文本数据的文本特征。
  6. 根据权利要求5所述的基于深度学习的信息推送方法,其特征在于,所述基于注意力机制模型,增强所述语音文本数据和历史文本数据的文本特征,具体包括:
    分别将所述语音文本上下文特征向量和所述历史文本上下文特征向量输入自注意力机制和全连接层的融合模型;
    获取包含所述语音文本数据自身权重信息的语音文本注意力特征向量;
    获取包含所述历史文本数据自身权重信息的历史文本注意力特征向量。
  7. 根据权利要求6所述的基于深度学习的信息推送方法,其特征在于,所述将所述实时语音数据文本特征和所述历史文本数据文本特征融合到融合特征向量,具体包括:
    将所述语音文本注意力特征向量和所述历史文本注意力特征向量进行融合得到所述融合特征向量。
  8. 根据权利要求7所述的基于深度学习的信息推送方法,其特征在于,所述对所述融合特征进行实体抽取和意图识别生成会话状态跟踪任务,具体包括:
    将所述融合特征向量输入双向长短记忆网络和卷积神经网络的组合模型进行实体抽取和意图识别生成所述会话状态跟踪任务。
  9. 根据权利要求7所述的基于深度学习的信息推送方法,其特征在于,所述基于所述会话状态跟踪任务计算结果信息,具体包括:
    根据所述会话状态跟踪任务,并通过基于实体信息和意图识别所形成的***自有和历史积累形成的决策库、以及执行动作命令的引擎库,计算得到用以反馈的所述结果信息。
  10. 根据权利要求1所述的基于深度学习的信息推送方法,其特征在于,所述转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征;提取所述历史文本数据的文本特征,还包括:
    获取存储于外部缓存的配置数据,将所述语音文本数据和所述历史文本数据基于所述配置数据执行深度神经网络计算,进行文本转写和提取文本特征。
  11. 根据权利要求1所述的基于深度学习的信息推送方法,其特征在于,所述获取实时语音数据,具体包括:
    获取语音采集装置所采集的所述实时语音数据,和/或
    获取自客户终端传输的所述实时语音数据。
  12. 根据权利要求1所述的基于深度学习的信息推送方法,其特征在于,获取历史文本数据,具体包括:
    获取内部存储的历史文本作为历史文本数据,和/或
    获取外部存储的历史文本作为历史文本数据,和/或
    获取客户终端传输的历史文本作为历史文本数据。
  13. 根据权利要求1所述的基于深度学习的信息推送方法,其特征在于,还包括步骤:
    对所述实时语音数据进行预处理,包括:对所述实时语音数据进行分帧处理和加窗处理,
    对所述历史文本数据进行预处理,包括:对所述语音文本数据进行清洗处理、标注、分词、去停用词。
  14. 根据权利要求1所述的基于深度学习的信息推送方法,其特征在于,所述输出所述结果信息包括:
    将所述结果信息转换为语音进行输出,和/或
    将所述结果信息转换为语音传输至客户终端输出,和/或
    将所述结果信息转换为文本进行输出,和/或
    将所述结果信息转换为文本传输至客户终端输出。
  15. 根据权利要求1所述的基于深度学习的信息推送方法,其特征在于,所述获取所述实时语音数据和所述历史文本数据的上下文信息和权重信息,具体包括:
    获取存储于外部缓存的配置数据,将所述语音文本数据和所述历史文本数据基于所述配置数据执行深度神经网络计算,获取所述实时语音数据和所述历史文本数据的上下文信息和权重信息。
  16. 一种基于深度学习的信息推送装置,其特征在于,包括:
    数据获取模块,用于获取实时语音数据和获取历史文本数据;
    转写模块,用于转写所述实时语音数据为语音文本数据;
    特征提取模块,用于提取所述语音文本数据文本特征和提取所述历史文本数据的文本特征;
    融合模块,用于将所述实时语音数据文本特征和所述历史文本数据文本特征融合得到融合特征;
    结果计算模块,用于对所述融合特征进行实体抽取和意图识别生成会话状态跟踪任务,并基于所述会话状态跟踪任务计算结果信息;
    输出模块,用于输出所述结果信息。
PCT/CN2023/095083 2022-05-20 2023-05-18 基于深度学习的信息推送方法和装置 WO2023222090A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210554860.4 2022-05-20
CN202210554860.4A CN115098765A (zh) 2022-05-20 2022-05-20 基于深度学习的信息推送方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023222090A1 true WO2023222090A1 (zh) 2023-11-23

Family

ID=83289971

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095083 WO2023222090A1 (zh) 2022-05-20 2023-05-18 基于深度学习的信息推送方法和装置

Country Status (2)

Country Link
CN (1) CN115098765A (zh)
WO (1) WO2023222090A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098765A (zh) * 2022-05-20 2022-09-23 青岛海尔电冰箱有限公司 基于深度学习的信息推送方法、装置、设备及存储介质
CN116070020A (zh) * 2022-12-31 2023-05-05 青岛海尔电冰箱有限公司 基于知识图谱的食材推荐方法、设备及存储介质
CN116186258A (zh) * 2022-12-31 2023-05-30 青岛海尔电冰箱有限公司 基于多模态知识图谱的文本分类方法、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110675860A (zh) * 2019-09-24 2020-01-10 山东大学 基于改进注意力机制并结合语义的语音信息识别方法及***
US20210082398A1 (en) * 2019-09-13 2021-03-18 Mitsubishi Electric Research Laboratories, Inc. System and Method for a Dialogue Response Generation System
CN112560506A (zh) * 2020-12-17 2021-03-26 中国平安人寿保险股份有限公司 文本语义解析方法、装置、终端设备及存储介质
CN113590769A (zh) * 2020-04-30 2021-11-02 阿里巴巴集团控股有限公司 任务驱动型多轮对话***中的状态追踪方法及装置
CN114944156A (zh) * 2022-05-20 2022-08-26 青岛海尔电冰箱有限公司 基于深度学习的物品分类方法、装置、设备及存储介质
CN115062143A (zh) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 语音识别与分类方法、装置、设备、冰箱及存储介质
CN115098765A (zh) * 2022-05-20 2022-09-23 青岛海尔电冰箱有限公司 基于深度学习的信息推送方法、装置、设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210082398A1 (en) * 2019-09-13 2021-03-18 Mitsubishi Electric Research Laboratories, Inc. System and Method for a Dialogue Response Generation System
CN110675860A (zh) * 2019-09-24 2020-01-10 山东大学 基于改进注意力机制并结合语义的语音信息识别方法及***
CN113590769A (zh) * 2020-04-30 2021-11-02 阿里巴巴集团控股有限公司 任务驱动型多轮对话***中的状态追踪方法及装置
CN112560506A (zh) * 2020-12-17 2021-03-26 中国平安人寿保险股份有限公司 文本语义解析方法、装置、终端设备及存储介质
CN114944156A (zh) * 2022-05-20 2022-08-26 青岛海尔电冰箱有限公司 基于深度学习的物品分类方法、装置、设备及存储介质
CN115062143A (zh) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 语音识别与分类方法、装置、设备、冰箱及存储介质
CN115098765A (zh) * 2022-05-20 2022-09-23 青岛海尔电冰箱有限公司 基于深度学习的信息推送方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115098765A (zh) 2022-09-23

Similar Documents

Publication Publication Date Title
WO2023222088A1 (zh) 语音识别与分类方法和装置
WO2023222089A1 (zh) 基于深度学习的物品分类方法和装置
WO2023222090A1 (zh) 基于深度学习的信息推送方法和装置
CN113408385B (zh) 一种音视频多模态情感分类方法及***
WO2021174757A1 (zh) 语音情绪识别方法、装置、电子设备及计算机可读存储介质
WO2021082941A1 (zh) 视频人物识别方法、装置、存储介质与电子设备
CN111933129A (zh) 音频处理方法、语言模型的训练方法、装置及计算机设备
CN109509470A (zh) 语音交互方法、装置、计算机可读存储介质及终端设备
CN110097870B (zh) 语音处理方法、装置、设备和存储介质
CN111968679A (zh) 情感识别方法、装置、电子设备及存储介质
WO2024140434A1 (zh) 基于多模态知识图谱的文本分类方法、设备及存储介质
WO2024140430A1 (zh) 基于多模态深度学习的文本分类方法、设备及存储介质
Gupta et al. Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition
WO2024140432A1 (zh) 基于知识图谱的食材推荐方法、设备及存储介质
CN113129867A (zh) 语音识别模型的训练方法、语音识别方法、装置和设备
Kumar et al. Machine learning based speech emotions recognition system
CN116431806A (zh) 自然语言理解方法及冰箱
CN114399995A (zh) 语音模型的训练方法、装置、设备及计算机可读存储介质
CN117041430B (zh) 一种提高智能协调外呼***的外呼质量及鲁棒方法和装置
CN115798459B (zh) 音频处理方法、装置、存储介质及电子设备
CN112199498A (zh) 一种养老服务的人机对话方法、装置、介质及电子设备
CN112150103B (zh) 一种日程设置方法、装置和存储介质
Sartiukova et al. Remote Voice Control of Computer Based on Convolutional Neural Network
CN114373443A (zh) 语音合成方法和装置、计算设备、存储介质及程序产品
KR20210085182A (ko) 사용자 발화 의도 인식을 위한 시스템, 서버 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23807040

Country of ref document: EP

Kind code of ref document: A1