WO2018153200A1 - 基于hlstm模型的声学建模方法、装置和存储介质 - Google Patents

基于hlstm模型的声学建模方法、装置和存储介质 Download PDF

Info

Publication number
WO2018153200A1
WO2018153200A1 PCT/CN2018/073887 CN2018073887W WO2018153200A1 WO 2018153200 A1 WO2018153200 A1 WO 2018153200A1 CN 2018073887 W CN2018073887 W CN 2018073887W WO 2018153200 A1 WO2018153200 A1 WO 2018153200A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
hlstm
training
state
lstm
Prior art date
Application number
PCT/CN2018/073887
Other languages
English (en)
French (fr)
Inventor
张鹏远
董振江
张宇
贾霞
李洁
张恒生
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2018153200A1 publication Critical patent/WO2018153200A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present disclosure relates to the field of speech recognition technologies, and in particular, to an acoustic modeling method, apparatus, and storage medium based on a Highway Long Short Time Memory (HLSTM) model.
  • HLSTM Highway Long Short Time Memory
  • the Long Short Time Memory (LSTM) model was introduced into acoustic modeling, and the LSTM model has stronger acoustic modeling capabilities than a simple feedforward network. Due to the increasing amount of data, it is necessary to deepen the number of layers of the acoustic model neural network to improve the modeling ability. However, as the number of network layers in the LSTM model deepens, the training difficulty of the network increases, and the gradient disappears. In order to avoid the disappearance of the gradient, an HLSTM model based on the LSTM model was proposed, which introduced direct connections between memory cells in adjacent layers of the LSTM model.
  • the proposed HLSTM model enables the deeper network structure to be practically applied in the recognition system and greatly improves the recognition accuracy.
  • the deep HLSTM model has stronger modeling capabilities, the deepening of the number of layers and the introduction of new connections (the above-mentioned direct connection) also make the acoustic model have a more complex network structure, so the forward calculation takes longer. , eventually leading to slower decoding. Therefore, how to improve the performance without increasing the complexity of the acoustic model becomes a problem to be solved.
  • Embodiments of the present disclosure provide an acoustic modeling method, apparatus, and storage medium based on an HLSTM model.
  • Embodiments of the present disclosure provide an acoustic modeling method based on an HLSTM model, including:
  • the randomly initialized HLSTM model is trained based on a preset function, and the training result is optimized;
  • the HLSTM model is the same as the network parameter of the LSTM model.
  • the training is performed on the randomly initialized HLSTM model based on a preset function, and the training result is optimized, including:
  • the HLSTM model obtained by the training is optimized according to the state-level minimum Bayesian risk criterion.
  • the F CE represents a cross entropy objective function
  • the speech feature at time t is the annotation value of the output point in the y state
  • X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
  • the X represents the training data
  • the N is the total duration of the speech feature.
  • the objective function corresponding to the state-level minimum Bayesian risk criterion is:
  • the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
  • the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
  • the number of network layers of the HLSTM model is greater than or equal to the number of network layers of the LSTM model.
  • the training the randomly initialized LSTM model based on the result of the forward calculation and the preset function including:
  • An embodiment of the present disclosure further provides an acoustic modeling device based on an HLSTM model, including:
  • the HLSTM model processing module is configured to train the randomly initialized HLSTM model based on a preset function and optimize the training result;
  • a calculation module configured to perform training forward calculation through the optimized HLSTM model
  • the LSTM model processing module is configured to train the randomly initialized long and short time memory LSTM model based on the forward calculation result and the preset function, and the obtained model is an acoustic model of the speech recognition system;
  • the HLSTM model is the same as the network parameter of the LSTM model.
  • the HLSTM model processing module includes:
  • a first training unit configured to train the randomly initialized HLSTM model using a cross entropy objective function
  • An optimization unit configured to optimize the HLSTM model obtained by the training according to a state-level minimum Bayesian risk criterion.
  • the F CE represents a cross entropy objective function
  • the speech feature at time t is the annotation value of the output point in the y state
  • X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
  • the X represents the training data
  • the N is the total duration of the speech feature.
  • the objective function corresponding to the state-level minimum Bayesian risk criterion is:
  • the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
  • the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
  • the LSTM model processing module includes:
  • An obtaining unit configured to obtain an output result of each frame obtained by the forward calculation
  • a second training unit configured to train the randomly initialized LSTM model based on the output result of each frame and the cross entropy objective function; wherein, in the cross entropy objective function The result is output for each frame obtained by the forward calculation.
  • Embodiments of the present disclosure further provide a storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of any of the above methods.
  • the HLSTM model-based acoustic modeling method, apparatus and storage medium train a randomly initialized HLSTM model based on a preset function, and optimize the training result; and the training data is obtained through the optimization.
  • the HLSTM model performs forward calculation; based on the result of the forward calculation and the preset function, training the randomly initialized LSTM model, and the obtained model is an acoustic model of the speech recognition system; wherein the HLSTM model and the The network parameters of the LSTM model are the same.
  • the embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
  • FIG. 1 is a schematic flow chart of an acoustic modeling method based on an HLSTM model according to an embodiment of the present disclosure
  • FIG. 2 is a network structure diagram of a bidirectional HLSTM model according to an embodiment of the present disclosure
  • FIG. 3 is a schematic structural diagram of an acoustic modeling device based on an HLSTM model according to an embodiment of the present disclosure
  • FIG. 4 is a schematic structural diagram of an HLSTM model processing module according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic structural diagram of an LSTM model processing module according to an embodiment of the present disclosure.
  • FIG. 1 is a schematic flowchart of an acoustic modeling method based on an HLSTM model according to an embodiment of the present disclosure. As shown in FIG. 1 , the method includes:
  • Step 101 Train the randomly initialized HLSTM model based on a preset function, and optimize the training result.
  • Step 102 Perform training calculation by using the HLSTM model obtained by the optimization
  • Step 103 Train the randomly initialized LSTM model based on the result of the forward calculation and the preset function, and obtain the model as an acoustic model of the speech recognition system;
  • the HLSTM model is the same as the network parameter of the LSTM model.
  • the HLSTM model and the LSTM model may both be bidirectional or both unidirectional.
  • the network parameters may include: an input layer node number, an output layer node number, an input observation vector, a hidden layer node number, a recursive delay, and a mapping layer connected after each hidden layer.
  • the embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
  • the randomly initialized HLSTM model is shown in FIG. 2, and the dotted line box is an inter-layer memory unit connection (direct connection) set on the basis of the LSTM model, as shown in FIG. 2.
  • the direct connection between adjacent layer memory cells is introduced in the HLSTM model, the problem of gradient disappearance can be avoided, and the difficulty of network training is reduced, so that a deeper structure can be used in practical applications.
  • the number of network layers cannot be infinitely deepened because the larger parameter quantity model causes over-fitting compared to the amount of training data. In actual use, the number of network layers of the HLSTM model can be adjusted based on the amount of training data available.
  • the training is performed on the randomly initialized HLSTM model based on a preset function, and the training result is optimized, including:
  • the HLSTM model obtained by the training is optimized according to the state-level minimum Bayesian risk criterion.
  • the F CE represents a cross entropy objective function
  • the speech feature at time t is the annotation value of the output point in the y state
  • X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
  • the X represents the training data
  • the N is the total duration of the speech feature.
  • the objective function corresponding to the state-level minimum Bayesian risk criterion is:
  • the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
  • the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
  • the number of network layers of the HLSTM model is greater than or equal to the number of network layers of the LSTM model.
  • the training the randomly initialized LSTM model based on the result of the forward calculation and the preset function including:
  • the embodiment of the present disclosure further provides an acoustic modeling device based on the HLSTM model, which is used to implement the above embodiments and specific implementations, and has not been described again.
  • the term “module” "unit” may implement a combination of software and/or hardware of a predetermined function.
  • the device comprises:
  • the HLSTM model processing module 301 is configured to train the randomly initialized HLSTM model based on a preset function, and optimize the training result;
  • the calculating module 302 is configured to perform training calculation by using the optimized HLSTM model
  • the LSTM model processing module 303 is configured to train the randomly initialized LSTM model based on the result of the forward calculation and the preset function, and the obtained model is an acoustic model of the speech recognition system;
  • the HLSTM model is the same as the network parameter of the LSTM model.
  • the embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
  • the randomly initialized HLSTM model is shown in FIG. 2, and the dotted line box is an inter-layer memory unit connection (direct connection) set on the basis of the LSTM model, and the connection formula is as shown in FIG. 2.
  • the direct connection between adjacent layer memory cells is introduced in the HLSTM model, the problem of gradient disappearance can be avoided, and the difficulty of network training is reduced, so that a deeper structure can be used in practical applications.
  • the number of network layers cannot be infinitely deepened because the larger parameter quantity model causes over-fitting compared to the amount of training data. In actual use, the number of network layers of the HLSTM model can be adjusted based on the amount of training data available.
  • the HLSTM model processing module 301 includes:
  • the first training unit 3011 is configured to train the randomly initialized HLSTM model by using a cross entropy objective function
  • the optimization unit 3012 is configured to optimize the HLSTM model obtained by the training according to the state-level minimum Bayesian risk criterion.
  • the F CE represents a cross entropy objective function
  • the speech feature at time t is the annotation value of the output point in the y state
  • X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
  • the X represents the training data
  • the N is the total duration of the speech feature.
  • the objective function corresponding to the state-level minimum Bayesian risk criterion is:
  • the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
  • the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
  • the LSTM model processing module 303 includes:
  • the obtaining unit 3031 is configured to acquire an output result of each frame obtained by the forward calculation
  • a second training unit 3032 configured to train the randomly initialized LSTM model based on the output result of each frame and the cross entropy objective function; wherein, in the cross entropy objective function The result is output for each frame obtained by the forward calculation.
  • the number of network layers of the HLSTM model is greater than or equal to the number of network layers of the LSTM model.
  • the HLSTM model processing module 301, the calculation module 302, the LSTM model processing module 303, the first training unit 3011, the optimization unit 3012, the acquisition unit 3031, and the second training unit 3032 may be in an HLSTM model-based acoustic modeling device. Processor implementation.
  • a deep two-way HLSTM model with stronger modeling capability is trained as a "teacher” model
  • a randomly initialized two-way LSTM model is used as a "student” model
  • a "teacher” model is used to train a student with a relatively small parameter amount. "model. The specific method is described as follows:
  • the HLSTM model is randomly initialized.
  • the network structure of the HLSTM model is shown in Figure 2. Since HLSTM introduces direct connection between adjacent layer memory cells, the problem of gradient disappearance is avoided, and the difficulty of network training is reduced. Therefore, a deeper structure can be used in practical applications.
  • the number of network layers cannot be infinitely deepened because the excessive parameter quantity model causes over-fitting compared to the amount of training data. In actual use, the number of HLSTM network layers can be adjusted according to the amount of training data available.
  • the training data can be 300h (hours)
  • the HLSTM model used is 6 layers, namely: an input layer, an output layer, and four hidden layers between them.
  • the HLSTM model is iteratively updated using the CrossEntropy (CE) objective function.
  • CE objective function formula is as follows:
  • the F CE represents a cross entropy objective function
  • the speech feature at time t is the annotation value of the output point in the y state
  • X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
  • the X represents the training data
  • the N is the total duration of the speech feature.
  • the HLSTM model generated based on CE objective function training has better recognition performance.
  • the model is further optimized by the discriminative sequence-level training criterion, namely: State-level Minimum Bayes Risk (SMBR) criterion.
  • SMBR State-level Minimum Bayes Risk
  • the difference from the acoustic model training of the CE criterion is that the discriminative sequence-level training criterion tries to learn more classes from the positive and negative training samples on a limited training set by optimizing the function related to the system recognition rate. Distinguish information.
  • Its objective function is as follows:
  • the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
  • the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
  • HLSTM model newly connected model
  • the information transmission method of the embodiment of the present disclosure is to perform the forward calculation by using the "teacher” model, obtain the output corresponding to each frame input, mark the obtained output, and use the CE criterion mentioned above as the objective function to train.
  • the "student” model, the trained LSTM model is used as an acoustic model for speech recognition systems.
  • An advantage of embodiments of the present disclosure is to improve LSTM baseline model performance without increasing model complexity.
  • the HLSTM model has stronger modeling capabilities and higher recognition performance, the decoding real-time rate is also one of the indicators for evaluating the performance of the recognition system.
  • the HLSTM model has a higher parameter size and model complexity than the LSTM model, which inevitably slows down the decoding speed.
  • the HLSTM model network information is transmitted to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model.
  • the “student” model performance is lower than the “teacher” model, but Still higher than the performance of the directly trained LSTM model.
  • Step 1 Extract the speech features of the training data.
  • the EM algorithm is used to iteratively update the mean variance of the GMM-HMM system, and the GMM-HMM system is used to force the alignment of the feature data to obtain the three-factor cluster state annotation.
  • Step 2 Train the two-way HLSTM model based on the cross entropy criterion.
  • a six-layer bidirectional HLSTM model is used, and the parameter quantity of the model is 190M.
  • the specific configuration is as follows: the input layer has 260 nodes, the input observation vector is 2 frames for the context, and the number of nodes of the four hidden layers are both For 1024, the recursive delay is 1, 2, 3, 4; each layer of hidden layer is connected with a 512-dimensional mapping layer to reduce the dimension reduction parameter.
  • the number of nodes in the output layer is 2821, which corresponds to 2821 triphone clustering states.
  • Step 3 The model generated in step 2 is used as a seed model, and the bidirectional HLSTM model is iteratively updated based on the state-level minimum Bayesian risk criterion.
  • Step 4 Perform the forward calculation by using the two-way HLSTM model generated in step three to obtain the output vector.
  • Step 5 The output vector obtained in step 4 is labeled corresponding to the input feature, and the bidirectional LSTM model with three hidden layers is trained, and the parameter quantity is 120M.
  • the network parameters of the model are consistent with the HLSTM model in step 2.
  • embodiments of the present disclosure can be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or a combination of software and hardware aspects. Moreover, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • an embodiment of the present disclosure further provides a storage medium, in particular a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method in the embodiment of the present disclosure are implemented.
  • the solution provided by the embodiment of the present disclosure trains the randomly initialized HLSTM model based on a preset function, and optimizes the training result; and performs training data through the optimized HLSTM model for forward calculation; To the calculated result and the preset function, the randomly initialized LSTM model is trained, and the obtained model is an acoustic model of the speech recognition system; wherein the HLSTM model has the same network parameters as the LSTM model.
  • the embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

一种基于直连长短时记忆(HLSTM)模型的声学建模方法、装置和存储介质,方法包括:基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化(101);将训练数据通过经优化得到的HLSTM模型进行前向计算(102);基于前向计算的结果和预设函数,训练已随机初始化的长短时记忆(LSTM)模型,得到的模型为语音识别***的声学模型(103);其中,HLSTM模型与LSTM模型的网络参数相同。

Description

基于HLSTM模型的声学建模方法、装置和存储介质
相关申请的交叉引用
本申请基于申请号为201710094191.6、申请日为2017年02月21日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开涉及语音识别技术领域,尤其涉及一种基于直连长短时记忆(Highway Long Short Time Memory,HLSTM)模型的声学建模方法、装置和存储介质。
背景技术
近年来,大词汇连续语音识别***取得了重大进步。传统的语音识别***采用隐马尔科夫模型(Hidden Markov Model,HMM)来表达语音信号的时变特性,采用高斯混合模型(Gaussian Mixture Model,GMM)建模语音信号的发音多样性。后来,深度学习技术被引入到语音识别研究领域,使语音识别***的性能有了显著的提高,真正把语音识别推动到商业可用级别。由于语音识别技术存在巨大的实用价值,该领域成为科技巨头、互联网公司和知名高校的研究热点。深度神经网络(Deep Neural Network,DNN)被引入语音识别后,人们又进一步研究了神经网络的序列鉴别性训练和卷积神经网络(Convolutional Neural Network,CNN)在语音识别中的应用。
随后,长短时记忆(Long Short Time Memory,LSTM)模型被引入声学建模,相比于简单的前馈网络,LSTM模型具有更强的声学建模能力。由于数据量日益增大,因此需要加深声学模型神经网络的层数来提升建模能力。但随着LSTM模型网络层数的加深,网络的训练难度也随之增大, 同时伴随着梯度消失问题。为了避免梯度消失,一种基于LSTM模型的HLSTM模型被提出,HLSTM模型是在LSTM模型相邻层的记忆单元之间引入直连。
HLSTM模型的提出使更深层的网络结构在识别***中得到实际的应用,并大幅度提升了识别准确度。虽然深层的HLSTM模型有更强的建模能力,但层数的加深和引入的新连接(上述直连)也使声学模型具有了更复杂的网络结构,因而前向计算耗费的时间会更长,最终导致解码变慢。因此,如何在提升性能的同时不增加声学模型的复杂度成为有待解决的问题。
发明内容
本公开实施例提供一种基于HLSTM模型的声学建模方法、装置和存储介质。
本公开实施例的技术方案是这样实现的:
本公开实施例提供了一种基于HLSTM模型的声学建模方法,包括:
基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;
将训练数据通过经所述优化得到的HLSTM模型进行前向计算;
基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别***的声学模型;
其中,所述HLSTM模型与所述LSTM模型的网络参数相同。
上述方案中,所述基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化,包括:
采用交叉熵目标函数训练已随机初始化的HLSTM模型;
依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。
上述方案中,所述交叉熵目标函数为:
Figure PCTCN2018073887-appb-000001
其中,所述F CE表示交叉熵目标函数;所述
Figure PCTCN2018073887-appb-000002
为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。
上述方案中,所述状态级最小贝叶斯风险准则对应的目标函数为:
Figure PCTCN2018073887-appb-000003
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。
上述方案中,所述HLSTM模型的网络层数大于或等于所述LSTM模型的网络层数。
上述方案中,所述基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,包括:
获取所述前向计算得到的每帧的输出结果;
基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的
Figure PCTCN2018073887-appb-000004
为所述前向计算得到的每帧输出结果。
本公开实施例还提供了一种基于HLSTM模型的声学建模装置,包括:
HLSTM模型处理模块,配置为基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;
计算模块,配置为将训练数据通过经所述优化得到的HLSTM模型进行前向计算;
LSTM模型处理模块,配置为基于所述前向计算的结果和所述预设函数,训练已随机初始化的长短时记忆LSTM模型,得到的模型为语音识别 ***的声学模型;
其中,所述HLSTM模型与所述LSTM模型的网络参数相同。
上述方案中,所述HLSTM模型处理模块包括:
第一训练单元,配置为采用交叉熵目标函数训练已随机初始化的HLSTM模型;
优化单元,配置为依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。
上述方案中,所述交叉熵目标函数为:
Figure PCTCN2018073887-appb-000005
其中,所述F CE表示交叉熵目标函数;所述
Figure PCTCN2018073887-appb-000006
为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。
上述方案中,所述状态级最小贝叶斯风险准则对应的目标函数为:
Figure PCTCN2018073887-appb-000007
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。
上述方案中,所述LSTM模型处理模块包括:
获取单元,配置为获取所述前向计算得到的每帧的输出结果;
第二训练单元,配置为基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的
Figure PCTCN2018073887-appb-000008
为所述前向计算得到的每帧输出结果。
本公开实施例又提供了一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一方法的步骤。
本公开实施例提供的基于HLSTM模型的声学建模方法、装置和存储介质,基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;将训练数据通过经所述优化得到的HLSTM模型进行前向计算;基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别***的声学模型;其中,所述HLSTM模型与所述LSTM模型的网络参数相同。本公开实施例将优化后的HLSTM模型的网络信息通过后验概率传递到LSTM网络,达到了在不增加模型复杂度的情况下,提升LSTM基线模型的性能。
附图说明
图1为本公开实施例所述基于HLSTM模型的声学建模方法流程示意图;
图2为本公开一实施例所述双向HLSTM模型网络结构图;
图3为本公开实施例所述基于HLSTM模型的声学建模装置结构示意图;
图4为本公开实施例所述HLSTM模型处理模块的结构示意图;
图5为本公开实施例所述LSTM模型处理模块的结构示意图。
具体实施方式
下面结合具体实施例对本公开进行详细描述。
图1为本公开实施例所述基于HLSTM模型的声学建模方法流程示意图,如图1所示,该方法包括:
步骤101:基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;
步骤102:将训练数据通过经所述优化得到的HLSTM模型进行前向计 算;
步骤103:基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别***的声学模型;
其中,所述HLSTM模型与所述LSTM模型的网络参数相同。
这里,所述HLSTM模型与所述LSTM模型可均为双向、或均为单向。所述网络参数可包括:输入层节点数、输出层节点数、输入的观测矢量、隐层的节点数、递归时延,以及每层隐层后连接的映射层等等。
本公开实施例将优化后的HLSTM模型的网络信息通过后验概率传递到LSTM网络,达到了在不增加模型复杂度的情况下,提升LSTM基线模型的性能。
作为一个实例,所述随机初始化的HLSTM模型如图2所示,虚线框中为在LSTM模型基础上设置的层间记忆单元连接(直连),如图2所示。由于HLSTM模型里引入相邻层记忆单元间的直连,可避免梯度消失的问题,降低了网络训练的难度,因此实际应用中可以使用更深层的结构。但另一方面受制于参数量的限制,网络层数不能无限加深,因为相比于训练数据量来说,较大参数量模型会引起过拟合。实际使用中,HLSTM模型的网络层数可以根据可用训练数据量进行调整。
本公开实施例中,所述基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化,包括:
采用交叉熵目标函数训练已随机初始化的HLSTM模型;
依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。
其中,所述交叉熵目标函数为:
Figure PCTCN2018073887-appb-000009
其中,所述F CE表示交叉熵目标函数;所述
Figure PCTCN2018073887-appb-000010
为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。
其中,所述状态级最小贝叶斯风险准则对应的目标函数为:
Figure PCTCN2018073887-appb-000011
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。
本公开实施例中,所述HLSTM模型的网络层数大于或等于所述LSTM模型的网络层数。
本公开实施例中,所述基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,包括:
获取所述前向计算得到的每帧的输出结果;
基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的
Figure PCTCN2018073887-appb-000012
为所述前向计算得到的每帧输出结果。
经过对HLSTM与LSTM进行对比实验发现:对引入直连后的LSTM模型,即:HLSTM模型做鉴别性训练获得的性能提升明显高于LSTM模型获得的提升,因此,鉴别性训练对HLSTM模型性能的提升是非常有意义的。
本公开实施例还提供了一种基于HLSTM模型的声学建模装置,用于实现上述实施例及具体实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”“单元”可以实现预定功能的软件和/或硬件的组合。如图3所示,该装置包括:
HLSTM模型处理模块301,配置为基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;
计算模块302,配置为将训练数据通过经所述优化得到的HLSTM模型进行前向计算;
LSTM模型处理模块303,配置为基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别***的声学模型;
其中,所述HLSTM模型与所述LSTM模型的网络参数相同。
本公开实施例将优化后的HLSTM模型的网络信息通过后验概率传递到LSTM网络,达到了在不增加模型复杂度的情况下,提升LSTM基线模型的性能。
作为一个实例,所述随机初始化的HLSTM模型如图2所示,虚线框中为在LSTM模型基础上设置的层间记忆单元连接(直连),连接公式如图2所示。由于HLSTM模型里引入相邻层记忆单元间的直连,可避免梯度消失的问题,降低了网络训练的难度,因此实际应用中可以使用更深层的结构。但另一方面受制于参数量的限制,网络层数不能无限加深,因为相比于训练数据量来说,较大参数量模型会引起过拟合。实际使用中,HLSTM模型的网络层数可以根据可用训练数据量进行调整。
本公开实施例中,如图4所示,所述HLSTM模型处理模块301包括:
第一训练单元3011,配置为采用交叉熵目标函数训练已随机初始化的HLSTM模型;
优化单元3012,配置为依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。
其中,所述交叉熵目标函数为:
Figure PCTCN2018073887-appb-000013
其中,所述F CE表示交叉熵目标函数;所述
Figure PCTCN2018073887-appb-000014
为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。
其中,所述状态级最小贝叶斯风险准则对应的目标函数为:
Figure PCTCN2018073887-appb-000015
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。
本公开实施例中,如图5所示,所述LSTM模型处理模块303包括:
获取单元3031,配置为获取所述前向计算得到的每帧的输出结果;
第二训练单元3032,配置为基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的
Figure PCTCN2018073887-appb-000016
为所述前向计算得到的每帧输出结果。
本公开实施例中,所述HLSTM模型的网络层数大于或等于所述LSTM模型的网络层数。
经过对HLSTM与LSTM进行对比实验发现:对引入直连后的LSTM模型,即:HLSTM模型做鉴别性训练获得的性能提升明显高于LSTM模型获得的提升,因此,鉴别性训练对HLSTM模型性能的提升是非常有意义的。
实际应用时,HLSTM模型处理模块301、计算模块302、LSTM模型处理模块303、第一训练单元3011、优化单元3012、获取单元3031、第二训练单元3032可由基于HLSTM模型的声学建模装置中的处理器实现。
下面结合一具体场景实施例对本公开进行描述。
本实施例将训练完成具有更强建模能力的深层双向HLSTM模型做“教师”模型,将随机初始化的双向LSTM模型做“学生”模型,利用“教师”模型训练参数量相对较小的“学生”模型。具体方法描述如下:
一、训练“教师”模型
首先,随机初始化HLSTM模型,HLSTM模型网络结构如图2所示。 由于HLSTM引入相邻层记忆单元间的直连,避免了梯度消失的问题,降低了网络训练的难度,因此实际应用中可以使用更深层的结构。但另一方面受制于参数量的限制,网络层数不能无限加深,因为相比于训练数据量来说,过大参数量模型会引起过拟合。实际使用中,HLSTM网络层数可以根据可用训练数据量做调整。本实施例中训练数据可为300h(小时),使用的HLSTM模型为6层,即:输入层,输出层以及它们之间的四层隐层。
使用交叉熵(CrossEntropy,CE)目标函数迭代更新训练HLSTM模型,CE目标函数公式如下所示:
Figure PCTCN2018073887-appb-000017
其中,所述F CE表示交叉熵目标函数;所述
Figure PCTCN2018073887-appb-000018
为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。
基于CE目标函数训练生成的HLSTM模型已具有较好的识别性能。在此基础上,利用鉴别性序列级训练准则,即:状态级最小贝叶斯风险(State-level Minimum Bayes Risk,SMBR)准则进一步优化模型。与CE准则的声学模型训练不同之处在于,鉴别性序列级训练准则通过优化与***识别率相关的函数,在有限的训练集上力图从正反两方面的训练样本中学习到更多的类区分度信息。它的目标函数如下所示:
Figure PCTCN2018073887-appb-000019
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。通过对HLSTM与LSTM的对比实验,发 现引入新连接后的模型(HLSTM模型)做鉴别性训练获得的性能提升要明显高于LSTM模型获得的提升,因此,鉴别性训练对HLSTM模型性能的提升是非常有意义的。至此,训练完成的模型即为“教师”模型。
二、训练“学生”模型
随机初始化一个含三层隐层的LSTM模型,模型的其它参数与“老师”模型一致。接下来,需要将HLSTM模型学到的信息传递给LSTM模型。本公开实施例的信息传递方式是将训练数据通过“教师”模型做前向计算,得到每帧输入对应的输出,将得到的输出做标注,使用上文提到的CE准则为目标函数,训练“学生”模型,训练得到的LSTM模型作为语音识别***使用的声学模型。
本公开实施例的优点是在不增加模型复杂度的情况下,提升LSTM基线模型性能。虽然,HLSTM模型有更强的建模能力和更高的识别性能,但解码实时率同样为评价识别***性能的指标之一。HLSTM模型从参数规模和模型复杂度都高于LSTM模型,必然会拖慢解码速度。将HLSTM模型网络信息通过后验概率传递到LSTM网络,以此提升LSTM基线模型的性能,虽然信息传递过程中会有不可避免的性能损失,即“学生”模型性能低于“老师”模型,但仍然高于直接训练的LSTM模型性能。
下面结合具体的模型参数对所述方法实施例进行描述。
步骤一:提取训练数据的语音特征。利用EM算法迭代更新GMM-HMM***均值方差,使用GMM-HMM***对特征数据做强制对齐,得到三因子聚类状态标注。
步骤二:基于交叉熵准则训练双向HLSTM模型。
本实施例中使用六层的双向HLSTM模型,模型的参数量为190M,具体配置如下:输入层有260个节点,输入的观测矢量为上下文各做2帧扩展,四层隐层的节点数目均为1024,递归时延分别为1,2,3,4;每层隐层后连接512维的映射层,用于降低维度减少参数量。输出层的节点数为2821,对应2821个三音子聚类状态。
步骤三:以步骤二生成的模型为种子模型,基于状态级最小贝叶斯风 险准则迭代更新双向HLSTM模型。
步骤四:将训练数据通过步骤三生成的双向HLSTM模型做前向计算,得到输出向量。
步骤五:将步骤四得到的输出向量做对应输入特征的标注,训练含三层隐层的双向LSTM模型,参数量为120M。模型的网络参数与步骤二中的HLSTM模型一致。
本领域内的技术人员应明白,本公开的实施例可提供为方法、***、或计算机程序产品。因此,本公开可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。
本公开是参照根据本公开实施例的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
基于此,本公开实施例还提供了一种存储介质,具体为计算机可读存 储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现本公开实施例所述方法的步骤。
以上所述,仅为本公开的较佳实施例而已,并非用于限定本公开的保护范围。
工业实用性
本公开实施例提供的方案,基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;将训练数据通过经所述优化得到的HLSTM模型进行前向计算;基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别***的声学模型;其中,所述HLSTM模型与所述LSTM模型的网络参数相同。本公开实施例将优化后的HLSTM模型的网络信息通过后验概率传递到LSTM网络,达到了在不增加模型复杂度的情况下,提升LSTM基线模型的性能。

Claims (12)

  1. 一种基于直连长短时记忆HLSTM模型的声学建模方法,包括:
    基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;
    将训练数据通过经所述优化得到的HLSTM模型进行前向计算;
    基于所述前向计算的结果和所述预设函数,训练已随机初始化的长短时记忆LSTM模型,得到的模型为语音识别***的声学模型;
    其中,所述HLSTM模型与所述LSTM模型的网络参数相同。
  2. 根据权利要求1所述的方法,其中,所述基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化,包括:
    采用交叉熵目标函数训练已随机初始化的HLSTM模型;
    依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。
  3. 根据权利要求2所述的方法,其中,所述交叉熵目标函数为:
    Figure PCTCN2018073887-appb-100001
    其中,所述F CE表示交叉熵目标函数;所述
    Figure PCTCN2018073887-appb-100002
    为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。
  4. 根据权利要求2所述的方法,其中,所述状态级最小贝叶斯风险准则对应的目标函数为:
    Figure PCTCN2018073887-appb-100003
    其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM 模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。
  5. 根据权利要求1-4中任一项所述的方法,其中,所述HLSTM模型的网络层数大于或等于所述LSTM模型的网络层数。
  6. 根据权利要求3所述的方法,其中,所述基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,包括:
    获取所述前向计算得到的每帧的输出结果;
    基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的
    Figure PCTCN2018073887-appb-100004
    为所述前向计算得到的每帧输出结果。
  7. 一种基于直连长短时记忆HLSTM模型的声学建模装置,包括:
    HLSTM模型处理模块,配置为基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;
    计算模块,配置为将训练数据通过经所述优化得到的HLSTM模型进行前向计算;
    LSTM模型处理模块,配置为基于所述前向计算的结果和所述预设函数,训练已随机初始化的长短时记忆LSTM模型,得到的模型为语音识别***的声学模型;
    其中,所述HLSTM模型与所述LSTM模型的网络参数相同。
  8. 根据权利要求7所述的装置,其中,所述HLSTM模型处理模块包括:
    第一训练单元,配置为采用交叉熵目标函数训练已随机初始化的HLSTM模型;
    优化单元,配置为依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。
  9. 根据权利要求8所述的装置,其中,所述交叉熵目标函数为:
    Figure PCTCN2018073887-appb-100005
    其中,所述F CE表示交叉熵目标函数;所述
    Figure PCTCN2018073887-appb-100006
    为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。
  10. 根据权利要求8所述的装置,其中,所述状态级最小贝叶斯风险准则对应的目标函数为:
    Figure PCTCN2018073887-appb-100007
    其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。
  11. 根据权利要求9所述的装置,其中,所述LSTM模型处理模块包括:
    获取单元,配置为获取所述前向计算得到的每帧的输出结果;
    第二训练单元,配置为基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的
    Figure PCTCN2018073887-appb-100008
    为所述前向计算得到的每帧输出结果。
  12. 一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至6任一项所述方法的步骤。
PCT/CN2018/073887 2017-02-21 2018-01-23 基于hlstm模型的声学建模方法、装置和存储介质 WO2018153200A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710094191.6 2017-02-21
CN201710094191.6A CN108461080A (zh) 2017-02-21 2017-02-21 一种基于hlstm模型的声学建模方法和装置

Publications (1)

Publication Number Publication Date
WO2018153200A1 true WO2018153200A1 (zh) 2018-08-30

Family

ID=63222056

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/073887 WO2018153200A1 (zh) 2017-02-21 2018-01-23 基于hlstm模型的声学建模方法、装置和存储介质

Country Status (2)

Country Link
CN (1) CN108461080A (zh)
WO (1) WO2018153200A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517679A (zh) * 2018-11-15 2019-11-29 腾讯科技(深圳)有限公司 一种人工智能的音频数据处理方法及装置、存储介质
US11158303B2 (en) 2019-08-27 2021-10-26 International Business Machines Corporation Soft-forgetting for connectionist temporal classification based automatic speech recognition

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569700B (zh) * 2018-09-26 2020-11-03 创新先进技术有限公司 优化损伤识别结果的方法及装置
CN111709513B (zh) * 2019-03-18 2023-06-09 百度在线网络技术(北京)有限公司 长短期记忆网络lstm的训练***、方法及电子设备
CN110751941B (zh) * 2019-09-18 2023-05-26 平安科技(深圳)有限公司 语音合成模型的生成方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104538028A (zh) * 2014-12-25 2015-04-22 清华大学 一种基于深度长短期记忆循环神经网络的连续语音识别方法
CN105529023A (zh) * 2016-01-25 2016-04-27 百度在线网络技术(北京)有限公司 语音合成方法和装置
CN105810193A (zh) * 2015-01-19 2016-07-27 三星电子株式会社 训练语言模型的方法和设备及识别语言的方法和设备
CN106098059A (zh) * 2016-06-23 2016-11-09 上海交通大学 可定制语音唤醒方法及***
CN106170800A (zh) * 2014-09-12 2016-11-30 微软技术许可有限责任公司 经由输出分布来学习学生dnn
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106170800A (zh) * 2014-09-12 2016-11-30 微软技术许可有限责任公司 经由输出分布来学习学生dnn
CN104538028A (zh) * 2014-12-25 2015-04-22 清华大学 一种基于深度长短期记忆循环神经网络的连续语音识别方法
CN105810193A (zh) * 2015-01-19 2016-07-27 三星电子株式会社 训练语言模型的方法和设备及识别语言的方法和设备
CN105529023A (zh) * 2016-01-25 2016-04-27 百度在线网络技术(北京)有限公司 语音合成方法和装置
CN106098059A (zh) * 2016-06-23 2016-11-09 上海交通大学 可定制语音唤醒方法及***
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517679A (zh) * 2018-11-15 2019-11-29 腾讯科技(深圳)有限公司 一种人工智能的音频数据处理方法及装置、存储介质
CN110517679B (zh) * 2018-11-15 2022-03-08 腾讯科技(深圳)有限公司 一种人工智能的音频数据处理方法及装置、存储介质
US11158303B2 (en) 2019-08-27 2021-10-26 International Business Machines Corporation Soft-forgetting for connectionist temporal classification based automatic speech recognition

Also Published As

Publication number Publication date
CN108461080A (zh) 2018-08-28

Similar Documents

Publication Publication Date Title
WO2018153200A1 (zh) 基于hlstm模型的声学建模方法、装置和存储介质
WO2022022163A1 (zh) 文本分类模型的训练方法、装置、设备及存储介质
CN107273355B (zh) 一种基于字词联合训练的中文词向量生成方法
CN110532355B (zh) 一种基于多任务学习的意图与槽位联合识别方法
CN110717334A (zh) 基于bert模型和双通道注意力的文本情感分析方法
CN111344779A (zh) 训练和/或使用编码器模型确定自然语言输入的响应动作
CN106649514A (zh) 用于受人启发的简单问答(hisqa)的***和方法
CN110321418A (zh) 一种基于深度学习的领域、意图识别和槽填充方法
CN107408111A (zh) 端对端语音识别
CN106502985A (zh) 一种用于生成标题的神经网络建模方法及装置
US10255910B2 (en) Centered, left- and right-shifted deep neural networks and their combinations
JP2019159654A (ja) 時系列情報の学習システム、方法およびニューラルネットワークモデル
WO2021208455A1 (zh) 一种面向家居口语环境的神经网络语音识别方法及***
US10529322B2 (en) Semantic model for tagging of word lattices
Lee et al. Joint learning of phonetic units and word pronunciations for ASR
Sreevidya et al. Sentiment analysis by deep learning approaches
CN114429143A (zh) 一种基于强化蒸馏的跨语言属性级情感分类方法
Sadoughi et al. Detecting section boundaries in medical dictations: toward real-time conversion of medical dictations to clinical reports
Wen Intelligent English translation mobile platform and recognition system based on support vector machine
Li et al. Biomedical named entity recognition based on the two channels and sentence-level reading control conditioned LSTM-CRF
Zhu et al. Prior knowledge driven label embedding for slot filling in natural language understanding
CN116561592A (zh) 文本情感识别模型的训练方法和文本情感识别方法及装置
Chen et al. A self-attention joint model for spoken language understanding in situational dialog applications
Zhao et al. Tibetan Multi-Dialect Speech and Dialect Identity Recognition.
CN111026848B (zh) 一种基于相似上下文和强化学习的中文词向量生成方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18757896

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18757896

Country of ref document: EP

Kind code of ref document: A1