CN112489623A - Language identification model training method, language identification method and related equipment - Google Patents

Language identification model training method, language identification method and related equipment Download PDF

Info

Publication number
CN112489623A
CN112489623A CN202011287099.XA CN202011287099A CN112489623A CN 112489623 A CN112489623 A CN 112489623A CN 202011287099 A CN202011287099 A CN 202011287099A CN 112489623 A CN112489623 A CN 112489623A
Authority
CN
China
Prior art keywords
language
language identification
voice
identification model
spectrogram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011287099.XA
Other languages
Chinese (zh)
Inventor
邓艳江
罗超
胡泓
李巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Computer Technology Shanghai Co Ltd
Original Assignee
Ctrip Computer Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Computer Technology Shanghai Co Ltd filed Critical Ctrip Computer Technology Shanghai Co Ltd
Priority to CN202011287099.XA priority Critical patent/CN112489623A/en
Publication of CN112489623A publication Critical patent/CN112489623A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of voice processing, and provides a language identification model training method, a language identification method and related equipment. The language identification model training method comprises the following steps: obtaining sample data, comprising: obtaining initial voice and target language thereof; preprocessing the initial voice to obtain a spectrogram; training a language identification model, comprising: extracting the spatial features of the spectrogram through a convolutional neural network; extracting time sequence characteristics of the spatial characteristics through a recurrent neural network; performing full-connection operation on the spatial features based on the time sequence features, and predicting language probability through a classifier; and adjusting parameters of the language identification model according to the language probability and the target language until the language identification model converges. The invention can make high-efficiency and accurate classification on the language of the voice and provide data support for subsequent voice recognition.

Description

Language identification model training method, language identification method and related equipment
Technical Field
The invention relates to the technical field of voice processing, in particular to a language identification model training method, a language identification method and related equipment.
Background
With the development of artificial intelligence technology, speech recognition falls on the ground in numerous industrial scenes.
However, there are multiple languages in some industrial scenes, and the current speech recognition model only supports a single language. Therefore, under the condition that the data source has multiple languages, before the voice is transcribed, the language of the voice needs to be judged, and then the voice recognition model corresponding to the language needs to be selected for voice recognition.
The current language discrimination work is usually distinguished by manually listening to the tone, timbre and the like of sound, and has low efficiency and poor accuracy.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the invention and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of this, the present invention provides a language identification model training method, a language identification method and related devices, which can efficiently and accurately classify the language of speech and provide data support for subsequent speech identification.
One aspect of the present invention provides a language identification model training method, including: obtaining sample data, comprising: obtaining initial voice and target language thereof; preprocessing the initial voice to obtain a spectrogram; training a language identification model, comprising: extracting the spatial features of the spectrogram through a convolutional neural network; extracting time sequence characteristics of the spatial characteristics through a recurrent neural network; performing full-connection operation on the spatial features based on the time sequence features, and predicting language probability through a classifier; and adjusting parameters of the language identification model according to the language probability and the target language until the language identification model converges.
In some embodiments, the training method further comprises: after the time sequence features are extracted, aggregating the time sequence features through an attention mechanism; and when the spatial features are subjected to full connection operation, performing full connection operation on the spatial features based on the aggregated time sequence features.
In some embodiments, said aggregating the timing characteristics by attention mechanism comprises: obtaining a hidden vector corresponding to each time sequence feature; performing attention calculation on the hidden vector to obtain a context vector; and aggregating each time sequence feature according to the context vector.
In some embodiments, the formula for attention calculation of the hidden vector is:
uit=tanh(Wwhit+bw);
Figure BDA0002782683710000021
Figure BDA0002782683710000022
wherein h isitIs the output of the recurrent neural network at the time t and is hidden by the time sequence characteristicVector, siIs the context vector at time i, W, obtained by attention calculationw、bwAnd uwAre parameters.
In some embodiments, the convolutional neural network comprises three layers and the recurrent neural network comprises two layers.
In some embodiments, the pre-processing the initial speech includes: performing fast Fourier transform on the initial voice according to frames to obtain frequency spectrums of the frames; and combining the frequency spectrums of each frame into a spectrogram along a time sequence.
Another aspect of the present invention provides a language identification method, including: obtaining an effective voice segment of a voice to be recognized; preprocessing the effective voice fragments to obtain a spectrogram; and inputting the language spectrogram into a language identification model to obtain a language identification result, wherein the language identification model is generated by training through the training method of any embodiment.
In some embodiments, the obtaining the valid speech segment of the speech to be recognized includes: carrying out end point detection on the voice to be recognized, and screening out non-effective frames to obtain voice fragments; and filling the voice segment to a preset duration to form an effective voice segment.
Another aspect of the present invention provides a training apparatus for language identification models, including: a sample data acquisition module configured to: obtaining initial voice and target language thereof; preprocessing the initial voice to obtain a spectrogram; a language recognition model training module configured to: extracting the spatial features of the spectrogram through a convolutional neural network; extracting time sequence characteristics of the spatial characteristics through a recurrent neural network; performing full-connection operation on the spatial features based on the time sequence features, and predicting language probability through a classifier; and adjusting parameters of the language identification model according to the language probability and the target language until the language identification model converges.
Still another aspect of the present invention provides a language identification device, including: the preprocessing module is used for obtaining effective voice fragments of the voice to be recognized; the speech spectrogram generating module is used for preprocessing the effective speech segments to obtain a speech spectrogram; and the language identification module is used for inputting the language spectrogram into a language identification model to obtain a language identification result, and the language identification model is generated by training through the training method in any embodiment.
Yet another aspect of the present invention provides an electronic device, comprising: a processor; a memory having executable instructions stored therein; wherein, when executed by the processor, the executable instructions implement the language identification model training method and/or the language identification method according to any of the above embodiments.
Still another aspect of the present invention provides a computer-readable storage medium storing a program that, when executed, implements a language identification model training method and/or a language identification method according to any of the above embodiments.
Compared with the prior art, the invention has the beneficial effects that:
obtaining a spectrogram of the voice based on the frequency domain information of the voice; obtaining the spatial features and the time sequence features of the spectrogram through a convolutional neural network and a cyclic neural network based on deep learning; furthermore, the speech languages are identified through the full connection and the classifier, so that the speech languages are efficiently and accurately classified, and data support is provided for subsequent speech identification.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic diagram illustrating steps of a training method for language identification models according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a network structure of a language identification model according to an embodiment of the present invention;
FIG. 3 is a block diagram of a training apparatus for language identification models according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating steps of a language identification method according to an embodiment of the present invention;
FIG. 5 is a block diagram of a language identification device according to an embodiment of the present invention;
FIG. 6 is a schematic diagram showing a structure of an electronic apparatus according to an embodiment of the present invention; and
fig. 7 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The step numbers in the following embodiments are merely used to indicate different execution contents, and the execution order between the steps is not strictly limited. The use of "first," "second," and similar terms in the detailed description is not intended to imply any order, quantity, or importance, but rather is used to distinguish one element from another. It should be noted that features of the embodiments of the invention and of the different embodiments may be combined with each other without conflict.
Fig. 1 shows the main steps of the training method of the language identification model in the embodiment, and referring to fig. 1, the training method of the language identification model in the embodiment includes: in step S110, sample data is obtained, including: s110-10, obtaining initial voice and target language thereof; s110-20, preprocessing the initial voice to obtain a spectrogram; in step S120, training the language identification model includes: s120-10, extracting spatial features of the spectrogram through a convolutional neural network; s120-20, extracting time sequence characteristics of the spatial characteristics through a recurrent neural network; s120-30, performing full-connection operation on the spatial features based on the time sequence features, and predicting language probability through a classifier; and S120-40, adjusting the parameters of the language identification model according to the language probability and the target language until the language identification model converges.
The target language is the true language corresponding to the initial speech. Preprocessing the initial voice to obtain a spectrogram specifically comprises the following steps: performing fast Fourier transform on the initial voice according to frames to obtain frequency spectrums of the frames; and combining the frequency spectrums of the frames into a spectrogram along a time sequence. Before the fast Fourier transform, the method also comprises pre-emphasis, framing, windowing and other pre-operations on the sound signal of the initial voice.
The characteristics of speech are very diverse, and prosodic characteristics are common, such as: fundamental frequency, formants, speech rate, energy and the like, wherein the prosodic features can reflect continuous features of the speech; and spectral features such as: spectrum, mel cepstrum coefficient, linear prediction cepstrum coefficient, etc., and the spectral characteristics can reflect the short-term characteristics of the speech. The above features are mostly biased to information based on frequency domain, and the language distinguishing features are mainly concentrated in frequency domain. Therefore, after the time domain signal of the initial voice is subjected to Fourier transform to a frequency domain, the feature extraction of the subsequent steps is carried out. The initial voice is subjected to Fourier transform frame by frame to obtain frequency spectrums, and then all the frequency spectrums are spliced into a spectrogram according to time sequence so as to be convenient for characteristic extraction of the spectrogram.
In feature extraction, a speech signal is processed into a spectrogram and then has spatial and temporal features, so that the spatial and temporal features of the spectrogram need to be extracted simultaneously. The deep learning is widely applied in the field of artificial intelligence, and the feature extraction can be well realized along with the widening and deepening of the layer number based on the deep learning model. In the embodiment, a convolutional neural network is used as a feature extractor of spatial features, and a cyclic neural network is used as a feature extractor of time sequence features.
In the specific implementation, the speech spectrum is regarded as an image by utilizing a convolutional neural network based on deep learning, and the spatial features of the speech spectrum are extracted. In a specific example, multiple layers of convolution kernels can be adopted, the multiple layers of convolution kernels of each layer enable the extracted spatial features to be richer, and the multiple layers of convolution kernels enable the extracted spatial features to be more discriminative and play a role in reducing dimensions. Further, a cyclic neural network based on deep learning is utilized, and time sequence features are extracted on the basis of the spatial features extracted in the previous sequence.
Furthermore, after the time sequence features are extracted, the time sequence features are aggregated through an Attention (Attention) mechanism, so that when the spatial features are subjected to full-connection operation subsequently, the spatial features are subjected to full-connection operation based on the aggregated time sequence features.
The process of aggregating the time-series characteristics by the attention mechanism specifically includes: obtaining a hidden vector corresponding to each time sequence feature; performing attention calculation on the hidden vector to obtain a context vector; and aggregating the time sequence characteristics according to the context vector.
The formula for calculating the attention of the hidden vector is as follows:
uit=tanh(Wwhit+bw);
Figure BDA0002782683710000061
Figure BDA0002782683710000062
wherein h isitIs an implicit vector of the corresponding time sequence characteristics output by the recurrent neural network at the time t,siis the context vector at time i, W, obtained by attention calculationw、bwAnd uwFor the parameter, tanh is the mathematical sign of the hyperbolic tangent function, uit TIs uitThe transpose of the vector, exp, is the mathematical sign of the exponential function.
And after the features extracted from the pre-sequences are screened and aggregated by using an attention mechanism, the language identification is realized by using a classifier. Wherein the classifier may employ a SoftMax function.
Fig. 2 shows a network structure of the language identification model in the embodiment, and referring to fig. 2, in a specific example, the language identification model 200 includes a three-layer convolutional neural network 210 and a two-layer recurrent neural network 220. In this embodiment, the convolutional neural network 210 is specifically Conv1d (one-dimensional convolution), and the cyclic neural network 220 is specifically a GRU (Gated Recurrent Unit) network. The GRU network has the advantages of small parameter number and high calculation speed. Language identification model 200 also includes an Attention mechanism 230, a Dense layer (fully connected neural network layer) 240, and a SoftMax classifier 250.
During specific training, firstly, preprocessing is carried out on a voice signal of initial voice, wherein the preprocessing comprises pre-emphasis, framing, windowing and fast Fourier transform, and a frequency spectrum of each frame is obtained. And secondly, splicing the frequency spectrums of each frame along a time sequence to form a spectrogram. Thirdly, extracting the spatial features of the spectrogram by using the 3-layer Conv1d 210; in addition, by means of the Conv1d 210, the time sequence features can be gathered, so that the time dimension features are compressed, and the subsequent calculation cost is saved. And fourthly, extracting time sequence characteristics based on the spatial characteristics of the preambles by utilizing the layer 2 GRU network 220. Fifthly, screening and aggregating the hidden vectors of each time sequence by using an Attention mechanism 230. Sixthly, after calculation of the full connection layer 240, classification is performed by using a SoftMax function 250, the probability that the voice signal belongs to each language is predicted, and then, according to the prediction result and the real language of the initial voice, relevant model parameters are adjusted until the language recognition model is converged, and the prediction result conforming to the real language can be output.
The language identification model training method described in each of the above embodiments obtains a spectrogram of a speech based on frequency domain information of the speech; obtaining the spatial features and the time sequence features of the spectrogram through a convolutional neural network and a cyclic neural network based on deep learning; furthermore, the speech languages are identified through the full connection and the classifier, so that the speech languages are efficiently and accurately classified, and data support is provided for subsequent speech identification.
The embodiment of the invention also provides a language identification model training device, which is used for realizing the language identification model training method described in any embodiment.
Fig. 3 shows the main blocks of the training apparatus for language identification model, and referring to fig. 3, the training apparatus 300 for language identification model in this embodiment includes: a sample data acquisition module 310 configured to: obtaining initial voice and target language thereof; preprocessing initial voice to obtain a spectrogram; a language recognition model training module 320 configured to: extracting the spatial features of the spectrogram through a convolutional neural network; extracting time sequence characteristics of the spatial characteristics through a recurrent neural network; performing full-connection operation on the spatial features based on the time sequence features, and predicting language probability through a classifier; and adjusting the parameters of the language identification model according to the language probability and the target language until the language identification model converges.
The specific principle of each module can be referred to the above embodiments of the training method of each language identification model, and the description is not repeated here.
The embodiment of the invention also provides a language identification method, which can identify the language of the voice through the language identification model generated by the training of the training method embodiment of the random language identification model.
Fig. 4 shows the main steps of the language identification method, and referring to fig. 4, the language identification method in this embodiment includes: step S410, obtaining an effective voice segment of the voice to be recognized; step S420, preprocessing the effective voice segments to obtain a spectrogram; and step S430, inputting the language spectrogram into the language identification model to obtain a language identification result, wherein the language identification model is generated by training according to the embodiment of the training method of the arbitrary language identification model.
The process of obtaining the effective voice segment of the voice to be recognized specifically includes: performing endpoint Detection on the Voice to be recognized, namely performing Voice Activity Detection (VAD) on the head and tail sections of the Voice, and screening out non-valid frames to obtain Voice segments; and filling the voice segments to a preset time length, namely copying the existing voice segments for the voice with the time length less than the maximum time length to form effective voice segments. The process of obtaining the spectrogram is the same as the embodiment of the training method of the language identification model, and the spectrogram is obtained through operations such as framing, windowing, Fourier transform and the like. And inputting the language spectrogram into a language identification model, and taking the language corresponding to the maximum prediction probability to obtain a language identification result.
The language identification method of the embodiment can be particularly applied to hotel telephone recording of an online travel agency, is used for analyzing languages, and provides a basis for selecting the voice identification model of the corresponding language.
In the language identification method of this embodiment, a spectrogram of a voice is obtained based on frequency domain information of the voice; obtaining the spatial features and the time sequence features of the spectrogram through a convolutional neural network and a cyclic neural network based on deep learning; furthermore, the speech languages are identified through the full connection and the classifier, so that the speech languages are efficiently and accurately classified, and data support is provided for subsequent speech identification.
The embodiment of the invention also provides a language identification device, which is used for realizing the language identification method described in the embodiment.
Fig. 5 shows the main blocks of the language identification device, and referring to fig. 5, the language identification device 500 in this embodiment includes: a preprocessing module 510, configured to obtain an effective speech segment of a speech to be recognized; a spectrogram generating module 520, configured to pre-process the valid speech segments to obtain a spectrogram; and a language identification module 530, configured to input the language spectrogram into a language identification model to obtain a language identification result, where the language identification model is generated by the training of the above-mentioned embodiment of the language identification model.
The specific principle of each module can be referred to the above embodiment of the language identification method, and the description is not repeated here.
An embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores executable instructions, and when the executable instructions are executed by the processor, the method for training a language identification model and/or a language identification method described in any of the above embodiments is implemented.
As described above, the electronic device of the present invention can obtain a spectrogram of a voice based on frequency domain information of the voice; obtaining the spatial features and the time sequence features of the spectrogram through a convolutional neural network and a cyclic neural network based on deep learning; furthermore, the speech languages are identified through the full connection and the classifier, so that the speech languages are efficiently and accurately classified, and data support is provided for subsequent speech identification.
Fig. 6 is a schematic structural diagram of an electronic device in an embodiment of the present invention, and it should be understood that fig. 6 only schematically illustrates various modules, and these modules may be virtual software modules or actual hardware modules, and the combination, the splitting, and the addition of the remaining modules of these modules are within the scope of the present invention.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including the memory unit 620 and the processing unit 610), a display unit 640, etc.
The storage unit stores a program code, and the program code can be executed by the processing unit 610, so that the processing unit 610 executes the steps of the language identification model training method and/or the language identification method described in any of the embodiments. For example, processing unit 610 may perform the steps shown in fig. 1 and 4.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include programs/utilities 6204 including one or more program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 700, and the external devices 700 may be one or more of a keyboard, a pointing device, a bluetooth device, and the like. The external devices 700 enable a user to interactively communicate with the electronic device 600. The electronic device 600 may also be capable of communicating with one or more other computing devices, including routers, modems. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.
The embodiment of the present invention further provides a computer-readable storage medium, which is used for storing a program, and when the program is executed, the method for training a language identification model and/or the method for language identification described in any of the above embodiments are implemented. In some possible embodiments, the aspects of the present invention may also be implemented in the form of a program product, which includes program code for causing a terminal device to execute the language identification model training method and/or the language identification method described in any of the above embodiments, when the program product runs on the terminal device.
As described above, the computer-readable storage medium of the present invention can obtain a spectrogram of a voice based on frequency domain information of the voice; obtaining the spatial features and the time sequence features of the spectrogram through a convolutional neural network and a cyclic neural network based on deep learning; furthermore, the speech languages are identified through the full connection and the classifier, so that the speech languages are efficiently and accurately classified, and data support is provided for subsequent speech identification.
Fig. 7 is a schematic structural diagram of a computer-readable storage medium of the present invention. Referring to fig. 7, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of readable storage media include, but are not limited to: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device, such as through the internet using an internet service provider.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (12)

1. A training method of language identification model is characterized by comprising the following steps:
obtaining sample data, comprising:
obtaining initial voice and target language thereof;
preprocessing the initial voice to obtain a spectrogram;
training a language identification model, comprising:
extracting the spatial features of the spectrogram through a convolutional neural network;
extracting time sequence characteristics of the spatial characteristics through a recurrent neural network;
performing full-connection operation on the spatial features based on the time sequence features, and predicting language probability through a classifier; and
and adjusting parameters of the language identification model according to the language probability and the target language until the language identification model converges.
2. The training method of claim 1, further comprising:
after the time sequence features are extracted, aggregating the time sequence features through an attention mechanism;
and when the spatial features are subjected to full connection operation, performing full connection operation on the spatial features based on the aggregated time sequence features.
3. The training method of claim 2, wherein the aggregating the timing features by an attention mechanism comprises:
obtaining a hidden vector corresponding to each time sequence feature;
performing attention calculation on the hidden vector to obtain a context vector; and
and aggregating the time sequence characteristics according to the context vector.
4. A training method as claimed in claim 3, wherein the formula for calculating the attention of the hidden vector is:
uit=tanh(Wwhit+bw);
Figure FDA0002782683700000011
Figure FDA0002782683700000012
wherein h isitIs an implicit vector, s, corresponding to the time sequence characteristic output by the recurrent neural network at the time tiIs the context vector at time i, W, obtained by attention calculationw、bwAnd uwIs a parameter.
5. The training method of claim 1, wherein the convolutional neural network comprises three layers and the circular neural network comprises two layers.
6. The training method of claim 1, wherein the pre-processing the initial speech comprises:
performing fast Fourier transform on the initial voice according to frames to obtain frequency spectrums of the frames; and
and combining the frequency spectrums of each frame into a spectrogram along a time sequence.
7. A language identification method, comprising:
obtaining an effective voice segment of a voice to be recognized;
preprocessing the effective voice fragments to obtain a spectrogram; and
inputting the language spectrogram into a language identification model to obtain a language identification result, wherein the language identification model is generated by training according to the training method of any one of claims 1 to 6.
8. The speech recognition method of claim 7, wherein obtaining valid speech segments of the speech to be recognized comprises:
carrying out end point detection on the voice to be recognized, and screening out non-effective frames to obtain voice fragments;
and filling the voice segment to a preset duration to form an effective voice segment.
9. A training device for language identification models, comprising:
a sample data acquisition module configured to:
obtaining initial voice and target language thereof;
preprocessing the initial voice to obtain a spectrogram;
a language recognition model training module configured to:
extracting the spatial features of the spectrogram through a convolutional neural network;
extracting time sequence characteristics of the spatial characteristics through a recurrent neural network;
performing full-connection operation on the spatial features based on the time sequence features, and predicting language probability through a classifier; and
and adjusting parameters of the language identification model according to the language probability and the target language until the language identification model converges.
10. A language identification device, comprising:
the preprocessing module is used for obtaining effective voice fragments of the voice to be recognized;
the speech spectrogram generating module is used for preprocessing the effective speech segments to obtain a speech spectrogram; and
a language identification module, configured to input the language spectrogram into a language identification model to obtain a language identification result, where the language identification model is generated by training according to the training method of any one of claims 1 to 6.
11. An electronic device, comprising:
a processor;
a memory having executable instructions stored therein;
wherein the executable instructions, when executed by the processor, implement a method of training a language identification model according to any one of claims 1 to 6, and/or implement a language identification method according to claim 7 or 8.
12. A computer-readable storage medium storing a program which, when executed, implements a language identification model training method according to any one of claims 1 to 6, and/or implements a language identification method according to claim 7 or 8.
CN202011287099.XA 2020-11-17 2020-11-17 Language identification model training method, language identification method and related equipment Pending CN112489623A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011287099.XA CN112489623A (en) 2020-11-17 2020-11-17 Language identification model training method, language identification method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011287099.XA CN112489623A (en) 2020-11-17 2020-11-17 Language identification model training method, language identification method and related equipment

Publications (1)

Publication Number Publication Date
CN112489623A true CN112489623A (en) 2021-03-12

Family

ID=74931613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011287099.XA Pending CN112489623A (en) 2020-11-17 2020-11-17 Language identification model training method, language identification method and related equipment

Country Status (1)

Country Link
CN (1) CN112489623A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327584A (en) * 2021-05-28 2021-08-31 平安科技(深圳)有限公司 Language identification method, device, equipment and storage medium
CN114429766A (en) * 2022-01-29 2022-05-03 北京百度网讯科技有限公司 Method, device and equipment for adjusting playing volume and storage medium
CN115831094A (en) * 2022-11-08 2023-03-21 北京数美时代科技有限公司 Multilingual voice recognition method, system, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN110909131A (en) * 2019-11-26 2020-03-24 携程计算机技术(上海)有限公司 Model generation method, emotion recognition method, system, device and storage medium
CN111009262A (en) * 2019-12-24 2020-04-14 携程计算机技术(上海)有限公司 Voice gender identification method and system
CN111554281A (en) * 2020-03-12 2020-08-18 厦门中云创电子科技有限公司 Vehicle-mounted man-machine interaction method for automatically identifying languages, vehicle-mounted terminal and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN110909131A (en) * 2019-11-26 2020-03-24 携程计算机技术(上海)有限公司 Model generation method, emotion recognition method, system, device and storage medium
CN111009262A (en) * 2019-12-24 2020-04-14 携程计算机技术(上海)有限公司 Voice gender identification method and system
CN111554281A (en) * 2020-03-12 2020-08-18 厦门中云创电子科技有限公司 Vehicle-mounted man-machine interaction method for automatically identifying languages, vehicle-mounted terminal and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327584A (en) * 2021-05-28 2021-08-31 平安科技(深圳)有限公司 Language identification method, device, equipment and storage medium
CN113327584B (en) * 2021-05-28 2024-02-27 平安科技(深圳)有限公司 Language identification method, device, equipment and storage medium
CN114429766A (en) * 2022-01-29 2022-05-03 北京百度网讯科技有限公司 Method, device and equipment for adjusting playing volume and storage medium
CN115831094A (en) * 2022-11-08 2023-03-21 北京数美时代科技有限公司 Multilingual voice recognition method, system, storage medium and electronic equipment
CN115831094B (en) * 2022-11-08 2023-08-15 北京数美时代科技有限公司 Multilingual voice recognition method, multilingual voice recognition system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
CN107610709B (en) Method and system for training voiceprint recognition model
CN107481717B (en) Acoustic model training method and system
CN110853618B (en) Language identification method, model training method, device and equipment
CN112489623A (en) Language identification model training method, language identification method and related equipment
CN109686383B (en) Voice analysis method, device and storage medium
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN108428446A (en) Audio recognition method and device
CN112233646A (en) Voice cloning method, system, device and storage medium based on neural network
CN111402891A (en) Speech recognition method, apparatus, device and storage medium
Massoudi et al. Urban sound classification using CNN
CN112800782A (en) Text semantic feature fused voice translation method, system and equipment
CN112217947B (en) Method, system, equipment and storage medium for transcribing text by customer service telephone voice
CN113420556B (en) Emotion recognition method, device, equipment and storage medium based on multi-mode signals
CN112331177A (en) Rhythm-based speech synthesis method, model training method and related equipment
CN109979428B (en) Audio generation method and device, storage medium and electronic equipment
CN113223560A (en) Emotion recognition method, device, equipment and storage medium
CN114333865A (en) Model training and tone conversion method, device, equipment and medium
CN111653270A (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN111583965A (en) Voice emotion recognition method, device, equipment and storage medium
CN114495977A (en) Speech translation and model training method, device, electronic equipment and storage medium
CN110675865B (en) Method and apparatus for training hybrid language recognition models
CN112885379A (en) Customer service voice evaluation method, system, device and storage medium
CN113782005B (en) Speech recognition method and device, storage medium and electronic equipment
CN115376498A (en) Speech recognition method, model training method, device, medium, and electronic apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination