CN115188366A

CN115188366A - Language identification method and device based on deep learning and readable storage medium

Info

Publication number: CN115188366A
Application number: CN202210519620.0A
Authority: CN
Inventors: 黄诗雅; 罗睦军; 朱栩
Original assignee: Guangzhou Yunqu Information Technology Co ltd
Current assignee: Guangzhou Yunqu Information Technology Co ltd
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-10-14

Abstract

The disclosure provides a language identification method, a language identification device and a readable storage medium based on deep learning. The method comprises the following steps: a language tag obtaining step, namely obtaining a historical call recording set, identifying the language type of each call recording in the historical call recording set, and marking a language tag for the corresponding call recording according to the identified language type to obtain an audio data set comprising a plurality of call recordings with the language tags; a model training step, in which model parameters of a set model are trained through the audio data set to obtain a recognition model special for language recognition; and a language identification step, namely inputting the real-time voice into the identification model for language identification, and obtaining and outputting the language category of the real-time voice. The method realizes higher accuracy and higher recognition speed aiming at the subdivision field, does not need a large amount of manpower to label languages, does not depend on operators to provide training corpora, and saves a large amount of manpower and time cost.

Description

Language identification method and device based on deep learning and readable storage medium

Technical Field

The present disclosure relates to the field of telecommunications, speech recognition, data feature processing, and deep learning, and more particularly, to a method and an apparatus for language recognition based on deep learning, and a readable storage medium.

Background

The existing seat assistant type product needs to identify languages in the process of interaction between a user and a seat when the user makes a call. Only under the condition that language identification is correct, the appeal content of the user can be correctly input when other functions are configured, the time for manually inputting information by the agent is shortened, and the pressure of the agent is relieved. The existing language identification method mainly aims at a text (conversation transcription text) after ASR (Automatic Speech Recognition) real-time Speech transcription, and realizes language identification by identifying based on the text.

There are two drawbacks to using call transcription text for language identification: 1. the recognition accuracy depends greatly on whether the ASR transcription is accurate or not; 2. the speech recognition can be carried out after the real-time ASR transcription is used, and the recognition speed is far from the direct speech recognition. The speed at which the agent can obviously perceive the recognition of the language of the text is very slow.

In addition, there is another method that an external language identification API interface is used to perform real-time language identification, although the identification speed is fast, because the API interface is oriented to a general service scene, the identification accuracy rate is significantly reduced in the face of subdivision industry, such as in the customer service field or in the case of severe local accent, and only about 64% accuracy rate cannot meet the requirements of a specific subdivision field.

In addition, in the deep learning field, the model accuracy is high, and the online support is high under the concurrent condition, the language category is marked by relying on a large amount of manpower more, so that the training corpus is obtained, at least 3W and 4W of the training corpus are needed, and a large amount of labor cost and time cost are consumed.

Disclosure of Invention

The recognition speed is low based on the current language recognition; training corpora requires a large amount of manual labeling; the external language identification API is oriented to the condition of poor effect of the subdivision field, and the method for realizing language identification based on deep learning is provided by combining deep learning and voice identification.

According to a first aspect of the present disclosure, there is provided a method for implementing language identification based on deep learning, the method including:

a language tag obtaining step, namely obtaining a historical call recording set, identifying the language type of each call recording in the historical call recording set, and marking a language tag for the corresponding call recording according to the identified language type to obtain an audio data set comprising a plurality of call recordings with the language tags;

a model training step, in which model parameters of a set model are trained through the audio data set to obtain a recognition model special for language recognition;

and a language identification step, namely inputting real-time voice into the identification model for language identification to obtain and output the language category of the real-time voice.

Therefore, the method can obtain the training corpus without manual marking, and can identify the training corpus without converting the voice into the text, thereby saving labor and time.

Optionally, the set model is a speech feature model implemented based on a deep neural network.

Optionally, the model training step further includes a feature extraction step of extracting PLP feature parameters of the audio data set, and performing model training on the set model by using the PLP feature parameters.

Optionally, in the language identification step, the PLP feature parameters of the real-time speech are extracted, and the extracted PLP feature parameters of the real-time speech are input to the identification model to perform language identification on the real-time speech.

Optionally, the method further includes a filtering step before the model training step, where the filtering step includes at least one of an error category rejection step, an identification degree screening step, and a duration screening step; wherein, the first and the second end of the pipe are connected with each other,

an error category removing step of deleting the call record with a specific language tag from the audio data set, wherein the specific language tag is a tag which does not belong to a preset language category set;

an identification degree screening step, namely deleting the call records with the language category identification accuracy rate lower than a set threshold value from the audio data set;

and a time length screening step, namely reading the time length of each call record in the audio data set by using a librosa audio processing library, and deleting the call records with the time length less than 3 seconds.

By the filtering step, the training corpora can be screened aiming at the subdivision field, so that modeling is completed more efficiently, and higher identification accuracy is obtained.

Optionally, the feature extraction step includes:

reading, namely reading each call record by using a librosa audio processing library;

and a parameter acquisition step, namely acquiring PLP characteristic parameters for each read call record by using a PLP technology.

Optionally, in the language identification step, the set model obtains probabilities that the real-time speech belongs to each language, and outputs a language category corresponding to a maximum value of the probabilities as an identification result.

Optionally, in the language identification step, when the maximum value of the probability is greater than a preset output threshold, the language category corresponding to the maximum value of the probability is output as the identification result.

According to a second aspect of the present disclosure, there is provided a language identification apparatus based on deep learning, the apparatus including:

the system comprises a language tag acquisition module, a language tag storage module and a language tag processing module, wherein the language tag acquisition module is used for acquiring a historical call recording set, identifying the language category of each call recording in the historical call recording set, and marking a language tag for the corresponding call recording according to the identified language category to obtain an audio data set comprising a plurality of call recordings with the language tags;

the model training module is used for training the model parameters of the set model through the audio data set to obtain a recognition model special for language recognition;

and the language identification module is used for inputting the real-time voice into the identification module to carry out language identification, and obtaining and outputting the language category of the real-time voice.

According to a third aspect of the present disclosure, there is provided a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect of the present disclosure.

By the method of the embodiment, the making of the training corpus can be automatically completed through voice noise reduction and the existing language identification API; the training corpus is subjected to feature modeling through deep learning, and finally real-time call recording is analyzed and language identification is carried out, so that a model for a subdivided field can be established, higher identification accuracy and higher identification speed are realized, the manual hearing pressure is reduced, and manpower is saved.

In addition, because the existing language identification API interface is charged, the fee is continuously generated for language identification in the service scene. According to the method of the embodiment, only a certain expense is generated when the language label is obtained, and after the training corpus is manufactured, the recognition model obtained according to the method of the present disclosure can be used for recognition, and the method does not rely on the language recognition API interface, so that the expense is not generated in the aspect, and the cost is reduced.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a hardware configuration diagram of a language identification device that can be used to implement an embodiment of the present disclosure.

FIG. 2 illustrates a schematic diagram of a language identification method according to the present disclosure.

FIG. 3 illustrates a flow chart for language identification using a language method according to the present disclosure.

Fig. 4 is a system architecture diagram of an embodiment of the present disclosure.

Fig. 5 shows a functional block diagram of a language identification device according to the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

< hardware configuration >

Fig. 1 is a schematic hardware configuration diagram of a language identification apparatus 1000 that can be used to implement the language identification method according to an embodiment of the present disclosure.

The language identification device 1000 may include, but is not limited to, a processor 1100, a memory 1200, an interface unit 1300, a communication unit 1400, a display unit 1500, an input unit 1600, a speaker 1700, a microphone 1800, and the like. The processor 1100 may be a central processing unit CPU, a graphics processing unit GPU, a microprocessor MCU, or the like, and is configured to execute a computer program, where the computer program may be written by using an instruction set of architectures such as x86, arm, RISC, MIPS, and SSE. The memory 1200 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface unit 1300 includes, for example, a USB interface, a serial interface, a parallel interface, and the like. The communication unit 1400 is capable of wired communication using an optical fiber or a cable, for example, or wireless communication, and specifically may include WiFi communication, bluetooth communication, 2G/3G/4G/5G communication, and the like. The display unit 1500 is, for example, a liquid crystal display, a touch display, or the like. The input unit 1600 may include, for example, a touch screen, a keyboard, a somatosensory input, and the like. The speaker 1700 is used to output an audio signal. The microphone 1800 is used to collect audio signals.

The memory 1200 of the language identification device 1000 is used to store a computer program for controlling the processor 1100 to operate to implement the language identification method according to the embodiment of the present disclosure. The skilled person can design the computer program according to the disclosed solution of the present disclosure. How the computer program controls the processor to operate is well known in the art and will not be described in detail here. The language identification device 1000 may be installed with an intelligent operating system (e.g., windows, linux, android, IOS, etc.) and application software.

It should be understood by those skilled in the art that although a plurality of units of the language identification apparatus 1000 are illustrated in fig. 1, the language identification apparatus 1000 of the embodiment of the present disclosure may refer to only some of the units, for example, only the processor 1100 and the memory 1200, etc.

Various embodiments and examples according to the present invention are described below with reference to the accompanying drawings.

< method examples >

The principle and system structure of the language identification method according to the present disclosure will be described below with reference to fig. 2 and 4. FIG. 2 illustrates a schematic diagram of a language identification method according to the present disclosure. Fig. 4 shows a system architecture diagram of an embodiment of the present disclosure. The language identification method according to the present disclosure includes a language label acquisition step (S201), a model training step (S202), and a language identification step (S203).

In step S201, a history call record is downloaded via FTP to obtain a history call record set. The language type of each call record in the historical set of call records is then identified using the language identification API interface, preferably, the language type is obtained using the language identification API interface while the accuracy rate identified for that language is also obtained. And then, marking language tags for the corresponding call records according to the identified language categories to obtain an audio data set comprising a plurality of call records with the language tags. In the following description, an audio data set comprising a plurality of language-tagged call recordings is also referred to as "training corpora".

Model training is performed in step S202, including: feature extraction is performed on the corpus obtained in step S201, feature parameters of, for example, PLP are obtained from the speech signal, and model parameters of a set model (for example, a speech feature model based on a fully-connected feedforward neural network) are trained by using the feature parameters of PLP, so as to obtain a recognition model dedicated to speech recognition.

In step S203, the real-time speech is acquired through, for example, FTP, the PLP feature parameters of the real-time speech are extracted and input to the recognition model obtained in step S202 for language recognition, and the language category of the real-time speech is obtained and output.

The language identification method according to the present disclosure is explained in detail below with reference to fig. 3. FIG. 3 illustrates a flow chart for language identification using a language method according to the present disclosure.

In step S301, the history call record file is downloaded by FTP to obtain a history call record.

In step S302, the language type of the history call recording is identified through the language type identification API interface, and the language tag is marked on the history call recording. Meanwhile, the recognition accuracy can be obtained through the language recognition API. For example, a "Mandarin, 0.9" result may be obtained through the language recognition API interface. Wherein, the "Mandarin" is the language category, and the "0.9" is the recognition accuracy.

In step S303, the corpus is filtered. The filtration may be performed by one of the following steps, or by any combination of the following steps.

(1) And eliminating the error categories. The predefined phrase class sets, for example, the predefined phrase class sets include only "mandarin" and "cantonese". In the case where a certain recording is marked with, for example, "english" in step S302, since "english" does not belong to the set of predetermined language categories, the call recording tagged with the language "english" is deleted from the corpus. The speech features can be extracted aiming at the subdivided field through the error category removing step, so that the model training is more efficient.

(2) And a step of screening the recognition degree, namely deleting the call records with the recognition accuracy rate of the language category lower than a set threshold value from the training corpus. For example, the recognition accuracy threshold is set to 0.8, and all the corpus with the recognition rate lower than 0.8 is deleted in the recognition degree screening step. More accurate corpus can be obtained through the step.

(3) And a time length screening step, namely reading the time length of each call record in the training corpus by using a librosa audio processing library, and deleting the call records with the time length less than 3 seconds.

By the corpus filtering step, high-quality corpus can be obtained, and feature model training can be completed more accurately and quickly.

The details will be described below.

In step S304, a PLP (perceptual linear prediction) technique is used to extract speech features from the filtered corpus, and PLP feature parameters are obtained.

The extraction of the PLP characteristic parameters comprises the following steps:

(1) Reading, namely reading each call record into a memory by using a l ibrosa audio processing library;

(2) And a parameter acquisition step, wherein a PLP technology is used for acquiring PLP characteristic parameters for each read call record.

In step S305, model training is performed using PLP feature parameters. A fully-connected neural network (FNN) model is used in the present embodiment, and the FNN model includes an input layer, a hidden layer, and an output layer. The FNN model performs a series of linear operations and activation operations using input values and weights, and neurons in each layer receive inputs of neurons in a previous layer and output the inputs to neurons in a next layer. In the connection process, the input of the previous layer is subjected to weighting processing and nonlinear activation function processing, and mapping from an input space to an output space is realized.

The PLP feature parameters extracted in step S304 are input into the FNN input layer using a multi-frame stitching method when model training is performed. The current frame is spliced with the following 10 frames, for example, and the consecutive 11 frames are spliced into 1 frame. The same stitching is then repeated for the next frame. Each frame has 39 dimensions, and 11 consecutive frames (i.e. 1 frame after splicing) have 39x11=429 dimensions of spliced PLP characteristic parameters.

And finally, the PLP characteristic parameters enter a hidden layer, and weighting processing is carried out on each neuron. Hidden layers, e.g. two layers in total, the weight w of the input layer to the first hidden layer ₁ The calculation formula is as follows:

w ₁ ＝v×h

wherein v is the number of neurons in the input layer, and h is the number of neurons in the hidden layer.

Weight w from first layer hidden layer to second layer hidden layer ₂ The calculation formula is as follows:

w ₂ ＝h×h

second layer to output layer weight w ₃ The calculation formula is as follows:

w ₃ ＝h×s

where s is the output layer dimension.

Each hidden layer uses the ReLU activation function to process the input signals of the neurons in the previous layer, preserves features and maps to the neurons in the next layer. The output signal of the hidden layer enters a Softmax layer where the probability of belonging to each language category is output using a Softmax function. And then calculating the error between the output value of the Softmax layer and the target value, then repeatedly training, and ending the training when the error is equal to or less than a preset expected value to obtain the recognition model.

In step S306, the language category of the real-time speech is identified. Acquiring a real-time call record (real-time voice) through the FTP, extracting PLP feature parameters of the real-time voice and transmitting the PLP feature parameters to the recognition model trained in step S305, obtaining probabilities belonging to each language category by a Softmax layer of the recognition model, comparing the magnitudes of the probabilities, and outputting the language category with the highest probability as a recognition result. Preferably, an output threshold (e.g., 0.8) is set, and the result is output only when the maximum value of the probability is greater than 0.8. Otherwise, the real-time voice is continuously identified, and the identification result with the probability value larger than 0.8 is not output.

< apparatus embodiment >

Fig. 5 shows a functional block diagram of a language identification apparatus 1000 according to the present disclosure. As shown in fig. 5, the language identification device includes:

a language tag obtaining module 4100, configured to obtain a historical call record set, identify a language type of each call record in the historical call record set, and tag a language tag for the corresponding call record according to the identified language type to obtain an audio data set including a plurality of call records with language tags;

a model training module 4200 for training model parameters of the set model through the audio data set to obtain a recognition model dedicated to language recognition;

and the language identification module 4300 is used for inputting the real-time voice into the identification module to perform language identification, and obtaining and outputting the language category of the real-time voice.

In one embodiment, the set-up model is a speech feature model implemented based on a deep neural network.

In one embodiment, the model training module further includes a feature extraction module, which extracts PLP feature parameters of the audio data set, and performs model training on the set model using the PLP feature parameters.

In one embodiment, the language identification module extracts PLP feature parameters of the real-time speech, and inputs the extracted PLP feature parameters of the real-time speech into the identification model to perform language identification on the real-time speech.

In one embodiment, the language identification device further comprises a filtering module, and before extracting the characteristic parameters, the filtering module performs at least one of error category rejection, recognition degree screening and duration screening; wherein the content of the first and second substances,

removing error categories, namely deleting the call records with specific language tags from the audio data set, wherein the specific language tags are tags which do not belong to a preset language category set;

screening the recognition degree, namely deleting the call records with the recognition accuracy rate of language categories lower than a set threshold value from the audio data set;

and (3) time length screening, namely reading the time length of each call record in the audio data set by using a librosa audio processing library, and deleting the call records with the time length less than 3 seconds.

In one embodiment, the extraction of PLP feature parameters comprises:

and acquiring parameters, namely acquiring PLP characteristic parameters for each read call record by using a PLP technology.

In one embodiment, in the language identification module, the setting module obtains probabilities that the real-time speech belongs to each language, and outputs a language category corresponding to a maximum value of the probabilities as an identification result.

In one embodiment, in the language identification module, when the maximum value of the probability is greater than a preset output threshold, the language category corresponding to the maximum value of the probability is output as the identification result.

It will be appreciated by those skilled in the art that the language identification means 1000 may be implemented in various ways. For example, the language identification apparatus 1000 may be implemented by an instruction configuration processor. For example, the language identification apparatus 1000 may be implemented by storing instructions in ROM and reading the instructions from ROM into a programmable device when the device is started. For example, the language identification device 1000 may be incorporated into a dedicated device (e.g., ASIC). The language identification means 1000 may be divided into units independent of each other or may be implemented by combining them together. The language identification means 1000 may be implemented by one of the various implementations described above, or may be implemented by a combination of two or more of the various implementations described above.

In this embodiment, the language identification apparatus 1000 may have various implementation forms, for example, any functional module running in a software product or an application providing a control service, or a peripheral insert, a plug-in, a patch, etc. of the software product or the application, or the software product or the application itself.

< readable storage Medium >

In this embodiment, there is also provided a readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the language identification method according to any embodiment of the present disclosure.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the market, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the present disclosure is defined by the appended claims.

Claims

1. A language identification method based on deep learning is characterized by comprising the following steps:

2. The method of claim 1, wherein the set model is a speech feature model implemented based on a deep neural network.

3. The method of claim 1, wherein the model training step further comprises a feature extraction step of extracting PLP feature parameters of the audio data set, and performing model training on the set model using the PLP feature parameters.

4. The method according to claim 3, wherein in the language identification step, the PLP feature parameters of the real-time speech are extracted, and the extracted PLP feature parameters of the real-time speech are input to the identification model to perform the language identification on the real-time speech.

5. The method of claim 1, further comprising a filtering step before the model training step, wherein the filtering step comprises at least one of an error category rejection step, a recognition degree screening step, and a duration screening step; wherein the content of the first and second substances,

6. The method of claim 3, wherein the feature extraction step comprises:

a reading step, reading each call record by using a librosa audio processing library;

7. The method according to claim 1, wherein in said language identification step, said set model obtains probabilities that the real-time speech belongs to each language, and outputs a language category corresponding to a maximum value of the probabilities as an identification result.

8. The method according to claim 7, wherein in the language identification step, when the maximum value of the probability is larger than a preset output threshold value, a language category corresponding to the maximum value of the probability is output as an identification result.

9. A language identification device based on deep learning, the device comprising:

10. A readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the method of any one of claims 1 to 8.