CN111783437A

CN111783437A - Method for realizing language identification based on deep learning

Info

Publication number: CN111783437A
Application number: CN202010496197.8A
Authority: CN
Inventors: 黄诗雅; 罗睦军; 邓从健
Original assignee: Guangzhou Yunqu Information Technology Co ltd
Current assignee: Guangzhou Yunqu Information Technology Co ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-10-16

Abstract

The invention discloses a method for realizing language identification based on deep learning, which comprises the following steps: after a call recording file is obtained, a language text data set is generated through an Ali cloud ASR and a language identification interface; performing language text noise reduction processing on the fed back recognition result; extracting language texts under the categories to identify, and judging the languages of the categories to finish the manufacturing process of the training corpus; mapping words of a training corpus into index representation, constructing a vocabulary-index mapping table, constructing a label-index mapping table for language labels, reading word vectors from a pre-trained word vector model, inputting the word vectors into the model as an initialization value, digitizing language texts and the language labels into index representation through the mapping table, filling the index representation into a fixed length, and submitting the index representation to a deep learning classifier for training; and the deep learning classifier analyzes and predicts the language text to be tested and finds out the language category with the highest probability. The invention can reduce the pressure of manual listening again, save manpower, and has high efficiency, automation and accuracy.

Description

Method for realizing language identification based on deep learning

Technical Field

The invention relates to the field of telecommunication, in particular to a method for realizing language identification based on deep learning.

Background

Currently, language data for the customer service hotline is lacked, and user attribute features such as language type, address information, service requirement content and the like can be mined in the call content. In subsequent service analysis, the change condition of service requirements of each user group needs to be mined from the basic attributes of the users, the complaint monitoring system of each user group is perfected, and favorable data support is provided for subsequent refined user operation and maintenance. In the absence of basic user attribute indexes (using language types), recording data needs to be labeled manually. However, since 2 or 3 general users dial a client hotline to consult business conditions every day, if a telecom operator needs to classify the languages of the customer service hotline service, a great amount of manpower is needed to perform repeated listening and marking of call recording every day, and at this time, a great amount of manpower and time are needed to be consumed by performing repeated listening and language marking only manually.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for implementing language identification based on deep learning, which can reduce the pressure of manual listening again, save manpower, and has high efficiency, automation and accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method for realizing language identification based on deep learning is constructed, and comprises the following steps:

A) after a call recording file is obtained, a language text data set is generated through an Ali cloud ASR and a language identification interface;

B) performing language text noise reduction processing on the fed back recognition result;

C) extracting language texts under the categories by a manual sampling method to identify, and judging the languages of the categories to finish the manufacturing process of the training corpus;

D) mapping words of the training corpus into index representations, constructing a vocabulary-index mapping table, constructing a label-index mapping table for language labels, reading word vectors from a pre-trained word vector model, inputting the word vectors into the word vector model as initialization values, and finally, numerically expressing language texts and language labels into index representations through the vocabulary-index mapping table and the label-index mapping table, filling the index representations into fixed lengths, and submitting the index representations to a deep learning classifier for training;

E) and the deep learning classifier analyzes and predicts the language text to be tested and finds out the language category with the highest probability.

In the method for implementing language identification based on deep learning according to the present invention, the step B) further includes:

B1) screening language texts with language identification accuracy higher than a set value, removing the identified wrong languages outside the non-professional field through condition judgment, and only keeping the language texts with high identification accuracy;

B2) and performing word segmentation on the language text, then matching words with the words of the stop word list, and filtering out stop words.

In the method for implementing language identification based on deep learning according to the present invention, the step D) further includes:

D1) reading the training corpus into a memory, and performing word segmentation processing on each document;

D2) by calculating the word frequency of each word in the document, filtering out words with the word frequency smaller than the lowest threshold value and higher than the highest threshold value, mapping the residual non-repeated words into an index representation, namely constructing a vocabulary-index mapping table, and constructing a label-index mapping table for all non-repeated language labels;

D3) reading a word vector corresponding to a word-index mapping table by adopting a word2vec word vector model with a Tencent AI open source as an initial value of the word2vec word vector model;

D4) digitizing each document word through the vocabulary-index mapping table, carrying out fixed length processing on the condition that the length of each document is inconsistent, intercepting the document with the length longer than the highest threshold value and expanding the document with < PAD > shorter than the lowest threshold value, and storing the vocabulary-index mapping table and the word vector into a configuration file.

In the method for realizing language identification based on deep learning, the deep learning classifier adopts a TEXTCNN text classifier.

In the method for realizing language identification based on deep learning, the text classifier of TEXTCNN is used for carrying out language text classification prediction, and after the language text data is converted into a text sequence with a fixed length, the text sequence is put into a CNN network structure for training.

In the method for realizing language identification based on deep learning, the CNN network structure is composed of an input layer, a convolution layer, a pooling layer and a full-link layer.

In the method for realizing language identification based on deep learning, a text sequence c with a fixed length n is input in the input layer, wherein n is an integer and is more than or equal to 1; each word is represented by a word vector xi, each word is embedded in a dimension k, and the sentence is represented as

And the word vector xi adopts a pre-training word2vec as the input of the input layer, and is not subjected to fine tuning in the model training process.

In the method for realizing language identification based on deep learning, the convolution layer uses m convolution kernels with different sizes, m is an integer and is more than or equal to 1, the height h of the convolution kernel is a window value, the height h takes the value of 2-8, the width of the convolution kernel is the dimension equal width k of a word vector, and the convolution kernel is omega ∈ R^hkEach time the sliding window result ci is obtained, the convolution operation result is c_i＝f(ω*x_i:i+h-1) + b, where b ∈ R, f is a non-linear function, and a sliding window n-h +1 times is required for a text c of a language to be slid once, and the result of the convolution summary of the text c of the language is c ═ c₁,c₂,...,c_n-h+1]。

In the radical of the inventionIn the method for realizing language identification in deep learning, Max-pool layer is adopted, namely

Using the number of convolution kernels as m and the pooled data as

Each pooling may obtain a global maximum pooling.

In the method for realizing language identification based on deep learning, one layer of the full-connection layer is used, and y is omega z + b, namely the extracted feature z is input into an LR classifier for classification.

The method for realizing language identification based on deep learning has the following beneficial effects: firstly, after a user call recording file of a telecom operator is obtained, the recording is transcribed and the language type is identified through an Aliyun voice recognition interface, and a text with high recognition accuracy is reserved by noise reduction processing, so that the making of a training corpus is completed; then, building a network structure for modeling and training the corpus through deep learning, and finally, automatically recognizing the language of the daily call record of the operator through a characteristic model; the invention can reduce the pressure of manual listening again, save manpower, and has high efficiency, automation and accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating an embodiment of a method for implementing language identification based on deep learning according to the present invention;

FIG. 2 is a flow chart of a method for implementing language identification by deep learning in the embodiment;

fig. 3 is a specific flowchart of performing language text denoising processing on the fed-back recognition result in the embodiment;

fig. 4 is a specific flowchart of the generation of the word vector model in the embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiment of the method for implementing language identification based on deep learning, a flow chart of the method for implementing language identification based on deep learning is shown in fig. 1. Fig. 2 is a flow chart of a method for implementing language identification by deep learning in this embodiment. In fig. 1, the method for implementing language identification based on deep learning includes the following steps:

step S01, after acquiring the call recording file, generates a language text data set through the aricloud ASR and the language identification interface: in this step, after the call recording file of the user of the telecom operator is obtained, the speech language recognition system downloads the call recording file through the FTP, and a language text data set is generated through an ariclout speech recognition interface, namely an ariclout ASR and a language recognition interface.

Step S02 performs language text noise reduction processing on the fed-back recognition result: in this step, the language text noise reduction processing is performed on the fed back recognition result. Specifically, the speech language recognition system reads the call recording file, firstly obtains the transcribed language text content and language labels through the Aliyun ASR transcription and language recognition API interface, then eliminates the language texts in the non-business through a condition judgment method, and finally eliminates noise data by performing noise reduction processing on the language texts.

Step S03, extracting language texts under the category to identify through a manual sampling method, and judging the category of the texts to finish the manufacturing process of the training corpus: in the step, language texts under the categories are extracted for recognition through a manual sampling method, and the categories are judged, so that the manufacturing process of the training corpus is completed. Specifically, the speech language identification system stores language text files with high judgment accuracy into the same file; the service personnel of the speech language identification system performs sampling inspection on each language text and renames the text according to the real language attribute of the language text, thereby completing the manufacture of the training corpus.

Step S04, mapping words of the training corpus into index representation, constructing a vocabulary-index mapping table, constructing a label-index mapping table for language labels, reading word vectors from the pre-trained word vector model, inputting the word vectors into the word vector model as initialization values, finally quantizing language texts and language labels into index representation through the vocabulary-index mapping table and the label-index mapping table, filling the index representation into fixed length, and submitting the index representation to a deep learning classifier for training: in this step, the text language identification system loads the training corpus, extracts the classification result from the training corpus, and stores the classification result in the model file. Specifically, words of a training corpus are mapped into index representations, a vocabulary-index mapping table is constructed, a label-index mapping table is constructed for language labels, word vectors are read from a pre-trained word vector model and input into the word vector model as initialization values, finally, language texts and language labels are numerically represented as indexes through the vocabulary-index mapping table and the label-index mapping table, the indexes are filled to fixed lengths, and the indexes are submitted to a deep learning classifier for training. The deep learning classifier adopts a text classifier realized based on TEXTCNN, namely a TEXTCNN text classifier.

Step S05, the deep learning classifier analyzes and predicts the language text to be tested, and finds out the language category with the highest probability: in this step, the text language identification system downloads the call recording text (language text) to be predicted and analyzed through the FTP, performs recognition prediction or analysis prediction on the language text to be tested through the TEXTCNN text classifier, and finally finds out the language category with the highest probability, i.e. obtains the recognition result with the highest probability.

And transferring the call recording file and identifying the language type through the Aliyun voice recognition interface, and performing noise reduction to reserve the language text with high recognition accuracy so as to finish the manufacture of the training corpus. And then, building a network structure for modeling and training the corpus through deep learning, and finally, automatically recognizing the language of the daily call record of the operator through a characteristic model. The method for realizing language identification based on deep learning solves the problem that millions of call records per day need to be manually marked by operators at present, and a large amount of manpower is consumed. The method is based on natural language processing and deep learning, has the characteristics of high reliability, strong modeling and high accuracy, only needs few manual operations in the whole process, and does not depend on operators to provide training corpora, thereby saving a large amount of manpower and time cost for the operators.

Text classification prediction is performed by the TEXTCNN classifier. The TEXTCNN method converts language text data into a text sequence with a fixed length and then puts the text sequence into a CNN network structure for training. The CNN network structure is mainly composed of four parts: input layer, convolution layer, pooling layer and full-link layer. The specific prediction step comprises:

(1) input layer (word embedding layer): inputting a text sequence c with a fixed length n in an input layer, wherein n is an integer and is more than or equal to 1; each word is represented by a word vector xi, each word is embedded in a dimension k, and the sentence is represented as

The word vector xi adopts the pre-training word2vec as the input of the input layer, and is not fine-tuned in the model training process.

(2) The convolutional layer comprises m convolutional kernels with different sizes, wherein m is an integer and is more than or equal to 1, the height h of the convolutional kernels is a window value and is 2-8, the width of the convolutional kernels is the dimension equal width k of the word vector, and the convolutional kernels are omega ∈ R^hkEach time the sliding window result ci is obtained, the convolution operation result is c_i＝f(ω*x_i:i+h-1) + b, where b ∈ R, f is a non-linear function, and for the sliding first-time language text c, a sliding window is requiredn-h +1 times, the convolution summary result of language text c is c ═ c₁,c₂,...,c_n-h+1]。

(3) Using Max-pool of the largest pooling layer, i.e.

Using the number of convolution kernels as m and the pooled data as

Each pooling may obtain a global maximum pooling.

(4) Full connection layer: and inputting the extracted features z into an LR classifier for classification by using a fully connected layer, wherein y is omega and z + b.

For the present embodiment, the step S02 can be further refined, and the detailed flowchart is shown in fig. 3. In fig. 3, the step S02 further includes:

step S21, the language text with the language identification accuracy higher than the set value is screened, the language with the error is identified except the non-professional field through condition judgment, and only the language text with high identification accuracy is reserved: in this step, language texts with language identification accuracy higher than a set value are screened, the language text with errors is identified outside the non-professional field through condition judgment, and only the language text with high identification accuracy is reserved.

Step S22, the language text is participled, then the words are matched with the words of the stop word list, and stop words are filtered out: in the step, the language text is subjected to word segmentation processing, then words are matched with the words of the stop word list, and stop words are filtered.

For the present embodiment, the step S04 can be further refined, and the detailed flowchart is shown in fig. 4. In fig. 4, the step S04 further includes:

step S41 reads the corpus into the memory, and performs word segmentation processing on each document: in this step, the training corpus is read into the memory, and word segmentation processing is performed on each document.

Step S42 is to map the remaining non-repeated words into an index representation by calculating the word frequency of each word appearing in the document, filtering out words whose word frequency is less than the lowest threshold and higher than the highest threshold, i.e., constructing a vocabulary-index mapping table, and constructing a label-index mapping table for all non-repeated language labels: in the step, words with the word frequency smaller than the lowest threshold value and higher than the highest threshold value are filtered out by calculating the word frequency of each word in the document, and then the remaining non-repeated words are mapped into index representations, namely a vocabulary-index mapping table is constructed. In addition, a label-index mapping table is also constructed for all the labels of the non-repetitive languages.

Step S43 reads a word vector corresponding to the vocabulary-index mapping table by using the word2vec word vector model open from Tencent AI as an initial value of the word2vec word vector model: in this step, word vectors corresponding to the vocabulary-index mapping table are read out by adopting a word2vec word vector model with a Tencent AI open source, and the word vectors are used as initial values of the word2vec word vector model.

Step S44, digitizing each document word through the vocabulary-index mapping table, performing fixed length processing on the inconsistent length of each document, intercepting the document with the length longer than the highest threshold value and extending the document with < PAD > shorter than the lowest threshold value, and storing the vocabulary-index mapping table and the word vector in a configuration file: in the step, each document word is digitized through the vocabulary-index mapping table, in addition, the fixed length processing is carried out on the inconsistent length condition of each document, the length is longer than the highest threshold value and is intercepted, the length is shorter than the lowest threshold value and is expanded by a PAD, and the vocabulary-index mapping table and the word vector are stored in a configuration file.

In a word, the invention relates to the fields of telecommunication communication, deep learning and natural language, and the method is a method for identifying the text languages of operators based on deep learning. The occurrence of deep learning can complete the making of training corpora by the existing API language identification and text noise reduction on the premise of reducing the marking of personnel in the early stage as much as possible; the training corpus is modeled through deep learning, and finally unstructured text analysis and language identification are carried out on the call recording text, so that the manual hearing pressure is reduced, and the labor is saved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for realizing language identification based on deep learning is characterized by comprising the following steps:

2. The method for realizing language identification based on deep learning according to claim 1, wherein the step B) further comprises:

3. The method for realizing language identification based on deep learning according to claim 2, wherein the step D) further comprises:

4. The method for realizing language identification based on deep learning of claim 3, wherein the deep learning classifier adopts TEXTCNN text classifier.

5. The method of claim 4, wherein the text classifier of TEXTCNN is used to predict the classification of the language text, and the language text data is transformed into a text sequence with a fixed length and then placed into a CNN network structure for training.

6. The method for realizing language identification based on deep learning of claim 5, wherein the CNN network structure is composed of an input layer, a convolutional layer, a pooling layer and a full connection layer.

7. The method for realizing language identification based on deep learning of claim 6, wherein a text sequence c with a fixed length n is input in the input layer, n is an integer and n is more than or equal to 1; each word is represented by a word vector xi, each word is embedded with a dimension k, and a sentence is represented by xi, n is x1 ≦ x2 ≦ x ≦ xn, wherein the word vector xi adopts a pre-training word2vec as the input of the input layer, and fine adjustment is not performed in the model training process.

8. The method for realizing language identification based on deep learning of claim 7, wherein the convolutional layer uses m convolutional kernels with different sizes, m is an integer and is greater than or equal to 1, the height h of the convolutional kernel is a window value, the height h is 2-8, the width of the convolutional kernel is the dimension equal width k of a word vector, and the convolutional kernel is omega ∈ R^hkEach time the sliding window result ci is obtained, the convolution operation result is c_i＝f(ω*x_i:i+h-1) + b, where b ∈ R, f is a non-linear function, and a sliding window n-h +1 times is required for a text c of a language to be slid once, and the result of the convolution summary of the text c of the language is c ═ c₁,c₂,...,c_n-h+1]。

9. The method for realizing language identification based on deep learning of claim 8 wherein Max pooling layer Max-pool is adopted

Using the number of convolution kernels as m and the pooled data as

Each pooling may obtain a global maximum pooling.

10. The method of claim 9, wherein a layer of the fully-connected layer is used, and y ═ ω × z + b, i.e. the extracted features z, is input into an LR classifier for classification.