CN112818118A

CN112818118A - Reverse translation-based Chinese humor classification model

Info

Publication number: CN112818118A
Application number: CN202110088848.4A
Authority: CN
Inventors: 孙世昶; 孟佳娜; 刘玉宁; 朱彦霖
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2021-05-18
Anticipated expiration: 2041-01-22
Also published as: CN112818118B

Abstract

Chinese humor classification model based on reverse translation belongs to the natural language processing field, includes: s1, a text input layer; s2, embedding a BERT layer; s3, embedding Chinese phonetic character into a layer; s4, embedding text part-of-speech characteristics into a layer; s5, a characteristic fusion layer; s6, a BiGRU layer; s7, a full connection layer is used for finishing classified output of the humorous text in the Chinese text finally; the beneficial effects are that: on the basis of a method based on a reverse translation technology and combined with a linguistic humorous theory, a basic model BERT-BiGRU-Softmax is provided for carrying out classification research on humorous in a Chinese text, different humorous characteristics are gradually added into the model, a characteristic fusion model BERT + POS + Homophony-BiGRU-Softmax is realized, and the model is effective for finally judging whether the Chinese text is humorous or not.

Description

Reverse translation-based Chinese humor classification model

Technical Field

The invention belongs to the field of natural language processing, and relates to a Chinese humor classification model based on reverse translation.

Background

Humorous, one of the important ways of emotional expression, has been associated with people's life. With the rapid development of national science and technology and the large-scale popularization of the internet and communication equipment, the application of the internet and artificial intelligence field is drastically changed from a reading type to an interactive type. Humor "interactive" is also produced. The humor can bring joy to people, and can improve social ability, working efficiency and the like. At present, most of chat robots, which are interactive application representatives, collect various network resources and integrate information to interact with service objects, but few chat robots with humorous functions have no temperature, and the chat robots should not be only mechanical robots, but also have humanized thinking, knowledge of cold and heat, and general expression, namely humorous capability. "humor" has special meaning for chat robots. Therefore, the chat service robot needs to have and understand the humorous component of the speaker, and the realization basis of the function is to enable the chat service robot to classify the humorous sentence.

The task of classifying Chinese text humor is an important research field in the field of domestic natural language processing, the technology in the field mainly relates to the technologies of cognitive science, linguistics, machine learning, information retrieval and the like, and the development at home and abroad is gradually hot in recent years. The study work of the humorous classification of the Chinese text mainly divides the utterances expressed in the text into humorous and non-humorous studies according to the attitudes or humorous tendencies of the expressors.

The humor classification task originally originated in western countries, and through many years of research and development, the humor classification task has become a popular research for natural language processing, and some foreign researchers have gradually matured the research on the task. Raskin is the first theory of humor in the world, and the Semantic Script Theory (SSTH) of humor was proposed in 1985, which becomes the basic theory of artificial intelligence in humor computational analysis and also is a foundation stone. Subsequently, Attardo and Raskin propose the general theory of speech humor based on the extension and revision of the Raskin base theory, namely, the six main humor elements: the script opposition, the logic mechanism, the situation, the object, the narration strategy and the humor language are divided into 6 different levels from concrete to abstract, which has important significance for the development of the humor theory. With the gradual development of the artificial intelligence technology, the requirements of the high-performance deep neural network model on the data scale are gradually improved, and large-data-volume and high-quality training set support is required. However, many fields of data sets disclosed now, such as emotion classification, named entity recognition, image analysis, etc., have the problem that the data set with high quality is not enough, so that the high-performance model cannot be better matched, and therefore, data enhancement technology is in the process of production.

In recent years, since the natural language processing field also has a situation where the training set of text data is insufficient or the quality of training samples is not high, the data enhancement technology is also widely applied in the natural language processing field. Many researchers are inspired by the success of generating a confrontation network and GAN in image processing, and the GAN network is applied to a text data enhancement task. The proposal of OpenAI on GPT-2 model in 2019 and the improvement of Chinese GPT-2[28] have positive impact on data enhancement in the field of natural language processing. At present, methods of data enhancement technology in the field of natural language processing mainly include methods of noise, EDA, reverse translation and the like, and excellent effects are obtained in different fields.

The humour represented by different humour sentences in different eyes is different, and it requires a large stock of knowledge background for communication recognizers. At present, in the research of the humor classification and identification field, the linguistic humor theory basis is less used, and the theory and deep learning are not well combined. Therefore, how to better combine the linguistic humorous theoretical features to extract humorous features in the text and trace back to the original source of humorous to complete the humorous classification task is a challenge.

Besides humor theory, data is also deficient. Since the classification of the Chinese humor is developed later than foreign research, the existing high-quality text data of the Chinese humor is less. If the data enhancement technology of other fields is migrated to the text, the data enhancement technology does not necessarily have better generalization capability due to different fields, so that the noise problem is likely to exist, less noise is caused, or the performance improvement of the model is facilitated, for example, if the noise is more, the performance of the model is necessarily influenced. Therefore, the Chinese text humor data set also has certain influence on the learning of the model.

The humor classification task started late and lags early studies of text emotion classification. Early humor classification task research was based on english data, and since there were few high-quality chinese humor data sets, chinese humor classification has gradually developed in recent years. In addition, compared with the humor language of English, the humor form and the characteristics of Chinese have some differences in syntax structure and syntax form, so that machine learning cannot better acquire the meaning of the humor form and cannot select the characteristics with pertinence to judge whether the text corpus is the humor sentence or not.

Disclosure of Invention

In order to judge whether the Chinese text is humor, the invention provides the following technical scheme: a reverse translation-based Chinese humor classification model, comprising:

s1, a text input layer;

s2, embedding a BERT layer;

s3, embedding Chinese phonetic character into a layer;

s4, embedding text part-of-speech characteristics into a layer;

s5, a characteristic fusion layer;

s6, a BiGRU layer;

and S7, a full connection layer is used for finishing classified output of the text humor of the Chinese text finally.

Further, the text input layer takes sentences as input.

Further, the pinyin feature embedding layer comprises the following steps:

converting Chinese characters into pinyin: converting each Chinese character in the sentence to be characterized into Chinese pinyin;

acquiring a unique character set: each character corresponds to an integer as its ID;

and (3) pinyin vectorization, namely performing pinyin vectorization on the text to be converted according to the two steps of work.

Furthermore, the text part-of-speech characteristics are embedded into the layer, a jieba tool is used for leading the text into a disabled word bank, word segmentation is carried out on sentences in the text, and then all parts of speech are extracted and converted into part-of-speech characteristic vectors.

Furthermore, in the feature fusion layer, feature fusion is carried out on a feature vector matrix extracted by the BERT model, Chinese pinyin features obtained by comparison through a reverse translation method and text part-of-speech feature vectors, a multi-feature mode is formed, and training is carried out in a deep learning model; a feature vector matrix generated by subjecting a sample sentence of the text input layer to a BERT model is V, and a formula of a domain feature fusion sentence corresponding to the sample sentence can be expressed by formula 4.1:

in the above formula, W represents the generated new feature vector, f1 represents the word vector feature, and f2 represents the hanyu pinyin feature.

Further, the BiGRU layer comprises a forward GRU layer and a backward GRU layer, context learning is carried out on a feature vector matrix W output by the feature fusion layer through a forward neural network and a backward neural network, and deeper feature extraction operation is carried out on the text.

Furthermore, the reverse translation method comprises the following steps of translating the Chinese humor data set into an English data set by a machine translation method, and translating the English data set back into the Chinese data set.

Has the advantages that: by utilizing a deep learning technology, the significant features of a text data set are intensively researched and extracted, on the basis of a method based on a reverse translation technology and a linguistic humor theory, a basic model BERT-BiGRU-Softmax is provided for carrying out classification research on the humor of a text in the text, different humor features are gradually added into the model for carrying out an experiment, the construction and training of a feature fusion model BERT + POS + Homophony-BiGRU-Softmax (hereinafter, BPH-BiGRU-Softmax model for short) are realized, and the effect of finally judging whether the text in Chinese is the humor is achieved. According to the experimental data in the specific embodiment, the public data set and the self-built data set are respectively used as texts in the experiment, the BPH-BiGRU-Softmax model generated by combining the BERT basic characteristics, the part-of-speech characteristics and the Chinese pinyin characteristics is obtained by carrying out the experiment in four directions of influencing the result of the hummer classification of the Chinese text by multiple characteristics, comparing the result with other network models, researching the super parameters of an important model by the experiment and comparing different data enhancement technologies, and the obtained experimental results are all the best. The effectiveness of the model is verified, the time cost for the experiment is reduced, the memory utilization rate of the machine is improved, and a more accurate descending direction can be found to reduce the oscillation amplitude of the model. The reverse translation can change the expression structure and mode of the sentence, so that the enhanced data can have a structure different from that of the original sentence, correct semantic information can be kept under the condition of changing the grammar structure sometimes, the data diversity of the text corpus is increased, and the robustness and generalization capability of the model are better increased.

Drawings

FIG. 1 is a schematic structural diagram of a CBOW model;

FIG. 2 is a schematic diagram of the Skip-Gram model structure;

FIG. 3 is a schematic diagram of a basic structure of machine learning;

FIG. 4 is a schematic diagram of a two-dimensional SVM partitioning method;

FIG. 5Textcnn schematic;

FIG. 6 is a diagram of RNN structure;

FIG. 7 is a development view of the circulation layer;

FIG. 8 is a diagram of an LSTM network architecture;

FIG. 9 is a general block diagram;

FIG. 10 is a diagram of a reverse translation;

FIG. 11 syntax visualization of an original sentence;

FIG. 12 is a syntax visualization after reverse translation;

FIG. 13 is a model design framework diagram;

FIG. 14 is a vector representation matrix V;

FIG. 15 a GRU base model diagram;

FIG. 16 is a graph showing the effect of Batchsize on the model effect.

Detailed Description

1.1 solving the problem

The invention provides a Chinese humorous classification model based on reverse translation, which enables a machine to have emotion more and humorous sense urgently under the background of artificial intelligence and rapid development of man-machine interaction in the current society, and a humorous classification task is generated at the same time. The method analyzes the tendency of Chinese text humor, systematically analyzes the related research of text humor classification, and explores different influence capacities in different deep learning deep level expression in humor classification tasks. By utilizing a deep learning technology, the significant features of the data set are researched and extracted, on the basis of a method based on a reverse translation technology and a linguistic humor theory, a basic model BERT-BiGRU-Softmax is provided for carrying out classification research on the humor of the Chinese text, different humor features are gradually added into the model for carrying out an experiment, the construction and the training of a feature fusion model BERT + POS + Homophony-BiGRU-Softmax (BPH-BiGRU-Softmax model for short) are realized, and finally whether the Chinese text is humor is judged.

2.1 humorous classification

2.1.1 text humor Classification

Text humor classification is a branch of the research field of emotion analysis and is also one of the important tasks in the field of artificial intelligence. The main research object of the task of text humor classification is the "subjective factor" of the text, namely whether the subjective tendency expressed by the publisher or author includes humor effect, and the classification result is to obtain information about whether a specific text supports a certain humor. Due to the rapid development of scientific technology, three main humorous classification methods, namely a classification method based on statistics and grammar analysis, a classification method based on machine learning and a classification method based on deep learning, are developed. The classification method based on statistics and grammar analysis and the classification method based on machine learning are both traditional classification methods, and the classification method based on deep learning is a more active classification method in recent years.

2.1.2 text preprocessing

In a data mining task or a natural language processing task, text preprocessing is indispensable, and the fineness of the text preprocessing can directly influence the basic performance of an experimental model to a great extent. The text preprocessing work mainly comprises the works of Chinese word segmentation, word stop, part of speech tagging, dependency syntactic analysis and the like. Generally speaking, the text preprocessing operation is mainly to perform Chinese word segmentation operation on the basis of removing designated useless symbols and only keeping Chinese characters in a text, then remove stop words, namely words with weak emotion colors or without practical meanings, and perform part-of-speech tagging or dependency syntax analysis on the text, so that a computer has the capability of automatically analyzing emotion colors.

(1) Removing designated symbols

Generally, at the beginning of text preprocessing, regardless of chinese or english datasets, since many datasets are crawled from websites or other places, we need to perform an operation of removing designated useless symbols from the datasets. This allows the following text preprocessing operations to be better performed.

(2) Chinese word segmentation

Currently, word segmentation techniques based on chinese can be divided into two main categories: a word segmentation method based on character string matching and a word segmentation method based on statistics. A word segmentation method based on character string matching is also called a mechanical word segmentation method, and is characterized in that a Chinese character string to be analyzed is matched with a vocabulary entry in a sufficiently large machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, matching succeeds, namely a word is identified. According to different scanning directions, the character string matching word segmentation method can be divided into forward matching and reverse matching; according to the condition of preferential matching of different lengths, the method can be divided into maximum (longest) matching and minimum (shortest) matching; according to whether the method is combined with the part-of-speech tagging process or not, the method can be divided into a simple word segmentation method and an integrated method combining word segmentation and part-of-speech tagging. The common character string matching method includes a forward maximum matching method, a reverse maximum matching method, a minimum segmentation method and a bidirectional maximum matching method. The algorithm has the advantages of high speed and simple realization, but has less ideal effect on ambiguous words or unregistered words; the word segmentation method based on statistics is to utilize a statistical machine learning model to learn the rules of word segmentation under the premise of giving a large amount of already segmented texts, so as to realize segmentation of unknown texts. Such as a maximum probability word segmentation method, a maximum entropy word segmentation method, and the like. With the establishment of large-scale corpora and the research and development of statistical machine learning methods, the Chinese word segmentation method based on statistics gradually becomes the mainstream method. The main statistical models are: n-gram (N-gram), Hidden Markov Model (HMM), maximum entropy Model (ME), conditional random field model (CRF), and the like.

In practical application, the word segmentation system based on statistics needs to use a word segmentation dictionary to perform character string matching word segmentation, and meanwhile, a statistical method is used for identifying some new words, namely, the frequency statistics of character strings and the character string matching are combined, so that the characteristics of high matching word segmentation speed and high efficiency are exerted, and the advantages of dictionary-free word segmentation combined with context recognition of new words and automatic ambiguity elimination are utilized. In recent years, the Chinese word segmentation technology of the Chinese character in the crust adopts dynamic programming to search a larger probability path and find out a larger segmentation combination based on word frequency, and for unknown words, an HMM model based on the word forming capability of Chinese characters is adopted, and a Viterbi algorithm is used to obtain a better result. Therefore, the invention adopts Chinese word segmentation technology to segment the pre-training corpus.

(3) Stop word

In natural language processing, generally, we use stop word operation to save storage space and improve retrieval efficiency. These stop words generally include two categories, the first category being extremely common functional words, words that have little practical meaning compared to other words. The second category is a word that includes a very widely used vocabulary, with little help for a particular task. The stop word list used in the invention is a Hadamard stop word list.

(4) Part-of-speech tagging

Part-of-speech tagging, i.e., a computer-automated process of tagging out parts of speech of words in a sentence, such as nouns, verbs, adjectives, etc. The work of part-of-speech tagging is not necessarily done in text preprocessing, but it can sometimes help us to simplify some of the work. For example, in the process of part-of-speech tagging, words with certain parts-of-speech in unnecessary sentences can be removed for a specific task, so as to achieve a better text preprocessing effect.

(5) Dependency parsing

Dependency parsing is one of the key technologies in the field of natural language processing, and was originally proposed by french linguist l.tesniere in its written "structural and syntactic base", which has a profound impact on the development of linguistics, and is also highly advocated in the world of computational linguistics. The basic task is to determine the syntactic structure of a sentence or the dependency between words in the sentence. The method mainly comprises two aspects, namely, determining a grammar system of a language, namely, giving formal definition to a grammar structure of a legal sentence in the language; another aspect is syntactic analysis techniques, i.e. the automatic derivation of the syntactic structure of a sentence, according to a given syntactic hierarchy, the analysis of the syntactic units contained in the sentence and the relations between these syntactic units.

2.1.3 word vector representation

(1) One-hot representation

One-Hot encoding (One-Hot encoding), also known as One-Hot encoding, is a relatively common word Representation method. The method is to use an N-bit status register to encode N states, each state having its own independent register bit and only one of which is active at any one time. That is, only one bit is 1 and the rest are 0. The one-hot coding is mainly used in a classification task and is used for normalizing the characteristics of the classes. For example, in terms of gender, there are male and female gender, and the newly constructed features have two results, gender-male and gender-female, which are identical in corresponding positions to 1 compared to the original data, and zero otherwise. Although the problem that a classifier does not well process discrete data and the like can be solved by using the one-hot coding in practical application, the disadvantage on text feature representation is highlighted. First, it is a bag-of-words model, regardless of the order of words. Second, it assumes that words are independent of one another. Finally, it gets a feature that is discrete and sparse.

(2) Distributed representation

Distributed representation was originally proposed by Hinton et al, originated from cognitive representation, and is the most important property for deep learning, and combined information of learning eigenvalues can be derived. Unlike the one-hot representation, the distributed representation expresses the intrinsic meaning of a word in a dense real vector. Suppose data sample a [ "ta"]And B [ "harsh"]Where the feature vector is a 2-dimensional vector, then samples a and B can be represented in a distributed manner as: a: [1.79683,0.326625816]，B：[2.6215376，0.9257021](ii) a The advantages of the distributed representation are evident compared to the individual thermal representation. Firstly, the unique expression cannot represent the associated information of the text, and the distributed expression can effectively represent the semantic similarity, so that the information between words can be better represented. Second, the distributed representation helps the model to have better generalization capability. Most notably, the distributed representation has very powerful feature expression capabilities. For example, k values per dimension of an N-dimensional vector, can be characterized as kⁿAnd (4) information.

2.1.4 word vector representation

(1) Word representation generation model

As the research on the generation of word representations has progressed, researchers have increasingly discovered that the better word representations can be obtained on larger data sets using simple network models and contexts. Thus, Word2vec has created a new era. The method is an open source tool proposed by GOOGLE, namely words in a corpus are converted into vectors so as to perform various calculations on the basis of the word vectors. The Word2vec model includes two training models, namely a CBOW model and a Skip-Gram model. The specific model structures are shown in fig. 1 and 2, respectively.

The training input of the CBOW model is a word vector corresponding to a word related to the context of a certain characteristic word, and the output is the word vector of the specific word. The ideas of the Skip-Gram model and CBOW are the opposite, i.e., the input is a word vector for a particular word, and the output is a context word vector for the particular word.

In 2014, Pennington et al were inspired by predecessors and proposed GLOVE Word vectors, which are all static vectors, as was Word2 Vec. The push out of the BERT model, until 2018, received the strongest attention in history. The BERT model improves the generalization capability of the word vector model, and words and sentences are well represented.

2.1.5 machine learning related techniques

At present, in humorous text classification tasks, if machine learning technology is used for implementation, most classification methods based on supervised training are used, sentence features are extracted from a large number of labeled data sets, model parameters are learned and models are generated by utilizing a machine learning algorithm, and finally the models are used for classifying and identifying texts. This subsection will introduce relevant knowledge and algorithms of machine learning in the humor text classification task. Machine learning is a generic term for a class of algorithms, and is mainly applied to the field of artificial intelligence. These algorithms attempt to mine the content laws that they imply from a large set of historically stored data and use them for prediction or classification. More specifically, machine learning can be viewed as finding a function, where the input is sample data in a corpus and the output is a result that we expect, but the function is too complex to be conveniently formalized. Notably, the goal of machine learning is to adapt the learned function well to the "new sample", rather than just performing well on the training sample, seeking to maximize the model generalization ability. Fig. 3 is a basic block diagram structure of machine learning. Machine learning models salient in the humor classification task are decision tree models and support vector machine models, and therefore, the above two model algorithms will be described below.

(1) Naive Bayes

Naive Bayes [45] is a very common and widely used text classification algorithm, and the rationale (Bayesian) is proposed by the British mathematician Thomas Bayes. At present, the naive bayes algorithm is a simple but extremely powerful predictive modeling algorithm. The usual operation is to first determine the characteristic attributes, specifying what the predicted values are. And properly dividing each characteristic attribute, then manually classifying a part of data to form a training sample, calculating the occurrence frequency of each category in the training sample and the conditional probability of each characteristic attribute division to each category by inputting the characteristic attributes and the training sample, and outputting a classifier. And finally, classifying the new data by using a classifier, and outputting the result.

In order to avoid the training obstacle of the bayesian formula, the naive bayesian classifier adopts an attribute conditional independence assumption (attribute conditional independence assumption), that is, for a known class, assuming that all attributes are independent from each other, that is, each attribute independently affects the classification result, the bayesian formula 2.1 mentioned above:

wherein d represents the number of attributes, xi represents the value of x on the ith attribute, and because p (x) is uniquely determined by the sample set, i.e. the values are the same for all classes p (x), the expression of the naive bayes classifier is as follows, 2.2:

the training process of the naive Bayes classifier is to estimate class prior probabilities P (c) based on a training set D, estimate conditional probabilities P (xi | c) for each attribute, represent a set composed of class c samples in the training set D by Dc, and if sufficient independent same-distribution samples exist, can easily estimate class prior probabilities 2.3:

for discrete attributes, let Dc, xi denote the set of samples with xi on the ith attribute in Dc, and then the conditional probability P (xi | c) is expressed by the formula 2.4

For the continuous type attribute, assume equation 2.5

Where μ c, i, σ 2c, i are the mean and variance of the class c sample on the attribute i, respectively (it is assumed here that the corresponding continuous variable follows a normal distribution), as shown in equation 2.6:

(2) decision tree model

Decision trees (Decision trees) are a basic classification and regression method, called classification trees when they are used for classification and regression trees when they are used for regression. Since the classification task is discussed herein, the classification tree is mainly presented herein.

A classification tree is a tree structure that describes the classification of instances. When the classification tree is used for classification, a certain characteristic of the example is tested from the root node, and the example is distributed to the child nodes according to the test result. At this point, each sub-node corresponds to a value of the feature. The instances are tested and assigned recursively in this manner until a leaf node is reached. And finally, classifying the instances into the classes of the leaf nodes. The goal of classification tree learning is to construct a decision tree model from a given training data set that enables it to correctly classify instances. Decision tree learning essentially generalizes a set of classification rules from a training data set. There may be more or none of the decision trees that do not contradict the training data set (i.e., the decision trees that correctly classify the training data). What is needed is a decision tree that is less contradictory to training data, while having good generalization capability. From another perspective, decision tree learning is the estimation of a conditional probability model from a training data set. There are infinite conditional probability models of classes based on feature space partitioning, and the selected conditional probability model should have good fitting to training data and good prediction to unknown data. The objective is represented by a loss function, which is usually a normalized maximum likelihood function, and the strategy of decision tree learning is minimization with the loss function as an objective function. When the loss function is determined, the learning problem becomes a problem of selecting the optimal decision tree in the sense of the loss function.

(3) Support vector machine model

The SVM is a two-class model, has a supervised statistical learning method, can minimize empirical errors and maximize geometric edges, is called a maximum interval classifier, and can be used for classification and regression analysis. As shown in fig. 4, the learning strategy of the support vector machine is interval maximization, and can be formulated as a problem of solving convex quadratic programming, which is also equivalent to the minimization problem of regularized hinge loss function. The learning algorithm of the support vector machine is an optimization algorithm for solving convex quadratic programming.

Let us assume that given a linearly separable training data set T { (x) over a feature space₁,y₁),(x₂,y₂),...,(x_n,y_n)}

Wherein

y_iE { +1, -1}, i { +1, 2., n, xi is the ith feature vector, also called an instance, yi is a class label of xi, and xi is a positive case when yi { +1, · 1; when yi is-1, xi is called negative case, and xi, yi is called sample point.

The separation hyperplane obtained by the interval maximization or equivalently solving the corresponding convex quadratic programming problem learning is as follows:

w*x+b＝0 (2.7)

where w is the normal vector of the classification hyperplane and b is the intercept which is also the offset. The plane is determined by the normal vector w and the intercept b. The separation hyperplane divides the space into two parts, one part is positive class, the other part is negative class, and the side pointed by the normal vector is positive class. According to the calculation formula from point to plane, to ensure the maximum final classification interval, the problem can be transformed into the following problem (such as formula 2.8 and 2.9):

s.t.y_i(w·x_i+b)-1≥0,i＝1,2,...,n (2.9)

the above problem can be converted into dual problem (such as formula 2.10 and 2.11) for convenient solution, and finally the optimal solution omega is obtained^*，b*。：

Wherein alpha is_i≥0,i＝1,2,3...,n

Non-linear problems tend to be poorly solved, so it is desirable to solve this problem by solving the linear classification problem. The method is to perform a nonlinear transformation to transform the nonlinear problem into a linear problem, and solve the original nonlinear problem by solving the transformed linear problem. The kernel refers to the kernel function, so that the calculation amount in a high-dimensional space is reduced. Commonly used kernel functions include the following three, linear kernel functions, polynomial kernel functions, and gaussian kernel functions.

The linear kernel is the simplest kernel and is mathematically expressed as shown in equation 2.12:

k(x,y)＝x^ty (2.12)

polynomial kernel function:

k(x,z)＝(x·z+1)^p (2.13)

the corresponding support vector machine is a polynomial classifier of degree p, in which case the classification decision function becomes

Gaussian kernel function:

the corresponding support vector machine is a gaussian radial basis function classifier. In this case, the classification decision function becomes:

2.1.6 deep learning related techniques

(1) Convolutional neural network

The convolutional neural network is different from other network structures such as a full-connection network and the like, and has three unique characteristics in structure: 1) Local connection: each neuron is no longer connected to all neurons in the previous layer, but to only a small fraction of neurons. 2) Weight sharing: a group of connections may share the same weight instead of a different weight for each connection, thus reducing many parameters. 3) Down-sampling: the data processing amount can be reduced and useful information can be retained. Kim et al [48] proposed in 2014 the application of the CNN model to the task of text classification, which may be called Kaisha. The CNN model mainly includes convolutional layers, pooling layers, and fully-connected layers, as shown in fig. 5 below.

The input layer converts a piece of text into the input format required by the convolutional layer, typically resulting in a vector matrix of sentences. Where we set n to represent a segment of the number of text words. Because the number of words in the text is variable, the input text needs to be preprocessed and processed into a fixed length. The commonly used method is to take the longest text length in the data and the statistical text length distribution to take the length which can cover most of the text. K represents the length of the word embedding. Glove Word vectors, Word2Vec Word vectors, Bert, etc. may be used in general, to help the effect of natural language processing tasks. Then the sentence matrix of the model input can be expressed as formula 2.17:

wherein

Indicating a splicing operation.

And then carrying out convolution operation on the words in the window to obtain the required characteristic value. In particular with c_iRepresenting the word pair h within a window_i:i+h-1The eigenvalues obtained after the convolution operation are shown in equation 2.18:

c_i＝f(ω·x_i:i+j-1+b) (2.18)

wherein

Is a bias unit and f is a non-linear function.

Performing convolution operation on each possible window in the sentence matrix to obtain a feature mapping set, as shown in formula 2.19:

and finally, performing maximum pooling operation on the feature mapping set c obtained by the pooling layer to obtain the final features. As shown in equation 2.20:

the above describes the process of extracting a feature from a convolution kernel. In general, in an experiment, different feature representations can be obtained by using a plurality of convolution kernels, and then the feature representations are transferred to a fully connected softmax layer, so that the probability distribution of text labels can be output, and a process of text classification is completed.

(2) Recurrent neural networks

As early as 1982, a single-layer feedback neural network Hopfield network was invented by the physicist John Hopfield of the California institute of technology, California, USA, to solve the combinatorial optimization problem. This is the earliest prototype of RNN. After continuous innovation and improvement, the existing Recurrent Neural Network (RNN) model is proposed.

The great advantage of the recurrent neural network is that the sequence information can be well trained. A simple recurrent neural network e.g. consisting of an input layer, a hidden layer and an output layer, as shown in fig. 6 below; unfolding the loop layer in time-line yields the following figure 7:

this network receives an input x at time t_tThe value of the hidden layer is then s_tThe output value is o_t. Wherein s is_tNot only depends on x_tAlso depends on s_t-1. Specifically, the calculation method of the recurrent neural network can be expressed by the following formula:

o_t＝g(v·s_t) (2.21)

s_t＝f(u·x_t+w·s_t-1) (2.22)

wherein x is_tRepresenting input, u representing weight from input layer to hidden layer, v representing weight from hidden layer to output layer, w representing last value of hidden layer as weight value at this input, s_tRepresenting the value of the hidden layer at time t.

The core part of the recurrent neural network is a memory unit, but in some cases, it is likely that the distance between the relevant information and the position where the information is needed is very large, and when the distance increases, the recurrent neural network cannot connect the relevant information, which is a disadvantage that with the intensive research of the recurrent neural network, the long-short term memory network derived based on the recurrent neural network solves the problem.

(3) Long and short term memory network

The network structure of the long-short term memory network is shown in fig. 8.

The core of the long-short term memory network is the cell state, which can add or delete information in the cell state, which is first processed by a structure called "gate". Mainly comprises input doors, a forgetting door and an output door. The first step-long short-term memory network needs to decide what information to discard from the cell state, which is determined by the neural layer of the activation function sigmoid, i.e. forget gate. Firstly, the information is used as the input of a forgetting gate, and for each digit, the forgetting gate outputs a [0, 1]]The number of intervals, output 1 represents "completely reserved", and output 0 represents "completely forgotten". Let the current time be t, forget the door f_tDeciding what information to discard, the derivation formula is shown in the figure, where h_t-1Representing the output at the previous moment, x_tFor input at time t, w_fIs a parameter matrix, b_fIs a bias vector.

f_t＝σ(w_f·[h_t-1,x_t]+b_f) (2.23)

The second step is to decide which new information we will need to store in the cell state, defined as i_tAnd (4) showing. First, the layer called the "import gate" for sigmoid, which is the activation function, decides which values we will update. The derived formula is shown in 2.24. Wherein w_iIs a parameter matrix, b_iIs a bias vector.

i_t＝δ(w_i·[h_t-1,x_t]+b_i) (2.24)

Then, a new candidate value vector is created by using the activation function tanh layer, and the new candidate value vector is added to the current candidate unit state and defined as

The derivation formula is shown in the figure. Wherein, w_cIs a parameter matrix, b_cIs a bias vector.

Next, the old cell state is updated to the new cell state, i.e., the value of i for the gate is entered_tAnd the value f of the forgetting gate_tRespectively with the current candidate cell state

And last time cell state c_t-1Multiplying and summing to obtain the current state c_tThe derived formula is shown in 2.26.

The output gate needs to determine the content of the output, defined as o_tDerived formula as shown in the figure, wherein w_oIs a parameter matrix, b_oAs a bias vector

o_t＝σ(w_o[h_t-1,x_t]+b_o) (2.27)

Finally, its cell state is multiplied by the output of the activation function sigmoid gate by the tanh function (converting the value into the [ -1,1] interval), thus outputting only the part that determines the output. The derivation formula is shown in the figure.

h_t＝o_t*tanh(c_t) (2.28)

2.1.7 summary of this chapter

This chapter mainly introduces the theoretical research of humor classification technology, including preprocessing text, word representation method and development process, and the classical machine learning and deep learning related technologies, and provides the basis of theoretical technology for the subsequent research.

3.1 reverse translation-based data enhancement technology Studies

Reverse translation, namely translating the Chinese humorous data set by using a machine translation methodAnd (4) forming an English data set, and translating the English data set back to a Chinese data set, which is a reverse translation process. Zhang Jie^[50]The differences of Chinese and English languages have been studied, which indicates that the Chinese language and English language have different language structures due to the culture difference. Similarly, Zhanghui Mei^[51]In the research on the inter-translation of English and Chinese, it is pointed out that the communication expression of the same culture is not necessarily effective in another culture due to the difference between the culture and the custom. Therefore, by utilizing the differences of Chinese and English in expression, culture, custom, translation and the like, the influence of reverse translation on the humorous feature theory is explored by tracing the source of humorous and identifying humorous features possessed by the data set through combining the theoretical basis of the previous human on humorous in linguistics and the reverse translation technology. The specific overall structure is shown in fig. 9.

3.1.1 overview of reverse translation technology

In recent years, with the realization of more data enhancement techniques, the reverse translation technique gradually goes into the field of natural language processing. The reverse translation technology is a simple, convenient and obvious technology in the data enhancement technology. The method mainly utilizes a machine translation tool to translate a data set in a primitive language library into a desired target language (which can be English or other foreign languages) through the machine translation tool, and then utilizes the machine translation tool to translate the data set of the target language back into a corpus of the original language. This constitutes a reverse translation technique. The method is based on Chinese and English translation in the reverse translation process, achieves differentiation of data sets through language differentiation, is not only beneficial to analyzing humorous characteristics, but also beneficial to generating a new data set in a limited Chinese humorous data set, and achieves another effect, namely data enhancement. The following figure is a block diagram of the reverse translation technique, as shown in FIG. 10.

3.1.2 humor Category overview of this dataset

Humor the types of humor are also varied, and the types of humor are varied, but humor language has its most basic language characteristicAnd (5) carrying out characterization. Plum source^[52]In the research of the linguistic functions and behavior characteristics of humor, the fact that the exploration of the humor language from the perspective of grammar modification includes various characteristics such as 8 characteristics of contradiction, exaggeration, avoidance of serious problems, no question, ambiguity, no birth, citation, great mentality, no fool and identity confusion. Ran Ming Zhi^[53]Discordance and characteristics of the discordance in the humorous utterances are analyzed. Zhang et al have designed five categories including speech characteristics, morphological and syntactic characteristics, lexical and semantic characteristics, pragmatic characteristics and emotional characteristics based on the theory of inconformity and linguistic characteristics of humor, and have more than 50 kinds of humor characteristics in total.

Aiming at the study of Chinese humorous linguistic characteristics by scholars, the humorous theory which is found to be suitable for the data set is generally 4 types, including a voice characteristic theory, a Chinese structural characteristic theory, a vocabulary semantic characteristic theory and a new tide word and dialect characteristic theory. These four broad classes of theory will be described in detail below.

(1) Theory of speech characteristics

Humorous in other languages such as chinese or english, one of the most common phenomena is humorous generated by speech sounds. This language pronunciation is mainly referred to as a harmonic character type. Harmonic sounds are widely used by Chinese as a common phenomenon of Chinese language. Harmonic sounds refer to the phenomenon of language that expresses the meaning of chinese by means of phonetic features that sound the same or similar. In the humor domain, harmonic sounds are usually presented in the form of a head rhyme or a tail rhyme. In interpersonal communication, the dialogue person induces humorous by using the inconsistency of the sounds uttered by the speech sounds. Say in english, for example: example 1: as fit as match, this brief expression, there is no humor in the sense of language, but the two letters F with the same pronunciation in this phrase are reused to form an humor of a phonetic feature. For another example, in chinese, for example: example 2: the hair is lost and the head is more outstanding! The pinyin pronunciation of the untranspected trace in the Chinese is zong, the pinyin pronunciation of the outstanding people in the Chinese is zhong, and the ong of the pronunciations of the two characters forms a tail rhyme technique, so that the feeling of turning the eyes and being testimony of humor is brought to people, and the effect of humor expression is generated. In the example, it can be seen that harmonic sounds in the voice features are very strong to generate humor effects, and in the expression of actual many humor texts, even if the humor effect at the semantic level is not strong, the effect of generating the witty harmony can be achieved or the expression effect of the humor is stronger by applying the voice features of the head rhyme or the tail rhyme in the harmonic sounds.

(2) Theory of structural features

Expressions that utilize sentence structure are also common in linguistic expressions. Such as our traditional Chinese spring festival scrolls, it is a unique form of literature. The method is characterized in that the words of sentences connected up and down are equal, and the matching is carried out in an equal manner and in an equal and narrow manner. Sentence structure expression plays a great role in Chinese circulation and humorous sentence expression. In humorous interpersonal communication, the characters with the same number of characters in the upper sentence and the lower sentence are mostly used, and the characters such as pronunciation are often accompanied. For example: example 3: it is easy to quit smoking and difficult to quit you. The sentence can not see semantic information, and the character of the same number of words of the upper sentence and the lower sentence exists in the expression mode of Chinese, so that the sentence is easy to read and easy to understand, is the beginning of a 'ring' word, and can also generate humorous effect. Some sentences have the characteristics of the same word number and the harmonic sound in 3.1.1, so that the humorous effect of the sentences is stronger. For example: example 4: carrying a knife by the mouse, and finding cats in the street. The phrase "dao" as the pronunciation of "knife" for knife resistance and "mao" as the pronunciation of "cat" for finding cat have "ao", which constitutes a technique of tail rhyme. Meanwhile, the method has the characteristic of the same number of words in the upper sentence and the lower sentence, so that the combination of the double characteristics makes the effect of expressing humor stronger.

(3) Lexical semantic feature theory

The semantic features of the vocabulary, that is, the same word or word expresses different meanings in the same sentence, which causes semantic ambiguity and thus induces humour. In many cases, different semantics are expressed and parts of speech change accordingly. Such as: example 5: and (3) a teacher: is you a boy or girl? A pupil: i am a mother! In this way, we can understand from the semantics that the word "student" is not the same meaning for both the pupil and the teacher. The "life" of boys and girls tends to be nominal, while the "life" of mom is an nominal character. Therefore, the same words have different parts of speech to derive ambiguity caused by different meanings of the same words, and humorous effect is easy to generate.

(4) Theory of new trend and dialect features

Sometimes, people can communicate with each other and can easily speak the new trend word or dialect. For example, the southeast says "haha" in the northeast, feels very interesting once hearing, may not need any actual expression meaning, but rather, produces humorous effects due to sudden failure to defend. The same is true for the new trend. However, if the interpersonal communication is familiar, the humorous effect that may be produced by some dialects or new tide words is not very prominent. Therefore, the theory of new trend and dialect features may be humorous in certain situations.

3.1.3 Effect of reverse translation on different types of humor data

Reverse translation not only serves as a means for trending in data enhancement, but also plays an important role in the humorous data field. The data set translated reversely can be compared and analyzed with the original data set, the humorous theory is deeply interpreted by using the difference of Chinese and English languages and applying a linguistic theory method to find the nature of the humorous, and a theoretical basis is provided for the future humorous research and the automatic humorous judgment of a computer. The following shows that a visualization using the reverse translation technique and syntactic analysis can reveal the diversity of sentences, as shown in fig. 11 and 12. The original sentence: the way is not flat, and goes forward when the way is roared. This study will compare our existing data set with a reverse translated sentence against the original sentence. Each type shows 3 humorous sentences as analysis, and each enumeration is divided into an original sentence, a reversely translated sentence, whether humorous exists after the reverse translation or not, smile point analysis and related methods of computer automatic recognition of the type. And finally, summarizing the analysis result of each type to illustrate the influence of the reverse translation on the recognition of humor characteristics.

(1) Harmonic character

As shown in Table 3.1, the original Chinese sentence is reversely translated to obtain a new Chinese sentence, and the characteristics of the laughing point of the harmonic character of the sentence are identified by comparison. Taking sentence 1 as an example, there is a case where humor is lost after the sentence is reversely translated. The smiling point of the original sentence is an advertisement of the farmer spring, and the advertisement words of the farmer spring are as follows: farmer spring, some sweet. The words of the farmer, the mountain spring and the advertisement of the farmer mountain spring form a sense of homophonic and different word senses, and a sense of humour is formed. If the humor is identified, the method can be realized by formulating word lists and rules of homophones.

TABLE 3.1 harmonic phonetic analysis

From the above table 3.1, we can find that humorous caused by consonant characters using pronunciation of pinyin vowels is common, and we can extract pinyin features from text humorous data sets to perform feature extraction, thereby improving humorous classification effect.

(2) Structural symmetry

As shown in Table 3.2, the original Chinese sentence is reversely translated to obtain a new Chinese sentence, and the new Chinese sentence is compared to identify the smile point feature with a symmetrical structure of the sentence. Taking sentence 1 as an example, after the sentence is reversely translated, humorous is lost, and the translated language structure and harmonic sound are lost. The smiling point of the original sentence is that the upper sentence and the lower sentence are symmetrical in structure and the same in word number, and a harmonic sound effect exists, so that the original sentence feels like being aloud, and a humorous effect is generated. To recognize such humor, it is necessary to use syntactic analysis or harmonic features to discriminate the relationship between upper and lower sentences.

TABLE 3.2 structural symmetry analysis

From the above table 3.2, we can find that structurally symmetric humor sentences can appear simultaneously with harmonic sound effects in many cases, and because we teach that pronunciation and structure appear simultaneously in many cases, we can extract structural features and harmonic sound features for humor classification and identification.

(3) Meaning of one word

As shown in Table 3.3, the original Chinese sentence is reversely translated to obtain a new Chinese sentence, and the new Chinese sentence is compared to identify the characteristic of the Chinese sentence with a ambiguous word. Taking sentence 1 as an example, the case of humorous loss of the sentence after reverse translation, and the case of ambiguous word after translation disappears. The smiling point of the original sentence is that the leader face turns red to indicate a leader state, the employee has reddening to indicate that the employee has paid money, the former is biased to verb part of speech, and the latter is noun part of speech. The two "red" words are completely different meanings when matched in different contexts, and the humorous effect is generated. If the humor is to be identified, the semantic distance between redness and redness needs to be calculated, so that the semantic deviation relation between upper sentences and lower sentences is judged.

TABLE 3.3 one-word polysemous analysis

As can be seen from the above table 3.3, in chinese humor, it is common that an ambiguity caused by an ambiguity of a word causes an effect of humor, and we can extract features by using a feature extraction model that can analyze an ambiguity of a word, for example, the BERT model can solve the ambiguity of a word. In many cases, however, the phenomenon of word ambiguity is often caused by a change in part of speech, so that part of speech features can be added to feature engineering to classify humor.

(4) New trend words and dialects

As shown in table 3.4, the original chinese sentence is reversely translated to obtain a new chinese sentence, and the new chinese sentence is compared to identify the new trending word and the smiling point feature of the dialect of the sentence. Taking sentence 1 as an example, the sentence is subjected to humorous loss after being reversely translated, and the translated new tide word is lost. The smiling point of the original sentence is the new trend of HELLO KITTY, and the effect of humorous is given out. If the humor is identified, a word bank and a rule of new tide words are required to be established for realizing the identification. Taking sentence 2 as an example, the sentence is humorously lost after being reversely translated, and the translated dialect feature is lost. The smiling point of the original sentence is that the northeast dialect of the old sentence is calculated by you, and a humorous effect is formed. If the humor is identified, a dialect word bank and rules need to be established for realizing the identification.

TABLE 3.4 analysis of the New trend and dialect

As can be seen from the comparative analysis of table 3.4 above, the trending words or dialects can also cause humorous effect, but in the data set, there is relatively less data, so that no separate feature extraction operation is performed on the trending words and dialects in the feature extraction stage.

3.1.4 summary of this chapter

In this chapter, the humorous theory is summarized relatively, deep humorous material features are identified by a novel data enhancement method, namely a reverse translation technology, and compared with an original data set, the original source of the humorous material features is traced, and some features triggering humorous in real interpersonal interaction are found: harmonic character features, homophonic character features, structural features, one-character polysemy features, and the like. The reverse translation technique can not only be used as a data enhancement technique, but also helps us to identify deep features in many hummers.

The harmonic character features and the homonym character features belong to the category of voice features, and the corresponding features can be extracted by using the feature features of the Chinese pinyin. Through analysis, structural features often appear together with harmonic character features to a great extent, so that feature extraction can be performed by using the feature characteristics of Chinese pinyin or using a feature vector tool with more deep level. In the case of word ambiguity, we can extract features by means of a part-of-speech feature or a word sense feature extraction model, and the like. These feature extractions can be traced back to the origin of humor, rather than using some advanced automated models to pursue the height of the results. Analysis shows that deep information of humor can be excavated by using a reverse translation technology, and a solid theoretical demonstration and experimental foundation is provided for analysis of influence of different characteristics on the humor and classification of the humor.

4.1 humorous classification studies based on data enhancement

Although the traditional humor classification method, such as classic machine learning algorithm support vector machine, decision tree, random forest and deep learning algorithms CNN, RNN and other algorithms, can classify the humor short text into humor and non-humor to some extent, the humor of the text cannot be distinguished from the original source of the humor theory. Meanwhile, when the traditional Word vector is used alone to represent vectors such as GLOVE, Word2Vec and the like, only static Word vectors can be obtained from the text, and the words cannot be well represented under the conditions that the context is complex and changeable and the same words may have different meanings. Therefore, the section designs a BERT-based BPH-BiGRU-Softmax Chinese text humor classification model by taking a feature vector extracted by a BERT model as a basic vector, introduces a third chapter into the model, analyzes features contained in the summarized humor text through a reverse translation technology, performs multi-feature representation as input of the model, performs deep feature extraction on the features through a bidirectional GRU network, and finally inputs the features into a Softmax classifier, so that the performance of a model system is further improved, and the influence of multi-feature fusion on a classification task is analyzed. The specific design framework is shown in fig. 13.

4.1.1BERT model characterization

The BERT model is a pre-trained model. The pre-trained models, simply, learn knowledge and network model parameters on a task and save them. When a new task needs to be performed, if the same model structure is adopted, model parameters are initialized and loaded into the parameters which are pre-trained before, and network training is performed on a new data set. The pre-training model often exerts its strong advantages and achieves good effects when the data set of the new task is small. More Word vectors are currently used, such as GLOVE, Word2Vec, ELMo, BERT. Meanwhile, when the current natural language processing field is spoken from a certain aspect, the classification task can be trained and divided into two parts, namely a feature vector extraction part and a feature vector operation part. The feature vector extraction part uses the word vector tools to express characters in texts required by experiments into vectors capable of carrying out mathematical operations. The feature vector operation part belongs to a downstream task part, and the vectors are input into some classifier models to further perform text classification operation.

Word2Vec is static, it can only obtain context-free Word vectors, and in downstream tasks, such as when the humorous text contains many same words with different meanings in the same sentence, it is easy to influence the classification effect of the experiment. ELMo achieves context correlation by using Bi-LSTM as the encoder, but it is not a fully Bi-directional model and the next word to predict is already present in a given sequence, so that the trained word vector does not necessarily work best. The BERT model differs from the ELMo model in that BERT uses a Transformer structure as an encoder, which is advantageous in that it can have a deeper layer number and better parallelism. The BERT model further increases the generalization capability of the word vector model, fully describes character level, word level, sentence level and even sentence-to-sentence relation characteristics, so that the BERT model can solve the problem of word ambiguity and learn deeper semantic information. Therefore, in the humor classification experiment in this chapter, BERT is used as the basic feature vector representation.

4.1.2 humorous classification method based on multi-feature fusion model

Through a technology based on reverse translation, the fact that the main set of the Chinese text humor data set of the experiment contains harmonic character features, structural features and vocabulary meaning features is analyzed. Firstly, the most obvious expression of harmonic character features in Chinese pronunciation is on Chinese Pinyin, so that the feature extraction tool of Chinese Pinyin, namely PyPinyin in a Pyhton library, is used for extracting the harmonic character features. In the structural features, through analysis, the humorous sentences with the structural features almost contain harmonic sound character features, and because an expression mode in Chinese is that the structural characters of upper and lower sentences are identical and the first rhyme or the last rhyme is to be formed, a feature extraction tool for extracting harmonic sound characters is still used in the aspect of feature extraction. In terms of word sense of vocabulary, in Chinese humor, word sense ambiguity is usually generated by word sense ambiguity or word sense ambiguity of the word sense of the vocabulary, so that a humorous expression effect is generated. The characteristic vector extracted by the BERT has good effect so far, and the most outstanding characteristic is that the characteristic vector can capture the phenomenon of word ambiguity, word embedding related to context and other forms of information, which is very helpful for completing the phenomenon of word meaning ambiguity. Meanwhile, the comparison result in the third chapter can also find that the part of speech of some words changes when ambiguity occurs. Therefore, for the vocabulary meaning characteristics, a BERT model is adopted as a basic characteristic vector extraction tool, and meanwhile, text part-of-speech characteristics are also extracted by means of a jieba tool. And respectively extracting the feature vectors through the three parts, and then splicing the feature vectors. And then inputting the spliced feature vectors into the BiGRU model by taking the spliced feature vectors as input. The BiGRU model firstly performs dimensionality reduction operation on the spliced feature vectors input by people, and then performs deep feature extraction operation to retain important features. And finally, taking the extracted characteristic important information output by the BiGRU model as input, inputting the input into a Softmax layer, taking the Softmax as a classifier, and outputting the final classification probability of each class so as to obtain an experimental classification result.

4.1.3 method Structure overview based on BPH-BiGRU-Softmax model

As shown in fig. 13, the humor classification model in this chapter mainly includes: the device comprises a text input layer, a BERT-Embedding word Embedding layer, a Chinese pinyin feature Embedding layer, a text part-of-speech feature Embedding layer, a feature fusion layer, a BiGRU layer and a full connection layer, and finally the full connection layer completes classified output of the humorous text in the Chinese.

(1) Text entry layer

The text input layer of BERT is different from other models in that sentences are used as input, and the input is accessed to the next layer and converted into vector matrixes.

(2) BERT embedding layer

The BERT model is a pre-training model with strong generalization capability and can also be used as a bidirectional and deep text representation model. The embedded layer uses a BERT model as text representation, converts sentences input by the text input layer into vectors and accesses the classification layer to be classified.

The embedded layer of BERT consists of three parts: token 12, Segment 12, and Position.

Token Embeddings layer: the words are converted into fixed-dimension vectors. In BERT, each word is converted to a 768-dimensional vector representation.

Segment Embeddings: BERT is capable of handling the task of classifying pairs of input sentences. The two sentences in the sentence pair are simply spliced together and then sent into the model. In the model, Segment Embeddings need to distinguish two sentences in a sentence pair.

Position Embeddings: position information of each word in a sentence may be represented.

The BERT model applies a fixed tag to each sentence at the text entry level when the sentence is represented in text. CLS is added at the beginning of the sentence to represent the beginning of the sentence, SEP is added at the end of the sentence to represent the end of the sentence, and the sentence index is added. If each word of one of the sample sentences of the text input layer is denoted by w, the sentence S can be represented as S ═ { w1, w2, w3, ·, wn }, and n is used to denote the sequence length of the sample sentence. The matrix V generated by vectorially representing a sentence by the BERT model is shown in fig. 14 below:

(3) chinese phonetic embedded layer

The Python provides a Chinese character-to-pinyin library, PyPinyin. It can be used in Chinese character phonetic notation, sorting, searching and other occasions, and is developed based on hotto/pinyin library. The algorithm for extracting the characteristics by using the Chinese pinyin comprises the following steps:

step 1: chinese character phonetic transcription

I.e. each chinese character in the sentence we want to characterize is converted into pinyin. For example, if we want to convert the sentence "i love china", the form "wo ai zhong guo" can be converted.

Step 2: obtaining a unique character set

In the work of text processing, a dictionary (a collection of all words contained in a corpus) is often used, and we also need to find this "dictionary" (character set) first. Each character corresponds to an integer as its ID. The function of the dictionary is to convert characters/words and numbers into each other.

Step 3: pinyin vectorization

Based on the work of the two steps, all basic conditions needed for forming word vectors are provided, and the pinyin vectorization is carried out on the text needing to be converted.

(4) Text part-of-speech embedding layer

The jieba tool used for text part-of-speech feature extraction is the most widely used text processing tool in the Python community at present. The specific method comprises the steps of firstly importing a disabled word bank, carrying out word segmentation operation on sentences in a text, then extracting all parts of speech and converting the parts of speech into part of speech characteristic vectors.

(5) Feature fusion layer

In order to track the origin of text humor and improve the humor identification effect, the layer performs feature fusion on a feature vector matrix extracted by the BERT model and Chinese pinyin features and text part-of-speech feature vectors obtained by comparing the feature vector matrix with the third chapter through a reverse translation technology, and forms a multi-feature mode for training in a deep learning model. If the feature vector matrix generated by passing the sample sentence of the text input layer through the BERT model is V, the formula of the domain feature fusion sentence corresponding to the sample sentence can be expressed by formula 4.1:

in the above formula, W represents the new feature vector generated, f₁Representing word vector features, f₂The Chinese phonetic characters are shown.

(6) BiGRU layer

The gru (gated refresh unit) is a variant of Long-Short-Term Memory Neural Network (LSTM), which is proposed to solve the problems of Long-Term Memory and gradient in back propagation, and belongs to an improved model of Recurrent Neural Network (RNN) with LSTM. Unlike LSTM, the GRU model has only two gates, namely an update gate and a reset gate. I.e. z of the three figures_tAnd r_t. Because the GRU can only obtain forward context information and neglects backward context information, the bidirectional GRU neural network is used in the text, namely the context information is obtained from front to back and upwards simultaneously so as to improve the accuracy of feature extraction. A diagram of the basic model of the GRU is shown in fig. 15.

The BiGRU layer mainly comprises a forward GRU layer and a backward GRU layer, and context learning is carried out on a feature vector matrix W output by the feature fusion layer through a forward neural network and a backward neural network, so that deeper feature extraction operation is carried out on the text.

(7) Softmax layer

The Softmax layer has very wide application in machine learning and deep learning. In the Softmax layer, the feature output by the previous layer is converted into the probability of judging the label, namely, the feature vector is mapped into a probability sequence. If we will V_iExpressed as the ith element of the vector V, the value of this element can be expressed as equation 4.2 as follows:

4.1.4 Experimental setup and results analysis

(1) Experimental Environment

The hardware operating environment configured in this experiment is shown in table 4.1 below:

table 4.1 experimental environment hardware configuration table

The software operating environment configured in this experiment is shown in table 4.2 below:

TABLE 4.2 Experimental Environment software operating Environment

(2) Evaluation index of experimental performance

At present, in the field of machine learning and deep learning, models are generally required to be established to solve specific problems, but the quality of the identification models, namely the generalization capability of the models, needs to use some evaluation indexes, such as accuracy, recall, F1 value, ROC and AUC indexes, which are often applied to tasks such as information retrieval (e.g. search engine), natural language processing and detection classification. The accuracy and the F1 value are mainly used as main indexes for the experimental evaluation. In order to make the experiment have fairness, the evaluation index results in the text are all the average value of the results of 10 times of experiments.

The predicted result and the actual result of the experiment are expressed by a confusion matrix, so that the corresponding evaluation index can be calculated, and the classification result relationship is shown in the following table 4.3.

TABLE 4.3 Classification results

1) Precision ratio

The accuracy rate is the ratio of the number of correct text classification prediction judgments to the number of all prediction judgments as true, i.e. the probability that all samples predicted as positive are actually positive samples. In most cases the higher the accuracy, the better the model performance, the formula is as follows:

2) recall rate

The recall ratio is the ratio of the number of correct judgment judgments of the text classification to the number of all true texts, namely the probability of being predicted as a positive sample in the actual positive samples. The formula is as follows:

3) f1 value

The F1 score is the harmonic mean of the precision rate and the recall rate. When the deep learning model is trained, the accuracy rate and the recall rate are both considered, and a unified single-value evaluation index is used for evaluating the training effect of the deep learning model. Therefore, the F1 value can show the performance of the model. The formula is as follows:

(3) experiment corpus

The experimental study corpus is totally divided into two parts, one is public corpus, and the other is self-construction corpus. The exploration of the influence of humor by reverse translation, which is described in the third chapter, is based on public corpus, and two significant features of Chinese humor, namely pronunciation features and part-of-speech features, are analyzed. The purpose of self-building corpus is to verify that humorous features we analyze are not accidental, but universal and universal.

1) Open corpus

The corpus of the experiment is from Chinese corpus of the eighteenth China computer linguistics conference (CCL2019), and the corpus is divided into two categories: humorous and non-humorous, humorous label is 1, and non-humorous label is 0. The corpus is divided into a training set and a test set, the training set corpus is 16420 sentences, and the test set corpus is 4105 sentences. See table 4.4 for a distribution table display of the chinese humorous data set.

TABLE 4.4 Chinese humor data set distribution Table

The partial contents of the chinese humor dataset are shown in table 4.5.

TABLE 4.5 examples of humorous data sets in Chinese

2) Self-built corpus

The data set constructed by the user is the text in the smiling website and the paragraph website on the internet, and nearly 2 ten thousand pieces of data are crawled by using a crawler technology. Because some parts of the crawled data have writing and special symbol or sentence problems, the data are preprocessed and arranged into a standard format, and finally the self-built data set is 12078 pieces of data.

Data aggregation is divided into two categories, humorous and non-humorous. By the method of randomly dividing the data set in the Sklearn library, the whole data set is divided into 3: the scale of 1 is divided into a training set and a test set.

The data set partitioning and positive and negative case classification are shown in table 4.6.

TABLE 4.6 data partitioning case Table

After the data is preprocessed, the purpose of removing data noise is achieved, and specific examples are shown in table 4.7.

TABLE 4.7 sample examples of the post-pretreatment data set

(4) Experimental parameter settings

In the deep learning network structure, one core content is parameters and hyper-parameters, which are the final learning target for training the deep neural network. The choice of hyper-parameters represents the capabilities of the model, directly affecting the performance of the model, and therefore requiring a significant amount of time to adjust. In our experimental process, common hyper-parameters presumably include learning rate, batch processing size, dropout size, network layer number, neuron number, iteration number, and the like, as shown in table 4.8.

TABLE 4.8 model parameter settings Table

(5) Published data set experimental results and analysis

1) Effect of multiple features on the results of humorous classification of Chinese text

In the experiment of the section, on the basis of extracting the character meaning characteristics of the text based on a BERT-BiGRU-Softmax model only using the BERT, the characteristics of analysis and comparison of a third chapter reverse translation technology, namely the part of speech characteristics and the pinyin characteristics, are added respectively. Finally, the three characteristics are fused together to generate a BPH-BiGRU-Softmax model, a comparison experiment is carried out, and the obtained experimental data results are shown in the following table 4.9:

TABLE 4.9 comparison of results of fusing different characteristics

From the experimental result data, we can see that the accuracy of character-level feature extraction based on the BERT model alone is 85.77%, and the F1 value is 83.67%. However, we analyze the characteristics of Chinese text humor, namely part-of-speech characteristics and pinyin characteristics, of the existing data set from the third chapter, so that the accuracy rate of text part-of-speech characteristics added on the basis of BERT characteristics is 86.46%, the F1 value is 84.17%, and the accuracy rate is improved by 0.69% compared with the basic BERT characteristics. It is explained that the part-of-speech feature is a valid feature in the classification and recognition of Chinese text humor. The Chinese phonetic characters are added on the basis of the BERT characteristics, the accuracy rate is 86.71%, the F1 value is 84.27%, the accuracy rate is improved compared with the first two experiments, the accuracy rate is improved by nearly 1% compared with the BERT basic characteristics, and the accuracy rate is improved by 0.25% compared with the text part-of-speech characteristics. Finally, the BERT basic characteristics, the part-of-speech characteristics and the Chinese pinyin characteristics are combined together, the experimental result is the highest, the accuracy and the F1 value are respectively 87.09% and 84.61%, the accuracy is improved by 1.32% compared with the BERT-BASE basic characteristics, and the F1 value is improved by nearly 1%. From the table we can see that both accuracy and F1 values, and recall and precision, are steadily increasing with the addition of other language features on the basis of BERT features. The method can analyze that the result after the Chinese pinyin characteristics are added is better than the result after the part-of-speech characteristics are added, and the great characteristic of the Chinese text humor is that the characteristic of harmonic sounds in Chinese language is used for expressing the humor. And the experimental results are best after all three features are fused together. The method provides theoretical and experimental basis for analyzing characteristics of the Chinese text humor on the basis of linguistics, and verifies the effectiveness of the model.

2) Comparison experiment with other network models

To validate the classification model based on multi-feature fusion of the reverse translation technique proposed herein, the following classical network model and Khandelwal et al were compared on the present experimental data set^[15]Applied SVM method and Chen et al^[18]The TEXT-CNN method used was compared. This contrast adopts the effectThe best BPH-BiGRU-Softmax model which integrates BERT characteristics, part of speech characteristics and Chinese pinyin characteristics. The results of the experiment are shown in Table 4.10.

TABLE 4.10 comparison of experimental models

Through the experimental results shown in the table, the experimental accuracy from the SVM model applied by Khandelwal et al to the TEXT-RCNN model is gradually improved, wherein the TEXT-RCNN is improved by nearly 8.8% compared with the SVM method and is improved by nearly 1.2% compared with the TEXT-CNN model applied by Chen et al. Compared with the model method based on the multi-feature fusion of the reverse translation, the accuracy of the model provided by the method is still better than that of the TEXT-RCNN model, and is improved by 4.87% compared with the result, so that the model is the optimal model. The fact that in the data set, a common model cannot well capture deep semantic features to achieve a better effect, the text model integrates a plurality of humorous features of a linguistic theory, and whether the recognition of the text in the middle is humorous or not is traced, so that the experiment result of humorous classification is improved.

3) Exploring the hyperparameters of important models

In deep learning, if a good model is to be trained, then appropriate parameters need to be found for implementation. If the model parameters are not properly selected, the network model may not exhibit optimal capabilities, or even the opposite. For example, the training results are unsatisfactory due to over fitting, too long time consumption and cost, poor convergence, and the like. The effect of the Batchsize value on model training is emphatically explored in the experiment of the subsection.

At present, the amount of data sets for deep learning is large, and if the number of samples and the content are large, it is not practical to train all data at one time. Therefore, in the training process, the training method of mini-batch is generally adopted. That is, the whole data set forms a batch according to the Batchsize text data at a time as the input of each step. And comparing the output result with the expected value of the batch of samples, calculating loss by using a loss function, updating the weight and the offset value, and taking the new parameter as the initial value of the next batch. Therefore, in each updating process, the data sets are different, and certain randomness exists. Therefore, the performance of the network model gradually approaches to a stable state through continuous iterative learning.

Generally, after the value of the Batchsize is increased to a certain degree, the determined descending direction of the Batchsize is basically not changed any more, and if the value of the Batchsize is increased in a single step, the excessive memory is occupied, and the generalization capability of the model is even reduced. If the Batchsize value is small, it is easily difficult to converge.

The experiment is based on a BPH-BiGRU-Softmax model, the data set is a humor data set in this chapter, different values are set for Batchsize respectively to explore the influence on the model, and a line graph of the experimental result is shown in FIG. 15.

As is clear from the graph, when the value of Batchsize is set to 1, the time required is longest, the cost of time is greatly increased, and the accuracy performance is not ideal. When the value of Batchsize reaches 64, the accuracy has started to decline even though less time is spent. When the Batchsize value is set to exceed 64, the data set is too large causing the machine to be overloaded due to the limited memory of the machine. Because the data sets of each batch have differences, which easily causes the obtained gradient after each iteration not to be corrected, it is important how to select a suitable value of batch size within a reasonable range and balance the effect with the time cost. Namely a batch gradient descent method commonly used in experiments. During the experiment, gather suitable sample scale, not only can reduce the used time cost of experiment, improve the memory rate of utilization of machine, can also find more accurate decline direction to reduce the vibration range of model. Through this section of experiment, when 32 were selected to the value of Batchsize, compromise the rate of accuracy of time cost and experiment simultaneously and reach the optimum.

4) Contrast experiment of different data enhancement techniques

Based on the reverse translation technology, the laugh point can be recognized by comparing the data set translated back to the source language with the original data set linguistics, and the reverse translation technology can also be used as a very excellent data enhancement technology to expand the original data set so as to improve the robustness and generalization capability of the model, so that a better classification effect is obtained. Currently, the mainstream data enhancement techniques include sentence-level based reverse translation techniques and word-level based EDA techniques. The EDA technology is an operation of performing random replacement, random deletion, random insertion, and random exchange of words on sentences in an original data set. Good results are obtained in each large data set. Therefore, this section will also compare the data enhancement techniques. Specifically, data enhancement is carried out on data through two different methods, then the data are gathered with original data to form a new data set, and the new data set is input into a model, so that a classification experiment on the text humor in the Chinese is carried out.

A sentence representation using the reverse translation technique and the EDA data enhancement technique is shown in table 4.11.

Table 4.11 data enhanced sentence example

In this section of experiment, both data enhancement methods were each performed with 15% data enhancement. The results of the experiment are shown in Table 4.12.

TABLE 4.12 Table of results of contrast enhancement experiments with different data

From the experimental results in the table, it can be seen that the application of the reverse translation data enhancement technique is slightly better than that of the EDA data enhancement technique in humor classification, and at the same time, the accuracy and the F1 value are respectively improved by 0.44% and 0.4% compared with the case that the data enhancement technique is not used in the model. By analyzing the potential reasons, probably in the Chinese humorous data set, the original structure of the data is changed and the semantic diversity is enhanced after reverse translation, compared with an EDA (electronic design automation) technology, the noise is less, a machine can automatically learn by comparing with an original sentence, and the robustness and generalization capability of the model are better increased. (6) Self-created data set experimental results and analysis

Because the study on humor classification is Chinese-oriented, two significant features of Chinese humor, namely a voice feature and a part-of-speech feature, are found through reverse translation of the disclosed Chinese humor corpus in the CCL2019 and comparison of linguistic humor theories, and are effective after experimental verification in the CCL2019 disclosed Chinese humor data set, and the two significant features are universal in order to continue verification and analysis. Therefore, the experiment in this section performs experimental analysis on the data set constructed by the experiment. 1) Effect of multiple features on the results of humorous classification of Chinese text

Like the influence exploration of multi-feature experiments in the public data set, the section mainly analyzes the influence exploration of self-built data sets to the multi-feature. The specific experimental parameters were the same as for the public data set. And the experimental model also adopts BERT-BiGRU-Softmax as a basic model, gradually adds features into the model and performs feature fusion to generate the BPH-BiGRU-Softmax model. The results of this section of the self-created data set experiment are shown in table 4.13.

Table 4.13 self-created data set experimental results

As is apparent from the experimental result table, the accuracy of BERT based on the basic feature model is 97.43, and the F1 value is 97.33. After the first feature, namely the part-of-speech feature, is added and the feature fusion is carried out, the accuracy and the F1 value are respectively improved by 0.14 percent and 0.18 percent compared with the basic feature model BERT. After the second feature, namely the pinyin feature, is added and the features are fused, the accuracy and the F1 value are respectively improved by 0.17 percent and 0.21 percent compared with the basic feature model BERT. Both features are valid for feature fusion. After the basic feature BERT is fused with the part-of-speech feature and the pinyin feature, the accuracy and the F1 value reach the best states, are respectively 97.89% and 97.85%, and are respectively increased by 0.46% and 0.52% compared with the basic feature model BERT accuracy and the F1 value. The three characteristics have obvious fusion advantages and outstanding effect, and the effectiveness of the characteristic method is verified.

2) Comparison experiment with other network models

In order to verify the effectiveness of the classification model based on the multi-feature fusion of the reverse translation technology, the experiment of self-establishing data sets in this section is still common to the following classical TEXT emotion classification network models, such as network models of TEXT-RNN, TEXT-RCNN, DPCNN and the like, and Khandelwal et al^[15]SVM method applied in humor classification experiment and Chen et al^[18]The TEXT-CNN model method is used for comparison. The comparison also adopts a BPH-BiGRU-Softmax model which has the best effect and fuses BERT characteristics, part of speech characteristics and Chinese pinyin characteristics. The experimental results are shown in table 4.14.

TABLE 4.14 comparison of experimental models

By comparing the experimental results in the table above, we can clearly find that the experimental accuracy from the SVM model to the TEXT-CNN model is gradually improved. The SVM model is a classic model of machine learning in the text emotion classification task, and the models behind the SVM model are classic models of deep learning in the text emotion classification task. The accuracy of the deep learning model TEXT-CNN is improved by 4.76% compared with that of a machine learning model SVM, and the effect of the experimental accuracy is higher along with the improvement of the complexity of the model, so that the TEXT-CNN model reaches 94.34%. However, compared with the accuracy of the BPH-BiGRU-Softmax model provided by the TEXT, the accuracy of the model provided by the TEXT greatly exceeds the results of the classical network model and the network model common in humor TEXT classification, and is improved by 3.36% compared with the accuracy of the TEXT-CNN model, so that the model is the optimal model of the self-constructed data set. The experimental result shows that the model method and the characteristics analyzed by comparing a large number of data sets through the reverse translation technology are also suitable for the self-built data sets, and the model and the characteristics provided by the method have universality in the Chinese humor classification task and can realize better experimental results.

3) Exploring the hyperparameters of important models

As with the important parameter exploration in the public data set, there is also a problem with the impact of the size of the blocksize on the model performance in the self-created data set. If the data set is small, it can be completely in the form of a Full data set (Full Batch Learning), which has the advantages that the direction determined by the Full data set can better represent the sample population, so as to more accurately face the direction of the extreme value, and because the gradient values of different weights are very different, it is difficult to select a global Learning rate. However, since the self-created data set is large, experimental adjustments to the specific selection of the batch size are required. The following FIG. 16 shows the experimental results for different blocksize values on the self-created data set.

It can be seen from the experimental comparison above that the different barchsize values differ significantly with respect to the time the model was run and the accuracy of the model. In general, as the value of the Batchsize increases, from 1 to 256, the time for the model to run decreases. However, the running time of the model is only one aspect of the task of training the model, and the accuracy of the model is the most important. Therefore, whether the blocksize is appropriate also takes into account the representation of the accuracy of the model. As is apparent from the figure, when the value of the batch size is set to 32, the accuracy is the highest value, the performance is the best, and meanwhile, the running time of the model is also in an acceptable range, so that the value of the batch size is still selected to 32 as the important parameter of the model in the self-constructed data set. (4) Contrast experiment of different data enhancement techniques

In the experiment of the subsection, a data enhancement technology of reverse translation is still adopted to be compared with an EDA data enhancement technology, and a data enhancement method suitable for the humorous classification of the Chinese text in the subsection is verified. To ensure fairness, the data expansion ratio was set at 15% for the data set, and the specific experimental results are shown in table 4.15.

TABLE 4.15 Table of results of contrast enhancement experiments with different data

As can be seen from the above table, the data enhancement technique of the reverse translation is still slightly better than the EDA data enhancement technique, and the accuracy and the F1 value are respectively improved by 0.33% and 0.35% compared with the model method provided herein without using data enhancement. Because the data are randomly deleted, replaced and shifted by the EDA, some noise exists, the data are large in collective quantity, and the performance of the EDA is limited. Compared with the reverse translation technology, the limitation on the defects is less, the expression structure and mode of the sentence can be changed through reverse translation, the enhanced data can have a structure different from that of the original sentence, correct semantic information can be kept under the condition that the grammar structure is changed sometimes, the data diversity of the text corpus is increased, and the robustness and generalization capability of the model are better increased.

4.1.5 summary of this chapter

According to the chapter, a BERT word vector is used as a basic vector, humorous characteristics of a data set analyzed through a reverse translation technology of a third chapter are analyzed, pinyin characteristic vectors and part-of-speech characteristic vectors are extracted through a pinyin and part-of-speech characteristic extraction tool and are fused with the BERT word vector to represent text information, then, on the basis, characteristic information is extracted deeply through a bidirectional threshold recurrent neural network (BiGRU), deep understanding of a network model on the text characteristic information is achieved, and finally, the text characteristic information is input into a Softmax classifier to complete an experiment of humorous classification of Chinese texts, and the BPH-BiGRU-Softmax model is provided. Firstly, the configuration environment and the performance evaluation index of the experiment are introduced, and then the public data set and the self-built data set used in the experiment are introduced in detail. The experimental parameter settings are then described in detail. Finally, experimental comparisons and correlation analyses were performed on the public data sets as well as the self-created data sets. Experimental results show that compared with a basic feature model BERT, the three feature fusion methods based on the BERT word vectors are greatly improved in both open data sets and self-built data sets, and the effect is obvious. And compared with the Support Vector Machine (SVM) and TEXT-CNN models proposed by other students who classify humorous recently, the results of the model are greatly superior to those of the models. Meanwhile, the reverse translation technology and the EDA data enhancement technology which is popular in the present year are utilized to compare and analyze the public data set. The experimental results show that our method is effective.

Claims

1. A reverse translation-based chinese humor classification model, comprising:

s1, a text input layer;

s2, embedding a BERT layer;

s3, embedding Chinese phonetic character into a layer;

s4, embedding text part-of-speech characteristics into a layer;

s5, a characteristic fusion layer;

s6, a BiGRU layer;

2. The reverse translation-based Chinese humor classification model of claim 1, wherein the text input layer takes sentences as input.

3. The reverse translation-based classification model of chinese humor according to claim 1, wherein the pinyin feature-embedding layer comprises the steps of:

4. The reverse translation-based Chinese humor classification model according to claim 1, wherein the text part-of-speech features are embedded in the layer, a jieba tool is used for importing the text into a dead word library, a word segmentation operation is performed on sentences in the text, and then all parts-of-speech are extracted and converted into part-of-speech feature vectors.

5. The reverse translation-based Chinese humor classification model according to claim 1, wherein in the feature fusion layer, feature fusion is performed on a feature vector matrix extracted by a BERT model, Chinese pinyin features obtained through comparison by a reverse translation method and text part-of-speech feature vectors to form a multi-feature mode, and the multi-feature mode is trained in a deep learning model; a feature vector matrix generated by subjecting a sample sentence of the text input layer to a BERT model is V, and a formula of a domain feature fusion sentence corresponding to the sample sentence can be expressed by formula 4.1:

6. The backward translation-based Chinese humor classification model according to claim 1, wherein the BiGRU layer comprises a forward GRU layer and a backward GRU layer, and the forward and backward neural networks are used for context learning of the feature vector matrix W output by the feature fusion layer, so as to perform deeper feature extraction on the text.

7. The backward translation-based Chinese humor classification model of claim 5, wherein the backward translation method comprises translating the Chinese humor data set into an English data set by a machine translation method, and translating the English data set back into the Chinese data set.