CN112818118B

CN112818118B - Reverse translation-based Chinese humor classification model construction method

Info

Publication number: CN112818118B
Application number: CN202110088848.4A
Authority: CN
Inventors: 孙世昶; 孟佳娜; 刘玉宁; 朱彦霖
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2024-05-21
Anticipated expiration: 2041-01-22
Also published as: CN112818118A

Abstract

A method for constructing a Chinese humor classifying model based on reverse translation belongs to the field of natural language processing, and comprises the following steps: s1, a text input layer; s2, a BERT embedded layer; s3, a Chinese phonetic feature embedding layer; s4, embedding a text part-of-speech feature into a layer; s5, a feature fusion layer; s6, a BiGRU layer; s7, the full-connection layer is used for finally completing classified output of the humor of the Chinese text; the beneficial effects are that: based on a method combining linguistic humor theory based on a reverse translation technology, a basic model BERT-BiGRU-Softmax is provided for classifying and researching the Chinese text humor, different humor features are gradually added into the model, a feature fusion model BERT+POS+ Homophony-BiGRU-Softmax is realized, and the model is effective for finally judging whether the Chinese text is humor.

Description

Reverse translation-based Chinese humor classification model construction method

Technical Field

The invention belongs to the field of natural language processing, and relates to a method for constructing a Chinese humor classification model based on reverse translation.

Background

Humor, one of the important ways of emotion expression, has been accompanied by people's lives. With the rapid development of national science and technology and the large-scale popularization of the internet and communication equipment, the application of the internet and artificial intelligence fields is being drastically changed from "reading type" to "interactive type". Humor "interactive" has also emerged. Humour not only brings pleasure to people, but also improves social ability, work efficiency and the like. At present, the 'interactive' application represents a chat robot, which mostly collects various network resources and then integrates information to interact with a service object, but the chat robot with humorous function is few, the chat robot has no 'temperature', and the chat robot should not be a mechanical robot only, but also have humanized thinking, know about cold and warm, and express, namely, should have humorous capability. The humor has special significance for chat robots. Therefore, the chat service robot needs to possess and understand the humor component of the speaker, and the realization basis for achieving the function is to make the chat service robot perform humor classification on the sentence.

The task of classifying Chinese text humor is an important research field in the domestic natural language processing field, and the technology in the field mainly relates to the technologies of cognitive science, linguistics, machine learning, information retrieval and the like, and the development of the technology at home and abroad in recent years is gradually hot. The research work of Chinese text humor classification mainly divides the words expressed in the text into humor and non-humor research according to the attitudes or humor tendencies of the expressive person.

The research of the humor classification task is originally initiated in the western countries, and after years of research and development, the humor classification task has become a popular research of natural language processing, and some foreign research scholars have gradually tended to be mature. The first humor theory in the world was Raskin, and he proposed humor Semantic Script Theory (SSTH) in 1985, which becomes the basic theory of artificial intelligence in humor calculation and analysis, and is also a foundation stone. Then Attardo and Raskin based on expansion and modification of Raskin basic theory, proposed the general theory of speech humor, namely six main humor elements: script opposition, logical mechanisms, contexts, objects, narrative strategies, and humorous speech, and are divided into 6 different levels from concrete to abstract, which is of great significance to the development of humorous theory. Along with the gradual development of artificial intelligence technology, the requirements of a high-performance deep neural network model on the data scale are gradually increased, and a large-data-volume and high-quality training set support is required. However, the data sets in many fields disclosed in the prior art, such as emotion classification, named entity recognition, image analysis and the like, have the problem that the data set with high quality is insufficient to better match with the high-performance model, so that the data enhancement technology is generated.

In recent years, since there are also cases where text data training sets are insufficient or training sample quality is low in the field of natural language processing, data enhancement techniques are also widely used in the field of natural language processing. Among them, many researchers are inspired by generating an countermeasure network, and GAN is successful in image processing, and the GAN network is applied to a text data enhancement task. The proposal of GPT-2 model and the improvement of Chinese GPT-2 in OpenAI in 2019 have positively influenced the enhancement of data in the field of natural language processing. At present, the data enhancement technology mainly has noise in the natural language processing field, EDA and reverse translation methods and the like, and has excellent effects in different fields.

The humor represented by different humor sentences in different eyes is different, which requires a large stock of knowledge background for communicating the identifier. At present, in the research in the humor classification recognition field, the humor theory basis of linguistics is rarely used, and theory and deep learning are not well combined together. Therefore, how to better combine linguistic humor theoretical features to extract humor features in texts and trace back to the humor origin to complete humor classification tasks is a great challenge.

In addition to humor theory, there are also insufficient problems in terms of data. Because the Chinese humor classification is developed later than foreign research, the existing Chinese humor has less high-quality text data. If data enhancement techniques in other fields are migrated to the text, noise problems are likely to exist because the fields are not necessarily better generalized, and if there is more noise, the performance of the model is affected. Therefore, the learning of the model by the Chinese text humor data set also has a certain influence.

The humor classification task starts later, lagging behind the early study of text emotion classification. Early humor classification task researches are based on English data, and the research of Chinese humor classification is gradually developed in recent years due to the fact that high-quality Chinese humor data sets are relatively few. In addition, compared with the humor language of English, the humor form and the characteristics of Chinese have some differences in the syntactic structure and the grammatical form, so that the meaning of the humor language cannot be better acquired by machine learning, and the characteristics cannot be selected in a targeted manner to judge whether the text corpus is the humor sentence.

Disclosure of Invention

In order to judge whether the Chinese text is humorous, the invention provides the following technical scheme: the method for constructing the Chinese humor classification model based on reverse translation is characterized by comprising the following steps of:

S1, a text input layer;

S2, a BERT embedded layer;

S3, a Chinese phonetic feature embedding layer;

s4, embedding a text part-of-speech feature into a layer;

s5, a feature fusion layer;

S6, a BiGRU layer;

S7, the full-connection layer is used for finally completing classified output of the humor of the Chinese text.

Further, the text input layer takes sentences as input.

Further, the Pinyin feature embedding layer comprises the following steps:

Chinese characters are converted into pinyin: converting each Chinese character in the sentence to be characterized into Chinese pinyin;

acquiring a unique character set: each character corresponds to an integer as its ID;

and performing pinyin vectorization, namely performing pinyin vectorization on the text to be converted according to the two steps of work.

Further, the part-of-speech feature is embedded in the text part-of-speech feature embedding layer, a jieba tool is used for guiding the text into a deactivated word stock, word segmentation operation is carried out on sentences in the text, and then all parts-of-speech are extracted and converted into part-of-speech feature vectors.

Further, in the feature fusion layer, feature vectors extracted by the BERT model, chinese pinyin features obtained by comparison through a reverse translation method and text part-of-speech feature vectors are subjected to feature fusion to form a multi-feature mode, and training is performed in a deep learning model; the feature vector matrix generated by the sample sentence of the text input layer through the BERT model is V, and the formula of the territory feature fusion sentence corresponding to the sample sentence can be expressed as follows by the formula 4.1:

In the above formula, W represents the new feature vector generated, f1 represents the word vector feature, and f2 represents the pinyin feature.

Further, the BiGRU layers include a forward GRU layer and a backward GRU layer, and the forward and backward neural network is used for performing context learning on the feature vector matrix W fused and output by the feature fusion layer, so as to perform deeper feature extraction operation on the text.

Further, the reverse translation method is as follows, the Chinese humor data set is translated into an English data set by using a machine translation method, and then the English data set is translated back into the Chinese data set.

The beneficial effects are that: by using a deep learning technology, the remarkable characteristics of the data set are emphasized and extracted, a basic model BERT-BiGRU-Softmax is provided for classifying and researching the Chinese text humor on the basis of a method based on a reverse translation technology and combining the linguistic humor theory, different humor characteristics are gradually added into the model for experiments, and the construction and training of a characteristic fusion model BERT+POS+ Homophony-BiGRU-Softmax (hereinafter referred to as a BPH-BiGRU-Softmax model) are realized, so that the method is effective for finally judging whether the Chinese text is humor. The experimental data in the specific embodiment can be seen that the experimental results obtained by taking the public dataset and the self-built dataset as texts in the experiment, comparing the influence of the multi-feature on the humorous classification result of the Chinese text with other network models, exploring the super-parameters of the important model in the experiment and comparing the super-parameters of different data enhancement technologies in the four directions are obtained, and the BPH-BiGRU-Softmax model generated by combining the BERT basic features, the part-of-speech features and the Chinese pinyin features is the best. The method not only verifies the effectiveness of the model, but also reduces the time cost used in the experiment, improves the memory utilization rate of the machine, and can find a more accurate descending direction so as to reduce the vibration amplitude of the model. The reverse translation can change the expression structure and mode of sentences, so that the enhanced data can have a structure different from that of the original sentences, and correct semantic information can be reserved under the condition of changing the grammar structure, thereby increasing the data diversity of a text corpus and better increasing the robustness and generalization capability of the model.

Drawings

FIG. 1CBOW is a schematic diagram of a model structure;

FIG. 2 is a schematic diagram of a Skip-Gram model structure;

FIG. 3 is a schematic diagram of the basic structure of machine learning;

FIG. 4 is a schematic diagram of a two-dimensional partitioning of SVM;

FIG. 5Textcnn is a schematic diagram;

FIG. 6RNN block diagram;

FIG. 7 is a cyclic layer expansion;

FIG. 8LSTM network architecture diagram;

FIG. 9 is a general block diagram;

FIG. 10 is a block diagram of reverse translation;

FIG. 11 is a syntactic visualization of an original sentence;

FIG. 12 is an inverted syntax visualization;

FIG. 13 is a diagram of a model design framework;

FIG. 14 vector represents matrix V;

FIG. 15GRU base model diagram;

FIG. 16Batchsize is a schematic view of the effect of a model.

Detailed Description

1.1 Solving the problems

The invention provides a method for constructing a Chinese humor classifying model based on reverse translation, which is urgent in that the machine has more emotion and humor sense under the background of artificial intelligence and rapid development of man-machine interaction in the current society, and humor classifying tasks are also generated. The invention analyzes the tendency of the humor of the Chinese text, and the related research of the humor classification of the analysis text of the system and the exploration of different influence capacities of different deep learning deep expressions existing in the humor classification task are carried out. The method is characterized in that the remarkable characteristics of the data set are researched and extracted by using a deep learning technology, a basic model BERT-BiGRU-Softmax is provided for classifying and researching the Chinese text humor on the basis of a method based on a reverse translation technology and combining linguistic humor theory, different humor characteristics are gradually added into the model for experiments, and the construction and training of a characteristic fusion model BERT+POS+ Homophony-BiGRU-Softmax (hereinafter referred to as a BPH-BiGRU-Softmax model) are realized, so that whether the Chinese text is humor or not is finally judged.

2.1 Humor classification

2.1.1 Text humor classification

Text humor classification is a branch of the emotion analysis research field and is also one of important tasks in the artificial intelligence field. The main research object of the text humor classification task is the "subjective factor" of the text, i.e. whether the subjective tendency the publisher or author is to express contains humor effects, and the result of the classification is to obtain information about whether it supports a certain humor for a particular text. Due to the rapid development of the science and technology, three main humor classification methods are developed, namely a classification method based on statistics and grammar analysis, a classification method based on machine learning and a classification method based on deep learning. The classification method based on statistics and grammar analysis and the classification method based on machine learning are both traditional classification methods, and the classification method based on deep learning is a more active classification method in recent years.

2.1.2 Text pretreatment

In the data mining task or the natural language processing task, text preprocessing is an indispensable work, and the fineness of the text preprocessing can directly influence the basic performance of an experimental model to a great extent. The text preprocessing work mainly comprises Chinese word segmentation, word stopping, part-of-speech tagging, dependency syntactic analysis and the like. Generally, text preprocessing works mainly comprise the steps of performing Chinese word segmentation operation on the basis of removing specified useless symbols to enable a text to only reserve Chinese characters, removing stop words, namely words with weak emotion colors or no practical meaning, and performing part-of-speech labeling or dependency syntax analysis on the text, so that a computer has the capability of automatically analyzing emotion colors.

(1) Removing designated symbols

Usually at the beginning of text preprocessing, whether it is a chinese data set or an english data set, since many data sets are crawled from websites or other places, we need to perform an operation on the data sets to remove the specified useless symbols. So that the following text preprocessing operation can be better performed.

(2) Chinese word segmentation

Currently, word segmentation techniques based on chinese can be divided into two main categories: word segmentation method based on character string matching and word segmentation method based on statistics. The word segmentation method based on character string matching is also called as a mechanical word segmentation method, and is characterized in that a Chinese character string to be analyzed is matched with entries in a 'fully large' machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful, namely, a word is identified. According to different scanning directions, the character string matching word segmentation method can be divided into forward matching and reverse matching; according to the situation of the preferential matching of different lengths, the matching can be divided into maximum (longest) matching and minimum (shortest) matching; according to whether the word-part labeling process is combined or not, the method can be divided into a simple word segmentation method and an integrated method combining word segmentation and word-part labeling. The common character string matching method comprises a forward maximum matching method, a reverse maximum matching method, a minimum segmentation method and a bidirectional maximum matching method. The algorithm has the advantages of high speed and simple realization, but has less ideal effect on ambiguous words or non-login words; the word segmentation method based on statistics is to learn word segmentation rules by using a statistical machine learning model on the premise of giving a large number of segmented texts, so that unknown texts are segmented. Such as a maximum probability word segmentation method, a maximum entropy word segmentation method, etc. With the establishment of a large-scale corpus, a statistical machine learning method is researched and developed, and a Chinese word segmentation method based on statistics is gradually becoming a mainstream method. The main statistical models are: an N-gram model, a Hidden Markov Model (HMM), a maximum entropy Model (ME), a conditional random field model (CRF), and the like.

In practical application, the word segmentation system based on statistics needs to use word segmentation dictionary to match and segment the character strings, and simultaneously uses a statistical method to identify some new words, namely, the character string frequency statistics and the character string matching are combined, so that the characteristics of high word segmentation speed and high efficiency of matched word segmentation are brought into play, and the advantages of word segmentation without dictionary in combination with context to identify new words and automatically disambiguate are utilized. In recent years, the Chinese word segmentation technology adopts dynamic programming to search a larger probability path, so as to find a larger segmentation combination based on word frequency, adopts an HMM model based on the word forming capability of Chinese characters for unregistered words, and adopts a Viterbi algorithm to obtain a better result. Therefore, the invention adopts the Chinese word segmentation technology in the crust to carry out word segmentation operation on the pre-training corpus.

(3) Removing stop words

In natural language processing, we typically use the de-stop word operation in order to save storage space and improve retrieval efficiency. These disused words generally include two classes, the first being extremely popular functional words, words that have little actual meaning compared to other words. The second category is words that include some words that use a very broad vocabulary and that do not help with a particular task. The stop word list used in the invention is a Ha Gong stop word list.

(4) Part of speech tagging

Part of speech tagging, i.e., a process of tagging parts of speech of words in sentences, such as nouns, verbs, adjectives, etc., using computer automation. But part-of-speech tagging is not necessarily done in text preprocessing, but it can sometimes help us simplify some of the work. For example, during part-of-speech tagging, words with certain parts of speech in unwanted sentences may be removed for a particular task to achieve better text preprocessing.

(5) Dependency syntax analysis

Dependency syntactic analysis is one of key technologies in the field of natural language processing, and is first proposed by French language scholars L.Tesniere in the work of structural syntactic basis, has profound influence on the development of linguistics, and is also favored in the computational linguistics field. The basic task is to determine the syntactic structure of a sentence or the dependency between words in a sentence. The method mainly comprises two aspects of contents, namely determining a grammar system of a language, namely giving formal definition to a grammar structure of legal sentences in the language; another aspect is a syntactic analysis technique, i.e. automatically deriving the syntactic structure of a sentence from a given syntactic hierarchy, analyzing the syntactic units contained in the sentence and the relations between these syntactic units.

2.1.3 Word vector representations

(1) Single heat representation

One-hot coding (One-HotRepresentation), also known as One-hot coding, is a relatively common word representation method. The method is to encode N states using N-bit state registers, each with its own register bit, and at any time only one of the bits is valid. That is, only one bit is 1 and the rest are 0. The single thermal coding is mainly used for classifying tasks and normalizing the characteristics of the classes. For example, in the case of gender, the new structure features two results, gender-female, with the corresponding position being 1 compared with the original data, otherwise zero. Although the use of single-hot coding in practical applications can solve the problems that the classifier does not benefit discrete data, the disadvantages in text feature representation are highlighted. First, it is a bag of words model, regardless of word-to-word order. Second, it assumes that words are independent of each other. Finally, it yields features that are discrete sparsity.

(2) Distributed representation

The distributed representation was originally proposed by Hinton et al, and is derived from the cognitive representation, which is the most important property of deep learning, and can derive the combined information of the learning feature values. Unlike the one-hot representation, the distributed representation expresses the intrinsic meaning of a word in a dense real vector. Assuming that data sample a= [ "he" ], b= [ "magnitude" ], the feature vector is a 2-dimensional vector, then samples a and B can be represented as distributed: a: [1.79683,0.326625816], B: [2.6215376,0.9257021]; the advantages of the distributed representation are apparent over the single thermal representation. Firstly, the single-hot representation cannot represent the associated information of the text, and the distributed representation can effectively represent the semantic similarity, so that the information between words can be better represented. Second, the distributed representation helps the model to have better generalization ability. Most notably, distributed representations have very powerful feature expression capabilities. Such as k values per dimension of the N-dimensional vector, can be characterized as k ⁿ information.

2.1.4 Word vector representations

(1) Word representation generation model

With the intensive research into word representation generation, researchers have increasingly found that better word representations can be obtained with simple network models and contexts over larger data sets. Thus, word2vec opened a new era. The method is an open source tool proposed by GOOGLE, namely, words in a corpus are converted into vectors so as to carry out various subsequent calculations on the basis of the word vectors. The Word2vec model includes two training models, namely CBOW model and Skip-Gram model. The specific model structure is shown in fig. 1 and 2, respectively.

The training input of the CBOW model is a word vector corresponding to the context-dependent word of a particular feature word, and the output is a word vector of that particular word. The ideas of Skip-Gram models and CBOW are reversed, i.e., the input is a word vector for a particular word, and the output is a context word vector for the particular word.

In 2014 Pennington et al, inspired by the predecessor, proposed the GLOVE Word vector, which is the same as Word2Vec, in that they are all static vectors. Until 2018, the push of BERT models received the greatest historical attention. The BERT model improves the generalization capability of the word vector model, and words and sentences are well represented.

2.1.5 Machine learning related Art

At present, in humor text classification tasks, if machine learning technology is used, most of the tasks are based on classification methods with supervision training, sentence features are extracted from a large number of labeled data sets, model parameters are learned and models are generated by using a machine learning algorithm, and finally the models are used for classifying and identifying texts. This section will introduce machine learning of relevant knowledge and algorithms in the humor text classification task. Machine learning is a generic term for a class of algorithms, and is mainly applied to the field of artificial intelligence. These algorithms attempt to mine their underlying rules of content from a large set of historically stored data and are used for prediction or classification. More specifically, machine learning can be seen as finding a function, the input being sample data in the corpus, the output being what we expect, except that this function is too complex to be conveniently formalized. Notably, the goal of machine learning is to adapt the learned function well to the "new sample" rather than just performing well on the training sample, seeking to maximize the generalization ability of the model. Fig. 3 is a basic block diagram structure of machine learning. The machine learning model is characterized by a decision tree model and a support vector machine model in humor classification tasks, and therefore, the two model algorithms are described below.

(1) Naive Bayes

Naive bayes is a very common and widely used text classification algorithm, and the basic principle (bayes principle) is proposed by the united kingdom mathematician thomas bayes. Currently, the naive bayes algorithm is a simple but extremely powerful predictive modeling algorithm. A common operation is to first determine the feature attributes, and to determine what the predicted value is. And properly dividing each characteristic attribute, classifying a part of data by manpower to form a training sample, inputting the characteristic attribute and the training sample, calculating the occurrence frequency of each category in the training sample and the conditional probability of each characteristic attribute division on each category, and outputting a classifier. And finally, classifying the new data by using a classifier, and outputting a result.

To avoid the training obstacle of bayesian formulation, the naive bayes classifier uses "attribute condition independence assumption" (attribute conditional independence assumption), i.e. for known classes, all attributes are assumed to be independent of each other, i.e. each attribute affects the classification result independently, then we have the bayesian formulation 2.1 mentioned above:

Where d represents the number of attributes, xi represents the value of x on the ith attribute, and because P (x) is uniquely determined by the sample set, i.e., the same for all classes P (x), then the expression of the naive bayes classifier is as 2.2:

The training process of the naive Bayes classifier is to estimate the class prior probability P (c) based on the training set D, estimate the conditional probability P (xi|c) for each attribute, and represent the set formed by the c-th sample in the training set D by using Dc, if there are enough independent and uniformly distributed samples, the class prior probability 2.3 can be easily estimated:

Let Dc, xi denote the set of samples in Dc that take the value xi on the ith attribute for the discrete attribute, then the conditional probability P (xi|c) is equation 2.4

For the continuous attribute, assume equation 2.5

Where μc, i, σ2c, i are the mean and variance, respectively, of the class c sample over the property i (assuming here that the corresponding continuous variable obeys a normal distribution), then equation 2.6 is as follows:

(2) Decision tree model

Decision trees (Decision trees) are a basic classification and regression method, called classification trees when used for classification and regression trees when used for regression. Since classification tasks are discussed herein, classification trees are primarily presented herein.

A classification tree is a tree structure that describes classifying instances. When classifying by using the classification tree, starting from the root node, testing a certain characteristic of the instance, and distributing the instance to the child nodes according to the test result. At this time, each sub-node corresponds to a value of the feature. Instances are tested and assigned recursively as such until leaf nodes are reached. Finally, the instances are classified into leaf node classes. The goal of classification tree learning is to construct a decision tree model from a given training dataset so that it can classify instances correctly. Decision tree learning essentially generalizes a set of classification rules from a training dataset. There may be multiple decision trees (i.e., decision trees that can correctly classify the training data) that are not inconsistent with the training data set, or there may be none or one. What is needed is a decision tree that contradicts little with training data, while having good generalization ability. From another perspective, decision tree learning is estimating a conditional probability model from a training dataset. There are infinite number of conditional probability models based on classes of feature space division, and we choose a conditional probability model that should not only fit well to training data, but also predict well to unknown data. Decision tree learning represents this goal in terms of a loss function, which is typically a regularized maximum likelihood function, and the strategy for decision tree learning is minimization with the loss function as the objective function. After the loss function is determined, the learning problem becomes a problem of selecting an optimal decision tree in the sense of the loss function.

(3) Support vector machine model

SVM is a classification model, a supervised statistical learning method, capable of minimizing empirical errors and maximizing geometric edges, called maximum interval classifier, useful for classification and regression analysis. As shown in fig. 4, the learning strategy of the support vector machine is interval maximization, which can be formed as a problem of solving convex quadratic programming, and is also equivalent to the problem of minimizing regularized hinge loss function. The learning algorithm of the support vector machine is an optimization algorithm for solving convex quadratic programming.

Assume that given a linearly separable training data set t= { (x ₁,y₁),(x₂,y₂),...,(x_n,y_n) } over a feature space

Wherein the method comprises the steps ofY _i e { +1, -1}, i=1, 2,..n, xi is the i-th feature vector, also called the instance, yi is the class label of xi, when yi= +1, xi is called the positive instance; when yi= -1, we call xi negative, and (xi, yi) sample point.

The separation hyperplane learned by maximizing the separation or equivalently solving the corresponding convex quadratic programming problem is:

w*x+b＝0 (2.7)

Where w is the normal vector of the classification hyperplane and b is the intercept and also the offset. The plane is determined by the normal vector w and the intercept b. The separation hyperplane divides the space into two parts, one part is positive, the other part is negative, and the side pointed by the normal vector is positive. To ensure that the final classification interval is maximized, the problem can be translated into the following problems (e.g., equations 2.8 and 2.9) based on the point-to-plane calculation formula:

s.t.y_i(w·x_i+b)-1≥0,i＝1,2,...,n (2.9)

the above problems are solved conveniently, and can be converted into dual problems (such as formulas 2.10 and 2.11), and finally, the optimal Jie ^*, b is obtained. :

wherein α _i is ≡ 0, i=1, 2,3, n

Nonlinear problems tend to be poorly solved, so it is desirable to solve this problem by solving the linear classification problem. The method is to perform a nonlinear transformation to transform the nonlinear problem into a linear problem, and solve the original nonlinear problem by a method of solving the transformed linear problem. The core is to reference the kernel function, thereby reducing the calculation amount in the high-dimensional space. Common kernel functions include the following three, linear kernel functions, polynomial kernel functions, gaussian kernel functions.

The linear kernel function is the simplest kernel function, and the mathematical expression is shown in equation 2.12:

k(x,y)＝x^ty (2.12)

Polynomial kernel function:

k(x,z)＝(x·z+1)^p (2.13)

The corresponding support vector machine is a p-th order polynomial classifier, in which case the classification decision function becomes

Gaussian kernel function:

the corresponding support vector machine is a gaussian radial basis function classifier. In this case, the classification decision function becomes:

2.1.6 deep learning related Art

(1) Convolutional neural network

Convolutional neural networks, unlike other network structures such as fully-connected networks, have three unique features in structure: 1) Local connection: each neuron is no longer connected to all neurons of the previous layer, but is connected to only a small portion of neurons. 2) Weight sharing: instead of having a different weight for each connection, a set of connections may share the same weight, thus reducing many parameters. 3) Downsampling: the amount of data processing can be reduced and useful information can be retained. Kim et al in 2014 proposed the application of the CNN model to the task of text classification, which may be called mountain-going. The CNN model mainly includes a convolution layer, a pooling layer, and a full connection layer, as shown in fig. 5.

The input layer converts a piece of text into the input format required by the convolutional layer, typically resulting in a vector matrix of sentences. Where we set n to represent the number of text words in a paragraph. Since the number of words of the text is long, here we need to preprocess the input text and process it to a fixed length. The common method is to take the longest text length in the data and to calculate the text length distribution to take a length that can cover most of the text. K represents the length of word embedding. Generally Glove Word vectors, word2Vec Word vectors, bert, etc. may be used, which may help with the effects of natural language processing tasks. Then the sentence matrix entered by the model can be expressed as equation 2.17:

Wherein the method comprises the steps of Representing a stitching operation.

Then, the required characteristic value is obtained by carrying out convolution operation on the words in the window. Specifically, c _i is used to represent the eigenvalue obtained after the convolution operation is performed on the h words x _i:i+h-1 in a window, as shown in formula 2.18:

c_i＝f(ω·x_i:i+j-1+b) (2.18)

Wherein the method comprises the steps of Is a bias unit and f is a nonlinear function.

Convolving each possible window in the sentence matrix to obtain a feature mapping set, as shown in formula 2.19:

and finally, carrying out maximum pooling operation on the feature mapping set c obtained by the pooling layer to obtain final features. As shown in equation 2.20:

the above describes the process of extracting a feature from a convolution kernel. Typically in experiments, we can obtain different feature representations using multiple convolution kernels, and then pass the feature representations to the fully-connected softmax layer, which can output the probability distribution of the text labels, thus completing the text classification process.

(2) Circulating neural network

As early as 1982, physicist John Hopfield of the california academy of engineering, usa, invented a single layer feedback neural network Hopfield network to solve the problem of combination optimization. This is the earliest prototype of RNN. After the continuous innovation and perfection of the prior circulating neural network model (recurrent neural network, RNN) is provided.

The cyclic neural network has the great advantage of being capable of well training sequence information. A simple recurrent neural network, for example, consisting of an input layer, a hidden layer and an output layer, as shown in fig. 6; expanding the loop layer in time line yields the graph of fig. 7:

After the network receives the input x _t at time t, the value of the hidden layer is s _t and the output value is o _t. Wherein the value of s _t depends not only on x _t, but also on s _t-1. Specifically, the calculation method of the recurrent neural network can be expressed by the following formula:

o_t＝g(v·s_t) (2.21)

s_t＝f(u·x_t+w·s_t-1) (2.22)

Where x _t denotes input, u denotes weight from input layer to hidden layer, v denotes weight from hidden layer to output layer, w denotes last value of hidden layer as weight value at this time input, s _t denotes value of hidden layer at time t.

The core part of the cyclic neural network is a memory unit, but in some cases, the distance between the related information and the position where the information is needed is quite large, when the distance is increased, the cyclic neural network cannot be connected with the related information, and the disadvantage is that with the deep research of the cyclic neural network, the long-term memory network based on the cyclic neural network solves the problems.

(3) Long-short term memory network

The network structure of the long and short term memory network is shown in fig. 8.

The core of the long and short term memory network is the cell state, which can add or delete information in the cell state, which is first processed by a structure called a "gate". Mainly comprises an input door, a forget door and an output door. The first step short-term memory network needs to decide what information to discard from the cell state, which is determined by the neural layer of the activation function sigmoid, i.e. forgetting gate. Firstly, the information is used as input of a forgetting gate, and for each number, the forgetting gate outputs a number in a [0,1] interval, the output 1 represents 'complete reservation', and the output 0 represents 'complete forgetting'. Let the current time be t, forget gate f _t decides what information to discard, and the derivation formula is shown in the figure, where h _t-1 represents the output of the last time, x _t is the input of the time t, w _f is the parameter matrix, and b _f is the bias vector.

f_t＝σ(w_f·[h_t-1,x_t]+b_f) (2.23)

The second step is to decide what new information we will need to store in the cell state, defined as the i _t representation. First, a layer called "input gate" where the activation function is sigmoid decides which values we will update. The derivation formula is shown as 2.24. Where w _i is the parameter matrix and b _i is the bias vector.

i_t＝δ(w_i·[h_t-1,x_t]+b_i) (2.24)

Then, a new candidate value vector is created by using an activation function tanh layer and added to the current candidate unit state, which is defined asThe derivation formula is shown in the figure. Where w _c is the parameter matrix and b _c is the bias vector.

Next, the old cell state is updated to a new cell state, i.e. the input gate value i _t and the forget gate value f _t are respectively compared with the current candidate cell stateAnd the cell state c _t-1 at the previous time is multiplied and summed to obtain the current state c _t, and the deduction formula is shown as 2.26.

The content of the output gate to be determined and output is defined as o _t, the derivation formula is shown in the figure, wherein w _o is a parameter matrix, and b _o is a bias vector

o_t＝σ(w_o[h_t-1,x_t]+b_o) (2.27)

Finally, the cell state of the tanh function (converting the value into the [ -1,1] interval) is multiplied by the output of the sigmoid gate of the activation function, so that only the part of the decision output is output. The derivation formula is shown in the figure.

h_t＝o_t*tanh(c_t) (2.28)

2.1.7 Chapter summary

The chapter mainly introduces theoretical research of humor classification technology, which comprises preprocessing of texts, word representation method and development process and classical machine learning and deep learning related technology, and provides a theoretical technology foundation for subsequent research.

3.1 Data enhancement technology study based on reverse translation

Reverse translation, i.e. translating our Chinese humor data set into English data set by using machine translation method, and then translating English data set back into Chinese data set, which is a process of reverse translation. Zhang Jie has studied the difference of Chinese and English languages, and shows that the two languages have different structures due to cultural difference. Also, zhang Huimei indicates that the alternative expression of the same culture is not necessarily effective in another culture due to the difference between culture and custom in the study of Chinese and English inter-translation. Therefore, the differences of Chinese and English expression, culture, custom, translation and the like are utilized, the theory basis of humour in the language of the predecessor is combined with the reverse translation technology in this section, the origin of humour is traced back, humour characteristics of the data set are identified, and the influence of the reverse translation on humour characteristic theory is explored. A specific general structure is shown in fig. 9.

3.1.1 Overview of reverse translation techniques

In recent years, with the implementation of more data enhancement techniques, reverse translation techniques have gradually moved into the field of natural language processing. The reverse translation technology is a simple, convenient and obvious technology in the data enhancement technology. The method mainly utilizes a machine translation tool to translate a data set in an original language database into a target language (English or other foreign languages) expected by users through the machine translation tool, and then utilizes the machine translation tool to translate the data set of the target language back to the language database of the original language. This constitutes a reverse translation technique. The method is based on Chinese-English translation in the reverse translation process, and differentiation of the data sets is achieved through language differentiation, so that analysis of humor features is facilitated, new data sets are generated in limited Chinese humor data sets, and the other effect, namely data enhancement, is achieved. Fig. 10 is a block diagram of the reverse translation technique, as shown in fig. 10.

3.1.2 Humor category overview of the present dataset

Humour has various modes of forming humour although the modes of defining the categories of the five flowers and eight doors are different, but humour language has the most basic language characteristics. Li Yuan in the study of the functional and behavioral characteristics of humour language, it was found that exploring humour language from the grammatical aspect contains a number of features such as contradiction, exaggeration, delightness, fuzziness, myself, citation, mental insights, and confusion of identity, 8 features. Ran Mingzhi analyze the characteristics of incompatibility and non-uniformity in humor utterances. Zhang et al designed five categories including phonetic features, morphological and syntactic features, lexical semantic features, and emotional features based on humor's inconsistency theory and linguistic features, up to 50 humor features in total.

For the study of the Chinese humor linguistic features by the students, the humor theory applicable to the data set is found to be approximately 4 types, including a voice feature theory, a Chinese structural feature theory, a vocabulary semantic feature theory and a new trend word and dialect feature theory. These four general categories of theory will be described in detail below.

(1) Theory of speech characteristics

Humor is produced by pronunciation in language, the most common phenomenon being in other languages such as chinese or english. This linguistic pronunciation is primarily referred to as a harmonic word type. Harmonic sounds are widely used by people in China as a common phenomenon for Chinese language. Harmonic sounds refer to linguistic phenomena that express chinese meaning by means of phonetic features that are the same or similar in pronunciation. In the humor field, harmonic sounds are usually presented in the form of a head or tail rhyme. In interpersonal communication, the voice inconsistency of a speaker using language speech causes humor. Say in english, for example: example 1: ASFITASFIDDLE, the short expression shows that the semantic meaning does not have humour, but two letters F with the same pronunciation in the phrase are reused, so that the head rhyme is formed into humour with a voice characteristic. For another example in chinese, for example: example 2: the hair is not traced, and the dandruff is more superior-! The "trace" without trace is "zong" in Chinese phonetic pronunciation, the superior "crowd" is "zhong" in Chinese phonetic pronunciation, and the "ong" of pronunciation of two words constitutes the method of tail rhyme, gives people a feeling of being able to play, humor and harmony, and produces an effect of humor expression. From the example, we can see that the harmonic sounds in the voice characteristics are very strong in generating humorous effect, and in the actual expression of many humorous texts, even though the humorous effect on the semantic level is not strong, the voice characteristics of the head or tail rhyme in the harmonic sounds can be used to realize the effect of generating the harmony or make the humorous expression effect stronger.

(2) Theory of structural features

Expressions using sentence structures are also common in language expressions. For example, the traditional spring festival couplets in our China are a unique literature form. It is aimed at the modes of equal word numbers of sentences connected up and down, antithesis and equal level. Sentence structure expression is transmitted in Chinese for a long time, and plays a great role in humorous sentence expression. In the interpersonal interaction of humor, the characteristics of the same number of words of upper sentence and lower sentence are used in many cases, and the characteristics of pronunciation and the like are associated. For example: example 3: it is easy to stop smoking and is too difficult to stop smoking. The sentence does not see semantic information, and has the characteristic of identical numbers of words of upper sentence and lower sentence only from the expression mode of Chinese, is easy to read, is easy to understand, is the beginning of the 'abstinence' word, and can also generate humour effect. Some sentences have the same word number and the characteristic of harmonic sounds spoken in 3.1.1, and the humorous effect of the sentences is stronger. For example: example 4: the rat carrying knife can find cats when the street is full. The pronunciation "dao" of the knife resistance and the pronunciation "mao" of the cat are both provided with "ao", which constitutes the method of tail rhyme. Meanwhile, the method has the characteristic that the numbers of the upper sentence words and the lower sentence words are the same, and the effect of expressing humor is stronger due to the combination of the double characteristics.

(3) Lexical semantic feature theory

The semantic features of the vocabulary, namely the same word or word, express different meanings in the same sentence, so that semantic ambiguity is caused, and humor is caused. In many cases, different semantics are expressed, and the parts of speech also change. Such as: example 5: teacher). Whether you are men or women? Pupil: i am mother-! This way we can understand from the semantics that the word "student" is not the same meaning as both pupil and teacher. "birth" of boys and girls is prone to the part of the name, whereas "birth" of mama is an animal part of the name. Therefore, ambiguity caused by the fact that the same word has different meanings is derived from the fact that the same word has different parts of speech, and humorous effect is easy to generate.

(4) Feature theory of trendy words and dialects

Sometimes people interact with each other to easily speak new words or dialects. For example, the northeast dialect "dry haling" is very interesting for the south, and what is actually expressed meaning may not be needed, but the sudden lack of precaution produces humorous effect. The same is true of the trendy word. However, if the interpersonal interaction is familiar, some dialects or the humorous effect of the new words may not be very pronounced. Thus, the trendy word and dialect feature theory may be humorous in certain specific situations.

3.1.3 Influence of reverse translation on different types of humor data

Reverse translation not only serves as a means of comparing the trend in data enhancement, but also plays an important role in the humor data field. The method can compare and analyze the data set which is translated back with the original data set, and can deeply read the humor theory by utilizing the difference of Chinese and English languages and applying a linguistic theory method, so as to find the nature of the humor, and provide a theoretical basis for the humor study and the automatic judgment of the computer. The following shows that a sentence can exhibit a sentence difference using a reverse translation technique and a syntactic analysis visualization, as shown in fig. 11 and 12. Original sentence: the road sees an uneven roar, which goes forward after one sound. The present study will compare a reverse translated sentence with the original sentence for our existing dataset. Each type shows 3 humorous sentences as analysis, and each enumeration is divided into an original sentence, a reverse translation sentence, whether humorous exists after reverse translation, smile analysis and a method related to automatic recognition of the type of computer. Finally, the influence of reverse translation on the identification of humor features is described by summarizing the analysis results of each type.

(1) Harmonic words

As shown in Table 3.1, a new Chinese sentence is obtained after reverse translation for the original Chinese sentence, and the smile point characteristics of the harmonic words of the sentence are compared and identified. Taking sentence 1 as an example, the sentence is lost in humor after reverse translation. The laughing point of the original sentence is an advertisement of a farmer mountain spring, and the advertisement words of the farmer mountain spring are: farmers mountain spring with sweet taste. That sentence, the farmer, the spring and the advertisement of the farmer spring form a sense of homonymy and different word senses, and a humorous sense is formed. If such humor is identified, it can be implemented by formulating a vocabulary and rules of homonyms.

TABLE 3.1 harmonic word analysis

From the comparative analysis of the above table 3.1, we can find that the situation that the consonant word utilizes the pronunciation of the Pinyin finals to cause humor is more common, and we can extract the Pinyin characteristics of the text humor data set to perform characteristic extraction, so that the humor classification effect is improved.

(2) Symmetrical structure

As shown in Table 3.2, a new Chinese sentence is obtained after reverse translation of the original Chinese sentence, and the smile point characteristics of the sentence with symmetrical structure are compared and identified. Taking sentence 1 as an example, the sentence is lost after reverse translation, and the translated language structure and harmonic sound are lost. The laughing point of the original sentence is that the upper sentence and the lower sentence are symmetrical in structure, the number of words is the same, and harmonic sound effects and the feeling of the upper mouth of the lamban exist, so that the humorous effect is generated. To identify such humor, it is necessary to distinguish between upper and lower sentence relationships using syntactic analysis or harmonic features.

TABLE 3.2 structural symmetry analysis

From the comparative analysis of the above table 3.2, we can find that the humour sentence with symmetrical structure appears simultaneously with the effect of the harmonic sound in many times, because we teach that the pronunciation and the structure appear simultaneously in many times, when we extract the characteristics, we can extract not only the structural characteristics, but also the harmonic sound characteristics to carry out humour classification and identification.

(3) Word multi-meaning

As shown in Table 3.3, a new Chinese sentence is obtained after reverse translation for the original Chinese sentence, and the smiling point characteristics of the word and the meaning of the sentence are compared and identified. Taking sentence 1 as an example, the situation that humour is lost after the sentence is reversely translated, and the situation that the translated character is ambiguous is eliminated. The laughing point of the original sentence is that the leading face redness refers to a leading state, the employee has reddening that the employee has paid money, the former is biased towards the verb part of speech, and the latter is the noun part of speech. The two red words are matched in different contexts to have completely different meanings, and the humorous effect is generated. If the humour is to be identified, the semantic distance between reddening and reddening needs to be calculated, so that the semantic deviation relationship between the upper sentence and the lower sentence is judged.

TABLE 3 in-line ambiguous analysis

As can be seen from the comparative analysis of the above Table 3.3, the effect of generating humor due to ambiguity caused by ambiguity of a character is more common in Chinese humor, and the problem of ambiguity of a character can be well solved by using a feature extraction model capable of analyzing the ambiguity of a character to extract features, for example, a BERT model. However, in many cases, the phenomenon of word ambiguity is often caused by the change of parts of speech, so that humor can be classified by adding part of speech features to feature engineering.

(4) New trend word and dialect

As shown in table 3.4, a new chinese sentence is obtained after reverse translation for the original chinese sentence, and the laughing point characteristics of the new words and dialects of the sentence are compared and identified. Taking sentence 1 as an example, the situation of humour loss occurs after the sentence is reversely translated, and the translated new tide word is lost. The smiling point of the original sentence is HELLOKITTY, which is a new trend word, and the effect of humorous is said. If the humor is identified, a word library and rules of a new trend word are established to realize the recognition. Taking sentence 2 as an example, the situation of humor loss occurs after the sentence is reversely translated, and the characteristics of the translated dialect are lost. The smiling point of the original sentence is that you calculate a nobody, the northeast dialect of the sentence, and the humour effect is formed. If such humor is identified, a word stock and rules of a dialect need to be established for implementation.

TABLE 3.4 New Charactizing and dialect analysis

As can be seen from the comparative analysis of table 3.4 above, the trendy word or dialect is also humorous, but in the present dataset there is relatively little data, so no separate feature extraction operation is performed on the trendy word and dialect portions at the stage of feature extraction.

3.1.4 Chapter summary

The chapter carries out relevant overview on humor theory, and through an emerging data enhancement method, namely a reverse translation technology, the humor corpus characteristics in deep level are identified by comparing with an original data set, the origin of the humor corpus characteristics is tracked, and some characteristics triggering humor in real interpersonal interaction are found: harmonic character features, co-sound character features, structural features, character multi-meaning features, and the like. The reverse translation technology can be used as a data enhancement technology, and also helps us identify deep features in many humor corpora.

The harmonic character features and the synchronous character features belong to the category of voice features, and the corresponding features can be extracted by utilizing the feature features of the Chinese phonetic alphabet. Through analysis, structural features often appear with harmonic character features to a great extent, so the feature characteristics of Chinese pinyin can be utilized as well or a feature vector tool with a deeper level can be used for feature extraction. In the case of word multi-meaning, we can perform feature extraction by means of part-of-speech feature or word sense feature extraction model, etc. These feature extractions can be traced back to the humor's origin, rather than using some advanced automated model to pursue the level of the result. Analysis shows that deep information of humor can be indeed mined by using a reverse translation technology, and a solid theoretical demonstration and experimental foundation is provided for the influence analysis and humor classification of different characteristics.

4.1 Humor classification study based on data enhancement

Traditional humor classifying methods, such as classical machine learning algorithms support vector machines, decision trees, random forests, deep learning algorithms CNN, RNN and the like, can classify humor short texts into humor and non-humor to a certain extent, but cannot distinguish humor of texts from the origin of humor theory. Meanwhile, the traditional Word vector representations such as GLOVE, word2Vec and the like are used independently, only static Word vectors can be obtained in the text, and the Word vectors cannot be well represented under the condition that the context is complex and changeable and the same Word possibly has different meanings. Therefore, the section designs a BERT-based BPH-BiGRU-Softmax Chinese text humor classification model by taking feature vectors extracted by a BERT model as basic vectors, proposes that a third chapter is introduced into the model, the multi-feature representation is carried out by analyzing features contained in the summarized humor text through a reverse translation technology to serve as input of the model, and then the deep feature extraction is carried out on the features through a bidirectional GRU network, and finally the deep feature extraction is input into a Softmax classifier, so that the system performance of the model is further improved, and meanwhile, the influence of multi-feature fusion on classification tasks is analyzed. A specific design framework is shown in fig. 13.

4.1.1BERT model characterization

The BERT model is a pre-trained model. The model is pre-trained, simply by learning knowledge and network model parameters over a task and saving them. When we need to perform a new task, if the same model structure is used, the model parameters will initialize and load the previously pre-trained parameters, and network training will be performed on the new data set. When the data set of the new task is less, the pre-training model often plays a strong advantage, and achieves good effect. More Word vectors are currently used, such as GLOVE, word2Vec, ELMo, BERT. Meanwhile, when the field of the current natural language processing is from a certain level, the field of the current natural language processing can be divided into two parts, namely a feature vector extraction part and a feature vector operation part during training of classification tasks. The feature vector extraction part uses these word vector tools to represent the text words in the text needed by the experiment as the vectors which can be used for mathematical operation. The feature vector operation part also belongs to a downstream task part, and inputs the vector into some classifier models to perform text classification operation.

Word2Vec is static, and can only acquire context-free Word vectors, so that in a downstream task, when a plurality of words contained in humorous text have different meanings in the same sentence, the classification effect of an experiment is easily affected. ELMo implements context-dependent by using Bi-LSTM as an encoder, but it is not a completely Bi-directional model and the next word to be predicted already appears in a given sequence, so that the trained word vector does not necessarily work best. The BERT model differs from the ELMo model in that BERT uses a transducer structure as the encoder, which has the advantage that it can have a deeper layer number and better parallelism. The BERT model further increases the generalization capability of the word vector model, and fully describes the character-level, word-level, sentence-level and even inter-sentence relationship characteristics, so that the BERT model can solve the word ambiguity problem and learn deeper semantic information. Therefore, in this chapter humor classification experiment, BERT is used as our basic feature vector representation. 4.1.2 humor classifying method based on multi-feature fusion model

Through a reverse translation-based technology, the main concentration of the Chinese text humor data set of the experiment is analyzed to contain harmonic character features, structural features and word meaning features. The most obvious expression of the harmonic character features in Chinese pronunciation is on Chinese phonetic alphabet, so that the feature extraction tool of Chinese phonetic alphabet, namely PyPinyin in Pyhton library is used for extracting the harmonic character features. In the structural characteristics, the humorous sentences with the structural characteristics mostly contain harmonic character characteristics, and because the Chinese expression mode of the humorous sentences contains the same structural words of upper and lower sentences and needs to form first rhyme or last rhyme, a characteristic extraction tool for extracting harmonic characters is still used in the aspect of characteristic extraction. In terms of word meaning, usually, ambiguity of a word meaning is generated in Chinese humor by means of one-word multi-meaning or one-word multi-meaning of the word meaning, so that expression effect of humor is generated. The feature vector extracted by the BERT has better effect so far, and the most outstanding characteristic is that the feature vector can capture the phenomenon of word ambiguity, word embedding related to context and other forms of information, which is helpful for completing the word sense ambiguity phenomenon of vocabulary. Meanwhile, the comparison result of the third chapter can also find that when some words are ambiguous, the part of speech also changes. Therefore, for word sense features, the BERT model is used as a basic feature vector extraction tool, and text part-of-speech features are also extracted by means of jieba tools. And respectively extracting the feature vectors through the three parts, and then splicing the feature vectors. Then we input the stitched feature vector as input into BiGRU model. The BiGRU model firstly performs dimension reduction operation on the spliced feature vectors input by us, then performs deep feature extraction operation, and retains important features. And finally, inputting the extracted characteristic important information output by the BiGRU model into a Softmax layer as input, taking the Softmax as a classifier, and outputting the final classification probability of each category, thereby obtaining an experimental classification result.

4.1.3 Structural overview of the method based on the BPH-BiGRU-Softmax model

As shown in fig. 13, the humor classification model of the chapter mainly includes: the text input layer, the BERT-Embedding word embedding layer, the Chinese phonetic feature embedding layer, the text part-of-speech feature embedding layer, the feature fusion layer, the BiGRU layer and the full connection layer, and finally the full connection layer is used for completing classified output of the Chinese text humor.

(1) Text entry layer

The text input layer of BERT is different from other models in that sentences are taken as input, accessed to the next layer and converted into vector matrices.

(2) BERT embedded layer

The BERT model is a pretrained model with high generalization capability, and can also be used as a bidirectional and deep text representation model. The embedded layer uses the BERT model to make text representation, converts sentences input by the text input layer into vectors and accesses the classification layer to make classification tasks.

The embedded layer of BERT includes three parts: token Embeddings, segment Embeddings and Position Embeddings.

Token Embeddings layers: the words are converted into vectors of fixed dimensions. In BERT, each word is converted into 768-dimensional vector representations.

Segment Embeddings: BERT is able to handle classification tasks for pairs of input sentences. Two sentences in the sentence pair are simply spliced together and then fed into the model. In the model Segment Embeddings needs to distinguish between two sentences in a sentence pair.

Position Embeddings: the position information of each word in a sentence can be represented.

The BERT model adds a fixed label to each sentence at the text entry layer at the time of text presentation. CLS is added at the beginning of the sentence to represent the beginning of the sentence, SEP is added at the end of the sentence to represent the end of the sentence and sentence index is added. If each word of one of the sample sentences of the text input layer is represented by w, then the sentence S can be represented as s= { w1, w2, w3, …, wn }, n being used to represent the sequence length of the sample sentence. The matrix V generated by vector representation of sentences by the BERT model is shown in fig. 14:

(3) Chinese phonetic alphabet embedding layer

A library of kanji to pinyin, pyPinyin, is provided in Python. It can be used for phonetic notation, sorting, searching and other occasions of Chinese characters, and is developed based on hotto/pinyin library. The algorithm for extracting the characteristics by using the Chinese pinyin comprises the following steps: step1: chinese character to phonetic transcription

I.e. each Chinese character in the sentence we want to characterize is converted into pinyin. For example, we want to convert the sentence "i love China", then we can convert to the form "wo ai zhong guo".

Step2: acquiring unique character sets

In the work of text processing, a dictionary (a collection of all words contained in a corpus) is often used, and we also need to find this "dictionary" (character set) first. Each character corresponds to an integer as his ID. The dictionary functions to mutually convert characters/words and numbers.

Step3: pinyin vectorization

Based on the two steps, the basic conditions needed by forming the word vector are provided, and the text which is needed to be converted is subjected to pinyin vectorization.

(4) Text part-of-speech embedding layer

The jieba tool used for extracting the part-of-speech features of the text is the most widely used text processing tool in the Python community at present. The specific method is characterized in that firstly, a stop word stock is introduced, word segmentation operation is carried out on sentences in the text, and then all parts of speech are extracted and converted into part of speech feature vectors.

(5) Feature fusion layer

In order to track the local source of text humor and improve humor recognition effect, the feature vector matrix extracted by the BERT model is subjected to feature fusion with the Chinese pinyin feature and the text part-of-speech feature vector obtained by comparing the feature vector matrix with a third chapter through a reverse translation technology, and a multi-feature mode is formed to train in the deep learning model. If the feature vector matrix generated by the sample sentence of the text input layer through the BERT model is V, the formula of the domain feature fusion sentence corresponding to the sample sentence can be expressed as formula 4.1:

In the above formula, W represents the new feature vector generated, f ₁ represents the word vector feature, and f ₂ represents the pinyin feature.

(6) BiGRU layers

GRU (Gated Recurrent Unit) is a variant of Long Short Term Memory neural network (LSTM), proposed to solve the problems of Long Term Memory and gradients in counter-propagation, and LSTM is an improved model of circulatory neural network (RNN, recurrent Neural Network). Unlike LSTM, the GRU model has only two gates, an update gate and a reset gate. Namely z _t and r _t of the three figures. Because GRU can only obtain the context information of forward direction, neglect the context information of backward direction, therefore this text uses the neural network of two-way GRU, namely obtain the context information from front and back upwards at the same time, in order to improve the accuracy rate of the characteristic extraction. A basic model diagram of the GRU is shown in fig. 15.

The BiGRU layers mainly comprise a forward GRU layer and a backward GRU layer, and the forward and backward neural network is utilized to perform context learning on the feature vector matrix W fused and output by the feature fusion layer, so as to perform deeper feature extraction operation on the text.

(7) Softmax layer

Softmax layers have very wide application in machine learning and deep learning. At this layer, the Softmax layer converts the features output by the previous layer into probabilities for determining the tag, i.e. maps the feature vectors into probability sequences. If we express V _i as the i-th element of vector V, then the value of this element can be expressed as follows in equation 4.2:

4.1.4 Experimental setup and results analysis

(1) Experimental environment

The hardware operating environment of the present experimental configuration is shown in the following table 4.1:

table 4.1 experiment environment hardware configuration table

The software operating environment of the present experimental configuration is shown in table 4.2 below:

TABLE 4.2 experiment Environment software running Environment

/>

(2) Experimental Performance evaluation index

Currently, in the fields of machine learning and deep learning, models are usually required to be built to solve specific problems, but some evaluation indexes such as accuracy, recall, F1 value, ROC and AUC indexes are required to be used to identify the quality of the models, namely the generalization ability of the models, and are often applied to tasks such as information retrieval (e.g. search engines), natural language processing and detection classification. The accuracy and the F1 value are mainly adopted as main indexes of experimental evaluation. For better fairness, the evaluation index results herein were averaged over 10 experimental results.

The prediction result and the actual result of the experiment are expressed by using a confusion matrix, so that corresponding evaluation indexes can be calculated, and the classification result relationship is shown in the following table 4.3.

TABLE 4.3 classification results

1) Accuracy rate of

The accuracy rate refers to the ratio between the number of correctly predicted decisions of the text classification and the number of true predicted decisions, i.e. the probability of actually being a positive sample among all the samples predicted to be positive. In most cases, the higher the accuracy, the better the performance of the model, the following formula:

2) Recall rate

Recall refers to the ratio between the number of text classification prediction decisions that are correct and the number of text that is all truly true, i.e., the probability of being predicted as positive samples among the samples that are actually positive. The formula is as follows:

3) F1 value

F1 score is the harmonic mean of precision and recall. When we train the deep learning model, we want to be able to consider both the accuracy and recall, and use a unified single-value evaluation index to evaluate the training effect of our deep learning model. Therefore, the F1 value can show the good or bad of the model performance. The formula is as follows:

(3) Experimental corpus

The corpus of the experimental study is divided into two parts, namely, a public corpus and a self-built corpus. The exploration of the humor influence by reverse translation introduced in the third chapter is based on the public corpus, and two significant features in the Chinese humor, namely a phonetic pronunciation feature and a part-of-speech feature, are analyzed. The purpose of the self-built corpus is to verify that the humor features we analyze are not occasional, but rather general, with features that are generic.

1) Public corpus

The corpus of this experiment comes from Chinese humor corpus in the eighteenth China computing language Association (CCL 2019), and is divided into two categories: humor and non-humor, the humor label is 1, and the non-humor label is 0. The corpus is divided into a training set and a testing set, wherein the training set corpus comprises 16420 sentences, and the testing set corpus comprises 4105 sentences. The chinese humor dataset distribution table is shown in table 4.4.

Table 4.4 chinese humor dataset distribution table

See table 4.5 for a partial content presentation of the present chinese humor dataset.

Table 4.5 Chinese humor dataset sample

2) Self-built corpus

The data set constructed by the user is a text from a laughing website and a segment sub-website on the internet, and nearly 2 ten thousand pieces of data are crawled by utilizing a crawler technology. Because some of the crawled data have writing and special symbol or sentence problems, the data are preprocessed and are arranged into a standard format, and finally, the self-built data set is 12078 pieces of data.

The datasets fall into two categories in total, humor and non-humor. The whole dataset is divided according to 3 by a random partitioning dataset method in Sklearn libraries: the scale of 1 is divided into training and test sets.

The data set partitioning case and the positive and negative case classification case are shown in table 4.6.

Table 4.6 data partitioning case table

After preprocessing the data, the purpose of removing data noise is achieved, and specific examples are shown in table 4.7.

Table 4.7 data set samples after pretreatment

(4) Experimental parameter setting

In the deep learning network structure, one core content is parameters and super parameters, which are targets for training the deep neural network to learn finally. The choice of hyper-parameters represents the ability of the model to directly affect the performance of the model, and therefore requires a significant amount of time to adjust. In our experimental process, common super parameters are about learning rate, batch size, dropout size, network layer number, neuron number, iteration number, etc., as shown in table 4.8.

Table 4.8 model parameter set table

(5) Disclosure of data set experimental results and analysis

1) Effect of multiple features on the results of humor classification of chinese text

The text analysis and comparison features, namely part-of-speech features and pinyin features, are respectively added to the text analysis and comparison test in the section experiment based on the extraction of word sense features based on the BERT-BiGRU-Softmax model using BERT only. Finally, three characteristics are fused together to generate a BPH-BiGRU-Softmax model, and a comparison experiment is carried out, so that the obtained experimental data result is shown in the following table 4.9:

table 4.9 comparison of results of different features fused

From the experimental result data, we can see that the accuracy of character-level feature extraction based on the BERT model alone is 85.77%, and the F1 value is 83.67%. But we analyze the features of the humor of the Chinese text of the existing dataset and the features of the part of speech and the pinyin from the third chapter, so that the accuracy of adding the part of speech features of the text based on the BERT features is 86.46%, the F1 value is 84.17%, and the accuracy is improved by 0.69% compared with the basic BERT features. The part-of-speech feature is an effective feature in the classification and identification of the humor of the Chinese text. The Chinese phonetic feature is added on the basis of the BERT feature, the accuracy is 86.71%, the F1 value is 84.27%, and the accuracy is improved compared with the prior two experiments, wherein the accuracy is improved by nearly 1% compared with the BERT basic feature, and is improved by 0.25% compared with the text part-of-speech feature. Finally, the BERT basic features and the part of speech features are combined together, the experimental result is the highest, the accuracy and the F1 value are 87.09% and 84.61% respectively, the accuracy is improved by 1.32% compared with the BERT-BASE basic feature result, and the F1 value is improved by nearly 1%. From the table we can see that both the accuracy and F1 values, and the recall and precision, steadily rise with the addition of other language features on the basis of the BERT feature. The Chinese text humour can be analyzed to obtain better results after the Chinese pinyin features are added than the word-part features, and the great characteristic of the Chinese text humour is that the humour is expressed by using the characteristic of harmonic sounds in Chinese language. And the experimental results are optimal after all three features are fused together. The method provides theoretical and experimental basis for analyzing characteristics of the following Chinese text humor on the basis of linguistics, and verifies the effectiveness of the model.

2) Comparing experiments with other network models

To verify the effectiveness of the multi-feature fusion classification model presented herein based on reverse translation techniques, the experimental dataset was compared with the following classical network model and the SVM method operated by KHANDELWAL et al and the TEXT-CNN method operated by Chen et al. The BPH-BiGRU-Softmax model which integrates BERT features, part-of-speech features and Chinese pinyin features and has the best effect is adopted in the comparison. The experimental results are shown in table 4.10.

Table 4.10 experimental model comparison results table

From the experimental results of the table, we can find that the experimental accuracy from the SVM model applied by KHANDELWAL et al to the TEXT-RCNN model is gradually improved, wherein the TEXT-RCNN is improved by approximately 8.8% compared with the SVM method, and is improved by approximately 1.2% compared with the TEXT-CNN model applied by Chen et al. Compared with the model method based on reverse translation multi-feature fusion, the model accuracy is still better than that of the TEXT-RCNN model, and the result is improved by 4.87% compared with that of the TEXT-RCNN model, so that the model is an optimal model. The method has the advantages that in the data set, the common model cannot well capture deep semantic features so as to obtain a better effect, the text model integrates a plurality of humor features of a linguistic theory, and whether the recognition of the Chinese text is the origin of humor or not is traced back, so that the experimental result of humor classification is improved.

3) Exploring the super parameters of important models

In deep learning, if a good model is to be trained, then appropriate parameters need to be sought to achieve this. If the model parameters are not properly selected, the network model may not perform optimally, or even have the opposite effect. Such as over-fitting, excessive time and cost consumption, poor convergence, etc., resulting in unsatisfactory training results. The section experiment focuses on exploring the effect of Batchsize values on model training.

Currently, the aggregate of data for deep learning is large, and if the number of samples and the content are large, it is not practical to train all data at one time. Therefore, in the training process, a mini-batch training method is generally adopted. That is, the entire data set is formed into one batch per Batchsize text data as input for each step. The output result is compared with the expected value of the batch of samples, loss is calculated by using a loss function, the weight and the offset value are updated, and the new parameter is used as the initial value of the next batch. Thus, each time the data set is updated, the data set is different, and certain randomness exists. So, the continuous iterative learning gradually tends to a stable state of the performance of the network model.

Typically, the determined direction of descent is substantially unchanged after Batchsize has increased to some extent, and if Batchsize is increased in value, excessive memory is consumed, and the generalization ability of the model is even reduced. If Batchsize is small, it is very difficult to converge.

The experiment is based on a BPH-BiGRU-Softmax model, the data set is the humor data set of the chapter, different values are respectively set for Batchsize to explore the influence condition of the model, and the experimental result line diagram is shown in fig. 15.

As is clear from the graph, when Batchsize is set to 1, the time required is the longest, the cost of time is greatly increased, and the accuracy performance is not ideal. When Batchsize reaches a value of 64, the accuracy has begun to decline even though less time is spent. When the Batchsize value exceeds 64, the machine is overloaded due to the limited memory of the machine, which is a result of the excessive dataset. Because the data set of each batch is different, it is easy to make the gradient obtained after each iteration uncorrectable, so how to choose a suitable Batchsize value in a reasonable range and balance the effect and time cost is important. I.e. the batch gradient descent method commonly used in experiments. During the experiment, a proper sample scale is acquired, so that the time cost for the experiment can be reduced, the memory utilization rate of the machine can be improved, and a more accurate descending direction can be found to reduce the vibration amplitude of the model. Through the experiment in this section, when the Batchsize value is selected to be 32, the time cost and the accuracy of the experiment are simultaneously considered to reach the optimal state.

4) Contrast experiments of different data enhancement technologies

The reverse translation technology can be used for identifying laughing points by comparing the data set translated back to the source language with the original data set linguistic, and the reverse translation technology can be used as a very excellent data enhancement technology to amplify the original data set so as to improve the robustness and generalization capability of the model, so that a better classification effect is obtained. Currently, the dominant data enhancement techniques include sentence-level based reverse translation techniques and word-level based EDA techniques. The EDA technology refers to operations of random word replacement, random deletion, random insertion and random exchange of sentences in an original data set. Good results are obtained in large data sets. Thus, this section will also make a comparative experiment on the data enhancement technique. The specific method is that data is enhanced by two different methods, then a new data set is formed by the data and the original data set, and the new data set is input into a model, so that a classification experiment of the humor of the Chinese text is carried out.

Sentence examples using the reverse translation technique and the EDA data enhancement technique are shown in table 4.11.

Table 4.11 data enhanced sentence examples

In this section of experiment, both data enhancement methods were each performed 15% data enhancement. The experimental results are shown in table 4.12.

Table 4.12 different data enhancement comparison experiment results table

As can be seen from experimental results in the table, the reverse translation data enhancement technology is slightly better than EDA data enhancement technology in humor classification, and meanwhile, compared with the model in the text, the accuracy and the F1 value are respectively improved by 0.44% and 0.4% under the condition that the data enhancement technology is not used. By analyzing the potential reasons, the method is probably in a Chinese humor data set, the original structure of the data is changed after reverse translation, semantic diversity is enhanced, and compared with EDA technology, less noise is generated, so that a machine can automatically learn in comparison with an original sentence, and the robustness and generalization capability of a model are better improved. (6) Experimental results and analysis of self-built dataset

Since the research on humor classification is Chinese-oriented, through the reverse translation of the public Chinese humor corpus in CCL2019 and the comparison of linguistic humor theory, two significant features of Chinese humor, namely a voice feature and a part-of-speech feature, are found to be effective in CCL2019 public Chinese humor data set through experimental verification, and the two significant features are universal for continuous verification and analysis. Therefore, the experiment in this section performs experimental analysis on the data set constructed by itself. 1) Effect of multiple features on the results of humor classification of chinese text

As with the multi-feature experimental impact studies conducted on the public dataset, this section primarily analyzes the impact studies of the self-built dataset versus the multi-features. The specific experimental parameters are the same as the published data set. The experimental model adopts BERT-BiGRU-Softmax as a basic model, and features are gradually added to the model and feature fusion is carried out to generate the BPH-BiGRU-Softmax model. The experimental results of the self-built dataset of this section are shown in Table 4.13.

TABLE 4.13 self-established dataset experimental results

As is evident from the experimental results table, the accuracy based on the basic feature model BERT is 97.43, and the F1 value is 97.33. After the first feature, part-of-speech feature, is added and feature fusion is performed, the accuracy and F1 value are respectively improved by 0.14% and 0.18% compared with the basic feature model BERT. After the second feature, namely the pinyin feature, is added and feature fusion is carried out, the accuracy and the F1 value are respectively improved by 0.17 percent and 0.21 percent compared with the basic feature model BERT. Both features are valid for feature fusion. After the basic feature BERT, the part-of-speech feature and the pinyin feature are fused, the accuracy and the F1 value reach the optimal states, namely 97.89% and 97.85%, respectively, and the accuracy and the F1 value are improved by 0.46% and 0.52% respectively compared with the basic feature model BERT. The three features have obvious fusion advantages and outstanding effects, and the effectiveness of the feature method is verified.

2) Comparing experiments with other network models

In order to verify the effectiveness of the multi-feature fusion classification model based on the reverse translation technique presented herein, the experiments in this section on the self-built dataset still compare with the following network models commonly used for classical TEXT emotion classification, such as the network models of TEXT-RNN, TEXT-RCNN, DPCNN, etc., and with the SVM method applied by KHANDELWAL, et al in humor classification experiments and the TEXT-CNN model method applied by Chen, et al. The comparison also adopts a BPH-BiGRU-Softmax model which integrates BERT features, part-of-speech features and Chinese pinyin features with the best effect. The experimental results are shown in table 4.14.

Table 4.14 experimental model comparison results table

By comparing the experimental results in the table, we can clearly find that the experimental accuracy results from the SVM model to the TEXT-CNN model are gradually improved. The SVM model is a classical model of machine learning in a text emotion classification task, and the subsequent models are classical models of deep learning in the text emotion classification task. The accuracy of the deep learning model TEXT-CNN is improved by 4.76% compared with that of a machine learning model SVM, and along with the improvement of the complexity of the model, the effect of the experimental accuracy is higher, and the TEXT-CNN model reaches 94.34%. But compared with the accuracy of the BPH-BiGRU-Softmax model, the accuracy of the model provided by the invention greatly exceeds the results of the classical network model and the network model common in humor TEXT classification, and the accuracy of the model is improved by 3.36 percent compared with that of the TEXT-CNN model, thus being the optimal model of the self-built data set. The experimental result shows that the model method and the characteristics analyzed by comparing a large number of data sets through the reverse translation technology are also applicable to the self-built data sets, and the model and the characteristics are universal in the Chinese humor classification task, so that a better experimental result can be realized.

3) Exploring the super parameters of important models

As with the critical parameter exploration in the public dataset, there is also a problem with the size of batchsize on the self-built dataset that affects the model performance. If the dataset is small, it is entirely possible to take the form of a full dataset (FullBatchLearning), which has the advantage that the direction determined by the full dataset can better represent the sample population, thus facing more accurately the direction in which the extremum is located and that it is difficult to choose a global learning rate due to the large differences in gradient values of different weights. However, since the self-built dataset is large, experimental adjustments to batchsize specific choices are required. Fig. 16 shows experimental results of different batchsize values on a self-built dataset.

From the above experimental comparison graph, it can be seen that there is a large difference between the different barchsize values for the time the model is run and the accuracy of the model. Overall, the model run time is progressively smaller from 1 to 256 with progressively increasing batchsize values. However, the time the model is run is only one aspect of the training model task, and the accuracy of the model is important. Therefore batchsize is appropriate while taking into account the behavior of the accuracy of the model. As is evident from the figure, batchsize, when set to 32, has the highest accuracy and performs best, and at the same time, the running time of the model is also in an acceptable range, so that on the self-built dataset, the value of batchsize still selects 32 as an important parameter of the model. (4) Contrast experiments of different data enhancement technologies

The section experiment still adopts a data enhancement technology of reverse translation to compare with an EDA data enhancement technology, and verifies the data enhancement method suitable for humour classification of the Chinese text in the section. To ensure fairness, the data augmentation ratio was set at 15% for the data set, and the specific experimental results are shown in table 4.15.

Table 4.15 different data enhancement comparison experiment results table

As can be seen from the table, the data enhancement technology of the reverse translation is still slightly better than the EDA data enhancement technology, and the accuracy and the F1 value are respectively improved by 0.33% and 0.35% under the condition that the data enhancement is not used compared with the model method proposed by the invention. Because EDA adds and deletes, replaces and shifts data widely and randomly, noise exists, the data aggregate is large, and the EDA performance is limited. Compared with the reverse translation technology, the method has less limitation on the defects, the reverse translation can change the expression structure and mode of sentences, so that the enhanced data can have a structure different from that of the original sentences, and correct semantic information can be reserved under the condition of changing the grammar structure, thereby increasing the data diversity of a text corpus and better increasing the robustness and generalization capability of a model.

4.1.5 Chapter summary

The method comprises the steps of taking BERT word vectors as basic vectors, analyzing humor features of a data set through a third chapter reverse translation technology, extracting pinyin feature vectors and part-of-speech feature vectors by utilizing a pinyin and part-of-speech feature extraction tool, fusing the pinyin feature vectors and the part-of-speech feature vectors with the BERT word vectors to represent text information, then carrying out deep extraction on the feature information on the basis of the information by utilizing a bidirectional threshold cyclic neural network (BiGRU), realizing deep understanding of the text feature information by a network model, finally inputting the text information into a Softmax classifier to complete experiments of humor classification of Chinese texts, and providing a BPH-BiGRU-Softmax model. Firstly, the configuration environment and performance evaluation index of an experiment are introduced, and then, the public data set and the self-built data set used for the experiment are introduced in detail. The experimental parameter settings are described in detail below. Finally, experimental comparison and correlation analysis are performed on the public data set and the self-built data set. Experimental results show that compared with a basic feature model BERT, the three feature fusion methods based on the BERT word vector provided by the inventor have the advantages that the public data set and the self-built data set are greatly improved, and the effect is obvious. And compared with the Support Vector Machine (SVM) and TEXT-CNN models proposed by other students who do humor classification recently, the results of the method are greatly superior to those of the models. Meanwhile, the comparison and analysis are also carried out on the public data set by utilizing a reverse translation technology and an EDA data enhancement technology which is popular in the recent year. Experimental results indicate that our method is effective.

Claims

1. The method for constructing the Chinese humor classifying model based on reverse translation is characterized in that the Chinese humor classifying model is constructed according to the following steps:

s1.1, a text input layer is used for receiving Chinese texts to be classified;

s1.2 BERT embedding layer, which is responsible for converting the Chinese text of the text input layer into context-aware vector representation;

S1.3, a Chinese phonetic feature embedding layer, which comprises the following steps:

s1.3.1 Chinese characters are converted into pinyin: converting each Chinese character in the sentence to be characterized into Chinese pinyin;

S1.3.2 obtain a unique character set: assigning an integer to each character as its ID;

s1.3.3 pinyin vectorization: according to S1.3.1 and S1.3.2 steps, performing pinyin vectorization on the text to be converted;

s1.4, importing a disabled word stock through a jieba tool, performing word segmentation operation on the text, and extracting all parts of speech to form part of speech feature vectors;

S1.5, a feature fusion layer fuses a feature vector matrix extracted by the BERT model, chinese pinyin features obtained by a reverse translation method and text part-of-speech feature vectors to form a multi-feature mode; in the feature fusion layer, feature vectors extracted by the BERT model, chinese pinyin features obtained by comparison through a reverse translation method and text part-of-speech feature vectors are subjected to feature fusion to form a multi-feature mode, and training is performed in the deep learning model; the feature vector matrix generated by the sample sentence of the text input layer through the BERT model is V, and the formula of the territory feature fusion sentence corresponding to the sample sentence can be expressed as follows by the formula 4.1:

In the above formula, W represents the generated new feature vector, f ₁ represents the word vector feature, and f ₂ represents the Chinese pinyin feature;

s1.6 BiGRU layers, including a forward GRU layer and a backward GRU layer, performing context learning on the feature vector matrix output by the feature fusion layer through a forward and backward neural network, and extracting deep semantic features; the BiGRU layers comprise a forward GRU layer and a backward GRU layer, the forward and backward neural network is utilized to perform context learning on the feature vector matrix W fused and output by the feature fusion layer, and deeper feature extraction operation is performed on the text;

The Softmax layer converts the feature output by the previous layer into the probability of deciding the tag, i.e. maps the feature vector into a probability sequence, and expresses Vi as the i-th element of the vector V, then the value of this element is expressed as formula 4.2 as follows:

s1.7, processing the characteristics output by the BiGRU layers through the full-connection layer, and finally finishing the classified output of the humor of the Chinese text.

2. The method for constructing a reverse translation-based Chinese humor classification model according to claim 1, wherein in the step S1.1, the text input layer takes sentences as input, extracts contrast characteristics of original Chinese humor corpus and reverse translated non-humor corpus, and uses supervised learning of harmony, structural symmetry, character ambiguity and new tide words and dialects in combination with humor theory in linguistics.

3. The method for constructing a reverse translation-based Chinese humor classification model according to claim 1, wherein the part-of-speech feature is embedded in the text layer, a jieba tool is used to introduce the text into a deactivated thesaurus, a word segmentation operation is performed on sentences in the text, and then all parts of speech are extracted and converted into part-of-speech feature vectors.

4. The method for constructing a reverse translation-based Chinese humor classification model according to claim 1, wherein the reverse translation method is to translate a Chinese humor data set into an English data set by using a machine translation method, and then translate the English data set back into the Chinese data set.