CN111930947A

CN111930947A - System and method for identifying authors of modern Chinese written works

Info

Publication number: CN111930947A
Application number: CN202010866981.3A
Authority: CN
Inventors: 施建军
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-11-13

Abstract

The invention discloses an identification system and method for modern Chinese written works authors. The system comprises a sample data processing module for the author to know and the author to not know, a sample writing feature extraction and vectorization module, a multilayer neural network model training module, an 'anonymous literary work' author distinguishing module and the like. The sample data processing module is used for making training sample data and discrimination sample data required by the author identification system; the writing feature extraction and vectorization module is used for extracting language features reflecting writing habits of authors so as to produce training sample vectors and discrimination sample vectors; the multi-layer neural network model training module trains the multi-layer neural network model by using the training sample vector data to establish a discrimination model. The discrimination module discriminates the author of the anonymous literary works according to the writing habit language feature vector by using a discrimination model. The invention has very high identification precision for the author of the anonymous literary works in the designated range, and even can approach 100 percent.

Description

System and method for identifying authors of modern Chinese written works

Technical Field

The invention relates to a system and a method for identifying authors of modern Chinese written works in a given range.

Background

The author problem of the Chinese article is existed since ancient times, and the author problem of the ancient Chinese literary works of the four famous writings exists, wherein the author problem of 'dream of Red mansions' is the best known. Is the author in the dream of Red mansions exactly who he is, is it an author from the first 80 times and from the last 40 times? These problems are not only academic problems that have been long debated by the academia, but also general social concerns. For a long time, the academic world has studied the author's problems of "dream of Red mansions" in various ways, of which the testimonial methods advocated by the Hu-Han and the statistical methods advocated by the Gao Benhan are the most representative.

In recent years, copyright disputes caused by the problems of the authors of the written works are increasing, and the society has generated a need for the identification technology of the authors of the modern chinese written works. With the progress of modern science and technology, especially after the appearance of computers, word processing technology is more intelligent and more efficient, and it becomes very easy to forge literary works for various purposes, and it is difficult to find the source. The internet has been highly developed in recent 10 years, and various types of social media provide convenient platforms and channels for the propagation of anonymous literary works. These anonymous literary works are becoming a public problem that plagues the internet and self-media, and thus also create a technical need to authenticate their authors.

The modern Chinese anonymous literary works have three characteristics: firstly, most of the electronic texts are input by a computer, and the author identification is difficult to be carried out by a traditional method; secondly, the manufacturing speed is high and the manufacturing amount is large by depending on the information technology; and thirdly, the internet is used as a channel, so that the propagation is rapid, the propagation range is wide, and the social influence is large. For these reasons, it is difficult to trace the author or source of the anonymous works by conventional means, which also presents a new topic for the academia.

At present, the study and technology for academic plagiarism and plagiarism are relatively mature, but the authentication of the authors of anonymous electronic writing works and the academic plagiarism authentication are different. The academic plagiarism is to show the contents of the research results of plagiarism as the literal works of plagiarisms, and the academic plagiarism identification is the consistency of the contents of plagiarism works and the contents of plagiarism works. The object identified by the anonymous literary work writer has no similarity with the known works in content, and the consistency of writing habits of different content works needs to be identified. Thus, techniques for identifying plagiarism cannot be applied to author identification.

The research on the identification of the authors of written works began with the research of the authors of written works. Some western mathematicians have used statistical methods to study the problem of the bible authors and the shakespeare by examining the distribution of word lengths. Chinese scholars study the authors of 'dream of Red mansions' by hypothesis testing, classification and clustering, support vector machines and other methods. These are academic explorations, and the mature identification technology of authors related to modern Chinese written works has not been reported yet.

Disclosure of Invention

The invention refers to the anonymous writing or the writing with inconsistent signature and actual author as the anonymous writing. In view of the problems in the prior art, it is an object of the present invention to provide a technique and method that can effectively identify the authors of modern Chinese "anonymous literary works" within a given range. The idea is to learn the writing habit of an author from the known sample data of the author by utilizing a multilayer neural network technology, establish a discrimination model, and then identify the attribution of the author of the anonymous literary works by utilizing the discrimination model according to the writing habit of the author.

In order to achieve the purpose, the technical scheme of the invention is as follows:

an identification system for authors of modern chinese written works, comprising: the system comprises a sample data processing module for Chinese literary works known by authors and unknown by authors, a sample writing feature extraction and vectorization module, a multilayer neural network model training module and an 'anonymous literary work' author distinguishing module. The author known and author unknown sample data sorting module is used for processing modern Chinese written works known by authors and unknown by authors, labeling known author and unknown author labels on each sample, making training sample data and discrimination sample data required by an author identification system, labeling the samples known by authors as training data, and labeling the samples unknown by authors as discrimination data. The writing feature extraction and vectorization module is used for extracting language features reflecting writing habits of authors from training sample data and discrimination sample data, digitalizing the language features, and expressing the digitalized language features in a vector mode to produce training sample vectors and discrimination sample vectors for the neural network model training module and the discrimination module. The multilayer neural network model training module trains the multilayer neural network model by using the training sample data vector to establish a discrimination model for author identification. The 'anonymous literary work' author distinguishing module is used for classifying and distinguishing authors of 'anonymous literary works' according to writing feature vectors of 'anonymous literary works' unknown by the authors by using a distinguishing model established by the training module to obtain an author distinguishing result.

Further, the author-known sample and the author-unknown sample refer to author-known sample data and author-unknown sample data to be identified, which are used for multi-layer neural network model training by the system. The author knows that the sample data must prepare at least AB two different author literary work samples, the two known author literary work samples must each be no less than 5 (the more the better), and the length of each sample must be no less than 500 words (including punctuation marks, but not including spaces, the longer the sample length, the better). An author-unknown sample means that the sample author is known to belong to one of AB, but specifically a or B is unknown.

Further, label labeling of sample data known by the author and unknown by the author is to label with specific symbols which sample data are known by the author and used for model training and which sample data are unknown by the author and to be identified by the author; the author used for model training knows which of the sample data are of known author a, which are of known author B, etc. Such as: the author-known sample for model training can be labeled with '0', and the author-unknown sample to be distinguished can be labeled with '1'; the author A sample may be labeled with a "0" and the author B sample may be labeled with a "1".

Further, the author identification of the anonymous literary works is that the system identifies the unknown author sample within a specified range, namely, the unknown author sample is assumed to come from one of a plurality of known authors. The system classifies the works unknown to the author into the category of a certain sample known to the author by utilizing a classification model obtained by training the sample data known to the author, and even if the discrimination of the author of the anonymous literal works is completed.

Further, the term "discrimination result" means that the system classifies an author unknown sample into a category of a known author sample. The specific implementation is that a character feature vector of an unknown sample of an author is attached with a label of a known author, and the classification label and the sample sequence number are output as a judgment result and fed back to a user.

The method for identifying the author of the modern Chinese written works comprises the following steps:

1) processing training sample data and sample data to be distinguished provided by a user, and labeling labels of information such as the training sample, the distinguishing sample, a known author A, a known author B, a sample source and the like;

2) extracting language features reflecting the writing habits of the authors from the training sample data obtained in the step 1) and the sample data to be distinguished and vectorizing the language features;

3) establishing a multilayer neural network model, and training the model by using the training sample language feature vector obtained in the step 2) to obtain a classification model;

4) distinguishing the samples to be subjected to author identification obtained in the steps 1) and 2) by using the classification model obtained in the step 3);

5) and outputting the judgment result to obtain the author attribution of the author unknown work.

Further, the data processing in step 1) refers to data processing of original sample data provided by a user, so that the original sample data conforms to a data format required by the system. The format of the data file processed by the system is a text file, and the code uses UTF-8.

Further, the step 1) of labeling the information such as the training sample, the discriminant sample, the known author a, the known author B, the sample source and the like includes: and establishing a labeling rule of the information, and labeling the sample file name provided by the user. The labeling rules provided by the method are as follows: "sample type information, author information, sample number + sample original file name".

Sample category information: "0" represents a training sample, and "1" represents a text to be distinguished;

author information: "0" represents a known author A, "1" represents a known author B;

sample number: numbering the sample files by integers;

sample original filename: the original file name of the sample provided by the user is retained.

For example: txt "indicates that this sample is a training sample for training the model, is a written work of author A, and has a sample number of 10, and txt" is the original file name of the sample. "1.0.1 new-born. txt" in Chinese language means that it is a sample to be distinguished, and the sample number is 1. Note that the author information of the sample to be distinguished, which is manually labeled, is meaningless but needs to be retained, so the "0" in the middle of the sample name does not indicate that the author is a.

Further, step 2) extracting language features reflecting the writing habits of the author from the training samples and the samples to be distinguished, and vectorizing the language features. The method takes single Chinese characters or punctuation marks appearing in training samples and samples to be distinguished and the frequency of the single Chinese characters or punctuation marks appearing in the samples as language features reflecting the writing habits of authors of the samples. The Chinese characters have the capability of independently expressing semantics, the Chinese characters which are common to all samples and appear at high frequency can usually reflect the writing habits of authors, and the punctuation marks are more so, so the method takes the language features as the basis for judging the authors, and takes the frequency of the Chinese characters and the punctuation marks appearing in the samples as feature vectors for reflecting the writing habits of the samples. The method for extracting the writing features is characterized in that the obtained feature vectors are task-specific, and certain language features are not fixedly regarded as consistent feature vectors.

Further, step 3) establishing a discrimination model, and introducing a multilayer neural network to establish the discrimination model. The method constructs a neural network perceptron comprising three hidden layers, the number of the hidden layer neural units is 512, 1024 and 512, the input layer and the hidden layer use 'relu' as an activation function, and the output layer uses 'sigmoid' as the activation function. Training the multilayer neural network model by using the training sample feature vector obtained in the step 2) to obtain a discriminant model for practice.

Further, step 4) utilizes the discrimination model obtained by training in step 3) and the language feature vector of the sample unknown by the author obtained in step 2 to perform author identification on the sample to be discriminated.

Further, step 5) outputs the discrimination result obtained in step 4). The discrimination result contains the following three fields: and (4) judging a result label by the system, manually presetting the label and the name of the sample source file.

System discrimination result labeling: the value is 0 or 1, so that the sample to be distinguished belongs to an author A or B;

manually presetting a label: and (3) manually presetting labels when marking author information on the training samples and the samples to be distinguished in the step 1). The label is mainly used for checking during model training and evaluating the model discrimination accuracy for the training sample, and has no significance for the sample to be discriminated.

The method for identifying the author of the anonymous literary works in the specified range has the characteristics of less required training samples, high identification accuracy, low resource consumption, high identification speed and the like. When the number of samples of each known author used for model training reaches 10 respectively, and the length of each sample reaches more than 1000 characters, the discrimination accuracy of the method is close to 100%; when the number of the work samples of each known author reaches 5 respectively, and the length of each sample reaches more than 500 characters, the method can also achieve good effect.

Drawings

FIG. 1 is a schematic block diagram of the present invention of a modern Chinese written work author identification system.

FIG. 2 is an exemplary diagram of training samples and discrimination samples of the method for identifying authors of modern Chinese written works.

FIG. 3 is an exemplary diagram of feature vectors of training samples and discrimination samples of the author identification method of modern Chinese written works.

FIG. 4 is a flow chart of the method for identifying authors of modern Chinese written works.

FIG. 5 is a schematic diagram of a neural network model of the method for identifying authors of modern Chinese written works.

Detailed Description

The invention is further explained below with reference to the drawings and the examples.

As shown in fig. 1, an authentication system for authors of modern chinese written works includes: the system comprises a sample data processing module for the author known and unknown, a sample writing feature extraction and vectorization module, a multilayer neural network model training module and an 'anonymous literary work' author distinguishing module. The author known and unknown sample data processing module is used for processing modern Chinese written works known by authors and unknown by authors, labeling known author and unknown author labels on each sample, making training sample data and discrimination sample data required by an author identification system, labeling the samples known by authors as training data, and labeling the samples unknown by authors as data to be discriminated. The writing feature extraction and vectorization module is used for extracting language features reflecting writing habits of authors from training sample data and discrimination sample data, digitalizing the language features, and expressing the digitalized language features in a vector mode to produce training sample vectors and discrimination sample vectors for the model training module and the discrimination module to use. The multi-layer neural network model training module trains the multi-layer neural network model by using the training sample vector data and establishes a discrimination model for identifying unknown author samples. The 'anonymous literary work' author distinguishing module is used for classifying and distinguishing the authors of the 'anonymous literary work' according to the writing feature vectors of the literary work samples unknown to the authors by using a distinguishing model established by the model training module, and '0' and '1' are distinguishing results about the authors given by the system.

Fig. 2 is raw text comprising samples of modern chinese written works known to the author and unknown to the author to be identified, which must be stored in a textual format, the stored code being UTF-8, used to train the neural network model. The naming rule of the original text file name is ' sample type information ', ' author information ', ' TXT ' of sample number sample source file name, such as ' 0.0.1 luxun ' TXT ' which indicates that the file is a ' training sample ', the author information is ' 0 ', the sample number is ' 1 ', the sample source is a luxun work, and the sample file format is ' TXT ' file.

Two or more samples of the written works of the known authors must be provided, the number of the samples of the works of each author must be respectively more than 5, preferably more than 10, and the length of each sample must be more than 500, preferably more than 1000. The author of the text to be identified must be more than 500 words, preferably more than 1000 words, in length, and the author of the text to be identified must be one of two or more known authors. For example, 10 samples are known to be works of the Luxun, another 10 samples are known to be works of the Dianthus superbus, and only the author of the existing "New Chinese language" is known to be the Luxun or the Dianthus superbus, and it is necessary to identify who the author is.

FIG. 3 is a language feature vector derived from original text expressing sample writing habits. The sample writing characteristics must be a single Chinese character or punctuation mark appearing in all samples, including samples known by the author and samples unknown to the author to be identified. The writing features (Chinese characters or punctuation marks) required by the invention are not fixed, and the type and the number of the writing features are different from sample to sample for different identification tasks. The feature vector is composed of the frequency of occurrence of the qualified kanji or punctuation in each sample. The specific calculation method comprises the following steps:

frequency P of Chinese characters_iThe number of characters (frequency of occurrence of the character in a certain sample ÷ total number of words in a certain sample) × 1000.

FIG. 4 is the overall process of the present invention for "anonymous literary work" author identification. After receiving the author identification task, firstly, carrying out labeling processing on the obtained sample data, and labeling the following information on the sample file name:

sample number: numbering the sample files by integers;

Furthermore, the using frequency of Chinese characters (including punctuation marks) of all samples is counted, and the using frequency of the Chinese characters and the punctuation marks appearing in each sample is extracted to be used as a feature vector for representing the writing habit of the samples.

Further, according to the labeling information, the sample feature vector is divided into two parts, wherein one part is the sample feature vector known by the author, and the other part is the sample feature vector unknown by the author. The sample feature vectors known by the author are used for training the multilayer neural network model shown in fig. 5 to generate a discriminant model.

And further, inputting the feature vector of the sample unknown by the author into a discriminant model to identify the author of the sample.

And further, outputting the authentication result of the author of the anonymous literary works.

FIG. 5 is a multi-layer neural network model for author identification at the core of the present invention. The model is a perceptron composed of 5 layers of neural networks, such as 1 input layer, 2 hidden layers containing 512 neural units, 1 hidden layer containing 1024 neural units, 1 output layer and the like. The neural unit of the input layer is not fixed, and the dimensionality of the characteristic vector written by the system according to different discrimination task samples is automatically set. The output layer is composed of 1 nerve unit, the output result is 0 or 1, and the attribution of the author of the anonymous literary works is judged according to the output result. The input layer and each hidden layer are activated using an activation function relu and the output layer is activated using a sigmoid function for binary classification. The author identification task of "anonymous literary works" usually can provide few training samples, and relatively speaking, the parameters of the model are too many, which can cause the trained model to be easy to be over-fitted and lose the prediction ability. To prevent overfitting, the present invention uses dropout on each hidden layer. The multilayer neural network model constructed by the invention classifies and discriminates the author according to the writing characteristic vector of the unknown text sample of the author, and attaches an author label (namely 0 or 1), thereby achieving the purpose of identifying the author of the unknown sample of the author.

When the invention trains the neural network model, the parameters are adjusted according to the following principle:

and (4) checking the sample proportion: more than 10 percent and not more than 20 percent, namely, the equivalence _ split is more than or equal to 0.1 and less than or equal to 0.2;

training cycle number: not less than 30 cycles and not more than 100 cycles, namely, not less than 30 and not more than 100 epochs;

number of samples in batch: more than 2 and less than the total number of samples participating in training, namely 2 or more than or equal to the total number of batch _ size or less than or equal to the total number of training samples;

because fewer samples are usually available for training in the author identification task, the number of samples selected in a training session does not exceed the total number of training samples.

Because the author attribution problem of discriminating the 'anonymous literary works' in the range of a plurality of authors can be decomposed into discrimination tasks in the range of a plurality of two authors, the method is expanded appropriately, and the problem of discriminating the 'anonymous literary works' in the range of a plurality of authors can be solved.

Table 1 is the output of the performance testing of the present invention using samples of the luxun and dianthus superbus works. During testing, 12 samples of the works to be tested are collected together, author labels and numbers are carried out on the samples to be tested according to the rule of the step 1) of the method, and the author label of the works to be tested is 0; 9 Dianthus superbus works were collected, and were labeled with the same sample number and author label, and the author label of Dianthus superbus was 1. And selecting samples numbered 1-5 as training data, and using the rest samples as unknown samples of authors to carry out author identification test. The first column in table 1 is the identification of the author of the test sample according to the present invention. To verify the test results, the second column in table 1 is the artificially labeled author label. Thus, comparing the data in columns 1 and 2 allows an assessment to be made of the performance of the present invention. The discrimination result of the author of the works of the Luxun and the dianthus superbus is consistent with the manual discrimination result:

TABLE 1 example of the output results of the present invention of the system for identifying authors of modern Chinese written works

Claims

1. A system for identifying authors of modern chinese written works, comprising: the system comprises a sample data processing module for author known and unknown, a writing feature extraction and vectorization module, a multilayer neural network model training module, and an 'anonymous literary work' author distinguishing module:

the author known and unknown sample data processing module is used for labeling known author and unknown author labels on training samples and discrimination samples to make training sample data and discrimination sample data;

the writing feature extraction and vectorization module is used for extracting language features representing writing habits of authors from training sample data and discrimination sample data, digitalizing the language features and vectorizing the language features;

the multilayer neural network model training module is used for training the neural network model by using training sample vector data and establishing an identification model of a literary work author;

the 'anonymous literary work' author identification module is used for carrying out author identification on a 'literary work' sample unknown by an author by utilizing the neural network model established by the training module to obtain an author identification result.

2. The system of claim 1, wherein the sample data processing modules for known authors and unknown authors name and label sample data file names according to specific rules, and label information such as training samples, discriminant samples, authors, sample numbers, sample sources, and the like.

3. The system as claimed in claim 1, wherein the writing habit feature extracting and vectorizing module is capable of extracting language features representing writing habits of an author from text samples known by the author and unknown by the author, and converting the language features into feature vectors that can be processed by the multi-layer neural network model.

4. The system as claimed in claim 1, wherein the multi-layer neural network model training module utilizes multi-layer neural networks to construct a machine learning model, and uses the language feature vectors reflecting writing habits of the writer as training data to train and construct a classification model capable of efficiently identifying the writer of the written works.

5. The system for identifying authors of modern chinese written works as claimed in claim 1, wherein the author identification module is capable of efficiently and accurately identifying authors of unknown textual samples according to the author classification model trained in claim 4 and the language feature vector in claim 3.

6. The method for identifying the author of the modern Chinese written works is characterized by comprising the following steps of:

1) processing training sample data and sample data to be distinguished provided by a user, and labeling labels such as a training sample, a distinguishing sample, a known author A, a known author B, a sample source and the like;

3) establishing a discrimination model, training the multilayer neural network model by using the training sample language feature vector obtained in the step 2), and establishing the discrimination model for author identification;

4) judging the author of the anonymous literary works sample by using the judging model obtained in the step 3) according to the sample feature vector to be subjected to author identification obtained in the step 2);

5) and outputting a judgment result to finish the author identification task.

7. The method of identifying authors of modern chinese written works according to claim 6, wherein step 1) requires at least 5, preferably more than 10 samples of each known work of authors; the length of each sample of the known author work is at least 500 Chinese character units, preferably more than 1000 Chinese character units; naming and marking the sample data file of the written works according to the following rules: the training sample is labeled as 0 and the discrimination sample is labeled as 1; the sample of author A is labeled 0 and the sample of author B is labeled 1; the samples of different authors are numbered by integers and added with original file names respectively; the sample file is stored in a text file format, and the code is UTF-8; each type of annotation is separated by a half angle.

8. The method for identifying the author of the modern Chinese written work according to claim 6, wherein the step 2) extracts Chinese characters and punctuations appearing in all samples from the samples of the works known by the author and the samples of the works unknown by the author as writing features of each sample, and uses the use frequency of the Chinese characters and the punctuations in each sample as writing feature vectors which are used as model training data and author distinguishing data; the extracted writing features are not fixed and are different according to different discrimination tasks.

9. The method for identifying the author of the modern Chinese written work according to claim 6, wherein step 3) is to establish a multilayer neural network sensor comprising 5 layers, such as 1 input layer, 2 hidden layers containing 512 neural units, 1 hidden layer containing 1024 neural units, and 1 output layer containing 1 neural unit; the input layer nerve unit is not fixed and is different due to different tasks; dropout is used at each hidden layer to prevent overfitting, relu activation functions are used for the input and hidden layers, and sigmoid activation functions are used for the output layer.

10. The method for identifying the author of the modern Chinese written works according to claim 6, wherein the parameter setting in the step 3) of training the multilayer neural network sensor model is that the proportion of the check samples is more than 10% and less than 20% (0.1 is less than or equal to validation _ split is less than or equal to 0.2); the training cycle number is more than 30 cycles and less than 100 cycles (30 is less than or equal to epochs and less than or equal to 100); the number of samples selected in one training is more than or equal to 2 and less than or equal to the total number of training samples.