CN111897958B

CN111897958B - Ancient poetry classification method based on natural language processing

Info

Publication number: CN111897958B
Application number: CN202010684783.5A
Authority: CN
Inventors: 邓桦; 闫灵芝; 孙娟娟; 魏增辉
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2024-03-12
Anticipated expiration: 2040-07-16
Also published as: CN111897958A

Abstract

The invention discloses a natural language processing-based ancient poetry classification method, which comprises the following steps: inputting poem data to be classified; performing word segmentation processing on the poetry data according to a preset word stock, wherein the preset word stock at least comprises a first data set and a second data set; matching the poetry data with the second data set to obtain a first matching result, wherein the first matching result represents all single characters appearing in the poetry data; matching the phrases in the first data set according to the first matching result to obtain a second matching result, wherein the second matching result represents parts of speech and classification labels of all the phrases in the poetry data to be classified; and classifying the poetry data according to the parts of speech and/or the classification labels of all the phrases in the poetry data to be classified. According to the ancient poetry classifying method based on natural language processing, the ancient poetry can be segmented by means of a computer algorithm, and the part of speech and a preset classifying label of the ancient poetry can be obtained, so that the input ancient poetry can be classified efficiently.

Description

Ancient poetry classification method based on natural language processing

Technical Field

The invention relates to a text classification method, in particular to a natural language processing-based ancient poetry classification method.

Background

The ancient poetry of China is an idea crystal of five thousands of years up and down of China, and adds a thick and ink and a heavy color to our national culture. In the ancient times, poetry and talents are an important measurement standard of talents, and are also brought into the examination category of talents, and after new culture movement, the poetry starts to turn to modern poems, compared with classical poems, the languages of the modern poems are straighter and simpler, natural and easier to understand, and meanwhile, due to the transition of the times, the classical poems are not common in our lives, and the factors make the classical poems in the mind of modern people, namely in some traditional and even classical plate expression modes, so that a part of modern people do not want to touch. But Chinese classical poetry has the realistic significance of existence. First, classical poems in China give a mental gift. When we express their joy, fun and sense of reality of life with classical poems, we find that life is artistic, our emotion is sublimated, and mind is also a beautiful gift. Second, classical poems of China are symbolism of China culture. The comparison of cultural softness is not avoided all the time in the world of today, and the classical poetry of China is a symbol of profound and profound Chinese culture and is a tie for bringing up ancient culture and modern culture. Finally, classical poems in China are unique to human body shaping and curing. Classical poetry creation is an art work of modern life that tastes "beautiful" in life and makes this aesthetic feeling permanent. Classical poems have the unique advantage of grasping the aesthetic feeling of human life. The artistic conception of rhyme, rhythm, image, etc. is created by means of rhyme, dual, cramp, syllable, etc. and the artistic conception is often baked.

Based on the realistic significance of the ancient poetry discussed above, it can be seen that deep knowledge of the ancient poetry is necessary for modern people. However, most poems, except for some of the poems that are widely spread, are difficult for ordinary people to learn and understand systematically. Therefore systematic classification is necessary to facilitate better learning. The poetry classification commonly accepted at present can include: mountain-water garden poems, unfortunately, are sent from other poems, hometown nostalgia poems, edge plug poems, singing Shi Huai ancient poems and singing poems. With the popularization of electronic equipment, no ancient poetry classification method based on a computer algorithm exists at present.

Disclosure of Invention

In view of the foregoing problems of the prior art, an aspect of the present invention is to provide a method for classifying ancient poems based on natural language processing. The method can automatically classify massive ancient poems in a natural language processing algorithm mode, and is convenient for users to find and learn.

In order to achieve the above object, one embodiment of the present invention provides a method for classifying ancient poetry based on natural language processing, including:

inputting poem data to be classified;

performing word segmentation processing on the poetry data according to a preset word stock, wherein the preset word stock at least comprises a first data set and a second data set, the first data set is a finite set and comprises all ancient Chinese phrase information, and the ancient Chinese phrase information at least comprises part of speech and classification labels; the second data set is a finite set, which contains all the single characters of ancient Chinese;

matching the poetry data with the second data set to obtain a first matching result, wherein the first matching result represents all single characters appearing in the poetry data;

matching the phrases in the first data set according to the first matching result to obtain a second matching result, wherein the second matching result represents parts of speech and classification labels of all the phrases in the poetry data to be classified;

and classifying the poetry data according to the parts of speech and/or the classification labels of all the phrases in the poetry data to be classified.

Preferably, before the poetry data to be classified is input, the poetry data is preprocessed according to a third data set, wherein the third data set is a finite set and contains all the ancient Chinese character participatory information, and the preprocessing is to remove participatory characters from the poetry data to be processed.

Preferably, the preset word stock further comprises a fourth data set, wherein the fourth data set comprises all single ancient Chinese characters and parts of speech and classification labels thereof contained in the second data set but not in the first data set; and matching the phrase in the first data set according to the first matching result, and matching single characters in the fourth data set according to the first matching result and acquiring part-of-speech and classification labels.

Compared with the prior art, the ancient poetry classification method based on natural language processing can divide the ancient poetry by means of a computer algorithm to obtain the part of speech and a preset classification label, so that the input ancient poetry can be classified efficiently. The method is convenient for modern people to learn the ancient poetry more systematically.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

This document provides an overview of various implementations or examples of the technology described in this disclosure, and is not a comprehensive disclosure of the full scope or all of the features of the disclosed technology.

Drawings

FIG. 1 is a flow chart of the ancient poetry classification method based on natural language processing of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of the terms "comprising" or "includes" and the like in this disclosure is intended to cover an element or article listed after that term and equivalents thereof without precluding other elements or articles. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may also include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits detailed description of known functions and known components.

As shown in fig. 1, a method for classifying ancient poetry based on natural language processing according to an embodiment of the present invention includes:

s1, inputting poetry data to be classified; the method for classifying the ancient poetry can be applied to a computer system based on a C/S architecture, so that the mode of inputting poetry data to be classified can be a mode of inputting the poetry data through a terminal by a client or can be directly obtained from a poetry database of a local or cloud server. The poetry data refers to ancient Chinese traditional poetry represented by ancient poetry, near poetry and rhythms, such as Tang poetry, song words and the like.

S2, performing word segmentation processing on the poetry data according to a preset word stock, wherein the preset word stock at least comprises a first data set and a second data set, the first data set is a finite set and comprises all ancient Chinese phrase information, and the ancient Chinese phrase information at least comprises part of speech and classification labels; the second data set is a finite set, which contains all the single characters of ancient Chinese; specifically, in the present invention, the preset word stock is derived from an ancient Chinese book which has been published, for example, wherein the first data set and the second data set are both derived from an ancient Chinese dictionary, a business printing library, ISBN:978-7-100-01549-3.

S3, matching the poetry data with the second data set to obtain a first matching result, wherein the first matching result represents all single characters appearing in the poetry data; since the second data set includes only a single palindromic character, after being matched, the poetry data is divided by a single character, i.e., the first matching result is a set of single palindromic characters appearing in the poetry data.

S4, matching the phrases in the first data set according to the first matching result to obtain a second matching result, wherein the second matching result represents parts of speech and classification labels of all phrases in the poetry data to be classified; in particular, in this step, a phrase consisting of individual chinese characters, such as the character "null", may be searched for, based on the individual chinese characters already matched for appearance in the first dataset, a set of phrases relating to "null" may be obtained, for example, the number of the cells to be processed, { hollow, empty room, empty mountain, empty illusion, empty silence, empty port, empty spirit, empty text, empty void }, the present invention is described herein by way of example only and is not limited thereto. At this time, a single ancient Chinese character "blank" is a radical, and is used to form a phrase. And by analogy, performing single-character-based phrase matching on each paleo-Chinese character in poetry data, and simultaneously acquiring the part of speech and classification labels of the matched phrases so as to perform natural language-based word segmentation processing subsequently. The parts of speech include nouns, verbs, adjectives, numbers, adjectives and pronouns, and also can include adverbs, prepositions, conjunctions, auxiliary words, exclamations and personification which belong to the virtual words. The classification labels include mountain-water garden poems, mind-style remote poems, edge-plug poems, shi Huai ancient poems and poems, and also can include word names such as constant wind wave, xinnujiao, wave sand washing, qingping, dream, qin Yuan Chun, raccoon sand, bodhisattva, etc.

S5, classifying the poetry data according to the parts of speech and/or the classification labels of all the phrases in the poetry data to be classified. For example, take Wangwei mountain autumn borer as an example, its entire text is as follows:

mountain/new rain/back, weather/night/coming/autumn.

Moon/pine/bay/irradiation, spa/stone/up/stream.

bamboo/loud/Chinese angelica/coon female, lotus/dynamic/descending/fishing boat.

Random/spring aromatic/intermittent, grandchild/self/available/reserved.

After steps S3 and S4, phrases including "sky mountain", "new rain", "weather", "open moon", "clean spring" and the like may be obtained, but in terms of parts of speech, according to the part of speech labels in the first data set, it should be understood that most keywords in this poem belong to nouns, then the frequency statistics is performed on the classification label of each phrase, after sorting, the label of "mountain-water garden poem" is the most, so that the label can be used as the classification basis, i.e. the mountain-water garden poem "in the classification of mountain-resident autumn borer of king is the" mountain-water garden poem ".

In addition, the conventional Chinese word segmentation algorithm at present is generally divided into three types, namely word segmentation algorithm based on word list, wherein the word segmentation algorithm comprises a forward maximum matching algorithm FMM, a reverse maximum matching algorithm BMM and a bidirectional maximum matching algorithm BM; secondly, word segmentation algorithm based on statistical model: word segmentation algorithm based on N-gram language model; and thirdly, a word segmentation algorithm based on sequence annotation, which comprises a word segmentation algorithm based on HMM, a word segmentation algorithm based on CRF and an end-to-end word segmentation algorithm based on deep learning. However, we know that grammar and sentence reading in ancient Chinese are quite complicated, and the blind sleeve is based on the existing modern Chinese word segmentation technology, so that accurate word segmentation results cannot be obtained. The method adopted by the invention is close to an FMM algorithm, but is different in that the method at least comprises a first data set and a second data set, a single character is obtained through matching of the second data set, then the single character is used as a word root, a matching phrase is removed, the part of speech and a classification label of the phrase in the ancient Chinese poetry are obtained, and a final classification result is given by combining the occurrence frequency of the classification label. This is in contrast to any of the algorithms described above for the existing modern chinese segmentation.

And, further, as a preferable mode, before the poetry data to be classified is input, the poetry data can be preprocessed according to a third data set, wherein the third data set is a finite set and contains all the ancient Chinese character deficiency information, and the preprocessing is to remove deficiency words from the poetry data to be processed. Because the virtual word cannot form a radical, namely cannot form a phrase with other characters, the execution efficiency of the method can be greatly improved after the virtual word is removed.

In other embodiments, preferably, the preset word stock further includes a fourth data set, the fourth data set including all the single ancient chinese characters and their parts of speech and class labels contained in the second data set but not in the first data set; and matching the phrase in the first data set according to the first matching result, and matching single characters in the fourth data set according to the first matching result and acquiring part-of-speech and classification labels. For example, still in the Wangwei mountain fall borer, the pine, bamboo and lotus all have definite parts of speech and classification labels with representative meaning, so in this embodiment, after the broken words are removed, it can be judged that the parts of speech and classification labels are obtained from the single characters which do not form the phrase with the root, and the final classification result is obtained by counting and sorting the parts of speech and classification labels together with the parts of speech and classification labels of the phrase. It can be appreciated that this way, classification accuracy can be further improved.

Of course, what has been described above is a preferred embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principle of the present invention, and these modifications and adaptations are also considered as protecting the scope of the present invention.

Claims

1. The ancient poetry classification method based on natural language processing comprises the following steps:

inputting poem data to be classified;

2. The method for classifying ancient poems based on natural language processing as claimed in claim 1, wherein, before inputting the poems data to be classified, the poems data is preprocessed according to a third data set, wherein the third data set is a finite set, which contains all the ancient Chinese character information, and the preprocessing is to remove the characters from the poems data to be processed.

3. The natural language processing based ancient poetry classification method as claimed in claim 2, wherein said preset word library further comprises a fourth data set including all the ancient chinese single characters and their parts of speech and classification tags included in said second data set but not in said first data set; and matching the phrase in the first data set according to the first matching result, and matching single characters in the fourth data set according to the first matching result and acquiring part-of-speech and classification labels.