CN111897958B - Ancient poetry classification method based on natural language processing - Google Patents

Ancient poetry classification method based on natural language processing Download PDF

Info

Publication number
CN111897958B
CN111897958B CN202010684783.5A CN202010684783A CN111897958B CN 111897958 B CN111897958 B CN 111897958B CN 202010684783 A CN202010684783 A CN 202010684783A CN 111897958 B CN111897958 B CN 111897958B
Authority
CN
China
Prior art keywords
poetry
data
data set
ancient
matching result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010684783.5A
Other languages
Chinese (zh)
Other versions
CN111897958A (en
Inventor
邓桦
闫灵芝
孙娟娟
魏增辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202010684783.5A priority Critical patent/CN111897958B/en
Publication of CN111897958A publication Critical patent/CN111897958A/en
Application granted granted Critical
Publication of CN111897958B publication Critical patent/CN111897958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a natural language processing-based ancient poetry classification method, which comprises the following steps: inputting poem data to be classified; performing word segmentation processing on the poetry data according to a preset word stock, wherein the preset word stock at least comprises a first data set and a second data set; matching the poetry data with the second data set to obtain a first matching result, wherein the first matching result represents all single characters appearing in the poetry data; matching the phrases in the first data set according to the first matching result to obtain a second matching result, wherein the second matching result represents parts of speech and classification labels of all the phrases in the poetry data to be classified; and classifying the poetry data according to the parts of speech and/or the classification labels of all the phrases in the poetry data to be classified. According to the ancient poetry classifying method based on natural language processing, the ancient poetry can be segmented by means of a computer algorithm, and the part of speech and a preset classifying label of the ancient poetry can be obtained, so that the input ancient poetry can be classified efficiently.

Description

Ancient poetry classification method based on natural language processing
Technical Field
The invention relates to a text classification method, in particular to a natural language processing-based ancient poetry classification method.
Background
The ancient poetry of China is an idea crystal of five thousands of years up and down of China, and adds a thick and ink and a heavy color to our national culture. In the ancient times, poetry and talents are an important measurement standard of talents, and are also brought into the examination category of talents, and after new culture movement, the poetry starts to turn to modern poems, compared with classical poems, the languages of the modern poems are straighter and simpler, natural and easier to understand, and meanwhile, due to the transition of the times, the classical poems are not common in our lives, and the factors make the classical poems in the mind of modern people, namely in some traditional and even classical plate expression modes, so that a part of modern people do not want to touch. But Chinese classical poetry has the realistic significance of existence. First, classical poems in China give a mental gift. When we express their joy, fun and sense of reality of life with classical poems, we find that life is artistic, our emotion is sublimated, and mind is also a beautiful gift. Second, classical poems of China are symbolism of China culture. The comparison of cultural softness is not avoided all the time in the world of today, and the classical poetry of China is a symbol of profound and profound Chinese culture and is a tie for bringing up ancient culture and modern culture. Finally, classical poems in China are unique to human body shaping and curing. Classical poetry creation is an art work of modern life that tastes "beautiful" in life and makes this aesthetic feeling permanent. Classical poems have the unique advantage of grasping the aesthetic feeling of human life. The artistic conception of rhyme, rhythm, image, etc. is created by means of rhyme, dual, cramp, syllable, etc. and the artistic conception is often baked.
Based on the realistic significance of the ancient poetry discussed above, it can be seen that deep knowledge of the ancient poetry is necessary for modern people. However, most poems, except for some of the poems that are widely spread, are difficult for ordinary people to learn and understand systematically. Therefore systematic classification is necessary to facilitate better learning. The poetry classification commonly accepted at present can include: mountain-water garden poems, unfortunately, are sent from other poems, hometown nostalgia poems, edge plug poems, singing Shi Huai ancient poems and singing poems. With the popularization of electronic equipment, no ancient poetry classification method based on a computer algorithm exists at present.
Disclosure of Invention
In view of the foregoing problems of the prior art, an aspect of the present invention is to provide a method for classifying ancient poems based on natural language processing. The method can automatically classify massive ancient poems in a natural language processing algorithm mode, and is convenient for users to find and learn.
In order to achieve the above object, one embodiment of the present invention provides a method for classifying ancient poetry based on natural language processing, including:
inputting poem data to be classified;
performing word segmentation processing on the poetry data according to a preset word stock, wherein the preset word stock at least comprises a first data set and a second data set, the first data set is a finite set and comprises all ancient Chinese phrase information, and the ancient Chinese phrase information at least comprises part of speech and classification labels; the second data set is a finite set, which contains all the single characters of ancient Chinese;
matching the poetry data with the second data set to obtain a first matching result, wherein the first matching result represents all single characters appearing in the poetry data;
matching the phrases in the first data set according to the first matching result to obtain a second matching result, wherein the second matching result represents parts of speech and classification labels of all the phrases in the poetry data to be classified;
and classifying the poetry data according to the parts of speech and/or the classification labels of all the phrases in the poetry data to be classified.
Preferably, before the poetry data to be classified is input, the poetry data is preprocessed according to a third data set, wherein the third data set is a finite set and contains all the ancient Chinese character participatory information, and the preprocessing is to remove participatory characters from the poetry data to be processed.
Preferably, the preset word stock further comprises a fourth data set, wherein the fourth data set comprises all single ancient Chinese characters and parts of speech and classification labels thereof contained in the second data set but not in the first data set; and matching the phrase in the first data set according to the first matching result, and matching single characters in the fourth data set according to the first matching result and acquiring part-of-speech and classification labels.
Compared with the prior art, the ancient poetry classification method based on natural language processing can divide the ancient poetry by means of a computer algorithm to obtain the part of speech and a preset classification label, so that the input ancient poetry can be classified efficiently. The method is convenient for modern people to learn the ancient poetry more systematically.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
This document provides an overview of various implementations or examples of the technology described in this disclosure, and is not a comprehensive disclosure of the full scope or all of the features of the disclosed technology.
Drawings
FIG. 1 is a flow chart of the ancient poetry classification method based on natural language processing of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of the terms "comprising" or "includes" and the like in this disclosure is intended to cover an element or article listed after that term and equivalents thereof without precluding other elements or articles. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may also include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits detailed description of known functions and known components.
As shown in fig. 1, a method for classifying ancient poetry based on natural language processing according to an embodiment of the present invention includes:
s1, inputting poetry data to be classified; the method for classifying the ancient poetry can be applied to a computer system based on a C/S architecture, so that the mode of inputting poetry data to be classified can be a mode of inputting the poetry data through a terminal by a client or can be directly obtained from a poetry database of a local or cloud server. The poetry data refers to ancient Chinese traditional poetry represented by ancient poetry, near poetry and rhythms, such as Tang poetry, song words and the like.
S2, performing word segmentation processing on the poetry data according to a preset word stock, wherein the preset word stock at least comprises a first data set and a second data set, the first data set is a finite set and comprises all ancient Chinese phrase information, and the ancient Chinese phrase information at least comprises part of speech and classification labels; the second data set is a finite set, which contains all the single characters of ancient Chinese; specifically, in the present invention, the preset word stock is derived from an ancient Chinese book which has been published, for example, wherein the first data set and the second data set are both derived from an ancient Chinese dictionary, a business printing library, ISBN:978-7-100-01549-3.
S3, matching the poetry data with the second data set to obtain a first matching result, wherein the first matching result represents all single characters appearing in the poetry data; since the second data set includes only a single palindromic character, after being matched, the poetry data is divided by a single character, i.e., the first matching result is a set of single palindromic characters appearing in the poetry data.
S4, matching the phrases in the first data set according to the first matching result to obtain a second matching result, wherein the second matching result represents parts of speech and classification labels of all phrases in the poetry data to be classified; in particular, in this step, a phrase consisting of individual chinese characters, such as the character "null", may be searched for, based on the individual chinese characters already matched for appearance in the first dataset, a set of phrases relating to "null" may be obtained, for example, the number of the cells to be processed, { hollow, empty room, empty mountain, empty illusion, empty silence, empty port, empty spirit, empty text, empty void }, the present invention is described herein by way of example only and is not limited thereto. At this time, a single ancient Chinese character "blank" is a radical, and is used to form a phrase. And by analogy, performing single-character-based phrase matching on each paleo-Chinese character in poetry data, and simultaneously acquiring the part of speech and classification labels of the matched phrases so as to perform natural language-based word segmentation processing subsequently. The parts of speech include nouns, verbs, adjectives, numbers, adjectives and pronouns, and also can include adverbs, prepositions, conjunctions, auxiliary words, exclamations and personification which belong to the virtual words. The classification labels include mountain-water garden poems, mind-style remote poems, edge-plug poems, shi Huai ancient poems and poems, and also can include word names such as constant wind wave, xinnujiao, wave sand washing, qingping, dream, qin Yuan Chun, raccoon sand, bodhisattva, etc.
S5, classifying the poetry data according to the parts of speech and/or the classification labels of all the phrases in the poetry data to be classified. For example, take Wangwei mountain autumn borer as an example, its entire text is as follows:
mountain/new rain/back, weather/night/coming/autumn.
Moon/pine/bay/irradiation, spa/stone/up/stream.
bamboo/loud/Chinese angelica/coon female, lotus/dynamic/descending/fishing boat.
Random/spring aromatic/intermittent, grandchild/self/available/reserved.
After steps S3 and S4, phrases including "sky mountain", "new rain", "weather", "open moon", "clean spring" and the like may be obtained, but in terms of parts of speech, according to the part of speech labels in the first data set, it should be understood that most keywords in this poem belong to nouns, then the frequency statistics is performed on the classification label of each phrase, after sorting, the label of "mountain-water garden poem" is the most, so that the label can be used as the classification basis, i.e. the mountain-water garden poem "in the classification of mountain-resident autumn borer of king is the" mountain-water garden poem ".
In addition, the conventional Chinese word segmentation algorithm at present is generally divided into three types, namely word segmentation algorithm based on word list, wherein the word segmentation algorithm comprises a forward maximum matching algorithm FMM, a reverse maximum matching algorithm BMM and a bidirectional maximum matching algorithm BM; secondly, word segmentation algorithm based on statistical model: word segmentation algorithm based on N-gram language model; and thirdly, a word segmentation algorithm based on sequence annotation, which comprises a word segmentation algorithm based on HMM, a word segmentation algorithm based on CRF and an end-to-end word segmentation algorithm based on deep learning. However, we know that grammar and sentence reading in ancient Chinese are quite complicated, and the blind sleeve is based on the existing modern Chinese word segmentation technology, so that accurate word segmentation results cannot be obtained. The method adopted by the invention is close to an FMM algorithm, but is different in that the method at least comprises a first data set and a second data set, a single character is obtained through matching of the second data set, then the single character is used as a word root, a matching phrase is removed, the part of speech and a classification label of the phrase in the ancient Chinese poetry are obtained, and a final classification result is given by combining the occurrence frequency of the classification label. This is in contrast to any of the algorithms described above for the existing modern chinese segmentation.
And, further, as a preferable mode, before the poetry data to be classified is input, the poetry data can be preprocessed according to a third data set, wherein the third data set is a finite set and contains all the ancient Chinese character deficiency information, and the preprocessing is to remove deficiency words from the poetry data to be processed. Because the virtual word cannot form a radical, namely cannot form a phrase with other characters, the execution efficiency of the method can be greatly improved after the virtual word is removed.
In other embodiments, preferably, the preset word stock further includes a fourth data set, the fourth data set including all the single ancient chinese characters and their parts of speech and class labels contained in the second data set but not in the first data set; and matching the phrase in the first data set according to the first matching result, and matching single characters in the fourth data set according to the first matching result and acquiring part-of-speech and classification labels. For example, still in the Wangwei mountain fall borer, the pine, bamboo and lotus all have definite parts of speech and classification labels with representative meaning, so in this embodiment, after the broken words are removed, it can be judged that the parts of speech and classification labels are obtained from the single characters which do not form the phrase with the root, and the final classification result is obtained by counting and sorting the parts of speech and classification labels together with the parts of speech and classification labels of the phrase. It can be appreciated that this way, classification accuracy can be further improved.
Of course, what has been described above is a preferred embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principle of the present invention, and these modifications and adaptations are also considered as protecting the scope of the present invention.

Claims (3)

1. The ancient poetry classification method based on natural language processing comprises the following steps:
inputting poem data to be classified;
performing word segmentation processing on the poetry data according to a preset word stock, wherein the preset word stock at least comprises a first data set and a second data set, the first data set is a finite set and comprises all ancient Chinese phrase information, and the ancient Chinese phrase information at least comprises part of speech and classification labels; the second data set is a finite set, which contains all the single characters of ancient Chinese;
matching the poetry data with the second data set to obtain a first matching result, wherein the first matching result represents all single characters appearing in the poetry data;
matching the phrases in the first data set according to the first matching result to obtain a second matching result, wherein the second matching result represents parts of speech and classification labels of all the phrases in the poetry data to be classified;
and classifying the poetry data according to the parts of speech and/or the classification labels of all the phrases in the poetry data to be classified.
2. The method for classifying ancient poems based on natural language processing as claimed in claim 1, wherein, before inputting the poems data to be classified, the poems data is preprocessed according to a third data set, wherein the third data set is a finite set, which contains all the ancient Chinese character information, and the preprocessing is to remove the characters from the poems data to be processed.
3. The natural language processing based ancient poetry classification method as claimed in claim 2, wherein said preset word library further comprises a fourth data set including all the ancient chinese single characters and their parts of speech and classification tags included in said second data set but not in said first data set; and matching the phrase in the first data set according to the first matching result, and matching single characters in the fourth data set according to the first matching result and acquiring part-of-speech and classification labels.
CN202010684783.5A 2020-07-16 2020-07-16 Ancient poetry classification method based on natural language processing Active CN111897958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010684783.5A CN111897958B (en) 2020-07-16 2020-07-16 Ancient poetry classification method based on natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010684783.5A CN111897958B (en) 2020-07-16 2020-07-16 Ancient poetry classification method based on natural language processing

Publications (2)

Publication Number Publication Date
CN111897958A CN111897958A (en) 2020-11-06
CN111897958B true CN111897958B (en) 2024-03-12

Family

ID=73189137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010684783.5A Active CN111897958B (en) 2020-07-16 2020-07-16 Ancient poetry classification method based on natural language processing

Country Status (1)

Country Link
CN (1) CN111897958B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434137B (en) * 2020-12-11 2023-04-11 乐山师范学院 Poetry retrieval method and system based on artificial intelligence

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778171A (en) * 2014-01-10 2015-07-15 携程计算机技术(上海)有限公司 Character string matching system and method
CN107688596A (en) * 2017-06-09 2018-02-13 平安科技(深圳)有限公司 Happen suddenly topic detecting method and burst topic detection equipment
CN107918605A (en) * 2017-11-22 2018-04-17 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN109471936A (en) * 2018-10-11 2019-03-15 上海叔本华智能科技有限公司 A kind of method and system for plant maintenance information progress tagsort
CN109885836A (en) * 2019-02-21 2019-06-14 陈包容 A method of precisely segment
CN109918509A (en) * 2019-03-12 2019-06-21 黑龙江世纪精彩科技有限公司 Scene generating method and scene based on information extraction generate the storage medium of system
CN110188781A (en) * 2019-06-06 2019-08-30 焦点科技股份有限公司 A kind of ancient poetry text automatic identifying method based on deep learning
CN110276052A (en) * 2019-06-10 2019-09-24 北京科技大学 A kind of archaic Chinese automatic word segmentation and part-of-speech tagging integral method and device
CN110825850A (en) * 2019-11-07 2020-02-21 哈尔滨工业大学(深圳) Natural language theme classification method and device
WO2020082562A1 (en) * 2018-10-25 2020-04-30 平安科技(深圳)有限公司 Symbol identification method, apparatus, device, and storage medium
CN111160026A (en) * 2019-12-18 2020-05-15 北京明略软件***有限公司 Model training method and device, and method and device for realizing text processing
CN111221943A (en) * 2020-01-13 2020-06-02 口口相传(北京)网络技术有限公司 Query result matching degree calculation method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778171A (en) * 2014-01-10 2015-07-15 携程计算机技术(上海)有限公司 Character string matching system and method
CN107688596A (en) * 2017-06-09 2018-02-13 平安科技(深圳)有限公司 Happen suddenly topic detecting method and burst topic detection equipment
CN107918605A (en) * 2017-11-22 2018-04-17 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN109471936A (en) * 2018-10-11 2019-03-15 上海叔本华智能科技有限公司 A kind of method and system for plant maintenance information progress tagsort
WO2020082562A1 (en) * 2018-10-25 2020-04-30 平安科技(深圳)有限公司 Symbol identification method, apparatus, device, and storage medium
CN109885836A (en) * 2019-02-21 2019-06-14 陈包容 A method of precisely segment
CN109918509A (en) * 2019-03-12 2019-06-21 黑龙江世纪精彩科技有限公司 Scene generating method and scene based on information extraction generate the storage medium of system
CN110188781A (en) * 2019-06-06 2019-08-30 焦点科技股份有限公司 A kind of ancient poetry text automatic identifying method based on deep learning
CN110276052A (en) * 2019-06-10 2019-09-24 北京科技大学 A kind of archaic Chinese automatic word segmentation and part-of-speech tagging integral method and device
CN110825850A (en) * 2019-11-07 2020-02-21 哈尔滨工业大学(深圳) Natural language theme classification method and device
CN111160026A (en) * 2019-12-18 2020-05-15 北京明略软件***有限公司 Model training method and device, and method and device for realizing text processing
CN111221943A (en) * 2020-01-13 2020-06-02 口口相传(北京)网络技术有限公司 Query result matching degree calculation method and device

Also Published As

Publication number Publication date
CN111897958A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
Black et al. Statistically-driven computer grammars of English: The IBM/Lancaster approach
CN111832275B (en) Text creation method, device, equipment and storage medium
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
CN110765759B (en) Intention recognition method and device
CN107368474B (en) Automatic efficient translation and conversion method from Chinese to braille
CN100568225C (en) The Words symbolization processing method and the system of numeral and special symbol string in the text
CN110609983B (en) Structured decomposition method for policy file
CN102272755A (en) Method for semantic processing of natural language using graphical interlingua
CN111143571B (en) Entity labeling model training method, entity labeling method and device
CN112069826A (en) Vertical domain entity disambiguation method fusing topic model and convolutional neural network
CN112948543A (en) Multi-language multi-document abstract extraction method based on weighted TextRank
WO2009046612A1 (en) System for synthetically cognizing entire semantic information and applications thereof
CN112528649A (en) English pinyin identification method and system for multi-language mixed text
CN116092472A (en) Speech synthesis method and synthesis system
CN113609840B (en) Chinese law judgment abstract generation method and system
CN111897958B (en) Ancient poetry classification method based on natural language processing
CN111597302B (en) Text event acquisition method and device, electronic equipment and storage medium
CN103336803A (en) Method for generating name-embedded spring festival scrolls through computer
CN111178009B (en) Text multilingual recognition method based on feature word weighting
CN108491384A (en) A kind of auxiliary writing system of patent application document
Sacher Interactions in Chinese: designing interfaces for Asian languages
CN108763487B (en) Mean Shift-based word representation method fusing part-of-speech and sentence information
Cristea et al. From scan to text. Methodology, solutions and perspectives of deciphering old cyrillic Romanian documents into the Latin script
CN114722829A (en) Automatic generation method of ancient poems based on language model
CN104866607B (en) A kind of Dongba character textual research and explain database building method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant