CN117473078A

CN117473078A - Visual reading system of long literature based on cross-domain named entity recognition

Info

Publication number: CN117473078A
Application number: CN202311298279.1A
Authority: CN
Inventors: 刘颖凡; 曹云昀; 耿岱琳; 梅文娟; 程涛; 高明
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2023-10-09
Filing date: 2023-10-09
Publication date: 2024-01-30

Abstract

The invention provides a visual reading system of long literature based on cross-domain named entity recognition, which comprises the following components: the data acquisition module acquires a source text of the literary work through a web crawler program; the literature entity recognition module is used for recognizing character entities, place entities and family entities in the source text by using an open-source named entity recognition model to generate a coarse-granularity entity data set; the literature entity optimization module is used for training a cross-domain named entity recognition model based on parameter migration based on a coarse-granularity entity data set, and further recognizing and generating fine-granularity character entity, place entity and family entity data sets through a rule weight network optimization model result; the visual analysis and display module for the literature entity utilizes an open source visual tool to carry out visual analysis and display on a fine-grained literature entity data set, and the visual analysis and display comprises four units of a figure relation network, a figure moving track, a figure emotion change and a figure attendance frequency.

Description

Visual reading system of long literature based on cross-domain named entity recognition

Technical Field

The invention relates to the technical field of natural language processing and data visualization, in particular to a long literature visual reading system based on cross-domain named entity recognition.

Background

Since the internet age of "information explosion," the contradiction between the vast and sophisticated information and limited time-driven incentives has led to an increasing pursuit of fragmented information with a decreasing focus on traditional literature, especially long literature. The long literature works usually have the characteristics of long length, complex character relationship, and zigzag plot, and are often obscured and understandable to read, so that people are difficult to adhere to. Under the rapid development of information science, the literature field starts to explore the application value and development potential of text visual analysis technology gradually. Text processing analysis can well combine natural language processing techniques, while visualization can clearly and effectively convey and communicate information in a graphical manner. The combination of text analysis and visualization in the literature field gives full play to the core advantages of the two technical fields, and the content and the characteristics of the literature works can be understood deeply and rapidly through interaction with readers.

Currently, there is a visualization method for literary works, with publication number CN116151255a. The patent discloses a text analysis and visualization method and system, but the method only considers the visualization of the frequency of person appearing, the form is relatively single, and the requirements of readers cannot be met.

Disclosure of Invention

The invention aims to provide a visual reading system for long literature based on cross-domain named entity recognition, which takes a novel text as a research object, performs character relation analysis, character track analysis, emotion change analysis, field frequency analysis and the like on an original text, generalizes law facts and emotion factors therein, and shows the law facts and emotion factors in a more efficient and visual mode, so that a user can conveniently read the novel text and understand characters, topics and text emotion more clearly. The invention constructs a visual platform which is focused on intelligent analysis and interactivity of long literary works, thereby optimizing the traditional reading mode, increasing the reading interest, improving the overall understanding ability of readers to the literary works, leading lengthy and complex stories to become clear and easy to understand and arousing attention of people to the literary works.

The specific technical scheme for realizing the aim of the invention is as follows:

a visual reading system for long literature based on cross-domain named entity recognition, comprising:

the data acquisition module acquires a source text of the literary work from the electronic resource by using a Python web crawler program;

the literature entity recognition module is used for recognizing character entities, place entities and family entities in the source text by using an open-source named entity recognition model to generate a coarse-granularity entity data set;

the literature entity optimizing module is used for training a cross-domain named entity identifying model based on parameter migration based on a coarse granularity entity data set, setting a rule weight network by introducing context constraint rules of specific entity types, optimizing model results, further identifying literature work source texts, and generating fine granularity character entity, place entity and family entity data sets;

the visual analysis and display module for the literature entity utilizes an open source visual tool to carry out visual analysis and display on a fine-grained literature entity data set, and comprises four units of figure relation network visualization, figure movement track visualization, figure emotion change visualization and figure emergence frequency visualization.

Preferably, in a chinese literature entity optimization module of a visual reading system based on cross-domain named entity recognition, the context constraint rule of a specific entity type includes: rules for identifying persona entities, rules for identifying place entities, and rules for identifying family entities.

Preferably, the visual analysis and display module facing the literature entity in the visual reading system of the literature based on the cross-domain naming entity identification comprises:

the figure relation network visualization unit is used for displaying a complex figure relation network in the literary works and evaluating the relation strength degree between two roles by counting the simultaneous occurrence times of different roles in the same sentence; and using the open source visualization tool D3.Js, taking the names of the roles as nodes and the relationship strength values among the roles as the weights of the edges, and presenting a complex person relationship network in a visual mode.

Preferably, the visual analysis and display module facing the literature entity in the visual reading system of the literature based on the cross-domain naming entity identification further comprises:

the character movement track visualization unit is used for displaying movement tracks and important events of characters in literary works, extracting place entities in character sections in the articles according to the character names and the appearance sequence, and constructing a character track data set; extracting important events by a rule-based matching method; drawing a line drawing of the moving track of the character by using an open source visualization tool ECharts; the module also integrates interaction functions of storylines and places where different chapters occur, and provides reading navigation tools to assist readers in understanding and tracking complex character relationships and story development;

the character emotion change visualization unit is used for showing the change trend of character emotion along with the development of the scenario in the literary works, the module utilizes an open source tool NLTK to extract sentences describing characters, carries out emotion analysis on the sentences, and calculates the score of each character in different emotion dimensions; then, using open source visualization tool ECharts to present the variation condition of the emotion of the character in different chapters in a line graph;

the character appearance frequency visualization unit is used for showing the change trend of the character appearance frequency along with the development of chapters in literary works, and a data set of the character appearance frequency is constructed by counting the occurrence times of different roles in different chapters; and drawing a figure out field frequency change line graph by using an open source visualization tool ECharts, wherein a user can check the number of times that a certain character appears in the whole novel by dragging a time axis so as to help understand the importance degree change trend of the figure in the plot development.

Compared with the prior art, the invention has at least the following advantages or beneficial effects:

(1) Visual presentation of novel text: the invention constructs a perfect visual system for the literacy by processing and analyzing the text. With the visualization method, including charts, graphs and other visual elements, readers can intuitively understand and perceive the episodes, character relationships and other important elements of novels in a completely new form. The visual mode provides a brand new reading experience, so that literary works are more vivid, easy to understand and appreciate.

(2) Automated entity extraction: in the aspect of the extraction of fictitious characters and place entities of the novel, the invention adopts an automatic method. Through a cross-domain named entity recognition technology, entities and relations in novels can be efficiently and accurately extracted. Compared with the traditional manual extraction mode, the automatic extraction method greatly reduces the manual workload and improves the efficiency.

(3) Lowering the reading threshold: the invention enables readers to learn about the content of novels in a faster way, thereby lowering the threshold of reading. By processing and analyzing the novel text, key information can be extracted, and a brief summary or abstract is constructed. The reader does not need to fully read the entire book to understand its main content. The method not only saves reading time, but also increases the interest and interactivity of reading.

Drawings

FIG. 1 is a schematic diagram of the structure of the present invention;

FIG. 2 is a flow chart of the literature entity optimization module of the present invention;

FIG. 3 is a diagram of a relationship network of people provided by an embodiment of the present invention;

FIG. 4 is a diagram showing a change in the movement track of a person according to an embodiment of the present invention;

FIG. 5 is a graph of a person's emotion change provided by an embodiment of the present invention;

fig. 6 is a graph showing the variation of the frequency of the person's appearance according to the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following specific examples and drawings. The procedures, conditions, experimental methods, etc. for carrying out the present invention are common knowledge and common knowledge in the art, except for the following specific references, and the present invention is not particularly limited.

Examples

FIG. 1 is a schematic diagram of a visual reading system for long literature based on cross-domain named entity recognition; as shown in fig. 1, the visual reading system for long literature based on cross-domain named entity recognition comprises:

In this embodiment, the literary works are selected as English novel from Song of ice and fire, and the literary entities include character names, place names and family names in the literary works.

In this embodiment, the literature entity recognition module uses an open source crawler tool to capture the novel text of ice and fire, and uses an open source natural language processing tool library SpaCy to perform entity recognition processing, so as to construct a novel text entity data set named GOT, wherein the novel text entity data set includes entities such as person names, place names, family names and the like;

in this embodiment, the implementation steps of the literature entity optimization module (as shown in fig. 2) are as follows:

step 1: using the CONLL03 dataset as the source dataset, multiple language models are pre-trained on the dataset, including BERT, biLSTM, biLSTM +crf based models, etc., that will get a shared semantic representation by learning named entity features in the source dataset.

Step 2: and taking the GOT as a target data set, and performing fine tuning on the pre-trained model. In the fine tuning process, the GOT data set is used for performing supervised training on the model, so that the model is better suitable for entity identification tasks in the literature field.

Step 3: the method comprises the steps of introducing context constraint rules of specific entity types, wherein the specific rules comprise: rules for identifying persona entities, including matching rules expressed as if an entity appears before a verb as a subject or object, and mention rules, it is likely to be a persona entity; the mention rule is expressed as if an entity is mentioned together with a known persona entity in a sentence, it is likely to be a persona entity. Rules for identifying a place entity, including matching rules and description rules, the matching rules being expressed as if an entity occurs after a place preposition (e.g., on, at), then it is likely to be a place entity; the descriptive rule is expressed as if an entity is mentioned together with a known locality entity in a sentence, it is likely to be a locality entity. Rules for identifying a family entity, including descriptive rules, are expressed as if an entity is mentioned together with a known family entity in a sentence, then it is likely to be a family entity.

Step 4: designing a rule weight network, taking the fully-connected neural network as the weight network, taking the characteristic representation of the rule as input, outputting the weight of the rule, adding a weight coefficient in the fine tuning process, and optimizing a model result.

Step 5: by testing a series of model migration methods, the best performance of the BiLSTM+CRF based pre-training model was found, and therefore the model was chosen as the final pre-training model.

Step 6: and (3) re-identifying the entities in the text of 'Bing and Huo Geng' by using the cross-domain named entity identification model based on BiLSTM+CRF optimized in the steps 1 to 4 to obtain a fine-grained entity data set for the subsequent visual analysis and display module.

In this embodiment, the visual analysis and display module for literature entities includes:

In a specific example, the person relationship network visualization unit is implemented by:

step 1: and constructing the character relationship. According to the fine-grained literature entity data set, the interaction times between people are calculated, for example, in a novel, the people and the entities are simultaneously present in the same sentence, and then the existence of one interaction between the two people is judged.

Step 2: character relationship data sets are formed. And (3) arranging the person interaction relation data obtained through statistics into a database table structure suitable for visual display so as to be read and presented during the visual display.

Step 3: and (5) visual display. An interactive character relationship network diagram is constructed by using an open source visualization tool library d3.Js, as shown in fig. 3, in which character nodes represent character entities and basic information of characters including names, nicknames, family information, character profiles, etc. are integrated. The relationship among the people is represented by drawing the connection lines among the people nodes, and the connection lines are added with labels so as to enrich the content of the people relationship network.

Step 4: interaction function. In the visual display process, a user can check personal information of a person, including names, nicknames, family information, person profiles and the like, by clicking on nodes in the network diagram; meanwhile, the user can click on the edges between the nodes to know the relation strength between the person and other people, so as to understand the complexity of the person relation.

In this embodiment, the visual analysis module for literature entities further includes:

in a specific example, the implementation steps of the character movement track visualization unit are as follows:

step 1: character track data is extracted. Aiming at literary works of 'ice and fire songs', a name entity recognition technology is used for extracting a person name from chapter names, and comparing the person name with place names on ice and fire Wikipedia, and extracting place names where people appear in relevant chapter texts;

step 2: searching a story map of ice and fire songs, and manually marking each place on the map, including coordinate information of each place.

Step 3: and (3) arranging character, place and family information obtained by using a named entity recognition technology into a JSON data format, and inputting the JSON data format as a data source for visualizing the character track.

Step 4: the interactive line graph and scatter plot (as shown in fig. 4) are implemented using the open source visualization tool library echartis, showing the change in character trajectories. Each character will have its own trajectory, the intersection of the character trajectories representing that there is some event link between the current characters.

Step 5: the track visualization page also integrates the interactive function of chapters and places, so that readers can quickly find places and events, and chapter positioning is realized. The function is used as a navigation tool and has guidance on the reading process of readers.

In this embodiment, the visual analysis and display module for literature entities further includes:

the character emotion change visualization unit is used for displaying the change trend of character emotion along with the development of the drama in the literary works, extracting sentences describing the characters by using an open source tool NLTK, carrying out emotion analysis on the sentences, and calculating the score of each character in different emotion dimensions; then, using open source visualization tool ECharts to present the variation condition of the emotion of the character in different chapters in a line graph;

in a specific example, the implementation steps of the figure emotion change visualization unit are as follows:

step 1: and extracting sentences describing the characters from the text by using an open source tool NLTK, and analyzing emotion. An emotion dictionary NRC issued by the national research council of canada is used, which contains emotion categories of happiness, fear, sadness, anger, surprise, aversion, trust, and desire. And calculating the score of each character in different emotion dimensions by judging emotion categories in sentences.

Step 2: and (3) sorting the character emotion analysis results, and creating an emotion data set comprising character names and corresponding emotion scores.

Step 3: the ECharts is used for realizing an interactive line graph (shown in fig. 5), and the emotion score and the change condition of the character are intuitively displayed. Each character has an independent emotion change broken line chart, and the size of a specific emotion score can be reflected through the height of the broken line chart.

Step 4: according to the user's needs, the system provides a sliding time axis, and the user can select different time ranges to observe the influence of plot development on the emotion change of the character.

In a specific example, the implementation steps of the person attendance frequency analysis unit are as follows:

step 1: and counting the occurrence times of all the people in each section according to the section division of the novel, and constructing a data set of the people's attendance frequency.

Step 2: drawing a line graph of the character appearance frequency by adopting ECharts, wherein the horizontal axis in the graph represents chapters, and the vertical axis represents the character appearance frequency as shown in fig. 6; the user can drag the time axis to check the frequency of the occurrence of the specific role along with the development of the plot in the whole novel so as to help readers understand the change trend of the importance degree of the character in the development of the plot.

Claims

1. A visual reading system for long literature based on cross-domain named entity recognition, comprising:

the literature entity optimizing module is used for training a cross-domain named entity identifying model based on parameter migration based on a coarse-granularity entity data set, optimizing a model result by introducing a context constraint rule design rule weight network of a specific entity type, further identifying literature work source texts and generating a fine-granularity character entity, place entity and family entity data set;

2. The visual reading system of claim 1, wherein the context constraint rules for a particular entity type in the literature entity optimization module include rules for identifying persona entities, rules for identifying place entities, and rules for identifying family entities.

3. The visual reading system of literature based on cross-domain named entity recognition of claim 1, wherein the visual analysis and display module for literature entities comprises:

the figure relation network visualization is used for showing relation networks among different figures in literary works, and evaluating the relation strength degree between two roles by counting the simultaneous occurrence times of different roles in the same sentence; using open source visualization tool D3.Js, using character names as nodes and relationship strength values between characters as weights of edges, and displaying a character relationship network in a visual mode;

the character movement track visualization is used for displaying movement tracks and important events of characters in literary works, and extracting place entities in character sections in the articles according to the character names and the appearance sequence to construct a character track data set; drawing a line drawing of the moving track of the character by using an open source visualization tool ECharts; extracting important events in literary works by a rule-based matching method, and acquiring abstracts, places and chapter information of the important events; integrating the abstract, the place and the chapter of the important event into a figure moving track graph, and providing a reading navigation tool to assist readers in understanding and tracking the figure track and the plot development;

the character emotion change visualization is used for displaying the change trend of character emotion along with the development of the drama in literary works, extracting sentences describing the characters by using an open source tool NLTK, and carrying out emotion analysis on the extracted sentences, namely calculating the score of each character on different emotion dimensions; then, using open source visualization tool ECharts to present the variation condition of the emotion of the character in different chapters in a line graph;

the character appearance frequency visualization is used for showing the change trend of the character appearance frequency along with the development of chapters in literary works, and a data set of the character appearance frequency is constructed by counting the occurrence times of different roles in different chapters; and drawing a character field frequency change line graph by using an open source visualization tool ECharts, and checking the number of times of a character in the whole novel by dragging a time axis by a user to help readers understand the change trend of the number of field times of the character in the plot development.