CN111191010B

CN111191010B - Movie script multi-element information extraction method

Info

Publication number: CN111191010B
Application number: CN201911416307.9A
Authority: CN
Inventors: 刘宏伟; 刘宏蕊
Original assignee: Tianjin Foreign Studies University; Guangdong University of Technology
Current assignee: Tianjin Foreign Studies University; Guangdong University of Technology
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-08-08
Anticipated expiration: 2039-12-31
Also published as: CN111191010A

Abstract

The present disclosure provides a method for extracting multiple information of a movie scenario. The method comprises the following steps: extracting one or more scenes from the text; determining an event contained in the scene and event information of the event; determining the scenario type of the scene according to the event contained in the scene; and storing the scene, the event information and the scenario type in a graph database correspondingly. The method provided by the invention can extract the multi-element information containing the semantic layer from the text, so that readers can preview the text content better.

Description

Movie script multi-element information extraction method

Technical Field

The disclosure relates to the field of computer software, in particular to a method for extracting multi-element information of a movie scenario.

Background

In order to extract main information from text with long space so as to facilitate a reader to preview text content quickly, a text format is generally utilized to extract text information in a rule or expression-based manner, however, the manner still has some defects, for example, the method ignores text semantic-level information and is difficult to realize extraction of multi-element information in the text, so how to extract multi-element information containing semantic-level information from the text so as to facilitate the reader to preview text content better becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the disclosure aims to provide a method for extracting the multi-element information of the movie script, so as to extract the multi-element information containing the semantic layer from the text, and facilitate readers to preview the text content better.

To achieve the above objective, an embodiment of the present disclosure provides a method for extracting multiple information of a movie scenario, where the method includes:

extracting one or more scenes from the text;

determining an event contained in the scene and event information of the event;

determining the scenario type of the scene according to the event contained in the scene;

and storing the scene, the event information and the scenario type in a graph database correspondingly.

The embodiment of the disclosure also provides a device for extracting the multielement information of the movie scenario, which comprises:

the scene extraction module is used for extracting one or more scenes from the text;

the event determining module is used for determining an event contained in the scene and event information of the event;

the scenario type determining module is used for determining the scenario type of the scene according to the event contained in the scene;

and the data storage module is used for storing the scene, the event information and the plot type into a graph database correspondingly.

The embodiment of the disclosure also provides a computer device, which comprises a processor and a memory for storing instructions executable by the processor, wherein the steps of the movie scenario multi-element information extraction method in any embodiment are realized when the instructions are executed by the processor.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer instructions that, when executed, implement the steps of the movie scenario multivariate information extraction method described in any of the embodiments above.

As can be seen from the technical solutions provided by the embodiments of the present disclosure, the present disclosure determines an event included in each scene in a text and event information of the event, and then determines a scenario type of the scene according to the event included in the scene; therefore, the multi-element information containing the semantic layer is extracted, and a reader can preview text content better.

Drawings

Fig. 1 is a flowchart of a method for extracting multiple information of a movie scenario according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a movie scenario format provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a data storage structure provided by an embodiment of the present disclosure;

fig. 4 is a block diagram of a movie scenario multi-element information extraction device according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a computer device provided by an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a computer-readable storage medium provided by an embodiment of the present disclosure.

Detailed Description

The embodiment of the disclosure provides a method for extracting multi-element information of a movie script.

In order that those skilled in the art will better understand the technical solutions in the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

Referring to fig. 1, a flowchart of a method for extracting multi-element information of a movie scenario according to an embodiment of the disclosure may include the following steps:

s1: one or more scenes are extracted from the text.

In this embodiment, the scene information may be extracted using a regular expression.

In some implementations, the text is a movie script, and the scene information in the movie script typically starts with "ext." or "int.", and therefore, regular expressions can be used to locate sentences that start with "ext." or "int.", to determine the scene information of the event.

For example, referring to the movie scenario shown in fig. 2, if the character string "ext" is included in the starting position of the scenario, the sentence can be located by the regular expression, and the scenario information is extracted: we're flying once again over Robin Hood Trail, an tapering slow.

S2: and determining the event information of the event contained in the scene.

In some embodiments, in order to determine an event contained in a scene, part-of-speech tagging may be performed on the sentence first, and a verb in the sentence is determined; and matching the verb with the event type in the pre-established ACE (Automatic Content Extraction, automatic content extraction library) to obtain the event type and the event subtype matched with the verb.

For example, table 1 below is a matching relationship between a portion of event types and event subtypes in ACEs and trigger words.

TABLE 1

Specifically, the sentence may be labeled with parts of speech using space or StanFordNLP.

In this embodiment, the event information may include a person, a time, and a place, but may include other contents, which is not limited to this disclosure. The event information may be determined by a deep learning model, such as an RNN-CRF model, a CNN-CRF model, a maximum entropy model, and a BiLSTM-CRF model.

The following describes how to obtain event information, taking the BiLSTM-CRF model as an example.

The first step: words in the text are mapped into word vectors using a word casting layer.

And a second step of: inputting the word vector obtained in the first step into a BiLSTM layer, and outputting a predicted BIO label and a score value corresponding to the BIO label for each word.

And a third step of: and outputting a legal BIO label sequence based on the score value of the BIO label output in the second step under the learned constraint by utilizing a pre-trained CRF model. Wherein the learned constraints include: the first word in the sentence starts with the label "B-" or "O", and the labels "B-label I-label 2I-label 3I-" are of the same type as label1, label2 and label 3.

For example, one specific example of a BIO annotation is:

helping Chinese team win in Dabao

B-PER I-PER I-PER O O B-ORG I-ORG O-ORG O O

S3: and determining the scenario type of the scene according to the event contained in the scene.

To further integrate the resulting events, the events are generalized to one scenario type, and the scenario type of the scene needs to be determined by an LDA (Latent Dirichlet Allocation, latent dirichlet) model or clustering algorithm.

The relationship between event type and scenario type is described below using a specific scenario as an example.

For example, at home, a person picks up luggage, takes a ticket for purchase, calls a taxi. The series of actions occur in the same scene, and events such as 'arrangement', 'carrying', 'calling' and the like can be triggered, and the events belong to the 'trip' scenario type.

Taking an LDA model as an example, the topic model of an event description paragraph is used to determine the episode type.

Specifically, the text D of each event in the event set D, the topic set T, D is regarded as a word sequence < w ₁ ,w ₂ ,...,w _n >, w _i The i-th word is represented, and d has n words. All the different words involved in D form a large set VOCABULARY (VOC for short), LDA takes the set D as input to train two result vectors, the two result vectors are respectively the text D in each D, and the probability theta corresponding to different topics _d ＜pt ₁ ,...,pt _k > and for the topic T in each T, the probabilities phi of the different words are generated _t ＜pw ₁ ,...,pw _m ＞。

Wherein, for the text D in each D, the probabilities θ corresponding to different topics _d ＜pt ₁ ,...,pt _k ＞，pt _i Representing the probability that d corresponds to the ith topic in T, pt _i ＝nt _i /n，nt _i Representing the number of words in d that correspond to the ith topic, n being the total number of all words in d。

For the topic T in each T, probabilities φ of different words are generated _t ＜pw ₁ ,...,pw _m ＞，pw _i Representing the probability of t generating the ith word in the VOC, pw _i ＝Nw _i /N，Nw _i Represents the number of i-th words in the VOC corresponding to the topic t, and N represents the total number of words all corresponding to the topic t.

Further, using the formula p (w|d) =p (w|t) ×p (t|d), taking the subject as the middle layer, passing the current θ _d And phi _t Giving the probability of the word w appearing in the text d, where p (t|d) passes θ _d Calculated, p (w|t) is calculated by phi _t And (5) calculating to obtain the product.

Using the current theta _d And phi _t P (w|d) for a word in a text for any topic can be calculated and then the topic to which the word should correspond updated based on these results. Then, if the update changes the topic corresponding to the word, θ will be adversely affected _d And phi _t 。

At the beginning of the LDA model, θ is randomly given first _d And phi _t And (5) assigning values. The above process is repeated, and the final convergence result is the output of LDA

S4: and storing the scene, the event information and the scenario type in a graph database correspondingly.

Referring to fig. 3, the scene, the event information, and the episode type may be stored in the form of triples to the graph database.

For example, the scene: (Beaurity, scenes, FITS HOUSE), time: (FITTS hole, time, right), character: (JANE, appliar, FITS HOUSE).

Referring to fig. 4, the present disclosure further provides a movie scenario multivariate information extraction apparatus, the apparatus comprising:

a scene extraction module 100, configured to extract one or more scenes from the text;

an event determining module 200, configured to determine an event included in the scene and event information of the event;

a scenario type determining module 300, configured to determine a scenario type of the scene according to an event included in the scene;

and the data storage module 400 is used for storing the scene, the event information and the scenario type into a graph database correspondingly.

Referring to fig. 5, the disclosure further provides a computer device, including a processor and a memory for storing instructions executable by the processor, where the processor executes the instructions to implement the steps of the method for extracting the multiple information of the movie scenario in any embodiment.

Referring to fig. 6, an embodiment of the disclosure further provides a computer readable storage medium having stored thereon computer instructions that, when executed, implement the steps of the method for extracting multi-information of a movie scenario described in any of the above embodiments.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., a field programmable gate array (Field Programmable gate array, FPGA)) is an integrated circuit whose logic function is determined by the user programming the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware DescriptionLanguage), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (RubyHardware Description Language), etc., VHDL (Very-High-SpeedIntegrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The apparatus, modules illustrated in the above embodiments may be implemented in particular by a computer chip or entity or by a product having a certain function.

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of the various modules may be implemented in the same one or more pieces of software and/or hardware when implementing the present disclosure.

From the description of the embodiments above, it will be apparent to those skilled in the art that the present disclosure may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the aspects of the present disclosure, in essence or contributing to the art, may be embodied in the form of a software product, which in a typical configuration, includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The computer software product may include instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described by various embodiments or portions of embodiments of the present disclosure. The computer software product may be stored in a memory, which may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

The disclosure is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The disclosure may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The disclosure may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

While the foregoing is directed to embodiments of the present invention, other and further details of the invention may be had by the present invention, it should be understood that the foregoing description is merely illustrative of the present invention and that no changes, substitutions, or alterations herein may be made without departing from the spirit and principles of the invention.

Claims

1. The method for extracting the multi-element information of the movie script is characterized by comprising the following steps of:

extracting one or more scenes from the text using the regular expression;

determining an event contained in the scene and event information of the event;

storing the scene, the event information and the scenario type in a graph database correspondingly;

the event information of the event is determined through a deep learning model, wherein the deep learning model comprises an RNN-CRF model, a CNN-CRF model, a maximum entropy model and a BiLSTM-CRF model;

wherein determining event information for the event using the BiLSTM-CRF model comprises:

mapping words in the text into word vectors by using a word embedding layer of the BiLSTM model;

inputting the word vector into a BiLSTM layer of the BiLSTM model, and outputting a predicted BIO label and a score value corresponding to the BIO label for each word;

and outputting a legal BIO label sequence based on the score value corresponding to the BIO label under the learned constraint by utilizing a pre-trained CRF model.

2. The method of claim 1, wherein said determining the event contained by the scene comprises:

respectively marking parts of speech of sentences in the scene, and determining verbs in the sentences;

and matching the verb with an event type in a pre-established automatic content extraction library to obtain the event type matched with the verb.

3. The method of claim 1, wherein the episode type of the scene is determined by a latent dirichlet model or a clustering algorithm.

4. The method of claim 1, wherein the scene, the event information, and the episode type are stored to the graph database in triples.

5. The method of claim 1, wherein the event information includes a person, a time, and a place.

6. A movie scenario multivariate information extraction apparatus, comprising:

the data storage module is used for storing the scene, the event information and the plot type into a graph database correspondingly;

7. The apparatus of claim 6, wherein the event determination module comprises:

part of speech tagging unit, which is used for respectively tagging the sentences in the scene with part of speech and determining verbs in the sentences;

and the event matching unit is used for matching the verb with the event type in the automatic content extraction library established in advance to obtain the event type matched with the verb.

8. A computer device comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1-5.

9. A computer readable storage medium having stored thereon computer instructions which when executed perform the steps of the method of any of claims 1-5.