CN108108482A

CN108108482A - A kind of method that the enhancing of scene authenticity is realized in text scape conversion

Info

Publication number: CN108108482A
Application number: CN201810011163.8A
Authority: CN
Inventors: 杨富平; 刘凯
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-01-05
Filing date: 2018-01-05
Publication date: 2018-06-01
Anticipated expiration: 2038-01-05
Also published as: CN108108482B

Abstract

A kind of method that the enhancing of scene authenticity is realized in literary scape conversion is claimed in the present invention.This method includes：Step 1 obtains more Chinese documents for describing a certain scene from internet, sets up scene corpus.Step 2：The word segmentation processing of not duplicate removal is carried out to the document in corpus；The document after participle is carried out afterwards to stop word processing.Step 3：Using treated document, statistical analysis is carried out to the substantive noun in scene description.Step 4：Using statistical indicator, scene type feature is analyzed.Step 5：Representation of concept is carried out to the scene using substantive noun, establishes scene conceptual dictionary.Present invention aims at establish entity word to associate with " word class " of scene word, analyze the general feature of " classification ", realize the representation of concept to a certain scene word, the analysis that Scene entity elements are converted for literary scape provides support, meet the common sense cognition of people so as to fulfill the scene of generation, with complete background environment so that the scene sense of reality is enhanced.

Description

A kind of method that the enhancing of scene authenticity is realized in text scape conversion

Technical field

The invention belongs to the side of scene authenticity enhancing is realized in text visualization field more particularly to a kind of literary scape conversion Method.

Background technology

Literary scape conversion is the research topic in current more forward position, and literary scape conversion is substantially referred to the symbol of text information Number it is converted into visual simulation expression.There is the suitable modeling that several reasons promote this transfer process.One significant application It is the modeling to the psychologic status of people, another significant point contributes to the understanding to a story.3rd relevant Field is the modeling of cognition, and one segment description is explained using a large amount of different knowledge.It realizes text visualization, not only needs Meet the description of text, with greater need for the actual conditions for meeting scene and entity.

According to domestic and international scientific and technical literature, the existing research emphasis for literary scape conversion is mainly in research text semantic, analysis Described in entity the relations such as space.And for text described " scene " without carrying out further investigated.Existing Wen Jingzhuan System is changed, input text is mostly plain text, and the scene of generation contains only the described entity of text, and scene does not have apparent Category feature and background environment, without the sense of reality, practical significance is not strong.

Present invention seek to address that above problem of the prior art.Understanding to literary scape converting system, literary scape are converted knowledge It is associated with scene image, builds the bridge from knowledge to scene.One width scene, by one group of associated scene objects structure Into there is correspondences with the scenario entities noun in text description for the entity object in the scene image of generation.Scene has There is class discrimination, different scenes is made of different scene objects, in lit desert scene, can be appreciated that wide desert, withered Branches and leaves, cactus or camel, without can be appreciated that boundless ocean.This provides the classification of scene the foundation of reality.For field The classification of scape, in image understanding and classification field, currently a popular mode classification mainly has following 3 kinds：(1) object-based field Scape is classified；(2) scene classification based on region；(3) scene classification based on context；Object-based scene classification method with Scene classification method based on context shows the classification by that can realize scene to the research of Scene Semantics object, inhomogeneity Other scene has different semantic object set.Herein based on this, define scene type semantic object collection and be combined into its class The representation of concept of other word establishes the conceptual model of classifier-semantic object, from the angle of text, analyzes scene type word and language Relation between adopted object, from common sense, selective analysis is under a certain scene, and in daily life, which scene can include Entity.The scene generated to literary scape converting system is supplemented so that scene more meets the common sense cognition of people, realizes scene Authenticity enhances.

The content of the invention

Present invention seek to address that above problem of the prior art.Propose a kind of side for the authenticity enhancing for realizing scene Method.Technical scheme is as follows：

A kind of method that the enhancing of scene authenticity is realized in text scape conversion, comprises the following steps：

1) more Chinese documents for describing a certain scene, are obtained from internet, set up scene corpus, the existing pin of the present invention To Chinese literary scape converting system.

2) word segmentation processing of not duplicate removal, is carried out to the Chinese document collection for describing a certain scene；Then to word segmentation processing after Chinese document carries out stopping word processing；

3), go to stop word treated word segmentation result using step 2) Chinese document collection, to the physical name in word segmentation result The method that word utilizes word frequency statistics, obtains the statistical indicator of substantive noun；

4), using step 3) substantive noun statistical indicator, structure document sets correspond to the feature word list of scene type；

5), using the scene type feature word list of step 4), analyze and extract optimal scene type Feature Words, establish field Scape entity dictionary.

Further, the scene corpus of the step 1) is set up by the document of Same Scene classification, and scene corpus is Document sets with apparent scene characteristic.

Further, the step 1) Scene concept model is the term vector that is formed using substantive noun Representation of concept, w are carried out to scene type_tPresentation-entity noun.Each scene type correspond to one group of relevant word to Amount defines the threshold value that subscript t is concept dictionary, is also the mould of term vector, by obtaining same category of large volume document, statistics text Occurrence number is more in shelves and associated with classification C substantive noun composition term vectorM is defined as in fact The quantity of body noun, withDetermine the scene concept dictionary of scene type C

Further, the step 2) carries out the Chinese document in scene corpus the word segmentation processing of not duplicate removal；Then Chinese document after word segmentation processing is carried out to stop word processing, is specifically included：

For the multiple documents of acquisition, denoising is carried out to document first, removal document includes advertising words and English Interior word is linked at, word segmentation processing is carried out using ROST Chinese word segmentations instrument.

Further, the step 3) goes to stop word treated word segmentation result using step 2) Chinese document collection, to point The method that substantive noun in word result utilizes word frequency statistics, obtains the statistical indicator of substantive noun, specifically includes：

Traditional text feature TFIDF models mainly consider the frequency information TF of characteristic item and anti-document frequency Rate Information ID F, characteristic item frequency TF refers to the number that characteristic item occurs in a document, for Scene concept model, obtains The n piece documents of a certain classification C, form document sets A, and the number that substantive noun w occurs in the document sets of classification C is to obtain scene One of important references of concept dictionary；

For each document sets A, using remove to stop word treated Chinese document as a result, going out in n documents of statistics Existing substantive noun frequency of occurrences size；

Defined terms w_iThe word frequency number f in A_iFor

count(w_i, A) and/size (A), 0 ＜ f_i＜ 1

count(w_i, A) and it is defined as word w_iThe number occurred in the document sets of A, size (A_k) it is defined as all entities in A The sum that noun occurs；

It is calculated again using anti-document frequency IDF, anti-document frequency IDF is amount of the characteristic item in document sets distribution situation Change, the computational methods of IDF are：Total number of documents is set to N in document sets A, and number of files of the definition comprising word w is n, then model of place In anti-document frequency be defined as：

Further, the step 4) analyzes scene type feature, tool using the statistical indicator of the substantive noun of step 3) Body includes；Define the term vector that the list is made of substantive nounRepresentation of concept is carried out to scene type；

It is studied for each document, it is assumed that its scene is made of multiple substantive nouns, then for document sceneWith p (w_n) represent scenario entities word w_nProbability, generate document described by sceneProbability be：

Based on the document of selection its scene is mostly described with landscape, and it is unique, it is assumed that text due in scene corpus Shelves first select a scene s, in the entity that the scene according to described by scene generation document needs, the described scene of the document Uniquely, it is assumed that scene type has s₁, s₂..., s_k, then the probability for generating document scene is：

It is two scene types s1, s2 to t random divisions before the list of scene type Feature Words after selected t values；Then Probability analysis is carried out to each document, it is assumed that result is for a documentIts generating probability is：

Wherein N=U+V, U represent scene type s₁The number of included substantive noun, V represent scene type s₂Comprising Substantive noun number, then the selection of t values is improper.

Further, in the case of there are multiple sub-scenes for first t, the judgements of t values still carries out traversal dichotomy, and two Point-score intermediate point value range is [2, t-1], and whether Ergodic judgement t values are suitable.

It advantages of the present invention and has the beneficial effect that：

Compared with prior art, the present invention the beneficial effects of the invention are as follows Scene concept model is established, input describes certain The document sets of one scene type, you can acquisition and the corresponding substantive noun of the scene type establish scene word and physical name The association on scene type of word.The analysis that Scene entity elements are converted for literary scape provides support, so as to fulfill generation Scene meets the common sense cognition of people, has complete background environment so that the scene sense of reality is enhanced.

Description of the drawings

Fig. 1 is the flow that the present invention provides the method that Scene concept model is established in a kind of literary scape conversion of preferred embodiment Figure；

Fig. 2 is the systems function diagram of the application Scene conceptual model.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, detailed Carefully describe.Described embodiment is only the part of the embodiment of the present invention.

The present invention solve above-mentioned technical problem technical solution be：

A kind of method that the enhancing of scene authenticity is realized in text scape conversion, including：

Step 1 obtains more Chinese documents for describing a certain scene from internet, sets up scene corpus.

Step 2：The word segmentation processing of not duplicate removal is carried out to the document in corpus；The document after participle is gone afterwards Stop word processing.

Step 3：Using in step 2, Chinese document collection removes to stop word treated word segmentation result, in word segmentation result The method that substantive noun utilizes word frequency statistics, obtains the statistical indicator of substantive noun；

Step 4：Using the statistical indicator of the substantive noun of step 3, structure document sets correspond to the Feature Words of scene type List；

Step 5：Using the scene type feature word list of step 4, analyze and extract optimal scene type Feature Words, build Position scape entity dictionary.

The method that Scene concept model is established in the literary scape conversion, wherein step 1, including：

For a certain scene type C, multiple documents of the description in relation to scene type C are crawled using that internet, it is up to a hundred A piece is even more, and example uses 200 compositions of the description in relation to " birthday " scene crawled in Baidu writes a composition.

The method that Scene concept model is established in the literary scape conversion, wherein step 2, including：

For the multiple documents of acquisition, denoising is carried out to document first, removes the advertising words in document and English chain It connects, word segmentation processing is carried out using ROST Chinese word segmentations instrument.

The method that Scene concept model is established in the literary scape conversion, wherein step 3, including：

Step 3 is to carry out the statistical analysis of substantive noun, and traditional text feature provides preferable feature Extract thinking.TFIDF models mainly consider the frequency information TF and anti-document frequency Information ID F of characteristic item.Characteristic item frequency (TF) number that characteristic item occurs in a document is referred to.Different classes of document has very on the probability of occurrence of some characteristic items Big difference.For Scene concept model, the n piece documents of a certain classification C are obtained, form document sets A, substantive noun w is in class The number occurred in the document sets of other C is one of important references for obtaining scene concept dictionary.

For each document sets A, using step 2 as a result, frequency occurs in the substantive noun occurred in n documents of statistics Rate size；

Defined terms w_iThe word frequency number f in A_iFor

count(w_i, A) and/size (A), 0 ＜ f_i＜ 1

count(w_i, A) and it is defined as word w_iThe number occurred in the document sets of A.size(A_k) it is defined as all entities in A The sum that noun occurs.

The frequency information of substantive noun is not enough to reply all situations, there is a situation where not exclusively to classify, such as physical name Word w is only present in a document in category documents collection A, and occurrence number is more, then this word may be only with this piece Article is related, and less with the class relations of document sets A, and Frequency Index is not enough to represent the feature of classification C.Anti- document frequency (IDF) it is quantization of the characteristic item in document sets distribution situation.The common computational methods of IDF are：

Wherein N is several total number of files of document, n_kTo there is characteristic item T_kNumber of files.

The core concept of IDF algorithms is that the characteristic item all occurred in most of documents is not so good as only in fraction document The characteristic item of appearance is important.IDF can weaken the importance of some high frequency characteristics items occurred in most of documents, together The importance of some characteristics of low-frequency items occurred in fraction document of Shi Zengqiang.And for Scene concept model, for The data set of given single scene type analyzes feature possessed by scene type, this feature should be in data set document The characteristic item that the scene is related generally to is described.All occur in most of documents so the feature of Scene concept model obtains Characteristic item, and filter out the characteristic item repeatedly occurred in fraction document.

Total number of documents is set to N in document sets A, and number of files of the definition comprising word w is n, then the anti-document in model of place Frequency is defined as：

For example, as follows to the conceptual modeling methods that " birthday " this scene is mentioned in text：

Substantive noun	Word frequency	Total number of documents in document sets	SIDF
				Mother	403	79	0.25
Cake	158	72	0.23
				Father	147	57	0.19
Present	107	40	0.14
				Classmate	77	31	0.117
China	60	7	0.029
				Motherland	50	10	0.04
Candle	49	27	0.103

Table (one)

As shown above, for " China " and " motherland " two words repeatedly occurred, under SIDF indexs, value is less than " candle ", for birthday scene, this variation is favourable.

The method that Scene concept model is established in the literary scape conversion, wherein step 4, including：

Step 4 utilizes statistical indicator, analyzes the feature of scene type.Above-mentioned steps, what is got is special on scene type Levy the list ordering of word, then should the first few items of selected characteristic word list carry out representation of concept scene type and be one to need to handle The problem of.Define the term vector that the list is made of substantive nounRepresentation of concept is carried out to scene type.

For Scene concept model, since situation about not exclusively classifying exists, meet for the selection of the length t of term vector Face Railway Project：

The value of t is too small, and scene type expression can be not accurate enough.

The value of t is excessive, the situation of more scene type mixing.

Analysis situation 2, when t values are excessive, more scene types are present with classification mixing.With above-mentioned table (one) Suo Shi, in profit After being sorted with SIDF, when t selection initial values are more than 7, scene characteristic mixes, i.e. birthday scene：Mother, father, cake, gift Object, candle, classmate and the associated description of National Day：China, motherland.So the selection needs of t values make rational processing with sentencing It is disconnected.

For scene, the scene in document is made of occasion (social environment) and landscape (natural environment), and scene is retouched Write is that scene describes the synthesis described with landscape.In scene corpus, based on the scene of document description is mostly described with landscape, and it is Uniquely.

Scene corpus includes more than one scene s, describe the scene s that may be present of the document in relation to the birthday have it is more It is a, based on the existing document sets on the birthday, it is observed that the scene type of document sets probably includes celebrating a birthday, sighing with deep feeling into Length misses other people, National Day, sighs with deep feeling college entrance examination etc..From term frequencies with from the point of view of anti-document frequency, to celebrate a birthday as the document of scene It is in the majority.This provides a resolving ideas.For Scene concept model, the scene document of acquisition is the composition of primary school, small It is mostly unique to learn composition scene.

For a scene type C, multiple documents d is corresponded to, every document scene is described unique.And scene type C corresponds to many subdivision sub-scenes, and document is corresponded with scene, then by classifying to subdivision sub-scene, for t values For, if classification results are 1 classification, then t values are desirable, conversely, classification results are 2 or more than 2, then t's Value is excessive.

Based on the document of selection its scene is mostly described with landscape, and it is unique, it is assumed that text due in scene corpus Shelves first select a scene s, in the entity that the scene according to described by scene generation document needs, the described scene of the document Uniquely.Assuming that scene type has s₁, s₂..., s_k, then the probability for generating document scene is：

Wherein N=U+V

Then the selection of t values is improper.

Table (two)

As can be seen from Table II, when t values are 7, all documents will not all be expressed as the linear combination of s1 and s2.For preceding The t situations for multiple sub-scenes occur, the judgements of t values still carry out traversal dichotomy, dichotomy intermediate point value range be [2, T-1], whether Ergodic judgement t values are suitable.

The above embodiment is interpreted as being merely to illustrate the present invention rather than limit the scope of the invention. After the content for having read the record of the present invention, technical staff can make various changes or modifications the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims

1. the method for scene authenticity enhancing is realized in a kind of text scape conversion, which is characterized in that comprise the following steps：

1) more Chinese documents for describing a certain scene, are obtained from internet, set up scene corpus；

2) word segmentation processing of not duplicate removal, is carried out to the Chinese document collection for describing a certain scene；Then to the Chinese after word segmentation processing Document carries out stopping word processing；

3), go to stop word treated word segmentation result using step 2) Chinese document collection, to the substantive noun profit in word segmentation result With the method for word frequency statistics, the statistical indicator of substantive noun is obtained；

5), using the scene type feature word list of step 4), analyze and extract optimal scene type Feature Words, establish scene reality Pronouns, general term for nouns, numerals and measure words allusion quotation.

2. the method for scene authenticity enhancing is realized in a kind of literary scape conversion according to claim 1, which is characterized in that institute The scene corpus for stating step 1) is set up by the document of Same Scene classification, and scene corpus is the text with apparent scene characteristic Shelves collection.

3. the method for scene authenticity enhancing, feature are realized in a kind of literary scape conversion according to one of claim 1-2 It is, step 1) the scenario entities model is the term vector that is formed using substantive noun To scene type Carry out entitative concept expression, w_tPresentation-entity noun, each scene type correspond to one group of relevant term vector, define subscript t and are The threshold value of concept dictionary is also the mould of term vector, by obtaining same category of large volume document, in statistic document occurrence number compared with Substantive noun composition term vector more and associated with classification CThe quantity that m is substantive noun is defined, WithDetermine the scenario entities dictionary of scene type C

4. the method for scene authenticity enhancing is realized in a kind of literary scape conversion according to claim 1, which is characterized in that institute State the word segmentation processing that step 2) carries out the Chinese document in scene corpus not duplicate removal；Then to the Chinese text after word segmentation processing Shelves carry out stopping word processing, specifically include：

For the multiple documents of acquisition, denoising is carried out to document first, removal document includes advertising words and linked with English Word inside carries out word segmentation processing using ROST Chinese word segmentations instrument.

5. the method for scene authenticity enhancing is realized in a kind of literary scape conversion according to claim 1, which is characterized in that institute The method that step 3) utilizes the substantive noun in word segmentation result word frequency statistics is stated, obtains the statistical indicator of substantive noun, specifically Including：

Traditional text feature TFIDF models mainly consider the frequency information TF of characteristic item and anti-document frequency letter IDF is ceased, characteristic item frequency TF refers to the number that characteristic item occurs in a document, for Scene concept model, obtains a certain The n piece documents of classification C, form document sets A, and the number that substantive noun w occurs in the document sets of classification C is to obtain scene concept One of important references of dictionary；

For each document sets A, using removing to stop word treated Chinese document as a result, occurring in n documents of statistics Substantive noun frequency of occurrences size；

Defined terms w_iThe word frequency number f in A_iFor

count(w_i, A) and/size (A), 0 ＜ f_i＜ 1

count(w_i, A) and it is defined as word w_iThe number occurred in the document sets of A, size (A_k) it is defined as all substantive nouns in A The sum of appearance；

It being calculated again using anti-document frequency IDF, anti-document frequency IDF is quantization of the characteristic item in document sets distribution situation, The computational methods of IDF are：Total number of documents is set to N in document sets A, and number of files of the definition comprising word w is n, then in model of place Anti- document frequency be defined as：

6. the method for scene authenticity enhancing is realized in a kind of literary scape conversion according to claim 1, which is characterized in that institute Statistical indicator of the step 4) using the substantive noun of step 3) is stated, scene type feature is analyzed, specifically includes；Defining the list is The term vector that substantive noun is formedRepresentation of concept is carried out to scene type；

<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mover> <mi>w</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Pi;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> </mrow>

Based on the document of selection its scene is mostly described with landscape, and it is unique, it is assumed that give document first due in scene corpus A scene s is selected, in the entity that the scene according to described by scene generation document needs, the described scene of the document is only One, it is assumed that scene type has s₁, s₂..., s_k, then the probability for generating document scene is：

<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mover> <mi>w</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <munderover> <mo>&Pi;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>n</mi> </msub> <mo>|</mo> <msub> <mi>s</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mo>...</mo> <mo>+</mo> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <munderover> <mo>&Pi;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>n</mi> </msub> <mo>|</mo> <msub> <mi>s</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mi>s</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>s</mi> <mo>)</mo> </mrow> <munderover> <mo>&Pi;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>n</mi> </msub> <mo>|</mo> <mi>s</mi> <mo>)</mo> </mrow> </mrow>

It is two scene type s to t random divisions before the list of scene type Feature Words after selected t values₁, s₂；Then to every One document carries out probability analysis, it is assumed that result is for a documentIts generating probability is：

<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mover> <mi>w</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <munderover> <mo>&Pi;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>U</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>n</mi> </msub> <mo>|</mo> <msub> <mi>s</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <munderover> <mo>&Pi;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>V</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>n</mi> </msub> <mo>|</mo> <msub> <mi>s</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow>

Wherein N=U+V, U represent scene type s₁The number of included substantive noun, V represent scene type s₂Comprising entity Noun number, then the selection of t values is improper.

7. the method for scene authenticity enhancing is realized in a kind of literary scape conversion according to claim 6, which is characterized in that right Occurs the situation of multiple sub-scenes in first t, the judgement of t values still carries out traversal dichotomy, dichotomy intermediate point value range For [2, t-1], whether Ergodic judgement t values are suitable.