KR20170025424A

KR20170025424A - Paraphrase sentence generation method for a korean language sentence

Info

Publication number: KR20170025424A
Application number: KR1020150121843A
Authority: KR
Inventors: 최호진; 오교중; 김종명; 권가진; 김현기; 허정; 류법모
Original assignee: 한국과학기술원
Priority date: 2015-08-28
Filing date: 2015-08-28
Publication date: 2017-03-08
Also published as: KR101757222B1

Abstract

A method of generating a paraphrase sentence according to an embodiment of the present invention includes a step of classifying a substantial morpheme in a Korean language sentence; a step of replacing a root vocabulary classified into the substantial morpheme with a synonym based on a linguistic quality in the Korean language sentence; a step of classifying formal morphemes in the Korean language sentence; and a step of grammatically transforming at least one of a postposition and an ending classified in the formal morpheme in the Korean language sentence replaced with synonyms in accordance with postposition and ending transformation rules. So, it is possible to easily grasp the intent of a question.

Description

{PARAPHRASE SENTENCE GENERATION METHOD FOR A KOREAN LANGUAGE SENTENCE}

The present invention relates to a method for generating a paraphrase sentence for a Korean sentence.

So far, various researches have been conducted in the field of natural language processing such as analyzing human voice signals, converting words into sentences, and analyzing the constituents of sentences. Through these studies, there has been a major advance in the art of machine input of human language.

Understanding the meaning of a natural language sentence or understanding the intent of a question is being tried by various approaches such as rule base, knowledge base, and machine learning base in search engine, speech recognition, Q & A system. However, until now, the level of skill to understand the meaning of a sentence that a person speaks or to understand the intent of a question remains at an early stage. Therefore, there is an increasing need for techniques that allow a machine to understand the meaning of a sentence that a person speaks, especially to facilitate understanding of the intent of the question.

Korean Patent Publication No. 10-2011-0017129 (published Feb. 21, 2011)

The present invention is intended to provide a technique that allows a machine to understand the meaning of a Hangul sentence that a human being spoken or entered, or in particular, to easily understand the intention of a question.

A method for generating a paraphrase sentence according to an embodiment of the present invention includes classifying a substantial morpheme in a Hangul sentence; Replacing the root vocabulary classified into the substantial morpheme with a synonym based on linguistic qualities in the Hangul sentence; Classifying the morphemes in the Hangul sentence; And grammatically transforming at least one of the search term and the mother word term classified in the morpheme morpheme in the synonym-substituted Hangul sentence according to an irradiation and a morpheme modification rule.

According to the embodiment of the present invention, it is possible to provide a method of generating a paraphrase for a natural language sentence that enables a machine to understand the meaning of a Hangul sentence that a human being utters or inputs, and particularly, an intention of a question can be easily grasped.

Also, according to the embodiment of the present invention, it is possible to generate paraphrase sentence of Hangul query type sentence.

In addition, according to the embodiment of the present invention, it is possible to extract synonyms in which ambiguity according to the sentence is eliminated based on synonym information and linguistic qualities of the root vocabulary.

Also, according to the embodiment of the present invention, it is possible to generate a paraphrase sentence having a high grammatical accuracy by using the grammatical elements in the Korean sentence, the transformation rules of the mother, the abbreviations, and the spacing rules.

Further, according to the embodiment of the present invention, when a machine generates a paraphrase sentence which is easy to understand about a representation of a user who does not understand the machine in voice recognition, query response, etc., To find the grounds of.

1 is a flowchart of a paraphrase sentence generation method for a Korean sentence according to an embodiment of the present invention.
FIGS. 2A, 2B, and 2C show sentences classified into sentences, sentences classified into morphemes, and dependencies in a method of generating a paraphrase sentence for a Hangul sentence according to an embodiment of the present invention.
Figures 3a, 3b and 3c respectively represent sentences processed through Figures 2a, 2b and 2c, synonym substituted sentences, and paraphrased sentences.
4 illustrates a structure of a parody sentence generation system according to an embodiment of the present invention.

The following detailed description of the invention refers to the accompanying drawings, which illustrate, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different, but need not be mutually exclusive. For example, certain features, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with one embodiment. It is also to be understood that the position or arrangement of the individual components within each disclosed embodiment may be varied without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is to be limited only by the appended claims, along with the full scope of equivalents to which such claims are entitled, if properly explained. In the drawings, like reference numerals refer to the same or similar functions throughout the several views.

The present invention relates to a method for generating a paraphrase sentence for an input sentence in a natural language form. The paraphrase of the vocabulary is removed through a linguistic feature to generate a paraphrase sentence having the same semantics as the input sentence, The present invention relates to a technique for generating a paraphrase sentence.

In particular, the present invention relates to a technique for translating a human sentence expression into a Korean sentence having the same meaning in order to make a machine understand a person's utterance better. In general, in the case of a question-type sentence such as a query, since the amount of information to be provided with the query is small, in order to grasp the meaning and intention of the question, the information amount is increased by using the paraphrase generation technique to understand the sentence and / Can be found. In this specification, we focus on the paraphrase technique of the question type sentence which certainly includes the intention of the questioner so that it can be widely used in the search engine, speech recognition and query response system.

Hereinafter, the present invention will be described based on the Korean language, but it will be obvious to those skilled in the art that the present invention can be applied to other languages in the same or similar manner. In the following, the generation of the paraphrase sentence about the question-type sentence will be described, but it will be obvious to those skilled in the art that the present invention can also be applied to a phrase sentence, sentence, sentence, and sentence.

Hereinafter, a method of generating a paraphrase sentence for a Korean sentence according to an embodiment of the present invention will be described with reference to the accompanying drawings.

1 is a flowchart of a paraphrase sentence generation method for a Korean sentence according to an embodiment of the present invention. A method for generating a paraphrase sentence according to an embodiment of the present invention may be recorded in a programming language that enables each of the steps to be executed in a computer and executed in a computer.

Referring to FIG. 1, a method for generating a paraphrase sentence for a Hangul sentence according to an embodiment of the present invention includes extracting linguistic qualities of an input sentence (S100), classifying a substantial morpheme (S200) A step S300 of analyzing the synonym knowledge of the vocabulary corresponding to the substantial morpheme and the usage quality statistics according to the vocabulary S300, a step of replacing the vocabulary corresponding to the substantial morpheme with the vocabulary eliminating the ambiguity S400, (S600) of syntactically transforming a sentence that has been synonymized according to the search and / or the end transformation rule, and / or following the modified sentence with an abbreviation and / or a spacing rule (S700). &Lt; / RTI >

The paraphrased sentence in this specification may refer to a sentence generated by paraphrasing an inputted sentence, for example, a question-type Hangul sentence. The paraphrased sentence in the present specification is a sentence for enhancing the understanding of the Korean language natural language by replacing part of the expression in the input sentence such as a question sentence with a synonym and modifying it according to the grammar. The machine may be a computer, a computer-based robot, or a computer-based device. The paraphrased sentence allows the machine to understand the sentence and / or to easily understand the answer to the question sentence.

The step of extracting the linguistic qualities of sentences inputted in the embodiment (S100) can be performed based on various general sentence quality analysis techniques. In the embodiment of the present invention, the feature extracting step S100 may be performed in a preprocessing step and may include morphological analysis, lexical semantic analysis, entity name recognition, syntax analysis, semantic recognition, cross- Part-Of-Speech) tags, and dependency parsing. The feature extraction technique used in this specification can be performed through a language analysis system developed by ETRI (Electronics and Telecommunications Research Institute), for example. In this specification, the extracted results according to techniques such as morpheme analysis, object name recognition, syntax analysis, semantic recognition, and the like are used as qualities for generating paraphrase sentences according to the present invention.

In the embodiment of the present invention, the type of morpheme can mean a tag set of a morpheme. For example, 45 kinds of Sejong tag sets of Sejong plan are used to distinguish a tag set such as general noun, proper noun, dependent noun and pronoun .

In the embodiment of the present invention, the dependency relation may mean a dependency relation label and can be classified into 18 kinds such as a sentence, a noun phrase, a verb phrase, a phrase, and a subject.

In the embodiment of the present invention, semantic recognition or semantic determination is a technique of recognizing a part of a sentence that plays a semantic role on a descriptor centered on a descriptor, and determines and recognizes a meaning will be. For example, information about an agent, a causer, a theme, a loc (loc), etc. may be analyzed.

In the feature extraction step S100, a sentence can be decomposed into morpheme units, and information such as a tag indicating a type of the morpheme and a position in a sentence can be analyzed. Proper nouns, compound words, etc. can be extracted through the entity name recognition technique. In addition, through the parsing technique, the tag indicating the role in the sentence, the dependent word information, and the reliability value can be analyzed. Attachment relations such as subject, object, and modifier corresponding to semantics can be defined in the form of a tree structure based on a term (east / adjective) through semantic recognition technique.

In the qualification extraction step S100, the existing vocabulary dictionary resources (for example, the Korean vocabulary meaning network (Korlex) produced by Pusan National University, the Korean language word network (Wordnet) produced by Ulsan University, Sejong corpus, and the word) Results from language modeling on documents (news, blogs, etc.) can be used for analysis. In addition, in the present specification, a dictionary dictionary resource such as a word synonym dictionary may be used in addition to the language analysis system for lexical substitution and the like in the present specification.

In step S200 of classifying the actual morpheme in the input sentence in the embodiment, the step of classifying and classifying the substantial morphemes corresponding to the root in the input sentence. The step of classifying the substantial morpheme (S200) is a step of searching for the possible substituents among the vocabularies in the inputted sentence, and includes a general noun, proper noun, dependent noun, pronoun, investigation, verb, auxiliary verb, , Negation designator, observer, connection adverb, general adverb, and exclamation mark can be designated as the actual morpheme.

At this time, according to the embodiment, the morpheme including the entity name expression extracted by the entity name recognition technology can be excluded from the substantial morpheme. An entity may include, for example, a person, a place name, an institution name, and the like. Such an entity name may be the most important evidence for finding a correct answer to a question in a question and answer system in which a paraphrase sentence generation method according to the embodiment can be used. In many cases, the correct answer to the question can not be found when replacing the object name representation. Therefore, in the paraphrase sentence generation method according to the embodiment of the present invention, the entity name may be excluded from the substantial morpheme and not replaced. In the embodiment of the present invention, such object names can be extracted by pre-processing and post-processing based on a dictionary and providing rules based on a Lexicon-Semantic pattern. In the embodiment of the present invention, in addition to the language analysis system, a proper noun, a compound noun, a noun phrase (a cognate, a prefix, a suffix, a malformed ending, a noun combined with a dependent noun) Extraction and analysis.

The morphemes classified in the above correspond to the defined morpheme classification based on the Sejong corpus of the National Language Institute. When the sentence quality analysis technique is changed, the morphemes can be classified according to other classification standards and / or expressions.

In step S300 of analyzing the synonym knowledge of the root vocabulary according to the embodiment and the usage quality statistics according to the vocabulary, the synonym knowledge is first extracted for the vocabulary of the substantial morpheme corresponding to the root vocabulary. At this time, the synonyms can be extracted using, for example, a synonym knowledge dictionary. Such a synonym knowledge dictionary may be stored in a database inside or outside the apparatus that executes the paraphrase sentence generation method of the present invention.

In the case of generating the paraphrase sentences using only the synonym knowledge, the word sense disambiguation occurs in the case of replacing with a plural word, an upper word, or a lower word, so that the original sentence and the other sentence are generated . Therefore, in the embodiment of the present invention, it is possible to analyze the usability statistics for each vocabulary in a large corpus based on various qualities, and to solve the ambiguity based on the statistical information. For example, it is possible to distinguish the meaning according to statistical processing and analysis of the qualities using the linguistic qualities extracted in step S100.

The semantic differences of the homonyms in the sentences can be distinguished according to which morpheme type is mainly used for a certain word and which is tagged with certain dependency syntax qualities. In addition, using the semantic inverse qualities, we can distinguish the contextual differences that are associated with a particular verb. For example, in the case of "race", there may be "race" indicating the place name and "race" indicating "race". First, they can be distinguished from proper nouns and general nouns in morpheme types. In addition, the two are different in the frequency used in the clause and in the clause in the dependency relation. Also, the "race" representing "race" may have a higher statistical connection with verbs such as "win" or "jida". In the embodiment of the present invention, the ambiguity can be solved through such qualitative statistical analysis.

At this time, large corpus is required to analyze usage quality statistics for each vocabulary in order to eliminate ambiguity in synonym substitution. In this specification, a news document corpus collected over nine months in 2013 was used. This is merely an example, and a Wikipedia document other than the corpus or a web document such as a blog may be used as a corpus. The corpus may be stored in a database inside or outside the apparatus that executes the paraphrase sentence generation method of the present invention.

In the embodiment, in step S400 of replacing the vocabulary corresponding to the substantial morpheme with the vocabulary eliminated in the ambiguity, based on the result of analyzing the synonym knowledge extraction and usage quality statistics according to the vocabulary, Can be replaced with an optimal synonym to eliminate ambiguity. In this step, the fitness score can be scored on the basis of statistical information among the candidate synonyms according to the qualities such as morpheme type, dependency relation and semantic role labeling (SRL). At this time, a synonym for which the fitness is calculated can be performed for synonyms having the same vocabulary and semantic determinism qualities. The fitness of each synonym can be calculated according to the following equation (1). Depending on this fit, synonyms with the highest score may be selected as substitute words.

Equation (1)

Wherein, t: the type of the selected morphemes, w _i: i _{words, Fm t (w i):} i frequency of morphological types of _{words, Fd t (w i):} i dependency type frequency of the word, Fm _total (w _i ) = Σ _t Fm _t (w _i ), and Fd _total (w _i ) = Σ _t Fd _t (w _i ). More specifically, Equation (1) is a formula for calculating the similarity of morpheme and dependency qualities between a substitution target vocabulary and a synonym. A synonym having a high degree of fitness in the present specification may be a synonym used in a similar frequency with the same morpheme and dependency as the result of the analysis of the vocabulary to be replaced and the qualitative analysis. That is, there may be a high fitness score for synonyms with high frequency used in similar types of morphemes and / or dependencies.

In the embodiment, the step S500 of classifying the morpheme of the input sentence may classify the morpheme corresponding to the search and / or the ending. In the embodiment, in order to improve the grammaticality of the Hangul paraphrase, not only the actual morpheme but also the morpheme corresponding to the search and / or the mother can be classified.

In the embodiment, step S600 of syntactically transforming a synonym-substituted sentence in accordance with the irradiation and / or morphing rules may be performed. The survey and / or the end transformation rules may be stored in a database inside or outside the apparatus that executes the paraphrase sentence generation method of the present invention. The rules used in the examples can be gathered from relevant data from various literature published by the National Language Institute. Also, the step of syntactically transforming a sentence according to these rules can be implemented through an automation program.

In the embodiment, step S700 of post-processing may be performed on the grammatically modified sentence through the above-described step S600. In this post-processing step, the sentence can be modified according to the abbreviation and / or spacing rules. The abbreviation and / or spacing rules for the post-processing may be stored in a database inside or outside the apparatus executing the paraphrase sentence generation method of the present invention. The rules used in the examples can be gathered from relevant data from various literature published by the National Language Institute. In addition, the step of transforming a sentence according to these rules can be implemented through an automation program.

As described above, according to the embodiment of the present invention, it is possible to generate a Hangul paraphrase sentence of a question type sentence, for example. The method of generating a Hangul paraphrase sentence according to an embodiment of the present invention can be used for understanding a person's utterance and analyzing an intention of a machine in a search engine, a speech recognition, a query response system, etc., You can help find the answer.

FIGS. 2A, 2B, and 2C are diagrams for explaining a method of generating a paraphrase sentence for a natural language sentence according to an embodiment of the present invention, respectively, according to input sentence, sentence classified according to morphological analysis result, .

The sentence input in FIG. 2A is a sentence in which the query type Hangul sentence is "Who won the race? &Quot;. FIG. 2B illustrates sentences classified by morpheme. Here, the morpheme 1 (race), 3 (any), 4 (person), and 6 (beacon) may be a substantial morpheme representing the root in the embodiment. The morphemes 2 (in), 5 (i), 7 (i), and 8 (i) may be form morphemes representing the probing or ending in the embodiment. FIG. 2C shows sentences classified according to the analyzed dependency analysis and semantic determination.

Figures 3a, 3b and 3c respectively show sentences processed through Figures 2a, 2b and 2c, syntactic substitutions and surrogates, grammatical transformation of the endings and the abbreviation of the sentences, and paraphrased sentences. In FIG. 3A, the substantial morpheme 1 (race) is replaced by the "lace" in which the ambiguity is eliminated, the substantial morphemes 3 and 4 (some) are replaced by "who" Is replaced with "victory" in which the ambiguity is resolved is illustrated in FIG. 3B. In this case, the morpheme 5 (i) is converted to "a" according to the survey conversion rule, and the morpheme 7 (i) is converted to "was" according to the mother conversion rule. In FIG. 3B, it is exemplified that "someone" and "a" are concatenated to "who" and "to" and "to" are combined and converted to "to" according to the abbreviation rule. FIG. 3C illustrates the paraphrased sentence "who won in the race" after the substitution and conversion according to the above.

Depending on the embodiment, additional rules may be applied in post-processing step S700 so that the translated sentences illustrated in Fig. 3C may be further modified. For example, when a place crossing grammar is applied, the sentence "Who won in the race" can be converted to "Who won the race? At this time, the sentence "Who won the race" can be a paraphrase sentence according to the embodiment.

This paraphrased sentence can increase the likelihood that the machine will be easy to understand, thus providing accurate answers to the questions.

A method for generating a parody sentence according to an embodiment of the present invention may be recorded in a program language and executed in a computer. For example, the method of generating a parody sentence according to an embodiment of the present invention can be performed through the parody sentence generation system 100 as illustrated in FIG.

The paraphrase sentence generation system 100 according to the embodiment of the present invention may include the morpheme classifier 10, the synonym substitution unit 20, the sentence modification unit 30, and the database 40. The morpheme classifying unit 10, the synonym substituting unit 20 and the sentence transforming unit 30 in the paraphrase sentence generating system 100 according to the embodiment of the present invention may be configured in a module, The components can be composed of one module. The database 40 according to the embodiment of the present invention may be comprised in the parody sentence generation system 100 or an external database to which the parody sentence generation system 100 can be connected.

The morpheme classifying unit 10 of the parody sentence generating system 100 according to the embodiment of the present invention can perform morpheme classification steps S200 and S500 of the paraphrase sentence generating method according to the embodiment of the present invention . The step of classifying the substantial morpheme (S200) in the morpheme classifier (10) and the step of classifying the morpheme morpheme (S500) may be implemented in separate components. The synonym substitution unit 20 according to the embodiment analyzes the synonym knowledge of the root vocabulary and the usage quality statistics according to the vocabulary (S300), and substituting the root vocabulary with the vocabulary of the dissimilarity of synonyms (S400) Can be performed. Likewise, each step in the synonym replacement unit 20 can be implemented to be performed on a separate component. The sentence modification unit 30 according to the embodiment can perform the step S600 of syntactically transforming the sentence according to the irradiation / modification rule and the post-processing step S700. At this time, each step in the sentence transforming unit 30 can be implemented to be performed in a separate component. The data and knowledge required to perform each step in the morpheme classifying section 10, the synonym replacement section 20, and / or the sentence modification section 30 according to the embodiment may be stored in the database 40 . At least some of the necessary data and knowledge may be included in the morphological classification unit 10, the synonym substitution unit 20 and the sentence modification unit 30 according to the embodiment.

The features, structures, effects and the like described in the embodiments are included in one embodiment of the present invention, and are not necessarily limited to only one embodiment. Further, the features, structures, effects, and the like illustrated in the embodiments can be combined and modified by other persons having ordinary skill in the art to which the embodiments belong. Therefore, it should be understood that the present invention is not limited to these combinations and modifications.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is clearly understood that the same is by way of illustration and example only and is not to be taken by way of illustration, It can be seen that various modifications and applications are possible. For example, each component specifically shown in the embodiments can be modified and implemented. It is to be understood that all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

100: paraphrase sentence generation system
10: Morphological classification section 20: Synonym substitution section
30: sentence modifier 40: database

Claims

Classifying the actual morpheme in Hangul sentence;
Replacing the root vocabulary classified into the substantial morpheme with a synonym based on linguistic qualities in the Hangul sentence;
Classifying the morphemes in the Hangul sentence; And
And grammatically modifying at least one of an irradiation and a mother language word classified into the morpheme morpheme in the Hangul sentence with the synonym substitution according to an irradiation and a mother morpheme rule.
How to create paraphrase sentences.

The method according to claim 1,
In the step of classifying the substantial morphemes,
Wherein the entity name extracted by the entity name recognition technique is excluded from the substantial morpheme.

The method according to claim 1,
In the replacing step with the synonym,
Wherein the root vocabulary is replaced with a synonym that is corrected for ambiguity based on the usage quality statistics of the root vocabulary.

4. The method according to any one of claims 1 to 3,
Wherein the linguistic property includes at least one of morpheme type, dependency relation, and semantic determination.

4. The method according to any one of claims 1 to 3,
In the replacing step with the synonym,
Wherein said synonyms are selected from synonyms having the highest degree of goodness-of-fit based on dependency and morpheme types among synonym candidates having the same semantic decision.

4. The method according to any one of claims 1 to 3,
Further comprising deforming the modified sentence according to at least one of an abbreviation rule and a spacing rule.

The method for generating a paraphrase sentence according to any one of claims 1 to 3, wherein the step of grammatically transforming at least one of an inquiry and a mother language word in a Hangul sentence according to an inquiry and a mother deformation rule is executed in a computer A computer-readable medium having recorded thereon a program.

A computer-readable medium recording a program for causing a computer to execute a step of modifying a Hangul sentence according to at least one of an abbreviation rule and a spacing rule in a paraphrase sentence generation method according to claim 6.