CN110362691B - Syntax tree bank construction system - Google Patents

Syntax tree bank construction system Download PDF

Info

Publication number
CN110362691B
CN110362691B CN201910656652.3A CN201910656652A CN110362691B CN 110362691 B CN110362691 B CN 110362691B CN 201910656652 A CN201910656652 A CN 201910656652A CN 110362691 B CN110362691 B CN 110362691B
Authority
CN
China
Prior art keywords
word
syntax tree
labeling
sentence
chunk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910656652.3A
Other languages
Chinese (zh)
Other versions
CN110362691A (en
Inventor
王伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Yuzhixing Technology Co ltd
Original Assignee
Dalian Yuzhixing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Yuzhixing Technology Co ltd filed Critical Dalian Yuzhixing Technology Co ltd
Priority to CN201910656652.3A priority Critical patent/CN110362691B/en
Publication of CN110362691A publication Critical patent/CN110362691A/en
Application granted granted Critical
Publication of CN110362691B publication Critical patent/CN110362691B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a syntactic tree library construction system, which mainly comprises: the system comprises a word segmentation marking module, a word meaning marking module, a chunk connecting module, a component identification and component relation marking module and a syntax tree correction module. The method can enable more people to participate in the construction work of the syntax tree, thereby constructing a large-scale, multi-field and high-quality syntax tree library, solving the problems of high cost, low efficiency, poor consistency, small scale, narrow field, slow updating and the like of the traditional method for constructing the syntax tree, and solving the problems that the labeling operation can only be carried out on a larger screen.

Description

Syntax tree bank construction system
Technical Field
The invention relates to the technical field of syntactic analysis in natural language processing, in particular to a syntactic tree library construction system.
Background
The syntax tree is a syntax analysis result of a natural language sentence represented by a tree structure, and each node of the syntax tree is marked with rich information to characterize granularity of syntax analysis. A syntax tree library constructed from a large number of syntax trees is an important resource for machine automatic syntax analysis, and especially a syntax analyzer with supervised learning can be applied only after training through the syntax tree library. Currently, there are some manually constructed syntactic tree libraries, such as the state tree libraries PTB (English) and CTB (Chinese), the Chinese syntactic tree library TCT, the Taiwan Sinica Chinese tree library, etc., and these different syntactic tree libraries have different labeling systems.
The size and quality of the syntax tree library are critical to the performance of the automatic syntax analyzer, and the larger the size and the higher the quality of the syntax tree library, the better the effect of the automatic syntax analysis. However, the existing syntax tree libraries generally have the problems of small scale, narrow field and the like. The main reason is that: the traditional syntax tree labeling method requires that a labeling person is a linguist or a person with a certain linguistic background, and the syntax tree labeling can be performed only if the labeling person remembers a labeling symbol special for a labeling system, for example, a Qinghua Chinese tree base TCT is taken as an example, and the syntax tree labeling method comprises a part-of-speech labeling set (noun, verb v, adjective a, adverb d and the like), a component identification set (NP, VP, PP, DP and the like), a component relation identification set (VP-SB, VP-RT, fj-BL, fj-LG and the like) and the like. In order to ensure the quality of the labeling results, a special proofreader is required to manually proofread the labeling results. The method has higher requirements on the labeling person and the proofreading person, and if sentences in the specific field are labeled, the labeling person and the proofreading person have to have linguistic and professional knowledge, and the number of people with the conditions is less, so that more people are greatly limited to participate in labeling work. A conventional syntax tree labeling process is shown in fig. 1.
Disclosure of Invention
In view of the problems of high cost, low efficiency, poor consistency, small scale, narrow field, slow updating, incapability of being carried out on a mobile device with a small screen and the like of the traditional method for constructing the syntax tree, the invention provides a system for constructing a syntax tree library, which can enable more people to participate in the construction work of the syntax tree, thereby constructing a large-scale, multi-field and high-quality syntax tree library.
The invention adopts the following technical means:
a syntactic tree library construction system mainly comprises:
the word segmentation marking module is used for marking the word of the sentence subjected to the word pre-segmentation;
the word meaning marking module marks the word meaning of the sentence marked by the segmentation word;
the block connection module is used for performing block connection on sentences with word meaning labels and converting the block connection information into a syntax tree;
the component identification and component relation labeling module is used for automatically labeling the syntactic component identification and component relation after the composition;
and the syntax tree correction module is used for automatically correcting the labeling result to obtain a final labeling result.
Further, the word segmentation tagging module is configured to combine the morphemes into words in response to a first mode operation; and/or, breaking down the word into morphemes in response to the second mode operation.
Further, the word sense tagging module is configured to select a corresponding candidate word sense from the list of ambiguous word candidate word senses in response to the third mode operation.
Further, the word sense tagging module builds a list of ambiguous word candidate word senses using a word sense dictionary prior to word sense tagging.
Further, the chunk connection module comprises a chunk connection part and a syntax tree generation part;
the block connection part is set to respond to the fourth mode operation, block at least two adjacent words, re-group at least two adjacent words or blocks, and repeatedly re-group until the whole sentence is combined into a complete block;
the syntax tree generating unit stores all the information on completion of the chunk connection, and converts the information into a syntax tree.
Further, the component identification and component relation labeling module is specifically configured to:
according to a training library formed by word sense information and chunk identifiers in a small amount of syntax trees manually marked in advance, training is performed first in a machine learning mode, and then automatic marking of sentence component identifiers is realized;
according to a training library formed by word meaning information and syntactic component relations in a small amount of syntactic trees which are manually marked in advance, training is performed in a machine learning mode, and further automatic marking of sentence component relations is achieved.
Further, the syntax tree checking module is specifically configured to perform filtering classification on the labeling results and accumulate votes until the continuous labeling of the sentence is stopped when the absolute number of votes can be determined to be the final result, thereby completing automatic checking of the labeling results.
Compared with the prior art, the invention has the following advantages:
1. the invention reduces the threshold of the syntactic tree labeling operation, and a labeling person does not need to master higher linguistic knowledge, and can operate as long as the native language can be normally understood, so that more ginseng and labeling work can be realized, and a larger-scale syntactic tree library can be more easily constructed.
2. When the system of the invention is used for marking, various complex part-of-speech and component identifiers are not required to be mastered, so that the problem of error marking of the identifiers is avoided, the operation is simple, the marking efficiency is high, a large-scale high-quality syntactic tree library can be constructed more quickly, and for marking the text in the professional field, only a person who can read the text in the field can mark, so that the syntactic tree library in the large-scale specific field can be constructed.
3. The method for marking the syntax tree can be adopted in all languages belonging to mapping classes, so that a multilingual syntax tree library can be easily constructed, and the language with fewer resources in the world and the syntax tree library comprising minority languages in China can be constructed more quickly.
4. The automatic checking mechanism of the machine is adopted, so that the requirement on high-quality checking staff is reduced, the bottleneck problem that the checking work cannot keep up due to the fact that the checking staff is lack of a large-scale syntax tree library is avoided, and the same sentence is marked by accommodating enough people until a certain marking result wins out in absolute condition, so that the accuracy of the marking result can be ensured.
5. The labeling operation is only performed in the screen area occupied by all words of the sentence, so that labeling work can be performed on a small screen of the mobile phone, the fragment time of a labeling person can be fully utilized, and a large-scale, multi-field and timely-updated syntax tree library can be easily constructed as long as a network is provided, and the labeling work can be participated in at any time and any place.
The invention can provide various syntax tree library resources for the natural language processing field, and plays the role of improving the performance of a syntax analyzer and the performance of other various application systems based on the syntax tree library information. Based on the reasons, the method can be widely popularized in the field of natural language processing.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.
Fig. 1 is a block diagram of a conventional syntax tree construction process.
Fig. 2 is a block diagram of a syntax tree construction process according to the present invention.
Fig. 3 is an example of an input sentence.
FIG. 4 is an example of a word segmentation and labeling process for an input sentence.
Fig. 5 is the result of automatic word sense tagging of an input sentence.
Fig. 6 is a result of selection of a candidate word sense from an input sentence.
Fig. 7 is a result of completing word segmentation and word meaning tagging of an input sentence.
FIG. 8 is a concatenation operation of a sentence from the beginning of the first chunk until the last complete chunk.
Fig. 9 is a flow chart for combining new nodes by connecting 2 nodes on a screen.
Fig. 10 is a basic flowchart of recursively generating a syntax tree through a generated node array.
Fig. 11 is a result of recursively generating a syntax tree in parentheses.
Fig. 12 is a result of representing a recursively generated syntax tree in a visual form.
Fig. 13 is a process for deriving part of speech, component identification, and component relationship from semantic code sequences.
FIG. 14 is a flow chart of automatic collation of syntax tree labeling results.
Fig. 15 is a block diagram of a client/server system of a syntax tree labeling platform.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
As shown in fig. 2, the present invention provides a system for constructing a syntax tree library, which mainly includes:
and the word segmentation marking module is used for marking the word of the sentence subjected to the word pre-segmentation. The word segmentation labeling module is set to respond to the first mode operation to combine the morphemes into words; and/or, breaking down the word into morphemes in response to the second mode operation.
Specifically, a sentence of a text to be annotated is input, and generally, the input sentence to be annotated is basically a sentence without word segmentation. For word segmentation and labeling of an input sentence, the work is usually to use the existing mature word segmentation tool to perform pre-segmentation, and then the labeling person modifies the error part in the pre-segmentation result, namely, an 'automatic + small amount of manual' mode is adopted. This step is skipped for sentences that have already been segmented, or languages that do not require segmentation, such as english. As a preferred embodiment, the word synthesis is performed by using the operation of clicking or dragging adjacent morphemes rapidly as the first mode, or by using the operation of clicking the first and last characters of a plurality of adjacent morphemes of a word rapidly as the first mode. In the invention, long term target words are used as a second mode to perform morphological decomposition.
And the word meaning marking module is used for marking the word meaning of the sentence marked by the segmentation word. The word sense tagging module is configured to select a corresponding candidate word sense from the list of ambiguous candidate word senses in response to the third mode operation. Further, the word sense tagging module builds a list of ambiguous word candidate word senses using a word sense dictionary prior to word sense tagging.
Specifically, words in sentences are labeled with sense dictionary such as synonym forest, howNet, wordNet, etc. The method comprises the steps of automatically labeling the ambiguous words by a machine, then manually labeling the ambiguous words, and adopting an automatic and small-amount manual mode. Word sense tags also provide a basis for later syntax element related tags. The traditional tagging link is part-of-speech tagging, needs to memorize and distinguish parts of speech, and is inconvenient and easy to make mistakes. In a preferred embodiment, click confirmation is used as a selection of the third mode operation candidate word sense.
And the chunk connection module is used for performing chunk on the sentences with the word meaning labels and converting the chunk connection information into a syntax tree. The block connection module comprises a block connection part and a syntax tree generation part; the block connection part is set to respond to the fourth mode operation to block at least two adjacent words, and then to re-group at least two adjacent words or blocks, and the re-group is repeated until the whole sentence is combined into a complete block; the syntax tree generating unit stores all the information on completion of the chunk connection, and converts the information into a syntax tree.
Specifically, as a preferred embodiment, the join operation is combined as a fourth mode operation into a new chunk for two or more adjacent words or chunks, and the join operation may be implemented by a quick click or drag operation on the screen. The chunk connection of a sentence to be annotated is that the chunks start from the words in the sentence until the last whole sentence is assembled into a complete chunk. The annotators only need to determine the positions of the blocks to be directly connected, and other later annotators do not need to be manually participated. This portion of the operation may be operable in a small screen space, such as a cell phone screen. The traditional marking method is inconvenient to mark in a flexible moving space and inconvenient for more people to participate because the direct display tree structure occupies a larger screen space and is generally operated on a computer with a larger screen.
And the component identification and component relation labeling module is used for automatically labeling the syntactic component identification and the component relation after the composition. The component identification and component relation labeling module is specifically used for: according to a training library formed by word sense information and chunk identifiers in a small amount of syntax trees manually marked in advance, training is performed first in a machine learning mode, and then automatic marking of sentence component identifiers is realized; and training by adopting a machine learning mode according to a training library formed by the word meaning information and the syntactic component relation in a small amount of syntactic trees manually marked in advance, so as to realize automatic marking of sentence component relation.
Specifically, for labeling of syntactic component identifiers after the chunking, training is performed first by adopting a machine learning mode according to a training library formed by word sense information and chunk identifiers in a small amount of syntactic trees manually labeled in advance, and then automatic labeling is achieved. The traditional component identification marking is almost manual marking, complicated component identification needs to be remembered, the operation is inconvenient, errors are easy to generate, and the marking efficiency is low. The component relation labeling and the component identification labeling are similar in operation steps, but the information content is different. And for labeling of the syntactic component relation after the block, training is performed by adopting a machine learning mode according to a training library formed by word sense information and the syntactic component relation in a small amount of syntactic trees manually labeled in advance, and then automatic labeling is realized. The traditional component relation labeling is almost manual, complicated component identification needs to be remembered, the operation is inconvenient, errors are easy to generate, and the labeling efficiency is low.
And the syntax tree correction module is used for automatically correcting the labeling result to obtain a final labeling result. The syntax tree proofreading module is specifically configured to perform filtering classification on the labeling results and accumulate votes until the continuous labeling of the sentence is stopped when the absolute number of votes can determine that a labeling result is a final result, thereby completing automatic result proofreading.
Specifically, the automatic filtering classification is carried out on the labeling results of a plurality of different labeling persons, and voting is accumulated until the continuous labeling of the sentence is stopped when a certain labeling result can be determined as a final result according to the absolute voting number, so that the automatic correction of the labeling result is completed. The traditional proofreading method is manually carried out, the requirement on proofreading personnel is high, the labeling result of large-scale modulus is required to be more proofreading personnel, and the qualification of the proofreading personnel, the number of the proofreading personnel, the proofreading quality, the proofreading scale and the proofreading efficiency have great influence on the construction of a large-scale syntax tree.
The following further describes the scheme of the invention by specific implementation examples:
example 1
As shown in fig. 15, the present embodiment provides a client/server type system block diagram of a syntax tree labeling platform that facilitates mass participation. The function of simultaneous labeling of multiple people can be realized, and the system can be operated on a local server or a remote cloud server. In this embodiment, the syntax tree of the phrase structure is taken as an example, and the working process is described as follows:
1. input sentence
Assume that a sentence "w1, w2, w3, …, wn" containing n words is input, as in fig. 3, which has been subjected to a pre-word operation by the word segmentation tool. For convenience of explanation, it is assumed here that the word "W2" is composed of 2 characters, the word "W5" is composed of 3 characters, and the remaining words are composed of 1 character.
2. Word segmentation annotation
Although the automatic word segmentation is performed in advance by using a word segmentation tool, the word segmentation mark is a checking and correcting link because the word segmentation tool does not ensure that all the words are correct. If the word segmentation results are all correct, directly carrying out the next operation; if the individual word segmentation errors exist in the word segmentation method, the word elements can be clicked or dragged rapidly to be combined into words, and the words to be decomposed can be separated into a plurality of word elements for a long time, and a process diagram of word segmentation labeling is shown in fig. 4. For example, the method can be used for clicking the 'W3' and the 'W4' rapidly, and can be combined into a 'W34', and also can be used for clicking the first word and the last word of the word, such as clicking the 'W6' and the 'W8', and can be combined into a word 'W678'; the "W2" can be pressed for a long time to be split into "W2-1" and "W2-2".
3. Word meaning label
1) And automatically labeling the univocal words by a machine according to the designated semantic classification dictionary. The distinction is made according to the background color, for example, green (black and white text description is convenient, no diagonal background exists here) indicates that the word is not needed to be marked, red (diagonal background exists) is an ambiguous word needed to be marked by word sense, and a schematic diagram of the initial automatic word sense marking result of a sentence is shown in fig. 5.
2) According to the semantic dictionary standard, clicking the polysemous word to display the example word lines composed of some example words corresponding to each word sense of the polysemous word candidate, clicking the line of the example word of the candidate word sense, and directly determining the word sense, see fig. 6. For example, after the word "W2", there are m candidate word senses corresponding to the example word lines "S1" to "Sm", and these word sense codes are not actually displayed (indicated by the dotted grid background), but only a plurality of example words of the word senses are displayed, so as to facilitate the user to select and operate. For example, if it is determined that the word sense of "W2" is the word sense represented by the word example of line 2 according to the context, then directly clicking on the line 2 area automatically corresponds the word sense code "S2" of "W2" to "W2", thereby completing word sense tagging.
4. Block connection
And generating a new block for two (or more) adjacent words or adjacent blocks through connection combination, and realizing quick clicking or dragging of the corresponding position. To be able to complete the annotation of the syntactic tree in a specific screen area (the area covered by all words of a sentence), the boundary between two words or between two chunks is distinguished by changing the background color (for convenience of black and white text description, distinction is made here by using the dot-type background in the word box). For example, the sentence has completed the initial states of word segmentation and word meaning tagging, see fig. 7. The chunk connection can be performed in two steps, see description of 1) and 2) below.
1) By gradual chunk connection until all chunk connections are completed
First, a quick click or drag is started from the word level to form a chunk, such as clicking on "W2" and "W3" at the beginning to form a "W2W3" chunk, and the background color between the chunks automatically changes, as here the "W1" background color changes, see fig. 8. The chunk join operation and the word-grouping operation in the word-segmentation labeling step are substantially identical, but produce substantially different results. Then, the rest of the sentence is continuously subjected to chunk connection operation until the last complete chunk is completed. FIG. 8 shows the process of generating one color of the final overall chunk from the two background colors at the beginning of the chunk from the word of the sentence.
2) Converting all the information of the complete chunk connection into a syntax tree
a. And writing the information of combining the 2 nodes connected on the screen each time into a new node into a node array and displaying the result after the block is formed, wherein the related flow chart is shown in fig. 9. The flow chart is described according to a binary tree type of 2 nodes, and the operation of the binary tree type can be expanded to a trigeminal tree type, and the invention is not repeated.
b. The written node array is automatically converted into a syntax tree, which is described in a recursive manner, and the relevant flow chart is shown in fig. 10. This work can also be implemented by other algorithms, and the present invention will not be described in detail.
For example, sentences "w1, w2, w3, …, wn" in fig. 8 are generated by gradually blocking the arrays generated by the operation, and then the syntax tree result is generated by recursively calling, and the result is shown in brackets, and the sentence is shown in fig. 11, and the sentence is shown in a visual form, and the sentence is shown in fig. 12.
5. Component identification labeling
Because the word sense tags are already present, each word has a corresponding semantic code correspondence. A concatenated chunk component identification is formed by extracting semantic codes of two words (which can be simplified, see later). And meanwhile, the component identifier of the new chunk and other words or chunks can be recombined into the identifier of the new chunk. For convenience of explanation, a human invents words for example sentence "five thousands of years ago. ", see fig. 13. For example, "Dn04" of the word level and "Dn05" of the word level of the first layer are marked as "Dn" of the 2 nd layer, and the new chunk mark "Dn" and "Ca18" of the word level of the third layer constitute "Dn Ca" of the new chunk mark as "Dn Ca" of the 4 th layer. The content of the identifier extracted during the block can be simplified according to the situation, so that the condition that the whole identifier set can be effectively distinguished is ensured. Such as, for example, "Dn04" identifies the first 2 bits "Dn" of the semantic code selected. The identification of the block is simplified in a machine learning mode on the premise that the whole identification set can be effectively divided. Such as "Dn 04+dn dn= > Dn" (layer 3); "Gb kd+dk05= > Gb Dk" (layer 3). The syntax tree with the full semantic code sequence representation component identification is called a simplest semantic syntax tree, and compared with the traditional syntax tree representation, the syntax tree has the advantages of low marking error rate, good consistency and large information value because of no manual operation.
The simplest syntax tree may be converted into a conventional syntax tree in view of the possible need to compromise the conventional syntax tree representation. The specific method comprises the following steps: and performing word meaning labeling on a traditional syntax tree with a defined modulus, then forming component identifiers of corresponding chunks of the simplest semantic syntax tree according to the chunk connection relation of the original syntax tree, thereby forming a training syntax tree library with traditional syntax component identifiers corresponding to the component identifiers of the simplest semantic syntax tree, and obtaining the mapping relation of the traditional syntax component identifiers through a machine learning mode. When the syntax tree is marked, a series of semantic code sequences of the chunks are converted into parts of speech, syntax components and the like in the traditional syntax tree representation in a machine learning mode. The component identity can be derived from the semantic code, such as "Gb02+ dk05= > VP + NP", it can be seen that the information of "Gb02" is much richer than "VP". For traditional part-of-speech tagging, similar to the operation of deriving syntactic components, it can be derived directly from semantic codes. Because semantic codes contain more information than parts of speech, part of speech information may not even be needed at all, but only to take care of compatibility with structural information of conventional syntax trees, to extract parts of speech and to convert to conventional syntactic identification content.
6. Component relationship labeling
Referring to the steps of the component identification labeling, the components are directly converted from the semantic code sequences in the block process in a machine learning mode, for example, "Gb02+Dk05= > dynamic guest", and the description is omitted here.
7. Syntax tree collation
The automatic filtering classification and the accumulated voting are carried out on the labeling results of a plurality of labeling persons of a sentence until a certain labeling result can be determined by the absolute voting number and the labeling of the sentence is stopped, so that the automatic correction of the labeling result of the syntax tree is completed, and the basic flow chart of the correction process of the labeling result of the syntax tree is shown in fig. 14.
The invention reduces the threshold of the syntactic tree labeling operation, the labeling person does not need to master higher linguistic knowledge, and can operate only by normal people who can understand the native language, so that more people can work with the labeling, and a larger-scale syntactic tree library is easier to construct; the marking process does not need to master various complex part-of-speech and component identifiers, so that the problem of error marking of the identifiers is avoided, the operation is simple, the marking efficiency is high, and a large-scale high-quality syntax tree library can be constructed more quickly; for the text labeling in the professional field, only people who can read and understand the text in the field can label, so that a large-scale syntax tree library in the specific field can be constructed. The method for marking the syntactic tree can be adopted in all languages belonging to mapping classes, so that the syntactic tree libraries of multiple languages can be easily obtained, and the languages with small resources in the world and the syntactic tree libraries comprising minority languages in China can be built more quickly in a corresponding scale. The automatic checking mechanism of the machine is adopted, so that the number requirement on high-quality checking staff is reduced, and the bottleneck problem that the checking work cannot keep up due to the fact that the marking result is large when a large-scale syntax tree bank is constructed because of the lack of the checking staff is avoided. The sentence is marked by accommodating enough people until a certain marking result wins out under absolute conditions, so that the correctness of the marking result can be ensured. The labeling operation is only performed in the screen area occupied by all words of the sentence, so that labeling work can be performed on a small screen of the mobile phone, the fragment time of a labeling person can be fully utilized, and a large-scale, multi-field and timely-updated syntax tree library can be easily constructed as long as a network is provided, and the labeling work can be participated in at any time and any place. The invention can provide various syntax tree library resources for the natural language processing field, and plays the role of improving the performance of a syntax analyzer and the performance of other various application systems based on the syntax tree library information.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (2)

1. A syntax tree library construction system, comprising:
the word segmentation labeling module is used for labeling the words of the sentences subjected to the word pre-segmentation, and is used for combining the morphemes into words in response to the first mode operation; and/or, breaking down the word into morphemes in response to the second mode operation;
the word sense tagging module is used for tagging the word sense of the sentence tagged with the segmented word, the word sense tagging module is used for selecting a corresponding candidate word sense from the candidate word sense list of the ambiguous word in response to the operation of the third mode, and the word sense tagging module utilizes a word sense dictionary to construct the candidate word sense list of the ambiguous word before tagging the word sense;
the block connection module is used for performing block formation on sentences with word meaning marks and converting block connection information into a syntax tree, and comprises a block connection part and a syntax tree generation part, wherein the block connection part is used for performing block formation on at least two adjacent words in response to a fourth mode operation, and then re-combining at least two adjacent words or blocks, and repeatedly re-combining the words or blocks until the whole sentence is combined into a complete block; the syntax tree generating part stores all the information of the completion of the chunk connection and converts the information into a syntax tree;
the component identification and component relation labeling module is used for automatically labeling the syntactic component identification and component relation after the composition, and is specifically used for: training is carried out by adopting a machine learning mode according to a training library formed by word meaning information and block identifiers in a small amount of manually marked syntax trees in advance, so that automatic marking of sentence component identifiers is realized, training is carried out by adopting a machine learning mode according to a training library formed by word meaning information and syntax component relationships in a small amount of manually marked syntax trees in advance, and further automatic marking of sentence component relationships is realized;
the syntax tree correction module is specifically used for filtering, classifying and accumulating votes on the labeling results until the fact that a certain labeling result is the final result can be determined by the absolute voting number, stopping continuous labeling of the sentence and finishing automatic correction of the labeling result.
2. The system for constructing a syntax tree base according to claim 1, wherein the chunk connection module performs a chunk on the sentence with the word meaning labeled and converts the chunk connection information into a syntax tree, and specifically comprises: and combining the connection operation as a fourth mode operation to two or more adjacent words or chunks into a new chunk, wherein the connection operation is realized by quick clicking or dragging operation on a screen, and the chunk connection of a sentence to be marked is that the chunk starts from a word in the sentence until the whole sentence is finally combined into a complete chunk.
CN201910656652.3A 2019-07-19 2019-07-19 Syntax tree bank construction system Active CN110362691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910656652.3A CN110362691B (en) 2019-07-19 2019-07-19 Syntax tree bank construction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910656652.3A CN110362691B (en) 2019-07-19 2019-07-19 Syntax tree bank construction system

Publications (2)

Publication Number Publication Date
CN110362691A CN110362691A (en) 2019-10-22
CN110362691B true CN110362691B (en) 2023-06-02

Family

ID=68221300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910656652.3A Active CN110362691B (en) 2019-07-19 2019-07-19 Syntax tree bank construction system

Country Status (1)

Country Link
CN (1) CN110362691B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191425A (en) * 2020-01-02 2020-05-22 北京科技大学 Add husband's grammar analysis drawing system
CN112528670B (en) * 2020-12-01 2022-08-30 清华大学 Word meaning processing method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500160A (en) * 2013-10-18 2014-01-08 大连理工大学 Syntactic analysis method based on sliding semantic string matching

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853824B (en) * 2014-03-03 2017-05-24 沈之锐 In-text advertisement releasing method and system based on deep semantic mining
CN106202037B (en) * 2016-06-30 2019-05-14 昆明理工大学 Vietnamese phrase tree constructing method based on chunking
CN109359303B (en) * 2018-12-10 2023-04-07 枣庄学院 Word sense disambiguation method and system based on graph model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500160A (en) * 2013-10-18 2014-01-08 大连理工大学 Syntactic analysis method based on sliding semantic string matching

Also Published As

Publication number Publication date
CN110362691A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
US7707026B2 (en) Multilingual translation memory, translation method, and translation program
WO2018000272A1 (en) Corpus generation device and method
JP4940973B2 (en) Logical structure recognition processing program, logical structure recognition processing method, and logical structure recognition processing apparatus
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN102236423B (en) A kind of method that character supplements automatically, device and input method system
CN107798123B (en) Knowledge base and establishing, modifying and intelligent question and answer methods, devices and equipment thereof
CN109460552B (en) Method and equipment for automatically detecting Chinese language diseases based on rules and corpus
CN110046261A (en) A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering
CN111209728B (en) Automatic labeling and inputting method for test questions
CN102982010A (en) Method and device for abstracting document structure
CN110110334B (en) Remote consultation record text error correction method based on natural language processing
CN109740159B (en) Processing method and device for named entity recognition
CN110362691B (en) Syntax tree bank construction system
CN102023972A (en) Automatic translation system based on structured translation memory and automatic translation method using the same
CN105243056B (en) A kind of Chinese parsing method and device based on punctuation mark processing
Arellano et al. Frameworks for natural language processing of textual requirements
CN111190920A (en) Data interactive query method and system based on natural language
CN106372053B (en) Syntactic analysis method and device
CN108491381A (en) A kind of syntactic analysis method of Chinese bipartite structure
CN109683881B (en) Code format adjusting method and device
CN110008314B (en) Intention analysis method and device
Wisniewski Errator: a tool to help detect annotation errors in the universal dependencies project
CN106775914B (en) A kind of code method for internationalizing and device for automatically generating key assignments
CN114970543A (en) Semantic analysis method for crowdsourced design resources
CN114419645A (en) Contract intelligent analysis method based on AI

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant