CN113704383A - Method, system and device for labeling discourse semantics - Google Patents

Method, system and device for labeling discourse semantics Download PDF

Info

Publication number
CN113704383A
CN113704383A CN202110987422.2A CN202110987422A CN113704383A CN 113704383 A CN113704383 A CN 113704383A CN 202110987422 A CN202110987422 A CN 202110987422A CN 113704383 A CN113704383 A CN 113704383A
Authority
CN
China
Prior art keywords
semantic
discourse
document
chapter
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110987422.2A
Other languages
Chinese (zh)
Inventor
张学龙
谭培波
刘锋
刘伟华
马青
马学兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhitong Yunlian Technology Co Ltd
Original Assignee
Beijing Zhitong Yunlian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhitong Yunlian Technology Co Ltd filed Critical Beijing Zhitong Yunlian Technology Co Ltd
Priority to CN202110987422.2A priority Critical patent/CN113704383A/en
Publication of CN113704383A publication Critical patent/CN113704383A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method, a system and a device for labeling discourse semantics, which comprises the steps of obtaining a raw corpus document, and establishing a hierarchical semantic structure of the chapter-section-paragraph-sentence-groove of the raw corpus document; fusing the corresponding levels of the hierarchical semantic structure into a uniform discourse semantic document; modifying the level and the sequence number of the fused discourse semantic documents based on the correct logical thinking; the modified discourse semantic documents are stored in the idiom material base, and the labeling of discourse semantics is completed.

Description

Method, system and device for labeling discourse semantics
Technical Field
The invention relates to the field of discourse semantics, in particular to a discourse semantic annotation method, a discourse semantic annotation system and a discourse semantic annotation device.
Background
The natural language researches the thinking mode of people, and the most important characteristics of the thinking of people are that the people have hierarchy and abstraction, and people are most adept to intuitively see the abstract relation of the high level of things. The semantics of natural language research represent abstract connections among things that the human brain perceives, so the semantics that can really represent human thinking should also have hierarchy and abstraction. Semantic technology implements abstraction through a predefined way, such as part-of-speech tagging through a predefined part-of-speech system, without finding a new part-of-speech through the corpus in return. However, the definition of the semantic hierarchy by the existing semantic theory and technology is limited to sentence level, and there is almost no definition and labeling method for discourse semantics, and there is no discourse semantics labeling method which can be used in engineering.
The defects of the existing method are that sentence-level semantics cannot meet the requirement of semantic description on large-space, multi-level and long-time business activities in engineering, and the semantic annotation method based on single sentences and words in the sentences cannot realize chapter semantic annotation with a complex hierarchical structure.
Disclosure of Invention
The invention aims to provide a method, a system and a device for labeling discourse semantics, and aims to solve the discourse semantics labeling.
The invention provides a method for labeling discourse semantics, which comprises the following steps:
s1, acquiring a raw corpus document, and establishing a hierarchical semantic structure of chapter-section-paragraph-sentence-slot of the raw corpus document;
s2, fusing the corresponding levels of the hierarchical semantic structure into a uniform discourse semantic document;
s3, modifying the level and the sequence number of the fused discourse semantic documents based on the correct logical thinking;
s4, storing the modified discourse semantic documents into the idiom material library to finish the labeling of discourse semantics.
The invention also provides a system for labeling discourse semantics, which comprises,
a semantic structure module: the semantic structure is used for acquiring a raw corpus document and establishing a hierarchical semantic structure of chapter-section-paragraph-sentence-slot of the raw corpus document;
a fusion module: the system is used for fusing the corresponding levels of the hierarchical semantic structure into a uniform discourse semantic document;
and a modification module: the system is used for modifying the level and the sequence number of the fused discourse semantic documents based on the correct logical thinking;
a storage module: and the system is used for storing the modified discourse semantic documents into the idiom material library to finish the labeling of discourse semantics.
The embodiment of the invention also provides a device for labeling discourse semantics, which comprises: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the method for labeling discourse semantics when being executed by the processor.
The embodiment of the invention also provides a computer readable storage medium, wherein an implementation program for information transmission is stored on the computer readable storage medium, and the implementation program realizes the steps of the method when being executed by a processor.
By adopting the embodiment of the invention, the requirement of semantic description on large-space, multi-level and long-time business activities in engineering is met, and discourse semantic annotation with a complex hierarchical structure is realized.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a method for semantic annotation of chapters according to an embodiment of the present invention;
FIG. 2 is a diagram of a document structure in the prior art;
FIG. 3 is a schematic diagram of the logical structure of the method for semantic annotation of chapters according to the embodiment of the present invention;
FIG. 4 is a schematic format diagram of a raw corpus document of the method for semantic annotation of chapters according to the embodiment of the present invention;
FIG. 5 is a schematic format diagram of a chapter semantic document of the method for labeling chapter semantics according to the embodiment of the present invention;
FIG. 6 is a diagram illustrating a markup corpus format of a chapter semantic annotation method according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of the chapter semantic structure of the method for labeling chapter semantics according to the embodiment of the present invention;
FIG. 8 is a schematic view of chapter semantic fusion of the method for chapter semantic annotation according to the embodiment of the present invention;
FIG. 9 is a detailed flowchart of a method for semantic annotation of chapters according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a system for semantic annotation of chapters in accordance with an embodiment of the invention;
FIG. 11 is a diagram of an apparatus for space semantic annotation according to an embodiment of the present invention.
Description of reference numerals:
101: a semantic structure module; 102: a fusion module; 103: a modification module; 104: and a storage module.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise. Furthermore, the terms "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Method embodiment
According to the embodiment of the invention, a method for labeling discourse semantics is provided,
fig. 1 is a flowchart of a space semantic annotation method according to an embodiment of the present invention, as shown in fig. 1, specifically including:
s1, acquiring a raw corpus document, and establishing a hierarchical semantic structure of chapter-section-paragraph-sentence-slot of the raw corpus document;
s1 specifically includes: and acquiring a raw corpus document, and establishing a hierarchical semantic structure of the article, namely the article, chapter, section, paragraph and groove according to a corresponding article, chapter, section, paragraph and groove dictionary, wherein the article, chapter, section, sentence and groove dictionary take sentence semantics as a basic unit.
S2, fusing the corresponding levels of the hierarchical semantic structure into a uniform discourse semantic document;
s2 specifically includes: and fusing the semantic levels of the articles into a uniform discourse semantic document on the basis of a form of a table.
S3, modifying the level and the sequence number of the fused discourse semantic documents based on the correct logical thinking;
s3 specifically includes: and modifying the hierarchy and the sequence number of the integrated discourse semantic document, and acquiring the definition of the manually undefined chapters according to the hierarchy.
S4, storing the modified discourse semantic documents into the idiom material library to finish the labeling of discourse semantics.
The specific implementation method according to the method is as follows;
sentence-level semantics cannot meet the requirement of engineering on semantic description of large-space, multi-level and long-time business activities,
FIG. 2 is a diagram of a document structure in the prior art; as shown in fig. 2, the document "J10043 _ H well fracturing project design. docx", which describes the overall process for a well fracturing project, has a large range of time and space.
The fracturing project is a very complex project, especially 8 kilometers underground which cannot be reached by people, so the success and failure of fracturing are designed purely in advance, the range of time and space to be considered is large, and the complex project is described, and a description report with a complex structure is necessarily caused. The engineering design report comprises 9 chapters, 5 layers of chapters and a chart and a table of paragraphs with texts under the layers, so that even though the semantics of each sentence can be identified, for example, the semantics of 'cracking a main slit by using guar gum in a pad fluid stage and the semantic name of' detecting the working state of the pad fluid stage 'matched with slickwater slug processing … …' are identified, the semantics of the whole design document cannot be understood from the semantics of the sentence alone, and the whole thinking process of performing fracturing engineering on one well, which is wanted to be expressed by the document, cannot be described, so that the engineering needs to define chapter semantics with larger granularity so as to describe the production social activity of people with larger granularity.
Semantic annotation method based on single sentence and words in sentence cannot realize semantic annotation of sections and chapters with complex hierarchical structure
Semantic labeling generally adopts labeling of sentences or words on the sentences to achieve labeling tasks, such as spacing to achieve segmentation between words, and marking between words and entity names by adding "/" signs. The simple sentence-granularity-based labeling cannot meet the requirement of labeling with a complex-level semantic structure, for example, what is the semantic meaning of the chapter of "2 geological engineering analysis" to be labeled in fig. 1, the chapter contains a plurality of sections, sentences and words, understanding of the whole document cannot be realized only by labeling one word or sentence, for example, only labeling the "pad stage/process to use a guargum cracking main seam, and finding the business object type of the" pad stage "belonging to the" process "by matching with the slickwater slug … …" does not help understanding of the whole engineering design process.
Just as the characteristics of water H2O at molecular level cannot be understood by only relying on the characteristics of oxygen atoms O and hydrogen atoms H, different levels of things have unique and interconnected and inseparable characteristics, so that the semantic definition and labeling only from words and sentences cannot realize the understanding of the high-level semantics of the sections, and therefore, for engineering practice, an integral section semantic definition and labeling method with different levels and inseparable levels needs to be established.
The embodiment of the invention provides a chapter semantic annotation method based on hierarchical alignment, which has the basic idea that common merging is carried out according to the hierarchy and the sequence of each layer, the merged hierarchical structure is the chapter semantic of the whole service, and the description is as follows:
firstly, establishing a single-document hierarchical semantic document of chapter-section-paragraph-sentence-groove, wherein each level of semantics is obtained according to corresponding semantic dictionary query. Connecting the chapter structure of the traditional document with the sentence semantics in the traditional semantic theory through the paragraph semantics, and establishing the document, namely the layered chapter semantics of the business, or extending the traditional chapter structure to the sentence frame semantics to form the chapter semantics;
secondly, converging all document chapter semantic records in the corpus, and carrying out hierarchical merging and sequencing according to chapter-section-paragraph-sentence-groove; manually defining chapter descriptions which are not defined according to the hierarchy;
and finally, carrying out hierarchical inspection on the merged chapter semantic documents, and reallocating the node sequence numbers to the items of each layer so that the chapter semantic logic conforms to human thinking.
FIG. 3 is a schematic diagram of the logical structure of the method for space semantic annotation according to the embodiment of the present invention, as shown in FIG. 3;
the chapter semantic annotation method is composed of a data layer 1, a method layer 2 and an application layer 3. The data layer has the functions of realizing storage, reading, writing, modification and the like of files and comprises a raw corpus document, a semantic dictionary, a chapter semantic document, a document and a labeled idiom library 4; the method layer 2 realizes the processing, fusion and format conversion of the raw corpus to form a final discourse semantic text, and comprises 4 parts of establishing a chapter-section-sentence-groove semantic structure of a single document to fuse the document to a unified discourse semantic document, modifying the hierarchy and sequence number of the fused discourse semantic document, and storing the discourse semantic document; the application layer 3 realizes interactive operation with users, and comprises 3 parts of reading documents, editing Word documents and storing label documents, wherein the multi-level editing of the documents is carried out by adopting words.
The data layer 1 is composed of a raw corpus document 1-1, a semantic dictionary 1-2, a chapter semantic document 1-3, a document and a labeled idiom library 1-4. The format of the corpus document 1-1 is shown in fig. 1, and is generally a word docx document, and documents with other formats, such as doc and pdf, need to be converted into a docx format in advance; the structure of chapter-section of the word document is marked by the header of the word.
FIG. 4 is a schematic format diagram of a raw corpus document of the method for semantic annotation of chapters according to the embodiment of the present invention, as shown in FIG. 4;
FIG. 4 lists dictionary names and their corresponding fields, in a chapter-section dictionary, the levels corresponding to chapter-sections need to be included; the paragraph semantic dictionary needs to contain a list of sentence semantic combinations, and the paragraph semantic recognition is divided into 2 methods of recognition according to the original text and recognition according to the sentence semantic combinations. Since paragraphs are sometimes long, recognition from the original text is not practical in many cases, and recognition from sentence semantic combinations is more feasible.
FIG. 5 is a schematic format diagram of a chapter semantic document of the method for labeling chapter semantics according to the embodiment of the present invention, as shown in FIG. 5;
the discourse semantic documents 1-3 are the document templates of the commonalities of all the last documents, which are the headers of all the documents, and are the abstract knowledge structure of the whole corpus document, as shown in fig. 5. The chapter-section is identified in the word by a header level, which is a hierarchical structure provided by the word; for paragraph-sentence-slot is a structure about the body that follows chapter-section in its hierarchical structure, thus forming a complete chapter-section-paragraph-sentence-slot hierarchical structure that describes the semantics of the chapter. Discourse notation mature corpus 1-4 holds all the results for each document.
FIG. 6 is a diagram illustrating a markup corpus format of a space semantic annotation method according to an embodiment of the present invention, as shown in FIG. 6;
the format of the labeled corpus is stored by adopting a table, and the labeled corpus comprises word text-table sequencing, a title level, a text paragraph, a table sequence number, paragraph text, picture number, header n, a title template, header n-chapter semantics, paragraph semantics, sentence splitting text, sentence semantics, an NER-mode and object semantics, an object, a sentence-picture-table semantics and other fields, wherein the fields are full information tables obtained by splitting and labeling word documents, and the conversion of different document formats, such as the expression mode of a picture, can be realized through the labeled corpus.
The method layer 2 is composed of 4 parts, namely, establishing a chapter-section-paragraph-sentence-groove semantic structure 2-1 of a single document, fusing the document to a unified chapter semantic document 2-2, modifying the hierarchy and sequence of the fused chapter semantic document 2-3, storing the chapter semantic document 2-4 and the like. A piece-chapter-section-segment-sentence-slot semantic structure 2-1 of a single document is established to preprocess an input word document, and a hierarchical structure of a word format is divided into a structural form taking sentences as units.
FIG. 7 is a schematic diagram of the chapter semantic structure of the method for labeling chapter semantics according to the embodiment of the present invention, as shown in FIG. 7;
the structure is as shown in fig. 7, thereby realizing natural language processing in sentence units. The difference between fig. 7 and fig. 5 is that fig. 7 adds an original sentence on the basis of the semantic structure of chapter-section-paragraph-sentence-slot, which embodies the analysis idea of sentence unit, and only splits into sentences because the sentence-level natural language processing technology is mature. The module fuses the documents into the uniform chapter semantic document 2-2 and has the function of fusing the semantic structure of the new documents with the saved chapter semantic structure.
FIG. 8 is a schematic view of chapter semantic fusion of the method for chapter semantic annotation according to the embodiment of the present invention, as shown in FIG. 8;
and finding out the part which cannot be fused as the marked material. Because semantic fusion is a process of calculation according to the hierarchical and logical requirements, a table-based fusion mode is more suitable than a graph-based fusion mode, because a graph usually assumes that an object has uniqueness, and the uniqueness is not satisfied for a real object, such as a product produced in the same batch. Modifying the level and the sequence number of the text semantic document fused 2-3 means that a semantic undefined item 'NNNN' of a text structure and a text slot is found out in a word, and a sentence with a 'NNNN' label is output to a semantic dictionary corresponding to the graph 3, so that a text semantic processing cycle is realized. Saving the space semantic documents 2-4 means that the processed result of the whole document, such as the document shown in fig. 6, is updated, thereby realizing a complete semantic fusion process. Because the document is a complex structure, the fused data size is a large form stored in sentence units, which is equivalent to adding a complex hierarchical attribute frame to each sentence, and the frame more accurately describes the meaning of the sentence, which is more accurate than the meaning of the single sentence semantics.
The application layer 3 comprises 3 parts of reading documents 3-1, editing Word documents 3-2, saving annotation documents 3-3 and the like.
And the module reading document 3-1 realizes a document selection function and reads a word document into the software platform. Editing the Word document 3-2 is to immediately use the editing function of the Word, check and modify the layers of the fused semantic document, and sequence the sequence of each layer, so that the chapter semantics are consistent in the layers and the logics. For engineering application software, natural language processing needs to be combined with a professional tool to exert the speciality of the professional tool, for example, the natural language processing needs to be combined with engineering software such as AutoCAD and Nx, and here, word is used as a professional tool for text editing and is integrated into a labeling platform of chapter semantics, and respective strength items can be exerted.
FIG. 9 is a detailed flowchart of the method for space semantic annotation according to the embodiment of the present invention, as shown in FIG. 9;
the flow chart of the method for labeling the semantics of the sections based on the hierarchical alignment comprises 2 big flows which are related in the front and back: the method comprises the following steps of an original document processing flow and a chapter semantic fusion flow:
step 1: reading original word document
The input document is in a word docx format, and mainly utilizes the powerful editing function of the word, particularly the title level setting function of the word, so that the most appropriate tool is provided for the hierarchical structure of chapter semantics. For other forms of input documents, such as dox, txt, pdf, etc., they need to be converted into docx form in advance for processing.
The tables in the document are labeled by setting a scale variable to correspond to the tables in docx.
Step 2: simultaneous alignment of paragraphs and tables in xml
The Python-docx module only processes texts and pictures in a word in a paragraph mode, a table in a word document is not analyzed, and a chart in an engineering file is the main content of the work document, so that xml is needed to be analyzed from the foot of the word in the embodiment of the invention, and the paragraph and the table are sequenced together according to lines, so that the texts, the tables and the pictures in the word can be processed uniformly.
The table number can be obtained by using the document of docx module, and the table number corresponds to the table number obtained in step 1, so as to obtain the specific content of the table in the whole analysis text. The last line of characters of the table position of the table name is used as the table name, and if a plurality of figures are arranged at the same position, subscript numbers are required to be added to the table name.
The names of the figures are named with the illustration of the next row of the location where the figure is located.
Both the graph and the table are treated as entities in discourse semantics.
And step 3: the title levels are arranged in a hierarchical relationship
The title obtained from the docx module contains 2 columns [ title level, title text ], but the title level is a hierarchical relationship, labeled 1 column, which is not conducive to understanding and related format transformations, such as from excel to word or 3-tuple construction.
And 4, step 4: paragraph processing
This step comprises steps 4-1 to 4-4, the purpose of which is to name the paragraph. Paragraphs are among the mature sentences of the common chapter structure and processing means, and are the key of chapter semantics.
Step 4-1: paragraph semantic recognition
The function of this step is to look up the semantic name of the paragraph according to the semantic dictionary, and the paragraph defined by us contains a period ". "complex sentences, other sentences containing commas, colons," and other punctuation marks are considered a sentence and not 2 sentences or paragraphs because the marks represent semantically coherent.
The semantic dictionary of paragraphs shown in fig. 4 contains 3 parts of paragraph semantics, paragraph text and sentence semantics, and the definition of paragraph names adopts 2 ways, one of which is directly searching by original text and the other of which is searching by sentence semantics. The sentence original text is generally longer, the efficiency of natural language processing is lower, and the semantic combination of sentences is more consistent with the intention of paragraph setting.
Step 4-2: splitting paragraphs into sentences
According to the paragraph ". "split paragraph into sentence, expand the length of the table, the sentence expanded out keeps the hierarchical structure of all previous chapter sections.
And 5: sentence processing
Sentence processing is to analyze the semantics, patterns and corresponding entities of a sentence, the sentence is the minimum unit of natural language processing, and other processing higher than the sentence is based on the sentence.
Step 5-1: sentence semantic recognition
We define that the semantics of a sentence or the name of a sentence corresponds to the table name of the database, i.e. the sentence is described for a field of a certain table.
In the engineering project, the bottom-layer database is relatively complete, and the number of fields is relatively large, for example, a carbonate rock casting body slice identification table has 179 attributes, and drilling geological information has 104 attributes, which is greatly different from the common situation that the frame semantics has only a few attributes, and the content of the engineering table is rich and comprehensive, so that the structured form is not required to be rebuilt unless special situations exist. This is an engineering advantage and also an engineering difficulty.
Step 5-2: sentence slot identification
This step is to identify the sentence pattern, which is the pattern formed by the remaining part of the sentence after the entity is replaced, and this pattern is reflected by the human thinking pattern, which is a very important part in the natural language processing. For example, the sentence mode of the original sentence that the JHW023 and the JHW025 wells are constructed in a large displacement of 14m3/min in the whole process is that O construction is adopted in the whole process, wherein the positions of 2O are the groove positions, and the mode that O construction is adopted in the whole process is the whole sentence mode and represents the thinking mode of people. O can be replaced by different types, so that the semantics of the whole sentence pattern are more definite.
Step 5-3: in-sentence entity recognition
The entity identification in the sentence is the disambiguation processing of step 5-2, such as "how deep the triple well of the capillary dam" has 2 interpretations "how deep the triple well of the capillary dam" and "how deep the triple well of the capillary dam", which 2 interpretations are for non-professionals, but for professionals, only one interpretation "how deep the triple well of the capillary dam" is for professionals, because the triple well of the capillary dam is not a well but a gathering station, and how deep is the depth (average, maximum, minimum, etc.) of all wells of the gathering station. Therefore, the ambiguity of the entity in the sentence needs to be corrected by higher-level knowledge to determine the true meaning of the entity.
Step 6: adding graphs and tables as objects to an entity
In the names of the graphs and tables which have been identified in step 2, "graph" and "table" are used as names of the graph sentences.
And 7: saving preprocessed files
The word file is analyzed into a structured file according to the format of table 5 for storage, the minimum granularity is a sentence, and other columns are all attributes of different layers of the sentence, so that a complex hierarchical structure of chapter-section-sentence-groove is a hierarchical expression mode of chapter semantics.
And 8: reading text semantic excel document
And 7, archiving the original document after the processing flow of the original document is finished, wherein the separately processed document is taken out and is fused with the stored semantic document according to the hierarchy.
And step 9: reading text semantic excel document
The preprocessing result of the first discourse semantic document and the second document is the same, but discourse semantics is a common document of all articles, is an abstract document which is isomorphic with an original document, and is a header of all documents. Because the document types are different, the chapter structure and sentences may be different, so the structure of chapter semantics is very complex, and the part-of-speech system is a complex hierarchical structure defined by at least 40 parts-of-speech, but is generally a simple description of no more than 10 parts-of-speech for each sentence, in contrast to the part-of-speech system. Therefore, the discourse semantics is a header of a complex hierarchical structure, and the labeled document of each document is only filled in a small part of the discourse semantics document.
Step 10: sequentially applying an original excel file to a semantic excel file
Firstly, an original excel file is attached to the back of a semantic excel file by adopting an apend method of a DataFrame, and an overlapped large-space form file is formed. And then extracting the document according to the selected columns, and only keeping the level defined by the discourse semantic document.
In the semantic fusion process, the original text does not need to be added.
Step 11: fused file deduplication
The exact same rows are removed, but the original order is preserved.
Step 12: the merged file is arranged in ascending order according to chapter-section-paragraph-sentence-groove
Sorting is performed by using sort values method of DataFrame, and reordering is performed according to the levels of chapter, section, sentence and slot, so that fusion according to the levels can be realized, and the result is shown in FIG. 7.
Step 13: converting a converged file to word format
The document in Excel form is easy to analyze but inconvenient to read and understand, so that the structured file needs to be converted into a word form. When paragraph text is added, the indentation amount of each layer is the same, so that the same layer alignment can be ensured.
Step 14: modifying the merged file according to the hierarchy and the sequence
Copying corresponding sentences of 'original text → NNNNNN' in the word interface, wherein the sentences are respectively saved in the semantic dictionary shown in the figure 3 in the steps 4 and 5, so that corresponding names are filled in the semantic dictionary, and then repeating the steps 1-13 until the hierarchical structure of the word meets the requirement.
If the sequence corresponding to the title in each layer of the word is not correct, the sequence also needs to be numbered again, and then the sequence number corresponding to the whole semantic dictionary is updated.
The method is realized by understanding in word, marking in excel semantic dictionary, and then under the loop of 1-13 steps, until the discourse semantic is satisfied in hierarchy and logic.
Step 15: replacing original text semantic excel documents
After the adjustment is completed through the 14 steps, the adjusted text semantic excel documents are stored as the same file, so that one-time updating of the text semantic structure is completed.
The invention expands the text semantic structure by fusing and expanding the text semantic structure from the isomorphism of the text structure, expands the abstract understanding of people on the text semantic, and expands the range of natural language processing engineering problems. For the document of the HW152 well fracturing engineering design, only the physical associations among 62 objects can be seen without chapter semantics, and under the chapter semantics, the associations among 202 objects in different layers can be seen, the interrelations among the objects can be found from multiple dimensions of the chapter, section, paragraph and sentence, the thinking modes behind different layers are analyzed, and the number of the association nodes of the article is enlarged by 4 times.
The semantic annotation method meets the requirement of semantic description on large-space, multi-level and long-time business activities in engineering, and realizes the semantic annotation of sections with complex hierarchical structures.
System embodiment
According to an embodiment of the present invention, a system for space semantic annotation is provided, and fig. 10 is a schematic diagram of the system for space semantic annotation according to the embodiment of the present invention, as shown in fig. 10, specifically including:
a semantic structure module: the semantic structure is used for acquiring a raw corpus document and establishing a hierarchical semantic structure of chapter-section-paragraph-sentence-slot of the raw corpus document;
the semantic structure module is specifically configured to: and acquiring a raw corpus document, and establishing a hierarchical semantic structure of the article, namely the article, chapter, section, paragraph and groove according to a corresponding article, chapter, section, paragraph and groove dictionary, wherein the article, chapter, section, sentence and groove dictionary take sentence semantics as a basic unit.
A fusion module: the system is used for fusing the corresponding levels of the hierarchical semantic structure into a uniform discourse semantic document;
the fusion module is specifically configured to: and fusing the semantic levels of the articles into a uniform discourse semantic document on the basis of a form of a table.
And a modification module: the system is used for modifying the level and the sequence number of the fused discourse semantic documents based on the correct logical thinking;
a storage module: and the system is used for storing the modified discourse semantic documents into the idiom material library to finish the labeling of discourse semantics.
The embodiment of the present invention is a system embodiment corresponding to the above method embodiment, and specific operations of each module may be understood with reference to the description of the method embodiment, which is not described herein again.
Apparatus embodiment one
The embodiment of the present invention provides a system for labeling discourse semantics, as shown in fig. 11, including: a memory 110, a processor 112 and a computer program stored on the memory 110 and executable on the processor 112, the computer program, when executed by the processor, implementing the steps of the above-described method embodiments.
Device embodiment II
The embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transmission is stored, and when being executed by the processor 112, the implementation program implements the steps in the above method embodiments.
The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; however, these modifications or alternative technical solutions of the embodiments of the present invention do not depart from the scope of the present invention.

Claims (10)

1. A method for labeling discourse semantics is characterized by comprising the following steps,
s1, acquiring a raw corpus document, and establishing a hierarchical semantic structure of chapter-section-paragraph-sentence-slot of the raw corpus document;
s2, fusing the corresponding levels of the hierarchical semantic structure into a uniform discourse semantic document;
s3, modifying the level and the sequence number of the fused discourse semantic documents based on the correct logical thinking;
s4, storing the modified discourse semantic documents into the idiom material library to finish the labeling of discourse semantics.
2. The method according to claim 1, wherein S1 specifically comprises: and acquiring a raw corpus document, and establishing a hierarchical semantic structure of the article, namely the article, chapter, section, paragraph and groove according to a corresponding article, chapter, section, paragraph and groove dictionary, wherein the article, chapter, section, sentence and groove dictionary take sentence semantics as a basic unit.
3. The method according to claim 2, wherein S2 specifically comprises: and fusing the semantic levels of the articles into a uniform discourse semantic document on the basis of a form of a table.
4. The method according to claim 3, wherein the S3 specifically comprises: and modifying the hierarchy and the sequence number of the integrated discourse semantic document, and acquiring the definition of the manually undefined chapters according to the hierarchy.
5. A system for labeling discourse semantics is characterized by comprising,
a semantic structure module: the semantic structure is used for acquiring a raw corpus document and establishing a hierarchical semantic structure of chapter-section-paragraph-sentence-slot of the raw corpus document;
a fusion module: the system is used for fusing the corresponding levels of the hierarchical semantic structure into a uniform discourse semantic document;
and a modification module: the system is used for modifying the level and the sequence number of the fused discourse semantic documents based on the correct logical thinking;
a storage module: and the system is used for storing the modified discourse semantic documents into the idiom material library to finish the labeling of discourse semantics.
6. The system of claim 5, wherein the semantic structure module is specifically configured to: and acquiring a raw corpus document, and establishing a hierarchical semantic structure of the article, namely the article, chapter, section, paragraph and groove according to a corresponding article, chapter, section, paragraph and groove dictionary, wherein the article, chapter, section, sentence and groove dictionary take sentence semantics as a basic unit.
7. The system of claim 6, wherein the fusion module is specifically configured to: and fusing the semantic levels of the articles into a uniform discourse semantic document on the basis of a form of a table.
8. The system of claim 7, wherein the modification module is specifically configured to: and modifying the hierarchy and the sequence number of the integrated discourse semantic document, and acquiring the definition of the manually undefined chapters according to the hierarchy.
9. An apparatus for labeling discourse semantics, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the discourse semantic annotation method of any one of claims 1 to 4.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores thereon an information transfer implementation program, and when the program is executed by a processor, the information transfer implementation program implements the steps of the chapter semantic annotation method according to any one of claims 1 to 4.
CN202110987422.2A 2021-08-26 2021-08-26 Method, system and device for labeling discourse semantics Pending CN113704383A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110987422.2A CN113704383A (en) 2021-08-26 2021-08-26 Method, system and device for labeling discourse semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110987422.2A CN113704383A (en) 2021-08-26 2021-08-26 Method, system and device for labeling discourse semantics

Publications (1)

Publication Number Publication Date
CN113704383A true CN113704383A (en) 2021-11-26

Family

ID=78655085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110987422.2A Pending CN113704383A (en) 2021-08-26 2021-08-26 Method, system and device for labeling discourse semantics

Country Status (1)

Country Link
CN (1) CN113704383A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114254158A (en) * 2022-02-25 2022-03-29 北京百度网讯科技有限公司 Video generation method and device, and neural network training method and device
CN115249015A (en) * 2022-09-21 2022-10-28 中科雨辰科技有限公司 Labeling consistency test method and medium based on chapter clustering and sentence fusion

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738033A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Report template generation method, device and storage medium
CN112541337A (en) * 2020-12-16 2021-03-23 格美安(北京)信息技术有限公司 Document template automatic generation method and system based on recurrent neural network language model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738033A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Report template generation method, device and storage medium
CN112541337A (en) * 2020-12-16 2021-03-23 格美安(北京)信息技术有限公司 Document template automatic generation method and system based on recurrent neural network language model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114254158A (en) * 2022-02-25 2022-03-29 北京百度网讯科技有限公司 Video generation method and device, and neural network training method and device
CN114254158B (en) * 2022-02-25 2022-06-10 北京百度网讯科技有限公司 Video generation method and device, and neural network training method and device
CN115249015A (en) * 2022-09-21 2022-10-28 中科雨辰科技有限公司 Labeling consistency test method and medium based on chapter clustering and sentence fusion

Similar Documents

Publication Publication Date Title
US8166037B2 (en) Semantic reconstruction
CN113704383A (en) Method, system and device for labeling discourse semantics
CN106528583A (en) Method for extracting and comparing web page main body
JP2007095102A (en) Document processor and document processing method
CN103559199B (en) Method for abstracting web page information and device
CN105677638B (en) Web information abstracting method
CN103440232A (en) Automatic sScientific paper standardization automatic detecting and editing method
CN110795932B (en) Geological report text information extraction method based on geological ontology
CN103440233A (en) Automatic sScientific paper standardization automatic detecting and editing system
CN114372153A (en) Structured legal document warehousing method and system based on knowledge graph
CN111833981A (en) Structured report making and compiling method
CN116226349A (en) Question and answer method and system based on table semantic fasttet question analysis
US9619445B1 (en) Conversion of content to formats suitable for digital distributions thereof
CN112199960A (en) Standard knowledge element granularity analysis system
KR102034392B1 (en) Method and apparatus for generating internet genealogy using string data
CN107491524B (en) Method and device for calculating Chinese word relevance based on Wikipedia concept vector
CN114169336A (en) User-defined multi-mode distributed semi-automatic labeling system
JPH09282218A (en) Html document book form shaping method and device therefor
CN106649219A (en) Automatic generation method for communication satellite design documents
JP2003288332A (en) Method and system for supporting structured document creation
Ramesh et al. Automatically identify and label sections in scientific journals using conditional random fields
CN117151442B (en) Population health field scientific data management generation method based on mind map
Iwashokun et al. Structural vetting of academic proposals
Faulhaber PhiloBiblon y el mundo wiki
US11921797B2 (en) Computer service for indexing threaded comments with pagination support

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination