CN113139033A - Text processing method, device, equipment and storage medium - Google Patents

Text processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113139033A
CN113139033A CN202110524156.XA CN202110524156A CN113139033A CN 113139033 A CN113139033 A CN 113139033A CN 202110524156 A CN202110524156 A CN 202110524156A CN 113139033 A CN113139033 A CN 113139033A
Authority
CN
China
Prior art keywords
text
entity
label
entity text
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110524156.XA
Other languages
Chinese (zh)
Inventor
王水桃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202110524156.XA priority Critical patent/CN113139033A/en
Publication of CN113139033A publication Critical patent/CN113139033A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/328Management therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a text processing method, a text processing device, text processing equipment and a storage medium, which are applied to the field of natural language processing and comprise the following steps: acquiring a text to be processed, and determining a plurality of entity texts and the labeling information of each entity text in the entity texts from the text to be processed; storing the text to be processed as a root node of a tree-shaped label storage structure, and storing the labeling information of each entity text into each level of child nodes in the tree-shaped label storage structure; and labeling each entity text in the text to be processed according to the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text so as to facilitate the lookup or screening of each entity text. By adopting the embodiment of the application, the label information of the text to be processed can be stored by utilizing the tree-shaped storage structure, and the text can be labeled according to the label information of the text to be processed stored in the tree-shaped storage structure, so that the label efficiency of the text can be improved, and the application scenes of text labeling can be enriched.

Description

Text processing method, device, equipment and storage medium
Technical Field
The present application relates to the field of natural language processing, and in particular, to a text processing method, apparatus, device, and storage medium.
Background
With the development of the field of artificial intelligence, Natural Language Processing (NLP) is widely used in many scenes, such as emotion analysis, text similarity, review viewpoint extraction, text classification, and lexical analysis. In these natural language processing scenario applications, the NLP model needs to be trained with a large amount of labeled text. In the prior art, a more general labeling manner is to label a text by using excel or other word segmentation labeling tools, and these word segmentation labeling tools generally use a linear chain table type storage structure to store tags, and according to this storage structure, for an entity text, only one tag of the entity text can be displayed at the same time. However, in a specific application scenario, a situation that one entity text corresponds to multiple tags or one entity text sub-script also corresponds to one tag often occurs, and a common text labeling tool cannot directly store multiple tags corresponding to one entity text and tags corresponding to one entity text sub-script, nor directly label a nested tag corresponding to one entity text and tags corresponding to one entity text sub-script, which greatly limits the text labeling efficiency and usage scenarios.
Disclosure of Invention
The embodiment of the application provides a text processing method, a text processing device, a text processing equipment and a text processing storage medium, wherein a tree-shaped storage structure can be used for storing the marking information of the text to be processed, and the text is marked according to the marking information of the text to be processed stored in the tree-shaped storage structure, so that the marking efficiency of the text is improved, and the application scenes of text marking are enriched.
In a first aspect, an embodiment of the present application provides a text processing method, where the method includes:
acquiring a text to be processed, and determining a plurality of entity texts and the label information of each entity text in the entity texts from the text to be processed;
storing the text to be processed into a root node of a tree-shaped label storage structure, and storing the label information of each entity text into each level of child nodes in the tree-shaped label storage structure to obtain the text to be processed stored in the tree-shaped label storage structure and the label information of each entity text, wherein one child node of the tree-shaped label storage structure is used for storing the label information of one entity text;
and labeling and/or displaying each entity text in the text to be processed according to the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text so as to facilitate the lookup or screening of each entity text.
In the embodiment of the application, the text to be processed is obtained, and then the labeling information of each entity text in the entity texts and the entity texts in the entity texts is determined from the text to be processed. Storing the text to be processed into a root node of the tree-shaped label storage structure, and storing the labeling information of each entity text into each level of child nodes in the tree-shaped label storage structure, so as to obtain the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text. Here, one child node of the tree-shaped tag storage structure is configured to store the label information of one entity text, where the label information of one entity text includes all tags (i.e., one or more tags) corresponding to the entity text. Based on the text to be processed and the labeling information of each entity text stored in the tree-shaped label storage structure, each entity text in the text to be processed can be labeled and/or displayed (for example, the entity text corresponding to a certain category of labels is highlighted in a corresponding color), so that each entity text can be referred or screened conveniently. By adopting the scheme provided by the embodiment of the application, the label information of the text to be processed can be stored by utilizing the tree-shaped storage structure, and the text can be labeled according to the label information of the text to be processed stored by the tree-shaped storage structure, so that the label efficiency of the text is improved, and the application scenes of text labeling are enriched.
With reference to the first aspect, in a possible implementation manner, the storing the label information of each entity text in each level child node of the tree-shaped label storage structure includes:
determining start-stop character information and text length of each entity text based on the label information of each entity text;
determining subordinate entity texts of entity texts at all levels including a first-level entity text in the text to be processed based on the start-stop character information and the text length of each entity text, and storing the label information of the subordinate entity text of any level of entity text to subordinate child nodes of child nodes stored in the label information of any level of entity text in the tree-shaped label storage structure so as to store the label information of each entity text to each level of child nodes in the tree-shaped label storage structure;
the primary entity text is a subordinate entity text of the text to be processed, the label information of the primary entity text is stored in a subordinate child node of the root node in the tree-shaped label storage structure, the subordinate entity text of any text comprises an entity text which has no repeated characters with other entity texts in the entity text of any text, and an entity text which has repeated characters with other entity texts in the entity text of any text and has a text length larger than that of other entity texts with repeated characters.
In the embodiment of the application, the subordinate entity texts of the entity texts at all levels including the first-level entity text in the text to be processed are determined based on the start-stop character information and the text length of each entity text, and the label information of the subordinate entity text of any level of entity text is stored in the subordinate child node of the child node stored in the label information of any level of entity text in the tree-shaped label storage structure, so that the label information of each entity text is stored in each level of child node in the tree-shaped label storage structure. By adopting the scheme provided by the embodiment of the application, the labeling information of each entity text can be stored in each child node of the tree-shaped label storage structure according to the position of the entity text in the text to be processed (that is, the labeling information of the uppermost entity text in the text to be processed is stored in the uppermost child node of the tree-shaped label storage structure), so that the storage structure is more attached to the text to be processed, and labeling and/or displaying of each entity text in the text to be processed can be conveniently carried out according to the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text. For example, the labeling and/or display of each entity text in the text to be processed can be completed by performing one-time traversal on the labeling information of the text to be processed stored in each child node of the tree-like storage structure, so that the labeling efficiency of the text is improved.
With reference to the first aspect, in a possible implementation manner, the determining, based on the start-stop character information and the text length of each entity text, a subordinate entity text of each level of entity text including a first level entity text in the text to be processed includes:
determining the starting and ending character positions of each entity text according to the starting and ending character information of each entity text;
in an entity text of any text, if the start-stop character position of a target entity text does not include the start-stop character position of the non-target entity text, determining the target entity text as a subordinate entity text of the any text, wherein the any text is the text to be processed or an entity text in the text to be processed, and the non-target entity text is an entity text except the target entity text in the entity text of the any text;
and if the start-stop character positions of the target entity text comprise the start-stop character positions of the non-target entity text and the text length of the target entity text is greater than the text length of the overlapped non-target entity text, determining the target entity text as a subordinate entity text of any one of the texts, wherein the overlapped non-target entity text is the non-target entity text of which the start-stop character positions are included in the start-stop character positions of the target entity text.
In the embodiment of the application, the starting and stopping character positions of each entity text are determined through the starting and stopping character information of each entity text, and then the text inclusion relationship between each entity text is determined through the starting and stopping character positions of each entity text, so that the primary entity text of the text to be processed and the secondary entity text of any level of entity text can be more easily determined, and the text labeling efficiency is further improved.
With reference to the first aspect, in a possible implementation manner, each level of child nodes in the tree-like tag storage structure includes a single-tag child node and a multi-tag child node, and the storing the label information of each entity text into each level of child nodes in the tree-like tag storage structure includes:
determining the label of each entity text according to the label information of each entity text;
if any entity text corresponds to a label, determining that the entity text is a single-label entity text, storing the label information of the single-label entity text into the child nodes of the tree-shaped label storage structure, and setting the child nodes storing the label information of the single-label entity text as the single-label child nodes;
and if any entity text corresponds to a plurality of labels, determining that the entity text is a multi-label entity text, storing the label information of the multi-label entity text into the child nodes of the tree-shaped label storage structure, and setting the child nodes storing the label information of the multi-label entity text as the multi-label child nodes.
With reference to the first aspect, in a possible implementation manner, the labeling and/or displaying, according to the to-be-processed text and the labeling information of each entity text stored in the tree label storage structure, each entity text in the to-be-processed text includes:
performing single-layer rendering and/or single-layer highlighting on the single-label entity texts corresponding to the label information stored in each single-label sub-node of the tree-shaped label storage structure to identify that one single-label entity text corresponds to one label;
and performing multi-layer rendering and/or multi-layer highlighting on the multi-label entity text corresponding to the label information stored in each multi-label sub-node of the tree-shaped label storage structure to identify one multi-label entity text corresponding to a plurality of labels, wherein one layer of rendering and/or one layer of highlighting of any multi-label entity text is used for identifying one label in the plurality of labels corresponding to any multi-label entity text.
With reference to the first aspect, in one possible implementation, the method further includes:
and performing blank marking on other texts which are not entity texts in the texts to be processed to indicate that no corresponding labels exist in the other texts.
In this embodiment of the present application, the label information of each entity text may be stored in each level of child nodes in a tree-shaped label storage structure, where one child node of the tree-shaped label storage structure is used to store the label information of one entity text, and the label information of one entity text includes all labels corresponding to the entity text (i.e., one label corresponding to a single-label entity text, and multiple labels corresponding to multiple-label entity texts). The method comprises the steps of determining a sub-node for storing the marking information of a single-label entity text as a single-label sub-node, determining a sub-node for storing the marking information of a multi-label entity text as a multi-label sub-node, performing single-layer rendering and/or single-layer highlighting on the single-label entity text corresponding to the marking information stored in each single-label sub-node of the tree-shaped label storage structure, and performing multi-layer rendering and/or multi-layer highlighting on the multi-label entity text corresponding to the marking information stored in each multi-label sub-node of the tree-shaped label storage structure. Here, a one-layer rendering and/or a one-layer highlighting of any multi-label entity text is used to identify one of a plurality of labels to which the multi-label entity text corresponds. For example, the entity texts corresponding to the labels of a certain category may be highlighted in corresponding colors, and it is understood that the entity texts corresponding to the labels of several categories may also be highlighted in multiple colors, so as to facilitate the review or screening of the entity texts. It is further understood that other texts in the text to be processed, which are not entity texts, may also be blank labeled to indicate that no corresponding label exists in the other texts. By adopting the scheme provided by the embodiment of the application, the label information of the text to be processed can be stored by utilizing the tree-shaped storage structure, and the text can be labeled according to the label information of the text to be processed stored by the tree-shaped storage structure, so that the label efficiency of the text is improved, and the application scenes of text labeling are enriched.
In a second aspect, an embodiment of the present application provides a text processing apparatus, including:
the text acquisition module is used for acquiring texts to be processed and determining a plurality of entity texts and the label information of each entity text in the entity texts from the texts to be processed;
an information storage module, configured to store the to-be-processed text into a root node of a tree-shaped tag storage structure, and store the label information of each entity text into each level of child nodes in the tree-shaped tag storage structure, so as to obtain the to-be-processed text stored in the tree-shaped tag storage structure and the label information of each entity text, where one child node of the tree-shaped tag storage structure is used to store the label information of one entity text;
and the text labeling module is used for labeling and/or displaying each entity text in the text to be processed according to the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text so as to facilitate the lookup or screening of each entity text.
With reference to the second aspect, in a possible implementation manner, the information storage module includes an information confirmation unit and a hierarchical storage unit, where:
the information confirming unit is used for confirming the start-stop character information and the text length of each entity text based on the label information of each entity text;
the hierarchical storage unit is configured to determine, based on the start-stop character information and the text length of each entity text, a subordinate entity text of each level of entity text including a first level of entity text in the text to be processed, and store label information of the subordinate entity text of any level of entity text into a subordinate child node of a child node in the tree-like label storage structure, where the label information of any level of entity text is stored, so as to store the label information of each entity text into each level of child node in the tree-like label storage structure;
the primary entity text is a subordinate entity text of the text to be processed, the label information of the primary entity text is stored in a subordinate child node of the root node in the tree-shaped label storage structure, the subordinate entity text of any text comprises an entity text which has no repeated characters with other entity texts in the entity text of any text, and an entity text which has repeated characters with other entity texts in the entity text of any text and has a text length larger than that of other entity texts with repeated characters.
In a third aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a processor and a memory, and the processor and the memory are connected to each other. The memory is configured to store a computer program that supports the terminal device to execute the method provided by the first aspect and/or any one of the possible implementation manners of the first aspect, where the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method provided by the first aspect and/or any one of the possible implementation manners of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program including program instructions, which, when executed by a processor, cause the processor to perform the method provided by the first aspect and/or any one of the possible implementation manners of the first aspect.
In the embodiment of the application, the text to be processed is obtained, and then the labeling information of each entity text in the entity texts and the entity texts in the entity texts is determined from the text to be processed. Storing the text to be processed into a root node of the tree-shaped label storage structure, and storing the labeling information of each entity text into each level of child nodes in the tree-shaped label storage structure, so as to obtain the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text. Here, one child node of the tree-shaped tag storage structure is configured to store the label information of one entity text, where the label information of one entity text includes all tags (i.e., one or more tags) corresponding to the entity text. Based on the text to be processed and the labeling information of each entity text stored in the tree-shaped label storage structure, each entity text in the text to be processed can be labeled and/or displayed (for example, the entity text corresponding to a certain category of labels is highlighted in a corresponding color), so that each entity text can be referred or screened conveniently. By adopting the scheme provided by the embodiment of the application, the label information of the text to be processed can be stored by utilizing the tree-shaped storage structure, and the text can be labeled according to the label information of the text to be processed stored by the tree-shaped storage structure, so that the label efficiency of the text is improved, and the application scenes of text labeling are enriched.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a text processing method according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a storage flow of a tree tag storage structure provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a partial structure of a tree tag storage structure provided in an embodiment of the present application;
FIG. 4 is another schematic flow chart diagram of a text processing method provided in the embodiment of the present application;
fig. 5 is a schematic diagram illustrating entity text annotation information provided in an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Currently, with the development of the field of artificial intelligence, Natural Language Processing (NLP) is widely applied in many scenes, such as emotion analysis, text similarity, review viewpoint extraction, text classification, lexical analysis, and the like. In these natural language processing scenario applications, the NLP model needs to be trained with a large amount of labeled text. For example, in the field of production and management, the text processing is carried out on the financial report text of an enterprise, and the entity text in the financial report text is labeled and/or displayed, so that a manager of the enterprise can be helped to quickly know the operation and tax situations of a certain enterprise and screen related operation information to be used as a corpus sample for constructing an enterprise operation condition model; or the management personnel of the enterprise can be helped to quickly know the operation and tax condition of a certain enterprise and analyze the related operation information so as to make operation decision. For example, in the medical field, the text processing is performed on the diagnosis report of the patient, and the entity text in the diagnosis report is labeled and/or displayed, so that a doctor can be helped to quickly know the expression symptoms and the treatment effect of a certain disease and screen related diagnosis and treatment information to be used as a corpus sample for constructing a disease treatment model; or can help doctors to quickly know the expression symptoms and treatment effects of certain diseases and analyze relevant diagnosis and treatment information so as to make treatment opinions. It can be seen that a plurality of entity texts and the label information of each entity text are determined from the text to be processed, the label information of each entity text is stored (for example, the label information of each entity text is stored in the tree-shaped label storage structure), and then the application range of labeling and/or displaying each entity text according to the stored label information of each entity text is very wide. It is to be understood that any terminal or device having a text processing function may be used as an execution subject of the text processing method provided in the present application, and for convenience of description, the solution provided in the present application will be described below with a text processing device as an execution subject.
Specifically, for example, text processing is performed on the financial report text of an enterprise in production and operation, the text processing device may obtain the financial report text of the enterprise, determine entity texts corresponding to labels such as an operation address, a sales amount, a sales time, and a research and development investment in the financial report text of the enterprise, and determine labeling information (for example, a label corresponding to each entity text) of each entity text from the financial report text of the enterprise. Storing the financial report text and the labeling information of each entity text in the financial report text into a tree-shaped label storage structure in a tree-shaped structure, labeling and/or displaying each entity text according to the labeling information of each entity text in the tree-shaped label storage structure (for example, labels corresponding to each entity text), wherein the key labeling can help managers of enterprises to quickly know the management and tax situations of a certain enterprise and screen related management information (for example, entity texts corresponding to labels such as a management address, a sales amount, sales time and the like) to be used as a corpus sample for constructing an enterprise management situation model; or the management person of the enterprise can be helped to quickly know the operation and tax condition of a certain enterprise and analyze the relevant operation information (for example, entity texts corresponding to labels of sale money, sale time, best sale products, research and development invested products and the like) so as to make operation decisions. Specifically, please refer to fig. 1 to illustrate a text processing method provided in an embodiment of the present application.
Referring to fig. 1, fig. 1 is a flow chart illustrating a text processing method according to an embodiment of the present disclosure.
As shown in fig. 1, the method provided by the embodiment of the present application may include the following steps:
s101: and acquiring the texts to be processed, and determining the labeling information of each entity text in the plurality of entity texts and the plurality of entity texts from the texts to be processed.
In some possible implementations, the text processing device may obtain the text to be processed (e.g., sales of X billion dollars in south mountain area of shenzhen, Guangdong province, 2020 by A company, and development of the best product X is invested to 1% of the sales).
The text processing device can determine, by a text recognition method (or a word segmentation labeling tool), a plurality of entity texts in a text to be processed (for example, the sales of a company a in the south mountain area of shenzhen, guang province reaches X billion in 2020, and the development of a product X with the best sales amount is put into 1% of the sales), and labeling information (which may include a label corresponding to the entity text, position information of a start and stop character of the entity text, and a text length of the entity text) of each entity text in the plurality of entity texts. For example, the entity text 1 is 2020, the label corresponding to the entity text 1 is the time of sale, the start-stop character positions of the entity text 1 are 4 and 8, and the text length of the entity text 1 is 5. The entity text 2 is a Nanshan region of Shenzhen city, Guangdong province, the label corresponding to the entity text 2 is an operation address, the start and stop character positions of the entity text 2 are 10 and 18, and the text length of the entity text 2 is 9. The entity text 3 is Guangdong province, the label corresponding to the entity text 3 is a management address, the start and stop character positions of the entity text 3 are 10 and 12, and the text length of the entity text 3 is 3. The entity text 4 is a Shenzhen City south mountain area, the label corresponding to the entity text 4 is an operation address, the starting and stopping character positions of the entity text 4 are 13 and 18, and the text length of the entity text 4 is 6. The entity text 5 is a southern mountain area, the label corresponding to the entity text 5 is a business address, the start-stop character positions of the entity text 5 are 16 and 18, and the text length of the entity text 5 is 4. The entity text 6 is a product X, the label corresponding to the entity text 6 is a best sales product and a research and development input product, the starting and ending character positions of the entity text 6 are 33 and 35, and the text length of the entity text 6 is 3.
S102: and storing the text to be processed as a root node of a tree-shaped label storage structure, and storing the labeling information of each entity text into each level of child nodes in the tree-shaped label storage structure to obtain the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text.
In some possible embodiments, after storing the text to be processed as the root node of the tree tag storage structure, please refer to fig. 2 together, and fig. 2 is a schematic diagram of a storage flow of the tree tag storage structure provided in the embodiment of the present application. As shown in fig. 2, the method for storing the to-be-processed text and the label information of each entity text into the tree-shaped label storage structure may include implementation manners provided in the following steps S201 to S203.
S201: and determining start-stop character information and text length of each entity text based on the labeling information of each entity text.
In some possible embodiments, the text processing device may determine the start-stop character information and the text length of each entity text based on the label information of each entity text. For example, the entity text 1 is 2020, the start-stop character positions of the entity text 1 are 4 and 8, and the text length of the entity text 1 is 5. The entity text 2 is a Nanshan region of Shenzhen city, Guangdong province, the start and stop character positions of the entity text 2 are 10 and 18, and the text length of the entity text 2 is 9. The entity text 3 is in Guangdong province, the start and stop character positions of the entity text 3 are 10 and 12, and the text length of the entity text 3 is 3. The entity text 4 is the Shenzhen City south mountain area, the start and stop character positions of the entity text 4 are 13 and 18, and the text length of the entity text 4 is 6. The entity text 5 is a south mountain area, the start-stop character positions of the entity text 5 are 16 and 18, and the text length of the entity text 5 is 4. The entity text 6 is a product X, the start-stop character positions of the entity text 6 are 33 and 35, and the text length of the entity text 6 is 3.
S202: and determining subordinate entity texts of all levels of entity texts including the first level entity text in the text to be processed based on the start-stop character information and the text length of each entity text.
The specific judgment process of the first-level entity text and the subordinate entity texts of the entity texts at all levels is as follows:
in some possible embodiments, the text processing apparatus may determine the start-stop character position of each entity text according to the start-stop character information of each entity text. For example, the entity text 1 is 2020, the start-stop character positions of the entity text 1 are 4 and 8, and the text length of the entity text 1 is 5. The entity text 2 is a Nanshan region of Shenzhen city, Guangdong province, the start and stop character positions of the entity text 2 are 10 and 18, and the text length of the entity text 2 is 9. The entity text 3 is in Guangdong province, the start and stop character positions of the entity text 3 are 10 and 12, and the text length of the entity text 3 is 3. The entity text 4 is the Shenzhen City south mountain area, the start and stop character positions of the entity text 4 are 13 and 18, and the text length of the entity text 4 is 6. The entity text 5 is a south mountain area, the start-stop character positions of the entity text 5 are 16 and 18, and the text length of the entity text 5 is 4. The entity text 6 is a product X, the start-stop character positions of the entity text 6 are 33 and 35, and the text length of the entity text 6 is 3.
Specifically, in the entity texts (entity text 3 (Guangdong province), entity text 4 (Shenzhen City nan mountain region) and entity text 5 (nan mountain region)) of one text (e.g., entity text 2 (Shenzhen City nan mountain region)), if the provenance character positions (10-12) of the target entity text (entity text 3 (Guangdong province)) do not include the provenance character positions (13-18 and 16-18) of the non-target entity texts (entity text 4 (Shenzhen City nan mountain region) and entity text 5 (Nanshan region)), the target entity text (entity text 3 (Guangdong province)) is determined as the subordinate entity texts of the one text (entity text 2 (Guangdong province Shenzhen City nan mountain region)). Here, the non-target entity texts (entity text 4 (shenzhen city south mountainous region) and entity text 5 (south mountainous region)) are entity texts other than the target entity text (entity text 3 (guangdong province)) in the entity text 2 (entity text 3 (guangdong province), entity text 4 (shenzhen city south mountainous region) and entity text 5 (south mountainous region)).
Further, if the provenance character positions (13-18) of the target entity text (entity text 4 (Shenzhen City nan mountain)) include the provenance character positions (16-18) of the non-target entity text (entity text 5 (nan mountain)) in the entity text of the text, and the text length (6) of the target entity text (entity text 4 (Shenzhen City nan mountain)) is greater than the text length (3) of the coincident non-target entity text (entity text 5 (nan mountain)), the target entity text (entity text 4 (Shenzhen City nan mountain)) is determined to be the subordinate entity text of the text (entity text 2 (Guangxian City nan mountain)). Here, the coincident non-target entity text (entity text 5 (Nanshan)) is the non-target entity text whose starting character positions (16-18) are included in the starting character positions (13-18) of the target entity text (entity text 4 (Shenzhen City Nanshan)).
It is understood that one text here can be a pending text (sales of company a reaches X billion in south mountain area of shenzhen, Guangdong, 2020, and development investment of best product X for sales is 1%) and an entity text (entity text 1, entity text 2, entity text 3, entity text 4, entity text 5, and entity text 6) in the pending text. Thus, the text processing apparatus can execute step S203 after determining subordinate entity texts of each level of entity texts including the primary entity text in the text to be processed based on the start-stop character information and the text length of each entity text.
S203: and storing the labeling information of the primary entity text to a secondary child node of a root node in the tree-shaped label storage structure, and storing the labeling information of the secondary entity text of any level of entity text to a secondary child node of a child node stored in the labeling information of any level of entity text in the tree-shaped label storage structure.
In some possible implementations, please refer to fig. 3 together, and fig. 3 is a schematic partial structure diagram of a tree tag storage structure provided in an embodiment of the present application. As shown in fig. 3, the primary entity texts (entity text 1, entity text 2, and entity text 6) are subordinate entity texts of the text to be processed, and the text processing apparatus may store the label information of the primary entity texts (entity text 1, entity text 2, and entity text 6) to subordinate child nodes (i.e., primary child node a, primary child node B, and primary child node C) of the root node in the tree label storage structure. The text processing means further stores the label information of the subordinate entity texts (entity text 3 and entity text 4) of the primary entity text (entity text 2) to the subordinate child nodes (secondary child node a and secondary child node B) of the child node (primary child node B) in the tree label storage structure in which the label information of the primary entity text (entity text 2) is stored. The text processing means may further store the label information of the subordinate entity text (entity text 5) of the secondary entity text (entity text 4) to the subordinate child node (tertiary child node b1) of the child node (secondary child node b) in the tree label storage structure in which the label information of the secondary entity text (entity text 4) is stored.
In the embodiment of the application, the text processing device can determine the starting and stopping character positions of each entity text through the starting and stopping character information of each entity text, and further determine the text inclusion relationship between each entity text through the starting and stopping character positions of each entity text, so that the primary entity text of the text to be processed and the secondary entity text of any level of entity text can be more easily determined, and the text labeling efficiency is further improved. In addition, the text processing device stores the label information of each entity text in each child node of the tree-shaped label storage structure according to the position of the entity text in the text to be processed (that is, stores the label information of the uppermost entity text in the text to be processed in the uppermost child node of the tree-shaped label storage structure), so that the storage structure can be more attached to the text to be processed. It is understood that the text processing apparatus may execute step S103 after obtaining the text to be processed and the label information of each entity text stored in the tree-shaped tag storage structure.
S103: and marking and/or displaying each entity text in the text to be processed according to the text to be processed and the marking information of each entity text stored in the tree-shaped label storage structure.
In some possible embodiments, the text processing apparatus may label and/or display each entity text in the text to be processed according to the text to be processed stored in the tree-shaped tag storage structure and the label information of each entity text, so as to facilitate the lookup or screening of each entity text. For example, the labeling and/or display of each entity text in the text to be processed can be completed by performing one-time traversal on the labeling information of the text to be processed stored in each child node of the tree-like storage structure, so that the labeling efficiency of the text is improved.
It can be further understood that, after the text processing device stores the fiscal text and the label information of each entity text in the fiscal text in a tree-like structure to a tree-like label storage structure, each entity text can be labeled and/or displayed to an enterprise business condition analysis device (or other devices for analyzing the enterprise business conditions, or other devices capable of displaying label information) according to the label information of each entity text in the tree-like label storage structure, the text processing device can label the enterprise business condition analysis device with emphasis to help an enterprise manager to quickly know the business and tax conditions of a certain enterprise and to related business information (e.g., entity texts corresponding to labels such as business addresses, sales amounts, sales times, etc.), the enterprise business condition analysis device can screen each entity text, so as to obtain the corpus sample for constructing the enterprise operation condition model. The text processing device can label and/or display each entity text to a user interaction interface of the text processing device (or other equipment with the user interaction interface) according to the label information (for example, labels corresponding to each entity text) of each entity text in the tree-shaped label storage structure, so as to help managers of enterprises to quickly know the management and tax conditions of a certain enterprise and analyze related management information (for example, entity texts corresponding to labels such as sales amount, sales time, best sales products, research and development investment products and the like) to make management decisions.
In the embodiment of the application, the text to be processed is obtained, and then the labeling information of each entity text in the entity texts and the entity texts in the entity texts is determined from the text to be processed. Storing the text to be processed into a root node of the tree-shaped label storage structure, and storing the labeling information of each entity text into each level of child nodes in the tree-shaped label storage structure, so as to obtain the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text. Here, one child node of the tree-shaped tag storage structure is configured to store the label information of one entity text, where the label information of one entity text includes all tags (i.e., one or more tags) corresponding to the entity text. Based on the text to be processed and the labeling information of each entity text stored in the tree-shaped label storage structure, each entity text in the text to be processed can be labeled and/or displayed (for example, the entity text corresponding to a certain category of labels is highlighted in a corresponding color), so that each entity text can be referred or screened conveniently. By adopting the scheme provided by the embodiment of the application, the label information of the text to be processed can be stored by utilizing the tree-shaped storage structure, and the text can be labeled according to the label information of the text to be processed stored by the tree-shaped storage structure, so that the label efficiency of the text is improved, and the application scenes of text labeling are enriched.
Referring to fig. 4, fig. 4 is another schematic flow chart of a text processing method according to an embodiment of the present application.
S301: and acquiring the texts to be processed, and determining the labeling information of each entity text in the plurality of entity texts and the plurality of entity texts from the texts to be processed.
In some possible implementations, the text processing device may obtain the text to be processed (e.g., sales of X billion dollars in south mountain area of shenzhen, Guangdong province, 2020 by A company, and development of the best product X is invested to 1% of the sales).
The text processing device can determine, by a text recognition method (or a word segmentation labeling tool), a plurality of entity texts in a text to be processed (for example, the sales of a company a in the south mountain area of shenzhen, guang province reaches X billion in 2020, and the development of a product X with the best sales amount is put into 1% of the sales), and labeling information (which may include a label corresponding to the entity text, position information of a start and stop character of the entity text, and a text length of the entity text) of each entity text in the plurality of entity texts. For example, the entity text 1 is 2020, the label corresponding to the entity text 1 is the time of sale, the start-stop character positions of the entity text 1 are 4 and 8, and the text length of the entity text 1 is 5. The entity text 2 is a Nanshan region of Shenzhen city, Guangdong province, the label corresponding to the entity text 2 is an operation address, the start and stop character positions of the entity text 2 are 10 and 18, and the text length of the entity text 2 is 9. The entity text 3 is Guangdong province, the label corresponding to the entity text 3 is a management address, the start and stop character positions of the entity text 3 are 10 and 12, and the text length of the entity text 3 is 3. The entity text 4 is a Shenzhen City south mountain area, the label corresponding to the entity text 4 is an operation address, the starting and stopping character positions of the entity text 4 are 13 and 18, and the text length of the entity text 4 is 6. The entity text 5 is a southern mountain area, the label corresponding to the entity text 5 is a business address, the start-stop character positions of the entity text 5 are 16 and 18, and the text length of the entity text 5 is 4. The entity text 6 is a product X, the label corresponding to the entity text 6 is the best sales and development investment, the start-stop character positions of the entity text 6 are 33 and 35, and the text length of the entity text 6 is 3.
S302: and storing the text to be processed into a root node of the tree-shaped label storage structure, and determining the label of each entity text according to the labeling information of each entity text.
In some possible embodiments, the text processing apparatus may store the text to be processed (the sales of a company a in the south mountain area of shenzhen, guang, and shenzhen, Guangdong in 2020 reaches X billion, and the development investment of the best product X is 1% of the sales) into the root node of the tree-shaped tag storage structure, and determine the tag of each entity text according to the labeling information of each entity text. For example, the entity text 1 is 2020, and the label corresponding to the entity text 1 is the time of sale. The entity text 2 is a Nanshan region of Shenzhen city in Guangdong province, and the label corresponding to the entity text 2 is an operation address. The entity text 3 is Guangdong province, and the label corresponding to the entity text 3 is an operation address. The entity text 4 is a Shenzhen southward mountainous region, and the label corresponding to the entity text 4 is an operation address. The entity text 5 is a southern mountain area, and the label corresponding to the entity text 5 is an operation address. The entity text 6 is a product X, and the label corresponding to the entity text 6 is the best sales and development investment.
S303: and if any entity text corresponds to one label, determining that any entity text is a single-label entity text, storing the labeling information of the single-label entity text into the child nodes of the tree-shaped label storage structure, and setting the child nodes for storing the labeling information of the single-label entity text as single-label child nodes.
In some possible embodiments, as shown in fig. 3, the text processing apparatus may determine the number of tags corresponding to the entity text, determine that any entity text is a single-tag entity text if any entity text (e.g., entity text 1, entity text 2, entity text 3, entity text 4, or entity text 5) corresponds to one tag, store the label information of the single-tag entity text into the child nodes of the tree-like tag storage structure, and set the child nodes (primary child node a, primary child node B, secondary child node a, secondary child node B, and tertiary child node B1) storing the label information of the single-tag entity text as the single-tag child nodes.
S304: and if any entity text corresponds to a plurality of labels, determining that any entity text is a multi-label entity text, storing the labeling information of the multi-label entity text into the child nodes of the tree-shaped label storage structure, and setting the child nodes for storing the labeling information of the multi-label entity text as multi-label child nodes.
In some possible embodiments, as shown in fig. 3, the text processing apparatus may determine the number of labels corresponding to the entity text, and if any entity text (e.g., entity text 6) corresponds to multiple labels (best-selling products and development input products), determine that any entity text is a multi-label entity text, store the label information of the multi-label entity text in a child node of the tree-shaped label storage structure, and set a child node (first-level child node C) storing the label information of the multi-label entity text as a multi-label child node.
S305: and performing single-layer rendering and/or single-layer highlighting on the single-label entity text corresponding to the labeling information stored in each single-label sub-node of the tree-shaped label storage structure.
S306: and performing multilayer rendering and/or multilayer highlighting on the multi-label entity text corresponding to the labeling information stored in each multi-label sub-node of the tree-shaped label storage structure.
In some possible implementations, please refer to fig. 5 in combination, and fig. 5 is a schematic illustration showing entity text annotation information provided in an embodiment of the present application. As shown in fig. 5, the text processing apparatus may perform single-layer rendering and/or single-layer highlighting on the single-label entity text (entity text 1, entity text 2, entity text 3, entity text 4, or entity text 5) corresponding to the label information stored in each single-label child node (first-level child node a, first-level child node B, second-level child node a, second-level child node B, and third-level child node B1) of the tree-shaped label storage structure. Further, the text processing apparatus may perform multi-layer rendering and/or multi-layer highlighting on the multi-label entity text (entity text 6) corresponding to the label information stored in each multi-label sub-node (primary sub-node C) of the tree-shaped label storage structure. Here, a one-layer rendering and/or a one-layer highlighting of any multi-label entity text is used to identify one of a plurality of labels to which the multi-label entity text corresponds. That is, the text processing apparatus may highlight the entity texts corresponding to the tags of a certain category in corresponding colors, and it is understood that the entity texts corresponding to the tags of several categories may also be highlighted in multiple colors, so as to facilitate the search or screening of the entity texts.
In some possible embodiments, as shown in fig. 6, the text processing apparatus may perform blank labeling on other texts in the text to be processed, which are not entity texts, to indicate that there is no corresponding tag in the other texts. It can be further understood that, after the text processing device stores the fiscal text and the label information of each entity text in the fiscal text in a tree-like structure to a tree-like label storage structure, each entity text can be labeled and/or displayed to an enterprise business condition analysis device (or other devices for analyzing the enterprise business conditions, or other devices capable of displaying label information) according to the label information of each entity text in the tree-like label storage structure, the text processing device can label the enterprise business condition analysis device with emphasis to help an enterprise manager to quickly know the business and tax conditions of a certain enterprise and to related business information (e.g., entity texts corresponding to labels such as business addresses, sales amounts, sales times, etc.), the enterprise business condition analysis device can screen each entity text, so as to obtain the corpus sample for constructing the enterprise operation condition model. The text processing device can label and/or display each entity text to a user interaction interface of the text processing device (or other equipment with the user interaction interface) according to the label information (for example, labels corresponding to each entity text) of each entity text in the tree-shaped label storage structure, so as to help managers of enterprises to quickly know the management and tax conditions of a certain enterprise and analyze related management information (for example, entity texts corresponding to labels such as sales amount, sales time, best sales products, research and development investment products and the like) to make management decisions.
In this embodiment of the present application, the label information of each entity text may be stored in each level of child nodes in a tree-shaped label storage structure, where one child node of the tree-shaped label storage structure is used to store the label information of one entity text, and the label information of one entity text includes all labels corresponding to the entity text (i.e., one label corresponding to a single-label entity text, and multiple labels corresponding to multiple-label entity texts). The method comprises the steps of determining a sub-node for storing the marking information of a single-label entity text as a single-label sub-node, determining a sub-node for storing the marking information of a multi-label entity text as a multi-label sub-node, performing single-layer rendering and/or single-layer highlighting on the single-label entity text corresponding to the marking information stored in each single-label sub-node of the tree-shaped label storage structure, and performing multi-layer rendering and/or multi-layer highlighting on the multi-label entity text corresponding to the marking information stored in each multi-label sub-node of the tree-shaped label storage structure. Here, a one-layer rendering and/or a one-layer highlighting of any multi-label entity text is used to identify one of a plurality of labels to which the multi-label entity text corresponds. For example, the entity texts corresponding to the labels of a certain category may be highlighted in corresponding colors, and it is understood that the entity texts corresponding to the labels of several categories may also be highlighted in multiple colors, so as to facilitate the review or screening of the entity texts. It is further understood that other texts in the text to be processed, which are not entity texts, may also be blank labeled to indicate that no corresponding label exists in the other texts. By adopting the scheme provided by the embodiment of the application, the label information of the text to be processed can be stored by utilizing the tree-shaped storage structure, and the text can be labeled according to the label information of the text to be processed stored by the tree-shaped storage structure, so that the label efficiency of the text is improved, and the application scenes of text labeling are enriched.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application.
The text obtaining module 401 is configured to obtain a text to be processed, and determine, from the text to be processed, a plurality of entity texts and labeling information of each entity text in the plurality of entity texts.
In some possible implementations, the text retrieval module 401 can retrieve the pending text (e.g., sales in southern mountain of Shenzhen, Guangdong province of A corporation 2020 reaches X billion dollars, and development of the best product X is invested to 1% of the sales).
The text processing device can determine, by a text recognition method (or a word segmentation labeling tool), a plurality of entity texts in a text to be processed (for example, the sales of a company a in the south mountain area of shenzhen, guang province reaches X billion in 2020, and the development of a product X with the best sales amount is put into 1% of the sales), and labeling information (which may include a label corresponding to the entity text, position information of a start and stop character of the entity text, and a text length of the entity text) of each entity text in the plurality of entity texts. For example, the entity text 1 is 2020, the label corresponding to the entity text 1 is the time of sale, the start-stop character positions of the entity text 1 are 4 and 8, and the text length of the entity text 1 is 5. The entity text 2 is a Nanshan region of Shenzhen city, Guangdong province, the label corresponding to the entity text 2 is an operation address, the start and stop character positions of the entity text 2 are 10 and 18, and the text length of the entity text 2 is 9. The entity text 3 is Guangdong province, the label corresponding to the entity text 3 is a management address, the start and stop character positions of the entity text 3 are 10 and 12, and the text length of the entity text 3 is 3. The entity text 4 is a Shenzhen City south mountain area, the label corresponding to the entity text 4 is an operation address, the starting and stopping character positions of the entity text 4 are 13 and 18, and the text length of the entity text 4 is 6. The entity text 5 is a southern mountain area, the label corresponding to the entity text 5 is a business address, the start-stop character positions of the entity text 5 are 16 and 18, and the text length of the entity text 5 is 4. The entity text 6 is a product X, the label corresponding to the entity text 6 is a best sales product and a research and development input product, the starting and ending character positions of the entity text 6 are 33 and 35, and the text length of the entity text 6 is 3.
The information storage module 402 is configured to store the text to be processed into a root node of the tree-shaped tag storage structure, and store the label information of each entity text into each level of child nodes in the tree-shaped tag storage structure, so as to obtain the text to be processed stored in the tree-shaped tag storage structure and the label information of each entity text, where one child node of the tree-shaped tag storage structure is used to store the label information of one entity text.
In some possible implementations, the information storage module 402 includes an information confirmation unit and a hierarchical storage unit. The information confirming unit is used for confirming the start-stop character information and the text length of each entity text based on the marking information of each entity text. And the hierarchical storage unit is used for determining subordinate entity texts of all levels of entity texts including a first level entity text in the text to be processed based on the start-stop character information and the text length of each entity text, and storing the labeling information of the subordinate entity texts of any level entity text into subordinate child nodes of child nodes stored in the labeling information of any level entity text in the tree-shaped label storage structure so as to store the labeling information of each entity text into each level child node of the tree-shaped label storage structure. The primary entity texts are subordinate entity texts of the texts to be processed, the marking information of the primary entity texts is stored in subordinate child nodes of a root node in the tree-shaped label storage structure, the subordinate entity texts of any one text comprise entity texts which do not have repeated characters with other entity texts in the entity texts of any one text, and entity texts which have repeated characters with other entity texts and have text lengths larger than those of other entity texts which have repeated characters in the entity texts of any one text.
And a text labeling module 403, configured to label and/or display each entity text in the text to be processed according to the text to be processed stored in the tree tag storage structure and the labeling information of each entity text, so as to facilitate lookup or screening of each entity text.
In some possible embodiments, the text labeling module 403 may label and/or display each entity text in the text to be processed according to the text to be processed stored in the tree-shaped tag storage structure and the labeling information of each entity text, so as to facilitate the lookup or screening of each entity text. For example, the labeling and/or display of each entity text in the text to be processed can be completed by performing one-time traversal on the labeling information of the text to be processed stored in each child node of the tree-like storage structure, so that the labeling efficiency of the text is improved.
In this embodiment of the application, the text processing device may obtain the text to be processed, determine the multiple entity texts and the label information of each entity text in the multiple entity texts from the text to be processed, store the text to be processed into a root node of the tree-shaped label storage structure, and store the label information of each entity text into each level of sub-nodes in the tree-shaped label storage structure, so as to obtain the text to be processed stored in the tree-shaped label storage structure and the label information of each entity text. Here, one child node of the tree-shaped tag storage structure is configured to store the label information of one entity text, where the label information of one entity text includes all tags (i.e., one or more tags) corresponding to the entity text. Based on the text to be processed and the labeling information of each entity text stored in the tree-shaped tag storage structure, the text processing device can label and/or display each entity text in the text to be processed (for example, highlight the entity text corresponding to a certain category of tags in a corresponding color) so as to facilitate the lookup or screening of each entity text. By adopting the scheme provided by the embodiment of the application, the text processing device can utilize the tree-shaped storage structure to store the marking information of the text to be processed, and marks the text according to the marking information of the text to be processed stored by the tree-shaped storage structure, so that the marking efficiency of the text is improved, and the application scene of text marking is enriched.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 7, the terminal device in this embodiment may include: one or more processors 501 and memory 502. The processor 501 and the memory 502 are connected by a bus 503. The memory 502 is used for storing a computer program comprising program instructions, and the processor 501 is used for executing the program instructions stored in the memory 502 to perform the following operations:
acquiring a text to be processed, and determining a plurality of entity texts and the label information of each entity text in the entity texts from the text to be processed;
storing the text to be processed into a root node of a tree-shaped label storage structure, and storing the label information of each entity text into each level of child nodes in the tree-shaped label storage structure to obtain the text to be processed stored in the tree-shaped label storage structure and the label information of each entity text, wherein one child node of the tree-shaped label storage structure is used for storing the label information of one entity text;
and labeling and/or displaying each entity text in the text to be processed according to the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text so as to facilitate the lookup or screening of each entity text.
In some possible embodiments, the processor 501 is further configured to:
determining start-stop character information and text length of each entity text based on the label information of each entity text;
determining subordinate entity texts of entity texts at all levels including a first-level entity text in the text to be processed based on the start-stop character information and the text length of each entity text, and storing the label information of the subordinate entity text of any level of entity text to subordinate child nodes of child nodes stored in the label information of any level of entity text in the tree-shaped label storage structure so as to store the label information of each entity text to each level of child nodes in the tree-shaped label storage structure;
the primary entity text is a subordinate entity text of the text to be processed, the label information of the primary entity text is stored in a subordinate child node of the root node in the tree-shaped label storage structure, the subordinate entity text of any text comprises an entity text which has no repeated characters with other entity texts in the entity text of any text, and an entity text which has repeated characters with other entity texts in the entity text of any text and has a text length larger than that of other entity texts with repeated characters.
In some possible embodiments, the processor 501 is configured to:
determining the starting and ending character positions of each entity text according to the starting and ending character information of each entity text;
in an entity text of any text, if the start-stop character position of a target entity text does not include the start-stop character position of the non-target entity text, determining the target entity text as a subordinate entity text of the any text, wherein the any text is the text to be processed or an entity text in the text to be processed, and the non-target entity text is an entity text except the target entity text in the entity text of the any text;
and if the start-stop character positions of the target entity text comprise the start-stop character positions of the non-target entity text and the text length of the target entity text is greater than the text length of the overlapped non-target entity text, determining the target entity text as a subordinate entity text of any one of the texts, wherein the overlapped non-target entity text is the non-target entity text of which the start-stop character positions are included in the start-stop character positions of the target entity text.
In some possible embodiments, the processor 501 is configured to:
determining the label of each entity text according to the label information of each entity text;
if any entity text corresponds to a label, determining that the entity text is a single-label entity text, storing the label information of the single-label entity text into the child nodes of the tree-shaped label storage structure, and setting the child nodes storing the label information of the single-label entity text as the single-label child nodes;
and if any entity text corresponds to a plurality of labels, determining that the entity text is a multi-label entity text, storing the label information of the multi-label entity text into the child nodes of the tree-shaped label storage structure, and setting the child nodes storing the label information of the multi-label entity text as the multi-label child nodes.
In some possible embodiments, the processor 501 is configured to:
performing single-layer rendering and/or single-layer highlighting on the single-label entity texts corresponding to the label information stored in each single-label sub-node of the tree-shaped label storage structure to identify that one single-label entity text corresponds to one label;
and performing multi-layer rendering and/or multi-layer highlighting on the multi-label entity text corresponding to the label information stored in each multi-label sub-node of the tree-shaped label storage structure to identify one multi-label entity text corresponding to a plurality of labels, wherein one layer of rendering and/or one layer of highlighting of any multi-label entity text is used for identifying one label in the plurality of labels corresponding to any multi-label entity text.
In some possible embodiments, the processor 501 is configured to:
and performing blank marking on other texts which are not entity texts in the texts to be processed to indicate that no corresponding labels exist in the other texts.
In some possible embodiments, the processor 501 may be a Central Processing Unit (CPU), and the processor may be other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 502 may include both read-only memory and random access memory, and provides instructions and data to the processor 501. A portion of the memory 502 may also include non-volatile random access memory. For example, the memory 502 may also store device type information.
In a specific implementation, the terminal device may execute, through each built-in functional module, the implementation manner of the text processing method provided in each step in fig. 1, fig. 2, and fig. 4, which may be referred to specifically for the implementation manner provided in each step, and is not described herein again.
In the embodiment of the application, the text to be processed is obtained, and then the labeling information of each entity text in the entity texts and the entity texts in the entity texts is determined from the text to be processed. Storing the text to be processed into a root node of the tree-shaped label storage structure, and storing the labeling information of each entity text into each level of child nodes in the tree-shaped label storage structure, so as to obtain the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text. Here, one child node of the tree-shaped tag storage structure is configured to store the label information of one entity text, where the label information of one entity text includes all tags (i.e., one or more tags) corresponding to the entity text. Based on the text to be processed and the labeling information of each entity text stored in the tree-shaped label storage structure, each entity text in the text to be processed can be labeled and/or displayed (for example, the entity text corresponding to a certain category of labels is highlighted in a corresponding color), so that each entity text can be referred or screened conveniently. By adopting the scheme provided by the embodiment of the application, the label information of the text to be processed can be stored by utilizing the tree-shaped storage structure, and the text can be labeled according to the label information of the text to be processed stored by the tree-shaped storage structure, so that the label efficiency of the text is improved, and the application scenes of text labeling are enriched.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a processor, the text processing method provided in each step in fig. 1, fig. 2, and fig. 4 is implemented.
The computer-readable storage medium may be the user behavior recognition apparatus based on the prediction model provided in any of the foregoing embodiments, or an internal storage unit of the terminal device, such as a hard disk or a memory of an electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, which are provided on the electronic device. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the electronic device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
The terms "first", "second", "third", "fourth", and the like in the claims and in the description and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

Claims (10)

1. A method of text processing, the method comprising:
acquiring a text to be processed, and determining a plurality of entity texts and the labeling information of each entity text in the entity texts from the text to be processed;
storing the text to be processed into a root node of a tree-shaped label storage structure, and storing the labeling information of each entity text into each level of child nodes in the tree-shaped label storage structure to obtain the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text, wherein one child node of the tree-shaped label storage structure is used for storing the labeling information of one entity text;
and labeling and/or displaying each entity text in the text to be processed according to the text to be processed stored in the tree-shaped label storage structure and the labeling information of each entity text, so as to facilitate the lookup or screening of each entity text.
2. The method according to claim 1, wherein said storing the label information of each entity text into each level of child nodes of the tree-like label storage structure comprises:
determining start-stop character information and text length of each entity text based on the labeling information of each entity text;
determining subordinate entity texts of entity texts at all levels including a first-level entity text in the text to be processed based on the start-stop character information and the text length of each entity text, and storing the label information of the subordinate entity text of any level of entity text to subordinate child nodes of child nodes stored in the label information of any level of entity text in the tree-shaped label storage structure so as to store the label information of each entity text to each level of child nodes in the tree-shaped label storage structure;
the primary entity text is a subordinate entity text of the text to be processed, the label information of the primary entity text is stored in a subordinate child node of the root node in the tree-shaped label storage structure, the subordinate entity text of any text comprises an entity text which has no repeated characters with other entity texts in the entity text of any text, and an entity text which has repeated characters with other entity texts in the entity text of any text and has a text length larger than that of other entity texts with repeated characters.
3. The method according to claim 2, wherein the determining, based on the start-stop character information and the text length of each entity text, subordinate entity texts of each level of entity texts including a primary entity text in the text to be processed comprises:
determining the starting and ending character positions of each entity text according to the starting and ending character information of each entity text;
in an entity text of any text, if the start-stop character positions of a target entity text do not include the start-stop character positions of a non-target entity text, determining the target entity text as a subordinate entity text of the any text, wherein the any text is the text to be processed or an entity text in the text to be processed, and the non-target entity text is an entity text in the entity text of the any text except the target entity text;
and if the start-stop character positions of the target entity text comprise the start-stop character positions of the non-target entity text and the text length of the target entity text is greater than the text length of the overlapped non-target entity text, determining the target entity text as a subordinate entity text of any text, wherein the overlapped non-target entity text is the non-target entity text of which the start-stop character positions are included in the start-stop character positions of the target entity text.
4. The method according to any one of claims 1-3, wherein each level of child nodes in the tree-like tag storage structure includes a single-tag child node and a multi-tag child node, and the storing the labeling information of each entity text into each level of child nodes in the tree-like tag storage structure comprises:
determining the label of each entity text according to the labeling information of each entity text;
if any entity text corresponds to a label, determining that the entity text is a single-label entity text, storing the labeling information of the single-label entity text into the child nodes of the tree-shaped label storage structure, and setting the child nodes storing the labeling information of the single-label entity text as single-label child nodes;
if any entity text corresponds to a plurality of labels, determining that the entity text is a multi-label entity text, storing the labeling information of the multi-label entity text into the child nodes of the tree-shaped label storage structure, and setting the child nodes storing the labeling information of the multi-label entity text as multi-label child nodes.
5. The method according to claim 4, wherein the labeling and/or displaying the entity texts in the to-be-processed text according to the to-be-processed text stored in the tree tag storage structure and the labeling information of the entity texts comprises:
performing single-layer rendering and/or single-layer highlighting on the single-label entity texts corresponding to the labeling information stored in each single-label sub-node of the tree-shaped label storage structure to identify that one single-label entity text corresponds to one label;
and performing multi-layer rendering and/or multi-layer highlighting on the multi-label entity text corresponding to the labeling information stored in each multi-label sub-node of the tree-shaped label storage structure to identify one multi-label entity text corresponding to a plurality of labels, wherein one layer of rendering and/or one layer of highlighting of any multi-label entity text is used for identifying one label in the plurality of labels corresponding to any multi-label entity text.
6. The method according to any one of claims 1-5, further comprising:
and performing blank marking on other texts which are not entity texts in the texts to be processed to indicate that no corresponding labels exist in the other texts.
7. A text processing apparatus, characterized in that the apparatus comprises:
the text acquisition module is used for acquiring texts to be processed and determining a plurality of entity texts and the label information of each entity text in the entity texts from the texts to be processed;
the information storage module is used for storing the text to be processed into a root node of a tree-shaped label storage structure and storing the labeling information of each entity text into each level of child nodes in the tree-shaped label storage structure so as to obtain the text to be processed and the labeling information of each entity text which are stored in the tree-shaped label storage structure, wherein one child node of the tree-shaped label storage structure is used for storing the labeling information of one entity text;
and the text labeling module is used for labeling and/or displaying the entity texts in the texts to be processed according to the texts to be processed and the labeling information of the entity texts stored in the tree-shaped label storage structure so as to facilitate the lookup or screening of the entity texts.
8. The apparatus of claim 7, wherein the information storage module comprises an information confirmation unit and a hierarchical storage unit, wherein:
the information confirming unit is used for confirming the start-stop character information and the text length of each entity text based on the marking information of each entity text;
the hierarchical storage unit is configured to determine, based on the start-stop character information and the text length of each entity text, a subordinate entity text of each level of entity text including a first level of entity text in the text to be processed, and store label information of the subordinate entity text of any level of entity text to a subordinate child node of a child node in the tree-shaped label storage structure, where the label information of any level of entity text is stored, so as to store the label information of each entity text to each level of child node in the tree-shaped label storage structure;
the primary entity text is a subordinate entity text of the text to be processed, the label information of the primary entity text is stored in a subordinate child node of the root node in the tree-shaped label storage structure, the subordinate entity text of any text comprises an entity text which has no repeated characters with other entity texts in the entity text of any text, and an entity text which has repeated characters with other entity texts in the entity text of any text and has a text length larger than that of other entity texts with repeated characters.
9. A terminal device, characterized in that it comprises a processor and a memory, said processor and memory being interconnected, wherein said memory is adapted to store a computer program comprising program instructions, said processor being configured to invoke said program instructions to perform the method according to any one of claims 1-6.
10. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-6.
CN202110524156.XA 2021-05-13 2021-05-13 Text processing method, device, equipment and storage medium Pending CN113139033A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110524156.XA CN113139033A (en) 2021-05-13 2021-05-13 Text processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110524156.XA CN113139033A (en) 2021-05-13 2021-05-13 Text processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113139033A true CN113139033A (en) 2021-07-20

Family

ID=76817724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110524156.XA Pending CN113139033A (en) 2021-05-13 2021-05-13 Text processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113139033A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257622A (en) * 2023-05-16 2023-06-13 之江实验室 Label rendering method and device, storage medium and electronic equipment

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120109637A1 (en) * 2010-11-01 2012-05-03 Yahoo! Inc. Extracting rich temporal context for business entities and events
CN104298662A (en) * 2014-04-29 2015-01-21 中国专利信息中心 Machine translation method and translation system based on organism named entities
CN104679867A (en) * 2015-03-05 2015-06-03 深圳市华傲数据技术有限公司 Address knowledge processing method and device based on graphs
CN108399150A (en) * 2018-02-07 2018-08-14 深圳壹账通智能科技有限公司 Text handling method, device, computer equipment and storage medium
CN110489416A (en) * 2019-07-23 2019-11-22 中国平安财产保险股份有限公司 A kind of information storage means and relevant device based on data processing
CN110489415A (en) * 2019-07-23 2019-11-22 平安科技(深圳)有限公司 A kind of data-updating method and relevant device
CN111078878A (en) * 2019-12-06 2020-04-28 北京百度网讯科技有限公司 Text processing method, device and equipment and computer readable storage medium
CN111309872A (en) * 2020-03-26 2020-06-19 北京百度网讯科技有限公司 Search processing method, device and equipment
CN111813948A (en) * 2019-04-11 2020-10-23 阿里巴巴集团控股有限公司 Information processing method and device and electronic equipment
CN112256821A (en) * 2020-09-23 2021-01-22 北京捷通华声科技股份有限公司 Method, device, equipment and storage medium for complementing Chinese address
CN112464667A (en) * 2020-11-18 2021-03-09 北京华彬立成科技有限公司 Text entity identification method and device, electronic equipment and storage medium
CN112613322A (en) * 2020-12-17 2021-04-06 平安科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN112749561A (en) * 2020-04-17 2021-05-04 腾讯科技(深圳)有限公司 Entity identification method and device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120109637A1 (en) * 2010-11-01 2012-05-03 Yahoo! Inc. Extracting rich temporal context for business entities and events
CN104298662A (en) * 2014-04-29 2015-01-21 中国专利信息中心 Machine translation method and translation system based on organism named entities
CN104679867A (en) * 2015-03-05 2015-06-03 深圳市华傲数据技术有限公司 Address knowledge processing method and device based on graphs
CN108399150A (en) * 2018-02-07 2018-08-14 深圳壹账通智能科技有限公司 Text handling method, device, computer equipment and storage medium
CN111813948A (en) * 2019-04-11 2020-10-23 阿里巴巴集团控股有限公司 Information processing method and device and electronic equipment
CN110489416A (en) * 2019-07-23 2019-11-22 中国平安财产保险股份有限公司 A kind of information storage means and relevant device based on data processing
CN110489415A (en) * 2019-07-23 2019-11-22 平安科技(深圳)有限公司 A kind of data-updating method and relevant device
CN111078878A (en) * 2019-12-06 2020-04-28 北京百度网讯科技有限公司 Text processing method, device and equipment and computer readable storage medium
CN111309872A (en) * 2020-03-26 2020-06-19 北京百度网讯科技有限公司 Search processing method, device and equipment
CN112749561A (en) * 2020-04-17 2021-05-04 腾讯科技(深圳)有限公司 Entity identification method and device
CN112256821A (en) * 2020-09-23 2021-01-22 北京捷通华声科技股份有限公司 Method, device, equipment and storage medium for complementing Chinese address
CN112464667A (en) * 2020-11-18 2021-03-09 北京华彬立成科技有限公司 Text entity identification method and device, electronic equipment and storage medium
CN112613322A (en) * 2020-12-17 2021-04-06 平安科技(深圳)有限公司 Text processing method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
牛宁昌: ""大量文件名记录的树形结构存储"", pages 1 - 5, Retrieved from the Internet <URL:https://cloud.tencent.com/developer/article/1450312?shareByChannel=link> *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257622A (en) * 2023-05-16 2023-06-13 之江实验室 Label rendering method and device, storage medium and electronic equipment
CN116257622B (en) * 2023-05-16 2023-07-11 之江实验室 Label rendering method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111813963B (en) Knowledge graph construction method and device, electronic equipment and storage medium
US20190243848A1 (en) Generating a structured document guiding view
CN113886584A (en) Information detection method, device and equipment for application program
CN111831629B (en) Data processing method and device
CN111475612A (en) Construction method, device and equipment of early warning event map and storage medium
CN104106066A (en) System to view and manipulate artifacts at temporal reference point
CN104115145A (en) Generating visualizations of display group of tags representing content instances in objects satisfying search criteria
CN114020256A (en) Front-end page generation method, device and equipment and readable storage medium
JP2014199569A (en) Source program analysis system, source program analysis method, and program
CN108710636B (en) Medical record screening method, terminal equipment and computer readable storage medium
CN115080406A (en) Code log generation method, device, equipment and storage medium
CN112732949A (en) Service data labeling method and device, computer equipment and storage medium
CN113283216A (en) Webpage content display method, device, equipment and storage medium
CN109325217B (en) File conversion method, system, device and computer readable storage medium
CN113434542B (en) Data relationship identification method and device, electronic equipment and storage medium
CN113139033A (en) Text processing method, device, equipment and storage medium
CN111859862A (en) Text data labeling method and device, storage medium and electronic device
US8903754B2 (en) Programmatically identifying branding within assets
CN112328246A (en) Page component generation method and device, computer equipment and storage medium
CN116757812A (en) Method, device, electronic equipment and storage medium for detecting abnormal data
CN113687827B (en) Data list generation method, device and equipment based on widget and storage medium
CN115482075A (en) Financial data anomaly analysis method and device, electronic equipment and storage medium
JP2010170287A (en) Data extraction system
JP2016057715A (en) Graphic type program analyzer
JP7034426B2 (en) Character string list extraction management software in figure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination