CN117034864A

CN117034864A - Visual labeling method, visual labeling device, computer equipment and storage medium

Info

Publication number: CN117034864A
Application number: CN202311154476.6A
Authority: CN
Inventors: 禹健
Original assignee: Guangzhou Xingu Electronic Technology Co ltd
Current assignee: Guangzhou Xingu Electronic Technology Co ltd
Priority date: 2023-09-07
Filing date: 2023-09-07
Publication date: 2023-11-10
Anticipated expiration: 2043-09-07
Also published as: CN117034864B

Abstract

The application relates to the technical field of text labeling, in particular to a visual labeling method, a visual labeling device, computer equipment and a storage medium, wherein the visual labeling method comprises the following steps: marking pretreatment is carried out on the text data set, basic information is automatically extracted, and the extracted information is marked prominently according to the corresponding type to obtain pretreatment elements; outputting the text data set subjected to the marking pretreatment to a text display area, and supplementing unlabeled element information by using a marking tool in the text display area through manual marking; obtaining a final labeling result according to the element information supplementing result, and storing the final labeling result into a database to serve as sample data; the sample data is input into a machine training model, a machine labeling result is obtained according to the training result, and a labeling comparison result is obtained by comparing the machine labeling result with a manual labeling result.

Description

Visual labeling method, visual labeling device, computer equipment and storage medium

Technical Field

The present application relates to the field of text labeling, and in particular, to a visual labeling method, a visual labeling device, a computer device, and a storage medium.

Background

At present, with the popularization and construction of big data in various industries, the work of business departments has gradually turned to the direction of electronization and datamation. However, this transition also causes the generation of large amounts of unstructured text data, including office documents, informational cues, case summaries, briefs, transcript files, and the like. Such data is typically spread in folders or systems in the form of documents. Although the document can be consulted and analyzed by means of filtering, inquiring and the like, the interpretation of the information is realized by manually interpreting the document throughout, so that the method greatly depends on personal ability of policemen, and is low in efficiency and easy to make mistakes. More importantly, the same content, such as personnel, organizations and events, mentioned in different documents cannot be correlated for overall analysis. The situation not only wastes police manpower resources, but also has low efficiency, easily omits key information, brings certain difficulty to data analysis, and cannot fully utilize a large amount of valuable data.

The existing text labeling method can quickly identify and label entities, but mainly expands around people, things and things in data analysis, the entity identification scheme of the existing text labeling method is not completely adaptive to the service processing requirements of professional departments, and lacks a clean and attractive visual operation interface, so that the operation mode is complex and is not suitable for the daily work of the professional departments with great business and high efficiency requirements.

Disclosure of Invention

In order to improve the efficiency of analysis of clue files by related departments, the application provides a visual labeling method, a visual labeling device, computer equipment and a storage medium.

The first object of the present application is achieved by the following technical solutions:

a visual annotation method, the visual annotation method comprising:

acquiring a text data set to be marked, and performing format preprocessing on the text data set to obtain a text data set with uniform format;

marking and preprocessing the text data set with uniform format, automatically extracting basic information, and highlighting and marking the extracted information according to the corresponding type to obtain a preprocessed element;

outputting the text data set subjected to the marking pretreatment to a text display area, and supplementing unlabeled element information by using a marking tool in the text display area through manual marking;

obtaining a final labeling result according to the element information supplementing result, and storing the final labeling result into a database to serve as sample data;

inputting the sample data into a machine training model, obtaining a machine labeling result according to a training result, and obtaining a labeling comparison result by comparing the machine labeling result with a manual labeling result;

And iterating and updating the labeling pretreatment model according to the updated sample data and the labeling comparison result in the database.

By adopting the technical scheme, the text data set to be marked is subjected to format processing to unify the text formats, so that the text formats in the text data set marking pretreatment and machine training are unified, and the purpose of optimizing the marking pretreatment through a machine training model is further achieved; the basic information in the text data set is automatically extracted and marked by marking pretreatment, the text data set after marking pretreatment is output to a text display area, and unlabeled element information is supplemented by using a marking tool in the text display area through manual marking, so that the aim of visual operation is fulfilled, the manual marking operation is further simpler, the marking result is clear at a glance, and the document processing efficiency of a professional department is improved; storing the final labeling result into a database, facilitating the subsequent analysis of the labeling data, facilitating the business personnel to search and review the labeled text data set at any time, inputting the final labeling result as sample data into a machine training model to obtain a machine labeling result, obtaining a labeling comparison result by comparing the machine labeling result with a manual labeling result, iterating and updating a labeling pretreatment model according to the updated sample data and the labeling comparison result in the database, labeling and checking the text data set by combining a machine training data model, and comparing the difference between the manually labeled information and the machine training result so as to improve the accuracy and consistency of labeling pretreatment; by automatically extracting basic information and manually supplementing labeling information from text data and storing a final labeling result, unstructured text is converted into a structure, and a foundation is provided for subsequent services such as document classification, cluster screening, map analysis and the like.

The present application may be further configured in a preferred example to: the method comprises the steps of marking the text data set with uniform format, automatically extracting basic information, and highlighting the extracted information according to the corresponding type to obtain a preprocessed element, wherein the method specifically comprises the following steps:

acquiring a feature word stock, and automatically extracting basic information including names, telephone numbers, certificate numbers, addresses and ages by performing feature matching on the text data sets with unified formats through the feature word stock;

marking the extracted basic information according to the corresponding type, and manually screening the required marking information according to the marking result;

and obtaining the final pretreatment element according to the manual screening result.

By adopting the technical scheme, the name, telephone number, certificate number, address and age basic information in the text data set are automatically extracted, the working attribute of data analysis business around people, things and things is more met, the follow-up summarization and summarization are facilitated, and the main information in the text can be quickly known and consulted; various basic information automatically extracted by the marking pretreatment can be selected by a business person, information which is needed to be adopted can be deleted without being needed or extracted with errors, and screening results of the business person can be saved, so that the accuracy of the automatic extraction of the marking pretreatment can be improved.

The present application may be further configured in a preferred example to: the method comprises the steps of obtaining a feature word stock, automatically extracting basic information by feature matching of the text data set with unified format through the feature word stock, wherein the basic information comprises names, telephone numbers, certificate numbers, addresses and ages, and specifically comprises the following steps:

acquiring a surname word stock from the characteristic word stock, performing surname matching in the text data set with uniform format according to the surname word stock, recognizing corresponding words according to the semantics of the surname matching result, and combining the words with the surname matching result to obtain name information;

acquiring a telephone number feature word stock from the feature word stock, carrying out telephone number matching in the text data set with unified format according to the telephone number feature word stock, and obtaining telephone number information according to the telephone number matching result;

obtaining a certificate number feature word stock from the feature word stock, carrying out certificate number matching in the text data set with uniform format according to the certificate number feature word stock, and obtaining certificate number information according to the certificate number matching result;

acquiring an address feature word stock from the feature word stock, performing address matching in the text data set with unified format according to the address feature word stock, and obtaining address information according to the address matching result;

And obtaining an age characteristic word stock from the characteristic word stock, performing age matching in the text data set with uniform format according to the age characteristic word stock, and obtaining age information according to the age matching result.

By adopting the technical scheme, the characteristics of the basic information are acquired from the characteristic word library and are combined with natural language processing to be matched in the text so as to obtain the name, the telephone number, the certificate number, the address and the age, the entity information in the text can be more rapidly identified by combining the characteristics of the basic information with natural language processing, the key person, the thing and the thing information required by the professional department for processing the service can be more accurately extracted, the follow-up summary is convenient, and the main information in the text can be rapidly known and consulted.

The present application may be further configured in a preferred example to: outputting the text data set subjected to the marking pretreatment to a text display area, and supplementing unlabeled element information by using a marking tool in the text display area through manual marking, wherein the method specifically comprises the following steps of:

selecting a preprocessing element in the text data set as an entity through an entity labeling tool, selecting a corresponding entity type according to the entity, and labeling different entity types by adopting different background colors;

Labeling the entity with attributes through an attribute labeling tool according to the entity labeling result;

after the entity attributes are marked, the relation among the entities is marked through a relation marking tool connecting line;

acquiring preset tags in a preset tag set, and associating the selected phrases or sentences with the corresponding preset tags through a tag marking tool;

and finishing the supplementary annotation of the unlabeled element information in the text data set through the entity annotation tool, the attribute annotation tool, the relation annotation tool and the label annotation tool.

By adopting the technical scheme, the entity labeling tool is used for supplementing the unlabeled element information in the text data set, different background colors are adopted for labeling different entity types, attribute information is added to the labeled entity through the attribute labeling tool, the relationship between the entity and the entity is labeled through the connection of the relationship labeling tool, the phrase or sentence in the text is matched with the pre-defined label through the label labeling tool, and the labeling result is presented in a visual mode, such as different background colors and connection lines, so that the main content in the unstructured text is clear at a glance, the comprehensibility of the data is further improved, the visual analysis efficiency is further improved, the business personnel calling the text data next can quickly and accurately grasp the important content of the text information, and powerful support is provided for further data analysis and decision.

The present application may be further configured in a preferred example to: obtaining a final labeling result according to the element information supplementing result, and storing the final labeling result into a database to serve as sample data, wherein the method specifically comprises the following steps:

acquiring corresponding entity, attribute, relation and label data according to the final labeling result, and classifying and storing the final labeling result according to the entity, attribute, relation and label data;

receiving a data searching instruction, acquiring corresponding entity, attribute relation and label data in the database and a corresponding marked text data set according to the data searching instruction, and generating a searching result.

By adopting the technical scheme, the marked text data is stored in a classified manner according to the marked entity, attribute, relationship and label, so that business personnel can be related to people, things and objects in other documents when searching and calling the text data, the analysis direction and the information materials are expanded, the information of each aspect of the target is more comprehensively known, and the progress of the work such as case detection, information analysis and the like is accelerated.

The second object of the present application is achieved by the following technical solutions:

A visual annotation device, the visual annotation device comprising:

the text format adjustment module is used for acquiring a text data set to be marked, and carrying out format preprocessing on the text data set to obtain a text data set with uniform format;

the information extraction module is used for carrying out marking pretreatment on the text data set with the uniform format, automatically extracting basic information, and carrying out highlighting marking on the extracted information according to the corresponding type to obtain pretreatment elements;

the marking supplementing module is used for outputting the text data set subjected to marking pretreatment to a text display area, and supplementing unlabeled element information by using a marking tool through manual marking in the text display area;

the data storage module is used for obtaining a final labeling result according to the element information supplementing result and storing the final labeling result into a database to serve as sample data;

the model training module is used for inputting the sample data into a machine training model, obtaining a machine labeling result according to a training result, and obtaining a labeling comparison result by comparing the machine labeling result with a manual labeling result;

and the model iteration module is used for iterating and updating the labeling pretreatment model according to the updated sample data and the labeling comparison result in the database.

The third object of the present application is achieved by the following technical solutions:

a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the visual annotation method described above when the computer program is executed by the processor.

The fourth object of the present application is achieved by the following technical solutions:

a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the visual annotation method described above.

In summary, the present application includes at least one of the following beneficial technical effects:

inputting the final labeling result as sample data into a machine training model to obtain a machine labeling result, comparing the machine labeling result with a manual labeling result to obtain a labeling comparison result, iterating and updating a labeling pretreatment model according to the updated sample data and the labeling comparison result in a database, performing labeling verification on a text data set by combining the machine training data model, and comparing the difference between the manual labeling information and the machine training result to improve the accuracy and consistency of labeling pretreatment; by automatically extracting basic information and manually supplementing labeling information from text data and storing a final labeling result, unstructured text is converted into a structure, and a foundation is provided for subsequent services such as document classification, cluster screening, map analysis and the like.

Drawings

FIG. 1 is a flow chart of a visual annotation method according to an embodiment of the application;

FIG. 2 is a schematic diagram showing the results of a text dataset after extracting basic information in visual annotation according to an embodiment of the present application;

FIG. 3 is a flowchart showing the implementation of method step S20 in visual annotation according to one embodiment of the application;

FIG. 4 is a flowchart showing the implementation of method step S21 in visual annotation according to one embodiment of the application;

FIG. 5 is a flowchart showing the implementation of method step S30 in visual annotation according to one embodiment of the application;

FIG. 6 is a flowchart showing the implementation of method step S40 in visual annotation according to one embodiment of the application;

FIG. 7 is a schematic block diagram of a visual annotation device according to one embodiment of the application;

fig. 8 is a schematic diagram of an apparatus in an embodiment of the application.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings.

In an embodiment, as shown in fig. 1 and fig. 2, the application discloses a visual labeling method, which specifically comprises the following steps:

s10: and acquiring a text data set to be marked, and performing format preprocessing on the text data set to obtain a text data set with uniform format.

In this embodiment, the text data set to be annotated refers to the initial clue text collected.

Specifically, after the related clue text is collected by the user uploading by himself or by docking the related database, the clue text is used as the text data set to be annotated, for example, the text data set to be annotated is counted by intercepting dialogue or clue reporting and the like.

Further, after the text data set is obtained, the text data set with uniform format is obtained after the segmentation processing, the uniform format processing and the like are carried out on the data in the text data set.

S20: and marking and preprocessing the text data set with uniform format, automatically extracting basic information, and highlighting and marking the extracted information according to the corresponding type to obtain a preprocessed element.

Specifically, an algorithm for preprocessing a text data set is obtained, and basic information is extracted from the text data set according to the algorithm. When an algorithm for extracting basic information from text data in a text data set is set, firstly, the type of the data to be extracted, such as personnel name, telephone number, card information and the like, is confirmed, then a corresponding information extraction rule is set according to the character characteristics of each type of data, and the preset algorithm is further obtained, so that the basic information can be extracted from the text data set according to the preprocessed algorithm.

Further, as shown in fig. 2, when the basic information is extracted from the text data set, the basic information extracted by each type is labeled correspondingly according to the type of the extracted data in the algorithm, for example, background colors of different colors are used for highlighting, so as to obtain the preprocessing element.

S30: and outputting the text data set subjected to marking pretreatment to a text display area, and supplementing unlabeled element information by using a marking tool in the text display area through manual marking.

Specifically, before model training, automatic labeling of the text data sets is performed according to a preset rule, corresponding labeling is performed according to a matching result, however, artificial intelligence training is required for semantic association between preprocessing elements obtained through matching, and therefore the text data sets after the preprocessing elements are obtained through labeling are displayed in a preset text display area and are used for displaying the text data sets labeled with the preprocessing elements to related personnel.

Further, when related personnel supplement according to elements which are not marked in the text data set marked with the preprocessing elements, corresponding marking tools are adopted to supplement and mark according to the relation among different semantics, and then a final marking result corresponding to the text data set is obtained.

S40: and obtaining a final labeling result according to the element information supplementing result, and storing the final labeling result into a database to serve as sample data.

Specifically, after the final labeling results of the text data set to be labeled are obtained, storing all the final labeling results into a preset database, and further obtaining sample data for training.

S50: and inputting the sample data into a machine training model, obtaining a machine labeling result according to the training result, and obtaining a labeling comparison result by comparing the machine labeling result with a manual labeling result.

Specifically, the sample data is input into an initial model, namely a machine training model for training, so as to obtain a corresponding training result, namely a trained labeling pretreatment model.

Further, the text data set to be marked is input into a marking pretreatment model for identification marking, after a machine marking result is obtained, marking comparison is carried out on data in the machine marking result and manual marking results in sample data, and a corresponding marking comparison result is obtained.

S60: and iterating and updating the labeling pretreatment model according to the updated sample data and the labeling comparison result in the database.

Specifically, the labeling comparison result and the sample data are input into the labeling pretreatment model again for training, so that iteration and updating are carried out on the labeling pretreatment model, and the accuracy and consistency of labeling are improved continuously.

In one embodiment, as shown in fig. 3, in step S20, the marking pretreatment is performed on the text data set with uniform format to automatically extract basic information, and the extracted information is marked with a highlighting mark according to a corresponding type to obtain a pretreated element, which specifically includes:

s21: and acquiring a feature word library, and automatically extracting basic information including names, telephone numbers, certificate numbers, addresses and ages by performing feature matching on the text data sets with uniform formats through the feature word library.

In this embodiment, the feature word library refers to a database storing matching rules corresponding to each type of basic information.

Specifically, after a text data set with uniform format is obtained, matching inquiry is carried out in the text data set through the feature word stock. When the matching inquiry is carried out, the matching is respectively carried out according to the types of basic information in the feature word library, such as name, telephone number, certificate number, address, age and other keywords needing to be matched, and then the basic information is obtained.

S22: and marking the extracted basic information according to the corresponding type, and manually screening the required marking information according to the marking result.

Specifically, after the corresponding basic information is extracted, the basic information is classified according to the information extracted by adopting the matching rules of the corresponding types, namely the basic information obtained by matching the same matching rules is classified into one type, and then the basic information is subjected to corresponding automatic labeling according to the corresponding types.

Further, the labeling result of the automatic labeling is displayed for related personnel to carry out manual screening.

S23: and obtaining a final pretreatment element according to the result of manual screening.

Specifically, through manual screening, the preprocessing element is obtained after deleting the label of the error label and supplementing the label missing information.

In one embodiment, as shown in fig. 4, in step S21, a feature word library is obtained, and basic information including a name, a phone number, a certificate number, an address and an age is automatically extracted by performing feature matching on a text data set with uniform format through the feature word library, which specifically includes:

s211: and acquiring a surname word stock from the characteristic word stock, performing surname matching in a text data set with uniform format according to the surname word stock, recognizing corresponding words according to the semantics of the result of surname matching, and combining the words with the result of surname matching to obtain name information.

Specifically, when a feature word stock for extracting names is constructed, a surname word stock is constructed in advance, and corresponding Chinese characters are matched and inquired in a text data set with uniform format. Further, the name information is obtained by combining the surname obtained by the matching according to the result of the recognition by assuming that the next word or words are part of the first name and recognizing in combination with the context. For example, after a character of a surname is identified, the character is semantically identified with a next word as a name to be identified and with a context, if the semantic identification fails, namely the name to be identified is incomplete or is not a person's name, so that the next word and the name to be identified are combined as a new name to be identified, semantic identification is performed again, if the identification is successful, the new name to be identified is identified as corresponding name information, and if the identification fails, the phrase is identified as not name information.

S212: and obtaining a telephone number feature word stock from the feature word stock, carrying out telephone number matching in a text data set with uniform format according to the telephone number feature word stock, and obtaining telephone number information according to a telephone number matching result.

Specifically, by setting a common fixed telephone number format and a mobile phone number format as a telephone number feature word stock, if continuous digits appear in a text data set, the telephone number feature word stock is adopted, and the telephone number information is obtained through regular expression matching.

S213: obtaining a certificate number feature word stock from the feature word stock, carrying out certificate number matching in a text data set with uniform format according to the certificate number feature word stock, and obtaining certificate number information according to a certificate number matching result.

Specifically, based on the specific formats of the identity card number, the driving license number and other card numbers, the certificate number feature word stock is set, the matching inquiry is carried out in the text data set through the feature number feature word stock, and the result of the matching inquiry is used as certificate number information.

S214: and obtaining an address feature word stock from the feature word stock, performing address matching in a text data set with uniform format according to the address feature word stock, and obtaining address information according to an address matching result.

Specifically, keyword matching or grammar rules are used to extract address information, such as by matching words such as "province", "city", "region", etc. to locate the start and end positions of the address, and then match the address information.

S215: and obtaining an age characteristic word stock from the characteristic word stock, performing age matching in a text data set with uniform format according to the age characteristic word stock, and obtaining age information according to an age matching result.

Specifically, since an age is generally associated with a specific keyword, such as "age", "year", etc., as an age by a number or a digital phrase following it; furthermore, the text in the text dataset may be a year of birth of a person, and the age may be calculated by subtracting this year from the current year.

In one embodiment, as shown in fig. 5, in step S30, a text data set after the labeling preprocessing is output to a text display area, and unlabeled element information is supplemented by manually labeling in the text display area by using a labeling tool, which specifically includes:

s31: and selecting the preprocessing elements in the text data set as entities through the entity labeling tool, and labeling different entity types by adopting different background colors according to the entity types corresponding to the entity selections.

Specifically, a corresponding entity labeling tool is set for a user to select contents in a text as entities by dragging a mouse, and after the user selection is completed, a list is provided for the user to display all selectable entity types for the user to select, wherein different types of entities can be distinguished by different background colors.

S32: and labeling the entity with the attribute through an attribute labeling tool according to the entity labeling result.

Specifically, an attribute marking tool is set, so that attributes can be marked for the element information obtained through screening, a user can select an attribute text in the text through a mouse, and the corresponding attribute is dragged to a corresponding entity in a dragging mode to carry out attribute specification.

S33: and after the entity attributes are marked, marking the relationship among the entities through the relationship marking tool connecting line.

Specifically, a corresponding label marking tool is provided for marking the relationship between the entities, and the relationship between the entities is represented by connecting the entities through a solid line. In the labeling process, corresponding relation options are displayed for selection and labeling to a user side according to the types of the entities, so that when the user operates, the relationship can be labeled by clicking the entity needing to be established and then dragging the entity to a target entity by long-pressing a mouse.

S34: the method comprises the steps of obtaining preset tags in a preset tag set, and associating selected phrases or sentences with corresponding preset tags through a tag marking tool.

Specifically, through presetting the corresponding type of labels, phrases or sentences in the text can be matched with the predefined labels and presented to the client in the form of a label card, so that when a user uses the tool, the user can select corresponding characters through a mouse frame, and through the label function in a menu, the user can select the proper labels for association after popup the window, and the labeling is completed.

S35: and completing the supplementary annotation of the unlabeled element information in the text data set through the entity annotation tool, the attribute annotation tool, the relationship annotation tool and the label annotation tool.

Specifically, each element marked by a user is obtained through an entity marking tool, the association between each entity element is obtained through an attribute marking tool and a relationship marking tool, and then the type of the association between the user marking elements is obtained according to a label marking tool, so that the supplement of the unlabeled element information in the text data is completed.

According to the method provided by the embodiment, a user can label unlabeled elements according to the following operations:

1. entity element definition: in an entity element management module of the system management end, specific contents to be described are represented by adding new entity elements, and the defined contents comprise entity names, types, display colors and the like. Such as "personnel," "organizations," "events," and the like.

2. Entity attribute definition: related attributes are defined for each entity element, various attribute items including names, descriptions, types and the like are added for the entity in the entity management page, and proper attribute types such as texts, numerical values, dates and the like are defined according to requirements.

3. Entity relationship definition: relationships between the entity elements are defined. Relationships between the current entity and other entities, such as "participating," "organizing," "associating," etc., may be added to the entity management page to represent relationships between entities.

4. Labeling: selecting text sentence segments in a text to be associated with the labels by combining a preset label set, and reducing a calculation interval for subsequent machine learning, wherein the method specifically comprises the following steps of:

A. tag set preset

In the tag management module, a set of tags is preset by adding new tags for selection during the labeling process. The tag definition contains information such as tag names, tag classifications, and the like.

The initializing tag includes:

behavior tags, crowd tags, related event types, sensitive nodes, related regions, business feature tags, and the like.

B. Text to tag association

The user needs to select corresponding characters by using a mouse frame, select proper labels for association after popup the window through the label function in the menu, and therefore, labeling is completed.

In one embodiment, as shown in fig. 6, in step S40, that is, a final labeling result is obtained according to the element information supplementing result, the final labeling result is stored in a database as sample data, which specifically includes:

S41: and acquiring corresponding entity, attribute, relationship and label data according to the final labeling result, and classifying and storing the final labeling result according to the entity, attribute, relationship and label data.

Specifically, after the final labeling result is obtained, the entity, the attribute, the semantic association relation between other entities, the corresponding label and other types of data are obtained, and the final labeling result is classified and stored according to the types, so that the subsequent machine learning is facilitated.

S42: receiving a data searching instruction, acquiring corresponding entity, attribute relation and label data in a database and a corresponding marked text data set according to the data searching instruction, and generating a searching result.

Specifically, after a user-triggered data searching instruction is obtained, the entity, attribute relation, label and other types which specifically need to be checked are obtained from the user-triggered data searching instruction, and then the corresponding searching result is matched from the database.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

In an embodiment, a visual marking device is provided, where the visual marking device corresponds to the visual marking method in the above embodiment one by one. As shown in fig. 7, the visual annotation device comprises a text format adjustment module, an information extraction module, an annotation supplementing module, a data storage module, a model training module and a model iteration module. The functional modules are described in detail as follows:

the information extraction module is used for carrying out marking pretreatment on the text data set with uniform format, automatically extracting basic information, and carrying out highlighting marking on the extracted information according to the corresponding type to obtain pretreatment elements;

the marking supplementing module is used for outputting the text data set subjected to marking pretreatment to a text display area, and supplementing unlabeled element information by using a marking tool in the text display area through manual marking;

the data storage module is used for obtaining a final labeling result according to the element information supplementing result and storing the final labeling result into the database to serve as sample data;

The model training module is used for inputting sample data into a machine training model, obtaining a machine labeling result according to the training result, and obtaining a labeling comparison result by comparing the machine labeling result with a manual labeling result;

Optionally, the information extraction module includes:

the feature extraction sub-module is used for obtaining a feature word stock, automatically extracting basic information including names, telephone numbers, certificate numbers, addresses and ages by performing feature matching on text data sets with uniform formats through the feature word stock;

the supplementary marking sub-module is used for marking the extracted basic information according to the corresponding type and manually screening the required marking information according to the marking result;

and the result screening sub-module is used for obtaining a final pretreatment element according to the result of manual screening.

Optionally, the feature extraction submodule includes:

the name extraction unit is used for acquiring a surname word stock from the characteristic word stock, carrying out surname matching in a text data set with uniform formats according to the surname word stock, recognizing corresponding words according to the semantics of the result of surname matching, and combining the words with the result of surname matching to obtain name information;

The telephone extraction unit is used for acquiring a telephone number feature word stock from the feature word stock, carrying out telephone number matching in a text data set with uniform format according to the telephone number feature word stock, and obtaining telephone number information according to a telephone number matching result;

the certificate number extraction unit is used for obtaining a certificate number feature word stock from the feature word stock, carrying out certificate number matching in a text data set with uniform format according to the certificate number feature word stock, and obtaining certificate number information according to a certificate number matching result;

the address extraction unit is used for acquiring an address feature word stock from the feature word stock, performing address matching in a text data set with uniform format according to the address feature word stock, and obtaining address information according to an address matching result;

the age extraction unit is used for obtaining an age feature word stock from the feature word stock, carrying out age matching in the text data set with uniform format according to the age feature word stock, and obtaining age information according to an age matching result.

Optionally, the labeling supplementing module includes:

the entity labeling sub-module is used for selecting preprocessing elements in the text data set as entities through the entity labeling tool, selecting corresponding entity types according to the entities, and labeling different entity types by adopting different background colors;

The attribute marking sub-module is used for marking attributes for the entities through the attribute marking tool according to the entity marking results;

the association relation marking sub-module is used for marking the relation between the entities through a relation marking tool after the entity attributes are marked;

the label marking sub-module is used for acquiring preset labels in a preset label set and associating the selected phrases or sentences with the corresponding preset labels through a label marking tool;

and the annotation supplementing sub-module is used for supplementing and labeling the unlabeled element information in the text data set through the entity labeling tool, the attribute labeling tool, the relation labeling tool and the label labeling tool.

Optionally, the data storage module includes:

the classification storage sub-module is used for acquiring corresponding entity, attribute, relation and label data according to the final labeling result, and classifying and storing the final labeling result according to the entity, attribute, relation and label data;

and searching the corresponding sub-module, which is used for receiving the data searching instruction, acquiring the corresponding entity, attribute relation and label data in the database and the corresponding marked text data set according to the data searching instruction and generating a searching result.

For specific limitations of the visual labeling apparatus, reference may be made to the above limitations of the visual labeling method, and no further description is given here. The modules in the visual marking device can be realized in whole or in part by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a visual annotation method.

In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:

Marking pretreatment is carried out on the text data set with uniform format, basic information is automatically extracted, and the extracted information is marked prominently according to the corresponding type to obtain pretreatment elements;

outputting the text data set subjected to marking pretreatment to a text display area, and supplementing unlabeled element information by using a marking tool in the text display area through manual marking;

inputting sample data into a machine training model, obtaining a machine labeling result according to the training result, and obtaining a labeling comparison result by comparing the machine labeling result with a manual labeling result;

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. The visual labeling method is characterized by comprising the following steps of:

2. The visual labeling method according to claim 1, wherein the labeling preprocessing of the text data set with uniform format automatically extracts basic information, and the extracting information is highlighted according to a corresponding type to obtain a preprocessed element, and specifically comprises:

3. The visual labeling method according to claim 2, wherein the obtaining a feature word library, and performing feature matching on the text data set with unified format through the feature word library, automatically extracts basic information, where the basic information includes a name, a phone number, a certificate number, an address and an age, and specifically includes:

4. The visual labeling method according to claim 1, wherein the outputting the text data set after the labeling pretreatment to a text display area, and the manually labeling the text display area to supplement unlabeled element information by using a labeling tool, specifically comprises:

5. The method for visual annotation according to claim 1, wherein the obtaining a final annotation result according to the element information supplement result, and storing the final annotation result in a database as sample data, specifically includes:

6. A visual annotation device, comprising:

7. The visual annotation device of claim 6, wherein the information extraction module comprises:

the feature extraction sub-module is used for obtaining a feature word stock, and automatically extracting basic information including names, telephone numbers, certificate numbers, addresses and ages by performing feature matching on the text data sets with unified formats through the feature word stock;

and the result screening sub-module is used for obtaining the final pretreatment element according to the result of the manual screening.

8. The visual annotation device of claim 6, wherein the feature extraction submodule comprises:

the name extraction unit is used for acquiring a surname word stock from the characteristic word stock, carrying out surname matching in the text data set with unified formats according to the surname word stock, recognizing corresponding words according to the semantics of the result of surname matching, and combining the words with the result of surname matching to obtain name information;

the telephone extraction unit is used for acquiring a telephone number feature word stock from the feature word stock, carrying out telephone number matching in the text data set with unified format according to the telephone number feature word stock, and obtaining telephone number information according to the telephone number matching result;

The certificate number extraction unit is used for obtaining a certificate number feature word stock from the feature word stock, carrying out certificate number matching in the text data set with unified format according to the certificate number feature word stock, and obtaining certificate number information according to the certificate number matching result;

the address extraction unit is used for acquiring an address feature word stock from the feature word stock, performing address matching in the text data set with unified format according to the address feature word stock, and obtaining address information according to the address matching result;

the age extraction unit is used for obtaining an age feature word stock from the feature word stock, carrying out age matching in the text data set with unified format according to the age feature word stock, and obtaining age information according to the age matching result.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the visual annotation method according to any of claims 1 to 5 when the computer program is executed.

10. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the visual annotation method according to any of claims 1 to 5.