CN115391519A - NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium - Google Patents
NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium Download PDFInfo
- Publication number
- CN115391519A CN115391519A CN202210859622.4A CN202210859622A CN115391519A CN 115391519 A CN115391519 A CN 115391519A CN 202210859622 A CN202210859622 A CN 202210859622A CN 115391519 A CN115391519 A CN 115391519A
- Authority
- CN
- China
- Prior art keywords
- model
- enterprise
- data
- training
- data source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
An enterprise automatic labeling model generation method, system, equipment and storage medium based on NLP technology belongs to the technical field of artificial intelligence and solves the problems that an existing labeling mode depends on manual work, efficiency is low, accuracy is low, labor cost is high, and the proportion of subjective factors of experts is too high. The method comprises the following steps: s1, capturing Internet enterprise information to form a basic data source; s2, correspondingly processing the basic data source, and extracting enterprise key information from the processed basic data source by using an NLP (non-line segment) technology; s3, combining the original label data of the enterprise, and performing model training on the key information and the label data of the enterprise; s4, combining a model training result, adjusting model parameters and changing input data, and performing multiple iterations on the model to generate a training model; and S5, supplementing the model rule by combining the actual situation to generate an automatic labeling model.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an enterprise automatic labeling model generation method, system, equipment and storage medium based on NLP technology.
Background
At present, classification and labeling of enterprises generally depend on a traditional manual selection mode, and labeling is carried out by using experience of business experts. The method has the defects of low efficiency, high labor cost, high proportion of subjective factors of experts and the like. With the development of the times, more and more enterprises can generate the situation of a plurality of labels, and the situation of omission, misjudgment and the like easily occurs in a manual selection mode. Moreover, more and more enterprise data are required to be labeled at present, which causes great difficulty to the traditional manual labeling mode.
In summary, the existing labeling method has the following defects: because of depending on the manual work, the efficiency is low, the accuracy is low, the labor cost is high, and the ratio of subjective factors of experts is too high.
Disclosure of Invention
The invention solves the problems of low efficiency, low accuracy, high labor cost and overhigh subjective factor ratio of experts in the existing labeling mode because of depending on manpower.
The invention relates to an enterprise automatic labeling model generation method based on NLP technology, which comprises the following steps:
s1, capturing Internet enterprise information to form a basic data source;
s2, correspondingly processing the basic data source, and extracting enterprise key information from the processed basic data source by using an NLP (non line segment) technology;
s3, combining the original label data of the enterprise, and performing model training on the key information of the enterprise and the label data;
s4, combining a model training result, adjusting model parameters and changing input data, and performing multiple iterations on the model to generate a training model;
and S5, supplementing the model rule by combining the actual situation to generate an automatic labeling model.
Further, in an embodiment of the present invention, in the step S1, the manner of capturing the internet enterprise information includes web crawler collection and historical enterprise tag library data.
Further, in an embodiment of the present invention, in the step S2, the performing corresponding processing on the basic data source includes the following steps:
step S201, data in a basic data source is cleaned, and interference items in the data are removed;
step S202, performing word segmentation on the cleaned data in the basic data source;
and step S203, managing and supplementing the professional vocabulary and the disabled vocabulary according to the word segmentation result of the step S202.
Further, in an embodiment of the present invention, in the step S2, the weight of the part of the professional vocabulary of the enterprise key information extracted from the processed basic data source by using the NLP technology is adjusted.
Further, in an embodiment of the present invention, in the step S3, the XGBOOST algorithm is used for the model training.
Further, in an embodiment of the present invention, in the step S3, the model training of the enterprise tag data includes the following steps:
step S301, taking the enterprise tag data as a result set, and extracting vectorization data of the enterprise tag data by using an NLP technology;
and step S302, cutting a training set, a verification set and a cross-verification set by combining the result set, and then training a model.
The invention relates to an enterprise automatic labeling model generation system based on NLP technology, which comprises the following modules:
the capturing module is used for capturing internet enterprise information to form a basic data source;
the processing module is used for correspondingly processing the basic data source and extracting enterprise key information from the processed basic data source by utilizing an NLP (non line segment) technology;
the model module is used for performing model training on the enterprise key information and the label data in combination with the original label data of the enterprise;
the iteration module is used for adjusting model parameters and changing input data by combining a model training result, and performing multiple iterations on the model to generate a training model;
and the generating module is used for supplementing the model rule by combining the actual situation and generating the automatic labeling model.
The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for finishing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of the above methods when executing the program stored in the memory.
A computer-readable storage medium according to the present invention, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any of the above-mentioned methods.
The invention solves the problems of low efficiency, low accuracy, high labor cost and high proportion of subjective factors of experts in the existing labeling mode because of depending on manpower. The method has the following specific beneficial effects:
1. according to the enterprise automatic labeling model generation method based on the NLP technology, firstly, an enterprise basic information database is formed by capturing enterprise basic information, key data are extracted through data cleaning and iterative word segmentation, and a professional vocabulary weighting mode is introduced before Chinese text vectorization, so that the data model calculation is more accurate. Meanwhile, a model calculation method with the best effect is adopted, a data model is repeatedly trained in an iterative mode, and finally a business rule model is added, so that automatic labeling service of an enterprise which meets business requirements more accurately is provided, and the problems that an existing labeling mode depends on manual work, the efficiency is low, the accuracy is low, the labor cost is high, and the ratio of subjective factors of experts is too high are effectively solved.
2. According to the enterprise automatic labeling model generation method based on the NLP technology, data cleaning is carried out on enterprise basic information data, interference items in the data are removed, partial data fields which are not suitable for participating in the model are deleted, and accuracy of the data is improved.
3. According to the enterprise automatic labeling model generation method based on the NLP technology, the generated training model is combined with business data and expert suggestions to establish a rule model and supplement the training model, so that the result output by using the model can meet business related requirements.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of an enterprise automatic labeling model generation method based on NLP technology according to an embodiment.
FIG. 2 is a diagram of a basic data block, in accordance with an embodiment.
Fig. 3 is a flowchart of enterprise basic information data processing according to an embodiment.
Detailed Description
Various embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings. The embodiments described by referring to the drawings are exemplary and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The method for generating the enterprise automatic labeling model based on the NLP technology comprises the following steps:
s1, capturing Internet enterprise information to form a basic data source;
s2, correspondingly processing the basic data source, and extracting enterprise key information from the processed basic data source by using an NLP (non-line segment) technology;
s3, combining the original label data of the enterprise, and performing model training on the key information and the label data of the enterprise;
s4, combining a model training result, adjusting model parameters and changing input data, and performing multiple iterations on the model to generate a training model;
and S5, supplementing the model rule by combining the actual situation to generate an automatic labeling model.
In this embodiment, in step S1, the manner of capturing the internet enterprise information includes web crawler collection and historical enterprise tag library data.
In this embodiment, in the step S2, the corresponding processing performed by the basic data source includes the following steps:
step S201, data in a basic data source is cleaned, and interference items in the data are removed;
step S202, performing word segmentation on the cleaned data in the basic data source;
and step S203, managing and supplementing the professional vocabulary and the disabled vocabulary according to the word segmentation result of the step S202.
In this embodiment, in the step S2, the weight of the part of the professional vocabulary of the enterprise key information extracted from the processed basic data source by the NLP technology is adjusted.
In this embodiment, in step S3, the XGBOOST algorithm is used for the model training.
In this embodiment, the step S3 of performing model training on the enterprise tag data includes the following steps:
step S301, taking the enterprise tag data as a result set, and extracting vectorization data of the enterprise tag data by using an NLP technology;
and step S302, cutting a training set, a verification set and a cross-verification set by combining the result set, and then training a model.
The embodiment of the system for generating the enterprise automatic labeling model based on the NLP technology comprises the following modules:
the capturing module is used for capturing internet enterprise information to form a basic data source;
the processing module is used for correspondingly processing the basic data source and extracting enterprise key information from the processed basic data source by using an NLP (non line segment) technology;
the model module is used for performing model training on the enterprise key information and the label data in combination with the original label data of the enterprise;
the iteration module is used for adjusting model parameters and changing input data by combining a model training result, and performing multiple iterations on the model to generate a training model;
and the generation module is used for supplementing the model rule by combining the actual situation and generating the automatic labeling model.
The electronic device according to this embodiment includes a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface are configured to complete communication between the processor and the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of the above embodiments when executing the program stored in the memory.
A computer-readable storage medium according to this embodiment, in which a computer program is stored, which, when being executed by a processor, implements the method steps of any of the above embodiments.
The embodiment is based on the method for generating the enterprise automatic labeling model based on the NLP technology, which can be better understood by combining with fig. 1, and provides an actual embodiment:
step S1: establishing a basic data source: capturing internet enterprise information to form a basic data source;
step S2: extracting key information: extracting key information of an enterprise by using an NLP technology;
and step S3: primary model training: performing model training by combining the label data;
and step S4: an iterative model: iteration is carried out on the model by combining the model parameters and the data condition;
step S5: the supplementary model rules are: supplementing a model rule by combining with a service expert suggestion;
step S6: and generating a final automatic labeling model.
The basic data is mainly divided into two parts, namely the enterprise basic information data collected by the web crawler and the historical enterprise tag library data. As shown in fig. 2, then performing word segmentation, key information extraction and vectorization on the basic information of the related enterprise through the NLP technology; training relevant key information and label data into a model by combining original relevant enterprise label data of a company;
as shown in FIG. 3, firstly, data cleaning is needed to remove interference items in the data, and part of data fields which are not suitable for participating in the model are deleted, so that the accuracy of the data is improved; then, performing word segmentation, wherein the part is an iterative process and needs to perform management and supplement of professional vocabularies and disabled vocabularies according to word segmentation results; then extracting key information of each industry through NLP technology; and then, properly adjusting the weight of part of professional vocabularies to make the data more suitable for model calculation, and then carrying out Chinese text vectorization through a related algorithm.
The enterprise information labeling is essentially a multi-classification task, so the XGB OST algorithm is adopted for model training. Taking the enterprise label data as a result set, utilizing vectorization data extracted by an NLP module, cutting a training set, a verification set and a cross-validation set by combining the result set, and then carrying out model training; and (4) carrying out model iteration by properly adjusting parameters and changing input data according to a model training result to generate a training model.
And establishing a rule model by combining the service data and expert suggestions, and performing a supplementary training model to ensure that the result output by using the model meets the relevant requirements of the service. And finally, providing model service, wherein the input is the basic information of the enterprise, and the output is the enterprise label, thereby completing the automatic labeling of the enterprise.
In summary, the invention first captures the basic information of the enterprise to form a basic information database of the enterprise; through a data cleaning and iterative word segmentation mode, the NLP technology is used for extracting key data, and a mode of professional vocabulary weighting is introduced before Chinese text vectorization, so that the data model calculation is more accurate. Meanwhile, a model calculation method with the best effect is adopted, and a data model is repeatedly trained in an iterative mode; and finally, a business rule model is added, and the automatic labeling service of the enterprise which meets the business requirements more accurately is provided.
The method, the system, the equipment and the storage medium for generating the enterprise automatic labeling model based on the NLP technology are introduced in detail, a specific example is applied in the method to explain the principle and the implementation mode of the invention, and the explanation of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (9)
1. An enterprise automatic labeling model generation method based on NLP technology is characterized by comprising the following steps:
s1, capturing Internet enterprise information to form a basic data source;
s2, correspondingly processing the basic data source, and extracting enterprise key information from the processed basic data source by using an NLP (non-line segment) technology;
s3, combining the original label data of the enterprise, and performing model training on the key information of the enterprise and the label data;
s4, combining a model training result, adjusting model parameters and changing input data, and performing multiple iterations on the model to generate a training model;
and S5, supplementing the model rule by combining the actual situation to generate an automatic labeling model.
2. The method for generating an enterprise automatic tagging model based on the NLP technology according to claim 1, wherein in the step S1, the manner of capturing internet enterprise information includes web crawler collection and historical enterprise tag library data.
3. The method for generating an enterprise automatic labeling model based on NLP technology according to claim 1, wherein in the step S2, the basic data source is processed correspondingly, which includes the following steps:
step S201, data in a basic data source is cleaned, and interference items in the data are removed;
step S202, performing word segmentation on the cleaned data in the basic data source;
and step S203, managing and supplementing the professional vocabulary and the disabled vocabulary according to the word segmentation result of the step S202.
4. The method for generating an enterprise automatic labeling model based on the NLP technology as claimed in claim 1, wherein in step S2, the NLP technology is used to extract part of professional vocabularies of enterprise key information from the processed basic data source for weight adjustment.
5. The method for generating an enterprise automatic tagging model based on the NLP technology as claimed in claim 1, wherein in step S3, the XGBOOST algorithm is used for the model training.
6. The method according to claim 3, wherein in the step S3, the model training of the enterprise tag data includes the following steps:
step S301, taking the enterprise tag data as a result set, and extracting vectorization data of the enterprise tag data by using an NLP technology;
and step S302, cutting a training set, a verification set and a cross-verification set by combining the result set, and then training a model.
7. An enterprise automatic labeling model generation system based on NLP technology is characterized by comprising the following modules:
the capturing module is used for capturing internet enterprise information to form a basic data source;
the processing module is used for correspondingly processing the basic data source and extracting enterprise key information from the processed basic data source by utilizing an NLP (non line segment) technology;
the model module is used for performing model training on the enterprise key information and the label data in combination with the original label data of the enterprise;
the iteration module is used for adjusting model parameters and changing input data by combining a model training result, and performing multiple iterations on the model to generate a training model;
and the generating module is used for supplementing the model rule by combining the actual situation and generating the automatic labeling model.
8. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1 to 6 when executing a program stored in a memory.
9. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210859622.4A CN115391519A (en) | 2022-07-21 | 2022-07-21 | NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210859622.4A CN115391519A (en) | 2022-07-21 | 2022-07-21 | NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115391519A true CN115391519A (en) | 2022-11-25 |
Family
ID=84116815
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210859622.4A Pending CN115391519A (en) | 2022-07-21 | 2022-07-21 | NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115391519A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115599965A (en) * | 2022-12-13 | 2023-01-13 | 山东中慧强企信息科技有限公司(Cn) | Data economic informatization management system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783818A (en) * | 2019-01-17 | 2019-05-21 | 上海三零卫士信息安全有限公司 | A kind of enterprises ' industry multi-tag classification method |
CN112287075A (en) * | 2020-12-25 | 2021-01-29 | 北京智源人工智能研究院 | Method and device for automatically acquiring enterprise multi-level classification training data |
CN112632980A (en) * | 2020-12-30 | 2021-04-09 | 广州友圈科技有限公司 | Enterprise classification method and system based on big data deep learning and electronic equipment |
CN113312476A (en) * | 2021-02-03 | 2021-08-27 | 珠海卓邦科技有限公司 | Automatic text labeling method and device and terminal |
CN114491209A (en) * | 2022-01-24 | 2022-05-13 | 南京中新赛克科技有限责任公司 | Method and system for mining enterprise business label based on internet information capture |
-
2022
- 2022-07-21 CN CN202210859622.4A patent/CN115391519A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783818A (en) * | 2019-01-17 | 2019-05-21 | 上海三零卫士信息安全有限公司 | A kind of enterprises ' industry multi-tag classification method |
CN112287075A (en) * | 2020-12-25 | 2021-01-29 | 北京智源人工智能研究院 | Method and device for automatically acquiring enterprise multi-level classification training data |
CN112632980A (en) * | 2020-12-30 | 2021-04-09 | 广州友圈科技有限公司 | Enterprise classification method and system based on big data deep learning and electronic equipment |
CN113312476A (en) * | 2021-02-03 | 2021-08-27 | 珠海卓邦科技有限公司 | Automatic text labeling method and device and terminal |
CN114491209A (en) * | 2022-01-24 | 2022-05-13 | 南京中新赛克科技有限责任公司 | Method and system for mining enterprise business label based on internet information capture |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115599965A (en) * | 2022-12-13 | 2023-01-13 | 山东中慧强企信息科技有限公司(Cn) | Data economic informatization management system |
CN115599965B (en) * | 2022-12-13 | 2023-08-11 | 山东中慧强企信息科技有限公司 | Data economy informatization management system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110377759B (en) | Method and device for constructing event relation graph | |
CN111444340A (en) | Text classification and recommendation method, device, equipment and storage medium | |
CN106776538A (en) | The information extracting method of enterprise's noncanonical format document | |
CN110569359B (en) | Training and application method and device of recognition model, computing equipment and storage medium | |
CN116629275B (en) | Intelligent decision support system and method based on big data | |
CN111914555B (en) | Automatic relation extraction system based on Transformer structure | |
CN111309910A (en) | Text information mining method and device | |
CN110751234B (en) | OCR (optical character recognition) error correction method, device and equipment | |
CN106528616A (en) | Language error correcting method and system for use in human-computer interaction process | |
CN114239574A (en) | Miner violation knowledge extraction method based on entity and relationship joint learning | |
CN110705272A (en) | Named entity identification method for automobile engine fault diagnosis | |
CN116245097A (en) | Method for training entity recognition model, entity recognition method and corresponding device | |
CN114265937A (en) | Intelligent classification analysis method and system of scientific and technological information, storage medium and server | |
CN115391519A (en) | NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium | |
CN113360654B (en) | Text classification method, apparatus, electronic device and readable storage medium | |
CN112749556B (en) | Multi-language model training method and device, storage medium and electronic equipment | |
CN112836013A (en) | Data labeling method and device, readable storage medium and electronic equipment | |
CN110472231B (en) | Method and device for identifying legal document case | |
CN110362828B (en) | Network information risk identification method and system | |
CN116739408A (en) | Power grid dispatching safety monitoring method and system based on data tag and electronic equipment | |
CN107657060B (en) | Feature optimization method based on semi-structured text classification | |
CN117786427B (en) | Vehicle type main data matching method and system | |
CN111402012B (en) | E-commerce defective product identification method based on transfer learning | |
CN114036946B (en) | Text feature extraction and auxiliary retrieval system and method | |
CN110427615B (en) | Method for analyzing modification tense of financial event based on attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |