CN115114399A

CN115114399A - Method for realizing text data treatment preprocessing based on NLP technology

Info

Publication number: CN115114399A
Application number: CN202210674200.XA
Authority: CN
Inventors: 田一鸣; 徐寒亭; 朱震; 赵翔; 林潇; 胡松
Original assignee: Anhui Transport Consulting and Design Institute Co Ltd; Highway Traffic Energy Saving and Environmental Protection Technology and Equipment Transportation Industry R&D Center
Current assignee: Anhui Transport Consulting and Design Institute Co Ltd; Highway Traffic Energy Saving and Environmental Protection Technology and Equipment Transportation Industry R&D Center
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-27

Abstract

The invention discloses a method for realizing text data treatment pretreatment based on NLP technology, which comprises the following steps: s1, collecting a document set; s2, obtaining a target set from the document set obtained in the step S1 based on the service keywords; s3, performing semantic analysis on the service keywords and the target set based on the NLP technology to obtain an analysis result; s4, classifying all service data according to the analysis result obtained in the step S3; s5, respectively extracting characteristics and information of various service data based on NLP technology; and S6, performing effectiveness processing on the features extracted in the step S6 and the extracted information, and reserving or rejecting corresponding service data according to an effectiveness processing result. The invention provides a method for realizing text data treatment preprocessing based on an NLP technology in the field of artificial intelligence, which can greatly improve the speed and efficiency of mass data classification and data extraction and can greatly improve the accuracy.

Description

Method for realizing text data treatment preprocessing based on NLP technology

The technical field is as follows:

the invention relates to the field of data governance methods of large data centers, in particular to a method for achieving text data governance preprocessing based on an NLP technology.

The background art comprises the following steps:

in recent years, with the great support of "new infrastructure" by the country, information infrastructures represented by large data centers and artificial intelligence play an important role in the process of enterprise digital transformation, and data processing is an essential step therein. In the traffic industry, links such as design, construction, maintenance and operation of traffic infrastructure continuously generate massive structured, semi-structured and unstructured data, and the data have the characteristics of abundant data dimensionality, various service scenes, strong specialty and the like. At present, data preprocessing is mainly carried out by means of conventional tools and manpower, the mode seriously depends on personal ability and service proficiency of workers, the workload is huge, the working efficiency is low, and the requirement of mass data processing cannot be met.

Disclosure of Invention

The invention aims to provide a method for realizing text data treatment preprocessing based on an NLP technology, which is used for solving the problems of the data treatment technology in the field of traffic informatization in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a method for realizing text data treatment pretreatment based on NLP technology comprises the following steps;

s1, collecting a plurality of documents formed by the service data sets respectively, thereby obtaining a document set;

s2, searching the service keywords of the service data in the document set obtained in the step S1 to obtain a target set;

s3, performing semantic analysis on the service keywords and the target set based on an NLP technology to obtain an analysis result;

and S4, classifying all the service data in the target set according to the analysis result obtained in the step S3 and a preset rule according with service characteristics. The service data generated by different services such as design, construction, maintenance, operation and the like have great difference, and different service characteristics need to be preset and determined according to different service data.

S5, respectively extracting features and information of the various service data classified in the step S4 based on NLP technology;

and S6, performing validity processing on the features extracted in the step S6 and the extracted information, if the features pass the validity processing, retaining the corresponding service data, and if the features do not pass the validity processing, rejecting the corresponding service data.

Further, the service data set described in step S1 includes a text data set, and the document thus formed includes a text document.

Further, the searching for the business keyword in step S2 only includes integrating the text document, and the integrating method includes a binning method based on the keyword query rule and a clustering method based on the data set grouping.

Further, in step S2, the type of the service keyword includes at least one or a combination of more than one of chinese characters, numbers, english, and specific symbols.

Further, the semantic analysis in step S3 includes a keyword matching analysis and a text semantic recognition analysis, and the analysis result obtained thereby includes a keyword matching result and a semantic recognition result.

Further, the semantic analysis in step S3 further includes a dependency syntax analysis, and the analysis result obtained thereby further includes a dependency syntax analysis result.

Further, the rules meeting the business features preset in step S4 include at least one or a combination of keyword matching degree rules, dependency syntax relationship rules, and text semantic result rules, where:

the keyword matching rule is that keywords which accord with the service characteristics are extracted according to the text characteristics in the specified service, and text data which accord with the keywords is taken as a class;

the dependency syntactic relation rule is that according to the text characteristics in the specified service, the syntactic structure in the service text and the interrelation between sentences are extracted, and the syntactic structure and the interrelation are expressed in a structural form which is easy to understand;

the text semantic result rule refers to that an NLP technology is used for performing semantic recognition on a specified service text to obtain a semantic recognition result, and important information in the text is extracted by combining the characteristics of the service text, wherein the important information comprises service related data such as numbers, dates, service names, special numbers and the like

Further, in step S5, the objects of feature extraction and information extraction from various types of service data are determined specifically for each service feature.

Further, the data validity processing in step S6 includes null value checking, value range checking, format checking, and uniqueness constraint checking.

Further, in step S6, the extracting data includes extracting the original document and the real-time data, and arranging according to name and time. The original document refers to a document stored in a storage area, and the real-time data refers to service data generated by a system in real time and directly acquired from a service system.

Compared with the prior art, the invention has the advantages that:

according to the method for realizing the text data governance preprocessing based on the NLP technology, the NLP technology is adopted, the keyword and the target set can be subjected to semantic analysis directly by combining with the input of the keyword to obtain an analysis result, the classification operation of the data is realized according to the rules formulated according to the characteristics of each service, and then the NLP technology is used for carrying out feature extraction and information extraction on the data according to the data after different classifications, so that the efficiency and the accuracy of an enterprise big data center in the traffic information field in the aspect of governing mass source data are improved rapidly, the difficulty and the complexity of governing the source data are reduced greatly, and the method is favorable for rapid popularization in each service field.

Description of the drawings:

in order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts;

FIG. 1 is a block flow diagram of the method of the present invention.

The specific implementation structure is as follows:

the invention is further illustrated with reference to the following figures and examples.

As shown in fig. 1, a method for implementing text data governance preprocessing based on NLP technology includes the following steps;

and S1, collecting a plurality of documents respectively formed by the business data sets, thereby obtaining document sets. The service data set in the invention only comprises a text data set, and the formed document only comprises a text document.

By taking highway operation as an example for explanation, the method collects a charging service data set, a vehicle monitoring service data set and a meteorological monitoring service data set;

and S2, performing service keyword searching on the service data in the document set obtained in the step S1 to obtain a target set.

The searching for the service keywords in step S2 includes integrating the text documents, and the integration method includes a binning method based on a keyword query rule and a clustering method based on data set grouping.

In step S2, the type of the service keyword includes at least one or more combinations of chinese characters, numbers, english, and specific symbols, and there are the following cases:

p1, only including a type of keyword text, such as chinese characters and english.

P2, including two types of keyword texts, such as "kanji + numeral", "kanji + english", and "english + numeral".

P3, including three types of keyword texts, such as "chinese character + english + numeral", "chinese character + specific symbol + numeral", "chinese character + english + specific symbol", "english + numeral + specific symbol".

P4, including four types of keyword texts, such as "chinese character + english + specific symbol + number".

The set keywords comprise a charging service data set taking the date, a toll station, the amount, the license plate number, the vehicle type, the mileage and the like as the keywords, a vehicle monitoring service data set taking the license plate number, the vehicle type, the vehicle speed, the road section and other keywords as the keywords, and a meteorological monitoring service data set taking the wind speed, the wind direction, the air temperature, the air humidity, the illumination and other keywords as the keywords;

and S3, performing semantic analysis on the service keywords and the target set based on an NLP technology to obtain an analysis result.

The semantic analysis in step S3 includes keyword matching analysis, text semantic recognition analysis, and dependency syntax analysis, and the analysis results obtained thereby include keyword matching result, semantic recognition result, and dependency syntax analysis result.

The general process of semantic analysis by the NLP technique is as follows:

p1, content determination, wherein the text type and range included in the target set are determined according to the service type;

p2, performing word segmentation, namely performing word segmentation on the text in the target set to obtain a data structure taking words as units;

p3, part-of-speech tagging, wherein part-of-speech tagging is carried out according to the meaning and the context content of the result after word segmentation processing;

p4, named entity recognition, based on the results after word segmentation, recognizing the special entities with special meaning in the business and field;

p5, model training, modeling the processed result to obtain an identification model;

and P6, performing semantic analysis, namely analyzing the service data by using the recognition model to obtain a keyword matching result, a semantic recognition result and a dependency syntax analysis result.

The rules meeting the business features preset in step S4 include at least one or a combination of keyword matching degree rules, dependency syntax relationship rules, and text semantic result rules, where:

the dependency syntax relation rule is that a grammar structure and the interrelation between sentences in a service text are extracted according to the text characteristics in the specified service and are expressed into an easily understood structural form;

And S5, respectively extracting features and information of the various service data classified in the step S4 based on NLP technology. The object of feature extraction and information extraction from various service data is determined according to each service characteristic.

The rough process of feature extraction and information extraction by the NLP technology comprises the following steps:

p1, counting the total number of words in the target set and the occurrence frequency of each word;

p2, calculating chi-square value of each word and sequencing;

and P3, converting the text information into a quantifiable feature vector.

The data validity processing in step S6 includes null value check, value domain check, format check, and uniqueness constraint check, where:

the null value check is to check four types of data such as no parameter, parameter but null, parameter but 0 length, parameter but blank;

the value domain check is to check the validity of the value, the range and the length of the data;

the format check is to check the validity of the format of the data;

uniqueness constraint check is a check measure taken to ensure the uniqueness of a field or data in a set;

in step S6, the extracted data includes extracting the original document and the real-time data, and arranging according to name and time. The original document refers to a document stored in a storage area, and the real-time data refers to service data generated by a system in real time and directly acquired from a service system.

The implementation examples described in the present invention are only for describing the preferred embodiments of the present invention, and do not limit the concept and scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the design concept of the present invention shall fall into the protection scope of the present invention, and the technical content of the present invention that is claimed is fully set in the claims.

Claims

1. A method for realizing text data treatment pretreatment based on NLP technology is characterized by comprising the following steps;

s2, performing service keyword searching on the service data in the document set obtained in the step S1 to obtain a target set;

s4, classifying all the service data in the target set according to the analysis result obtained in the step S3 and a preset rule according with service characteristics;

and S6, performing validity processing on the features extracted in the step S5 and the extracted information, if the features pass the validity processing, retaining the corresponding service data, and if the features do not pass the validity processing, rejecting the corresponding service data.

2. The method according to claim 1, wherein the service data set in step S1 includes only text data set, and the formed document includes only text document.

3. The method according to claim 1, wherein the searching for the business keyword in step S2 only includes integrating the text document, and the integration method includes a binning method based on a keyword query rule and a clustering method based on a data set grouping.

4. The method according to claim 1, wherein in step S2, the type of the service keyword includes at least one or more of chinese character, numeral, english, and specific symbol.

5. The method according to claim 1, wherein the semantic analysis in step S3 includes keyword matching analysis and text semantic recognition analysis, and the analysis results obtained thereby include keyword matching result and semantic recognition result.

6. The method for implementing text data governance preprocessing based on NLP technology as claimed in claim 5, wherein the semantic analysis in step S3 further includes dependency parsing, and the analysis result obtained thereby further includes dependency parsing result.

7. The method according to claim 6, wherein the rules meeting business features preset in step S4 include at least one or more of keyword matching degree rules, dependency syntax relationship rules, and text semantic result rules, and wherein:

the keyword matching rule is that keywords which accord with the service characteristics are extracted according to the text characteristics in the specified service, and text data which accord with the keywords are taken as a class;

the text semantic result rule is that an NLP technology is used for performing semantic recognition on a specified service text to obtain a semantic recognition result, and important information in the text, including data related to numbers, dates, service names and special number services, is extracted by combining the characteristics of the service text.

8. The method for preprocessing text data based on NLP technology as claimed in claim 1, wherein in step S5, the objects for feature extraction and information extraction from various types of business data are determined based on each business feature.

9. The method for implementing text data governance preprocessing based on NLP technology as claimed in claim 1, wherein the data validity processing in step S6 includes null value check, value range check, format check, uniqueness constraint check.

10. The method according to claim 1, wherein the extracting data in step S6 includes extracting original documents and real-time data, and arranging the extracted documents according to name and time, wherein the original documents refer to documents already stored in the storage area, and the real-time data refer to service data generated by the system in real time and directly obtained from the service system.