CN115114399A - Method for realizing text data treatment preprocessing based on NLP technology - Google Patents

Method for realizing text data treatment preprocessing based on NLP technology Download PDF

Info

Publication number
CN115114399A
CN115114399A CN202210674200.XA CN202210674200A CN115114399A CN 115114399 A CN115114399 A CN 115114399A CN 202210674200 A CN202210674200 A CN 202210674200A CN 115114399 A CN115114399 A CN 115114399A
Authority
CN
China
Prior art keywords
service
data
text
nlp technology
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210674200.XA
Other languages
Chinese (zh)
Inventor
田一鸣
徐寒亭
朱震
赵翔
林潇
胡松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Transport Consulting and Design Institute Co Ltd
Highway Traffic Energy Saving and Environmental Protection Technology and Equipment Transportation Industry R&D Center
Original Assignee
Anhui Transport Consulting and Design Institute Co Ltd
Highway Traffic Energy Saving and Environmental Protection Technology and Equipment Transportation Industry R&D Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Transport Consulting and Design Institute Co Ltd, Highway Traffic Energy Saving and Environmental Protection Technology and Equipment Transportation Industry R&D Center filed Critical Anhui Transport Consulting and Design Institute Co Ltd
Priority to CN202210674200.XA priority Critical patent/CN115114399A/en
Publication of CN115114399A publication Critical patent/CN115114399A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for realizing text data treatment pretreatment based on NLP technology, which comprises the following steps: s1, collecting a document set; s2, obtaining a target set from the document set obtained in the step S1 based on the service keywords; s3, performing semantic analysis on the service keywords and the target set based on the NLP technology to obtain an analysis result; s4, classifying all service data according to the analysis result obtained in the step S3; s5, respectively extracting characteristics and information of various service data based on NLP technology; and S6, performing effectiveness processing on the features extracted in the step S6 and the extracted information, and reserving or rejecting corresponding service data according to an effectiveness processing result. The invention provides a method for realizing text data treatment preprocessing based on an NLP technology in the field of artificial intelligence, which can greatly improve the speed and efficiency of mass data classification and data extraction and can greatly improve the accuracy.

Description

Method for realizing text data treatment preprocessing based on NLP technology
The technical field is as follows:
the invention relates to the field of data governance methods of large data centers, in particular to a method for achieving text data governance preprocessing based on an NLP technology.
The background art comprises the following steps:
in recent years, with the great support of "new infrastructure" by the country, information infrastructures represented by large data centers and artificial intelligence play an important role in the process of enterprise digital transformation, and data processing is an essential step therein. In the traffic industry, links such as design, construction, maintenance and operation of traffic infrastructure continuously generate massive structured, semi-structured and unstructured data, and the data have the characteristics of abundant data dimensionality, various service scenes, strong specialty and the like. At present, data preprocessing is mainly carried out by means of conventional tools and manpower, the mode seriously depends on personal ability and service proficiency of workers, the workload is huge, the working efficiency is low, and the requirement of mass data processing cannot be met.
Disclosure of Invention
The invention aims to provide a method for realizing text data treatment preprocessing based on an NLP technology, which is used for solving the problems of the data treatment technology in the field of traffic informatization in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a method for realizing text data treatment pretreatment based on NLP technology comprises the following steps;
s1, collecting a plurality of documents formed by the service data sets respectively, thereby obtaining a document set;
s2, searching the service keywords of the service data in the document set obtained in the step S1 to obtain a target set;
s3, performing semantic analysis on the service keywords and the target set based on an NLP technology to obtain an analysis result;
and S4, classifying all the service data in the target set according to the analysis result obtained in the step S3 and a preset rule according with service characteristics. The service data generated by different services such as design, construction, maintenance, operation and the like have great difference, and different service characteristics need to be preset and determined according to different service data.
S5, respectively extracting features and information of the various service data classified in the step S4 based on NLP technology;
and S6, performing validity processing on the features extracted in the step S6 and the extracted information, if the features pass the validity processing, retaining the corresponding service data, and if the features do not pass the validity processing, rejecting the corresponding service data.
Further, the service data set described in step S1 includes a text data set, and the document thus formed includes a text document.
Further, the searching for the business keyword in step S2 only includes integrating the text document, and the integrating method includes a binning method based on the keyword query rule and a clustering method based on the data set grouping.
Further, in step S2, the type of the service keyword includes at least one or a combination of more than one of chinese characters, numbers, english, and specific symbols.
Further, the semantic analysis in step S3 includes a keyword matching analysis and a text semantic recognition analysis, and the analysis result obtained thereby includes a keyword matching result and a semantic recognition result.
Further, the semantic analysis in step S3 further includes a dependency syntax analysis, and the analysis result obtained thereby further includes a dependency syntax analysis result.
Further, the rules meeting the business features preset in step S4 include at least one or a combination of keyword matching degree rules, dependency syntax relationship rules, and text semantic result rules, where:
the keyword matching rule is that keywords which accord with the service characteristics are extracted according to the text characteristics in the specified service, and text data which accord with the keywords is taken as a class;
the dependency syntactic relation rule is that according to the text characteristics in the specified service, the syntactic structure in the service text and the interrelation between sentences are extracted, and the syntactic structure and the interrelation are expressed in a structural form which is easy to understand;
the text semantic result rule refers to that an NLP technology is used for performing semantic recognition on a specified service text to obtain a semantic recognition result, and important information in the text is extracted by combining the characteristics of the service text, wherein the important information comprises service related data such as numbers, dates, service names, special numbers and the like
Further, in step S5, the objects of feature extraction and information extraction from various types of service data are determined specifically for each service feature.
Further, the data validity processing in step S6 includes null value checking, value range checking, format checking, and uniqueness constraint checking.
Further, in step S6, the extracting data includes extracting the original document and the real-time data, and arranging according to name and time. The original document refers to a document stored in a storage area, and the real-time data refers to service data generated by a system in real time and directly acquired from a service system.
Compared with the prior art, the invention has the advantages that:
according to the method for realizing the text data governance preprocessing based on the NLP technology, the NLP technology is adopted, the keyword and the target set can be subjected to semantic analysis directly by combining with the input of the keyword to obtain an analysis result, the classification operation of the data is realized according to the rules formulated according to the characteristics of each service, and then the NLP technology is used for carrying out feature extraction and information extraction on the data according to the data after different classifications, so that the efficiency and the accuracy of an enterprise big data center in the traffic information field in the aspect of governing mass source data are improved rapidly, the difficulty and the complexity of governing the source data are reduced greatly, and the method is favorable for rapid popularization in each service field.
Description of the drawings:
in order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts;
FIG. 1 is a block flow diagram of the method of the present invention.
The specific implementation structure is as follows:
the invention is further illustrated with reference to the following figures and examples.
As shown in fig. 1, a method for implementing text data governance preprocessing based on NLP technology includes the following steps;
and S1, collecting a plurality of documents respectively formed by the business data sets, thereby obtaining document sets. The service data set in the invention only comprises a text data set, and the formed document only comprises a text document.
By taking highway operation as an example for explanation, the method collects a charging service data set, a vehicle monitoring service data set and a meteorological monitoring service data set;
and S2, performing service keyword searching on the service data in the document set obtained in the step S1 to obtain a target set.
The searching for the service keywords in step S2 includes integrating the text documents, and the integration method includes a binning method based on a keyword query rule and a clustering method based on data set grouping.
In step S2, the type of the service keyword includes at least one or more combinations of chinese characters, numbers, english, and specific symbols, and there are the following cases:
p1, only including a type of keyword text, such as chinese characters and english.
P2, including two types of keyword texts, such as "kanji + numeral", "kanji + english", and "english + numeral".
P3, including three types of keyword texts, such as "chinese character + english + numeral", "chinese character + specific symbol + numeral", "chinese character + english + specific symbol", "english + numeral + specific symbol".
P4, including four types of keyword texts, such as "chinese character + english + specific symbol + number".
The set keywords comprise a charging service data set taking the date, a toll station, the amount, the license plate number, the vehicle type, the mileage and the like as the keywords, a vehicle monitoring service data set taking the license plate number, the vehicle type, the vehicle speed, the road section and other keywords as the keywords, and a meteorological monitoring service data set taking the wind speed, the wind direction, the air temperature, the air humidity, the illumination and other keywords as the keywords;
and S3, performing semantic analysis on the service keywords and the target set based on an NLP technology to obtain an analysis result.
The semantic analysis in step S3 includes keyword matching analysis, text semantic recognition analysis, and dependency syntax analysis, and the analysis results obtained thereby include keyword matching result, semantic recognition result, and dependency syntax analysis result.
The general process of semantic analysis by the NLP technique is as follows:
p1, content determination, wherein the text type and range included in the target set are determined according to the service type;
p2, performing word segmentation, namely performing word segmentation on the text in the target set to obtain a data structure taking words as units;
p3, part-of-speech tagging, wherein part-of-speech tagging is carried out according to the meaning and the context content of the result after word segmentation processing;
p4, named entity recognition, based on the results after word segmentation, recognizing the special entities with special meaning in the business and field;
p5, model training, modeling the processed result to obtain an identification model;
and P6, performing semantic analysis, namely analyzing the service data by using the recognition model to obtain a keyword matching result, a semantic recognition result and a dependency syntax analysis result.
And S4, classifying all the service data in the target set according to the analysis result obtained in the step S3 and a preset rule according with service characteristics. The service data generated by different services such as design, construction, maintenance, operation and the like have great difference, and different service characteristics need to be preset and determined according to different service data.
The rules meeting the business features preset in step S4 include at least one or a combination of keyword matching degree rules, dependency syntax relationship rules, and text semantic result rules, where:
the keyword matching rule is that keywords which accord with the service characteristics are extracted according to the text characteristics in the specified service, and text data which accord with the keywords is taken as a class;
the dependency syntax relation rule is that a grammar structure and the interrelation between sentences in a service text are extracted according to the text characteristics in the specified service and are expressed into an easily understood structural form;
the text semantic result rule refers to that an NLP technology is used for performing semantic recognition on a specified service text to obtain a semantic recognition result, and important information in the text is extracted by combining the characteristics of the service text, wherein the important information comprises service related data such as numbers, dates, service names, special numbers and the like
And S5, respectively extracting features and information of the various service data classified in the step S4 based on NLP technology. The object of feature extraction and information extraction from various service data is determined according to each service characteristic.
The rough process of feature extraction and information extraction by the NLP technology comprises the following steps:
p1, counting the total number of words in the target set and the occurrence frequency of each word;
p2, calculating chi-square value of each word and sequencing;
and P3, converting the text information into a quantifiable feature vector.
And S6, performing validity processing on the features extracted in the step S6 and the extracted information, if the features pass the validity processing, retaining the corresponding service data, and if the features do not pass the validity processing, rejecting the corresponding service data.
The data validity processing in step S6 includes null value check, value domain check, format check, and uniqueness constraint check, where:
the null value check is to check four types of data such as no parameter, parameter but null, parameter but 0 length, parameter but blank;
the value domain check is to check the validity of the value, the range and the length of the data;
the format check is to check the validity of the format of the data;
uniqueness constraint check is a check measure taken to ensure the uniqueness of a field or data in a set;
in step S6, the extracted data includes extracting the original document and the real-time data, and arranging according to name and time. The original document refers to a document stored in a storage area, and the real-time data refers to service data generated by a system in real time and directly acquired from a service system.
The implementation examples described in the present invention are only for describing the preferred embodiments of the present invention, and do not limit the concept and scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the design concept of the present invention shall fall into the protection scope of the present invention, and the technical content of the present invention that is claimed is fully set in the claims.

Claims (10)

1. A method for realizing text data treatment pretreatment based on NLP technology is characterized by comprising the following steps;
s1, collecting a plurality of documents formed by the service data sets respectively, thereby obtaining a document set;
s2, performing service keyword searching on the service data in the document set obtained in the step S1 to obtain a target set;
s3, performing semantic analysis on the service keywords and the target set based on an NLP technology to obtain an analysis result;
s4, classifying all the service data in the target set according to the analysis result obtained in the step S3 and a preset rule according with service characteristics;
s5, respectively extracting features and information of the various service data classified in the step S4 based on NLP technology;
and S6, performing validity processing on the features extracted in the step S5 and the extracted information, if the features pass the validity processing, retaining the corresponding service data, and if the features do not pass the validity processing, rejecting the corresponding service data.
2. The method according to claim 1, wherein the service data set in step S1 includes only text data set, and the formed document includes only text document.
3. The method according to claim 1, wherein the searching for the business keyword in step S2 only includes integrating the text document, and the integration method includes a binning method based on a keyword query rule and a clustering method based on a data set grouping.
4. The method according to claim 1, wherein in step S2, the type of the service keyword includes at least one or more of chinese character, numeral, english, and specific symbol.
5. The method according to claim 1, wherein the semantic analysis in step S3 includes keyword matching analysis and text semantic recognition analysis, and the analysis results obtained thereby include keyword matching result and semantic recognition result.
6. The method for implementing text data governance preprocessing based on NLP technology as claimed in claim 5, wherein the semantic analysis in step S3 further includes dependency parsing, and the analysis result obtained thereby further includes dependency parsing result.
7. The method according to claim 6, wherein the rules meeting business features preset in step S4 include at least one or more of keyword matching degree rules, dependency syntax relationship rules, and text semantic result rules, and wherein:
the keyword matching rule is that keywords which accord with the service characteristics are extracted according to the text characteristics in the specified service, and text data which accord with the keywords are taken as a class;
the dependency syntax relation rule is that a grammar structure and the interrelation between sentences in a service text are extracted according to the text characteristics in the specified service and are expressed into an easily understood structural form;
the text semantic result rule is that an NLP technology is used for performing semantic recognition on a specified service text to obtain a semantic recognition result, and important information in the text, including data related to numbers, dates, service names and special number services, is extracted by combining the characteristics of the service text.
8. The method for preprocessing text data based on NLP technology as claimed in claim 1, wherein in step S5, the objects for feature extraction and information extraction from various types of business data are determined based on each business feature.
9. The method for implementing text data governance preprocessing based on NLP technology as claimed in claim 1, wherein the data validity processing in step S6 includes null value check, value range check, format check, uniqueness constraint check.
10. The method according to claim 1, wherein the extracting data in step S6 includes extracting original documents and real-time data, and arranging the extracted documents according to name and time, wherein the original documents refer to documents already stored in the storage area, and the real-time data refer to service data generated by the system in real time and directly obtained from the service system.
CN202210674200.XA 2022-06-15 2022-06-15 Method for realizing text data treatment preprocessing based on NLP technology Pending CN115114399A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210674200.XA CN115114399A (en) 2022-06-15 2022-06-15 Method for realizing text data treatment preprocessing based on NLP technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210674200.XA CN115114399A (en) 2022-06-15 2022-06-15 Method for realizing text data treatment preprocessing based on NLP technology

Publications (1)

Publication Number Publication Date
CN115114399A true CN115114399A (en) 2022-09-27

Family

ID=83327935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210674200.XA Pending CN115114399A (en) 2022-06-15 2022-06-15 Method for realizing text data treatment preprocessing based on NLP technology

Country Status (1)

Country Link
CN (1) CN115114399A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116644351A (en) * 2023-06-13 2023-08-25 石家庄学院 Data processing method and system based on artificial intelligence

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116644351A (en) * 2023-06-13 2023-08-25 石家庄学院 Data processing method and system based on artificial intelligence
CN116644351B (en) * 2023-06-13 2024-04-02 石家庄学院 Data processing method and system based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN109165294A (en) Short text classification method based on Bayesian classification
CN111291156A (en) Question-answer intention identification method based on knowledge graph
CN101079031A (en) Web page subject extraction system and method
CN101079024A (en) Special word list dynamic generation system and method
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN102081667A (en) Chinese text classification method based on Base64 coding
CN109446423B (en) System and method for judging sentiment of news and texts
CN111949774A (en) Intelligent question answering method and system
CN114896305A (en) Smart internet security platform based on big data technology
CN115794798B (en) Market supervision informatization standard management and dynamic maintenance system and method
CN114860882A (en) Fair competition review auxiliary method based on text classification model
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN114443855A (en) Knowledge graph cross-language alignment method based on graph representation learning
CN113722492A (en) Intention identification method and device
CN112328792A (en) Optimization method for recognizing credit events based on DBSCAN clustering algorithm
CN112257425A (en) Power data analysis method and system based on data classification model
CN115470871A (en) Policy matching method and system based on named entity recognition and relation extraction model
CN116089610A (en) Label identification method and device based on industry knowledge
CN108399238A (en) A kind of viewpoint searching system and method for fusing text generalities and network representation
CN115114399A (en) Method for realizing text data treatment preprocessing based on NLP technology
Kundana Data Driven Analysis of Borobudur Ticket Sentiment Using Naïve Bayes.
CN113569048A (en) Method and system for automatically dividing affiliated industries based on enterprise operation range
Ye et al. Syntactic word embedding based on dependency syntax and polysemous analysis
CN107291952B (en) Method and device for extracting meaningful strings
CN115828888A (en) Method for semantic analysis and structurization of various weblogs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination