CN113836131A

CN113836131A - Big data cleaning method and device, computer equipment and storage medium

Info

Publication number: CN113836131A
Application number: CN202111151699.8A
Authority: CN
Inventors: 吴智炜
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-24
Anticipated expiration: 2041-09-29
Also published as: CN113836131B

Abstract

The application discloses a big data cleaning method and device, computer equipment and a storage medium, and belongs to the technical field of big data. The method includes the steps of configuring a mapping relation between service data types and cleaning rules, generating a cleaning rule matching table, and constructing a distributed cluster, wherein the distributed cluster comprises a plurality of sub-servers, each sub-server respectively correspondingly processes one type of service type data, the cleaning rules are distributed to the corresponding sub-servers according to the service data types, target service data types corresponding to original data are determined, target sub-servers corresponding to the target service data types are searched, and data cleaning is carried out on the original data at the target sub-servers to obtain cleaning data. In addition, the present application also relates to a block chain technique, and the original data can be stored in the block chain. According to the method and the device, different types of service data are automatically cleaned by constructing the distributed cluster, so that the method and the device have high universality and adaptability, and are beneficial to unified management of data cleaning rules.

Description

Big data cleaning method and device, computer equipment and storage medium

Technical Field

The application belongs to the technical field of big data, and particularly relates to a big data cleaning method and device, computer equipment and a storage medium.

Background

Data cleansing refers to the last procedure to find and correct recognizable errors in data files, including checking data consistency, processing invalid and missing values, etc. Unlike questionnaire review, cleaning of data after entry is typically done by computer rather than manually.

At present, aiming at a data cleaning scene, the data volume, the data effective period or the data management rule of data generated by different business scenes of different business departments or different business scenes can be completely different, the existing data cleaning scheme usually needs to separately develop the corresponding data cleaning rule aiming at different business requirements, but the cleaning scheme can consume larger manpower and physics in the development stage of the data cleaning rule, and the multiplexing of some shared data cleaning rules can not be realized, so that the waste of development resources is caused, and the management of the data cleaning rule is not facilitated.

Disclosure of Invention

An embodiment of the application aims to provide a big data cleaning method, a big data cleaning device, computer equipment and a storage medium, so as to solve the technical problems that certain shared data cleaning rules cannot be reused in the existing big data cleaning scheme, so that resource development resources of the data cleaning rules are wasted, and the data cleaning rules are not easy to manage.

In order to solve the above technical problem, an embodiment of the present application provides a big data cleaning method, which adopts the following technical solutions:

a big data cleaning method comprises the following steps:

creating a preset number of cleaning rules, configuring a mapping relation between the service data type and the cleaning rules, and generating a cleaning rule matching table;

constructing a distributed cluster, wherein the distributed cluster comprises a plurality of sub-servers, and each sub-server respectively and correspondingly processes one type of service data;

uploading the cleaning rule matching table to the distributed cluster, and distributing the cleaning rule to the corresponding sub-server according to the type of the service data;

receiving original data and determining a target service data type corresponding to the original data;

and searching a target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by using the target sub-server to obtain cleaning data.

Further, the step of receiving the original data and determining the type of the target service data corresponding to the original data specifically includes:

receiving original data, importing the original data into the distributed cluster, and determining a field to be cleaned in the original data;

and extracting keywords of the field to be cleaned, and determining a target data type corresponding to the original data based on the keywords.

Further, the step of receiving the original data, importing the original data into the distributed cluster, and determining a field to be cleaned in the original data specifically includes:

acquiring a requirement document corresponding to the original data, wherein the requirement document records specific requirements of data cleaning;

identifying a data structure of the original data to obtain structure information of the original data;

segmenting original data based on the structural information to obtain a plurality of data fields;

and performing semantic recognition on each data field, and obtaining the field to be cleaned in the original data based on the semantic recognition and the requirement document.

Further, the step of extracting the keywords of the field to be cleaned and determining the target data type corresponding to the original data based on the keywords specifically includes:

performing keyword identification on the fields to be cleaned to obtain keywords of all the fields to be cleaned;

integrating the extracted keywords to generate a keyword combination of the original data;

and determining the type represented by the keyword combination to the target service data type corresponding to the original data.

Further, the step of integrating the extracted keywords and generating a keyword combination of the original data specifically includes:

calculating the weight of each keyword based on a preset TF-IDF algorithm;

ranking the weights of all keywords to obtain a keyword weight sequence;

combining the keywords based on the keyword weight sequence to generate a keyword combination of the original data.

Further, the step of searching for a target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by using the target sub-server to obtain cleaned data specifically includes:

acquiring a business data label corresponding to each sub-server;

in the distributed cluster, comparing the target service data type with the service data label corresponding to each sub-server one by one;

and determining a target sub-server corresponding to the original data according to the comparison result, and performing data cleaning on the original data through the target sub-server to obtain cleaning data.

Further, the step of performing data cleaning on the original data through the target sub-server to obtain cleaned data specifically includes:

formatting the original data to obtain formatted data;

detecting repeated data in the formatted data, and cleaning the repeated data to obtain duplication-removing data;

and detecting error data in the duplicate removal data, and cleaning the error data to obtain cleaning data.

In order to solve the above technical problem, an embodiment of the present application further provides a big data cleaning device, which adopts the following technical scheme:

a big data washing apparatus, comprising:

the rule configuration module is used for creating a preset number of cleaning rules, configuring the mapping relation between the service data types and the cleaning rules and generating a cleaning rule matching table;

the cluster building module is used for building a distributed cluster, wherein the distributed cluster comprises a plurality of sub-servers, and each sub-server respectively and correspondingly processes one type of service data;

the rule distribution module is used for uploading the cleaning rule matching table to the distributed cluster and distributing the cleaning rule to the corresponding sub-server according to the type of the service data;

the data preprocessing module is used for receiving original data and determining a target service data type corresponding to the original data;

and the data cleaning module is used for searching a target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by using the target sub-server to obtain cleaning data.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

a computer device comprising a memory having computer readable instructions stored therein and a processor that when executed implements the steps of a big data cleansing method as described above.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

a computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of a big data washing method as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:

the application discloses a big data cleaning method and device, computer equipment and a storage medium, and belongs to the technical field of big data. The distributed cluster is constructed, and corresponding data cleaning rules are configured for sub-servers in the distributed cluster according to the service data types, wherein each sub-server is configured with a data cleaning rule corresponding to one service data type. When the service data needs to be cleaned, the server identifies the data type of the service data to be cleaned and distributes the service data to be cleaned to the corresponding sub-servers in the distributed cluster for processing according to the data type of the service data to be cleaned. According to the method and the device, the distributed clusters are constructed to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of a data cleaning rule development stage can be effectively reduced, the reuse rate of a public data cleaning rule is improved, and meanwhile unified management of the data cleaning rule is facilitated.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 illustrates an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 illustrates a flow diagram of one embodiment of a big data cleansing method according to the present application;

FIG. 3 illustrates a schematic structural diagram of one embodiment of a big data washer according to the present application;

FIG. 4 shows a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the

terminal devices

101, 102, and 103, and may be an independent server, or a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.

It should be noted that the big data washing method provided by the embodiment of the present application is generally executed by a server, and accordingly, the big data washing apparatus is generally disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a big data cleansing method according to the present application is shown. The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. The big data cleaning method comprises the following steps:

s201, creating a preset number of cleaning rules, configuring a mapping relation between the service data types and the cleaning rules, and generating a cleaning rule matching table.

Specifically, before the distributed cluster is constructed, a preset number of cleaning rules are created through the server, wherein the cleaning rules include formatting rules, deduplication rules, correction rules and the like. Then, a mapping relation between the service data type and the cleaning rule is defined according to the service scene requirements, for example, at least word segmentation, formatting, duplicate removal and other operations are required to be performed on the text data, so that at least corresponding word segmentation rules, formatting rules and duplicate removal rules need to be configured for the text data, and finally, the mapping relations between all the service data types and the cleaning rules are integrated to generate a cleaning rule matching table.

In this embodiment, a cleaning rule matching table is generated by creating a cleaning rule and configuring a mapping relationship between a service data type and the cleaning rule, so as to facilitate uniform management of the data cleaning rule.

S202, constructing a distributed cluster, wherein the distributed cluster comprises a plurality of sub-servers, and each sub-server respectively and correspondingly processes one type of service data.

The distributed cluster can be built based on a Spark cluster architecture, the Spark cluster architecture comprises an open-source distributed storage system Tachyon, an open-source distributed resource management framework meso, a resource manager YARN, a large-scale parallel query engine BlinkDB and the like, wherein the Tachyon is a distributed file system based on a memory, so that each task can share data conveniently and the load of the JVM in the computing process is reduced; the Mesos is a cluster manager and realizes cluster fault tolerance by using a program coordination service Zookeeper; BlinkDB is a massively parallel query engine that allows for increased query response time by trading off data precision.

Specifically, the server builds a distributed cluster for realizing data cleaning based on a Spark cluster architecture, wherein the distributed cluster comprises a plurality of sub-servers, each sub-server respectively and correspondingly processes one type of service data, the distributed cluster can automatically clean different types of service data, the distributed cluster has strong universality and adaptability, and the distributed cluster can respond to the instant processing requirement of large-scale data.

S203, uploading the cleaning rule matching table to the distributed cluster, and distributing the cleaning rule to the corresponding sub-server according to the type of the business data.

Specifically, after completing the basic construction of the distributed cluster, the server uploads the cleaning rule matching table to the distributed cluster, and distributes the corresponding cleaning rule on the cleaning rule matching table to the corresponding sub-server according to the type of the service data. For example, in a specific embodiment of the present application, a cleansing rule corresponding to text data is assigned to the sub-server a, and a cleansing rule corresponding to numerical data is assigned to the sub-server B. In a more specific embodiment of the present application, the service data is policy service data, and the cleaning rule corresponding to the text data in the policy service data is assigned to the sub-server a1, and the cleaning rule corresponding to the numerical data in the policy service data is assigned to the sub-server B1.

In this embodiment, different types of service data cleaning are realized by constructing the distributed cluster and configuring corresponding data cleaning rules for each sub-server in the distributed cluster according to the service data types, and the method has strong universality and adaptability.

S204, receiving the original data and determining the type of the target service data corresponding to the original data.

Specifically, when a data cleaning requirement exists, the server receives a data cleaning instruction and receives original data and a requirement document uploaded by the client, wherein the requirement document records the specific requirement of data cleaning. The server imports the original data uploaded by the client into the distributed cluster, predetermines the target data type corresponding to the original data, searches the sub-servers corresponding to the service data type in the distributed cluster, and finally imports the original data into the sub-servers for data cleaning.

In this embodiment, the electronic device (for example, the server shown in fig. 1) on which the big data cleansing method operates may receive the data cleansing instruction through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

S205, searching a target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by using the target sub-server to obtain cleaning data.

Specifically, after the server allocates the cleansing rule to the corresponding sub-server according to the service data type, the server generates a corresponding service data tag for each sub-server, for example, in the above embodiment, the service data tag corresponding to the sub-server a1 is "policy service data-text data", and the service data tag corresponding to the sub-server B1 is "policy service data-numerical data". When the service data is cleaned, after the server determines a target service data type corresponding to the original data, the server compares the target service data type with a service data label corresponding to each sub-server one by one, and when the target service data type is matched with the service data label corresponding to one of the sub-servers, the sub-server is used as a target sub-server, and the original data is input to the target sub-server for data cleaning, so that the cleaning data is obtained.

In the embodiment, the distributed cluster is constructed, and the corresponding data cleaning rule is configured for each sub-server in the distributed cluster according to the service data type, wherein each sub-server is configured with the data cleaning rule corresponding to one service data type. When the service data needs to be cleaned, the server identifies the data type of the service data to be cleaned and distributes the service data to be cleaned to the corresponding sub-servers in the distributed cluster for processing according to the data type of the service data to be cleaned. According to the method and the device, the distributed clusters are constructed to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of a data cleaning rule development stage can be effectively reduced, the reuse rate of a public data cleaning rule is improved, and meanwhile unified management of the data cleaning rule is facilitated.

Specifically, after receiving the original data, the server imports the original data into the distributed cluster, performs field segmentation on the original data in the distributed cluster, determines a field to be cleaned in the original data according to a requirement document in the segmented field, and performs semantic analysis on the extracted keyword to determine a target data type corresponding to the original data by extracting the keyword of the field to be cleaned.

Specifically, the server identifies a data structure of the original data, obtains structure information of the original data, and performs field segmentation on the original data based on the structure information to obtain a plurality of data fields, for example, a specific data structure of the original data is a multi-drop structure, and performs field segmentation on the original data according to multi-drop structure distribution to obtain a plurality of data fields, where the data fields obtained by performing field segmentation on the original data include fields to be cleaned and data fields that do not need to be cleaned. And finally, performing semantic recognition on each data field, and determining the field to be cleaned in the original data based on a semantic recognition result and the requirement document.

In the embodiment, the original data is divided into a plurality of standard data fields by acquiring the structural information of the original data and dividing the original data according to the structural information of the original data, so that the fields to be cleaned in the original data can be acquired by performing semantic identification subsequently.

Specifically, the server performs keyword recognition on the field to be cleaned to obtain keywords of all the fields to be cleaned, wherein the keyword recognition can be realized by adopting OCR field scanning. After finishing extracting the keywords, the server calculates the weight of each keyword, integrates the keywords based on the calculated weights, generates a keyword combination of the original data, and determines the target service data type corresponding to the original data based on the keyword combination.

In the embodiment, the application obtains all keywords of the fields to be cleaned by scanning the OCR fields, and determines the target service data type corresponding to the original data by calculating the weight of the keywords and according to the weight of the keywords in the original data.

calculating the weight of each keyword based on a preset TF-IDF algorithm;

ranking the weights of all keywords to obtain a keyword weight sequence;

Specifically, the server calculates the weight of each keyword based on a preset TF-IDF algorithm, performs descending order on the weights of all keywords to obtain a keyword weight sequence, selects keywords with top rank from the keyword weight sequence according to the requirements of a required document, and combines the keywords with top rank to obtain the keyword combination of the original data.

Among them, TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and information exploration. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. In addition to TF-IDF, search engines on the internet use a ranking method based on link analysis to determine the order in which documents appear in the search results.

The method for calculating the weight of each keyword based on the preset TF-IDF algorithm specifically comprises the following steps:

calculating the word frequency of the keyword and calculating the inverse document frequency of the keyword;

and calculating the first word segmentation weight of the keyword based on the word frequency of the keyword and the inverse document frequency of the keyword.

The calculating of the word frequency of the keyword and the calculating of the inverse document frequency of the keyword specifically include:

determining a field to be cleaned where the keyword is located to obtain a target field;

counting the occurrence frequency of the keywords in the target field to obtain a first score number, and counting the sum of the occurrence frequency of all the keywords in each field to be cleaned to obtain a second score number;

calculating the word frequency of the keyword based on the first partial word number and the second partial word number;

counting the number of target fields to obtain the number of first documents, and counting the total number of the fields to be cleaned to obtain the number of second documents;

based on the first document quantity and the second document quantity, the inverse document frequency of the keyword is calculated.

Specifically, the calculation formula of the word frequency TF is as follows:

wherein, tf_i,jAs a keyword t_iWord frequency of, n_i,jAs a keyword t_iIn a certain field d to be cleaned_jOf (1) times of occurrence, Σ_kn_k,jIs the sum of the occurrences of k keywords in all the fields to be cleaned.

The formula for calculating the inverse text frequency IDF is as follows:

wherein idf_i,jAs a keyword t_iThe inverse text frequency index, | D | is the total number of fields to be cleaned, | { j: t |, is_i∈d_jContains a keyword t_iNumber of fields to be cleaned.

In this embodiment, the weights of the keywords are calculated, the weights of the keywords are arranged in a descending order, the keywords with the highest rank in the ordering result are selected to be combined to obtain a keyword combination, and the more important keywords in the original data are combined together through the calculation and the ordering of the weights of the keywords, so that the type of the target service data corresponding to the original data can be determined more accurately.

acquiring a business data label corresponding to each sub-server;

Specifically, after a server distributes cleaning rules to corresponding sub-servers according to service data types, a corresponding service data label is generated for each sub-server, when the service data is cleaned, the server determines a target service data type corresponding to original data, the target service data type is compared with the service data label corresponding to each sub-server one by one, when the target service data type is matched with the service data label corresponding to one of the sub-servers, the sub-server is used as a target sub-server, and the original data is input to the target sub-server for data cleaning, so that cleaning data is obtained.

In this embodiment, when performing the service data cleaning, the server compares the target service data type with the service data label corresponding to each sub-server one by one, and inputs the original data to the corresponding sub-server according to the comparison result to perform the data cleaning.

formatting the original data to obtain formatted data;

Specifically, the server formats original data by calling a data formatting rule stored in a target sub-server, so that the original data are unified into standard data meeting requirements, then calls a data deduplication rule to perform deduplication processing on the duplicated data in the original data, removes redundant duplicated data, finally retrieves error data existing in the deduplication data, and calls a data correction rule to perform error content cleaning on the error data, so as to generate final cleaning data.

In this embodiment, the server selects a corresponding data cleaning rule according to a cleaning requirement on the requirement document, and cleans the original data according to the data cleaning rule to obtain the cleaning data.

The application discloses a big data cleaning method, and belongs to the technical field of big data. The distributed cluster is constructed, and corresponding data cleaning rules are configured for sub-servers in the distributed cluster according to the service data types, wherein each sub-server is configured with a data cleaning rule corresponding to one service data type. When the service data needs to be cleaned, the server identifies the data type of the service data to be cleaned and distributes the service data to be cleaned to the corresponding sub-servers in the distributed cluster for processing according to the data type of the service data to be cleaned. According to the method and the device, the distributed clusters are constructed to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of a data cleaning rule development stage can be effectively reduced, the reuse rate of a public data cleaning rule is improved, and meanwhile unified management of the data cleaning rule is facilitated.

It is emphasized that the original data may also be stored in a node of a blockchain in order to further ensure the privacy and security of the original data.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a big data washing apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.

As shown in fig. 3, the big data washing apparatus according to the present embodiment includes:

the rule configuration module 301 is configured to create a preset number of cleaning rules, configure a mapping relationship between the service data types and the cleaning rules, and generate a cleaning rule matching table;

a cluster building module 302, configured to build a distributed cluster, where the distributed cluster includes a plurality of sub-servers, and each sub-server respectively processes one type of service data correspondingly;

a rule distribution module 303, configured to upload the cleaning rule matching table to the distributed cluster, and distribute the cleaning rule to a corresponding sub-server according to the service data type;

a data preprocessing module 304, configured to receive original data and determine a target service data type corresponding to the original data;

a data cleaning module 305, configured to search a target sub-server corresponding to the target service data type in the distributed cluster, and perform data cleaning on the original data by using the target sub-server to obtain cleaned data.

Further, the data preprocessing module 304 specifically includes:

the field identification submodule is used for receiving original data, importing the original data into the distributed cluster, and determining a field to be cleaned in the original data;

and the keyword identification submodule is used for extracting the keywords of the field to be cleaned and determining the target data type corresponding to the original data based on the keywords.

Further, the field identification submodule specifically includes:

the requirement document acquisition unit is used for acquiring a requirement document corresponding to the original data, wherein the requirement document records specific requirements of data cleaning;

the structure identification unit is used for identifying the data structure of the original data to obtain the structure information of the original data;

the field segmentation unit is used for segmenting the original data based on the structural information to obtain a plurality of data fields;

and the semantic identification unit is used for performing semantic identification on each data field and obtaining the field to be cleaned in the original data based on the semantic identification and the requirement document.

Further, the keyword recognition sub-module specifically includes:

the keyword identification unit is used for carrying out keyword identification on the fields to be cleaned and acquiring keywords of all the fields to be cleaned;

the keyword combination unit is used for integrating the extracted keywords and generating a keyword combination of the original data;

and the service type judging unit is used for determining the type represented by the keyword combination into a target service data type corresponding to the original data.

Further, the keyword combination unit specifically includes:

the weight calculation subunit is used for calculating the weight of each keyword based on a preset TF-IDF algorithm;

the weight sorting subunit is used for sorting the weights of all the keywords to obtain a keyword weight sequence;

and the keyword combination subunit is used for combining the keywords based on the keyword weight sequence to generate the keyword combination of the original data.

Further, the data cleansing module 305 specifically includes:

the service tag acquisition submodule is used for acquiring a service data tag corresponding to each sub-server;

the label comparison submodule is used for comparing the target service data type with the service data labels corresponding to each sub-server one by one in the distributed cluster;

and the data cleaning submodule is used for determining a target sub-server corresponding to the original data according to the comparison result, and performing data cleaning on the original data through the target sub-server to obtain cleaning data.

Further, the data washing sub-module specifically includes:

the formatting unit is used for formatting the original data to obtain formatted data;

the first cleaning unit is used for detecting the repeated data in the formatted data and cleaning the repeated data to obtain the duplicate removal data;

and the second cleaning unit is used for detecting error data in the duplicate removal data and cleaning the error data to obtain cleaning data.

The application discloses big data belt cleaning device belongs to big data technical field. The distributed cluster is constructed, and corresponding data cleaning rules are configured for sub-servers in the distributed cluster according to the service data types, wherein each sub-server is configured with a data cleaning rule corresponding to one service data type. When the service data needs to be cleaned, the server identifies the data type of the service data to be cleaned and distributes the service data to be cleaned to the corresponding sub-servers in the distributed cluster for processing according to the data type of the service data to be cleaned. According to the method and the device, the distributed clusters are constructed to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of a data cleaning rule development stage can be effectively reduced, the reuse rate of a public data cleaning rule is improved, and meanwhile unified management of the data cleaning rule is facilitated.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as computer readable instructions of a big data washing method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the big data washing method.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The application discloses computer equipment belongs to big data technology field. The distributed cluster is constructed, and corresponding data cleaning rules are configured for sub-servers in the distributed cluster according to the service data types, wherein each sub-server is configured with a data cleaning rule corresponding to one service data type. When the service data needs to be cleaned, the server identifies the data type of the service data to be cleaned and distributes the service data to be cleaned to the corresponding sub-servers in the distributed cluster for processing according to the data type of the service data to be cleaned. According to the method and the device, the distributed clusters are constructed to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of a data cleaning rule development stage can be effectively reduced, the reuse rate of a public data cleaning rule is improved, and meanwhile unified management of the data cleaning rule is facilitated.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the big data washing method as described above.

The application discloses a storage medium, and belongs to the technical field of big data. The distributed cluster is constructed, and corresponding data cleaning rules are configured for sub-servers in the distributed cluster according to the service data types, wherein each sub-server is configured with a data cleaning rule corresponding to one service data type. When the service data needs to be cleaned, the server identifies the data type of the service data to be cleaned and distributes the service data to be cleaned to the corresponding sub-servers in the distributed cluster for processing according to the data type of the service data to be cleaned. According to the method and the device, the distributed clusters are constructed to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of a data cleaning rule development stage can be effectively reduced, the reuse rate of a public data cleaning rule is improved, and meanwhile unified management of the data cleaning rule is facilitated.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A big data cleaning method is characterized by comprising the following steps:

2. The data cleaning method according to claim 1, wherein the step of receiving the original data and determining the type of the target service data corresponding to the original data specifically comprises:

3. The data cleansing method according to claim 2, wherein the step of receiving the raw data, importing the raw data into the distributed cluster, and determining the field to be cleansed in the raw data specifically comprises:

4. The data cleaning method according to claim 2, wherein the step of extracting the keyword of the field to be cleaned and determining the target data type corresponding to the original data based on the keyword specifically comprises:

5. The data cleaning method of claim 4, wherein the step of integrating the extracted keywords and generating the keyword combination of the original data specifically comprises:

calculating the weight of each keyword based on a preset TF-IDF algorithm;

ranking the weights of all keywords to obtain a keyword weight sequence;

6. The data cleaning method according to any one of claims 1 to 5, wherein the step of searching for a target sub-server corresponding to the target service data type in the distributed cluster and performing data cleaning on the original data by using the target sub-server to obtain cleaned data specifically includes:

acquiring a business data label corresponding to each sub-server;

7. The data cleaning method according to claim 6, wherein the step of performing data cleaning on the original data through the target sub-server to obtain cleaned data specifically comprises:

formatting the original data to obtain formatted data;

8. A big data washing device, comprising:

9. A computer device comprising a memory having computer readable instructions stored therein and a processor that when executed performs the steps of the big data cleansing method according to any one of claims 1 to 7.

10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of the big data washing method according to any of claims 1 to 7.