CN111221793A

CN111221793A - Data mining method, platform, computer equipment and storage medium

Info

Publication number: CN111221793A
Application number: CN201911416631.0A
Authority: CN
Inventors: 赵立永; 吴新丽; 刘启明; 代继涛; 韩勇; 李丹
Original assignee: Xinhua Net Co ltd
Current assignee: Xinhua Net Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-06-02
Anticipated expiration: 2039-12-31
Also published as: CN111221793B

Abstract

The application provides a data mining method, a platform, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a data processing request sent by an application client, wherein the processing request comprises data to be processed and an identifier of the application client; analyzing the data to be processed, and determining a target service type corresponding to the data to be processed; uploading data to be processed to a corresponding partition of the distributed file system according to the target service type; mining data to be processed in corresponding partitions of the distributed file system by using a computing engine to generate corresponding processing results; and storing the processing result and the identification of the application client to a preset distributed publishing and subscribing message system so that the application client can acquire the processing result from the preset distributed publishing and subscribing message system according to the identification of the application client. Because each system for data mining processing is independent, the system can be upgraded and maintained independently, and is compatible with the mining of each type of service data.

Description

Data mining method, platform, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data mining method, a data mining platform, a computer device, and a storage medium.

Background

With the advent of the 5G technology, the content generated by users on the internet and the mobile internet will increase rapidly, and a huge amount of data or even a large amount of data will be encountered in daily life, and how to quickly and effectively process the data to obtain useful information becomes a problem to be solved urgently.

In the related art, in order to implement mining processing on data of different service types, corresponding mining systems need to be developed for different services. The method not only needs to consume larger manpower and material resources, but also has poor compatibility among the corresponding excavation systems of different services.

Disclosure of Invention

The embodiment of the application provides a data mining method, a platform, computer equipment and a computer readable storage medium, which are used for solving the problem that in the related art, in order to implement mining processing on data of different service types, corresponding mining systems need to be developed for different services.

To this end, an embodiment of an aspect of the present application provides a data mining method, where the method includes: acquiring a data processing request sent by an application client, wherein the processing request comprises data to be processed and an identifier of the application client; analyzing the data to be processed, and determining a target service type corresponding to the data to be processed; uploading the data to be processed to a corresponding partition of a distributed file system according to the target service type; mining the data to be processed in the corresponding partitions of the distributed file system by utilizing a computing engine to generate corresponding processing results; and storing the processing result and the identification of the application client to a preset distributed publishing and subscribing message system so that the application client can acquire the processing result from the preset distributed publishing and subscribing message system according to the identification of the application client.

Another embodiment of the present application provides a data mining platform, where the platform includes a gateway, a network proxy system, a distributed file system, a computing engine, and a distributed publish-subscribe message system: the gateway is used for acquiring a data processing request sent by an application client, wherein the processing request comprises data to be processed and an identifier of the application client, and routing the data to be processed and the identifier of the application client to the network agent system; the network agent system is used for analyzing the data to be processed, determining a target service type corresponding to the data to be processed, and uploading the data to be processed to a corresponding partition of the distributed file system according to the target service type; the distributed file system is used for storing the data to be processed with different service types by using different partitions; the computing engine is used for mining the data to be processed in the corresponding partitions of the distributed file system, generating corresponding processing results and sending the processing results and the identification of the application client to the distributed publishing and subscribing message system; the distributed publish-subscribe message system is used for storing the processing result and the identifier of the application client so that the application client can obtain the processing result from the distributed publish-subscribe message system according to the identifier of the application client.

A further embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the data mining method described in the first embodiment.

A further aspect of the present application proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the data mining method according to the first aspect of the present invention.

The technical scheme disclosed in the application has the following beneficial effects:

the method comprises the steps that a network agent system is utilized to analyze data to be processed sent by an application client, a target service type corresponding to the data to be processed is determined, the data to be processed is uploaded to a corresponding partition of a distributed file system according to the target service type, a computing engine obtains the data to be processed from the corresponding partition of the distributed file system and mines the data to be processed, and then a generated processing result and an identification of the client are stored in a preset distributed publishing and subscribing message system, so that the application client obtains the processing result from the preset distributed publishing and subscribing message system according to the identification of the application client. The decoupling of the service processing flow is realized by utilizing the cooperation of the network agent system, the distributed file system, the computing engine and the distributed publishing and subscribing message system, and because the processing processes of the services at different stages are mutually independent by each system, each system can be pertinently modified, upgraded and maintained according to the needs, and a new mining system does not need to be developed aiming at different services, so the cost is low, the mining of various types of service data can be compatible, and the condition is provided for the real-time and batch processing of mass data.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart diagram illustrating a data mining method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating a data mining method according to another embodiment of the present application;

FIG. 3 is an architecture diagram of a data mining platform according to one embodiment of the present application;

FIG. 4 is a schematic business process diagram of a data mining method according to another embodiment of the present application;

FIG. 5 is a schematic structural diagram of a data mining platform according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a computer device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to another embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

Aiming at the technical problems that in the related technology, in order to realize the mining processing of data of different service types, corresponding mining systems need to be developed aiming at different services, the mode not only needs to consume larger manpower and material resources, but also has poor compatibility among the mining systems corresponding to different services, the data mining method is provided.

The data mining method provided by the embodiment of the application is executed by a data mining platform, wherein the data mining platform can comprise a gateway, a network agent system, a distributed file system, a computing engine and a distributed publishing and subscribing message system, the data processing request sent by an application client is acquired by the gateway, the data processing request comprises data to be processed and an identifier of the application client, the data to be processed sent by the application client is analyzed by the network agent system, a target service type corresponding to the data to be processed is determined, the data to be processed is uploaded to a corresponding partition of the distributed file system according to the target service type, the computing engine acquires the data to be processed from the corresponding partition of the distributed file system and mines the data to be processed, and then the generated processing result and the identifier of the client are stored in the preset distributed publishing and subscribing message system, and the application client acquires the processing result from the preset distributed publishing and subscribing message system according to the identification of the application client. The decoupling of the service processing flow is realized by utilizing the cooperation of the network agent system, the distributed file system, the computing engine and the distributed publishing and subscribing message system, and because the processing processes of the services at different stages are mutually independent by each system, each system can be pertinently modified, upgraded and maintained according to the needs, and a new mining system does not need to be developed aiming at different services, so the cost is low, the mining of various types of service data can be compatible, and the condition is provided for the real-time and batch processing of mass data.

A data mining method, a platform, a device, and a storage medium according to an embodiment of the present application are described below with reference to the drawings.

First, referring to fig. 1, a data mining method provided in an embodiment of the present application is specifically described.

Fig. 1 is a schematic flow chart of a data mining method according to an embodiment of the present application.

As shown in fig. 1, the data mining method of the present application may include the steps of:

step 101, a data processing request sent by an application client is obtained, and the processing request includes data to be processed and an identifier of the application client.

And 102, analyzing the data to be processed, and determining a target service type corresponding to the data to be processed.

Specifically, the data mining method provided by the embodiment of the present application may be executed by the data mining platform provided by the embodiment of the present application. The data mining platform can be configured in computer equipment to process data processing requests sent by application clients, so that real-time and batch processing of mass data and mining analysis of the mass data are realized. Specifically, the computer device in the embodiment of the present application may be any hardware device having a data processing function, such as a computer. The computer device may also be a server, which is not limited in this application.

The identifier of the application client is used to uniquely identify the application client, and may be a name, a number, and the like of the application client, which is not limited in this application.

In the embodiment of the application, a gateway included in the data mining platform can interact with an application client to receive a data processing request sent by the application client, send to-be-processed data included in the data processing request and an identifier of the application client to a network proxy system, analyze the to-be-processed data by the network proxy system, and determine a target service type corresponding to the to-be-processed data.

Specifically, the application client may be a plurality of types of clients, for example, a person name recognition class, a place name recognition class, an organization name recognition class, a short text polarity prediction class, a topic evolution analysis class, a chinese hot word extraction class, a netizen emotion analysis class, an automobile industry hotspot analysis class, an automobile industry class, or a news industry class, and the like, which is not limited in the present application.

The network agent system can divide multiple service types according to needs in advance, so that after the data to be processed included in the request to be processed sent by the application client is received, the data to be processed can be analyzed, and the target service type corresponding to the data to be processed is determined.

The service type, for example, may at least include: the method comprises the following steps of emotion perception calculation, news classification prediction, emotion polarity prediction, Chinese hot word extraction, Chinese news hot spot, English news hot spot, entity word identification and the like.

The network agent system can adopt a micro-service architecture, and can adopt an advanced Spring cluster framework in technical realization, wherein Netflix Eureka (service discovery) can realize micro-service registration and discovery, the service registration center is distinguished and the micro-service of specific service is realized through @ Enable EurekaServer and @ Enable EurekaClient, the registration center and other micro-services need to be configured with identical Eureka. client. serviceUrl parameters, and after the registration center and the EurekaClient client service are started in sequence, the automatic registration of the EurekaClient client is realized. The Config file can realize the function of the configuration center, and the configuration center service is realized by adding @ EnableEurekaServer and adding the repository information of the configuration file in the configuration file.

The micro-service mainly comprises two types, wherein one type of micro-service provides a Remote restful api (restful interface) to realize data receiving, provides RPC (Remote Procedure Call) service by using a distributed file system, and writes data to be processed into the distributed file system; another kind of service realizes the timing scheduling service based on @ EnableScheduling provided by spring, and provides RPC interface based on the calculation engine, and submits the processing task to the calculation engine and executes. The message communication between the two types of micro services can be realized through redis, different attributes of the same message are sensed together, and task scheduling work is realized through changing attribute values.

By using the micro-service architecture, a business processing flow can be decomposed into a series of loosely coupled services, and fine-grained service division and lightweight protocols are followed.

In an exemplary embodiment, because the data volume to be processed by the data mining platform is extremely large, and the processing requirement of massive data may not be met by a single network agent system, in this embodiment of the present application, the data mining platform may be preset to include a plurality of network agent systems, and according to the load of each network agent system and the maximum load that can be borne by the network agent system, an appropriate network agent system is selected to process the currently received data to be processed. That is, after step 102, the method may further include:

determining a currently available target network agent system according to the current load capacity of each network agent system;

accordingly, step 102 may be implemented by:

and analyzing the data to be processed by utilizing the target network agent system.

Specifically, the data mining platform may further include a load balancer, where the load balancer may receive the data to be processed and the identifier of the application client sent by the gateway, and determine the currently available network proxy system as a target network proxy system according to the load of each network proxy system and the maximum load that can be borne by the network proxy system, so that after the gateway obtains the data processing request, the gateway may call the load balancer to route to the target network proxy system, so as to analyze the data to be processed by using the target network proxy system.

And 103, uploading the data to be processed to a corresponding partition of the distributed file system according to the target service type.

And 104, mining the data to be processed in the corresponding partitions of the distributed file system by using a computing engine to generate corresponding processing results.

It can be understood that, in addition to analyzing the data to be processed and determining the target service type corresponding to the data to be processed, the network proxy system may also verify whether the data to be processed is normal, for example, whether the data to be processed meets a preset standard. In addition, the network agent system can also preprocess the data to be processed. Specifically, the preprocessing of the data to be processed by the network proxy system may include processing the data to be processed into data meeting a format requirement of the data processed by the computing engine according to the format requirement. For example, normalizing the grammar in the data to be processed, removing characters which do not meet the specification such as spaces, and the like. Accordingly, the data to be processed uploaded to the distributed file system may be preprocessed data.

Specifically, in the embodiment of the application, the distributed file system may be an HDFS (distributed file system) provided by Hadoop, and the Hadoop has the characteristics of reliability, high efficiency, scalability and the like.

Specifically, the distributed file system may include a plurality of partitions, and in this embodiment, it may be preset that each partition in the distributed file system may store data to be processed of different service types, so that after the network proxy system determines a target service type corresponding to the data to be processed, the network proxy system may upload the data to be processed to a corresponding partition of the distributed file system according to the target service type. Then, the calculation engine can read the data to be processed in the corresponding partition of the distributed file system, and mine the data to be processed to generate a corresponding processing result.

Specifically, the method can utilize a fast general engine provided by a Spark framework and used for processing mass data as a computing engine, Spark provides various components such as Spark SQL, Spark Streaming, MLlib, GraphX and the like on the basis of Spark Core, can meet various computing requirements of structured data processing, data Streaming processing implementation, machine learning, graph computing and the like, provides Spark launcher for submitting Spark application in codes, also supports various programming languages, and can meet application requirements of various scenes, so that mining processing of mass data by the method can be met.

In specific implementation, the reading and mining processing of the data to be processed by the computing engine can be started in various ways.

In a first mode

And after the computing engine detects that the corresponding partition of the distributed file system contains newly-added data, starting the reading and mining processing of the data to be processed by the computing engine. That is, before step 104, the method may further include:

the compute engine detects that the corresponding partition of the distributed file system contains the newly added data.

Specifically, the calculation engine may be configured to detect each partition of the distributed file system at a preset frequency or at a specific time, and after detecting that the corresponding partition of the distributed file system includes new data, the new data, that is, the data to be processed, may be mined to generate a corresponding processing result.

For example, the computing engine may be configured to detect each partition of the distributed file system every 1 minute, and if the network proxy system uploads the data to be processed to the partition a of the distributed file system according to the target service type, the computing engine may detect that the partition a of the distributed file system includes new data through detecting each partition of the distributed file system, so that the new data may be read, and the new data is mined to generate a corresponding processing result.

Mode two

After the computing engine receives the processing tasks submitted by the network agent system, the computing engine is started to mine the data to be processed. That is, after step 102, the method may further include:

generating a processing task corresponding to data to be processed;

the processing task is submitted to a compute engine.

Accordingly, before step 104, the method may further include:

and reading the data to be processed corresponding to the acquired processing task from the corresponding partition of the distributed file system by the computing engine.

Specifically, the network proxy system may generate a processing task corresponding to the data to be processed after uploading the data to be processed to the corresponding partition of the distributed file system, and submit the processing task to the computing engine, so that the computing engine may read the data to be processed corresponding to the acquired processing task from the corresponding partition of the distributed file system after acquiring the processing task, and perform mining processing on the data to be processed.

The processing task is used to instruct the computing engine to perform mining processing on the data to be processed, and may only include an identifier of the application client, or only include a storage location of the data to be processed in the distributed file system, and the like, which is not limited in the present application.

For example, if the processing task only includes a start instruction, the computing engine may detect whether each partition of the distributed file system includes new data after receiving the processing task, and if so, may read the new data and perform mining on the new data.

Or, assuming that the processing task only includes an identifier of one application client, after receiving the processing task, the computing engine may read data corresponding to the identifier of the application client from each partition of the distributed file system, and perform mining processing on the data corresponding to the application client.

And 105, storing the processing result and the identifier of the application client to a preset distributed publishing and subscribing message system, so that the application client can acquire the processing result from the preset distributed publishing and subscribing message system according to the identifier of the application client.

Specifically, the distributed publish-subscribe message system may perform data interaction with the application client, so that after the computing engine mines data to be processed and generates a corresponding processing result, and stores the processing result and the identifier of the application client in the preset distributed publish-subscribe message system, the application client may obtain the processing result from the preset distributed publish-subscribe message system according to the identifier of the application client.

It can be understood that Kafka (a publish-subscribe open-source message broker application) in the related art has the advantages of high throughput, capability of supporting the throughput of thousands of messages per transaction, millisecond-level delay, fault tolerance, durability, expandability and the like for the design of transaction logs, and therefore, the application can utilize Kafka to realize the storage of processing results and the identification of an application client and the data interaction with the application client.

It can be understood that the integrated multi-service type mining platform provided in the embodiments of the present application includes a gateway, a network broker system, a distributed file system, a computing engine, and a distributed publish-subscribe message system, and can provide a big data mining analysis service for an application client, when processing a service, the platform decouples a service processing flow into several stages of data receiving, data parsing, data mining, and result response, and then processes different stages of the service separately through different systems, wherein the network broker system and the distributed publish-subscribe message system provide data interaction for a data processing request of the application client together, the network broker system parses the data to be processed according to the data processing request of the application client, and uploads the data to be processed to the distributed file system in real time, so as to realize the invocation of the platform intelligent mining analysis service, the method comprises the steps that a computing engine finishes a mining processing task of data to be processed and stores processing results to a distributed publishing and subscribing message system, so that a complete business processing flow is formed, the circulation of business data in each stage has consistency specifications, application client result notification and flow triggering of each stage are realized through a message mechanism, an application client obtains the processing results from the preset distributed publishing and subscribing message system according to the identification of the application client, and real-time and batch processing of mass data and mining analysis of the mass data are realized through mutual cooperation of the systems.

In the whole business processing flow, only the data receiving and result responding stages need to carry out data interaction with the application client, so that only external service interfaces of the data receiving and result responding stages need to be defined, and internal data flow only needs to specify data formats among different stages.

As the service processing flow is decoupled and recombined, each system respectively processes different stages of the service, and the processing processes of the service at the different stages are mutually independent, the implementation modes including different programming languages can be selected at will on the premise of ensuring the consistency of data input and output specifications. Under the micro-service mode, a user can flexibly select data mining services to be subscribed according to needs, so that the whole data mining platform has flexibility, load balancing is adopted in a data receiving stage, the number of network proxy systems can be increased according to loads, and in a data analyzing stage, available network proxy systems can be selected according to load conditions, so that the whole platform has expandability.

The data mining method provided by the embodiment of the application determines a target service type corresponding to data to be processed by analyzing the data to be processed sent by an application client through a network agent system, uploads the data to be processed to a corresponding partition of a distributed file system according to the target service type, acquires the data to be processed from the corresponding partition of the distributed file system through a computing engine, mines the data to be processed, and stores a generated processing result and an identifier of the client to a preset distributed publishing and subscribing message system so that the application client acquires the processing result from the preset distributed publishing and subscribing message system according to the identifier of the application client. The decoupling of the service processing flow is realized by utilizing the cooperation of the network agent system, the distributed file system, the computing engine and the distributed publishing and subscribing message system, and because the processing processes of the services at different stages are mutually independent by each system, each system can be pertinently modified, upgraded and maintained according to the needs, and a new mining system does not need to be developed aiming at different services, so the cost is low, the mining of various types of service data can be compatible, and the condition is provided for the real-time and batch processing of mass data.

The data mining method of the present application is further described below with reference to fig. 2.

Fig. 2 is a schematic flow chart of a data mining method according to another embodiment of the present application.

As shown in fig. 2, the data mining method of the embodiment of the present application may include the following steps:

step 201, a data processing request sent by an application client is obtained, and the processing request includes data to be processed and an identifier of the application client.

Step 202, determining a currently available target network proxy system according to the current load capacity of each network proxy system.

Step 203, analyzing the data to be processed by using the target network proxy system, and determining the target service type corresponding to the data to be processed.

And step 204, uploading the data to be processed to a corresponding partition of the distributed file system according to the target service type.

Step 205, generating a processing task corresponding to the data to be processed.

At step 206, the processing task is submitted to a compute engine.

Step 207, the computing engine reads the data to be processed corresponding to the acquired processing task from the corresponding partition of the distributed file system.

And 208, mining the data to be processed corresponding to the processing task by using the computing engine to generate a corresponding processing result.

The detailed implementation process and principle of the steps 201-208 can refer to the detailed description of the above embodiments, and are not described herein again.

Step 209, storing the processing result and the identifier of the application client to the preset distributed publish-subscribe message system, so that the application client obtains the processing result from the preset distributed publish-subscribe message system according to the identifier of the application client.

Specifically, the distributed publish-subscribe message system may include a plurality of message queues, and each message queue may store a processing result of data of different service types, so that when the calculation engine stores the processing result and the identifier of the application client in the distributed publish-subscribe message system, the calculation engine may first determine, according to the target service type, a target message queue in which the processing result is to be stored, and then store the processing result and the identifier of the application client in a target message queue corresponding to the preset distributed publish-subscribe message system.

Correspondingly, when the application client obtains the processing result from the preset distributed publish-subscribe message system, the application client can obtain the processing result corresponding to the identifier of the application client from the target message queue.

In an exemplary embodiment, an application client may listen to a message queue on a distributed publish-subscribe messaging system to read processing results corresponding to an identification of the application client.

The data mining method provided by the present application is further described below with reference to the architecture diagram of the data mining platform shown in fig. 3 and the service flow diagram of the data mining method shown in fig. 4.

As shown in fig. 3 and 4, the data mining platform may include a gateway, a load balancer, a plurality of network broker systems (only one is illustrated in fig. 4), a compute engine, a distributed file system, and a distributed publish-subscribe message system. The gateway can receive a data processing request sent by the application client, the data processing request comprises data to be processed and an identifier of the application client, and the load balancer can determine a currently available target network proxy system according to the current load capacity of each network proxy system, so that the gateway can call the load balancer and route the load balancer to the target network proxy system. After the target network agent system acquires the data to be processed, the data to be processed can be analyzed, the target service type corresponding to the data to be processed is determined, the data is preprocessed according to the format requirement of the data processed by the computing engine, and then the preprocessed data is uploaded to the corresponding partition of the distributed file system.

The application client may be a plurality of types of clients, for example, a person name recognition type, a place name recognition type, an organization name recognition type, a short text polarity prediction type, a topic evolution analysis type, a chinese hot word extraction type, a netizen emotion analysis type, an automobile industry hotspot analysis type, an automobile industry classification or a news industry classification, and the like, which is not limited in the present application.

The service types may include at least: the method comprises the following steps of emotion perception calculation, news classification prediction, emotion polarity prediction, Chinese hot word extraction, Chinese news hot spot, English news hot spot, entity word identification and the like.

The computing engine can detect each partition of the distributed file system at regular time, and when detecting that the corresponding partition of the distributed file system contains newly-added data, can read the data to be processed, and mine the data to be processed to generate a corresponding processing result.

The method can be realized based on Spark Streaming, specifically, Spark Streaming can realize timing scheduling, and mining processing tasks of newly added data are completed by detecting whether the newly added data exist under an appointed folder. The spark streaming can monitor whether each partition of the distributed file system has new data through textFileStream to realize mining processing such as classification prediction of batch data, or the spark session can monitor a new task through textFile to realize mining processing such as hot spot extraction of batch data.

Or after uploading the data to be processed to the distributed file system, the network proxy system may generate a processing task corresponding to the data to be processed, and submit the processing task to the computing engine, and then the computing engine reads the data to be processed from the distributed file system when receiving the processing task, so as to mine the data to be processed, and generate a corresponding processing result.

This approach may be implemented based on Spark Core, and in particular, processing task submission and execution may be implemented by Spark launcher after the data to be processed is uploaded to the distributed file system.

After the calculation engine generates the processing result, the processing result and the identifier of the application client can be stored in a preset distributed publish-subscribe message system. The application client can monitor a message queue on a preset distributed publish-subscribe message system, so as to obtain a processing result corresponding to the identifier of the application client from the target message queue.

Specifically, the calculation engine may write the result into the preset distributed publish-subscribe message system through the kafka producer, so that the client may monitor the message queue on the preset distributed publish-subscribe message system in real time through the kafka consumer to obtain the processing result.

Through the method, the business processing flow is decoupled into the stages of data receiving, data analyzing, data mining and result responding. In the data receiving stage, large-scale data receiving can be realized based on HTTP service and Hadoop RCP calling of Restful API, and data are uploaded to an HDFS file system in real time; in the data analysis stage, remote submission of mining service based on Spark can be realized by adopting a spring timing scheduling framework @ EnableScheduling, a high-efficiency memory database Redis and a Spark launcher; in the data mining stage, components such as spark streaming, spark core, spark ML and the like can be adopted, various data mining algorithms and services are realized by combining the DataFrame, and computing resources can be expanded according to actual application scenes; in the result response phase, a kafka message queue can be used to implement the result response.

In the data mining method of this embodiment, a network proxy system is used to analyze data to be processed sent by an application client, determine a target service type corresponding to the data to be processed, upload the data to be processed to a corresponding partition of a distributed file system according to the target service type, a computing engine acquires the data to be processed from the corresponding partition of the distributed file system and mines the data to be processed, and then stores a generated processing result and an identifier of the client to a preset distributed publish-subscribe message system, so that the application client acquires the processing result from the preset distributed publish-subscribe message system according to the identifier of the application client. The decoupling of the service processing flow is realized by utilizing the cooperation of the network agent system, the distributed file system, the computing engine and the distributed publishing and subscribing message system, and because the processing processes of the services at different stages are mutually independent by each system, each system can be pertinently modified, upgraded and maintained according to the needs, and a new mining system does not need to be developed aiming at different services, so the cost is low, the mining of various types of service data can be compatible, and the condition is provided for the real-time and batch processing of mass data.

The data mining platform proposed by the embodiment of the present application is described below with reference to the drawings.

Fig. 5 is a schematic structural diagram of a data mining platform according to an embodiment of the present application.

As shown in fig. 5, the data mining platform includes a gateway 21, a network broker system 22, a distributed file system 23, a compute engine 24, and a distributed publish-subscribe message system 25.

Specifically, the data mining platform provided in the embodiment of the present application may execute the data mining method provided in the embodiment of the present application to process the data processing request sent by the application client, thereby implementing real-time and batch processing of mass data and mining analysis of mass data. The data mining platform may be configured in a computer device, and the computer device may be any hardware device having a data processing function, such as a computer and the like. The computer device may also be a server, which is not limited in this application.

The gateway 21 is configured to obtain a data processing request sent by an application client, where the processing request includes to-be-processed data and an identifier of the application client, and the gateway is further configured to send the to-be-processed data and the identifier of the application client to the network proxy system 22;

the network agent system 22 is configured to analyze the data to be processed, determine a target service type corresponding to the data to be processed, and upload the data to be processed to a corresponding partition of the distributed file system 23 according to the target service type;

the distributed file system 23 is used for storing the data to be processed with different service types by using different partitions;

the computing engine 24 is configured to mine data to be processed in a corresponding partition of the distributed file system 23, generate a corresponding processing result, and send the processing result and the identifier of the application client to the distributed publish-subscribe message system 25;

and the distributed publish-subscribe message system 25 is configured to store the processing result and the identifier of the application client, so that the application client obtains the processing result from the distributed publish-subscribe message system according to the identifier of the application client.

As an alternative implementation form, the number of the network proxy system 22 may be multiple, and the platform further includes: a load balancer;

the load balancer is used for determining the currently available target network agent system according to the current load capacity of each network agent system;

and the target network agent system is used for analyzing the data to be processed.

As another alternative implementation, the calculation engine 24 is further configured to:

it is detected whether the corresponding partition of the distributed file system 23 contains the newly added data.

As another alternative implementation form, the network proxy system 22 is further configured to:

generating a processing task corresponding to data to be processed;

the processing tasks are submitted to the compute engine 24.

the data to be processed corresponding to the acquired processing task is read from the corresponding partition of the distributed file system 23.

As another alternative implementation form, the calculation engine 24 is further configured to:

determining a target message queue to be stored with a processing result according to the target service type;

and storing the processing result and the identifier of the application client into a target message queue corresponding to the distributed publish-subscribe message system 25.

It should be noted that, for the implementation process and the technical principle of the data mining platform of this embodiment, reference is made to the foregoing explanation of the data mining method of the first embodiment, and details are not described here again.

The data mining platform provided in the embodiment of the application analyzes data to be processed sent by an application client by using a network proxy system, determines a target service type corresponding to the data to be processed, uploads the data to be processed to a corresponding partition of a distributed file system according to the target service type, acquires the data to be processed from the corresponding partition of the distributed file system by using a computing engine, mines the data to be processed, and stores a generated processing result and an identifier of the client to a preset distributed publishing and subscribing message system, so that the application client acquires the processing result from the preset distributed publishing and subscribing message system according to the identifier of the application client. The decoupling of the service processing flow is realized by utilizing the cooperation of the network agent system, the distributed file system, the computing engine and the distributed publishing and subscribing message system, and because the processing processes of the services at different stages are mutually independent by each system, each system can be pertinently modified, upgraded and maintained according to the needs, and a new mining system does not need to be developed aiming at different services, so the cost is low, the mining of various types of service data can be compatible, and the condition is provided for the real-time and batch processing of mass data.

In order to implement the above embodiments, the present application also provides a computer device.

Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device shown in fig. 6 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer apparatus 200 includes: the data mining method comprises a memory 210, a processor 220 and a computer program stored on the memory 210 and executable on the processor 220, wherein the processor 220 implements the data mining method according to the first aspect when executing the program.

In an alternative implementation form, as shown in fig. 7, the computer device 200 may further include: a memory 210 and a processor 220, a bus 230 connecting different components (including the memory 210 and the processor 220), wherein the memory 210 stores a computer program, and when the processor 220 executes the program, the data mining method according to the embodiment of the present application is implemented.

Bus 230 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 200 typically includes a variety of computer device readable media. Such media may be any available media that is accessible by computer device 200 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 210 may also include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)240 and/or cache memory 250. The computer device 200 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 260 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, commonly referred to as a "hard drive"). Although not shown in FIG. 7, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 230 by one or more data media interfaces. Memory 210 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 280 having a set (at least one) of program modules 270, including but not limited to an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment, may be stored in, for example, the memory 210. The program modules 270 generally perform the functions and/or methodologies of the embodiments described herein.

The computer device 200 may also communicate with one or more external devices 290 (e.g., keyboard, pointing device, display 291, etc.), with one or more devices that enable a user to interact with the computer device 200, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 200 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 292. Also, computer device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) through network adapter 293. As shown, network adapter 293 communicates with the other modules of computer device 200 via bus 230. It should be appreciated that although not shown in FIG. 7, other hardware and/or software modules may be used in conjunction with computer device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that, for the implementation process and the technical principle of the computer device of this embodiment, reference is made to the foregoing explanation of the data mining method of the first aspect, and details are not described here again.

The computer device provided by the embodiment of the application determines a target service type corresponding to data to be processed by analyzing the data to be processed sent by the application client through the network proxy system, uploads the data to be processed to a corresponding partition of the distributed file system according to the target service type, acquires the data to be processed from the corresponding partition of the distributed file system through the computing engine, mines the data to be processed, and stores a generated processing result and an identifier of the client to a preset distributed publishing and subscribing message system, so that the application client acquires the processing result from the preset distributed publishing and subscribing message system according to the identifier of the application client. The decoupling of the service processing flow is realized by utilizing the cooperation of the network agent system, the distributed file system, the computing engine and the distributed publishing and subscribing message system, and because the processing processes of the services at different stages are mutually independent by each system, each system can be pertinently modified, upgraded and maintained according to the needs, and a new mining system does not need to be developed aiming at different services, so the cost is low, the mining of various types of service data can be compatible, and the condition is provided for the real-time and batch processing of mass data.

To implement the above embodiments, the present application also provides a computer-readable storage medium.

Wherein the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the data mining method as described in the embodiments of the first aspect.

In an alternative implementation, the embodiments may be implemented in any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

To achieve the above embodiments, the present application further proposes a computer program product, wherein when the instructions in the computer program product are executed by a processor, the data mining method according to the foregoing embodiments is performed.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method of data mining, comprising:

acquiring a data processing request sent by an application client, wherein the processing request comprises data to be processed and an identifier of the application client;

analyzing the data to be processed, and determining a target service type corresponding to the data to be processed;

uploading the data to be processed to a corresponding partition of a distributed file system according to the target service type;

mining the data to be processed in the corresponding partitions of the distributed file system by utilizing a computing engine to generate corresponding processing results;

and storing the processing result and the identification of the application client to a preset distributed publishing and subscribing message system so that the application client can acquire the processing result from the preset distributed publishing and subscribing message system according to the identification of the application client.

2. The method of claim 1, wherein after obtaining the data processing request sent by the application client, the method further comprises:

the analyzing the data to be processed includes:

3. The method of claim 1, wherein prior to mining the data to be processed in the corresponding partition of the distributed file system with the compute engine, further comprising:

the compute engine detects that a corresponding partition of the distributed file system contains newly added data.

4. The method of claim 1, wherein after parsing the data to be processed, further comprising:

generating a processing task corresponding to the data to be processed;

submitting the processing task to the compute engine.

5. The method of claim 4, wherein prior to mining the data to be processed in the corresponding partition of the distributed file system using the compute engine, further comprising:

and the computing engine reads the data to be processed corresponding to the acquired processing task from the corresponding partition of the distributed file system.

6. The method according to any one of claims 1 to 5, wherein the storing the processing result and the identifier of the application client to a preset distributed publish-subscribe messaging system comprises:

determining a target message queue to be stored by the processing result according to the target service type;

and storing the processing result and the identification of the application client into a target message queue corresponding to a preset distributed publish-subscribe message system.

7. A data mining platform is characterized by comprising a gateway, a network agent system, a distributed file system, a computing engine and a distributed publishing and subscribing message system:

the gateway is used for acquiring a data processing request sent by an application client, wherein the processing request comprises data to be processed and an identifier of the application client, and the gateway is also used for sending the data to be processed and the identifier of the application client to the network agent system;

the network agent system is used for analyzing the data to be processed, determining a target service type corresponding to the data to be processed, and uploading the data to be processed to a corresponding partition of the distributed file system according to the target service type;

the distributed file system is used for storing the data to be processed with different service types by using different partitions;

the computing engine is used for mining the data to be processed in the corresponding partitions of the distributed file system, generating corresponding processing results and sending the processing results and the identification of the application client to the distributed publishing and subscribing message system;

the distributed publish-subscribe message system is used for storing the processing result and the identifier of the application client so that the application client can obtain the processing result from the distributed publish-subscribe message system according to the identifier of the application client.

8. The platform of claim 7, wherein the network broker system is plural in number, the platform further comprising: a load balancer;

9. The platform of claim 7, wherein the compute engine is further to:

and detecting whether the corresponding partition of the distributed file system contains newly added data.

10. The platform of claim 7, wherein the network proxy system is further configured to:

generating a processing task corresponding to the data to be processed;

submitting the processing task to the compute engine.

11. The platform of claim 9, wherein the compute engine is further to:

and reading the data to be processed corresponding to the acquired processing task from the corresponding partition of the distributed file system.

12. The platform of any one of claims 7-11, wherein the compute engine is further to:

and storing the processing result and the identification of the application client into a target message queue corresponding to the distributed publish-subscribe message system.

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the data mining method of any one of claims 1 to 6.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data mining method of any one of claims 1 to 6.