CN112035534A

CN112035534A - Real-time big data processing method and device and electronic equipment

Info

Publication number: CN112035534A
Application number: CN202010986807.2A
Authority: CN
Inventors: 田宗耕
Original assignee: Shanghai Yitu Network Science and Technology Co Ltd
Current assignee: Shanghai Yitu Network Science and Technology Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-04

Abstract

The application provides a real-time big data processing method, a real-time big data processing device and electronic equipment, wherein the method comprises the following steps: when the data of the data source is updated, sending the updated data to a first message queue; extracting tags from the updated data in the first message queue through preset rule or model analysis, and sending the updated data and the corresponding tags to a second message queue; storing the data in the second message queue and the corresponding label in a preset time range into a first database; filtering data in the second message queue based on the label, and storing corresponding data consistent with the label into a second database; and classifying the data in the second message queue according to the topics, and screening the data with the topics consistent with the preset topics and storing the data in a third database. According to the real-time big data processing method and device and the electronic equipment, automatic real-time big data processing can be achieved, manual triggering is not needed, and the newly added data can obtain a processing result in real time.

Description

Real-time big data processing method and device and electronic equipment

Technical Field

The present application relates to the field of big data, and in particular, to a real-time big data processing method and apparatus, and an electronic device.

Background

At present, the conventional batch processing of big data needs to manually start a program once a day, process file data of 10GB order of magnitude, and store the processing result into a certain database or database cluster. Each time a program is manually triggered to run, data between two trigger time points cannot be processed immediately, serious hysteresis exists in conventional batch processing, when an oversized file is read, a large amount of memory needs to be consumed, the running cost of a machine is very high, the running speed is slow, the machine is accumulated day by day, and when the data in a database is excessive, the batch processing query performance can be greatly reduced.

Disclosure of Invention

In view of this, the present application provides a real-time big data processing method, apparatus, and system, which can implement automatic real-time big data processing, do not need manual triggering, newly added data can obtain a processing result in real time, control memory occupation, and operate efficiently on a small memory machine, so that large-scale data can be stored, efficient query performance can be ensured, and the real-time big data processing process is more efficient.

In order to solve the technical problem, the following technical scheme is adopted in the application:

in a first aspect, the present application provides a real-time big data processing method, including:

monitoring a data source, monitoring the data source in real time, and sending updated data to a first message queue when the data of the data source is updated;

extracting a data tag, extracting the tag from the updated data in the first message queue through preset rule or model analysis, and sending the updated data and the corresponding tag to a second message queue, wherein the tag is used for identifying key information of the data;

data classification storage, including:

storing the data in the second message queue and the corresponding label in a preset time range into a first database;

filtering data in the second message queue based on the label, and storing corresponding data consistent with the label into a second database;

and classifying the data in the second message queue according to topics, screening the data with the topics consistent with preset topics, and storing the data into a third database, wherein the topics are used for summarizing data meanings.

As an embodiment of the first aspect of the present application, the listening data source includes:

and starting or closing the monitoring of the data source in a timing mode.

As an embodiment of the first aspect of the present application, when a user changes a rule or a model for generating the tag, a new tag is generated in real time, a timing task is started, a data tag is re-extracted from data in the first database, and the data tag is subjected to data classification storage, so as to update the first database, the second database, and the third database.

As an embodiment of the first aspect of the present application, the data of the data source may be any one of text data, audio data, and video data.

As an embodiment of the first aspect of the present application, said extracting data tags and said data classification storing are real-time streaming tasks, which are performed under a streaming framework.

As an embodiment of the first aspect of the present application, the stream processing framework may select any one of Flink, Storm, Map Reduce, Spark.

As an embodiment of the first aspect of the present application, the first database is a full database, and the full database is used for storing all data within a preset time range and the tags corresponding to the data; the full database is provided with a data life cycle for data stored in the full database, and the data exceeding the data life cycle is periodically cleared based on the data life cycle;

the second database is an active database, the active database is used for storing data after the second message queue is filtered based on the label, and an upper application corresponding to the active database can display or query the active database;

the third database is an archive database, the archive database is used for storing and manually screening data, and the data in the archive database is permanently stored.

In a second aspect, an embodiment of the present application provides a real-time big data processing apparatus, where the apparatus includes:

the monitoring data source module monitors the data source in real time, and when the data of the data source is updated, the updated data is sent to a first message queue;

the data extracting module extracts a tag from the updated data in the first message queue through preset rule or model analysis, and sends the updated data and the corresponding tag to a second message queue, wherein the tag is used for identifying key information of the data;

a data classification storage module comprising:

and classifying the data in the second message queue according to topics, screening the data with the topics consistent with preset topics, and storing the data into a third database, wherein the topics are used for summarizing data meanings and can summarize the meanings of the summarized data. For example, a topic of an article that may summarize ideas expressed by the article.

As an embodiment of the second aspect of the present application, the snoop data source module includes: and starting or closing the monitoring of the data source in a timing mode.

As an embodiment of the second aspect of the present application, when a user changes a rule or a model for generating the tag, a new tag is generated in real time, a timing task is started, a data tag is re-extracted from data in the first database, and the data tag is subjected to data classification storage, so as to update the first database, the second database, and the third database.

As an embodiment of the second aspect of the present application, the data of the data source may be any one of text data, audio data, and video data.

As an embodiment of the second aspect of the present application, the data tag extracting module and the data sorting and storing module execute a real-time streaming task, and execute the task under a streaming framework.

As an embodiment of the second aspect of the present application, the stream processing framework may select any one of Flink, Storm, Map Reduce, Spark.

As an embodiment of the second aspect of the present application, the first database is a full database, and the full database is used for storing all data within a preset time range and the tags corresponding to the data; the full database is provided with a data life cycle for data stored in the full database, and the data exceeding the data life cycle is periodically cleared based on the data life cycle;

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory,

the memory has stored therein an instruction that,

the processor is used for reading the instructions stored in the memory so as to execute the method of any one of the real-time big data processing methods.

The technical scheme of the application has at least one of the following beneficial effects:

according to the real-time big data processing method, the real-time big data processing device and the electronic equipment, data sources can be automatically monitored in an entry mode, manual triggering is not needed, data are read in a streaming mode, occupation of a memory is reduced, the data processing speed is accelerated, the processing result can be obtained by newly added data in real time, three databases are provided in the embodiment of the application, the full data, the currently processed data and the permanently filed data are respectively stored, large-scale data storage can be guaranteed, efficient data query performance can be guaranteed, and real-time big data processing is enabled to be stronger in instantaneity and more efficient.

Drawings

FIG. 1 is a scene diagram of a real-time big data processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of a real-time big data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a real-time big data processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic diagram of an architecture of a real-time big data processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following describes embodiments of the present application with reference to specific scenarios.

Fig. 1 is an application scenario diagram of a real-time big data processing method according to an embodiment of the present application. The big data processing occasions are various, and a social network site is taken as an application scene of the embodiment of the application for illustration. As shown in fig. 1, a social network site is very popular software in modern society, and everyone can know friends on the social network site, post his or her own thoughts, and record things in life, which means that PB level (Petabyte gigabyte) data is processed every day on a microblog, and the data on the social network site is updated in real time. Some company units, such as scientific research institutions and company operation departments, need to find information concerned by themselves from PB-level data every day for analysis. For example, the frequency of occurrence of a specific keyword in a social network site is counted, the preferences of a user group are analyzed, topics in which the user is interested or articles with certain keywords are mined, and corresponding user information can be obtained through further matching. For another example, public relations department performs public opinion analysis on data on social websites, collects media news and fans trends on the social websites in real time, can find problems in time, and can process and solve the problems as early as possible.

According to the embodiment of the application, the data updated by the social network site every day is automatically acquired by monitoring the data source, when new data are monitored, the tags of the data are immediately extracted, and the updated data are processed through the stream processing frame, wherein the stream processing frame can select the Flink stream processing frame, the tags of the data can be acquired in a short time, so that a user can analyze key information of the data, and the stream processing frame is used for reading the data content in a streaming manner instead of loading all the data at one time, so that the memory occupation is always kept at a low level when the embodiment of the application operates, and the data calculation speed can be accelerated due to the small data volume read at one time. And finally, the data is stored in three databases, namely a full database, so that all the obtained data can be stored, the data tracing in the later period is facilitated, the life cycle of the data is set, and the storage space of the full database is prevented from being too large. And the current database is used for storing the data being processed and can be quickly inquired by an upper-layer application. And the archive database is used for permanently storing the important data, so that the important data are protected. Therefore, the real-time big data processing method can automatically monitor the data source entries, does not need manual triggering, reads data in a streaming mode, reduces the occupation of a memory, accelerates the data processing speed, and can obtain the processing result in real time by adding new data.

First, a real-time big data processing method provided by the embodiment of the present invention is described below.

As shown in fig. 2, a flowchart of a real-time big data processing method provided in an embodiment of the present invention may include the following steps:

step S201, a data source is monitored, the data source is monitored in real time, and when the data of the data source is updated, the updated data is sent to a first message queue. Specifically, a process is set, a data source is monitored in real time, data generated in real time are obtained, for example, for data on a social network site, data generated by the social network site every day can be obtained through a crawler technology and stored in a file, wherein the file format can be a json format, the json data format is simple, easy to read and write, and multiple programming languages are supported. The process listening to the data source listens to the data source for a change whenever there is a data update, and reads the data in segments and imports the data into the first message queue. Therefore, the data source can be automatically monitored, and the updating data can be acquired in real time.

Step S202, extracting data labels, extracting the labels from the updated data in the first message queue through preset rule or model analysis, and sending the updated data and the corresponding labels to the second message queue, wherein the labels are used for identifying key information of the data. The process of extracting the data tag is performed under a stream processing framework, the stream processing framework may select any one of Flink, Storm, Map Reduce, and Spark, and in an optional embodiment of the present application, the Flink stream processing framework is selected, and the Flink stream processing framework has an advantage that high throughput and low latency can be achieved with little configuration. Under a Flink stream processing framework, data of the kafka message queue is read in a streaming mode, and the data is mapped to obtain a label through setting rules or training a model. The method has the advantages that the Flink stream processing frame is used for processing data, the occupied memory is low, the single processing data volume is small, the data processing speed is improved, and the effect of real-time big data processing can be achieved. After extracting the tag from the data, the data and the corresponding tag are sent to a second message queue, and the second message queue can also select a kafka message queue, wherein the kafka message queue has the advantages of high throughput and low delay.

Step S203, data classification storage, including:

and storing the data in the second message queue and the corresponding label in a preset time range into the first database. And filtering the data in the second message queue based on the label, and storing the corresponding data consistent with the label into a second database. And classifying the data in the second message queue according to the topics, screening the data with the topics consistent with the preset topics, and storing the data into a third database, wherein the topics are used for summarizing the meaning of the data and can summarize the meaning of the summarized data. For example, a topic of an article that may summarize ideas expressed by the article.

According to an optional embodiment of the present application, the first database is a full database, and the full database is used for storing all data within a preset time range and tags corresponding to the data; the data life cycle is set for the data stored in the full database, and the data exceeding the data life cycle is periodically cleared based on the data life cycle, so that the overlarge data volume of the full database is avoided. The full database storage data is set, so that the full data can be reprocessed, that is, when the tags change, the full database can be directly read under the Flink stream processing framework, the full data tags can be re-extracted, and new data hit by tag filtering can be obtained.

The second database is an active database for storing data filtered from the second message queue based on the tag, and the upper application corresponding to the active database may display or query the active database. Here, for example, a social network site is taken as an example, for example, a tag is set as "favorite", data in the second message queue is filtered based on the tag, when the data is matched, the data and the corresponding tag are both stored in the current database, and a user can query the current database on a front-end interface of an upper application, for example, a prompt is seen on the front-end interface, and the tag extracted from the data is "favorite", so that the user reads the data and determines whether the data needs to be paid further attention.

The third database is an archive database, the archive database is used for storing data which is manually screened, and the data in the archive database is permanently stored. In the last step, the user can see a prompt on the front-end interface of the upper-layer application, and when the user judges that the data is important and needs to pay more attention to the data, the data can be stored in the archive database for permanent storage.

It should be noted that, in the embodiment of the present application, an ES (elastic search) database may be selected as the database, where the ES database is a distributed document database, and has the advantages of being expandable and highly available, the ES data is stored in a plurality of servers in a distributed manner, and can process PB-level data, and the distributed storage has the advantages of sharing storage load by using a plurality of storage servers, and increasing data processing speed.

According to one embodiment of the present application, listening for a data source comprises:

periodically start or stop listening to the data source. For example, 24 o' clock per day may be set, the listening to the data source is started, and the process of updating data of the data source is performed.

According to an optional embodiment of the application, when a user changes rules or models for generating the tags, the user generates new tags in real time, starts a timing task, re-extracts the data tags from the data in the first database, performs data classification storage, and updates the first database, the second database and the third database. That is, when the tag in which the user is interested is changed, the embodiment of the present application may set a timing task to reprocess the data in the first database, that is, the data in the full database. And the data of the hit label is obtained again, updated to the current database and the archived database, and displayed on the front-end interface of the upper-layer application corresponding to the current database.

According to an alternative embodiment of the application, the data of the data source may be any one of text data, audio data, video data. The embodiment of the present application may process any one of text data, audio data, and video data, and here, processing of text data is exemplified.

According to one embodiment of the application, the data includes an event time attribute of the data and a processing time attribute of the data. Therefore, the problems that data cannot be transmitted to the stream processing framework in time due to the influence of external factors such as a network or a system, the data arrive out of order or the data arrive in a delayed mode and the like can be avoided.

The embodiment of the invention also provides a real-time big data processing device, as shown in fig. 3, the device comprises a data source monitoring module 3100, a data extracting module 3200 and a data classification storage module 3300.

The data source monitoring module 3100 monitors a data source in real time, and sends updated data to the first message queue when data of the data source is updated.

The data extracting module 3200 extracts a tag from the updated data in the first message queue through a preset rule or model analysis, and sends the updated data and the corresponding tag to the second message queue, wherein the tag is used for identifying key information of the data. In an optional embodiment of the present application, the Flink stream processing framework is selected, and the Flink stream processing framework has little configuration, so that the advantages of high throughput and low delay can be achieved.

The data classification storage module 3300 includes:

and storing the data in the second message queue and the corresponding label in a preset time range into the first database. And filtering the data in the second message queue based on the label, and storing the corresponding data consistent with the label into a second database. And classifying the data in the second message queue according to the topics, and screening the data with the topics consistent with the preset topics and storing the data into a third database, wherein the topics are used for summarizing data meanings.

In accordance with an alternative embodiment of the present application, listening data source module 3100 comprises: periodically start or stop listening to the data source.

According to an optional embodiment of the application, when a user changes rules or models for generating the tags, the user generates new tags in real time, starts a timing task, re-extracts the data tags from the data in the first database, performs data classification storage, and updates the first database, the second database and the third database.

According to an alternative embodiment of the application, the data of the data source may be any one of text data, audio data, video data.

According to an alternative embodiment of the present application, the extract data tag module 3200 and the data classification storage module 3300 perform a real-time streaming task, which is performed under a streaming framework.

According to an alternative embodiment of the present application, the stream processing framework may select any one of Flink, Storm, Map Reduce, Spark. In practical application of the embodiment of the application, a Flink stream processing framework is selected, and the Flink stream processing framework has the advantages that high throughput and low delay can be realized by little configuration.

According to an optional embodiment of the present application, the first database is a full database, and the full database is used for storing all data within a preset time range and tags corresponding to the data; the full database is provided with a data life cycle for data stored in the full database, and data exceeding the data life cycle are periodically cleared based on the data life cycle.

The second database is an active database for storing data filtered from the second message queue based on the tag, and the upper application corresponding to the active database may display or query the active database.

The third database is an archive database, the archive database is used for storing data which is manually screened, and the data in the archive database is permanently stored.

In an optional embodiment of the present application, the database is an ES (elastic search) database, the ES database is a distributed document database, and has the advantages of being expandable and highly available, and the data of the ES is stored in a plurality of servers in a distributed manner, and can process PB-level data.

The present application further provides an electronic device comprising a processor and a memory,

the memory stores instructions, and the processor is used for reading the instructions stored in the memory to execute the following steps in the real-time big data processing method:

as shown in fig. 4, a data source is monitored, the data source is monitored in real time, and when data of the data source is updated, the updated data is sent to a first message queue, where in the figure, a text data source in a json format is used as an example, and data collection enters the first message queue.

Extracting a data tag, reading data of a first message queue, extracting the tag from the updated data in the first message queue through preset rule or model analysis, and sending the updated data and the corresponding tag to a second message queue.

Data classification storage, including: storing the data in the second message queue and the corresponding label in a preset time range into a first database; filtering data in the second message queue based on the label, and storing corresponding data consistent with the label into a second database; and classifying the data in the second message queue according to topics, screening the data with the topics consistent with preset topics, and storing the data into a third database, wherein the topics are used for summarizing data meanings.

Therefore, the real-time big data processing method, the real-time big data processing device and the electronic equipment can automatically monitor the data source entries, do not need manual triggering, read data in a streaming mode, reduce the occupation of a memory, accelerate the data processing speed, and obtain the processing result in real time by adding new data.

It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.

The foregoing is a preferred embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and refinements can be made without departing from the principle described in the present application, and these modifications and refinements should be regarded as the protection scope of the present application.

Claims

1. A real-time big data processing method is characterized by comprising the following steps:

data classification storage, including:

2. The method of claim 1, wherein listening for the data source comprises:

and starting or closing the monitoring of the data source in a timing mode.

3. The method of claim 1, wherein when a user changes a rule or a model for generating the tag, a new tag is generated in real time, a timing task is started, a data tag is re-extracted from the data in the first database, and the data is classified and stored, and the first database, the second database and the third database are updated.

4. The method according to claim 1, wherein the data of the data source can be any one of text data, audio data, and video data.

5. The method of claim 1, wherein the extracting data tags and the data classification store are real-time streaming tasks that are performed under a streaming framework.

6. The method of claim 5, wherein the stream processing framework can select any one of Flink, Storm, Map Reduce, Spark.

7. The method according to claim 1, wherein the first database is a full database for storing all data within a preset time range and the tags corresponding to the data; the full database is provided with a data life cycle for data stored in the full database, and the data exceeding the data life cycle is periodically cleared based on the data life cycle;

8. The method of claim 1, wherein the data comprises an event time attribute of the data and a processing time attribute of the data.

9. A real-time big data processing apparatus, the apparatus comprising:

a data classification storage module comprising:

10. An electronic device comprising a processor and a memory,

the memory has stored therein an instruction that,

the processor to read the instructions stored in the memory to perform the method of any of claims 1-8.