CN112039968A

CN112039968A - Data processing system

Info

Publication number: CN112039968A
Application number: CN202010865098.2A
Authority: CN
Inventors: 王雪京; 李伟男; 王鑫; 苏超; 乔立新
Original assignee: China Media Group
Current assignee: China Media Group
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2020-12-04

Abstract

A data processing system, comprising: the system comprises a data calculation scheduling device, a file server, a data analysis engine, an ETL server and a database, wherein the service in the data calculation scheduling device is realized through a springMVC framework and is used for storing collected data to the file server, generating a calculation task according to the collected data, adding the calculation task to a message queue in the data calculation scheduling device, and sending the data to the ETL server and the database; the ETL server is used for processing the data, storing the processing marks to the database and sending the calculation results to an ETL result queue in the data calculation scheduling device; and the data analysis engine analyzes the data in the queue of the data calculation scheduling device and sends the calculation result to a calculation result queue and a database in the data calculation scheduling device. By adopting the scheme in the application, a large amount of data can be processed with high quality, and the data is ensured to be very accurate.

Description

Data processing system

Technical Field

The present application relates to broadcast television technology, and in particular, to a data processing system.

Background

When analyzing and processing television service data of a television station, the prior art generally utilizes RGui statistical analysis software to collect, analyze, mine and display the data. However, since the amount of viewing data of a tv station is very large, the RGui analysis requires a high consumption of server memory, and is likely to cause server abnormality due to improper memory management.

Problems existing in the prior art:

the server is abnormal in the case where the viewing data is very large.

Disclosure of Invention

The embodiment of the application provides a data processing system to solve the technical problem.

An embodiment of the present application provides a data processing system, including: a data calculation scheduling device, a file server, a data analysis engine, an ETL server and a database, wherein,

the service in the data calculation scheduling device is realized through a springMVC framework and is used for storing the acquired data to a file server, generating a calculation task according to the acquired data, adding the calculation task to a message queue in the data calculation scheduling device, and sending the data to an ETL server and a database;

the ETL server is used for processing the data, storing the processing marks to a database and sending the calculation results to an ETL result queue in the data calculation scheduling device;

and the data analysis engine analyzes the data in the queue of the data calculation scheduling device and sends the calculation result to a calculation result queue and a database in the data calculation scheduling device.

The data processing system provided by the embodiment of the application decomposes complex calculation into a plurality of servers, the data calculation scheduling device, the file server, the data analysis engine, the ETL server and the database are used for matching calculation, indexes are distributed to different modules for calculation, centralized processing calculation is not needed, the memory requirement of a single server is reduced, and by adopting the data processing system provided by the embodiment of the application, a large amount of data can be processed with high quality, and the data is very accurate.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a block diagram of a data processing system according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a data processing system according to a second embodiment of the present application;

FIG. 3 is a schematic diagram illustrating relationships between indexes analyzed by a television in the second embodiment of the present application;

fig. 4 shows a model diagram of index calculation in the second embodiment of the present application.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Example one

Fig. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present application.

As shown, the data processing system includes: a data calculation scheduling device, a file server, a data analysis engine, an ETL server and a database, wherein,

In one embodiment, the data computation scheduling apparatus includes:

the data maintenance module is used for collecting viewing data, storing the viewing data to the file server, generating a calculation task according to the collected viewing data and adding the calculation task into a viewing queue in the data calculation scheduling device;

and the audience rating module is used for analyzing audience rating data according to the calculation tasks in the audience rating queue, sending the analyzed audience rating data to an ETL server and a database, and sending the calculation results of the audience rating data in an ETL result queue to a data analysis engine.

In one embodiment, the data analysis engine is further configured to save the stage states to a database during the analysis of the data in the ETL result queue.

In one embodiment, the data computation scheduling apparatus further includes:

and the new media module is used for analyzing new media data related to the new media analysis indexes in the audience data according to the acquired audience data and preset new media analysis indexes, transmitting the analyzed new media data to an ETL (extract transform load) server and a database, and transmitting a new media data calculation result in a new media queue of the data calculation scheduling device to a data analysis engine.

In one embodiment, the data computation scheduling apparatus further includes:

and the comprehensive evaluation module is used for analyzing comprehensive evaluation data related to the comprehensive evaluation analysis indexes in the audience data according to the acquired audience data and preset comprehensive evaluation analysis indexes, transmitting the analyzed comprehensive evaluation data to an ETL server and a database, and transmitting a comprehensive evaluation data calculation result in a comprehensive evaluation queue of the data calculation scheduling device to a data analysis engine.

In one embodiment, the data analysis engine analyzes data in a queue of the data computation scheduling apparatus, including:

cleaning the audience sample data, audience characteristic data, CSM channel table and CSM program list in the queue of the data calculation scheduling device;

and calculating according to the viewing sample data, the viewing characteristic data, the CSM channel table and the CSM program list to obtain index metadata.

In one embodiment, the obtaining of the index metadata by calculation according to the viewing sample data, the viewing characteristic data, the CSM channel table, and the CSM program list includes:

grouping and aggregating the user field uid and the user behavior mid field of the audience sample data, and calculating to obtain user audience behavior data;

time grouping is carried out on the user watching behavior data to obtain data of each minute of the user watching behavior;

aggregating the user field uid according to the viewing characteristic data to obtain user information table data;

connecting the user information table data with the per minute data of the user viewing behaviors to obtain the per minute data of the user viewing behaviors and the weight;

associating the user viewing behavior and the weighted per minute data with a CSM channel list and a CSM program list to obtain channel per minute audience flow detail list data;

the incoming and outgoing situation of the channel per minute user is determined according to the channel per minute audience flow detail table data.

connecting the user information table data with the user viewing behavior data to obtain user viewing behavior and weight data;

determining user time interval viewing information, channel daily viewing information and channel ID according to the user viewing behavior and the weight data;

and calculating the audience rating of the channel in the nine major time periods according to the user time period audience rating information, and calculating the scale of the channel per day audience according to the channel per day audience rating information and the channel ID.

calculating to obtain channel per minute basic data according to the user viewing behavior, the weighted per minute data and a CSM channel table;

and calculating the per minute audience rating, the audience duration, the per minute audience composition and the per minute inflow and outflow data of the channel according to the per minute basic data of the channel.

calculating to obtain the daily audience data of the channel according to the channel per minute basic data and the user information table data;

and determining the average watching time length of each day of the channel according to the average watching data of each day of the channel.

Example two

In order to facilitate the implementation of the present application, the embodiments of the present application are described with a specific example.

The data processing system provided by the embodiment of the application can ensure the stability of the system and the efficient service processing capability through the combination of SPRINGMVC + NFS + AMQ + ORACLE on the premise of very large service data, and can recalculate when the service data is inaccurate.

Fig. 2 is a schematic structural diagram of a data processing system according to a second embodiment of the present application.

As shown, the data processing system includes a data computation scheduling module, a file server, a data analysis engine, an ETL server, and a database, wherein,

the data calculation scheduling module comprises a Web system and an MQ queue set, the Web system comprises a UI interface of the data maintenance module, a viewing module, a new media module, a comprehensive evaluation module and other functional modules, and the MQ queue set comprises a viewing queue, a new media queue, a comprehensive evaluation queue, a calculation result queue, an ETL result queue and other queues. The file server includes archived files and temporary files, the data analysis engine may include one or more computing nodes, and the database may employ an ORACLE database.

In order to make the graph more attractive, an e-chart scheme is adopted by a UI of the data maintenance module to visualize data, background management separates front and back data through an Angular technology, and services in the data maintenance module can acquire viewing data files through restful interface services in a springmvc framework.

In order to ensure that the system can stably calculate in the calculating process, an ActiveMQ message queue (comprising a viewing queue, a new media queue, a comprehensive evaluation queue, a basic result queue, an ETL queue and the like) is added to relieve the calculating pressure and ensure the stability of the server, and meanwhile, a large number of calculating steps are separated, so that the memory requirement on the server is reduced.

Because the analysis indexes of the television are very many and the relationship is relatively dependent and mutually influenced, and the query performance is also ensured, the embodiment of the application classifies all the indexes into different modules, wherein the common index is put into a data analysis engine, and each module carries out data deep processing.

These above modules are all implemented using springmvc.

In the embodiment of the application, the original file of the data is collected and then is placed in the file server, and the file server is realized through NFS. The method and the system can realize that a plurality of computing module servers share the same file server, and ensure the consistency of the technical server when using the original data.

The ETL server is also realized through springmvc, and the data after data acquisition is cleaned, so that the data quality is ensured, and the data repeatability is removed.

The ORACLE database is used as the database, the redundancy of some data is ensured, the efficiency of data query is ensured, and complex business calculation is processed based on the relational data.

The business data processing flow comprises the following steps:

1. a user enters raw data (e.g., viewing files) into the system through a web system.

2. And saving the file into nfs (file server) as an archived file through a data calculation scheduling module.

3. And due to large calculation amount, stable service performance is ensured, and the watching calculation task is added into a watching queue.

4. And the viewing module receives the task starting analysis data.

5. And analyzing and sending the analyzed original data to an ORACLE database and an ETL server.

And 6, cleaning processing data by the ETL service, and storing an ETL processing mark in an ORACLE database.

And 7, sending the calculation result message of the ETL to an ETL result queue.

8. The viewership module receives a result queue for the ETL.

9. And the viewing module is sent to the data analysis engine for data analysis.

10. During the analysis, the state of each phase is saved in the ORACLE database.

11. And the data analysis engine sends the calculation result to the calculation result queue.

12. And putting the calculation result into a database for storage.

Fig. 3 is a schematic diagram illustrating a relationship between indexes analyzed by a television in the second embodiment of the present application.

As shown in the figure, after the data is collected and cleaned, four types of original data, namely viewing sample data, viewing characteristic data, a CSM channel table (dictionary) and a CSM program list, can be obtained in the embodiment of the present application. And deriving each intermediate data by carrying out convergent calculation on the original data table to finally obtain index metadata.

For example, the channel per minute inflow and outflow detail index specific algorithm includes:

i. by grouping and aggregating the "viewing sample data" table field, user (uid) and user behavior (mid), the "user viewing behavior" can be calculated,

grouping the users in time to obtain data of 'user watching behaviors per minute',

aggregating the user fields (uid) by "viewing profile" to obtain "user info" table data.

Connecting the user information table with the user viewing behavior per minute data table according to the user (uid) to obtain the user viewing behavior and weight per minute data table

v. after the table of "user viewing behavior and weight per minute" is associated with the "CSM channel table (dictionary code)" and "CSM program list" by a large tag, the table data of "channel per minute viewer flow details" can be obtained.

Channel per minute inflow and outflow details can be observed in real time through a "channel per minute audience flow details" table.

Channel minute audience rating, audience duration, channel minute audience composition, channel minute inflow and outflow data and other indexes: the ' user watching behavior ' can be obtained by calculating ' watching sample data ', then the ' user watching behavior per minute ' is obtained by analyzing, the ' user watching behavior per minute ' is continuously analyzed, the ' user watching behavior and the ' channel per minute ' are obtained, the data of the ' CSM channel table (dictionary code ') is associated, and a ' channel per minute basic data ' table can be obtained by calculating and is supported by each index data.

Channel daily average audience rating duration index: the user information is obtained by analyzing the viewing characteristic data, and the two data are related and can be calculated by the basic data of the channels per minute in the steps.

The channel nine major time segment audience rating index: through the data 'user watching behaviors' and 'user information' acquired in the steps and the CSM channel table, the three tables are associated to obtain a data table 'user watching behaviors and weight', and a 'user time interval watching information' table can be obtained after time aggregation and used as the index data support.

Channel daily audience rating, channel daily audience size, and the like: through the table of user viewing behavior and weight in the process, the daily dynamics is calculated and analyzed to obtain a data table of channel daily pickup information and channel ID which is used as the index data support.

As shown in the figure, the audience rating is used as core data, the number of minutes and the arrival rate of the average audience rating can be calculated by grouping, collecting and aggregating the audience rating, and the three types of data are used as service basic data, so that various service indexes are obtained.

Audience rating: ratings examine the proportion of people watching a channel or a program in a particular time period in the population. The index is actually a uniform distribution of the number of viewers and the viewing duration over the length of a specific time period (program). When the audience is locked out as part of the overall population (e.g., 10-14 years old), the audience ratings are known as target audience ratings. It is an important basis for program arrangement and adjustment, and is a main index for program evaluation. The algorithm formula is as follows:

arrival rate: refers to the total number of people (000) or percentage (%) of the total television population that meet the reach condition in a particular time period, and in one embodiment, the reach condition is "at least 1 minute watched". The algorithm formula is as follows:

the arrival rate is a longitudinal cumulative index over time, which considers the number (or proportion) of people who watch a certain channel or column (or can be covered by a certain advertising plan) in a specific time period, and reflects the size and the spread of the contacted audiences.

Number of minutes per average person: the ratio of average daily viewing time (minutes) to the overall population of the television audience may be calculated for a particular channel or time period. The algorithm formula is as follows:

the number of people-averaged viewing minutes is the average distribution of the total viewing time of the viewing audience to the population of people, not the population of people.

And other indexes can be obtained by calculation according to the connection line relation according to the service calculation model. For example:

market share indicators. The number of people watching a certain channel or a certain program in a specific time period accounts for the percentage of the number of people watching television in the same time period. I.e., the percentage of the audience rating of a certain channel in a certain period of time to the total audience rating of all channels. The calculation formula is as follows:

the market share is that the proportion of people watching a certain channel (program) to all people watching television at that time (total audience), and the larger the value, the stronger the market competitiveness of the channel (program) in the time period.

The above calculation is only described by taking some indexes as examples, and other indexes may be provided with corresponding calculation programs according to actual needs, which is not described herein again.

In an implementation manner, in an embodiment of the present application, 10 physical machine servers may be deployed, where 2 (8 cores 16G 500G) are used for a data upload presentation service, 2 (8 cores 16G 500G) original file storage service, 2 (16 cores 64G 500G) message queue service, 2 (16 cores 64G 500G) intermediate result calculation service, and 2 database service.

1. The method comprises the steps that JAVA-WEB services and a front end UI are deployed in 2 physical machines, the 2 physical machines are required to be mutually active and standby, and when one service is down, the other service can ensure the normal provision of the service.

2. The NFS cluster is deployed in 2 physical machines which are respectively a main machine and a standby machine, so that one piece of data is guaranteed to be lost, and the other piece of data is also kept.

3. And deploying AMQ service clusters in 2 physical machines which are mutually active and standby to ensure that the service is continuously provided.

4. 2 intermediate result computing services are deployed, and the intermediate result computing services need high-performance services and support complex computing.

5. 2 database service term stores are deployed.

According to the monitoring system, when the data are inconsistent, the Web monitoring page generates a red alarm. The principle is as follows: when the original data, the ActiveMQ production data, the ActiveMQ consumption group data and the data in Oracle are inconsistent, the data can be judged to be lost or inaccurate. At this point, the system may click on "retrieve data" on the web front end page for data recalculation. The service flow is as follows: and sequentially deleting the data, the monitoring information, the intermediate table data and the Oracle data according to the sequence, then re-executing task acquisition, directly consuming the data by the front end, and manually confirming that the data is not lost, and then considering that the data is successfully recalculated.

The audience data service computing processing system provided by the embodiment of the application has the following advantages:

1. computing stability

The complex calculation is decomposed into a plurality of servers, and the existing RGui technology needs to process the calculation in a centralized way, requires high memory requirement of a single server, and easily causes memory overflow, thereby causing the failure of the calculation. The memory management of Java language is far superior to that of R language, and the stability of calculation is ensured.

2. Data visualization is more beautiful

The embodiment of the application uses the e-chart plug-in, and is more attractive compared with the graph of the RGui.

3. The calculation of the index is clearer

According to the method and the device, a large number of indexes are mutually dependent, the indexes are classified and distributed to different modules for calculation, common indexes are extracted, the calculation and development pressure is reduced, and the calculation process and the method of the whole index are simpler and clearer. These functions RGui cannot be realized simply.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be implemented by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A data processing system, comprising: a data calculation scheduling device, a file server, a data analysis engine, an ETL server and a database, wherein,

2. The data processing system of claim 1, wherein the data computation scheduler comprises:

3. The data processing system of claim 1, wherein the data analysis engine is further configured to save the stage states to a database during the analysis of the data in the ETL result queue.

4. The data processing system of claim 2, wherein the data computation scheduler further comprises:

5. The data processing system of claim 2, wherein the data computation scheduler further comprises:

6. The data processing system of claim 1, wherein the data analysis engine analyzes the data in the queue of the data computation scheduler, comprising:

7. The data processing system of claim 6, wherein the index metadata is computed from the viewing sample data, the viewing profile data, the CSM channel table, and the CSM program guide, and comprises:

8. The data processing system of claim 6, wherein the index metadata is computed from the viewing sample data, the viewing profile data, the CSM channel table, and the CSM program guide, and comprises:

9. The data processing system of claim 6, wherein the index metadata is computed from the viewing sample data, the viewing profile data, the CSM channel table, and the CSM program guide, and comprises:

10. The data processing system of claim 1, wherein the index metadata is computed from the viewing sample data, the viewing profile data, the CSM channel table, and the CSM program guide, and comprises: