CN113868525A

CN113868525A - Method, device and equipment for determining accumulative independent access amount based on batch streaming coordination

Info

Publication number: CN113868525A
Application number: CN202111138453.7A
Authority: CN
Inventors: 雷锦伟
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-31

Abstract

The embodiment of the specification discloses a method for determining accumulated independent access amount based on batch streaming coordination, which comprises the following steps: acquiring user access stream through a stream data source; extracting user access data from the user access stream according to a preset time interval to obtain a batch data source; creating batch tasks according to the batch data source, executing the batch tasks to perform deduplication, and obtaining at least partial deduplication historical access dimension tables; and creating a stream task corresponding to the current time period, and executing the stream task to perform reduplication removal according to the current historical access dimension table and stream data corresponding to the current time period in the user access stream to obtain the accumulated independent access amount in the current time period.

Description

Method, device and equipment for determining accumulative independent access amount based on batch streaming coordination

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for determining an accumulated independent access amount based on batch-to-batch coordination.

Background

In many cases, the accumulated independent visit amount needs to be counted to better understand the practical effect through the accumulated independent visit amount, for example, a certain website needs to count the increase condition of a new user, and a certain activity needs to count the participation condition of the new user.

In the flow calculation scenario, the cumulative independent access amount is determined by the flow task. The stream task is to process the access data of the current time period in real time to determine the accumulated independent access volume of the current time period, but the accumulated independent access volume obtained by the stream task may not be accurate, and it is difficult to provide reliable reference data for the user.

Based on this, there is now a need for an accurate way to determine the cumulative individual access.

Disclosure of Invention

One or more embodiments of the present specification provide a method, an apparatus, and a device for determining an accumulated independent access amount based on batch streaming coordination, so as to solve the following technical problems:

in a streaming computing scenario, the accumulated independent access volume obtained by relying on only streaming tasks may be inaccurate, and it is difficult to provide reliable reference data to a user.

One or more embodiments of the present disclosure adopt the following technical solutions:

one or more embodiments of the present specification provide a cumulative individual visit amount determination method based on a streaming batch cooperation, the method including:

acquiring user access stream through a stream data source;

extracting user access data from the user access stream according to a preset time interval to obtain a batch data source;

creating batch tasks according to the batch data sources, executing the batch tasks to perform deduplication, and obtaining at least partial deduplication historical access dimension tables;

and creating a stream task corresponding to the current time period, and executing the stream task to perform re-duplication removal according to the current historical access dimension table and stream data corresponding to the current time period in the user access stream to obtain the accumulated independent access amount in the current time period.

One or more embodiments of the present specification further provide a cumulative individual visit amount determination apparatus based on a streaming batch cooperation, the apparatus including:

the acquisition unit acquires a user access stream through a stream data source;

the extraction unit is used for extracting user access data from the user access stream according to a preset time interval to obtain a batch data source;

the dimension table determining unit is used for creating batch tasks according to the batch data sources, executing the batch tasks to perform duplicate removal, and obtaining at least part of duplicate removal historical access dimension tables;

and the accumulation unit is used for creating a stream task corresponding to the current time period, and executing the stream task to perform re-duplication removal according to the current historical access dimension table and stream data corresponding to the current time period in the user access stream to obtain the accumulated independent access amount in the current time period.

One or more embodiments of the present specification further provide a cumulative individual access amount determination apparatus based on a streaming batch cooperation, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

acquiring user access stream through a stream data source;

One or more embodiments of the present specification provide a non-transitory computer storage medium storing computer-executable instructions configured to:

acquiring user access stream through a stream data source;

The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects:

1. in the embodiment of the description, the user access flow is acquired through the flow data source, the user access flow is full data accumulated in real time, and the result can be more accurate and reliable according to the full data when the subsequent accumulated independent access amount is determined.

2. In the embodiment of the description, user access data are extracted from the user access stream at a preset time interval to obtain batch data sources, wherein the batch data sources are historical access data before a current time period, and the historical access data can be better deduplicated by dividing according to the preset time interval.

3. In the embodiment of the description, the batch data source is deduplicated through the batch task to obtain the deduplicated historical access dimensional table, and during the deduplication of the historical access data by the batch task in a segmenting mode, data in the historical access dimensional table can be reduced as much as possible, so that the accumulated independent access amount can be determined better subsequently.

4. In the embodiment of the description, according to the current historical access dimension table and the stream data corresponding to the current time period in the user access stream, the stream task of the current time period is executed for re-duplication removal to obtain the accumulated independent access volume in the current time period.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort. In the drawings:

fig. 1 is a schematic flowchart of a method for determining cumulative individual visit amount based on batch cooperation according to one or more embodiments of the present disclosure;

FIG. 2 is a block diagram of a flow batch all-in-one computing system framework provided in one or more embodiments of the present disclosure;

fig. 3 is a schematic structural diagram of a cumulative individual access amount determining apparatus based on a streaming batch cooperation according to one or more embodiments of the present disclosure;

fig. 4 is a schematic structural diagram of a cumulative individual access amount determining apparatus based on a streaming batch cooperation according to one or more embodiments of the present disclosure.

Detailed Description

The embodiment of the specification provides a method, a device and equipment for determining accumulative independent access amount based on batch streaming coordination.

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present specification without any creative effort shall fall within the protection scope of the present specification.

The accumulated independent access amount refers to a newly added independent access amount in a specific time period, and the newly added independent access amount is obtained after the weight is lost. When the accumulated independent access amount is deduplicated, not only the access data in a specific time period but also historical access data before the specific time period is required.

The following illustrates the calculation rules for accumulating the independent access volumes:

for example, the cumulative access amount of the a site is counted, the same user accessing the a site for a plurality of times can only be recorded as 1 time, if the specific time period is from 0 at9 month 1 day of 2021 year to 12 at9 month 1 day of 2021 year, the cumulative individual access amount is an individual access amount newly added within 0 at9 month 1 day of 2021 year to 12 at9 month 1 day of 2021 year, if the a user has accessed the a site before 0 at9 month 1 day of 2021 year, the a user has accessed the a site again at 10 at9 month 1 day of 2021 year and at 11 at9 month 1 day of 2021 year, and the a user has accessed the a site before 0 at9 month 1 day of 2021 year, so that even if the a user accesses the a site again at 10 at9 month 1 day of 2021 year and at9 month 1 day 11 of 2021 year, the a user cannot be included in the cumulative individual access amount.

Similarly, the accumulated access amount of the A website in the period from 0 at9 month 1 day of 2021 to 12 at9 month 1 day of 2021 year, if the b user does not access the A website before 0 at9 month 1 day of 2021 year, the b user accesses the A website at 8 at9 month 1 day of 2021 year and 10 at9 month 1 day of 2021 year, at this time, the b user can be counted in the accumulated independent access amount, and the access time when the accumulated independent access amount is recorded can be recorded as 8 at9 month 1 of 2021 year.

In the scenario of accumulating independent access volumes, the accumulated independent access volumes in a specific time period need to be calculated after deduplication is performed according to full data. In a stream calculation scenario, an independent access amount (which is not an accumulated independent access amount) is determined through stream calculation, because when the stream calculation engine starts to calculate, there is no historical access data, so that newly added partial data is not completely deduplicated, an accurate accumulated independent access amount cannot be provided for a user, and only the independent access amount which is not deduplicated with the historical access data can be calculated. Wherein the stream calculation is a real-time processing of the data stream.

The determination of the independent access amount by the flow calculation is exemplified as follows:

1) calculating the independent access amount from the current time to the zero point on the day in real time can be completed through the following SQL instructions:

select DateUtil(event_time,'yyyyMMdd'),count(distinct user)

from page_visit

group by DateUtil(event_time,'yyyyMMdd')；

'yyyyMMdd' is a date and is accurate to the day, the quitting user is a non-repetitive user, for example, 'yyyyMMdd' is 2021 year, 9 month and 1 day, the SQL instruction represents that a non-repetitive user with the time of 2021 year, 9 month and 1 day is selected from the current user access stream, and the users are grouped. Therefore, the SQL instruction can calculate the independent access amount of the current day, but the accumulated independent access amount of the current day cannot be obtained because the historical access data is not considered due to the defect of flow calculation.

2) And calculating the current accumulated access amount per hour in real time. Point 0, … … 12, point … … 24, may be accomplished by the following SQL instruction:

select DateUtil(event_time,'yyyyMMddHH'),count(distinct user)

from page_visit

group by DateUtil(event_time,'yyyyMMddHH')；

the 'yyyyMMdd' is the date and is accurate to the hour, and the accumulated independent access amount per hour cannot be obtained by the rest of the SQL instructions.

In view of the above-mentioned drawbacks, the technical solutions provided in the present specification will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a cumulative individual visit amount determination method based on batch cooperation according to one or more embodiments of the present disclosure, which may be executed by a batch-by-batch computing system, and some input parameters or intermediate results in the flow chart allow manual intervention and adjustment to help improve accuracy.

S102, obtaining user access stream through a stream data source.

The streaming data source may be the full amount of data accumulated in real time, i.e., each time the user accesses the streaming data source. The user access flow may include at least an independent statistical dimension and an access time. Creating the user access stream may be accomplished by the following SQL instructions:

create table page_visit(

user STRING'user id',

event_time BIGINT'visit timestamp'

dt STRING'day partition'

)；

in the SQL instruction, the independent statistical dimension is a user ID, the page _ view is a user access stream, and when creating the page _ view, the page _ view includes a user ID (user ID), a view timestamp (access time), and a day partition (time format divided by days).

And S104, extracting user access data from the user access stream according to a preset time interval to obtain a batch data source.

In this embodiment, in order to ensure that the accumulated independent access amount is accurate, the batch data source needs to extract the data accessed by the first user, so that the accumulated independent access amount is obtained based on the full data.

The preset time interval can be determined according to actual conditions so as to ensure that users in the preset time interval can extract reasonably. If the daily access amount is lower than the preset value, the preset time interval may be set to 3 days, if the daily access amount is higher than the preset value, the preset time interval may be set to 12 hours, and if the daily access amount is the preset value, the preset time interval may be set to 1 day.

S106, creating batch tasks according to the batch data source, executing the batch tasks to perform deduplication, and obtaining at least partial deduplication historical access dimension table.

The batch data source is user access data of each preset time interval, and the user access data of each preset time interval needs to be subjected to duplicate removal respectively to obtain a duplicate-removed historical access dimension table. The historical access dimension table is subjected to deduplication at each preset time interval, but the user access data between the preset time intervals is not deduplicated, so that the operation does not affect the subsequent determination of the accumulated access amount, and the detailed description will be given later.

At this time, duplicate removal may be performed on the user access data between the preset time intervals, and only the user access data between the preset time intervals need to be compared to screen out the repeated user access data, so as to complete the user access data between the preset time intervals.

Further, creating the batch task according to the batch data source is an operation which can be performed based on the current time period, each preset time interval does not belong to the current time period, and the user access data of each preset time interval is historical access data before the current time period. And the access data corresponding to the current time period is the time period for determining the accumulated independent access amount. The following is illustrated by way of example:

for example, when the accumulated access amount of the time period from the 9 th 1 st 0 of 2021 to the 9 th 1 st 12 of 2021 is issued at the 9 th 1 st 12 of 2021, all the user access data before the 9 th 1 st 0 of 2021 needs to be used, and all the user access data before the 9 th 1 st 0 of 2021 are stored in a batch data source, the batch data source is the user access data of each preset time interval, and the user access data of each preset time interval needs to be deduplicated.

It should be noted that different batch tasks are respectively created according to user access data at each preset time interval in the batch data source, and the batch tasks are processing the batch data source through batch calculation.

Further, before obtaining the at least partially deduplicated historical access dimension table, the following steps are performed: obtaining independent statistical dimensions corresponding to accumulated independent access volumes to be determined, and creating a historical access dimension table containing only independent statistical dimensions, wherein the historical access dimension table is updated by executing batch tasks, and a plurality of attribute dimensions corresponding to user access streams contain the independent statistical dimensions.

The historical access dimension table only has independent statistical dimensions, so that the memory space is saved, and meanwhile, the duplicate removal operation of the user access data of each preset time interval is smoothly completed. The historical access dimension table can be created by the following SQL instruction:

create table dim_history_visit(

user STRING'user id'

)；

in the SQL instruction, the independent statistical dimension is the user ID.

When the S106 is executed, batch tasks may be respectively created according to the user access data corresponding to each preset time interval in the batch data source and the corresponding third duplicate removal SQL instruction, and each batch task is executed for duplicate removal, so as to obtain the historical access dimensional table in which each time interval is independently deduplicated.

The deduplication of the user access data of each preset time interval can be completed by the following SQL instruction (third deduplication SQL instruction):

insert into dim_history_visit

select distinct user from page_visit where dt＝'${bizDate}'

the independent statistical dimension at this time is the user dimension, dim _ history _ visit is the historical access dimension table, bizDate is each preset time interval, the SQL instruction means to be inserted into the historical access dimension table, and the same user is deduplicated from each preset time interval, so that the access record of the same user in each preset time interval is only retained once.

And S108, creating a stream task corresponding to the current time period, and executing the stream task to perform re-duplication removal according to the current historical access dimension table and stream data corresponding to the current time period in the user access stream to obtain the accumulated independent access volume in the current time period.

When a stream task corresponding to the current time period is created, a querier receiving accumulated independent access amount sends a first duplicate removal SQL instruction irrelevant to a stream batch; automatically rewriting the first duplicate removal SQL instruction to obtain a second duplicate removal SQL instruction; and creating a stream task corresponding to the current time period according to the second duplicate removal SQL instruction.

Regarding the first deduplication SQL instruction, the user inputs the accumulated independent access amount for calculating the current time period, and the user may not know the second deduplication SQL instruction, or may rewrite the second deduplication SQL instruction. The first deduplication SQL instruction may provide corresponding parameters for rewriting to the second deduplication SQL instruction. Of course, the user may also know the second deduplication SQL instruction, and the user directly inputs the first deduplication SQL instruction, which may save the time of the user.

The first deduplication SQL instruction may be accomplished by the following SQL instruction:

select

deduplicate('max 1',user,event_time)

from

page_visit

group by

user

the 'max 1', user, event _ time in the SQL instruction are used to limit the maximum value of the selected access time in the second deduplication SQL instruction, and the page _ visit is used to limit the user access flow in the current time period in the second deduplication SQL instruction.

It should be noted that, by automatically rewriting the first deduplication SQL instruction, the second deduplication SQL instruction is obtained, and at the same time, the third deduplication SQL instruction executed in S106 may also be obtained. At this time, the first deduplication SQL instruction may be rewritten into a second deduplication SQL instruction and a third deduplication SQL instruction providing corresponding parameters.

In addition, the third deduplication SQL instruction may also be obtained by automatically rewriting the first deduplication SQL instruction when creating a batch task. Specifically, a preset time interval corresponding to the current time period is obtained first and is used as the latest batch task scheduling time; and automatically rewriting the first duplicate removal SQL instruction according to the latest batch task scheduling time to obtain a third duplicate removal SQL instruction.

When S108 is executed, the stream data in the current time period is connected with the current historical access dimension table; in the connection process, according to the independent statistical dimension, performing dimension alignment on the stream data in the current time period and the current historical access dimension table to obtain a connection result; screening out user records which exist in the streaming data of the current time period but do not exist in the current historical access dimension table from the connection result; and performing duplicate removal according to the screening result to complete the execution of the flow task.

In the execution step, the execution of the flow task is completed by removing the duplicate of the screening result, and the user access records contained in the screening result can be grouped according to the independent statistical dimension to obtain one or more access record groups; respectively numbering user access records in each access record group independently to obtain record line numbers of the user access records; filtering the access record groups by using the same record line number respectively, so that only one user access record is reserved in each filtered access record group; and finishing the execution of the flow task according to the reserved user access record.

When the accumulated independent access amount in the current time period is obtained, the method can be carried out according to the following SQL instructions:

select

user,event_time

from

(

select

*

,

Row_Number()over(partiton by user order by event_timedesc)as rn

from (— dimension table association, filtering historical cold data

select*from page_visit p left join dim_history_visit h

on p.user＝h.user where h.user is null

)

)t where t.rn＝1

The independent statistical dimension is a user dimension, the Row _ Number () over (partition by user order by event _ time) is used for processing access data in the current time period, the access data are grouped according to the user dimension and sorted according to the access time in the grouping, desc represents descending order of the access time, namely, the access time is sorted from large to small, as rn represents that the sorted result is marked as rn, select from page _ view p left join di _ history _ host p.user is marked as h.user where user is used for carrying out dimension table association on the user access data in the current time period, history access data are filtered, page _ view p represents that the access data in the current time period are marked as p, and then page _ history _ view represents that the history access dimension table is marked as h, the access data in the current time period and the access dimension table are marked as left, and the access data in the current time period are marked as left access data in the history access time period User dimensions are aligned, select represents that user access records which do not exist in a historical access dimension table are selected from access data in the current time period based on the user dimensions, t represents that access flow data in the current time period filtered according to the historical access data is recorded as t, where t.rn is 1, represents that the access flow data in the current time period is re-filtered, the access data in the same user dimension can be grouped through Row _ Number () and corresponding line numbers are generated, and at the moment, only access data with the line Number of 1 in each group need to be selected for accumulation.

It should be noted that, in S106, duplicate removal is not performed on the user access data between the preset time intervals, and the accuracy of the calculated accumulated independent access amount is not affected, because it is only necessary to determine that the access record corresponding to the independent statistical dimension in the user access stream of the current time period does not appear in the historical access dimension table in S108, and if the independent statistical dimension is the user dimension, it is only necessary to determine that a certain user ID in the user access stream of the current time period does not appear in the historical access dimension table, and it is possible to determine that the user ID can be accumulated.

The above-mentioned solution is to calculate the accumulated access volume for the current time period, where the condition for generating the current time period may be that a user issues an instruction at any time, and in this embodiment of the present specification, the accumulated access volume may also be calculated for each time period, and the following description is given for this case:

creating a stream task for each preset time interval according to each user access record from the user access stream in sequence, wherein the creation of the stream task at this time is to calculate the accumulated access amount for each preset time interval, each time interval herein may represent the current time period of the above steps, and is different from the time interval of S104; then, acquiring corresponding user access streams through stream tasks corresponding to preset time intervals, and acquiring corresponding at least partially de-duplicated historical access dimension tables through batch tasks corresponding to preset time intervals; and executing the stream tasks of the preset time intervals to perform deduplication according to the user access streams and the historical access dimension table corresponding to the preset time intervals to obtain the accumulated independent access volumes in the preset time intervals, and storing the accumulated independent access volumes of the preset time intervals to an accumulated independent access volume database.

Furthermore, the embodiment of the specification can be applied to the user conversion promotion activity, and the situation of the newly added user can be well known and the effect of the promotion activity can be known by calculating the accumulated independent access amount, so that reference is provided for the direction of the subsequent activity.

The execution facilitation activities are described in detail below:

determining user conversion promoting activities starting from a first designated time, and receiving a user conversion effect query instruction at a second designated time, wherein the second designated time is later than the first designated time, and a plurality of preset time intervals are included between the second designated time and the first designated time; analyzing a user conversion effect query instruction, and determining a time interval required to be queried, wherein the time interval comprises at least one preset time interval; and calling the accumulative independent access amount database to obtain the accumulative independent access amount corresponding to the time interval so as to determine the user conversion data.

Through the operation, the accumulated independent access amount in any interval between the first designated time and the second designated time can be inquired, the user conversion data is determined, the effect of promoting the activity can be reflected laterally through the user conversion data, and a reference is provided for whether the follow-up promoting activity is changed.

The embodiment of the present specification realizes collaboration between stream batches based on a history access dimension table, that is: batch tasks may be used to generate data for user dimensions that have been accessed historically.

The embodiment of the description utilizes a flow-batch integrated computing system framework and can automatically disassemble the historical access dimension table to generate batch tasks, so that the purpose of filtering historical access data is achieved, the accumulated independent access amount can be calculated in real time, and the problem that a flow task engine cannot store the historical access data is solved. The following is a detailed description of the flow batch-integrated computing system framework:

fig. 2 is a schematic structural diagram of a framework of a flow batch integrated computing system according to one or more embodiments of the present specification, in which a flow data source generates a flow task, a batch data source generates a batch task, and the flow task and the batch task can be automatically rewritten when an SQL instruction irrelevant to a flow batch is generated, and after the flow task is executed, an accumulated independent access amount can be calculated, and the accumulated independent access amount is uniformly stored.

Fig. 3 is a schematic structural diagram of an accumulated independent access amount determining apparatus based on batch streaming coordination according to one or more embodiments of the present specification, where the apparatus includes: an obtaining unit 302, an extracting unit 304, a dimension table determining unit 306 and an accumulating unit 308.

The obtaining unit 302 obtains the user access stream through the stream data source;

the extraction unit 304 extracts user access data from the user access stream according to a preset time interval to obtain a batch data source;

the dimension table determining unit 306 creates batch tasks according to the batch data sources, executes the batch tasks to perform deduplication, and obtains at least partially deduplicated historical access dimension tables;

the accumulation unit 308 creates a stream task corresponding to the current time period, and executes the stream task for re-duplication removal according to the current historical access dimension table and the stream data corresponding to the current time period in the user access stream, so as to obtain the accumulated independent access volume in the current time period.

Further, before the dimension table determining unit 306 performs obtaining the at least partially deduplicated historically accessed dimension table, the apparatus further includes:

the dimension obtaining unit 310 obtains an independent statistical dimension corresponding to the accumulated independent access amount to be determined;

the dimension table creating unit 312 creates a history access dimension table, which contains dimensions only including independent statistical dimensions, and is updated by executing a batch task, and multiple attribute dimensions corresponding to the user access stream contain independent statistical dimensions.

Further, when the accumulating unit 308 executes the stream task corresponding to the current time period, it is specifically configured to:

receiving a first duplicate removal SQL instruction which is sent by an inquirer with accumulated independent access amount and is irrelevant to the flow batch;

automatically rewriting the first duplicate removal SQL instruction to obtain a second duplicate removal SQL instruction;

and creating a stream task corresponding to the current time period according to the second duplicate removal SQL instruction.

Further, the dimension table determining unit 306 is specifically configured to, when executing a batch task created according to a batch data source, execute the batch task to perform deduplication, and obtain an at least partially deduplicated historical access dimension table:

respectively creating batch tasks according to user access data corresponding to each preset time interval in a batch data source and a corresponding third duplicate removal SQL instruction;

and executing each batch of tasks for duplicate removal to obtain a historical access dimension table with each time interval subjected to independent duplicate removal respectively.

Further, when the dimension table determining unit 306 executes a batch creation task, it is specifically configured to:

acquiring a preset time interval corresponding to the current time period as the latest batch task scheduling time;

and automatically rewriting the first duplicate removal SQL instruction according to the latest batch task scheduling time to obtain a third duplicate removal SQL instruction.

Further, the accumulating unit 308 executes a stream task for re-deduplication according to the current historical access dimension table and stream data corresponding to the current time period in the user access stream, and is specifically configured to:

connecting the stream data of the current time period with the current historical access dimension table;

in the connection process, according to the independent statistical dimension, carrying out dimension alignment on the stream data in the current time period and the current historical access dimension table to obtain a connection result;

screening out user records which exist in the streaming data of the current time period but do not exist in the current historical access dimension table from the connection result;

and performing duplicate removal according to the screening result to complete the execution of the flow task.

Further, the accumulating unit 308 performs deduplication according to the result of the filtering, and completes execution of the streaming task, specifically configured to:

according to the independent statistical dimension, grouping user access records contained in the screening result to obtain one or more access record groups;

respectively numbering user access records in each access record group independently to obtain record line numbers of the user access records;

filtering the access record groups by using the same record line number respectively, so that only one user access record is reserved in each filtered access record group;

and finishing the execution of the flow task according to the reserved user access record.

Further, the apparatus further comprises:

the stream task creating unit 314 creates a stream task for each preset time interval according to each user access record from the user access stream in sequence;

the access flow acquiring unit 316 acquires a corresponding user access flow through a flow task corresponding to each preset time interval, and acquires a corresponding at least partially de-duplicated historical access dimensional table through a batch task corresponding to each preset time interval;

the cumulative storage unit 318 executes the stream task of each preset time interval to perform deduplication according to the user access stream and the historical access dimension table corresponding to each preset time interval, obtains the cumulative independent access amount in each preset time interval, and stores the cumulative independent access amount of each preset time interval into the cumulative independent access amount database.

Further, the apparatus further comprises:

the activity determination unit 320 determines a user conversion facilitation activity starting from a first specified time;

the instruction receiving unit 322 receives a user conversion effect query instruction at a second designated time, where the second designated time is later than the first designated time, and a plurality of preset time intervals are included between the second designated time and the first designated time;

the query time determining unit 324 analyzes the user conversion effect query instruction, and determines a time interval to be queried, where the time interval includes at least one preset time interval;

the data determining unit 326 calls the database of the accumulated independent access amount to obtain the accumulated independent access amount corresponding to the time interval, so as to determine the user conversion data.

Fig. 4 is a schematic structural diagram of a cumulative individual access amount determining apparatus based on a streaming batch cooperation according to one or more embodiments of the present specification, where the apparatus includes:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to cause the at least one processor to:

acquiring user access stream through a stream data source;

creating batch tasks according to the batch data source, executing the batch tasks to perform deduplication, and obtaining at least partial deduplication historical access dimension tables;

and creating a stream task corresponding to the current time period, and executing the stream task to perform reduplication removal according to the current historical access dimension table and stream data corresponding to the current time period in the user access stream to obtain the accumulated independent access amount in the current time period.

acquiring user access stream through a stream data source;

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: the ARC625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, the present specification embodiments may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the device, and the nonvolatile computer storage medium, since they are substantially similar to the embodiments of the method, the description is simple, and for the relevant points, reference may be made to the partial description of the embodiments of the method.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The above description is merely one or more embodiments of the present disclosure and is not intended to limit the present disclosure. Various modifications and alterations to one or more embodiments of the present description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of one or more embodiments of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method for determining cumulative individual visitation based on batch cooperation of streams, the method comprising:

acquiring user access stream through a stream data source;

2. The method of claim 1, prior to obtaining the at least partially deduplicated historical access dimension table, the method further comprising:

obtaining an independent statistical dimension corresponding to the accumulated independent access amount to be determined;

and creating a historical access dimension table with dimensions only including the independent statistical dimensions, wherein the historical access dimension table is updated by executing the batch task, and a plurality of attribute dimensions corresponding to the user access flow include the independent statistical dimensions.

3. The method according to claim 1, wherein the creating of the streaming task corresponding to the current time period specifically comprises:

receiving a first duplicate removal SQL instruction which is not related to the flow batch and sent by the inquirer with the accumulated independent access amount;

4. The method according to claim 3, wherein the creating a batch task according to the batch data source, executing the batch task to perform deduplication, and obtaining an at least partially deduplicated historical access dimension table specifically includes:

respectively creating batch tasks according to user access data corresponding to each preset time interval in the batch data source and the corresponding third duplicate removal SQL instruction;

and executing each batch of tasks to perform deduplication, and obtaining a historical access dimension table with each time interval being independently deduplicated respectively.

5. The method according to claim 4, wherein the creating of the batch task specifically comprises:

acquiring the preset time interval corresponding to the current time period as the latest batch task scheduling time;

6. The method according to claim 2, wherein the executing the stream task for re-deduplication according to the current historical access dimension table and stream data corresponding to the current time period in the user access stream specifically includes:

in the connection process, according to the independent statistical dimension, performing dimension alignment on the stream data of the current time period and the current historical access dimension table to obtain a connection result;

and executing the flow task by performing duplicate removal on the screening result.

7. The method according to claim 6, wherein the performing the streaming task is completed by performing deduplication on the result of the screening, and specifically includes:

filtering in each access record group by using the same record line number, so that only one user access record is reserved in each filtered access record group;

and completing the execution of the stream task according to the reserved user access record.

8. The method of claim 1, further comprising:

creating a stream task aiming at each preset time interval according to each user access record from the user access stream in sequence;

acquiring a corresponding user access stream through a stream task corresponding to each preset time interval, and acquiring a corresponding at least partially de-duplicated historical access dimension table through a batch task corresponding to each preset time interval;

and executing the stream tasks of the preset time intervals to perform deduplication according to the user access streams and the historical access dimension table corresponding to the preset time intervals to obtain the accumulated independent access volumes in the preset time intervals, and storing the accumulated independent access volumes of the preset time intervals to an accumulated independent access volume database.

9. The method of claim 8, further comprising:

determining a user conversion facilitation activity starting at a first specified time;

receiving a user conversion effect query instruction at a second designated time, wherein the second designated time is later than the first designated time, and a plurality of preset time intervals are included between the second designated time and the first designated time;

analyzing the user conversion effect query instruction, and determining a time interval required to be queried, wherein the time interval comprises at least one preset time interval;

and calling the accumulative independent access amount database to obtain the accumulative independent access amount corresponding to the time interval so as to determine user conversion data.

10. An accumulated individual visit amount determination apparatus based on a streaming batch cooperation, the apparatus comprising:

11. The apparatus of claim 10, before the dimension table determining unit performs obtaining the at least partially deduplicated historically accessed dimension table, the apparatus further comprising:

the dimension acquisition unit is used for acquiring independent statistical dimensions corresponding to the accumulated independent access amount to be determined;

and the dimension table creating unit is used for creating a historical access dimension table which contains only the independent statistical dimension, the historical access dimension table is updated by executing the batch task, and a plurality of attribute dimensions corresponding to the user access flow contain the independent statistical dimension.

12. The apparatus according to claim 10, wherein the integrating unit, when executing the task of creating the stream corresponding to the current time period, is specifically configured to:

13. The apparatus according to claim 12, wherein the dimension table determining unit, when executing a batch task created according to the batch data source, executing the batch task to perform deduplication, and obtaining an at least partially deduplicated historical access dimension table, is specifically configured to:

14. The apparatus according to claim 13, wherein the dimension table determining unit, when executing the batch task, is specifically configured to:

15. The apparatus according to claim 11, wherein the accumulating unit is configured to, when executing the stream task for re-deduplication according to the current historical access dimension table and the stream data corresponding to the current time period in the user access stream, specifically:

16. The apparatus according to claim 15, wherein the accumulating unit performs deduplication on the result of the filtering to complete the execution of the streaming task, and is specifically configured to:

17. The apparatus of claim 10, further comprising:

the stream task creating unit is used for creating stream tasks aiming at preset time intervals according to user access records from the user access stream in sequence;

the access flow acquisition unit is used for acquiring corresponding user access flows through flow tasks corresponding to preset time intervals and acquiring corresponding at least partially de-duplicated historical access dimension tables through batch tasks corresponding to the preset time intervals;

and the accumulative storage unit executes the stream tasks of the preset time intervals to perform deduplication according to the user access streams and the historical access dimension table corresponding to the preset time intervals to obtain accumulative independent access volumes in the preset time intervals, and stores the accumulative independent access volumes of the preset time intervals to the accumulative independent access volume database.

18. The apparatus of claim 17, further comprising:

an activity determination unit that determines a user conversion promotion activity starting from a first specified time;

the instruction receiving unit is used for receiving a user conversion effect query instruction at a second specified time, wherein the second specified time is later than the first specified time, and a plurality of preset time intervals are included between the second specified time and the first specified time;

the query time determining unit is used for analyzing the user conversion effect query instruction and determining a time interval required to be queried, wherein the time interval comprises at least one preset time interval;

and the data determining unit calls the accumulative independent access amount database to obtain the accumulative independent access amount corresponding to the time interval so as to determine the user conversion data.

19. A cumulative individual access amount determination apparatus based on a streaming batch cooperation, comprising:

at least one processor; and the number of the first and second groups,

acquiring user access stream through a stream data source;