CN109981659B

CN109981659B - Network resource prefetching method and system based on data deduplication technology

Info

Publication number: CN109981659B
Application number: CN201910251873.2A
Authority: CN
Inventors: 姚瑶; 王战红; 丁颖; 王会霞
Original assignee: Zhengzhou Institute of Technology
Current assignee: Zhengzhou Institute of Technology
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2021-07-09
Anticipated expiration: 2039-03-29
Also published as: CN109981659A

Abstract

The invention discloses a network resource prefetching method and system based on a data deduplication technology, which comprises the following steps: when a client sends an access request to a server, a proxy server is used to record network access behavior information of a user and extract an access log; the method comprises the steps of obtaining an access rule by mining and analyzing Web of a network log and extracting network behavior characteristics, analyzing network resources most possibly accessed at the next time in advance by adopting a prediction algorithm through a prediction engine, and prefetching the network resources into a cache; because the size of the cache is limited, the resources stored in the cache are stored in the cache after being processed by the data deduplication technology, and the invention can save storage space, improve data transmission rate, reduce network delay, relieve network traffic pressure during network access peak, save bandwidth and improve system utilization rate on the premise of ensuring the prefetching efficiency.

Description

Network resource prefetching method and system based on data deduplication technology

Technical Field

The invention relates to the technical field of networks, in particular to a network resource prefetching method and system based on a data deduplication technology.

Background

With the great proliferation of internet information and users, how to improve the quality of network service and realize WWW acceleration is a problem which needs to be solved urgently at present. The Web caching mechanism, the Web pre-fetching and data de-duplication technology can effectively reduce network delay. The Web caching technology is based on a time locality principle, adopts an efficient replacement algorithm to cache resources which are possibly accessed by a user in advance, is applied to network environments such as a proxy server, a P2P network and a mobile network, and is limited by a hit rate. The Web prefetching technology tries to actively prefetch resources before a user makes a request, so that the hit rate is improved to a certain extent, and the access delay is reduced. Meanwhile, the method needs to be carefully controlled, otherwise, the performance is greatly reduced, and the original intention is violated. Data deduplication technology aims at detecting and removing the occupied space of duplicate data. It has now been found that if two versions of a resource refer to the same key, there will be 55% of the data that is redundantly duplicated. If the reference source is academic, the degree of repetition is as high as 87%. By utilizing the information redundancy among the data objects, the space utilization rate which is far higher than that of the traditional compression method and the incremental backup method can be obtained, the bandwidth occupied by the byte release part of the transmission data is reduced, and the network delay is reduced. If the Web prefetching technique and the data deduplication technique can be combined, it is significant to effectively reduce network delay.

Existing Web pre-fetching techniques reduce the expected latency applied first on the Mozilla Firefox browser, and then adopted by the Google search engine. Google Web Accelerator software works together with a Firefox or IE browser to realize browser-based prefetching. The commercial software relating to prefetching technologies, Robtex Viking Server and AllegroSurf, but no specific scheme relating to prefetching is disclosed. But is speculative and sometimes adds additional bandwidth. Therefore, the Web prefetching method needs to be used carefully. It is because of the potential bandwidth limitation that Web prefetching is not very promising for business applications.

Disclosure of Invention

Aiming at the defects and problems in the prior art, the invention provides a network resource prefetching method and system based on a data deduplication technology, which can improve prefetching efficiency and reduce network delay by a method of reducing transmission data redundancy.

The technical scheme adopted by the invention for solving the technical problems is as follows: a network resource prefetching method based on data deduplication technology comprises the following steps:

firstly, connecting a proxy server end between a client end and a server end, and recording network access behavior information of a user and extracting an access log by the proxy server while the client end sends an access request to the server end;

secondly, the proxy server performs Web mining and analysis on the network access log, extracts user behavior characteristics and acquires a network access rule; the step of mining the access preference of the user from the access log so as to extract the network access behavior characteristics of the user comprises the following steps: performing data cleaning pretreatment on the access log, removing records of access failure and objects which cannot be cached in log files, and extracting user browsing characteristics from a pretreated network access sequence;

meanwhile, network resources most possibly accessed by the user at the next time are analyzed in advance by adopting a prediction algorithm through a prediction engine and are pre-fetched into a cache; the prediction engine predicts a page which is likely to be accessed when each resource is requested, generates a series of URLs of the resources which are accessed most recently according to a prediction algorithm, and puts the result into a decision database; the user behavior characteristics can accurately describe the user browsing characteristics through a Markov chain model, a Markov tree is utilized to model the browsing behavior of the user on the webpage, and a prediction algorithm based on the access probability is adopted to predict the most probable access request sent by the user at the next time;

finally, the resources pre-fetched in the cache are stored in the cache after being processed by a data deduplication technology; the step of performing data deduplication processing on the prefetched resources comprises the following steps:

the CDM of the repeated data deleting module of the client runs in a client browser and is used for storing the latest network resource and indicating how the corresponding resource corresponds to the SDM module positioned at the server side according to the unique identifier;

the server-side data de-duplication model SDM is used for combining the finally responded data blocks, when the SDM receives a given resource request, the SDM retrieves a self-defined request header which refers to a resource identifier and is sent by CDM, then the SDM takes out the resource from the server, after the header and the data which are fully responded are received, the SDM allocates a new identifier to the resource, the resource data are divided into blocks, and the meta-information of the blocks is stored in a data storage file; in this data store SDM guarantees all blocks of all versions of a meta-information resource indexed by the hash of the block;

after the CDM receives the response, it reconstructs the original resources for all data, including copying the block reference information from the local cache resources and copying the non-redundant data content of the received response.

A network resource prefetching system based on data deduplication technology is characterized in that a simulator system frame is additionally arranged between a user path and a Web server and comprises a client and a proxy server, the client can prefetch user behaviors of a client browser, the client is connected with the proxy server, and the proxy server is connected with the Web server;

the client comprises 6 modules and 2 storage files:

a read path module: reading a request sequence of a user, wherein a data structure is a first-in first-out queue;

the prefetch management module: reading an access queue of a recording module, checking a pre-fetching object pool, and confirming whether resources are pre-fetched or not; if the request is not prefetched, the request is sent to a server, and a prefetching management module creates a plurality of user request threads and waits for a new request; when response resources are received from the server, the prefetching management module checks whether the URL of the server is in a prefetching queue, and if so, the prefetching management module removes the URL and inserts the URL into a prefetching object pool; the prefetching module checks whether the request queue is empty, if so, the prefetching request is allowed to be sent to the server until a new user request comes; when a new client requests to insert into the queue, the prefetching management module implicitly deletes the prefetched resources and clears the implicit queue data storage;

a user request module: receiving a request from the pre-fetching management module and transmitting the request to the request module; when a response resource is received from the server, the user request module inserts the queue of the response header into the prefetch queue data storage and inserts the URL of the resource into the prefetch object pool;

the request module is used for connecting a Web server and is responsible for processing bottom layer communication;

a CDM module: the client data deduplication module intercepts HTTP request messages sent by a client user request module or a pre-fetching request module; inquiring the resource version number to obtain resource identifiers of all resources in the client cache, and informing a communication interception module by CDM;

a communication interception module: adding a custom header 'X-vrs' to the information, attaching the header information to an HTTP request header, transmitting the HTTP request header to a request module, and finally transmitting the HTTP request header to a server;

the pre-fetching queue: the storage server informs the client of the object information needing to be prefetched;

prefetching the object pool: storing all the prefetched objects to act like a browser cache of a user;

the proxy server side includes:

a monitoring module: waiting for a thread queue connected to a client, and giving a port number;

a server connection module: processing the connection between the client and the server, and transmitting the resource data of the server to the client;

a communication interception module: intercepting the most original HTTP response from a server end, storing the HTTP response in a temporary buffer area and preprocessing the HTTP response;

an SDM module: the server-side data deduplication module executes a data splitting process of the message entity transmitted by the communication interception module;

a communication recombination module: preparing and sending a copy version of the response message; combined response header information: updating/creating Content-Length and adding new entity message data Length; adding two new header information resource version number identifiers and metadata lengths;

a prediction engine module: pages that are likely to be visited are predicted each time a resource is requested, a series of URLs for the most recently visited resource are generated according to a prediction algorithm, and the results are placed in a decision database.

The invention has the beneficial effects that: aiming at the defect that the prior Web prefetching is limited by bandwidth, the invention provides an improved method for a Web prefetching system, which can improve the prefetching efficiency and reduce the network delay by reducing the redundancy of transmission data.

On one hand, the invention provides a network resource prefetching method, which comprises the steps of obtaining logs of Web requests through a Squid proxy server, analyzing the logs to obtain access paths, inputting the access paths into a client module, and simulating the access behaviors of users; the client regularly sends requests to the server according to a specified time interval, a server connection module of the server sequentially receives the request messages and transmits the request messages to the server, and then the response messages are intercepted, analyzed and transmitted to a prediction engine module when HTTP response headers are received; the prediction engine module adopts a popular prediction algorithm to calculate a series of pages with the highest frequency of being accessed in the latest section, the result is put into a decision database, and meanwhile, after all response headers are received, the update state database is informed to prepare for next prediction; and finally, the server connection module sends the complete response data to the client.

On the other hand, the invention provides a network resource prefetching system, which can analyze the resource and perform data deduplication processing before prefetching the downloaded resource to the cache, and a communication interception module, an SDM module and a communication recombination module are added at a proxy server; the resource version module, the CDM module and the communication interception module are added at the client, so that the network access requirement of a user can be quickly responded, the network service quality is improved, the access delay is reduced, the bandwidth is effectively saved, and the system utilization rate and the prefetching efficiency are improved.

Drawings

Fig. 1 is a diagram of an application scenario of a prefetching method for a Web page according to an embodiment of the present invention.

FIG. 2 is a flowchart of a pre-fetch system for Web pages according to an embodiment of the present invention.

Fig. 3 is a diagram of a client module according to an embodiment of the present invention.

FIG. 4 is a block diagram of a proxy server side of the prefetch system according to an embodiment of the present invention.

Fig. 5 is a diagram of a data deduplication architecture.

Fig. 6 is an algorithm diagram of the SDM module in accordance with an embodiment of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

Example 1: a simulator system frame is additionally arranged between a user path and a Web server, wherein the simulator system frame comprises a client and a proxy server end, the client can prefetch the user behavior of a client browser, the client is connected with the proxy server end, and the proxy server end is connected with the Web server.

Fig. 1 is a diagram of an application scenario of a Web page prefetching system according to an embodiment of the present invention. In fig. 1, a client 101 issues a request to a proxy server 102 to access a Web server resource. When the main process of the proxy server 102 monitors that the request is sent by the client A, a sub-process is created to deal with the request sent by the client A; and the master process continues listening. Establishing connection between the created proxy server subprocess and the client 101, reading and analyzing a client request, and then checking a currently received request according to an access rule list preset on the proxy server; if the request satisfies the rule constraint, the proxy cache may be looked up for the presence of the required information and subsequent information requests processed locally. Therefore, the client can obtain the expected resources of the client more quickly, and the bandwidth is saved.

As shown in fig. 3, the client includes 6 modules and 2 storage files. The modules are a read path module 31, a prefetch management module 32, a user request module 33, a CDM module 34, a communication interception module 35 and a request module 36 in sequence; the storage files are a prefetch queue 37 and a prefetch object pool 38, respectively.

The read path module 31 records the access behavior of the user, including the IP address or host name of the user, the time when the request is sent, the method of the request (GET, POST, etc.), the path (URL) of the access page, the status code returned by the server, and the number of bytes sent in response.

Prefetch management module 32 selects an appropriate prediction algorithm to predict user requests that need prefetching. And the system is responsible for checking a prefetch object pool and whether a request resource is prefetched.

The user request module 33 receives the request from the prefetch management module 32 and passes it to the request module.

The CDM module 34 is used for client data deduplication processing, and obtains resource identifiers of all resources in the client cache chinese by intercepting an HTTP message request sent by a user request module and inquiring a resource version number. And finally notifies the communication interception module 35.

The communication interception module 35 is responsible for adding a custom header "X-vrs" to the request message, transmitting it to the request module and finally to the server.

The request module 36 is used to connect to a Web server and is responsible for handling the underlying communication.

The prefetch queue 37 is used to store object information that the server tells the client that prefetching is required.

The prefetch object pool 38 is used to store all objects that have been prefetched and functions like a user browser cache.

As shown in fig. 4, the proxy server is used for connecting the server and the client, and includes a listening module 41, a server connection module 42, a request restructuring module 43, an SDM module 44, a communication intercepting module 45, a prediction engine module 46, and a status update library 47.

The monitoring module 41: it is mainly responsible for waiting for the thread queue connected to the client, given a port number.

The server connection module 42: the module is mainly responsible for processing direct connection between the client and the server, and particularly receives a source HTTP response header at the first time when a response is processed and transmits the header to the client. And then all resource data from the server side are transmitted to the client side without change.

The communication interception module 45: the most original HTTP response from the server side is mainly intercepted and stored in a temporary buffer area and preprocessed.

The SDM module 44: and the server-side data deduplication module executes a data splitting process of the message entity forwarded by the communication interception module. The splitting process is mainly completed by two steps: block partitioning and data deduplication.

The request restructuring module 43: a duplicate version of the response message is prepared and sent. Combined response header information: updating/creating Content-Length and adding new entity message data Length; two new header information X-vrs (resource version number identifier) and X-mtd (metadata length) are added.

The prediction engine module 46: the main task is to predict the pages that will likely be accessed each time a resource is requested. The module will generate a series of URLs for the most recently accessed resources based on a predictive algorithm and place the results in a decision database.

Example 2:

a network resource prefetching method based on data deduplication technology comprises the following steps:

firstly, a proxy server end is connected between a client end and a server end, the client end sends an access request to the server end, simultaneously, the proxy server records network access behavior information of a user and extracts an access log, and the log file user network access behavior information mainly comprises access time of the user access request, a user IP address, a file name or script of an access resource, a parameter domain and the like.

Secondly, the proxy server performs Web mining and analysis on the network access log, extracts user behavior characteristics and acquires a network access rule; the step of mining the access preference of the user from the access log so as to extract the network access behavior characteristics of the user comprises the following steps: and carrying out data cleaning pretreatment on the access log, removing the record of access failure and the object which cannot be cached in the log file, and extracting the user browsing characteristics from the pretreated network access sequence.

Meanwhile, network resources most possibly accessed by the user at the next time are analyzed in advance by adopting a prediction algorithm through a prediction engine and are pre-fetched into a cache; the prediction engine predicts a page which is likely to be accessed when each resource is requested, generates a series of URLs of the resources which are accessed most recently according to a prediction algorithm, and puts the result into a decision database; the user behavior characteristics can accurately describe the user browsing characteristics through a Markov chain model, a Markov tree is utilized to model the browsing behavior of the user to the webpage, and a prediction algorithm based on the access probability is adopted to predict the most probable access request sent by the user at the next time.

the CDM of the data de-duplication module of the client runs in a browser of the client and is used for storing the latest network resource and indicating how the corresponding resource corresponds to the SDM module located at the server side according to the unique identifier.

The server-side data de-duplication model SDM is used for combining the finally responded data blocks, when the SDM receives a given resource request, the SDM retrieves a self-defined request header which refers to a resource identifier and is sent by CDM, then the SDM takes out the resource from the server, after the header and the data which are fully responded are received, the SDM allocates a new identifier to the resource, the resource data are divided into blocks, and the meta-information of the blocks is stored in a data storage file; in this data store SDM guarantees all blocks of all versions of a meta-information resource indexed by the hash of the block.

As shown in fig. 1, a client 101 issues a request to a proxy server 102 to access a Web server resource. When the main process of the proxy server 102 monitors that the request is sent by the client A, a sub-process is created to deal with the request sent by the client A; and the master process continues listening. Establishing connection between the created proxy server subprocess and the client 101, reading and analyzing a client request, and then checking a currently received request according to an access rule list preset on the proxy server; if the request satisfies the rule constraint, the proxy cache may be looked up for the presence of the required information and subsequent information requests processed locally. Therefore, the client can obtain the expected resources of the client more quickly, and the bandwidth is saved.

As shown in the flowchart of the pre-fetching system for Web pages in fig. 2, in step 201, the proxy server records a user browsing log and monitors the user browsing behavior.

The browsing behavior of the user may refer to the user's access history on static information, and these records are stored in a log file on a Web server or a proxy server. Each record containing the following requested information: the user's IP address or host name, the time the request was issued, the method of the request (GET, POST, etc.), the path to the page (URL), the status code returned by the server, and the number of bytes issued in response. The Web log file is readily available from a Web server or proxy server.

In step 202, the proxy server mines the user's access preference path through the prefetch management module.

By way of example, the network behavior of a college user is centralized in visiting portal websites such as campus web pages, new waves and search foxes in the office peak time period of 8:00-9:30 in the morning, and the behavior can be recorded through logs recorded by the proxy server.

A path analyzer in a pre-fetching management module of the proxy server end mainly completes the mining process of the user access path, and a Markov mode tree is constructed by analyzing an access sequence, so that a transition probability matrix is generated. And (3) adopting a prediction method based on a Markov chain model, and advancing the transition probability matrix and the initial state probability vector through the Markov chain model. All client requests are buffered in the client buffer and flow out of the buffer once the minimum sample threshold is exceeded or the session ends. Each client allocates a separate buffer and stores therein a sequence of client requests. And updating the Markov chain model according to the continuous change of the access path of the user. The updating method is mainly to change the default or current values of the matrix smoothly according to the additionally added path sequence.

In step 203, the prediction engine adopts a prediction algorithm to predict the most likely websites to be accessed at the next time, and the websites are downloaded into the cache in advance when the network is idle.

According to the rule of the network access behaviors, the network proxy cache device analyzes the network operation state at the next same time point (8: 00-9: 30) and predicts the most possibly accessed website according to a prediction algorithm. When the network is idle, the relevant network resources are pre-fetched from the network easy website in advance and stored in the cache. When the user makes an access request, the request resource is directly obtained from the cache, so that the network load is prevented from being increased in the peak period, the network bandwidth can be saved, and the service quality is ensured.

In step 204, the data deduplication system employed is derived from the dedupHTTP system. The core work of the data deduplication system is to analyze the content of a file, and the basic unit of research is an abstract data object called chunk (chunk). The main work of the system consists of five stages of block division, characteristic value calculation, same or similar detection, redundancy elimination and data storage. The deduplication system constructed by the embodiment is composed of two modules. A deduplication module (CDM) of the client and a deduplication module (SDM) of the server segment. CDM and SDM run at the client browser and Web server, respectively. CDM stores the most recent resource version number and indicates how each resource version number is assigned to SDMs in terms of a unique identifier.

With respect to data deduplication, as shown in the data deduplication architecture of fig. 5, when an SDM receives a request for a given resource, it retrieves a custom request header for the reference source identifier sent by CDM. The SDM then fetches the resource from the server. After receiving the fully responded header and data, the SDM assigns a new identifier to the resource. The resource data is divided into blocks, and meta information of the blocks is stored in a data storage file. In this data store SDM guarantees all blocks of all versions of the hash indexed meta-information resources of the blocks. The SDM traverses all the blocks of the resource. For each resource block, it will have the same hash block at the reference source. If a block does not have a matching reference source, the SDM will search in the current block of response resources. Therefore, redundancy detection is performed not only when resources are cached in CDM, but also when resources are submitted to CDM. The SDM combines the final response resources to the CDM. The response starts with a metadata segment. The size (in bytes) of the metadata is stored in a custom HTTP response header. The content of the metadata starts with the resource identifier of the response. A quadruple is contained in each reference source, and the information contained in each tuple is necessary information for CDM to be able to find each redundant block in the buffer. Specifically, the contained fields are Offset of the current response, resource identifier, Offset of the inside of the resource InOffset and the length of the block. The arrays are arranged in the order of the corresponding blocks in the original response. At the end of the metadata block, non-redundant data content is appended, the order still remaining in that in the original response. Because of the sequential maintenance of the tuple and the non-redundant data, CDM requires only the first offset of the array to know how to append the redundant data to the non-redundant data. When the CDM receives the response, it reconstructs the original resources for all data. Including copying block reference information from the local cache resource and copying non-redundant data content for which a response was received. CDM does not store any piece of meta-information nor does it require sending or storing hashes.

The algorithm principle of the SDM module in this embodiment is shown in fig. 6, the basic principle of the data deduplication system is a storage system based on content addressing, and the comparison and analysis of the similarity and the similarity of data are performed to delete duplicate data, so that space is saved, and the key is the reasonable selection of a data partitioning method.

As used herein, a data deduplication system divides each resource version number that a server responds to a client into index data by an SDMchunk. The block division algorithm employed is the LBFS algorithm.

The first step is as follows: when a client sends a request, the division of the resource into several indexes is performed by the serverchunk. This makes it easy to determine which parts of the resource the client already exists and which parts are new and need to be sent.

Starting with creating a byte hash of the resource content. These hashes represent smaller resourceschunk. Implemented using a sliding hash function.

The second step is that: a hash representing the resource is selected. While all hashes may be chosen, it is difficult to implement for memory, as there is one hash for each byte of content. Therefore, the appropriate hash is selected as the boundary of the larger data block.

In this step, a window mechanism is selected, which has proven to provide better redundancy detection. Implementing min and max, respectivelychunkSize due to contrast expectationchunkThe size, content based on the method may be too largechunkOr too smallchunk。

SelectingchunkAfter the boundary, the method of using 64 as MurmurHash with larger hashchunk. This method is superior to cryptographic hashes such as MD5 or SHA1 methods in keeping the collision rate low.

Claims

1. A network resource prefetching method based on data deduplication technology is characterized in that: the method comprises the following steps:

secondly, the proxy server performs Web mining and analysis on the network access log, extracts user behavior characteristics and acquires a network access rule; the method comprises the following steps of mining the access preference of a user from an access log so as to extract the network access behavior characteristics of the user, wherein the method comprises the following specific steps: performing data cleaning pretreatment on the access log, removing records of access failure and objects which cannot be cached in log files, and extracting user browsing characteristics from a pretreated network access sequence;

meanwhile, network resources most possibly accessed by the user at the next time are analyzed in advance by adopting a prediction algorithm through a prediction engine and are pre-fetched into a cache; the prediction engine predicts a page which is likely to be accessed when each resource is requested, generates a series of URLs of the resources which are accessed most recently according to a prediction algorithm, and puts the result into a decision database; modeling the browsing behavior of a user on a webpage by using a Markov tree, and predicting an access request most possibly sent by the user at the next time by adopting a prediction algorithm based on access probability;

the CDM of the client runs in a client browser and is used for storing the latest network resource and indicating how the corresponding resource corresponds to the SDM module of the data de-duplication model positioned at the server side according to the unique identifier;

2. The method of claim 1, wherein the network resource prefetching based on data deduplication technology comprises: the log file user network access behavior information comprises the access time of the user access request, the IP address of the user, the file name or script of the access resource and a parameter domain.

3. A network resource prefetching system based on data deduplication technology is characterized in that: a simulator system frame is additionally arranged between a user path and a Web server and comprises a client and a proxy server, the client can prefetch the user behavior of a client browser, the client is connected with the proxy server, and the proxy server is connected with the Web server;

the client comprises 6 modules and 2 storage files:

the 6 modules are respectively:

a read path module: reading a request sequence of a user, wherein a data structure generated according to the request sequence is a first-in first-out access queue;

the prefetch management module: reading the access queue, checking a pre-fetching object pool, and confirming whether the resource is pre-fetched; if the request is not prefetched, the request is sent to a server, and a prefetching management module creates a plurality of user request threads and waits for a new request; when response resources are received from the server, the prefetching management module checks whether the URL of the server is in a prefetching queue, and if so, the prefetching management module removes the URL and inserts the URL into a prefetching object pool; the prefetching management module checks whether the request queue is empty, and if the request queue is empty, the prefetching request is allowed to be sent to the server until a new user request comes; when a new client requests to insert into the queue, the prefetching management module implicitly deletes the prefetched resources and clears the implicit queue data storage;

the 2 storage files are respectively:

the proxy server side includes: