CN109981659B - Network resource prefetching method and system based on data deduplication technology - Google Patents

Network resource prefetching method and system based on data deduplication technology Download PDF

Info

Publication number
CN109981659B
CN109981659B CN201910251873.2A CN201910251873A CN109981659B CN 109981659 B CN109981659 B CN 109981659B CN 201910251873 A CN201910251873 A CN 201910251873A CN 109981659 B CN109981659 B CN 109981659B
Authority
CN
China
Prior art keywords
request
module
resource
server
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910251873.2A
Other languages
Chinese (zh)
Other versions
CN109981659A (en
Inventor
姚瑶
王战红
丁颖
王会霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Institute of Technology
Original Assignee
Zhengzhou Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Institute of Technology filed Critical Zhengzhou Institute of Technology
Priority to CN201910251873.2A priority Critical patent/CN109981659B/en
Publication of CN109981659A publication Critical patent/CN109981659A/en
Application granted granted Critical
Publication of CN109981659B publication Critical patent/CN109981659B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • H04L67/5681Pre-fetching or pre-delivering data based on network characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a network resource prefetching method and system based on a data deduplication technology, which comprises the following steps: when a client sends an access request to a server, a proxy server is used to record network access behavior information of a user and extract an access log; the method comprises the steps of obtaining an access rule by mining and analyzing Web of a network log and extracting network behavior characteristics, analyzing network resources most possibly accessed at the next time in advance by adopting a prediction algorithm through a prediction engine, and prefetching the network resources into a cache; because the size of the cache is limited, the resources stored in the cache are stored in the cache after being processed by the data deduplication technology, and the invention can save storage space, improve data transmission rate, reduce network delay, relieve network traffic pressure during network access peak, save bandwidth and improve system utilization rate on the premise of ensuring the prefetching efficiency.

Description

Network resource prefetching method and system based on data deduplication technology
Technical Field
The invention relates to the technical field of networks, in particular to a network resource prefetching method and system based on a data deduplication technology.
Background
With the great proliferation of internet information and users, how to improve the quality of network service and realize WWW acceleration is a problem which needs to be solved urgently at present. The Web caching mechanism, the Web pre-fetching and data de-duplication technology can effectively reduce network delay. The Web caching technology is based on a time locality principle, adopts an efficient replacement algorithm to cache resources which are possibly accessed by a user in advance, is applied to network environments such as a proxy server, a P2P network and a mobile network, and is limited by a hit rate. The Web prefetching technology tries to actively prefetch resources before a user makes a request, so that the hit rate is improved to a certain extent, and the access delay is reduced. Meanwhile, the method needs to be carefully controlled, otherwise, the performance is greatly reduced, and the original intention is violated. Data deduplication technology aims at detecting and removing the occupied space of duplicate data. It has now been found that if two versions of a resource refer to the same key, there will be 55% of the data that is redundantly duplicated. If the reference source is academic, the degree of repetition is as high as 87%. By utilizing the information redundancy among the data objects, the space utilization rate which is far higher than that of the traditional compression method and the incremental backup method can be obtained, the bandwidth occupied by the byte release part of the transmission data is reduced, and the network delay is reduced. If the Web prefetching technique and the data deduplication technique can be combined, it is significant to effectively reduce network delay.
Existing Web pre-fetching techniques reduce the expected latency applied first on the Mozilla Firefox browser, and then adopted by the Google search engine. Google Web Accelerator software works together with a Firefox or IE browser to realize browser-based prefetching. The commercial software relating to prefetching technologies, Robtex Viking Server and AllegroSurf, but no specific scheme relating to prefetching is disclosed. But is speculative and sometimes adds additional bandwidth. Therefore, the Web prefetching method needs to be used carefully. It is because of the potential bandwidth limitation that Web prefetching is not very promising for business applications.
Disclosure of Invention
Aiming at the defects and problems in the prior art, the invention provides a network resource prefetching method and system based on a data deduplication technology, which can improve prefetching efficiency and reduce network delay by a method of reducing transmission data redundancy.
The technical scheme adopted by the invention for solving the technical problems is as follows: a network resource prefetching method based on data deduplication technology comprises the following steps:
firstly, connecting a proxy server end between a client end and a server end, and recording network access behavior information of a user and extracting an access log by the proxy server while the client end sends an access request to the server end;
secondly, the proxy server performs Web mining and analysis on the network access log, extracts user behavior characteristics and acquires a network access rule; the step of mining the access preference of the user from the access log so as to extract the network access behavior characteristics of the user comprises the following steps: performing data cleaning pretreatment on the access log, removing records of access failure and objects which cannot be cached in log files, and extracting user browsing characteristics from a pretreated network access sequence;
meanwhile, network resources most possibly accessed by the user at the next time are analyzed in advance by adopting a prediction algorithm through a prediction engine and are pre-fetched into a cache; the prediction engine predicts a page which is likely to be accessed when each resource is requested, generates a series of URLs of the resources which are accessed most recently according to a prediction algorithm, and puts the result into a decision database; the user behavior characteristics can accurately describe the user browsing characteristics through a Markov chain model, a Markov tree is utilized to model the browsing behavior of the user on the webpage, and a prediction algorithm based on the access probability is adopted to predict the most probable access request sent by the user at the next time;
finally, the resources pre-fetched in the cache are stored in the cache after being processed by a data deduplication technology; the step of performing data deduplication processing on the prefetched resources comprises the following steps:
the CDM of the repeated data deleting module of the client runs in a client browser and is used for storing the latest network resource and indicating how the corresponding resource corresponds to the SDM module positioned at the server side according to the unique identifier;
the server-side data de-duplication model SDM is used for combining the finally responded data blocks, when the SDM receives a given resource request, the SDM retrieves a self-defined request header which refers to a resource identifier and is sent by CDM, then the SDM takes out the resource from the server, after the header and the data which are fully responded are received, the SDM allocates a new identifier to the resource, the resource data are divided into blocks, and the meta-information of the blocks is stored in a data storage file; in this data store SDM guarantees all blocks of all versions of a meta-information resource indexed by the hash of the block;
after the CDM receives the response, it reconstructs the original resources for all data, including copying the block reference information from the local cache resources and copying the non-redundant data content of the received response.
A network resource prefetching system based on data deduplication technology is characterized in that a simulator system frame is additionally arranged between a user path and a Web server and comprises a client and a proxy server, the client can prefetch user behaviors of a client browser, the client is connected with the proxy server, and the proxy server is connected with the Web server;
the client comprises 6 modules and 2 storage files:
a read path module: reading a request sequence of a user, wherein a data structure is a first-in first-out queue;
the prefetch management module: reading an access queue of a recording module, checking a pre-fetching object pool, and confirming whether resources are pre-fetched or not; if the request is not prefetched, the request is sent to a server, and a prefetching management module creates a plurality of user request threads and waits for a new request; when response resources are received from the server, the prefetching management module checks whether the URL of the server is in a prefetching queue, and if so, the prefetching management module removes the URL and inserts the URL into a prefetching object pool; the prefetching module checks whether the request queue is empty, if so, the prefetching request is allowed to be sent to the server until a new user request comes; when a new client requests to insert into the queue, the prefetching management module implicitly deletes the prefetched resources and clears the implicit queue data storage;
a user request module: receiving a request from the pre-fetching management module and transmitting the request to the request module; when a response resource is received from the server, the user request module inserts the queue of the response header into the prefetch queue data storage and inserts the URL of the resource into the prefetch object pool;
the request module is used for connecting a Web server and is responsible for processing bottom layer communication;
a CDM module: the client data deduplication module intercepts HTTP request messages sent by a client user request module or a pre-fetching request module; inquiring the resource version number to obtain resource identifiers of all resources in the client cache, and informing a communication interception module by CDM;
a communication interception module: adding a custom header 'X-vrs' to the information, attaching the header information to an HTTP request header, transmitting the HTTP request header to a request module, and finally transmitting the HTTP request header to a server;
the pre-fetching queue: the storage server informs the client of the object information needing to be prefetched;
prefetching the object pool: storing all the prefetched objects to act like a browser cache of a user;
the proxy server side includes:
a monitoring module: waiting for a thread queue connected to a client, and giving a port number;
a server connection module: processing the connection between the client and the server, and transmitting the resource data of the server to the client;
a communication interception module: intercepting the most original HTTP response from a server end, storing the HTTP response in a temporary buffer area and preprocessing the HTTP response;
an SDM module: the server-side data deduplication module executes a data splitting process of the message entity transmitted by the communication interception module;
a communication recombination module: preparing and sending a copy version of the response message; combined response header information: updating/creating Content-Length and adding new entity message data Length; adding two new header information resource version number identifiers and metadata lengths;
a prediction engine module: pages that are likely to be visited are predicted each time a resource is requested, a series of URLs for the most recently visited resource are generated according to a prediction algorithm, and the results are placed in a decision database.
The invention has the beneficial effects that: aiming at the defect that the prior Web prefetching is limited by bandwidth, the invention provides an improved method for a Web prefetching system, which can improve the prefetching efficiency and reduce the network delay by reducing the redundancy of transmission data.
On one hand, the invention provides a network resource prefetching method, which comprises the steps of obtaining logs of Web requests through a Squid proxy server, analyzing the logs to obtain access paths, inputting the access paths into a client module, and simulating the access behaviors of users; the client regularly sends requests to the server according to a specified time interval, a server connection module of the server sequentially receives the request messages and transmits the request messages to the server, and then the response messages are intercepted, analyzed and transmitted to a prediction engine module when HTTP response headers are received; the prediction engine module adopts a popular prediction algorithm to calculate a series of pages with the highest frequency of being accessed in the latest section, the result is put into a decision database, and meanwhile, after all response headers are received, the update state database is informed to prepare for next prediction; and finally, the server connection module sends the complete response data to the client.
On the other hand, the invention provides a network resource prefetching system, which can analyze the resource and perform data deduplication processing before prefetching the downloaded resource to the cache, and a communication interception module, an SDM module and a communication recombination module are added at a proxy server; the resource version module, the CDM module and the communication interception module are added at the client, so that the network access requirement of a user can be quickly responded, the network service quality is improved, the access delay is reduced, the bandwidth is effectively saved, and the system utilization rate and the prefetching efficiency are improved.
Drawings
Fig. 1 is a diagram of an application scenario of a prefetching method for a Web page according to an embodiment of the present invention.
FIG. 2 is a flowchart of a pre-fetch system for Web pages according to an embodiment of the present invention.
Fig. 3 is a diagram of a client module according to an embodiment of the present invention.
FIG. 4 is a block diagram of a proxy server side of the prefetch system according to an embodiment of the present invention.
Fig. 5 is a diagram of a data deduplication architecture.
Fig. 6 is an algorithm diagram of the SDM module in accordance with an embodiment of the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
Example 1: a simulator system frame is additionally arranged between a user path and a Web server, wherein the simulator system frame comprises a client and a proxy server end, the client can prefetch the user behavior of a client browser, the client is connected with the proxy server end, and the proxy server end is connected with the Web server.
Fig. 1 is a diagram of an application scenario of a Web page prefetching system according to an embodiment of the present invention. In fig. 1, a client 101 issues a request to a proxy server 102 to access a Web server resource. When the main process of the proxy server 102 monitors that the request is sent by the client A, a sub-process is created to deal with the request sent by the client A; and the master process continues listening. Establishing connection between the created proxy server subprocess and the client 101, reading and analyzing a client request, and then checking a currently received request according to an access rule list preset on the proxy server; if the request satisfies the rule constraint, the proxy cache may be looked up for the presence of the required information and subsequent information requests processed locally. Therefore, the client can obtain the expected resources of the client more quickly, and the bandwidth is saved.
As shown in fig. 3, the client includes 6 modules and 2 storage files. The modules are a read path module 31, a prefetch management module 32, a user request module 33, a CDM module 34, a communication interception module 35 and a request module 36 in sequence; the storage files are a prefetch queue 37 and a prefetch object pool 38, respectively.
The read path module 31 records the access behavior of the user, including the IP address or host name of the user, the time when the request is sent, the method of the request (GET, POST, etc.), the path (URL) of the access page, the status code returned by the server, and the number of bytes sent in response.
Prefetch management module 32 selects an appropriate prediction algorithm to predict user requests that need prefetching. And the system is responsible for checking a prefetch object pool and whether a request resource is prefetched.
The user request module 33 receives the request from the prefetch management module 32 and passes it to the request module.
The CDM module 34 is used for client data deduplication processing, and obtains resource identifiers of all resources in the client cache chinese by intercepting an HTTP message request sent by a user request module and inquiring a resource version number. And finally notifies the communication interception module 35.
The communication interception module 35 is responsible for adding a custom header "X-vrs" to the request message, transmitting it to the request module and finally to the server.
The request module 36 is used to connect to a Web server and is responsible for handling the underlying communication.
The prefetch queue 37 is used to store object information that the server tells the client that prefetching is required.
The prefetch object pool 38 is used to store all objects that have been prefetched and functions like a user browser cache.
As shown in fig. 4, the proxy server is used for connecting the server and the client, and includes a listening module 41, a server connection module 42, a request restructuring module 43, an SDM module 44, a communication intercepting module 45, a prediction engine module 46, and a status update library 47.
The monitoring module 41: it is mainly responsible for waiting for the thread queue connected to the client, given a port number.
The server connection module 42: the module is mainly responsible for processing direct connection between the client and the server, and particularly receives a source HTTP response header at the first time when a response is processed and transmits the header to the client. And then all resource data from the server side are transmitted to the client side without change.
The communication interception module 45: the most original HTTP response from the server side is mainly intercepted and stored in a temporary buffer area and preprocessed.
The SDM module 44: and the server-side data deduplication module executes a data splitting process of the message entity forwarded by the communication interception module. The splitting process is mainly completed by two steps: block partitioning and data deduplication.
The request restructuring module 43: a duplicate version of the response message is prepared and sent. Combined response header information: updating/creating Content-Length and adding new entity message data Length; two new header information X-vrs (resource version number identifier) and X-mtd (metadata length) are added.
The prediction engine module 46: the main task is to predict the pages that will likely be accessed each time a resource is requested. The module will generate a series of URLs for the most recently accessed resources based on a predictive algorithm and place the results in a decision database.
Example 2:
a network resource prefetching method based on data deduplication technology comprises the following steps:
firstly, a proxy server end is connected between a client end and a server end, the client end sends an access request to the server end, simultaneously, the proxy server records network access behavior information of a user and extracts an access log, and the log file user network access behavior information mainly comprises access time of the user access request, a user IP address, a file name or script of an access resource, a parameter domain and the like.
Secondly, the proxy server performs Web mining and analysis on the network access log, extracts user behavior characteristics and acquires a network access rule; the step of mining the access preference of the user from the access log so as to extract the network access behavior characteristics of the user comprises the following steps: and carrying out data cleaning pretreatment on the access log, removing the record of access failure and the object which cannot be cached in the log file, and extracting the user browsing characteristics from the pretreated network access sequence.
Meanwhile, network resources most possibly accessed by the user at the next time are analyzed in advance by adopting a prediction algorithm through a prediction engine and are pre-fetched into a cache; the prediction engine predicts a page which is likely to be accessed when each resource is requested, generates a series of URLs of the resources which are accessed most recently according to a prediction algorithm, and puts the result into a decision database; the user behavior characteristics can accurately describe the user browsing characteristics through a Markov chain model, a Markov tree is utilized to model the browsing behavior of the user to the webpage, and a prediction algorithm based on the access probability is adopted to predict the most probable access request sent by the user at the next time.
Finally, the resources pre-fetched in the cache are stored in the cache after being processed by a data deduplication technology; the step of performing data deduplication processing on the prefetched resources comprises the following steps:
the CDM of the data de-duplication module of the client runs in a browser of the client and is used for storing the latest network resource and indicating how the corresponding resource corresponds to the SDM module located at the server side according to the unique identifier.
The server-side data de-duplication model SDM is used for combining the finally responded data blocks, when the SDM receives a given resource request, the SDM retrieves a self-defined request header which refers to a resource identifier and is sent by CDM, then the SDM takes out the resource from the server, after the header and the data which are fully responded are received, the SDM allocates a new identifier to the resource, the resource data are divided into blocks, and the meta-information of the blocks is stored in a data storage file; in this data store SDM guarantees all blocks of all versions of a meta-information resource indexed by the hash of the block.
After the CDM receives the response, it reconstructs the original resources for all data, including copying the block reference information from the local cache resources and copying the non-redundant data content of the received response.
As shown in fig. 1, a client 101 issues a request to a proxy server 102 to access a Web server resource. When the main process of the proxy server 102 monitors that the request is sent by the client A, a sub-process is created to deal with the request sent by the client A; and the master process continues listening. Establishing connection between the created proxy server subprocess and the client 101, reading and analyzing a client request, and then checking a currently received request according to an access rule list preset on the proxy server; if the request satisfies the rule constraint, the proxy cache may be looked up for the presence of the required information and subsequent information requests processed locally. Therefore, the client can obtain the expected resources of the client more quickly, and the bandwidth is saved.
As shown in the flowchart of the pre-fetching system for Web pages in fig. 2, in step 201, the proxy server records a user browsing log and monitors the user browsing behavior.
The browsing behavior of the user may refer to the user's access history on static information, and these records are stored in a log file on a Web server or a proxy server. Each record containing the following requested information: the user's IP address or host name, the time the request was issued, the method of the request (GET, POST, etc.), the path to the page (URL), the status code returned by the server, and the number of bytes issued in response. The Web log file is readily available from a Web server or proxy server.
In step 202, the proxy server mines the user's access preference path through the prefetch management module.
By way of example, the network behavior of a college user is centralized in visiting portal websites such as campus web pages, new waves and search foxes in the office peak time period of 8:00-9:30 in the morning, and the behavior can be recorded through logs recorded by the proxy server.
A path analyzer in a pre-fetching management module of the proxy server end mainly completes the mining process of the user access path, and a Markov mode tree is constructed by analyzing an access sequence, so that a transition probability matrix is generated. And (3) adopting a prediction method based on a Markov chain model, and advancing the transition probability matrix and the initial state probability vector through the Markov chain model. All client requests are buffered in the client buffer and flow out of the buffer once the minimum sample threshold is exceeded or the session ends. Each client allocates a separate buffer and stores therein a sequence of client requests. And updating the Markov chain model according to the continuous change of the access path of the user. The updating method is mainly to change the default or current values of the matrix smoothly according to the additionally added path sequence.
In step 203, the prediction engine adopts a prediction algorithm to predict the most likely websites to be accessed at the next time, and the websites are downloaded into the cache in advance when the network is idle.
According to the rule of the network access behaviors, the network proxy cache device analyzes the network operation state at the next same time point (8: 00-9: 30) and predicts the most possibly accessed website according to a prediction algorithm. When the network is idle, the relevant network resources are pre-fetched from the network easy website in advance and stored in the cache. When the user makes an access request, the request resource is directly obtained from the cache, so that the network load is prevented from being increased in the peak period, the network bandwidth can be saved, and the service quality is ensured.
In step 204, the data deduplication system employed is derived from the dedupHTTP system. The core work of the data deduplication system is to analyze the content of a file, and the basic unit of research is an abstract data object called chunk (chunk). The main work of the system consists of five stages of block division, characteristic value calculation, same or similar detection, redundancy elimination and data storage. The deduplication system constructed by the embodiment is composed of two modules. A deduplication module (CDM) of the client and a deduplication module (SDM) of the server segment. CDM and SDM run at the client browser and Web server, respectively. CDM stores the most recent resource version number and indicates how each resource version number is assigned to SDMs in terms of a unique identifier.
With respect to data deduplication, as shown in the data deduplication architecture of fig. 5, when an SDM receives a request for a given resource, it retrieves a custom request header for the reference source identifier sent by CDM. The SDM then fetches the resource from the server. After receiving the fully responded header and data, the SDM assigns a new identifier to the resource. The resource data is divided into blocks, and meta information of the blocks is stored in a data storage file. In this data store SDM guarantees all blocks of all versions of the hash indexed meta-information resources of the blocks. The SDM traverses all the blocks of the resource. For each resource block, it will have the same hash block at the reference source. If a block does not have a matching reference source, the SDM will search in the current block of response resources. Therefore, redundancy detection is performed not only when resources are cached in CDM, but also when resources are submitted to CDM. The SDM combines the final response resources to the CDM. The response starts with a metadata segment. The size (in bytes) of the metadata is stored in a custom HTTP response header. The content of the metadata starts with the resource identifier of the response. A quadruple is contained in each reference source, and the information contained in each tuple is necessary information for CDM to be able to find each redundant block in the buffer. Specifically, the contained fields are Offset of the current response, resource identifier, Offset of the inside of the resource InOffset and the length of the block. The arrays are arranged in the order of the corresponding blocks in the original response. At the end of the metadata block, non-redundant data content is appended, the order still remaining in that in the original response. Because of the sequential maintenance of the tuple and the non-redundant data, CDM requires only the first offset of the array to know how to append the redundant data to the non-redundant data. When the CDM receives the response, it reconstructs the original resources for all data. Including copying block reference information from the local cache resource and copying non-redundant data content for which a response was received. CDM does not store any piece of meta-information nor does it require sending or storing hashes.
The algorithm principle of the SDM module in this embodiment is shown in fig. 6, the basic principle of the data deduplication system is a storage system based on content addressing, and the comparison and analysis of the similarity and the similarity of data are performed to delete duplicate data, so that space is saved, and the key is the reasonable selection of a data partitioning method.
As used herein, a data deduplication system divides each resource version number that a server responds to a client into index data by an SDMchunk. The block division algorithm employed is the LBFS algorithm.
The first step is as follows: when a client sends a request, the division of the resource into several indexes is performed by the serverchunk. This makes it easy to determine which parts of the resource the client already exists and which parts are new and need to be sent.
Starting with creating a byte hash of the resource content. These hashes represent smaller resourceschunk. Implemented using a sliding hash function.
The second step is that: a hash representing the resource is selected. While all hashes may be chosen, it is difficult to implement for memory, as there is one hash for each byte of content. Therefore, the appropriate hash is selected as the boundary of the larger data block.
In this step, a window mechanism is selected, which has proven to provide better redundancy detection. Implementing min and max, respectivelychunkSize due to contrast expectationchunkThe size, content based on the method may be too largechunkOr too smallchunk
SelectingchunkAfter the boundary, the method of using 64 as MurmurHash with larger hashchunk. This method is superior to cryptographic hashes such as MD5 or SHA1 methods in keeping the collision rate low.

Claims (3)

1. A network resource prefetching method based on data deduplication technology is characterized in that: the method comprises the following steps:
firstly, connecting a proxy server end between a client end and a server end, and recording network access behavior information of a user and extracting an access log by the proxy server while the client end sends an access request to the server end;
secondly, the proxy server performs Web mining and analysis on the network access log, extracts user behavior characteristics and acquires a network access rule; the method comprises the following steps of mining the access preference of a user from an access log so as to extract the network access behavior characteristics of the user, wherein the method comprises the following specific steps: performing data cleaning pretreatment on the access log, removing records of access failure and objects which cannot be cached in log files, and extracting user browsing characteristics from a pretreated network access sequence;
meanwhile, network resources most possibly accessed by the user at the next time are analyzed in advance by adopting a prediction algorithm through a prediction engine and are pre-fetched into a cache; the prediction engine predicts a page which is likely to be accessed when each resource is requested, generates a series of URLs of the resources which are accessed most recently according to a prediction algorithm, and puts the result into a decision database; modeling the browsing behavior of a user on a webpage by using a Markov tree, and predicting an access request most possibly sent by the user at the next time by adopting a prediction algorithm based on access probability;
finally, the resources pre-fetched in the cache are stored in the cache after being processed by a data deduplication technology; the step of performing data deduplication processing on the prefetched resources comprises the following steps:
the CDM of the client runs in a client browser and is used for storing the latest network resource and indicating how the corresponding resource corresponds to the SDM module of the data de-duplication model positioned at the server side according to the unique identifier;
the server-side data de-duplication model SDM is used for combining the finally responded data blocks, when the SDM receives a given resource request, the SDM retrieves a self-defined request header which refers to a resource identifier and is sent by CDM, then the SDM takes out the resource from the server, after the header and the data which are fully responded are received, the SDM allocates a new identifier to the resource, the resource data are divided into blocks, and the meta-information of the blocks is stored in a data storage file; in this data store SDM guarantees all blocks of all versions of a meta-information resource indexed by the hash of the block;
after the CDM receives the response, it reconstructs the original resources for all data, including copying the block reference information from the local cache resources and copying the non-redundant data content of the received response.
2. The method of claim 1, wherein the network resource prefetching based on data deduplication technology comprises: the log file user network access behavior information comprises the access time of the user access request, the IP address of the user, the file name or script of the access resource and a parameter domain.
3. A network resource prefetching system based on data deduplication technology is characterized in that: a simulator system frame is additionally arranged between a user path and a Web server and comprises a client and a proxy server, the client can prefetch the user behavior of a client browser, the client is connected with the proxy server, and the proxy server is connected with the Web server;
the client comprises 6 modules and 2 storage files:
the 6 modules are respectively:
a read path module: reading a request sequence of a user, wherein a data structure generated according to the request sequence is a first-in first-out access queue;
the prefetch management module: reading the access queue, checking a pre-fetching object pool, and confirming whether the resource is pre-fetched; if the request is not prefetched, the request is sent to a server, and a prefetching management module creates a plurality of user request threads and waits for a new request; when response resources are received from the server, the prefetching management module checks whether the URL of the server is in a prefetching queue, and if so, the prefetching management module removes the URL and inserts the URL into a prefetching object pool; the prefetching management module checks whether the request queue is empty, and if the request queue is empty, the prefetching request is allowed to be sent to the server until a new user request comes; when a new client requests to insert into the queue, the prefetching management module implicitly deletes the prefetched resources and clears the implicit queue data storage;
a user request module: receiving a request from the pre-fetching management module and transmitting the request to the request module; when a response resource is received from the server, the user request module inserts the queue of the response header into the prefetch queue data storage and inserts the URL of the resource into the prefetch object pool;
the request module is used for connecting a Web server and is responsible for processing bottom layer communication;
a CDM module: the client data deduplication module intercepts HTTP request messages sent by a client user request module or a pre-fetching request module; inquiring the resource version number to obtain resource identifiers of all resources in the client cache, and informing a communication interception module by CDM;
a communication interception module: adding a custom header 'X-vrs' to the information, attaching the header information to an HTTP request header, transmitting the HTTP request header to a request module, and finally transmitting the HTTP request header to a server;
the 2 storage files are respectively:
the pre-fetching queue: the storage server informs the client of the object information needing to be prefetched;
prefetching the object pool: storing all the prefetched objects to act like a browser cache of a user;
the proxy server side includes:
a monitoring module: waiting for a thread queue connected to a client, and giving a port number;
a server connection module: processing the connection between the client and the server, and transmitting the resource data of the server to the client;
a communication interception module: intercepting the most original HTTP response from a server end, storing the HTTP response in a temporary buffer area and preprocessing the HTTP response;
an SDM module: the server-side data deduplication module executes a data splitting process of the message entity transmitted by the communication interception module;
a communication recombination module: preparing and sending a copy version of the response message; combined response header information: updating/creating Content-Length and adding new entity message data Length; adding two new header information resource version number identifiers and metadata lengths;
a prediction engine module: pages that are likely to be visited are predicted each time a resource is requested, a series of URLs for the most recently visited resource are generated according to a prediction algorithm, and the results are placed in a decision database.
CN201910251873.2A 2019-03-29 2019-03-29 Network resource prefetching method and system based on data deduplication technology Active CN109981659B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910251873.2A CN109981659B (en) 2019-03-29 2019-03-29 Network resource prefetching method and system based on data deduplication technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910251873.2A CN109981659B (en) 2019-03-29 2019-03-29 Network resource prefetching method and system based on data deduplication technology

Publications (2)

Publication Number Publication Date
CN109981659A CN109981659A (en) 2019-07-05
CN109981659B true CN109981659B (en) 2021-07-09

Family

ID=67081749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910251873.2A Active CN109981659B (en) 2019-03-29 2019-03-29 Network resource prefetching method and system based on data deduplication technology

Country Status (1)

Country Link
CN (1) CN109981659B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110609714A (en) * 2019-07-31 2019-12-24 百度在线网络技术(北京)有限公司 Data prefetching method, device and equipment and storage medium
CN111586020B (en) * 2020-04-29 2021-09-10 北京天融信网络安全技术有限公司 Probability model construction method and device, electronic equipment and storage medium
CN112953894B (en) * 2021-01-26 2022-05-20 复旦大学 Multi-path request copying and distributing system and method
CN113064886B (en) * 2021-03-04 2023-08-29 广州中国科学院计算机网络信息中心 Method for storing and marking management of identification resource
CN114221953A (en) * 2021-11-29 2022-03-22 平安证券股份有限公司 Resource acquisition method, device, equipment and storage medium
CN114785858B (en) * 2022-06-20 2022-09-09 武汉格蓝若智能技术有限公司 Active resource caching method and device applied to mutual inductor online monitoring system
CN116112562A (en) * 2023-02-15 2023-05-12 厦门大学 Synergistic block prefetching method based on P2P network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574004A (en) * 2014-10-10 2016-05-11 阿里巴巴集团控股有限公司 Webpage deduplication method and device
CN106547764A (en) * 2015-09-18 2017-03-29 北京国双科技有限公司 The method and device of web data duplicate removal
CN108769253A (en) * 2018-06-25 2018-11-06 湖北工业大学 A kind of adaptive prefetching control method of distributed system access performance optimization

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8380680B2 (en) * 2010-06-23 2013-02-19 International Business Machines Corporation Piecemeal list prefetch
US10977321B2 (en) * 2016-09-21 2021-04-13 Alltherooms System and method for web content matching

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574004A (en) * 2014-10-10 2016-05-11 阿里巴巴集团控股有限公司 Webpage deduplication method and device
CN106547764A (en) * 2015-09-18 2017-03-29 北京国双科技有限公司 The method and device of web data duplicate removal
CN108769253A (en) * 2018-06-25 2018-11-06 湖北工业大学 A kind of adaptive prefetching control method of distributed system access performance optimization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
End-to-end data deduplication;Ricardo Filipe,Joao Barreto;《2011 IEEE International Symposium on Network Computing and Applications》;20110827;全文 *

Also Published As

Publication number Publication date
CN109981659A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109981659B (en) Network resource prefetching method and system based on data deduplication technology
US10645143B1 (en) Static tracker
US8171135B2 (en) Accumulator for prefetch abort
US9077681B2 (en) Page loading optimization using page-maintained cache
US8966053B2 (en) Methods and systems for performing a prefetch abort operation for network acceleration
US7269608B2 (en) Apparatus and methods for caching objects using main memory and persistent memory
US9055124B1 (en) Enhanced caching of network content
US20190222603A1 (en) Method and apparatus for network forensics compression and storage
US8533310B2 (en) Method and apparatus for acceleration by prefetching associated objects
CN108255647B (en) High-speed data backup method under samba server cluster
US7647417B1 (en) Object cacheability with ICAP
US10909104B2 (en) Caching of updated network content portions
KR20170054299A (en) Reference block aggregating into a reference set for deduplication in memory management
US9195773B2 (en) Structure-based adaptive document caching
US9021210B2 (en) Cache prefetching based on non-sequential lagging cache affinity
US8176141B1 (en) SMB request prefetching
US20020184441A1 (en) Apparatus and methods for caching objects using main memory and persistent memory
CN107092529B (en) OLAP service method, device and system
CN111078975B (en) Multi-node incremental data acquisition system and acquisition method
US11151082B1 (en) File system operation cancellation
Patil et al. High quality design and methodology aspects to enhance large scale web services
CN116756177B (en) Multi-table index maintenance method and system for mysql database
Balasundaram et al. Improving Read Throughput of Deduplicated Cloud Storage using Frequent Pattern-Based Prefetching Technique
US11966393B2 (en) Adaptive data prefetch
US11144504B1 (en) Eliminating redundant file system operations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant