CN110995835B - Method for collecting purchased electronic resource database access records in real time - Google Patents

Method for collecting purchased electronic resource database access records in real time Download PDF

Info

Publication number
CN110995835B
CN110995835B CN201911215102.4A CN201911215102A CN110995835B CN 110995835 B CN110995835 B CN 110995835B CN 201911215102 A CN201911215102 A CN 201911215102A CN 110995835 B CN110995835 B CN 110995835B
Authority
CN
China
Prior art keywords
page
database
terminal
log
script
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911215102.4A
Other languages
Chinese (zh)
Other versions
CN110995835A (en
Inventor
方旭光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Meda Electronics Co ltd
Original Assignee
Hangzhou Meda Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Meda Electronics Co ltd filed Critical Hangzhou Meda Electronics Co ltd
Priority to CN201911215102.4A priority Critical patent/CN110995835B/en
Publication of CN110995835A publication Critical patent/CN110995835A/en
Application granted granted Critical
Publication of CN110995835B publication Critical patent/CN110995835B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/565Conversion or adaptation of application format or content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/55Push-based network services

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for collecting the access records of a purchased electronic resource database in real time, which comprises the following specific steps: 1: firstly, using a terminal to complete terminal authentication; 2: sending the request to a database gateway according to the terminal identity and the database information to be accessed; 3: the database gateway forwards the request to the actual database website; 4: the positions of elements of different databases when displayed on a terminal browser are defined in the analysis rule, so that regular page capture of the script file is facilitated; 5: after the returned page is opened in a terminal browser, automatically running a page analysis script, and extracting the content of an article seen in the page according to a rule defined in the js script; 6: and after extraction is finished, automatically calling a log recording interface in the terminal browser to finish log writing. The invention has the characteristics that: the real-time performance is high, the influence of page encryption is avoided, and the terminal operation can support the use of a large-scale terminal; (4) high applicability and (5) convenient maintenance.

Description

Method for collecting purchased electronic resource database access records in real time
Technical Field
The invention designs a terminal use trace tracking technology of a heterogeneous electronic resource database, and by the technology, the main actions of the terminal accessing different databases can be recorded in real time, including browsing details and downloading full text, and the contents of a specific viewed page can be accurately obtained.
Background
At present, libraries and research institutions can spend a large amount of funds for purchasing electronic resource databases, because electronic resource databases have more manufacturers and many resource types, and the manufacturers directly provide online websites for providing services, when the libraries or the research institutions use the databases, the legitimacy of users is verified through IP (Internet protocol), real names cannot be achieved, in addition, each database website is different, the styles are different, the actual use of the databases and documents by the terminal cannot be known, and the specific defects are as follows:
1: without real name authentication, the user cannot know which terminals use the database;
2: the structures of all databases are different, and parts of the databases can provide brief download logs, so that the user cannot know which terminal accesses which article;
3: different database recording formats and standards are different, and the access conditions of the databases cannot be uniformly collected and managed.
Many companies at home and abroad want to solve the problem, and the problems solved at present are as follows: network monitoring mode
Monitoring network access of the whole school by adding software and hardware equipment at a network exit of the school, and screening out access records of related websites;
recording a terminal access log through a navigation system built locally, and analyzing a database accessed by the terminal; and carrying out access statistics according to the access report given by the database manufacturer.
Because electronic resource databases have various styles and websites generally adopt a content encryption mode, the current related log recording mode generally has the following problems:
(1) the log is not acquired in real time, access information is recorded into a log file in a network monitoring mode, an analysis tool acquires and analyzes the log file at regular time and then extracts specific access log information, and generally, the time delay is several hours;
(2) the method has the advantages that the log recording noise is high, accurate information is difficult to extract, all network access information including valid data and invalid data can be recorded in a log file, and a large number of rules and operations are needed for analysis;
(3) the content recorded by the log is not fine enough, only the page and time of the accessed website can be known through log analysis, and the specific content of the access cannot be known, so that the page content transmission is encrypted by adopting an HTTPS protocol and cannot be known;
(4) the log analysis cannot be associated with the terminal, and the terminal cannot be determined to be accessed;
(5) the log provided by the database provider only contains the total download amount of the journal, and does not contain specific access documents and access terminal information.
Therefore, a log record access which can accurately record the access content of the terminal in real time is urgently needed, and the terminal and a library can master the specific use condition of the purchased database in real time.
Disclosure of Invention
The invention aims to provide a method for acquiring the access records of a purchased electronic resource database in real time.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for collecting purchased electronic resource database access records in real time is realized by combining a database gateway forwarding technology and a page analysis technology, and comprises the following specific steps:
(1): firstly, using a terminal to complete terminal authentication;
(2): sending the request to a database gateway according to the terminal identity and the database information to be accessed;
(3): the database gateway forwards the request to the actual database website;
(4): when the data website returns page data, when the data website passes through a database gateway, a page analysis script is dynamically embedded, namely a code for quoting the script is embedded in a page code, and the script file is a js file which can be edited by a cloud and is used for executing and analyzing the content in the webpage at a terminal; the js script contains database related analysis rules, positions of elements of different databases when the databases are displayed on a terminal browser are defined in the analysis rules, and regular page grabbing of script files is facilitated.
(5): after the returned page is opened in a terminal browser, automatically running a page analysis script, and extracting the content of an article seen in the page according to a rule defined in the js script;
(6): and after extraction is finished, automatically calling a log recording interface in the terminal browser to finish log writing.
In the step 2, the ticket is a security code, the rule may be md5({ userid } + { privatekey } + (int) (now. gettime ()/1000)), the privatekey may be set individually for each user, the group is a role, and may be null, or a default right in the case of null, and may be used to control different access policies of different users.
And after receiving the request, the database gateway forwards the request to a database website, and when the database website returns to the page, the terminal information and the analysis and extraction script are dynamically embedded into the page.
In step 4, the terminal browser receives the data, displays the database content, starts to run an analysis extraction script embedded in the page, analyzes the extraction script to extract the user information in the page, judges whether the page is the page needing to capture the content, and captures the page content according to the rule if the page is the specific content page; and page analysis and grabbing implementation.
In step 4, each element is key information of article title, journal name and author.
After extracting the content of the articles seen in the page according to the rules defined in the js script, analyzing and extracting the script to complete data extraction, and calling a message queue appointed by the log platform to push the data into the message queue.
In the step 6, the log platform receives the message, analyzes the pushed log content, extracts the terminal account number and the log content, writes the information into the database, and completes the association with the terminal information.
The invention has the beneficial effects that:
the invention is characterized by the following aspects:
(1) the real-time performance is high, after the terminal completes page opening, the log is pushed to a log system in real time, and the terminal and an administrator can check the log in real time;
(2) the method is not influenced by page encryption, and is not influenced by an https encryption protocol because the data capturing action is executed at the terminal;
(3) the terminal operation has small pressure on the server, and can support the use of large-scale terminals;
(4) the applicability is high, the grabbing rules can be independently configured for different databases, and the grabbing data standardization of the heterogeneous database is realized;
(5) the maintenance is convenient, the capturing rule is dynamically loaded, the rule can be directly maintained at the cloud, and the configuration rule can be changed at the cloud if the rule needs to be adjusted.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and embodiments.
As shown in fig. 1, (1), a page for user authentication is preferably created to authenticate the terminal identity; establishing a navigation page of the electronic resource database, and after the terminal finishes logging in, clicking the connection of a certain electronic resource database on the page to realize the access to the electronic resource database;
(2) constructing a webpage proxy access server to realize proxy access when accessing the database; the purpose of proxy service is to implement request forwarding and url rewriting;
(3) analyzing the position of the content to be extracted in the page according to the page characteristics of the accessed electronic resource database, completing the configuration of the rule, and labeling through xpath or character string characteristics; writing the rule into a page analysis script file (js) file;
(4) the method comprises the steps of building a log receiving server, firstly installing a mysql database, creating a data table for storing log information, and building a web application server such as tomcat on the server in order to push the log information through a network;
(5) the log pushing implementation interface is used for receiving the logs, receiving the logs in an http protocol interface in a json-packaged data format, and writing the logs into a database after the pushed logs are received by the interface;
(6) and reading the log in the database to analyze or display an access log of the electronic resource database, wherein specific information of the log can comprise an access personnel account number, access time, access IP, access URL of the electronic resource database, access article title, article author, browse or download.
The log push interface definition and the content contained in the log are as follows
Interface definition: log push is accepted over HTTPS protocol.
Entry address:
[server]/logmq
calling mode: post
And (3) parameter organization: json, see below
Sample example:
Figure BDA0002299293220000051
Figure BDA0002299293220000061
in the description of the present invention, it should be noted that, as the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. appear, their indicated orientations or positional relationships are based on those shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" as appearing herein are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly stated or limited, the terms "mounted," "connected," and "connected" should be interpreted broadly, e.g., as being fixed or detachable or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Simple substitutions without changing the inventive content of the present invention are considered to be the same. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A method for collecting the access records of the purchased electronic resource database in real time is characterized in that:
the method is realized by combining a database gateway forwarding technology and a page analysis technology, and comprises the following specific steps:
(1): firstly, using a terminal to complete terminal authentication;
the method comprises the following specific steps:
creating a page for user authentication to authenticate the terminal identity;
establishing a navigation page of an electronic resource database, and after the terminal finishes logging in, clicking the connection of a certain electronic resource database on the page to realize the access to the electronic resource database;
(2): sending the request to a database gateway according to the terminal identity and the database information to be accessed;
the method specifically comprises the following steps:
constructing a webpage proxy access server to realize proxy access when accessing the database; the purpose of proxy service is to implement request forwarding and url rewriting;
ticket is a security code, the rule may be md5({ userid } + { privatekey } + (int) (now. gettime ()/1000)), the privatekey may be set individually for each user, the group is a role, may be null, and in case of null, is a default right, and may be used to control different access policies of different users;
after receiving the request again, the database gateway forwards the request to a database website, and when the database website returns to the page, terminal information and an analysis and extraction script are dynamically embedded in the page;
(3): the database gateway forwards the request to the actual database website;
the method specifically comprises the following steps:
analyzing the position of the content to be extracted in the page according to the page characteristics of the accessed electronic resource database, completing the configuration of rules, and labeling through xpath or character string characteristics; writing the rule into a page analysis script file (js) file;
(4): when the data website returns page data, when the data website passes through a database gateway, a page analysis script is dynamically embedded, namely a code for quoting the script is embedded in a page code, and the script file is a js file which can be edited by a cloud and is used for executing and analyzing the content in the webpage at a terminal; the js script contains database related analysis rules, positions of elements of different databases when the databases are displayed on a terminal browser are defined in the analysis rules, and regular page capture of script files is facilitated;
the method specifically comprises the following steps:
building a log receiving server, firstly installing a mysql database, creating a data table for storing log information, and building a web application server on the server in order to push the log information through a network;
(5): after the returned page is opened in a terminal browser, automatically running a page analysis script, and extracting the content of an article seen in the page according to rules defined in the js script;
the method specifically comprises the following steps:
the log pushing interface is used for receiving the log, the receiving form is an http protocol interface, the receiving form is a json-packaged data format, and the interface writes the log into the database after receiving the pushed log;
(6): after extraction is completed, a log recording interface is automatically called in a terminal browser, and log writing is completed;
the method specifically comprises the following steps:
by reading the logs in the database for analyzing or displaying the access logs of the electronic resource database, the specific information of the logs can include an access personnel account number, access time, access IP, access electronic resource database URL, access article title, article author, browse or download.
2. The method of claim 1, wherein the method comprises the steps of: and after receiving the request, the database gateway forwards the request to the database website, and when the database website returns to the page, the terminal information and the analysis and extraction script are dynamically embedded in the page.
3. The method of claim 1, wherein the method comprises the following steps: in step 4, the terminal browser receives the data, displays the database content, starts to run an analysis extraction script embedded in the page, analyzes the extraction script to extract the user information in the page, judges whether the page is the page needing to capture the content, and captures the page content according to the rule if the page is the specific content page; and page analysis and grabbing implementation.
4. The method of claim 1, wherein the method comprises the steps of: in step 4, each element is key information of article title, journal name and author.
5. The method of claim 1, wherein the method comprises the steps of: after extracting the content of the articles seen in the page according to the rules defined in the js script, analyzing and extracting the script to complete data extraction, and calling a message queue appointed by the log platform to push the data into the message queue.
6. The method of claim 1, wherein the method comprises the steps of: in the step 6, the log platform receives the message, analyzes the pushed log content, extracts the terminal account number and the log content, writes the information into the database, and completes the association with the terminal information.
CN201911215102.4A 2019-12-02 2019-12-02 Method for collecting purchased electronic resource database access records in real time Active CN110995835B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911215102.4A CN110995835B (en) 2019-12-02 2019-12-02 Method for collecting purchased electronic resource database access records in real time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911215102.4A CN110995835B (en) 2019-12-02 2019-12-02 Method for collecting purchased electronic resource database access records in real time

Publications (2)

Publication Number Publication Date
CN110995835A CN110995835A (en) 2020-04-10
CN110995835B true CN110995835B (en) 2022-08-19

Family

ID=70089291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911215102.4A Active CN110995835B (en) 2019-12-02 2019-12-02 Method for collecting purchased electronic resource database access records in real time

Country Status (1)

Country Link
CN (1) CN110995835B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004109532A1 (en) * 2003-06-05 2004-12-16 Cubicice (Pty) Ltd A method of collecting data regarding a plurality of web pages visited by at least one user
CN101163046A (en) * 2007-11-22 2008-04-16 北京金山软件有限公司 Distributed website log data acquisition method and distributed website system
CN106469185A (en) * 2016-08-29 2017-03-01 浪潮电子信息产业股份有限公司 A kind of method carrying out data collection in website statistics

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7739308B2 (en) * 2000-09-08 2010-06-15 Oracle International Corporation Techniques for automatically provisioning a database over a wide area network
CN103117892B (en) * 2013-01-21 2016-07-20 深圳市深信服电子科技有限公司 Add method and the device of website visiting record
CN106446228B (en) * 2016-10-08 2020-01-10 中国工商银行股份有限公司 Method and device for collecting and analyzing WEB page data
CN110620782A (en) * 2019-09-29 2019-12-27 深圳市珍爱云信息技术有限公司 Account authentication method and device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004109532A1 (en) * 2003-06-05 2004-12-16 Cubicice (Pty) Ltd A method of collecting data regarding a plurality of web pages visited by at least one user
CN101163046A (en) * 2007-11-22 2008-04-16 北京金山软件有限公司 Distributed website log data acquisition method and distributed website system
CN106469185A (en) * 2016-08-29 2017-03-01 浪潮电子信息产业股份有限公司 A kind of method carrying out data collection in website statistics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
数字图书馆个性化服务与Web日志挖掘数据预处理技术;柳胜国;《现代情报》;20070725(第07期);全文 *

Also Published As

Publication number Publication date
CN110995835A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
USRE48681E1 (en) System and method for tracking web interactions with real time analytics
CN103023710B (en) A kind of safety test system and method
US9106709B2 (en) Server side mobile audience intelligence creation
JP5218080B2 (en) Electronic coupon processing system, user management server device, service providing method, and program
US20190190977A1 (en) System and method of automatic generation and insertion of analytic tracking codes
Abd Wahab et al. Data pre-processing on web server logs for generalized association rules mining algorithm
CN101242307A (en) Website access analysis system and method based on built-in code proxy log
US20050021731A1 (en) Traffic flow analysis method
CN102065147A (en) Method and device for obtaining user login information based on enterprise application system
AU2014400621B2 (en) System and method for providing contextual analytics data
CN110798445B (en) Public gateway interface testing method and device, computer equipment and storage medium
US20040128534A1 (en) Method and product for identifying a website visitor session by visitor e-mail address
WO2017124692A1 (en) Method and apparatus for searching for conversion relationship between form pages and target pages
CN110808868B (en) Test data acquisition method and device, computer equipment and storage medium
CN111404937A (en) Method and device for detecting server vulnerability
CN114491518A (en) Unauthorized access detection method, device, system and medium
CN107566371B (en) WebShell mining method for massive logs
CN110995835B (en) Method for collecting purchased electronic resource database access records in real time
KR20090028368A (en) System and method for analysing program test result using test result log and program recording medium
KR100792837B1 (en) Method and system for providing realtime particular information
JP5061316B1 (en) Communication packet analyzer
KR100598921B1 (en) System for tracing menu search path by user
JP5851251B2 (en) Communication packet storage device
KR101270393B1 (en) Systme for providing updated data of rss contents using user agent and method thereof
KR20100072515A (en) Remote server log analysis system and the method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant