CN111814092A - Data preprocessing method for artificial intelligence algorithm based on user internet behavior - Google Patents

Data preprocessing method for artificial intelligence algorithm based on user internet behavior Download PDF

Info

Publication number
CN111814092A
CN111814092A CN202010705027.6A CN202010705027A CN111814092A CN 111814092 A CN111814092 A CN 111814092A CN 202010705027 A CN202010705027 A CN 202010705027A CN 111814092 A CN111814092 A CN 111814092A
Authority
CN
China
Prior art keywords
user
data
access
internet
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010705027.6A
Other languages
Chinese (zh)
Inventor
项亮
裴智晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shuming Artificial Intelligence Technology Co ltd
Original Assignee
Shanghai Shuming Artificial Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shuming Artificial Intelligence Technology Co ltd filed Critical Shanghai Shuming Artificial Intelligence Technology Co ltd
Priority to CN202010705027.6A priority Critical patent/CN111814092A/en
Publication of CN111814092A publication Critical patent/CN111814092A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data preprocessing method for an artificial intelligence algorithm based on user internet behavior comprises the steps of obtaining original information of a user; processing the basic information of the user; processing the internet access behavior data information; and merging the data information of the access record table according to the user dimension to form a user data table in a preset internet surfing time period. Therefore, the invention can process discontinuous, scattered and irregular data of the user online to form a data format for subsequent operation, so that artificial intelligent analysis by adopting the user online behavior data becomes possible.

Description

Data preprocessing method for artificial intelligence algorithm based on user internet behavior
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a data preprocessing method for an artificial intelligence algorithm based on user internet surfing behavior.
Background
With the rise and wide application of artificial intelligence algorithms, more and more internet enterprises are paying attention to the combination of their own services with technologies such as big data and artificial intelligence. The application and research and development of technologies such as big data and artificial intelligence are becoming important links of the current internet enterprise operation management.
The intelligent optimization prediction is based on artificial intelligence and prediction science, analyzes and processes data, and selects a proper model and parameters through artificial intelligence to solve practical problems. The most laborious of the data analysis projects of artificial intelligence is data acquisition and preprocessing. The time to acquire and pre-process the data can usually reach 80%.
In the real data, a large number of missing values may be included, a large number of noises may be included, and abnormal points may exist due to manual entry errors, which is very unfavorable for training the algorithm model. The data cleaning result is to process various dirty data in a corresponding mode to obtain standard, clean and continuous data, and provide the standard, clean and continuous data for data statistics, data mining and the like.
Data cleansing can address various issues with data, including but not limited to: accuracy, applicability, timeliness, consistency, and authority. Different approaches may be used to address the various issues described above.
The method of data cleaning is generally specific to a specific application, so that it is difficult to generalize a unified method and step, but a corresponding data cleaning method can be generally given according to different data. For example:
value missing processing method
In most cases, missing values must be filled in manually (i.e., manually cleaned). Of course, some missing values may be derived from the data source or other data sources, and the missing values may be replaced by averages, maximums, minimums, or more complex probability estimates for cleaning purposes.
Second, accuracy detection method
The data may be checked with a simple rule base (common sense rules, business specific rules, etc.) or may be detected and cleaned using constraints between different attributes, external data.
Solution of repeatability
Records with the same attribute value in the database are regarded as repeated records, whether the records are equal or not is detected by judging whether the attribute values among the records are equal or not, the equal records are combined into one record (namely, combination/removal), and the combination/removal is a basic method for eliminating the duplication.
Solution of inconsistency
Data integrated from multiple data sources may have semantic conflicts, integrity constraints may be defined for detecting inconsistencies, and ties may also be discovered by analyzing the data so that the data remains consistent.
Fifthly, noise treatment
Noise is the random error or variance of the measured variable. Binning and regression methods may be used. The binning method smoothes out ordered data values by looking at the "neighbors" (i.e., surrounding values) of the data. These ordered values are distributed into a number of "buckets" or bins. Regression methods may use a function to fit the data to smooth the data. Linear regression involves finding the "best" line that fits two attributes (or variables) so that one attribute can predict the other. Multiple linear regression is an extension of linear regression that involves more than two attributes and the data is fit to a multidimensional surface. Using regression, a mathematical equation is found that fits the data, which can help eliminate noise.
However, for internet operation enterprises, each user has a large amount of internet behavior data, and to use the data in the calculation of the artificial intelligence algorithm, the user behavior must be preprocessed to ensure that the data quality can meet the task of the calculation of the artificial intelligence algorithm. For a massive data source of user internet behavior data, no existing mode is available for data preprocessing at present.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a data preprocessing method for an artificial intelligence algorithm based on user internet access behaviors.
The invention discloses a data preprocessing method for an artificial intelligence algorithm based on user internet surfing behavior, which comprises the following steps:
step S1: acquiring original information of a user; the original information comprises user basic information and internet behavior data information, the user basic information comprises gender, age and attribution, and the internet behavior data information comprises internet time, access URL (uniform resource locator) address and access website frequency;
step S2: processing the user basic information; wherein,
grouping the gender of the user according to male, female and unknown states to form three different data groups; dividing said ages of users into M age groups, plus an unknown age group, into M +1 groups, said age of each user falling into one and only one age group; the attribution area of the user corresponds to different data fields according to the division of N location areas, and an unknown field is added, namely the attribution area is divided into N +1 data fields;
step S3: processing the internet surfing behavior data information; the method comprises the following steps:
step S31: simplifying the access URL addresses of all users according to a simplification principle; the simplification principle comprises business simplification and similarity simplification; the business simplification is to simplify the access URL addresses which are completely irrelevant according to the direction concerned by the business, and the similarity simplification combines the access URL addresses belonging to the same access URL address to form a unique access URL address;
step S32: numbering the simplified access URL addresses, wherein the access URL addresses have unique corresponding URL numbers, and the URL numbers are corresponding to URL data fields;
step S33: accumulating the website access frequency of each user according to the number of times of accessing each URL address in a preset internet surfing time period;
step S34: forming an access record table for all users to access each access URL address in the preset time;
step S4: and merging the data information of the access record table according to the user dimension to form a user data table in a preset internet surfing time period.
Preferably, the step S3 of the data preprocessing method for artificial intelligence algorithm based on user internet surfing behavior further includes the step S35: and performing normalization processing on the data in the access record table by adopting a nonlinear normalization algorithm.
According to the technical scheme, the data preprocessing method for the artificial intelligence algorithm based on the user internet behavior can process discontinuous, scattered and irregular data of the user internet behavior to form a data format for subsequent operation, so that the artificial intelligence analysis by using the user internet behavior data becomes possible.
Drawings
FIG. 1 is a schematic flow chart of a data preprocessing method for an artificial intelligence algorithm based on a user surfing behavior according to the present invention
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
It should be noted that the data preprocessing of the present invention can solve the accuracy, applicability and consistency of data, and it performs data preprocessing from massive user internet behavior data, so that the batch of data can be used for the calculation of the artificial intelligence algorithm, thereby ensuring the data quality to meet the task of the calculation of the artificial intelligence algorithm.
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a data preprocessing method for an artificial intelligence algorithm based on a user internet behavior according to the present invention. As shown in fig. 1, the method comprises the steps of:
step S1: acquiring original information of a user; the original information comprises user basic information and internet behavior data information, the user basic information can comprise gender, age and attribution, and the internet behavior data information can comprise internet time, access URL (uniform resource locator) address and access website frequency.
In an embodiment of the present invention, the user basic information may mainly include:
the gender of the user on the internet can be generally obtained from a database of the end user (such as a mobile phone); of course, if the gender of the internet user is kept secret, the internet user can be classified into an unknown state;
the age of the user on the internet can also be obtained from the database of the terminal user (such as a mobile phone); of course, if the gender of the internet user is kept secret, the internet user can be classified into an unknown state;
the zone of the user accessing the internet is determined by the geographical position (for example, it may be administrative province)
The internet access behavior data information may mainly include:
firstly, the internet surfing time of a user;
secondly, the user clicks the access URL address within a certain preset time;
and thirdly, the frequency of each access URL address clicked by the user within a certain preset time.
If the massive user original information needs to form a data set which can be processed by an artificial intelligence algorithm, the data needs to be preprocessed.
In the embodiment of the present invention, the preprocessing the data includes processing the user basic information and processing the internet access behavior data information.
Step S2: processing the user basic information; it specifically includes sex information processing, age information processing, and attribution information processing.
In the embodiment of the invention, the gender information processing can group the gender of the user according to male, female and unknown states to form three different data groups; respectively as follows: whether male, whether female, whether unknown. The data form formed is shown in table 1 below:
gender of user Whether it is a male Whether it is female Is unknown or not
For male 1 0 0
Woman 0 1 0
Is unknown 0 0 1
In an embodiment of the present invention, the ages of users may be divided into M age groups, plus an unknown age group, into M +1 groups, and the age of each user will fall into one and only one age group. For example, grouping is performed according to the user age group, and the grouping group is: 0-15 years old, 15-20 years old, 20-25 years old, 25-35 years old, 35-45 years old, 45-55 years old, 55-60 years old, more than 60 years old, unknown. At this point, M equals 8, plus unknown, for 9 age groups. Of all age groups, the age group data state corresponding to the user information is 1, and the remaining age group data states are 0.
In the embodiment of the invention, the home location information processing is to divide the home location of the user into N geographical location areas corresponding to different data fields, and add an unknown field, namely, divide the home location into N +1 data fields. For example, 34 provincial administrative districts in China are respectively mapped to 34 different data fields, at this time, N is equal to 34, and an unknown field is added, and 35 fields are counted. The data state of the field corresponding to the attribution province of the single internet user is 1, and the data fields of other provinces are correspondingly 0. Taking a certain Beijing user as an example, the data state of the data field "Beijing" is 1, and the rest province data fields are 0.
In the embodiment of the present invention, the URL addresses visited by all users are massive addresses, and all URL addresses must be reduced in data by a limited amount. The step S3 of processing the internet surfing behavior data information may specifically include the following steps:
step S31: simplifying the access URL addresses of all users according to a simplification principle; the simplification principle comprises business simplification and similarity simplification; the business simplification is to simplify the access URL addresses which are completely irrelevant according to the direction concerned by the business, and the similarity simplification combines the access URL addresses belonging to the same access URL address to form the only access URL address.
Step S32: and numbering the simplified access URL addresses, wherein the access URL addresses have unique corresponding URL numbers, and the URL numbers are corresponding to URL data fields.
Step S33: and accumulating the website access frequency of each user according to the number of times of accessing each URL address in a preset internet surfing time period.
That is, assuming that the predetermined internet access time period is one day, the access frequency of the user is combined according to the number of access times per day, the combination principle is that the number of times that a single user accesses a specific URL in a certain day is accumulated, and the formed data format is as follows:
user, date, URL1 number, URL1 number of visits, URL2 number, URL2 number of visits … … URLN number, number of URLN visits.
Step S34: and forming an access record table for all users to access each access URL address in the preset time.
Specifically, all the user access records on the same day after the processing of the first three steps can be merged, and the data processing is performed according to the following table 2:
Figure BDA0002594378240000061
note that if the user has no access behavior in the URL to, the corresponding field is 0.
In order to improve the efficiency of the artificial intelligence subsequent algorithm, the data in the table above is normalized, that is, step S35 is executed: and performing normalization processing on the data in the access record table by adopting a nonlinear normalization algorithm.
After the above steps are completed, the data information merging step is performed, that is, step S4 is executed: and merging the data information of the access record table according to the user dimension to form a user data table in a preset internet surfing time period. Therefore, the data preprocessing of the user internet behavior data is completed, and the partial data can be used in the subsequent calculation process of the artificial intelligence algorithm.
In summary, the present invention provides a data preprocessing method for an artificial intelligence algorithm based on a user internet access behavior, which performs data preprocessing on the user internet access behavior to achieve a method capable of performing application of the artificial intelligence algorithm.
The above description is only for the preferred embodiment of the present invention, and the embodiment is not intended to limit the scope of the present invention, so that all the equivalent structural changes made by using the contents of the description and the drawings of the present invention should be included in the scope of the present invention.

Claims (2)

1. A data preprocessing method for an artificial intelligence algorithm based on user internet surfing behavior is characterized by comprising the following steps:
step S1: acquiring original information of a user; the original information comprises user basic information and internet behavior data information, the user basic information comprises gender, age and attribution, and the internet behavior data information comprises internet time, access URL (uniform resource locator) address and access website frequency;
step S2: processing the user basic information; wherein,
grouping the gender of the user according to male, female and unknown states to form three different data groups;
dividing said ages of users into M age groups, plus an unknown age group, into M +1 groups, said age of each user falling into one and only one age group;
the attribution area of the user corresponds to different data fields according to the division of N location areas, and an unknown field is added, namely the attribution area is divided into N +1 data fields;
step S3: processing the internet surfing behavior data information; the method comprises the following steps:
step S31: simplifying the access URL addresses of all users according to a simplification principle; the simplification principle comprises business simplification and similarity simplification; the business simplification is to simplify the access URL addresses which are completely irrelevant according to the direction concerned by the business, and the similarity simplification combines the access URL addresses belonging to the same access URL address to form a unique access URL address;
step S32: numbering the simplified access URL addresses, wherein the access URL addresses have unique corresponding URL numbers, and the URL numbers are corresponding to URL data fields;
step S33: accumulating the website access frequency of each user according to the number of times of accessing each URL address in a preset internet surfing time period;
step S34: forming an access record table for all users to access each access URL address in the preset time;
step S4: and merging the data information of the access record table according to the user dimension to form a user data table in a preset internet surfing time period.
2. The data preprocessing method for artificial intelligence algorithm based on internet surfing behavior of users as claimed in claim 1, wherein said step S3 further comprises the step S35: and performing normalization processing on the data in the access record table by adopting a nonlinear normalization algorithm.
CN202010705027.6A 2020-07-21 2020-07-21 Data preprocessing method for artificial intelligence algorithm based on user internet behavior Withdrawn CN111814092A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010705027.6A CN111814092A (en) 2020-07-21 2020-07-21 Data preprocessing method for artificial intelligence algorithm based on user internet behavior

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010705027.6A CN111814092A (en) 2020-07-21 2020-07-21 Data preprocessing method for artificial intelligence algorithm based on user internet behavior

Publications (1)

Publication Number Publication Date
CN111814092A true CN111814092A (en) 2020-10-23

Family

ID=72860844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010705027.6A Withdrawn CN111814092A (en) 2020-07-21 2020-07-21 Data preprocessing method for artificial intelligence algorithm based on user internet behavior

Country Status (1)

Country Link
CN (1) CN111814092A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1791022A (en) * 2005-12-26 2006-06-21 阿里巴巴公司 Log analyzing method and system
CN108960975A (en) * 2018-06-15 2018-12-07 广州麦优网络科技有限公司 Personalized Precision Marketing Method, server and storage medium based on user's portrait
CN109145307A (en) * 2018-09-12 2019-01-04 广州视源电子科技股份有限公司 User portrait recognition method, pushing method, device, equipment and storage medium
CN110222272A (en) * 2019-04-18 2019-09-10 广东工业大学 A kind of potential customers excavate and recommended method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1791022A (en) * 2005-12-26 2006-06-21 阿里巴巴公司 Log analyzing method and system
CN108960975A (en) * 2018-06-15 2018-12-07 广州麦优网络科技有限公司 Personalized Precision Marketing Method, server and storage medium based on user's portrait
CN109145307A (en) * 2018-09-12 2019-01-04 广州视源电子科技股份有限公司 User portrait recognition method, pushing method, device, equipment and storage medium
CN110222272A (en) * 2019-04-18 2019-09-10 广东工业大学 A kind of potential customers excavate and recommended method

Similar Documents

Publication Publication Date Title
Gan et al. Extracting non-redundant correlated purchase behaviors by utility measure
CN107665444B (en) Network advertisement instant effect evaluation method and system based on user online behavior
CN104424231B (en) The processing method and processing device of multidimensional data
US10600011B2 (en) Methods and systems for improving engagement with a recommendation engine that recommends items, peers, and services
Ciceri et al. Crowdsourcing for top-k query processing over uncertain data
CN104394118A (en) User identity identification method and system
TW201237665A (en) Determining preferred categories based on user access attribute values
Baralis et al. CAS-Mine: providing personalized services in context-aware applications by means of generalized rules
CN111488385B (en) Data processing method and device based on artificial intelligence and computer equipment
Yuan et al. Multi-granularity periodic activity discovery for moving objects
Backiel et al. Combining local and social network classifiers to improve churn prediction
CN117971606B (en) Log management system and method based on elastic search
CN115062087A (en) User portrait construction method, device, equipment and medium
CN110717089A (en) User behavior analysis system and method based on weblog
CN106874293A (en) A kind of data processing method and device
CN114331566A (en) Pushing method, system and device based on label grouping
CN111814092A (en) Data preprocessing method for artificial intelligence algorithm based on user internet behavior
CN102012902A (en) Website visitor value estimation system and method
Weiß Fully observed INAR (1) processes
CN111026863A (en) Customer behavior prediction method, apparatus, device and medium
Maratea et al. An heuristic approach to page recommendation in web usage mining
Lorince et al. The wisdom of the few?“Supertaggers” in collaborative tagging systems
Sun et al. Towards Visualized User Profile Analysis from Massive Web Log
Liang et al. Mining social ties beyond homophily
Jorge et al. Recommendation with association rules: A web mining application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 200436 room 406, 1256 and 1258 Wanrong Road, Jing'an District, Shanghai

Applicant after: Shanghai Shuming Artificial Intelligence Technology Co.,Ltd.

Address before: Room 1601-026, 238 JIANGCHANG Third Road, Jing'an District, Shanghai, 200436

Applicant before: Shanghai Shuming Artificial Intelligence Technology Co.,Ltd.

CB02 Change of applicant information
WW01 Invention patent application withdrawn after publication

Application publication date: 20201023

WW01 Invention patent application withdrawn after publication