CN115130008B - Search ordering method based on machine learning model algorithm - Google Patents

Search ordering method based on machine learning model algorithm Download PDF

Info

Publication number
CN115130008B
CN115130008B CN202211050166.5A CN202211050166A CN115130008B CN 115130008 B CN115130008 B CN 115130008B CN 202211050166 A CN202211050166 A CN 202211050166A CN 115130008 B CN115130008 B CN 115130008B
Authority
CN
China
Prior art keywords
model
data
layer
search
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211050166.5A
Other languages
Chinese (zh)
Other versions
CN115130008A (en
Inventor
户向伟
王昕�
梁培利
于森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kasima Beijing Technology Co ltd
Original Assignee
Kasima Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kasima Beijing Technology Co ltd filed Critical Kasima Beijing Technology Co ltd
Priority to CN202211050166.5A priority Critical patent/CN115130008B/en
Publication of CN115130008A publication Critical patent/CN115130008A/en
Application granted granted Critical
Publication of CN115130008B publication Critical patent/CN115130008B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a search ordering method based on a machine learning model algorithm, which comprises the following steps: data processing: processing static characteristics, statistical characteristics and interactive characteristics of the commodities and the users required by processing by using a data warehouse; model training: performing exploratory data analysis on the processed data of the data warehouse, performing characteristic engineering of machine learning according to an analysis result, selecting a proper model for parameter adjustment training, performing model evaluation, and storing the model into a PMML model file; data synchronization: synchronizing the characteristic data to the data and updating the characteristic data periodically; model service: receiving a request transmitted by a search service, reading user characteristic data and loading and reading a PMML model file, predicting by the model service, carrying out relevant strategy processing on a predicted sequencing result, finally returning the sequencing result to the search service by the model service, and displaying the sequencing result to the front end by the search service.

Description

Search ordering method based on machine learning model algorithm
Technical Field
The invention relates to the technical field of machine learning, in particular to a search ordering method based on a machine learning model algorithm.
Background
In addition to the traditional general search engines Google, *** and Bing, most internet vertical products such as e-commerce, music, application markets, short videos and the like also need a search function to meet the query requirements of users, and compared with the passive requirement meeting of a recommendation system, the users can actively express the appeal by organizing the query when using the search, so that the search intention of the users is relatively clear. However, even if the intention is relatively clear, it is challenging to make a search engine, and points to be considered and related technologies have certain difficulties in depth and breadth. Especially for the electronic shopping mall, the searching occupies 80% of the traffic entrance, and it is especially important to make a searching system. A basic search system can be roughly divided into an offline mining part and an online retrieval part, wherein important modules contained in the basic search system mainly comprise: item content understanding, query understanding, retrieval recalling, sequencing modules and the like, the existing e-commerce websites usually build a search system of the e-commerce websites, so that the search experience of users is improved, but problem feedback of the users often exists in a sequencing mode aiming at search results. How to provide a technical scheme can improve the sequencing experience of search results for users, and is a technical problem which needs to be solved at present.
Disclosure of Invention
The object of the present invention is to solve at least one of the technical drawbacks mentioned.
Therefore, the invention aims to provide a search ranking method based on a machine learning model algorithm, so as to solve the problems mentioned in the background technology and overcome the defects in the prior art.
In order to achieve the above object, an embodiment of the present invention provides a search ranking method based on a machine learning model algorithm, including:
step S1, data processing: processing static characteristics, statistical characteristics and interactive characteristics of the commodities and the users required by processing by using a data warehouse;
s2, model training: exploratory data analysis is carried out on the processed data in the data warehouse in the step S1, characteristic engineering of machine learning is carried out according to an analysis result, a proper model is selected for parameter adjustment training, model evaluation is carried out, and the model is stored into a PMML model file;
step S3, data synchronization: synchronizing the characteristic data to the data and updating the characteristic data periodically;
step S4, model service: receiving a request transmitted by a search service, reading user characteristic data and loading and reading a PMML model file, predicting by the model service, carrying out relevant strategy processing on a predicted sequencing result, finally returning the sequencing result to the search service by the model service, and displaying the sequencing result to the front end by the search service.
Preferably, in any of the above schemes, in the step S2, model evaluation is performed by using an offline index, where the offline index includes: AUC values, recall, and accuracy.
Preferably, in the step S1, a hive data warehouse is adopted to extract data in a layered manner, and the layered data includes an ODS layer, a DIM layer, a DWD layer, a DWI layer, a DWS layer, and an APP layer, the data warehouse adopts dimension modeling, abstracts data in a subject-divided manner into a dimension table and a fact table, and uses a fact table association dimension table, wherein the dimension tables are not associated with one another;
the system comprises an ODS layer, a DIM layer, a DWD layer, a data warehouse layer and a DWS layer, wherein the ODS layer is a service data storage layer, the DIM layer is a layer where a multi-bin dimension table is located, the DWD layer is a layer where a multi-bin fact table is located, the DWI layer is a data warehouse light summary layer, the DWI layer data comes from the DIM layer and the DWD layer, and the DWS layer is a data warehouse summary layer; the DWS layer data is derived from the DWI layer, and the APP layer is the data application layer.
Preferably, according to any of the above schemes, the model of the data warehouse is a star model.
Preferably, in any of the above schemes, in the step S2,
step S21, reading data: and connecting the hive by using a JDBC mode, reading the required training table data from the APP layer of the hive data warehouse and processing the data into a DataFrame type.
S22, performing data exploratory analysis (EDA), checking the size and the number of rows and columns of search sequencing data, verifying that the read data is accurate and effective, checking the median, the mean and the number of unique values of each feature, the frequency of each column of features and the correlation among the features, and performing different processing on each column of data in a feature engineering according to the checking result;
step S23, characteristic engineering: and deleting the features with excessive single values and deleting the features with missing values larger than 70%. Performing onehotencoder when the number of data values of the object type is less than 5, and performing discretization processing on a plurality of characteristics related to price and quantity when the number of data values of the object type is more than 5 and a labelencoder is used; selecting features by using a selectfrommodel model so that the final mode-entering features are within 100;
step S24, model training: carrying out model training by adopting an LR model, an xgboost model or an xgboost + LR mixed model, comparing scores of 3 models, finally selecting the xgboost model as a final training model, and carrying out super-parameter selection by using a cross validation and grid search mode;
step S25, model evaluation: dividing a data set into a training set and a testing set, adopting a training model of the training set, and adopting the testing set to calculate an off-line two-classification evaluation index;
step S26, model saving: unifying the characteristic engineering and the model into a pipeline form, storing the pipeline form as a PMML file, wherein the PMML file uses the model in a cross-environment manner so as to enable the model service to call the model in a java environment.
In any of the above embodiments, in step S23, the missing value is preferably processed in a manner that the missing value having an actual meaning is not processed, and the missing value is directly deleted for the column with a small number of missing values; filling more missing values by adopting a mean value, a median value, a mode and an interpolation method; the processing mode of the abnormal values comprises replacing and filling the abnormal values, and deleting the abnormal values beyond the range according to the upper and lower limits of the normal data.
Preferably, in any one of the above schemes, in the step S23, the feature selection method includes the following three methods: filter, wrapper and Embedded.
Preferably, in step S25, the AUC value is 0.68, the recall rate is 0.66, and the accuracy rate is 0.7.
Preferably, in any of the above schemes, in the step S4,
step S41, receiving a search service request: the method comprises the steps that a front end sends a request to a search service, the search service uses an Elasticissearch full-text search engine to recall search data, then a search ranking model service is called to rank, and the model service is responsible for receiving the search service request;
step S42, reading characteristic data: reading required characteristic data from a PostgreSQL database according to parameters transmitted by search service;
step S43, loading a PMML model file: loading the trained PMML model file into a memory by using a third party jpmml library;
step S44, model prediction: predicting by using the loaded model file and the input parameter data in a uniform format, sequencing the prediction results, and printing a log; caching the predicted sequencing result to Redis, and reserving preset time length to reduce the pressure of the user in repeated searching;
step S45, strategy processing comprises: completing the commodities, matching distribution areas, highlighting search keywords and not sequencing the commodity data of the associated platform labels.
Step S46, return result: and returning the finally processed result to the search service, and uniformly returning the result to the front end by the search service for displaying.
Preferably, in any of the above solutions, in the step S42, the separately read feature tables are combined into model parameter data in a unified format.
According to the search ordering method based on the machine learning model algorithm, the search results are ordered by taking the static attributes of the users and the commodities, the statistical attributes of the users and the commodities, the interaction behaviors of the users and the commodities and other data as the ordering basis of the search results.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a search ranking method based on a machine learning model algorithm according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a data processing module according to an embodiment of the present invention;
FIG. 3 is a flowchart of the architecture of the model training module according to an embodiment of the present invention;
FIG. 4 is a flowchart of an architecture of a model service module according to an embodiment of the invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
As shown in fig. 1, the search ranking method based on machine learning model algorithm according to the embodiment of the present invention includes the following steps:
step S1, data processing: and processing static characteristics, statistical characteristics and interactive characteristics of the commodities and the users required by processing by using the data warehouse.
Specifically, the static characteristics, statistical characteristics and interactive characteristics of the commodities and the users required by the processing are processed by using the data warehouse. As shown in FIG. 2, data are extracted hierarchically by using a hive data warehouse, wherein hdfs is used for hive bottom storage, yann is used for resource scheduling, dolphinscheduler is used for flow scheduling, and data are extracted hierarchically by the hive data warehouse, and the data extraction system comprises an ODS layer, a DIM layer, a DWD layer, a DWI layer, a DWS layer and an APP layer, wherein the data warehouse adopts dimension modeling, and data are abstracted into a dimension table and a fact table according to the topic of the data. The data model adopted by the data warehouse is a star model, the fact tables are used for associating the dimension tables, and the dimension tables are not associated with each other. The ODS layer is a business data storage layer, the DIM layer is a layer where a plurality of warehouse dimension tables are located, the DWD layer is a layer where a plurality of warehouse fact tables are located, the DWI layer is a data warehouse light gathering layer, the DWI layer data are derived from the DIM layer and the DWD layer, and the DWS layer is a data warehouse gathering layer. The DWS layer data is derived from the DWI layer, and the APP layer is the data application layer. The training and prediction tables provided for the scheme are at this level.
The concrete functions are as follows:
ODS layer: and a business data storage layer. Data are extracted from an online service database through a data synchronization tool Sqoop, data are exported from a strategic log counting bin through a data export mode provided by the strategy, and a hive command is imported into the counting bin. The load policies for the different tables include incremental load, full load, add and change load, and single load. Data imported in various ways is stored in the ODS layer without modification.
DIM layer: and (4) the layer of the log bin dimension table. The dimension tables used by the invention comprise a time dimension table, a region dimension table, a distribution region dimension table, a user dimension table (comprising a field of research institute attribute), a commodity dimension table and a merchant dimension table. All data of the DIM layer come from the ODS layer, and besides data are divided again according to the dimensions and dimension data of multiple sources are unified, the work of the DIM layer also comprises data cleaning on repeated data, null data, dirty data and over-range data.
DWD layer: the bin facts are represented in a layer, also called a bin detail layer. The fact table is composed of the foreign key associated with the dimension table and the value of the measure of the numerical type. The fact table used by the invention comprises a login fact table, a purchase adding fact table, a collection fact table, an order detail fact table, a commodity click fact table and a search click fact table. The method comprises the steps of logging in a fact table, clicking the fact table on a commodity, searching and clicking the fact table to belong to a transaction type fact table, adding and purchasing the fact table, collecting the fact table to belong to a periodic type fact table, ordering the fact table to belong to an accumulation type fact table, and adopting different partition loading strategies according to different types of fact tables. The DWD layer data also all come from the ODS layer, and the work of the DWD layer is to unify the fact table data of multiple sources and clean null value and abnormal value data.
And (3) DWI layer: the data warehouse is a light summary level, and the DWI layer data is derived from DIM and DWD layers. The measurement values of the fact table are uniformly and lightly summarized according to the dimension, and the DWI layer is summarized according to the day in the invention. The purpose of the mild summary is to perform calculation in advance for other repeated complex summaries so as to change the space into time and improve the calculation efficiency. The light summary table in the invention comprises a member theme table, a merchant theme table and a commodity theme table.
DWS layer: and a data warehouse summary layer. DWS layer data is derived from the DWI layer and is a further summary of the DWI layer topic table. The DWS topic table summary is basically used for instant query, so two topic table partitions of the layer are saved. The summary layer is characterized in that the data tables are relatively few, and one table can cover more service contents. The theme table of the DWS layer comprises a member theme table, a merchant theme table and a commodity theme table.
APP layer: and the data application layer is used for providing a final version of data table for the interface to use for subsequent applications of the data warehouse, including reports, recommendation systems, search sequencing systems and the like, and is a layer where final output data of the data warehouse are located. The layer data is generally directly imported into a relational database through a data import tool Sqoop for interface query.
The related tables include a region related table, a member related table, a commodity related table, a merchant related table, a login behavior table, a click behavior table, a search click behavior table, an order submission behavior table and the like. The relevant fields involved are as follows:
Figure 490429DEST_PATH_IMAGE001
TABLE 1 Member related fields Table
Figure 735466DEST_PATH_IMAGE002
TABLE 2 Commodity-related fields Table
Figure 311940DEST_PATH_IMAGE003
TABLE 3 Merchant's associated fields Table
Figure 503887DEST_PATH_IMAGE004
TABLE 4 user behavior-related fields Table
S2, model training: and (3) performing exploratory data analysis on the data processed in the data warehouse in the step (S1), performing characteristic engineering of machine learning according to an analysis result, selecting a proper model for parameter adjustment training, performing model evaluation, and storing the model into a PMML model file. Selecting a proper model for parameter adjustment training, performing model evaluation by using auc values, recall rate, accuracy and other off-line indexes, and storing the model into a PMML model file as shown in FIG. 3.
Step S21, reading data: and connecting the hive by using a JDBC mode, reading required training data from the APP layer of the hive data warehouse and processing the training data into a DataFrame type. And deleting useless features according to the subsequent analysis result.
Step S22, EDA: the exploratory data analysis is carried out, the size and the number of rows and columns of the search sequencing data are checked, and the accuracy and the effectiveness of the read data are verified. Then understanding the business meanings of 114 features, checking the median, the mean value, the unique value number, the frequency of each row of features, the correlation among the features and the like of each feature, and performing different processing on each row of data in feature engineering according to the checking result. Looking at the label positive-negative ratio of 1. And checking the data type of each line, wherein the training data has three different data types of int, float and object, and object type data needs to be digitalized in a characteristic engineering variety.
Step S23, feature engineering: and deleting the features with excessive single values and deleting the features with missing values larger than 70%. Performing onehotencoder when the number of data values of the object type is less than 5, and performing discretization processing on a plurality of characteristics related to price and quantity when the number of data values of the object type is more than 5 and a labelencoder is used; feature selection is performed using a selectfrommodel so that the final in-mold feature is within 100.
And filling mode for other characteristics with missing values, and filling by using median, average or interpolation according to requirements. The purpose of digitizing onehotencoder with the number of data values of object type less than 5 and labelencoder more than 5 is to prevent the model of object type from receiving data and better mine information in data. The method is characterized in that a plurality of characteristics related to price and quantity are subjected to discretization processing, equal-frequency binning is used, and the discretization processing aims to enable a model to obtain more information, so that the linear model is greatly improved.
Feature selection was performed using a selectfrommodel so that the final in-mold features were 90. The characteristic engineering enables the data to meet the requirements of the model, and the model can better obtain useful information, so that the model effect is improved.
Step S24, model training: and (3) carrying out model training by adopting an LR model, an xgboost model or an xgboost + LR mixed model, comparing scores of 3 models, finally selecting the xgboost model as a final training model, and carrying out hyper-parameter selection by using a cross validation and grid search mode.
In the invention, a logistic regression model, an xgboost model and an xgboost + LR mixed model are tried to be used for model training, and cross validation and a grid search mode are used for super-parameter selection. The initial hyper-parameter is searched by using a default parameter or a value around an empirical parameter, and an xgboost model is finally determined to be used, wherein the optimal parameter is that the number of selected trees is 200, the tree depth is 6, and the learning rate is 0.3.
Step S25, model evaluation: the data set is divided into a training set and a testing set, a model is trained by the training set, model evaluation is carried out by the testing set data, the XGboost model is used for carrying out two-classification model prediction, and two-classification evaluation indexes including a confusion matrix, an AUC (AUC), a recall rate, an accuracy rate and an f1 score are off-line. The AUC in the final off-line result of the invention is 0.68, the recall rate is 0.66 and the accuracy rate is 0.7. And the online requirement is met. If the offline effect is not ideal, the optimization needs to be continued to be performed before the online process is performed.
Step S26, model saving: unifying the feature engineering and the model into a pipeline form, wherein the pipeline combines a plurality of models and feature engineering into one model. And the online calling is facilitated. pipeline, like model objects, can be trained, predicted, evaluated, hyper-parametric searched, persisted, deployed. And then saved in PMML format (PMML is a serialized format across multilingual platforms that can be used to deploy models across language environments) to make model files callable in a Java environment.
In the embodiment of the invention, the data exploratory analysis mainly checks and processes missing values, repeated values and abnormal values.
And the positive and negative sample ratios in the characteristic engineering are balanced, and data sampling is not needed. The missing value processing mode comprises the steps of not processing the missing values with actual meanings, deleting the missing values of the columns with few missing values directly and having little influence on data. And filling more missing values by adopting methods such as mean value, median value, mode, interpolation and the like.
And the processing mode of the constant value comprises the steps of replacing and filling the abnormal value, and deleting the abnormal value beyond the range according to the upper and lower normal limits of the data.
The numeralization method comprises the steps of coding the ordered type variable by using LabelEncode, coding the unordered type variable by using OneHotEncode, and coding the high-radix characteristic by adopting the modes of OneHot + pca, labelEncode + large number, characteristic hash, meanEncode and the like.
The discretization can be divided into continuous data discretization, continuous data binarization, and continuous data discretization methods including equal-width binning, equal-frequency binning, user-defined binning and the like. Among them, dimensionless is also called normalization, centralization, and its main role is: the speed is increased (in algorithms with gradient and matrix as cores, such as logistic regression, support vector machine and neural network), the dimension is unified, the precision is improved (in distance class models, such as K neighbor and K-Means clustering), and specific methods comprise StandardScale, minMaxScaler, maxAbsScale and the like.
In an embodiment of the present invention, the feature selection method includes 3 types of methods:
1, filter: variables are selected based on the association of the independent variable and the target variable. Including variance selection and selection using indicators such as correlation, chi-square, mutual information, etc.
2.Wrapper: obtaining the importance of the variable by using the attribute of 'coef' or 'feature _ objects', excluding the feature, and iterating for multiple times by using one model.
Embedded: and (4) independently calculating coefficients of each feature and Label by using the model, setting a threshold value and deleting the features. Multiple models are trained at one time. The variance selection method and the univariate feature selection belong to Filter, the recursive feature elimination belongs to Wrapper, and the SelectFromModel belongs to Embedded.
In the embodiment of the invention, the model finally adopts an xgboost model, is relatively suitable for the sequencing scene, and deletes useless features according to the feature importance obtained by the model.
Uniformly packaging the feature engineering and the model which are used for processing the data into pipeline, and storing the pipeline into a PMML form for calling by a model service.
Step S3, data synchronization: and synchronizing the characteristic data from the bins to the online relational database and updating the characteristic data regularly.
Preferably, in this step, the feature data is synchronized to the PostgreSQL database and updated daily.
Step S4, model service: the search service may send a request to the model service, the model service receives the search service request, and the model service reads the feature data. The model service reads the PMML model file, predicts the PMML model file, performs relevant policy processing, returns the result, and is uniformly displayed to the front end by the search service, as shown in fig. 4.
Step S41, receiving a search service request: the method comprises the steps that a front end sends a request to a search service, the search service uniformly schedules other services, the search service firstly uses an elastic search to recall search data, then a search ranking model service is called to rank, parameters are transmitted to a number _ id list and 1000 product _ id lists, and the model service is responsible for receiving the search service request.
Step S42, reading characteristic data: according to parameters transmitted by search service, required feature data are read from an online database, in order to meet the requirement of online timeliness, a plurality of prediction tables are not associated in advance, otherwise, the prediction tables are too large and query is slow, and the separately read feature tables are combined into model parameter data with a uniform format during real query.
Step S43, loading a PMML model file: and loading the trained PMML model file into a memory by using a third party jpmml library, loading the file once only when the service is started, and subsequently predicting by using the PMML object loaded into the memory without loading.
Step S44, model prediction: and predicting by using the loaded model file and the input parameter data in a uniform format, sequencing the prediction results, and printing a log. And caching the predicted sorting result into Redis, and keeping for 1 hour to reduce the pressure of the user in repeated searching. The prediction results are logged for subsequent analysis.
Step S45, strategy processing: in order to display the sorted commodities to the front end, some policy processing is required, including commodity completion, distribution area matching, highlight of search keywords, absence of participation of commodity data of associated platform tags in sorting, and the like.
Step S46, return result: and returning the finally processed result to the search service, and uniformly returning the result to the front end by the search service for displaying.
In the embodiment of the invention, the search service is one of the most important services already contained in the shopping mall, and the search service needs to be modified so that the search service does not directly return an Elasticsearch result, but calls the model service to sort and then returns to the front end.
In order to meet the requirement of online real-time performance, data reading is optimized, a characteristic width table is divided into a plurality of tables, multi-column data which has little influence on prediction are deleted, a sequencing result is directly cached to Redis, and a user directly reads the cached data during paging query.
The essence of the PMML cross-environment calling model is that Java is used by a third party jpmml library to complete calling and prediction of a model algorithm, and parameters are transmitted by using a PMML fixed format to achieve the purpose of cross-environment calling.
The search ranking solution is used as a part of the whole search engine, the relation between the search ranking solution and the search engine is located in the search engine query understanding, the elastic search result recall, the last part of the three large modules is ranked, and the accurate ranking is finally returned by using the search recall result.
According to the search ordering method based on the machine learning model algorithm, the search results are ordered by taking the static attributes of the users and the commodities, the statistical attributes of the users and the commodities, the interaction behaviors of the users and the commodities and other data as the ordering basis of the search results.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
It will be understood by those skilled in the art that the present invention includes any combination of the summary and detailed description of the invention described above and those illustrated in the accompanying drawings, which is not intended to be limited to the details and which, for the sake of brevity of this description, does not describe every aspect which may be formed by such combination. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (9)

1. A search ranking method based on a machine learning model algorithm is characterized by comprising the following steps:
step S1, data processing: processing static characteristics, statistical characteristics and interactive characteristics of the commodities and the users required by processing by using a data warehouse; the data warehouse adopts dimension modeling, abstracts data classification subjects into a dimension table and a fact table, and associates dimension tables by using the fact table, wherein the dimension tables are not associated with each other;
the system comprises an ODS layer, a DIM layer, a DWD layer, a data warehouse layer and a DWS layer, wherein the ODS layer is a service data storage layer, the DIM layer is a layer where a multi-bin dimension table is located, the DWD layer is a layer where a multi-bin fact table is located, the DWI layer is a data warehouse light summary layer, the DWI layer data comes from the DIM layer and the DWD layer, and the DWS layer is a data warehouse summary layer; the DWS layer data comes from a DWI layer, and the APP layer is a data application layer;
s2, model training: exploratory data analysis is carried out on the processed data in the data warehouse in the step S1, characteristic engineering of machine learning is carried out according to an analysis result, a proper model is selected for parameter adjustment training, model evaluation is carried out, and the model is stored into a PMML model file;
step S3, data synchronization: synchronizing the characteristic data to the data and updating the characteristic data periodically;
step S4, model service: receiving a request transmitted by a search service, reading user characteristic data and loading and reading a PMML model file, predicting by the model service, carrying out relevant strategy processing on a predicted sequencing result, finally returning the sequencing result to the search service by the model service, and displaying the sequencing result to the front end by the search service.
2. The machine-learning-model-algorithm-based search ranking method of claim 1, wherein in the step S2, model evaluation is performed using off-line metrics, wherein the off-line metrics include: AUC values, recall, and accuracy.
3. The machine-learning-model-algorithm-based search ranking method of claim 1, wherein the model of the data warehouse is a star model.
4. The machine learning model algorithm based search ranking method of claim 1 wherein, in said step S2,
step S21, reading data: connecting hive by using a JDBC mode, reading required training table data from the APP layer of the hive data warehouse and processing the training table data into a DataFrame type;
s22, namely performing data exploratory analysis (EDA), checking the size and the row and column number of the search sequencing data, verifying the accuracy and the validity of the read data, checking the median, the mean and the unique value number of each characteristic, the frequency of each column of characteristics and the correlation among the characteristics, and performing different processing on each column of data in a characteristic engineering according to the checking result;
step S23, feature engineering: deleting features with excessive single values and deleting features with missing values larger than 70%;
performing onehotencoder when the number of data values of the object type is less than 5, and performing discretization processing on a plurality of characteristics related to price and quantity when the number of data values of the object type is more than 5 and a labelencoder is used; selecting features by using a selectfrommodel model so that the final mode-entering features are within 100;
step S24, model training: carrying out model training by adopting an LR model, an xgboost model or an xgboost + LR mixed model, comparing scores of 3 models, finally selecting the xgboost model as a final training model, and carrying out super-parameter selection by using a cross validation and grid search mode;
step S25, model evaluation: dividing a data set into a training set and a test set, adopting a training model of the training set, and adopting the test set to calculate an off-line two-classification evaluation index;
step S26, model saving: unifying the characteristic engineering and the model into a pipeline form, storing the pipeline form as a PMML file, wherein the PMML file uses the model in a cross-environment mode so that the model service can call the model in a java environment.
5. The method according to claim 4, wherein in step S23, the missing values are processed in a manner that the missing values with actual meanings are not processed, and the missing values are directly deleted for the columns with few missing values; filling columns with a large number of missing values by adopting a mean value, a median value, a mode and an interpolation method; and the abnormal value processing mode comprises replacing and filling the abnormal value and deleting the abnormal value beyond the range according to the upper and lower normal limits of the data.
6. The machine learning model algorithm based search ranking method of claim 4 wherein in said step S23, the feature selection method includes the following three: filter, wrapper and Embedded.
7. The machine-learning-model-algorithm-based search ranking method of claim 4, wherein in said step S25, AUC value is 0.68, recall is 0.66, and accuracy is 0.7.
8. The machine learning model algorithm-based search ranking method of claim 1, wherein, in said step S4,
step S41, receiving a search service request: the method comprises the steps that a front end sends a request to a search service, the search service uses an Elasticissearch full-text search engine to recall search data, then a search sorting model service is called to sort, and the model service is responsible for receiving the search service request;
step S42, reading characteristic data: reading required characteristic data from a PostgreSQL database according to parameters transmitted by search service;
step S43, loading a PMML model file: loading the trained PMML model file into a memory by using a third party jpmml library;
step S44, model prediction: predicting by using the loaded model file and the input parameter data in a uniform format, sequencing the prediction results, and printing a log; caching the predicted sequencing result to Redis, and reserving preset time length to reduce the pressure of the user in repeated searching;
step S45, strategy processing comprises: completing the commodities, matching distribution areas, highlighting search keywords and not participating in sequencing the commodity data of the associated platform labels;
step S46, return result: and returning the finally processed result to the search service, and uniformly returning the result to the front end by the search service for displaying.
9. The machine-learning-model-algorithm-based search ranking method of claim 8, wherein in said step S42, separately read feature tables are merged into model parameter data of a uniform format.
CN202211050166.5A 2022-08-31 2022-08-31 Search ordering method based on machine learning model algorithm Active CN115130008B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211050166.5A CN115130008B (en) 2022-08-31 2022-08-31 Search ordering method based on machine learning model algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211050166.5A CN115130008B (en) 2022-08-31 2022-08-31 Search ordering method based on machine learning model algorithm

Publications (2)

Publication Number Publication Date
CN115130008A CN115130008A (en) 2022-09-30
CN115130008B true CN115130008B (en) 2022-11-25

Family

ID=83386965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211050166.5A Active CN115130008B (en) 2022-08-31 2022-08-31 Search ordering method based on machine learning model algorithm

Country Status (1)

Country Link
CN (1) CN115130008B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244310A (en) * 2023-03-06 2023-06-09 深圳今日人才信息科技有限公司 Freely-combinable data billboard manager and bottom-layer implementation method
CN116664219A (en) * 2023-04-14 2023-08-29 喀斯玛(北京)科技有限公司 Scientific research electronic commerce platform intelligent recommendation system based on machine learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102306176A (en) * 2011-08-25 2012-01-04 浙江鸿程计算机***有限公司 On-line analytical processing (OLAP) keyword query method based on intrinsic characteristic of data warehouse
CN106777088A (en) * 2016-12-13 2017-05-31 飞狐信息技术(天津)有限公司 The method for sequencing search engines and system of iteratively faster
CN109189904A (en) * 2018-08-10 2019-01-11 上海中彦信息科技股份有限公司 Individuation search method and system
CN111597444A (en) * 2020-05-13 2020-08-28 北京达佳互联信息技术有限公司 Searching method, searching device, server and storage medium
CN112100444A (en) * 2020-09-27 2020-12-18 四川长虹电器股份有限公司 Search result ordering method and system based on machine learning
CN112749325A (en) * 2019-10-31 2021-05-04 北京京东尚科信息技术有限公司 Training method and device for search ranking model, electronic equipment and computer medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11704566B2 (en) * 2019-06-20 2023-07-18 Microsoft Technology Licensing, Llc Data sampling for model exploration utilizing a plurality of machine learning models
US11599548B2 (en) * 2019-07-01 2023-03-07 Kohl's, Inc. Utilize high performing trained machine learning models for information retrieval in a web store
CN113721898B (en) * 2021-08-30 2024-04-12 平安科技(深圳)有限公司 Machine learning model deployment method, system, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102306176A (en) * 2011-08-25 2012-01-04 浙江鸿程计算机***有限公司 On-line analytical processing (OLAP) keyword query method based on intrinsic characteristic of data warehouse
CN106777088A (en) * 2016-12-13 2017-05-31 飞狐信息技术(天津)有限公司 The method for sequencing search engines and system of iteratively faster
CN109189904A (en) * 2018-08-10 2019-01-11 上海中彦信息科技股份有限公司 Individuation search method and system
CN112749325A (en) * 2019-10-31 2021-05-04 北京京东尚科信息技术有限公司 Training method and device for search ranking model, electronic equipment and computer medium
CN111597444A (en) * 2020-05-13 2020-08-28 北京达佳互联信息技术有限公司 Searching method, searching device, server and storage medium
CN112100444A (en) * 2020-09-27 2020-12-18 四川长虹电器股份有限公司 Search result ordering method and system based on machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于XGBoost的搜索结果智能排序***;赵晗等;《软件导刊》;20190731(第12期);第62-66页 *

Also Published As

Publication number Publication date
CN115130008A (en) 2022-09-30

Similar Documents

Publication Publication Date Title
CN115130008B (en) Search ordering method based on machine learning model algorithm
US11921715B2 (en) Search integration
US20170235820A1 (en) System and engine for seeded clustering of news events
Chen et al. Approximate parallel high utility itemset mining
Leventhal An introduction to data mining and other techniques for advanced analytics
CN110175895B (en) Article recommendation method and device
McKnight Information management: strategies for gaining a competitive advantage with data
CN105159971B (en) A kind of cloud platform data retrieval method
Hammond et al. Cloud based predictive analytics: text classification, recommender systems and decision support
CN114371946B (en) Information push method and information push server based on cloud computing and big data
Kaur Web content classification: a survey
CN116431895A (en) Personalized recommendation method and system for safety production knowledge
Billot et al. Introduction to big data and its applications in insurance
Prakash et al. WS-BD-based two-level match: Interesting sequential patterns and Bayesian fuzzy clustering for predicting the web pages from weblogs
Lu et al. A novel e-commerce customer continuous purchase recommendation model research based on colony clustering
CN111460300A (en) Network content pushing method and device and storage medium
Teng et al. A novel fahp based book recommendation method by fusing apriori rule mining
Zhang Design and implementation of insurance product recommendation system
Vakali et al. New directions in web data management 1
Mohania et al. Active, Real-Time, and Intellective Data Warehousing.
CN113420096B (en) Index system construction method, device, equipment and storage medium
KR102653187B1 (en) web crawling-based learning data preprocessing electronic device and method thereof
Katal et al. Computational Intelligence Techniques for Recommendation System in Big Data
KR102602448B1 (en) System of predicting future demand using deep-learning analysis of convolutional neural network and artificial neural network oh yeah
Wang et al. Research on Consuming Behavior Based on User Search Data: A Case of Xiaomi Mobile Phone

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant