WO2021139255A1 - 基于模型的预测数据变化频率的方法、装置和计算机设备 - Google Patents

基于模型的预测数据变化频率的方法、装置和计算机设备 Download PDF

Info

Publication number
WO2021139255A1
WO2021139255A1 PCT/CN2020/118530 CN2020118530W WO2021139255A1 WO 2021139255 A1 WO2021139255 A1 WO 2021139255A1 CN 2020118530 W CN2020118530 W CN 2020118530W WO 2021139255 A1 WO2021139255 A1 WO 2021139255A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
data
entity
specified
trained
Prior art date
Application number
PCT/CN2020/118530
Other languages
English (en)
French (fr)
Inventor
张圣
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010734520.0A external-priority patent/CN111859238B/zh
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139255A1 publication Critical patent/WO2021139255A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer equipment for predicting the frequency of data change based on a model.
  • the existing estimation scheme of network data change frequency is mainly based on a statistical assumption of statistics: the change frequency of network data obeys Poisson distribution.
  • X/T is an effective change frequency estimation scheme (T represents the time interval, X represents the number of changes of the network data in the time interval T).
  • T represents the time interval
  • X represents the number of changes of the network data in the time interval T.
  • this estimation solution has the following shortcomings: Many network resources do not provide a history of changes. In this case, only by comparing whether the data of the same page visited twice before and after has changed can we know whether it has changed. Even if the network data is different in the previous two visits, it is still impossible to accurately obtain the number of times the network resource changes within the time interval T. If the number of changes in the time interval T cannot be accurately obtained, the estimation of the corresponding change frequency is also inaccurate.
  • the inventor realized that for some newly-emerging entities in the knowledge base, such as the novel coronavirus pneumonia, due to the lack of data related to the newly-emerging entities, and the change history data of the newly-emerging entities is relatively small, at this time If the estimation scheme based on the Poisson distribution is still used to predict the change frequency of the newly-appearing entity, the accuracy of the prediction of the change frequency of the newly-appearing entity will be low.
  • the main purpose of this application is to provide a model-based method, device, computer equipment, and storage medium for predicting the change frequency of data, which aims to solve the existing use of Poisson distribution-based estimation schemes to measure the change frequency of newly emerging entities. Prediction will lead to a technical problem of low accuracy in predicting the change frequency of newly emerging entities.
  • this application proposes a model-based method for predicting the frequency of data change, the method comprising the steps:
  • the output result is used as the predicted value of the change frequency of the specified entity.
  • this application also provides a model-based device for predicting the frequency of data changes, including:
  • the first obtaining module is used to obtain the initial data in the specified entry page corresponding to the specified entity from the encyclopedia website, where the specified entity is any entity in the preset knowledge base;
  • An extraction module configured to extract designated feature data corresponding to the designated entity from the initial data
  • a calling module for calling a pre-trained prediction model wherein the prediction model is generated after training a preset regression model based on a pre-collected sample label data set;
  • a prediction module configured to input the specified feature data into the prediction model, so as to perform prediction processing on the specified feature data through the prediction model;
  • the second obtaining module is configured to obtain the output result corresponding to the specified entry page output by the preset model
  • the first determining module is configured to use the output result as the predicted value of the change frequency of the designated entity.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements a model-based method for predicting the frequency of data changes when the computer program is executed , wherein the method for predicting the frequency of data change based on the model includes the following steps:
  • the output result is used as the predicted value of the change frequency of the specified entity.
  • the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a model-based method for predicting the frequency of data changes is implemented, wherein the The method of predicting the data change frequency of the model includes the following steps:
  • the output result is used as the predicted value of the change frequency of the specified entity.
  • the method, device, computer equipment and storage medium for predicting the frequency of data change based on the model provided in this application intelligently and conveniently realize the prediction of the change frequency of entities in the knowledge base, and effectively improve the change of entities in the knowledge base. Frequency prediction accuracy.
  • FIG. 1 is a schematic flowchart of a method for predicting data change frequency based on a model according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of an apparatus for predicting data change frequency based on a model according to an embodiment of the present application
  • Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • a model-based method for predicting data change frequency includes:
  • S1 Obtain the initial data in the specified entry page corresponding to the specified entity from the encyclopedia website, where the specified entity is any entity in the preset knowledge base;
  • S4 Input the specified feature data into the prediction model, so as to perform prediction processing on the specified feature data through the prediction model;
  • the execution subject of this method embodiment is a model-based device for predicting the frequency of data change.
  • the aforementioned model-based device for predicting the frequency of data changes can be implemented by virtual devices, such as software codes, or by physical devices written or integrated with relevant execution codes, and can communicate with the user through a keyboard, mouse, or keyboard. Remote control, touchpad or voice control device for human-computer interaction.
  • the model-based device for predicting the frequency of data change of this embodiment can quickly and accurately generate the predicted value of the change frequency of any entity in the knowledge base. Specifically, first obtain the initial data in the specified entry page corresponding to the specified entity from the encyclopedia website, where the specified entity is any entity in the preset knowledge base.
  • the above-mentioned entities include words that have independent meanings and can be used to indicate any object in the preset knowledge base.
  • the aforementioned encyclopedia website may be any one or more online encyclopedia websites, such as Baidu Encyclopedia website, Wikipedia website, and so on.
  • the designated entry page may include a designated entity description page corresponding to the designated entity, and a designated entity update history page corresponding to the designated entity.
  • the initial data in the specified entry page corresponding to the specified entity refers to all the attribute data contained in the specified entry page in the encyclopedia website. For example, it can include the description text information of the specified entity, basic information table information, and some user-related Statistics, etc.
  • an entity in the default knowledge base usually corresponds to an entry page of an encyclopedia website.
  • the entity “new coronavirus pneumonia” "An entry page corresponding to the encyclopedia website can be: https://baike.***.com/item/new coronavirus pneumonia.
  • one of the main ways to automate the construction of the above-mentioned preset knowledge base is to obtain entity knowledge through high-quality encyclopedia and professional vertical websites.
  • the current knowledge bases in general fields are mostly automatically constructed through encyclopedia websites, such as Wikipedia and Baidu Baike.
  • the knowledge base in the medical field can be automatically constructed using encyclopedia websites and medical professional vertical websites.
  • the problem of predicting the change frequency of entities in the knowledge base can be transformed into the estimation of the change frequency of the entry page corresponding to the entity of the corresponding encyclopedia website to the entity, that is, the prediction of the change frequency of network data (webpages, etc.).
  • the designated feature data corresponding to the designated entity is extracted.
  • the above-mentioned designated characteristic data may specifically include four types of designated characteristic data, which are respectively basic statistical characteristics, user behavior characteristics, semantic characteristics, and dynamic characteristics.
  • the prediction model is generated after training a preset regression model based on a pre-collected sample label data set.
  • the sample label data set includes feature data related to the entity, and the sample label data set also contains a change frequency label value corresponding to the entity, and the change frequency label value corresponds to the entity within a preset period in the future
  • the frequency value of the change can be the quotient between the number of changes of an entity in a preset period in the future and a preset period, and it can be viewed from the entry page corresponding to the entity (including the entity description page corresponding to the entity, and the entity corresponding to the entity In the entity update history page), relevant calculations are performed to obtain the change frequency value corresponding to the entity in a preset period in the future, and the foregoing preset period is not specifically limited, for example, it can be set to 5 days.
  • the above prediction model may be a pre-trained regression model, such as a linear regression model, an SVM regression model, a random forest regression (Random Forest regression) model, and a multi-layer perceptron network based on deep learning ( MLP, Multilayer Perceptron) regression model.
  • a pre-trained regression model such as a linear regression model, an SVM regression model, a random forest regression (Random Forest regression) model, and a multi-layer perceptron network based on deep learning ( MLP, Multilayer Perceptron) regression model.
  • MLP Multilayer Perceptron
  • the output result of the prediction model corresponding to the specified entry page is obtained, and the output result is a future preset corresponding to the specified entry page.
  • the predicted value of the frequency of change within the period is used as the predicted value of the change frequency of the above-mentioned designated entity.
  • the predicted value of the change frequency corresponding to the designated entity refers to the quotient between the number of changes of the designated entity in a predetermined period in the future and a predetermined period.
  • the prediction of the change frequency of the entities in the knowledge base and the entry page in the encyclopedia website are also intelligently established, so that the prediction value of the change frequency of the entry page in the encyclopedia website can be obtained quickly and conveniently.
  • Prediction processing of the change frequency of entities in the knowledge base For example, for the entity novel coronavirus pneumonia, after obtaining the required feature data from the entry page corresponding to the entity novel coronavirus pneumonia at the current time point, and inputting the required feature data into the above prediction model, predict The model will output: the change frequency of the entry page in a predetermined period in the future, that is, the change frequency of the entity new coronavirus pneumonia in a predetermined period in the future.
  • this application When this application needs to obtain the predicted value of the designated change frequency of a designated entity in the knowledge base, it will first obtain the initial data in the designated entry page corresponding to the designated entity from the encyclopedia website. And based on the initial data of the specified entry page, the corresponding specified feature data is constructed, and the machine learning regression model is used to predict the change frequency of the entry page of the encyclopedia website, and then the output of the machine learning regression model corresponds to the entry page The predicted value of the change frequency of the specified entity is used as the predicted value of the change frequency of the specified entity.
  • this application uses a machine learning regression model to predict the change frequency of the entry page of the encyclopedia website, so that it can be intelligent based on the prediction of the change frequency of the entry page It is convenient to realize the prediction of the change frequency of the entities in the knowledge base, which effectively improves the prediction accuracy of the change frequency of the entities in the knowledge base.
  • the step S2 of extracting the designated feature data corresponding to the designated entity from the initial data includes:
  • the foregoing step of extracting the specified feature data corresponding to the specified entity from the foregoing initial data may specifically include: first obtaining preset feature category information.
  • the aforementioned feature category information may specifically include four feature category information, which are basic statistical feature information, user behavior feature information, semantic feature information, and dynamic feature information.
  • the specified feature data corresponding to the feature category information is extracted from the initial data based on the feature category information.
  • the basic statistical feature information, user behavior feature information, semantic feature information, and dynamic feature information only the basic statistical features, user behavior features, semantic features, and dynamic features corresponding to the specified entity can be extracted from the above initial data.
  • Corresponding feature data, and no data other than the above specified feature data will be extracted.
  • the extraction method can be directly extracted from the designated entry page, and the method of extracting the related data in the designated entry page after calculation may also be used.
  • the above-mentioned basic statistical feature information, user behavior feature information, semantic feature information, and dynamic feature information will be elaborated in detail.
  • the above-mentioned basic statistical features are all features obtained by simple calculations for some information of the entry page corresponding to the entity, which can specifically include the first feature (existing time), the second feature (page text size), and the first feature (page text size). Three characteristics (describe the size of the entity text).
  • the first feature described above calculates the time interval between the current time and the entry creation time. You can get the creation time of the specified entity from the specified entity update history page corresponding to the specified entity, then get the current time, and use the current time to subtract the creation time to get the first feature of the specified entity, that is, the existing time.
  • the change frequency of the entity will be relatively high at the beginning, and it should gradually decrease over time; the above second feature counts the length of all the text in the entry page. The richer the text information implies that the entity is more popular and tends to Changing.
  • the second feature is the total number of text words in the specified entity description page corresponding to the specified entity, including the description text related to the specified entity and the description text that is not related to the entity such as reference links; the above third feature only counts the entries The text length of the main body of the page description for the entity, excluding the external links, advertisements and other text information in the entry page is not counted. That is, the third feature is the number of words in the description text related to the entity in the specified entity description page corresponding to the specified entity.
  • the above-mentioned user behavior characteristics are characteristics directly related to user behavior, and specifically may include the fourth characteristic (the number of user edits) and the fifth characteristic (the number of user browsing). Specifically, the fourth feature mentioned above is the total number of edits.
  • the number of user edits associated with the specified entity can be directly obtained from the specified entity description page corresponding to the specified entity; the above fifth feature counts the number of visits to the entry page corresponding to the entity. The more visits, the more popular the entity is intuitively, and it is likely to change. Similarly, the number of user edits associated with the specified entity can be directly obtained from the specified entity description page corresponding to the specified entity.
  • the above-mentioned semantic feature refers to the information directly related to the entity's semantics obtained based on the hyperlink information, which can specifically include the sixth feature (the number of hyperlinks) and the seventh feature (the number of hyperlinks to the entity).
  • the above-mentioned sixth feature counts the number of all hyperlinks on the entry page. Some of the hyperlinks will link to other entities, and the remaining part will link to external reference information. If the content corresponding to these related links changes, these changes are likely to be propagated to the entity. That is, the sixth feature is the number of all hyperlinks in the specified entity description page corresponding to the specified entity, including the number of hyperlinks linked to the entity and the number of hyperlinks linked to external references; the above seventh feature is only Count the number of hyperlinks linked to the entity. Interrelated entities are generally semantically related, and the impact of changes in interrelated entities is more direct. That is, the sixth feature is the number of hyperlinks to the entity in the specified entity description page corresponding to the specified entity.
  • the above dynamic feature refers to the dynamic information of the entity obtained from the historical change record corresponding to the entry, which can specifically include the eighth feature (historical change frequency), the ninth feature (the number of changes in the most recent preset period), and the tenth feature. Features (the number of changes in the last four preset periods).
  • the above-mentioned eighth feature counts the historical change frequency of this page.
  • the change frequency of the entry and the historical change frequency generally have a strong correlation.
  • the eighth feature can be obtained by first obtaining the number of historical user edits from the specified entity description page corresponding to the specified entity, and then dividing the number of historical user edits by the time that the specified entity has existed; the aforementioned ninth feature is derived from the time series ( From the perspective of Time Series (TS), the number of changes in a predetermined period in the future has a strong correlation with the history of changes in the past.
  • the number of changes in a preset period closest to the current time can be directly calculated according to the specified entity update history page corresponding to the specified entity.
  • there is no specific limitation on the foregoing one preset period for example, it can be set to 5 days; the foregoing tenth feature counts the number of changes of the term page in the last four preset periods.
  • this embodiment can quickly and accurately extract the required target data from the aforementioned initial data, that is, the specified feature data corresponding to the aforementioned feature category information, which facilitates subsequent input of the obtained specified feature data to the preset
  • the above-mentioned prediction model is used to quickly and accurately predict the above-mentioned specified feature data, and output the predicted value of the change frequency corresponding to the above-mentioned specified entry page, so that subsequent predictions can be based on the change frequency of the entry page To realize the estimation of the change frequency of entities in the knowledge base.
  • data cleaning may be performed on the initial data to clean out impurities/useless data in the initial data.
  • the method before the step S3 of invoking the pre-trained prediction model, the method includes:
  • a determination process of determining the prediction model is also included.
  • the step of invoking a pre-trained prediction model includes: first obtaining a pre-trained regression model.
  • the above regression model is not specifically limited.
  • the above regression model may specifically include a linear regression model, an SVM regression model, a random forest regression (Random Forest regression) model, and a multilayer perceptron network (MLP) based on deep learning. ) Regression model.
  • the above-mentioned pre-trained regression model is used as the above-mentioned prediction model.
  • this embodiment can pre-train four regression models: linear regression model, SVM regression model, random forest regression model, and MLP-based regression model, and then select any regression model from the four regression models as the above prediction model.
  • a corresponding regression model can be selected as the prediction model according to the actual use intention of the user, or any regression model can be selected by the device itself as the prediction model.
  • MLP multi-layer perceptron network
  • MLP multi-layer fully connected neural network
  • the MLP-based regression model that has been trained and the number of hidden layers is 3 can be selected as the prediction model.
  • output the predicted value of the change frequency corresponding to the above specified entry page so that the subsequent prediction of the change frequency of the entry page can be used to estimate the change frequency of the entities in the knowledge base, which effectively improves the knowledge of the entities in the knowledge base.
  • the predictive accuracy of the frequency of change is a pre-trained regression model as the above prediction model.
  • the method before the step S3 of invoking the pre-trained prediction model, the method includes:
  • S310 Collect the first specified number of entry page information from the encyclopedia website
  • S311 Construct a sample label data set using the entry page information according to a preset feature construction rule, where the sample label data set includes feature data related to an entity and a change frequency label value corresponding to the entity;
  • S313 Use the training data set to train a preset regression model by using a stochastic gradient descent method to generate a trained first initial model
  • S314 Use the test data set to verify the trained first initial model, and determine whether the verification passes;
  • the process of creating the prediction model is also included.
  • the method may further include: first collecting the first specified number of entry page information from the encyclopedia website.
  • the first specified number of entry page information there is no specific limitation on the first specified number of entry page information, and it can be set according to actual conditions, for example, it can be set according to the number of entities in the knowledge base. If there are 200,000 entities in the preset knowledge base, the above-mentioned first specified number can be set to 200,000, and a corresponding number of entry page information corresponding to the entities can be collected from the encyclopedia website.
  • the tag data set is constructed using the above entry page information.
  • the feature construction rule is based on the preset feature category information to construct a sample label data set corresponding to the feature category information of the entity according to the entry page information.
  • the sample label data set includes feature data related to the entity, and the sample label data set also contains a change frequency label value corresponding to the entity, and the change frequency label value corresponds to the entity within a preset period in the future The frequency value of the change.
  • it can be the quotient between the number of changes of an entity in a preset period in the future and a preset period, and it can be viewed from the entry page corresponding to the entity (including the entity description page corresponding to the entity, and the entity corresponding to the entity
  • the entity update history page) performs related calculations to obtain the change frequency value corresponding to the entity in a preset period in the future.
  • a preset proportion of data can be randomly extracted from the label data value as the training data set, and the remaining other data can be used as the test data set. .
  • the above-mentioned preset ratio is not specifically limited, and can be set according to actual needs.
  • the preset ratio can be set to 80%, and 80% of the data can be randomly selected from the above-mentioned label data value as the training data set, and The remaining 20% of the above label data values are used as the test data set.
  • the training data set is then used, and the stochastic gradient descent method is adopted to train the preset regression model to generate the trained first initial model.
  • the model type of the aforementioned preset initial model is not specifically limited, and may include a linear regression model, an SVM regression model, a random forest regression model, and a regression model based on MLP.
  • the training process of using the above-mentioned stochastic gradient descent method to train the regression model can refer to the existing training process, which will not be repeated here.
  • the test data set is used to verify the trained first initial model, and it is judged whether the verification passes. If the verification is passed, the trained first initial model is used as the prediction model, so that the prediction model can be used directly to accurately output the predicted value of the change frequency corresponding to the specified entry page, which can then be based on the word
  • the prediction of the change frequency of each page intelligently realizes the estimation of the change frequency of the entities in the knowledge base, which effectively improves the accuracy of the prediction of the change frequency of the entities in the knowledge base.
  • the method may further include: storing the prediction model in a blockchain network, and verifying the prediction model by using the blockchain.
  • the above-mentioned prediction model generated by training is stored and managed, which can effectively ensure the safety and non-tamperability of the above-mentioned prediction model.
  • the above-mentioned step S314 of verifying the trained first initial model by using the test data set and determining whether the verification is passed includes:
  • the step of verifying the trained first initial model by using the test data set and judging whether the verification is passed may specifically include: After the initial model, each test sample in the test data set is input into the trained first initial model to obtain the test result of each test sample. Then, according to the test results of the above-mentioned test samples, the accuracy of the above-mentioned trained first initial model is obtained. Finally, when the accuracy of the trained first initial model is obtained, it is determined whether the accuracy is greater than the preset accuracy threshold.
  • the value of the above-mentioned accuracy threshold there is no specific limitation on the value of the above-mentioned accuracy threshold, and it can be set according to actual needs, for example, it can be set to 0.9. If it is determined that the accuracy rate is not greater than the preset accuracy rate threshold, it is determined that the verification fails. And if it is determined that the accuracy rate is greater than the preset accuracy threshold, the verification is determined to pass, so that the prediction model can be used directly to accurately output the predicted value of the change frequency corresponding to the specified entry page, which can then be based on the entry
  • the prediction of the change frequency of the page intelligently realizes the estimation of the change frequency of the entities in the knowledge base, which effectively improves the accuracy of the prediction of the change frequency of the entities in the knowledge base.
  • the method includes:
  • the step of judging whether the accuracy rate is greater than the preset accuracy rate threshold after the step includes: if it is determined that the accuracy rate is not greater than the preset accuracy rate threshold, then screening out the specified test whose test result is wrong in the test sample set. sample. After the designated test sample is obtained, the designated test sample is added to the training sample set to generate an updated training sample set.
  • the regression model is trained according to the updated training sample set to generate a trained second initial model.
  • the above-mentioned trained second initial model is used as the above-mentioned prediction model.
  • the foregoing process of training the foregoing initial model according to the foregoing updated training sample set to generate a trained second initial model, and using the foregoing trained second initial model as the foregoing prediction model may specifically include: iterative execution Perform the following steps until the accuracy of the trained second initial model is greater than the above accuracy threshold: train the above regression model according to the above updated training sample set to generate a trained second initial model; use the above training sample set Test the above-mentioned trained second initial model, and determine whether the accuracy of the above-mentioned trained second initial model is greater than the above-mentioned accuracy threshold; if it is not greater than the above-mentioned test samples in the set of test samples with wrong test results, again The above-mentioned updated training sample set is updated; after the iteration ends, the trained second initial
  • the accuracy of the trained first initial model is not greater than the preset accuracy threshold
  • the regression model is retrained by using the updated training samples to generate a prediction model with an accuracy greater than the accuracy threshold.
  • the prediction model can be used to accurately output the predicted value of the change frequency corresponding to the above specified entry page, and based on the prediction of the change frequency of the entry page to realize the estimation of the change frequency of the entities in the knowledge base, which effectively improves The accuracy of the prediction of the change frequency of the entities in the knowledge base.
  • the method before the step S3 of invoking the pre-trained prediction model, the method includes:
  • S321 According to the preset integrated learning algorithm, use all the sub-learners to train the preset meta model to generate a trained meta model;
  • the method may further include: first obtaining a second specified number of pre-trained sub-learners.
  • the above-mentioned second designated number is not specifically limited, and can be set according to actual needs, and it is preferable to set the second designated number to four.
  • the above-mentioned self-learning device may be the above-mentioned regression model.
  • the four sub-learners are used to train the preset meta model to generate a trained meta model.
  • the trained meta-model is used as the prediction model, so that the specified feature data obtained can be subsequently input into the preset prediction model to quickly use the prediction model. Accurately perform prediction processing on the designated feature data, and output the predicted value of the change frequency corresponding to the designated term page.
  • the above-mentioned pre-trained sub-learners are four pre-trained regression models: linear regression model, SVM regression model, random forest regression model, and MLP-based regression Model, in order to comprehensively use these four regression models, and then use Stacking-based integrated learning technology to comprehensively use the above four regression models, that is, use the Stacking method to combine other sub-learners by training a meta-regressor , And use the output of these sub-learners as the input of the meta-model to train the meta-model to obtain the trained meta-model, and finally use the trained meta-model as the prediction model.
  • the preset meta model used above may be a GDBT (Gradient Boost Decision Tree) model.
  • W l W l *x
  • W l the model of the linear regression model Parameters
  • y l the output of the linear regression model
  • y svm SVM(x)
  • y svm the output of the SVM regression model
  • y rf RandomForest(x)
  • y rf is the output of the random forest regression model
  • y mlp is the output of the MLP-based regression model.
  • the vector of the output of the four regression models is spliced as As the input of the meta-model GDBT, the stochastic gradient descent algorithm and integrated learning algorithm are used to train the meta-model GDBT, and the trained meta-model is obtained.
  • the trained meta-model will output when it receives the feature vector of the input entity
  • the output y of the meta-model GDBT is the predicted value of the change frequency corresponding to the entity.
  • This embodiment uses an integrated learning algorithm to comprehensively utilize multiple regression models to generate a trained meta-model, and uses the trained meta-model as a predictive model for the specified feature data corresponding to the input specified entity The prediction process has effectively improved the prediction effect of the model.
  • the step S1 of obtaining the initial data in the designated entry page corresponding to the designated entity from the encyclopedia website includes:
  • S101 Obtain initial data in a specified entry page corresponding to the specified entity through the data query interface.
  • the foregoing step of obtaining the initial data in the specified entry page corresponding to the specified entity from the encyclopedia website may specifically include: first calling the data query interface corresponding to the encyclopedia website. After completing the call to the above-mentioned data query interface, the initial data in the designated entry page corresponding to the above-mentioned designated entity is obtained through the above-mentioned data query interface.
  • the above-mentioned initial data in the specified entry page corresponding to the specified entity refers to all the data contained in the specified entry page in the encyclopedia website.
  • the initial data of the specified entry page may include at least the description text information of the specified entity and basic information.
  • Information table information some user-related statistical information (such as the number of times edited, the number of entry browsing times), the time information of each change of the entry corresponding to the specified entity and the reason information of the change, the change history information, the hyperlink information, and many more.
  • the above hyperlink information implies the mutual relationship between entities (such as semantic relationship). There will be a lot of hyperlink information in the specified entry page, and some of the hyperlinks are linked to the encyclopedia website which is different from the above specified entity. , And the remaining other hyperlinks are links to the external reference information corresponding to the above specified entity.
  • the initial data in the specified entry page corresponding to the specified entity is obtained by calling the data query interface corresponding to the above-mentioned encyclopedia website, which is beneficial to the subsequent acquisition of the initial data in the specified entry page based on the obtained initial data in the specified entry page.
  • the designated feature data corresponding to the above-mentioned designated entity can be quickly and conveniently extracted.
  • an embodiment of the present application also provides a model-based device for predicting the frequency of data change, including:
  • the first obtaining module 1 is configured to obtain the initial data in the specified entry page corresponding to the specified entity from the encyclopedia website, where the specified entity is any entity in the preset knowledge base;
  • the extraction module 2 is used to extract the designated feature data corresponding to the designated entity from the initial data
  • the calling module 3 is used to call a pre-trained prediction model, where the prediction model is generated after training a preset regression model based on a pre-collected sample label data set;
  • the prediction module 4 is configured to input the specified feature data into the prediction model, so as to perform prediction processing on the specified feature data through the prediction model;
  • the second obtaining module 5 is configured to obtain the output result corresponding to the specified entry page output by the preset model
  • the first determining module 6 is configured to use the output result as the predicted value of the change frequency of the designated entity.
  • the implementation process of the functions and effects of the first acquisition module, extraction module, calling module, prediction module, second acquisition module, and first determination module in the above model-based device for predicting data change frequency is detailed in
  • the foregoing model-based method for predicting the frequency of data change corresponds to the implementation process of steps S1 to S6, which will not be repeated here.
  • the aforementioned extraction module includes:
  • the second acquiring sub-module is used to acquire preset feature category information
  • the extraction sub-module is used to extract the designated feature data corresponding to the feature type information from the initial data according to the feature category information.
  • the implementation process of the functions and roles of the second acquisition submodule and the extraction submodule in the above model-based device for predicting data change frequency is detailed in the corresponding step S200 in the above model-based method for predicting data change frequency.
  • the implementation process to S201 will not be repeated here.
  • the aforementioned model-based device for predicting the frequency of data change includes:
  • the third acquisition module is used to acquire a pre-trained regression model
  • the second determining module is configured to use the pre-trained regression model as the prediction model.
  • the functions and functions of the third acquiring module and the second determining module in the aforementioned model-based device for predicting the frequency of data change are detailed in the corresponding step S300 in the aforementioned model-based method for predicting the frequency of data change.
  • the implementation process to S301 will not be repeated here.
  • the aforementioned model-based device for predicting the frequency of data change includes:
  • the collection module is used to collect the first specified number of entry page information from the encyclopedia website
  • the construction module is used to construct a sample label data set using the entry page information according to a preset feature construction rule, wherein the sample label data set includes feature data related to the entity and a change frequency label corresponding to the entity value;
  • a dividing module for dividing the label data set into a training data set and a test data set
  • the first training module is configured to use the training data set and adopt a stochastic gradient descent method to train a preset regression model to generate a trained first initial model;
  • the verification module is configured to verify the trained first initial model by using the test data set, and determine whether the verification is passed;
  • a third determination module configured to use the trained first initial model as the prediction model if the verification is passed;
  • the storage module is used to store the prediction model in the blockchain network.
  • the implementation process of the functions and functions of the collection module, the construction module, the division module, the first training module, the verification module, the third determination module, and the storage module in the above-mentioned model-based device for predicting the frequency of data change is detailed in detail. See the implementation process corresponding to steps S310 to S316 in the above model-based method for predicting the data change frequency, which will not be repeated here.
  • the above-mentioned verification module includes:
  • An input sub-module for inputting each test sample in the test data set into the trained first initial model to obtain the test result of each test sample
  • the third obtaining submodule is used to obtain the accuracy of the trained first initial model according to the test results of the test samples;
  • the judging sub-module is used to judge whether the accuracy rate is greater than a preset accuracy rate threshold
  • the first determining sub-module is configured to determine that the verification is passed if it is determined that the accuracy rate is greater than a preset accuracy rate threshold;
  • the second determining sub-module is configured to determine that the verification fails if it is determined that the accuracy rate is not greater than a preset accuracy rate threshold.
  • the implementation process of the functions and effects of the input sub-module, the third acquisition sub-module, the judgment sub-module, the first determination sub-module and the second determination sub-module in the above-mentioned model-based device for predicting the frequency of data change is specific
  • the above-mentioned verification module includes:
  • the screening sub-module is used to screen out designated test samples in the test sample set with wrong test results if it is determined that the accuracy rate is not greater than a preset accuracy rate threshold;
  • a training sub-module configured to train the regression model according to the updated training sample set to generate a trained second initial model
  • the third determining sub-module is configured to use the trained second initial model as the prediction model.
  • the implementation process of the functions and effects of the screening sub-module, generating sub-module, training sub-module, and third determining sub-module in the above-mentioned model-based device for predicting the frequency of data change is detailed in the aforementioned model-based prediction data.
  • the method of changing the frequency corresponds to the implementation process of steps S31420 to S31423, which will not be repeated here.
  • the aforementioned model-based device for predicting the frequency of data change includes:
  • the fourth acquisition module is used to acquire a second specified number of pre-trained sub-learners
  • the second training module is configured to use all the sub-learners to train the preset meta-model according to the preset integrated learning algorithm to generate a trained meta-model;
  • the fourth determining module is configured to use the trained meta-model as the prediction model.
  • the implementation process of the functions and roles of the fourth acquisition module, the second training module and the fourth determination module in the above model-based device for predicting data change frequency is detailed in the above model-based predicting data change frequency.
  • the method corresponds to the implementation process of steps S320 to S322, which will not be repeated here.
  • the above-mentioned first acquisition module includes:
  • the calling sub-module is used to call the data query interface corresponding to the encyclopedia website
  • the first obtaining sub-module is configured to obtain the initial data in the designated entry page corresponding to the designated entity through the data query interface.
  • the functions and functions of the calling sub-module and the first acquiring sub-module in the aforementioned model-based device for predicting the frequency of data change are detailed in the corresponding step S100 in the aforementioned method for predicting the frequency of data change based on the model.
  • the implementation process to S101 will not be repeated here.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed for the computer equipment is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store data such as the designated entity, the initial data in the designated entry page, the designated feature data, and the predicted value of the change frequency.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to implement the model-based method for predicting the frequency of data change shown in any of the above exemplary embodiments.
  • the above-mentioned processor executes the steps of the above-mentioned model-based method for predicting the frequency of data change:
  • the output result is used as the predicted value of the change frequency of the specified entity.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the devices and computer equipment to which the solution of the present application is applied.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and has a computer program stored thereon, which is realized when the computer program is executed by a processor.
  • the method for predicting the frequency of data change based on the model includes the following steps:
  • the output result is used as the predicted value of the change frequency of the specified entity.
  • the method, device, computer equipment, and storage medium for predicting the frequency of data change based on the model establish a correspondence relationship between entities in the knowledge base and entry pages in the encyclopedia website, and The corresponding feature data is constructed based on the initial data of the entry page, and the machine learning regression model is used to predict the change frequency of the entry page of the encyclopedia website, so that it can be intelligently and conveniently implemented based on the prediction of the change frequency of the entry page
  • the prediction of the change frequency of the entities in the knowledge base effectively improves the accuracy of the prediction of the change frequency of the entities in the knowledge base.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

一种基于模型的预测数据变化频率的方法、装置、计算机设备和存储介质,其中方法包括:从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体(S1);从所述初始数据中提取出与所述指定实体对应的指定特征数据(S2);调用预先训练好的预测模型,其中,预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成(S3);将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理(S4);获取所述预设模型输出的与所述指定词条页面对应的输出结果(S5);将所述输出结果作为所述指定实体的变化频率预测值(S6)。通过所述方法可以基于词条页面的变化频率的预测来智能方便地实现对于知识库中实体的变化频率的预测。

Description

基于模型的预测数据变化频率的方法、装置和计算机设备
本申请要求于2020年7月27日提交中国专利局、申请号为202010734520.0,发明名称为“基于模型的预测数据变化频率的方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,具体涉及一种基于模型的预测数据变化频率的方法、装置和计算机设备。
背景技术
现有的网络数据变化频率的估计方案主要是基于统计学的一个统计假设:网络数据的变化频率服从泊松分布。基于泊松分布假设,X/T即是一个有效的变化频率估计方案(T表示时间间隔,X表示该网络数据在时间间隔T内的变化次数)。但是这个估计方案会存在以下不足:很多网络资源没有提供变化历史,这种情况下只有对比前后两次访问的相同页面的数据是否有变化才能知道是否变化。即便前后两次访问该网络数据不同,依然无法准确获取时间间隔T内该网络资源变化的次数。如果时间间隔T内的变化次数无法准确获取,对应的变化频率的估计也是不准确的。但发明人意识到,对于知识库内一些新出现的实体,例如新型冠状病毒肺炎,由于目前缺乏与新出现的实体相关的数据,且新出现的实体的变化历史数据也比较少的,此时如果还是使用基于泊松分布的估计方案来对该新出现的实体的变化频率进行预测,则会导致对于新出现的实体的变化频率的预测准确性较低。
技术问题
本申请的主要目的为提供一种基于模型的预测数据变化频率的方法、装置、计算机设备和存储介质,旨在解决现有使用基于泊松分布的估计方案来对新出现的实体的变化频率进行预测,会导致对于新出现的实体的变化频率的预测准确性较低的技术问题。
技术解决方案
为实现上述目的,第一方面,本申请提出一种基于模型的预测数据变化频率的方法,所述方法包括步骤:
从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
从所述初始数据中提取出与所述指定实体对应的指定特征数据;
调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
获取所述预设模型输出的与所述指定词条页面对应的输出结果;
将所述输出结果作为所述指定实体的变化频率预测值。
第二方面,本申请还提供一种基于模型的预测数据变化频率的装置,包括:
第一获取模块,用于从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
提取模块,用于从所述初始数据中提取出与所述指定实体对应的指定特征数据;
调用模块,用于调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
预测模块,用于将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
第二获取模块,用于获取所述预设模型输出的与所述指定词条页面对应的输出结果;
第一确定模块,用于将所述输出结果作为所述指定实体的变化频率预测值。
第三方面,本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序时实现一种基于模型的预测数据变化频率的方法,其中,所述基于模型 的预测数据变化频率的方法包括以下步骤:
从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
从所述初始数据中提取出与所述指定实体对应的指定特征数据;
调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
获取所述预设模型输出的与所述指定词条页面对应的输出结果;
将所述输出结果作为所述指定实体的变化频率预测值。
第四方面,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种基于模型的预测数据变化频率的方法,其中,所述基于模型的预测数据变化频率的方法包括以下步骤:
从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
从所述初始数据中提取出与所述指定实体对应的指定特征数据;
调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
获取所述预设模型输出的与所述指定词条页面对应的输出结果;
将所述输出结果作为所述指定实体的变化频率预测值。
有益效果
本申请中提供的基于模型的预测数据变化频率的方法、装置、计算机设备和存储介质,智能方便地实现了对于知识库中实体的变化频率的预测,有效地提高了对于知识库中实体的变化频率的预测准确性。
附图说明
图1是本申请一实施例的基于模型的预测数据变化频率的方法的流程示意图;
图2是本申请一实施例的基于模型的预测数据变化频率的装置的结构示意图;
图3是本申请一实施例的计算机设备的结构示意图。
本发明的最佳实施方式
应当理解,此处所描述的具体实施例仅仅用于解释本申请,并不用于限定本申请。
参照图1,本申请一实施例的基于模型的预测数据变化频率的方法,包括:
S1:从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
S2:从所述初始数据中提取出与所述指定实体对应的指定特征数据;
S3:调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
S4:将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
S5:获取所述预设模型输出的与所述指定词条页面对应的输出结果
S6:将所述输出结果作为所述指定实体的变化频率预测值。
如上述步骤S1至S6所述,本方法实施例的执行主体为一种基于模型的预测数据变化频率的装置。在实际应用中,上述基于模型的预测数据变化频率的装置可以通过虚拟装置,例如软件代码实现,也可以通过写入或集成有相关执行代码的实体装置实现,且可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。本实施例的基于模型的预测数据变化频率的装置能够快速准确地生成知识库内任意一个实体的变化频率预测值。具体地,首先从百科网站中获取与指定实体对应的指定词条页 面内的初始数据,其中,上述指定实体为预设知识库中的任意一个实体。上述实体包括预设知识库中具有独立意义、能够用于指示任意一个对象的词语。上述百科网站可为任意的一个或多个在线百科类网站,例如百度百科网站、***网站,等等。另外,上述指定词条页面可包括与上述指定实体对应的指定实体描述页面,以及与上述指定实体对应的指定实体更新历史页面。上述与指定实体对应的指定词条页面内的初始数据是指百科网站中指定词条页面内包含的所有属性数据,例如可包括指定实体的描述文本信息、基本信息表信息,一些与用户相关的统计信息,等等。举例地,预设知识库中的一个实体通常对应到百科类网站的一个词条页面,假如预设知识库中存在一个名称为“新型冠状病毒肺炎”的实体,则该实体“新型冠状病毒肺炎”在百科网站中对应的一个词条页面可以为:https://baike.***.com/item/新型冠状病毒肺炎。此外,上述预设知识库的自动化构建的主要方式之一是通过高质量的百科类、专业垂直型网站来获取实体知识。比如目前通用领域的知识库多使用通过百科类网站,比如***、百度百科等来自动构建。例如对于医学领域的知识库可使用百科类网站以及医学专业垂直型网站自动构建。因此,对于知识库中实体变化频率的预测问题可以转化为对应的百科网站的与实体对应的词条页面对应到实体的变化频率的估计,也就是网络数据(网页等)的变化频率的预测。然后从上述初始数据中提取出与上述指定实体对应的指定特征数据。其中,上述指定特征数据具体可包括四种指定特征数据,分别为基本统计特征、用户行为特征、语义特征、以及动态特征。之后调用预先训练好的预测模型。其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成。另外,上述样本标签数据集包括与实体相关的特征数据,且该样本标签数据集中还包含有与该实体对应的变化频率标签值,上述变化频率标签值为与实体对应的未来一个预设周期内的变化频率值。具体可为实体在未来一个预设周期内的变化次数与一个预设周期之间的商值,并可从与实体对应的词条页面(包括与实体对应的实体描述页面,以及与实体对应的实体更新历史页面)中进行相关计算来得到与实体对应的未来一个预设周期内的变化频率值,且对于上述一个预设周期不作具体限定,例如可设为5天。此外,上述预测模型可为预先训练好的回归模型,该回归模型例如可为线性回归(Linearregression)模型、SVM回归模型、随机森林回归(RandomForestregression)模型、以及基于深度学习的多层感知器网络(MLP,Multilayer Perceptron)的回归模型。或者还可以对基于集成学习算法对预先训练好的多个回归模型进行组合,并训练出一个训练好的学习元模型来作为上述预测模型。在完成对于预测模型的调用后,再将上述指定特征数据输入至上述预测模型内,以通过上述预测模型对上述指定特征数据进行预测处理。在上述预测模型完成了对于上述指定特征数据的预测处理后,获取上述预测模型输出的与上述指定词条页面对应的输出结果,该输出结果即为与上述指定词条页面对应的未来一个预设周期内的变化频率的预测值。最后在得到了上述输出结果时,将上述输出结果作为上述指定实体的变化频率预测值。其中,上述与指定实体对应的变化频率预测值是指该指定实体在未来一个预设周期内的变化次数与一个预设周期之间的商值。进而智能地将知识库中的实体的变化频率的预测与百科网站中的词条页面也建立对应关系,从而可以通过求取百科网站中的词条页面的变化频率预测值来快速便捷地实现对于知识库中的实体的变化频率的预测处理。举例地,对于实体新型冠状病毒肺炎,在当前时间点从与实体新型冠状病毒肺炎对应的词条页面获取到所需的特征数据后,并将所需的特征数据输入至上述预测模型后,预测模型便会输出:该词条页面在未来一个预设周期内的变化频率,也即是实体新型冠状病毒肺炎在未来一个预设周期内的变化频率。本申请当需要求取知识库中的指定实体的指定变化频率预测值时,首先会从百科网站中获取与该指定实体对应的指定词条页面内的初始数据。并基于指定词条页面的初始数据构建相应的指定特征数据,以及使用机器学习回归模型来实现对于百科网站的词条页面的变化频率的预测,进而将机器学习回归模型输出的与词条页面对应的变化频率的预测值来作为指定实体的变化频率预测值。相比与现有的基于泊松分布的估计方案,本申请通过使用机器学习回归模型来实现对于百科网站的词条页面的变化频率的预测,使得可以基于词条页面的变化频率的预测来智能方便地实现对于知识库中实体的变化频率的预测,有效地提高了对于知识库中实体的变化频率的预测准确性。
进一步地,本申请一实施例中,上述从所述初始数据中提取出与所述指定实体对应的指定特征数据步骤S2,包括:
S200:获取预设的特征类别信息;
S201:根据所述特征类别信息,从所述初始数据中提取出与所述特征类型信息对应的指定特征数据。
如上述步骤S200至S201所述,上述从上述初始数据中提取出与上述指定实体对应的指定特征数据的步骤,具体可包括:首先获取预设的特征类别信息。其中,上述特征类别信息具体可包括四种特征类别信息,分别为基本统计特征信息、用户行为特征信息、语义特征信息以及动态特征信息。在得到了上述特征类别信息后,再根据上述特征类别信息,从上述初始数据中提取出与上述特征类别信息对应的指定特征数据。其中,根据上述基本统计特征信息、用户行为特征信息、语义特征信息以及动态特征信息,可以从上述初始数据中只提取出与上述指定实体对应的基本统计特征、用户行为特征、语义特征以及动态特征所对应的特征数据,而不会提取出上述指定特征数据之外的其他数据。另外,在提取出上述指定特征数据的过程中,可采用直接从指定词条页面获取方式进行提取的提取方式,以及还会采用对指定词条页面中的相关数据进行计算后再提取的方式。下面对上述基本统计特征信息、用户行为特征信息、语义特征信息以及动态特征信息进行详细的阐述。(1)上述基本统计特征都是针对实体对应的词条页面的一些信息进行简单的计算得到的特征,具体可包括第一特征(已存在的时间)、第二特征(页面文本大小)、第三特征(描述实体文本大小)。具体地,上述第一特征计算的是当前时间和词条创建时间的时间间隔。可从与指定实体对应的指定实体更新历史页面中获取指定实体的创建时间,再获取当前时间,并使用当前时间减去创建时间即可得到指定实体的第一特征,即指定实体的已存在的时间。实体的变化频率在最开始会比较高,后面随着时间应该会逐渐减少;上述第二特征统计了词条页面的中所有文本的长度,文本信息越丰富暗示这个实体是比较热门的,趋于变化的。即该第二特征为与指定实体对应的指定实体描述页面中的全部文本字数,具体包括与指定实体相关的描述文本以及参考链接等实体不相关的描述文本;上述第三特征仅统计了词条页面中描述对用实体的正文的文本长度,去除了词条页面中外链、广告等文本信息没有计算在内。即该第三特征为与指定实体对应的指定实体描述页面中的实体相关的描述文本字数。(2)上述用户行为特征是与用户行为直接相关的特征,具体可包括第四特征(用户编辑次数)、第五特征(用户浏览次数)。具体地,上述第四特征统计的是总编辑次数,直观上看编辑次数越多,该实体更易于再次被编辑或发生变化。可从与指定实体对应的指定实体描述页面中直接获取与指定实体关联的用户编辑次数;上述第五特征统计的是实体对应词条页面的访问次数。访问次数越多,直观上说明该实体越热门,很有可能会发生变化。同理,可从与指定实体对应的指定实体描述页面中直接获取与指定实体关联的用户编辑次数。(3)上述语义特征是指基于超链接信息得到的与实体直接语义相关的信息,具体可包括第六特征(超链接数量)和第七特征(链接到实体的超链接数量)。具体地,上述第六特征统计的是词条页面所有超链接的个数,其中一部分超链接是会链接到其他实体,而剩余部分则是会链接到外部参考信息。如果这些相关链接对应的内容发生变化,则这些变化很有可能会传播到该实体。即该第六特征为与指定实体对应的指定实体描述页面中所有超链接的个数,包括链接到实体的超链接个数,以及链接到外部参考的超链接的个数;上述第七特征仅统计链接到实体的超链接个数。相互关联的实体之间一般都是语义相关的,相互关联的实体的变化的影响更为直接。即该第六特征为与指定实体对应的指定实体描述页面中链接到实体的超链接的个数。(4)上述动态特征是指从词条对应历史变化记录中获取到实体的动态信息,具体可包括第八特征(历史变化频率)、第九特征(最近一个预设周期变化次数)和第十特征(最近四个预设周期内变化次数)。上述第八特征统计的是此页面的历史变化频率,词条的变化频率和历史变化频率一般是有很强的相关性。可先从与指定实体对应的指定实体描述页面中获取历史用户编辑次数,再使用该历史用户编辑次数除以指定实体已存在的时间即可得到该第八特征;上述第九特征从时间序列(Time Series,TS)角度来看,未来一个预设周期的变化次数与过去的变化历史是有很强的相关性。可根据与指定实体对应的指定实体更新历史页面来直接计算得到,距离当前时间最近的一个预设周期的变化次数。其中,对于上述一个预设周期不作具体限定,例如可设为5天;上述第十特征统计的是词条页面最近四个预设周期内的变化次数。同理,可根据与指定实体对应的指定实体更新历史页面来直接计算得到,距离当前时间最近的四个预设周期的变化次数。本实施例根据特征类别信息,能够快速准确地从上述初始数据中提取出所需的目标数据,即与上述特征类别信息对应的指定特征数据,有利于后续将得到的指定特征数据输入至预设的预测模型内,以通过上述预测模型来快速准确地对上述指定特征数据进行预测处理,并输出与上述指定词条页面对应的变化频率预测值,从而后续可以基于词条页面的变化频率的预测来实现对于知识库中实体的变化频率的估计。进一步地, 在进行从上述初始数据中提取出与上述特征类别信息对应的指定特征数据的步骤之前,还可先对上述初始数据进行数据清洗,以清洗掉上述初始数据中的杂质/无用数据,从而可以减少后续的特征提取过程的数据处理量,提高特征提取的处理效率。
进一步地,本申请一实施例中,上述调用预先训练好的预测模型步骤S3之前,包括:
S300:获取预先训练好的回归模型;
S301:将所述预先训练好的回归模型作为所述预测模型。
如上述步骤S300至S301所述,在进行调用预先训练好的预测模型的调用过程之前,还包括确定预测模型的确定过程。具体地,上述调用预先训练好的预测模型的步骤之前,包括:首先获取预先训练好的回归模型。其中,对上述回归模型不作具体限定,上述回归模型具体可包括线性回归(Linearregression)模型、SVM回归模型、随机森林回归(RandomForestregression)模型、以及基于深度学习的多层感知器网络(MLP,Multilayer Perceptron)的回归模型。在得到了上述预先训练好的回归模型后,再将上述预先训练好的回归模型作为上述预测模型。其中,本实施例可以预先训练好四个回归模型:线性回归模型、SVM回归模型、随机森林回归模型以及基于MLP的回归模型,然后从四个回归模型中选取出任意一个回归模型来作为上述预测模型。例如可根据用户的实际使用意愿来选出对应的一个回归模型来作为上述预测模型,也可以由装置自行选取任意一个回归模型来作为上述预测模型。另外,优选采用训练好的基于深度学习的多层感知器网络(MLP)的回归模型来作为上述预测模型。由于深度学习可以通过建立类似于大脑神经元结构,并通过神经元的之间的连接学习到数据更深层次的表示。且深度学习模型在理论上可以拟合任意连续的函数,这使得深度学习模型可以很好处理回归问题。通过预先训练好了具有不同隐层数的多个多层全连接神经网络(MLP)模型,并经过实验对比不同隐层数(2-5层)的基于MLP的回归模型的效果,由于使用了两个隐层的基于MLP的回归MLP模型在没有很好的效果;当隐层数为3、4、5时,各个基于MLP的回归模型的效果均取得了不错的效果并且效果基本相当,而进一步根据奥卡姆剃刀原则(Occam’s Razor principle),最终可选定训练好的且隐层数为3的基于MLP的回归模型来作为上述预测模型。本实施例通过使用预先训练好的回归模型作为上述预测模型,有利于后续将得到的指定特征数据输入至该预测模型内,以通过上述预测模型来快速准确地对上述指定特征数据进行预测处理,并输出与上述指定词条页面对应的变化频率的预测值,从而后续可以基于词条页面的变化频率的预测来实现对于知识库中实体的变化频率的估计,有效地提高了对于知识库中实体的变化频率的预测准确性。
进一步地,本申请一实施例中,上述获调用预先训练好的预测模型的步骤S3之前,包括:
S310:从百科网站中收集第一指定数量的词条页面信息;
S311:按照预设的特征构造规则,使用所述词条页面信息构建样本标签数据集,其中,所述样本标签数据集包括与实体相关的特征数据,以及与实体对应的变化频率标签值;
S312:将所述标签数据集划分为训练数据集与测试数据集;
S313:利用所述训练数据集,并采用随机梯度下降法对预设的回归模型进行训练,生成训练好的第一初始模型;
S314:采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过;
S315:若验证通过,则将所述训练好的第一初始模型作为所述预测模型;
S316:将所述预测模型存储至区块链网络。
如上述步骤S310至S316所述,在调用预先训练好的预测模型的过程之前,还包括创建该预测模型的创建过程。具体地,上述获取预先训练好的回归模型的步骤之前,还可包括:首先从百科网站中收集第一指定数量的词条页面信息。其中,对于上述词条页面信息的第一指定数量不作具体限定,可根据实际情况进行设置,例如可根据知识库中的实体数量来设置。假如预设知识库中存在20万个实体,则可将上述第一指定数量设置为20万,并从百科网站中收集对应数量的与实体对应的词条页面信息。然后按照预设的特征构造规则,使用上述词条页面信息构建标签数据集。其中,上述特征构造规则是根据预设的上述特征类别信息,来根据上述词条页面信息构造成与实体的特征类别信息对应的样本标签数据集。另外,上述样本标签数据集包括与实体相关的特征数据,且该样本标签数据集中还包含有与该实体对应的变化频率标签值,上述变化频率标签值为与实体对应的未来一个预设周期内的变化频率值。具体可为 实体在未来一个预设周期内的变化次数与一个预设周期之间的商值,并可从与实体对应的词条页面(包括与实体对应的实体描述页面,以及与实体对应的实体更新历史页面)中进行相关计算来得到与实体对应的未来一个预设周期内的变化频率值。并将上述标签数据集划分为训练数据集与测试数据集。另外,上述将上述标签数据集划分为训练数据集与测试数据集的步骤,可以从上述标签数据值中随机抽取出预设比例的数据作为训练数据集,再将剩余的其他数据作为测试数据集。对上述预设比例不作具体限定,可根据实际需求进行设定,举例地,预设比例可设为80%,即可从上述标签数据值中随机抽取出80%的数据作为训练数据集,并将上述标签数据值中剩余的20%的数据作为测试数据集。在得到了上述训练数据集与测试数据集后,再利用所述训练数据集,并采用随机梯度下降法对预设的回归模型进行训练,生成训练好的第一初始模型。其中,对于上述预设的初始模型的模型种类不作具体限定,可包括线性回归模型、SVM回归模型、随机森林回归模型以及基于MLP的回归模型。另外,使用上述随机梯度下降法进行模型训练回归模型的训练流程可参照现有的训练流程,在此不再赘述。最后在生成了上述训练好的第一初始模型时,再采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过。如果验证通过,则将所述训练好的第一初始模型作为所述预测模型,以便能够直接使用该预测模型来准确地输出与上述指定词条页面对应的变化频率的预测值,进而能够基于词条页面的变化频率的预测来智能地实现对于知识库中实体的变化频率的估计,有效地提高了对于知识库中实体的变化频率的预测准确性。进一步地,上述若验证通过,则将所述训练好的第一初始模型作为所述预测模型的步骤之后,还可包括:将上述预测模型存储至区块链网络,通过使用区块链来对训练生成的上述预测模型进行存储和管理,能够有效地保证上述预测模型的安全性与不可篡改性。
本申请一实施例中,上述采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过的步骤S314,包括:
S3140:将所述测试数据集中的各测试样本输入至所述训练好的第一初始模型内,以得到所述各测试样本的测试结果;
S3141:根据所述各测试样本的测试结果,获取所述训练好的第一初始模型的准确率;
S3142:判断所述准确率是否大于预设的准确率阈值;
S3143:若判断出所述准确率大于预设的准确率阈值,则判定验证通过;
S3144:若判断出所述准确率不大于预设的准确率阈值,则判定验证不通过。如上述步骤S3140至S3144所述,上述采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过的步骤,具体可包括:在得到了上述训练好的第一初始模型后,将上述测试数据集中的各测试样本输入至上述训练好的第一初始模型内,以得到上述各测试样本的测试结果。之后根据上述各测试样本的测试结果,获取上述训练好的第一初始模型的准确率。最后在得到了上述训练好的第一初始模型的准确率时,判断上述准确率是否大于预设的准确率阈值。其中,对于上述准确率阈值的数值不作具体限定,可以根据实际需求进行设置,例如可设置为0.9。如果判断出所述准确率不大于预设的准确率阈值,则判定验证不通过。而如果判断出上述准确率大于预设的准确率阈值,则判定验证通过,以便能够直接使用该预测模型来准确地输出与上述指定词条页面对应的变化频率的预测值,进而能够基于词条页面的变化频率的预测来智能地实现对于知识库中实体的变化频率的估计,有效地提高了对于知识库中实体的变化频率的预测准确性。
本申请一实施例中,上述判断所述准确率是否大于预设的准确率阈值步骤S3142之后,包括:
S31420:若判断出所述准确率不大于预设的准确率阈值,筛选出所述测试样本集中测试结果错误的指定测试样本;
S31421:将所述指定测试样本加入所述训练样本集,生成更新后的训练样本集;
S31422:根据所述更新后的训练样本集对所述回归模型进行训练,生成训练好的第二初始模型;
S31423:将所述训练好的第二初始模型作为所述预测模型。
如上述步骤S31420至S31423所述,在进行判断上述训练好的第一初始模型的准确率是否大于预设的准确率阈值的过程中,还可能出现该准确率不大于预设的准确率阈值的情况,则后续需要对上述回归模型进行重新训练,以生成准确率符合标准的预测模型。具体地,上述判断上述准确率是否大于预设的准确率阈值的步骤之后包括:若判断出上述准确率不大于预设的准确率阈值,则,筛选出上述测试样本 集中测试结果错误的指定测试样本。在得到了上述指定测试样本后,再将上述指定测试样本加入上述训练样本集,生成更新后的训练样本集。然后根据上述更新后的训练样本集对上述回归模型进行训练,生成训练好的第二初始模型。最后将上述训练好的第二初始模型作为上述预测模型。其中,上述根据上述更新后的训练样本集对上述初始模型进行训练,生成训练好的第二初始模型,并将上述训练好的第二初始模型作为上述预测模型的过程具体可包括:迭代地执行执行以下步骤,直至训练好的第二初始模型的准确率大于上述准确率阈值:根据上述更新后的训练样本集对上述回归模型进行训练,生成训练好的第二初始模型;使用上述训练样本集对上述训练好的第二初始模型进行测试,并判断上述训练好的第二初始模型的准确率是否大于上述准确率阈值;若不大于,则根据上述测试样本集中测试结果错误的测试样本,再次对上述更新后的训练样本集进行更新;迭代结束后,将在最后一轮迭代过程中生成的训练好的第二初始模型作为上述预测模型。本实施例在训练好的第一初始模型的准确率不大于预设的准确率阈值,通过使用更新后的训练样本对回归模型进行重新训练来生成准确率大于准确率阈值的预测模型,从而后续能够使用该预测模型来准确地输出与上述指定词条页面对应的变化频率的预测值,并基于词条页面的变化频率的预测来实现对于知识库中实体的变化频率的估计,有效地提高了对于知识库中实体的变化频率的预测准确性。
进一步地,本申请一实施例中,上述调用预先训练好的预测模型的步骤S3之前,包括:
S320:获取预先训练好的第二指定数量的子学习器;
S321:根据预设的集成学习算法,使用所有所述子学习器对预设的元模型进行训练,生成训练好的元模型;
S322:将所述训练好的元模型作为所述预测模型。
如上述步骤S320至S322所述,除了可以使用预先训练好的回归模型作为上述预测模型,还可以对基于集成学习算法对多个回归模型进行组合,并训练出一个训练好的元模型来作为上述预测模型。具体地,上述调用预先训练好的预测模型的步骤之前,还可包括:首先获取预先训练好的第二指定数量的子学习器。其中,对于上述第二指定数量不作具体限定,可根据实际需求进行设置,优选将该第二指定数量设置为四。另外,上述自学习器可为上述回归模型。然后根据预设的集成学习算法,使用所述四个子学习器对预设的元模型进行训练,生成训练好的元模型。在得到了上述训练好的元模型后,再将该训练好的元模型作为所述预测模型,以便后续将得到的指定特征数据输入至预设的预测模型内,以通过所述预测模型来快速准确地对所述指定特征数据进行预测处理,并输出与所述指定词条页面对应的变化频率预测值。其中,当上述第二指定数量为四个时,却上述预先训练好的子学习器为预先训练好的四个回归模型:即线性回归模型、SVM回归模型、随机森林回归模型以及基于MLP的回归模型,为了综合利用这四种回归器的模型,再使用基于Stacking集成学习技术来综合利用上述四个回归模型,即使用Stacking方法,通过训练一个元模型(meta-regressor)来组合其他子学习器,并将这些子学习器的输出作为元模型的输入来训练元模型,以得到训练好的元模型,最后将训练好的元模型作为所述预测模型。上述使用的预设的元模型可为GDBT(GradientBoost Decision Tree)模型。具体地,将收集的与实体相关的特征数据x分别输入到上述四个回归模型中,可得到各个回归模型的输出结果分别为:y l=W l*x,W l是线性回归模型的模型参数,y l是线性回归模型的输出;y svm=SVM(x),y svm是SVM回归模型的输出;y rf=RandomForest(x),y rf是随机森林回归模型的输出;y mlp=W 3*(W 2*(W 1*x)),W 1,W 2,W 3分别是基于MLP的回归模型中对应的三个隐藏层的参数,y mlp是基于MLP的回归模型的输出。将以上四个回归模型的输出拼起来可以得到一个四维向量
Figure PCTCN2020118530-appb-000001
之后将四个回归模型的输出拼接的向量作为
Figure PCTCN2020118530-appb-000002
作为元模型GDBT的输入,并采用随机梯度下降算法及集成学习算法对元模型GDBT进行训练,得到训练好的元模型,训练好的元模型在接收到输入的实体的特征向量时,会输出
Figure PCTCN2020118530-appb-000003
元模型GDBT的输出y即是对应于实体的变化频率预测值。本实施例通过使用集成学习算法将多个回归模型进行综合利用来生成训练好的元模型,并将该训练好的元模型作为预测模型来用于对输入的所述指定实体对应的指定特征数据进行预测处理,有效地提高了模型的预测效果。
进一步地,本申请一实施例中,上述从百科网站中获取与指定实体对应的指定词条页面内的初始数据的步骤S1,包括:
S100:调用与所述百科网站对应的数据查询接口;
S101:通过所述数据查询接口获取与所述指定实体对应的指定词条页面内的初始数据。
如上述步骤S100至S101所述,上述从百科网站中获取与指定实体对应的指定词条页面内的初始数据的步骤,具体可包括:首先调用与上述百科网站对应的数据查询接口。在完成对于上述数据查询接口的调用后,再通过上述数据查询接口获取与上述指定实体对应的指定词条页面内的初始数据。其中,上述与指定实体对应的指定词条页面内的初始数据是指百科网站中指定词条页面内包含的所有数据,该指定词条页面的初始数据至少可包括指定实体的描述文本信息、基本信息表信息,一些用户相关的统计信息(比如被编辑的次数、词条浏览次数),指定实体对应的词条的每一次变化的时间信息以及变化的原因信息,变化历史信息,超链接信息,等等。另外,上述超链接信息隐含着实体之间的相互关系(例如语义关系),在指定词条页面中还会存在很多超链接信息,其中一部分超链接是链接到百科网站中与上述指定实体不同的其他实体,而剩余的其他超链接则是链接到与上述指定实体对应的外部参考信息。本实施例通过调用与上述百科网站对应的数据查询接口,来获取与上述指定实体对应的指定词条页面内的初始数据,有利于后续能够根据得到的指定词条页面内的初始数据,来从该初始数据中快速便捷地提取出与上述指定实体对应的指定特征数据。
参照图2,本申请一实施例中还提供了一种基于模型的预测数据变化频率的装置,包括:
第一获取模块1,用于从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
提取模块2,用于从所述初始数据中提取出与所述指定实体对应的指定特征数据;
调用模块3,用于调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
预测模块4,用于将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
第二获取模块5,用于获取所述预设模型输出的与所述指定词条页面对应的输出结果;
第一确定模块6,用于将所述输出结果作为所述指定实体的变化频率预测值。
本实施例中,上述基于模型的预测数据变化频率的装置中的第一获取模块、提取模块、调用模块、预测模块、第二获取模块与第一确定模块的功能和作用的实现过程具体详见上述基于模型的预测数据变化频率的方法中对应步骤S1至S6的实现过程,在此不再赘述。
进一步地,本申请一实施例中,上述提取模块,包括:
第二获取子模块,用于获取预设的特征类别信息;
提取子模块,用于根据所述特征类别信息,从所述初始数据中提取出与所述特征类型信息对应的指定特征数据。
本实施例中,上述基于模型的预测数据变化频率的装置中的第二获取子模块与提取子模块的功能和作用的实现过程具体详见上述基于模型的预测数据变化频率的方法中对应步骤S200至S201的实现过程,在此不再赘述。
进一步地,本申请一实施例中,上述基于模型的预测数据变化频率的装置,包括:
第三获取模块,用于获取预先训练好的回归模型;
第二确定模块,用于将所述预先训练好的回归模型作为所述预测模型。
本实施例中,上述基于模型的预测数据变化频率的装置中的第三获取模块与第二确定模块的功能和作用的实现过程具体详见上述基于模型的预测数据变化频率的方法中对应步骤S300至S301的实现过程,在此不再赘述。
进一步地,本申请一实施例中,上述基于模型的预测数据变化频率的装置,包括:
收集模块,用于从百科网站中收集第一指定数量的词条页面信息;
构建模块,用于按照预设的特征构造规则,使用所述词条页面信息构建样本标签数据集,其中,所述样本标签数据集包括与实体相关的特征数据,以及与实体对应的变化频率标签值;
划分模块,用于将所述标签数据集划分为训练数据集与测试数据集;
第一训练模块,用于利用所述训练数据集,并采用随机梯度下降法对预设的回归模型进行训练,生成训练好的第一初始模型;
验证模块,用于采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过;
第三确定模块,用于若验证通过,则将所述训练好的第一初始模型作为所述预测模型;
存储模块,用于将所述预测模型存储至区块链网络。
本实施例中,上述基于模型的预测数据变化频率的装置中的收集模块、构建模块、划分模块、第一训练模块、验证模块、第三确定模块与存储模块的功能和作用的实现过程具体详见上述基于模型的预测数据变化频率的方法中对应步骤S310至S316的实现过程,在此不再赘述。
进一步地,本申请一实施例中,上述验证模块,包括:
输入子模块,用于将所述测试数据集中的各测试样本输入至所述训练好的第一初始模型内,以得到所述各测试样本的测试结果;
第三获取子模块,用于根据所述各测试样本的测试结果,获取所述训练好的第一初始模型的准确率;
判断子模块,用于判断所述准确率是否大于预设的准确率阈值;
第一确定子模块,用于若判断出所述准确率大于预设的准确率阈值,则判定验证通过;
第二确定子模块,用于若判断出所述准确率不大于预设的准确率阈值,则判定验证不通过。
本实施例中,上述基于模型的预测数据变化频率的装置中的输入子模块、第三获取子模块、判断子模块、第一确定子模块与第二确定子模块的功能和作用的实现过程具体详见上述基于模型的预测数据变化频率的方法中对应步骤S3140至S3144的实现过程,在此不再赘述。
进一步地,本申请一实施例中,上述验证模块,包括:
筛选子模块,用于若判断出所述准确率不大于预设的准确率阈值,筛选出所述测试样本集中测试结果错误的指定测试样本;
生成子模块,用于将所述指定测试样本加入所述训练样本集,生成更新后的训练样本集;
训练子模块,用于根据所述更新后的训练样本集对所述回归模型进行训练,生成训练好的第二初始模型;
第三确定子模块,用于将所述训练好的第二初始模型作为所述预测模型。
本实施例中,上述基于模型的预测数据变化频率的装置中的筛选子模块、生成子模块、训练子模块与第三确定子模块的功能和作用的实现过程具体详见上述基于模型的预测数据变化频率的方法中对应步骤S31420至S31423的实现过程,在此不再赘述。
进一步地,本申请一实施例中,上述基于模型的预测数据变化频率的装置,包括:
第四获取模块,用于获取预先训练好的第二指定数量的子学习器;
第二训练模块,用于根据预设的集成学习算法,使用所有所述子学习器对预设的元模型进行训练,生成训练好的元模型;
第四确定模块,用于将所述训练好的元模型作为所述预测模型。
本实施例中,上述基于模型的预测数据变化频率的装置中的第四获取模块、第二训练模块与第四确定模块的功能和作用的实现过程具体详见上述基于模型的预测数据变化频率的方法中对应步骤S320至S322的实现过程,在此不再赘述。
进一步地,本申请一实施例中,上述第一获取模块,包括:
调用子模块,用于调用与所述百科网站对应的数据查询接口;
第一获取子模块,用于通过所述数据查询接口获取与所述指定实体对应的指定词条页面内的初始数据。
本实施例中,上述基于模型的预测数据变化频率的装置中的调用子模块与第一获取子模块的功能和作用的实现过程具体详见上述基于模型的预测数据变化频率的方法中对应步骤S100至S101的实现过程,在此不再赘述。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过***总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***、计算机程序和数据库。该内存储器为非易失性存储介质中的操作***和计算机程序的运行提供环境。该计算机设备的数据库用于存储指定实体、指定词条页面内 的初始数据、指定特征数据以及变化频率预测值等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现上述任一个示例性实施例所示出的基于模型的预测数据变化频率的方法。
上述处理器执行上述基于模型的预测数据变化频率的方法的步骤:
从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
从所述初始数据中提取出与所述指定实体对应的指定特征数据;
调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
获取所述预设模型输出的与所述指定词条页面对应的输出结果;
将所述输出结果作为所述指定实体的变化频率预测值。
本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的装置、计算机设备的限定。
本申请一实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,其上存储有计算机程序,计算机程序被处理器执行时实现上述任一个示例性实施例所示出的基于模型的预测数据变化频率的方法,所述基于模型的预测数据变化频率的方法包括以下步骤:
从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
从所述初始数据中提取出与所述指定实体对应的指定特征数据;
调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
获取所述预设模型输出的与所述指定词条页面对应的输出结果;
将所述输出结果作为所述指定实体的变化频率预测值。
综上所述,本申请实施例中提供的基于模型的预测数据变化频率的方法、装置、计算机设备和存储介质,通过将知识库中的实体与百科网站中的词条页面建立对应关系,并基于词条页面的初始数据构建相应的特征数据,以及使用机器学习回归模型来实现对于百科网站的词条页面的变化频率的预测,使得可以基于词条页面的变化频率的预测来智能方便地实现对于知识库中实体的变化频率的预测,有效地提高了对于知识库中实体的变化频率的预测准确性。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储与一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM通过多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于模型的预测数据变化频率的方法,其中,包括:
    从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
    从所述初始数据中提取出与所述指定实体对应的指定特征数据;
    调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
    将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
    获取所述预设模型输出的与所述指定词条页面对应的输出结果;
    将所述输出结果作为所述指定实体的变化频率预测值。
  2. 根据权利要求1所述的基于模型的预测数据变化频率的方法,其中,所述从所述初始数据中提取出与所述指定实体对应的指定特征数据的步骤,包括:
    获取预设的特征类别信息;
    根据所述特征类别信息,从所述初始数据中提取出与所述特征类型信息对应的指定特征数据。
  3. 根据权利要求1所述的基于模型的预测数据变化频率的方法,其中,所述调用预先训练好的预测模型的步骤之前,包括:
    从百科网站中收集第一指定数量的词条页面信息;
    按照预设的特征构造规则,使用所述词条页面信息构建样本标签数据集,其中,所述样本标签数据集包括与实体相关的特征数据,以及与实体对应的变化频率标签值;
    将所述样本标签数据集划分为训练数据集与测试数据集;
    利用所述训练数据集,并采用随机梯度下降法对预设的回归模型进行训练,生成训练好的第一初始模型;
    采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过;
    若验证通过,则将所述训练好的第一初始模型作为所述预测模型;
    将所述预测模型存储至区块链网络。
  4. 根据权利要求3所述的基于模型的预测数据变化频率的方法,其中,所述采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过的步骤,包括:
    将所述测试数据集中的各测试样本输入至所述训练好的第一初始模型内,以得到所述各测试样本的测试结果;
    根据所述各测试样本的测试结果,获取所述训练好的第一初始模型的准确率;
    判断所述准确率是否大于预设的准确率阈值;
    若判断出所述准确率大于预设的准确率阈值,则判定验证通过;
    若判断出所述准确率不大于预设的准确率阈值,则判定验证不通过。
  5. 根据权利要求4所述的基于模型的预测数据变化频率的方法,其中,所述判断所述准确率是否大于预设的准确率阈值的步骤之后,包括:
    若判断出所述准确率不大于预设的准确率阈值,筛选出所述测试样本集中测试结果错误的指定测试样本;
    将所述指定测试样本加入所述训练样本集,生成更新后的训练样本集;
    根据所述更新后的训练样本集对所述回归模型进行训练,生成训练好的第 二初始模型;
    将所述训练好的第二初始模型作为所述预测模型。
  6. 根据权利要求1所述的基于模型的预测数据变化频率的方法,其中,所述调用预先训练好的预测模型的步骤之前,包括:
    获取预先训练好的第二指定数量的子学习器;
    根据预设的集成学习算法,使用所有所述子学习器对预设的元模型进行训练,生成训练好的元模型;
    将所述训练好的元模型作为所述预测模型。
  7. 根据权利要求1所述的基于模型的预测数据变化频率的方法,其中,所述从百科网站中获取与指定实体对应的指定词条页面内的初始数据的步骤,包括;
    调用与所述百科网站对应的数据查询接口;
    通过所述数据查询接口获取与所述指定实体对应的指定词条页面内的初始数据。
  8. 一种基于模型的预测数据变化频率的装置,其中,包括:
    第一获取模块,用于从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
    提取模块,用于从所述初始数据中提取出与所述指定实体对应的指定特征数据;
    调用模块,用于调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
    预测模块,用于将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
    第二获取模块,用于获取所述预设模型输出的与所述指定词条页面对应的输出结果;
    第一确定模块,用于将所述输出结果作为所述指定实体的变化频率预测值。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种基于模型的预测数据变化频率的方法:
    其中,所述基于模型的预测数据变化频率的方法包括:
    从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
    从所述初始数据中提取出与所述指定实体对应的指定特征数据;
    调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标签数据集对预设的回归模型进行训练后生成;
    将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
    获取所述预设模型输出的与所述指定词条页面对应的输出结果;
    将所述输出结果作为所述指定实体的变化频率预测值。
  10. 根据权利要求9所述的计算机设备,其中,所述从所述初始数据中提取出与所述指定实体对应的指定特征数据的步骤,包括:
    获取预设的特征类别信息;
    根据所述特征类别信息,从所述初始数据中提取出与所述特征类型信息对应的指定特征数据。
  11. 根据权利要求9所述的计算机设备,其中,所述调用预先训练好的预测模型的步骤之前,包括:
    从百科网站中收集第一指定数量的词条页面信息;
    按照预设的特征构造规则,使用所述词条页面信息构建样本标签数据集,其中,所述样本标签数据集包括与实体相关的特征数据,以及与实体对应的变化频率标签值;
    将所述样本标签数据集划分为训练数据集与测试数据集;
    利用所述训练数据集,并采用随机梯度下降法对预设的回归模型进行训练,生成训练好的第一初始模型;
    采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过;
    若验证通过,则将所述训练好的第一初始模型作为所述预测模型;
    将所述预测模型存储至区块链网络。
  12. 根据权利要求11所述的计算机设备,其中,所述采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过的步骤,包括:
    将所述测试数据集中的各测试样本输入至所述训练好的第一初始模型内,以得到所述各测试样本的测试结果;
    根据所述各测试样本的测试结果,获取所述训练好的第一初始模型的准确率;
    判断所述准确率是否大于预设的准确率阈值;
    若判断出所述准确率大于预设的准确率阈值,则判定验证通过;
    若判断出所述准确率不大于预设的准确率阈值,则判定验证不通过。
  13. 根据权利要求12所述的计算机设备,其中,所述判断所述准确率是否大于预设的准确率阈值的步骤之后,包括:
    若判断出所述准确率不大于预设的准确率阈值,筛选出所述测试样本集中 测试结果错误的指定测试样本;
    将所述指定测试样本加入所述训练样本集,生成更新后的训练样本集;
    根据所述更新后的训练样本集对所述回归模型进行训练,生成训练好的第二初始模型;
    将所述训练好的第二初始模型作为所述预测模型。
  14. 根据权利要求9所述的计算机设备,其中,所述调用预先训练好的预测模型的步骤之前,包括:
    获取预先训练好的第二指定数量的子学习器;
    根据预设的集成学习算法,使用所有所述子学习器对预设的元模型进行训练,生成训练好的元模型;
    将所述训练好的元模型作为所述预测模型。
  15. 根据权利要求9所述的计算机设备,其中,所述从百科网站中获取与指定实体对应的指定词条页面内的初始数据的步骤,包括;
    调用与所述百科网站对应的数据查询接口;
    通过所述数据查询接口获取与所述指定实体对应的指定词条页面内的初始数据。
  16. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种基于模型的预测数据变化频率的方法,其中,所述基于模型的预测数据变化频率的方法包括以下步骤:
    从百科网站中获取与指定实体对应的指定词条页面内的初始数据,其中,所述指定实体为预设知识库中的任意一个实体;
    从所述初始数据中提取出与所述指定实体对应的指定特征数据;
    调用预先训练好的预测模型,其中,所述预测模型基于预先收集的样本标 签数据集对预设的回归模型进行训练后生成;
    将所述指定特征数据输入至所述预测模型内,以通过所述预测模型对所述指定特征数据进行预测处理;
    获取所述预设模型输出的与所述指定词条页面对应的输出结果;
    将所述输出结果作为所述指定实体的变化频率预测值。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述从所述初始数据中提取出与所述指定实体对应的指定特征数据的步骤,包括:
    获取预设的特征类别信息;
    根据所述特征类别信息,从所述初始数据中提取出与所述特征类型信息对应的指定特征数据。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述调用预先训练好的预测模型的步骤之前,包括:
    从百科网站中收集第一指定数量的词条页面信息;
    按照预设的特征构造规则,使用所述词条页面信息构建样本标签数据集,其中,所述样本标签数据集包括与实体相关的特征数据,以及与实体对应的变化频率标签值;
    将所述样本标签数据集划分为训练数据集与测试数据集;
    利用所述训练数据集,并采用随机梯度下降法对预设的回归模型进行训练,生成训练好的第一初始模型;
    采用所述测试数据集对所述训练好的第一初始模型进行验证,并判断是否验证通过;
    若验证通过,则将所述训练好的第一初始模型作为所述预测模型;
    将所述预测模型存储至区块链网络。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述调用预先训练好的预测模型的步骤之前,包括:
    获取预先训练好的第二指定数量的子学习器;
    根据预设的集成学习算法,使用所有所述子学习器对预设的元模型进行训练,生成训练好的元模型;
    将所述训练好的元模型作为所述预测模型。
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述从百科网站中获取与指定实体对应的指定词条页面内的初始数据的步骤,包括;
    调用与所述百科网站对应的数据查询接口;
    通过所述数据查询接口获取与所述指定实体对应的指定词条页面内的初始数据。
PCT/CN2020/118530 2020-07-27 2020-09-28 基于模型的预测数据变化频率的方法、装置和计算机设备 WO2021139255A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010734520.0A CN111859238B (zh) 2020-07-27 基于模型的预测数据变化频率的方法、装置和计算机设备
CN202010734520.0 2020-07-27

Publications (1)

Publication Number Publication Date
WO2021139255A1 true WO2021139255A1 (zh) 2021-07-15

Family

ID=72947569

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118530 WO2021139255A1 (zh) 2020-07-27 2020-09-28 基于模型的预测数据变化频率的方法、装置和计算机设备

Country Status (1)

Country Link
WO (1) WO2021139255A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915392A (zh) * 2015-05-26 2015-09-16 国家计算机网络与信息安全管理中心 一种微博转发行为预测方法及装置
JP2017204219A (ja) * 2016-05-13 2017-11-16 日本電信電話株式会社 モデル学習装置、単語抽出装置、方法、及びプログラム
CN108287911A (zh) * 2018-02-01 2018-07-17 浙江大学 一种基于约束化远程监督的关系抽取方法
CN110019840A (zh) * 2018-07-20 2019-07-16 腾讯科技(深圳)有限公司 一种知识图谱中实体更新的方法、装置和服务器
CN111310931A (zh) * 2020-02-05 2020-06-19 北京三快在线科技有限公司 参数生成方法、装置、计算机设备及存储介质
CN111340244A (zh) * 2020-05-15 2020-06-26 支付宝(杭州)信息技术有限公司 预测方法、训练方法、装置、服务器及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915392A (zh) * 2015-05-26 2015-09-16 国家计算机网络与信息安全管理中心 一种微博转发行为预测方法及装置
JP2017204219A (ja) * 2016-05-13 2017-11-16 日本電信電話株式会社 モデル学習装置、単語抽出装置、方法、及びプログラム
CN108287911A (zh) * 2018-02-01 2018-07-17 浙江大学 一种基于约束化远程监督的关系抽取方法
CN110019840A (zh) * 2018-07-20 2019-07-16 腾讯科技(深圳)有限公司 一种知识图谱中实体更新的方法、装置和服务器
CN111310931A (zh) * 2020-02-05 2020-06-19 北京三快在线科技有限公司 参数生成方法、装置、计算机设备及存储介质
CN111340244A (zh) * 2020-05-15 2020-06-26 支付宝(杭州)信息技术有限公司 预测方法、训练方法、装置、服务器及介质

Also Published As

Publication number Publication date
CN111859238A (zh) 2020-10-30

Similar Documents

Publication Publication Date Title
WO2021004333A1 (zh) 基于知识图谱的事件处理方法、装置、设备和存储介质
US11520812B2 (en) Method, apparatus, device and medium for determining text relevance
WO2020001373A1 (zh) 一种本体构建方法及装置
US20190129732A1 (en) Methods, systems, and computer program product for implementing software applications with dynamic conditions and dynamic actions
US20190130305A1 (en) Methods, systems, and computer program product for implementing an intelligent system with dynamic configurability
US20100241647A1 (en) Context-Aware Query Recommendations
CN110019770A (zh) 训练分类模型的方法与装置
CN107358315A (zh) 一种信息预测方法及终端
CN113139134B (zh) 一种社交网络中用户生成内容的流行度预测方法、装置
WO2022141876A1 (zh) 基于词向量的搜索方法、装置、设备及存储介质
CN112183881A (zh) 一种基于社交网络的舆情事件预测方法、设备及存储介质
JP5276581B2 (ja) トレンド分析装置、トレンド分析方法およびトレンド分析プログラム
CN112380344A (zh) 文本分类的方法、话题生成的方法、装置、设备及介质
CN114118192A (zh) 用户预测模型的训练方法、预测方法、装置及存储介质
CN114818682B (zh) 基于自适应实体路径感知的文档级实体关系抽取方法
CN108647064A (zh) 操作路径导航的方法及装置
CN106803092B (zh) 一种标准问题数据的确定方法及装置
CN112257959A (zh) 用户风险预测方法、装置、电子设备及存储介质
WO2023159756A1 (zh) 价格数据的处理方法和装置、电子设备、存储介质
CN110909975B (zh) 科研平台效益评估方法、装置
CN107357782A (zh) 一种识别用户性别的方法及终端
CN113761193A (zh) 日志分类方法、装置、计算机设备和存储介质
Wang et al. Ipre: a dataset for inter-personal relationship extraction
WO2021139255A1 (zh) 基于模型的预测数据变化频率的方法、装置和计算机设备
JP2022531480A (ja) 訪問予測

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912701

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912701

Country of ref document: EP

Kind code of ref document: A1