Recommendation system for streaming listening audio content in vehicle-mounted scene
Technical Field
The invention relates to a system for providing personalized uninterrupted audio content for different users in a vehicle-mounted scene, in particular to a recommendation system for streaming audio listening content in the vehicle-mounted scene.
Background
With the rapid development of the internet, information overload is becoming more serious, and a recommendation system is one of important means for solving the problems. The current recommendation technology is basically a product serving a strong interaction mode of a mobile phone and a PC, and a vehicle-mounted scene has unique characteristics, so that the current recommendation technology has many problems:
1. on the mobile phone and the PC, the user is focused, and actively carries out explicit or implicit feedback on the recommendation result, such as scoring, praise, click playing and the like. In a vehicle-mounted scene, the user is attentive to driving and is listening to content in an accompanying manner, and user behavior data are sparse.
2. The existing recommendation technology is mainly used for products in an on-demand mode, and in a vehicle-mounted scene, streaming continuous listening is needed, so that the influence on driving is reduced.
3. Existing recommendation techniques are based on user information and behavioral data. In the vehicle-mounted scene, vehicle information (such as road conditions and vehicle speed) and scene information (such as commute, travel, midnight long distance) need to be fused.
Disclosure of Invention
The invention aims to provide a system for providing personalized uninterrupted audio content for different users in a vehicle-mounted scene, which can solve the existing problems and solve the problem of sparse active behavior data of the users in the vehicle-mounted scene by utilizing big data and expert knowledge.
In order to achieve the purpose, the invention provides a recommendation system for streaming listening audio content in a vehicle-mounted scene, which is used by matching with a client, a server, a local file system and a storage system, wherein the recommendation system comprises a real-time data collection subsystem, an offline model training subsystem and an online content delivery subsystem; the off-line model training subsystem calculates off-line model data according to original data recorded into the storage system, and finally the on-line content delivery subsystem delivers the off-line model data; the related information comprises user behavior data, automobile information and scene information.
The recommendation system for the streaming listening audio content in the vehicle-mounted scene is characterized in that the operation process of the offline model training subsystem comprises two links of candidate set generation and candidate set sorting, the candidate set generation is divided into user active behavior and offline model calculation, and the candidate set sorting is used for calculating the preference degree of a user to a candidate set.
The recommendation system for the streaming listening audio content in the vehicle-mounted scene is characterized in that the user active behavior is that the user actively fills favorite content tags in a corresponding product form, and the favorite content tags comprise a customized play list and interest selection; displaying the customized playlist on a product interface, and defining the content of the playlist by a user, wherein the content of the playlist is based on content classification, content labels and content keywords; the interest selection is an interface activated by a user in registration and selects the interested content tags of the user.
In the recommendation system for the streaming listening audio content in the vehicle-mounted scene, the offline model calculation is that the offline model training subsystem analyzes data through an algorithm, so as to obtain a content label that a user likes, wherein the data comprises user information, user behavior, automobile information and scene information; the off-line model calculation is composed of four parts of drama pursuit, user portrait, user attribute recommendation and popular content.
The recommendation system for the streaming listening audio content in the vehicle-mounted scene is characterized in that the episode tracing is performed by analyzing a user listening history record stored in the storage system by the offline model training subsystem, and the process is as follows: grouping according to the unique mark of each user, reserving the listening records of the programs of the continuous listening type, reserving the listened program lists of the last three months according to the time reverse order, inquiring the next content of each program, and finally storing the result.
The recommendation system for the streaming listening audio content in the vehicle-mounted scene is characterized in that the user portrait is obtained by first obtaining user behavior data and audio information, then associating the two types of data according to unique audio marks by an offline model training subsystem, then grouping according to each user, calculating the user portrait of each user, and calculating the label weight of each user through the user label weight, namely the behavior type weight, time attenuation, namely TF-IDF, and obtaining the behavior times of the label weight on each user; the user behavior data comprises audio listening duration, subscription, click on a playlist list, search click, album on demand, next and negative feedback; the audio information comprises duration, an album to which the audio information belongs, a label of the album, and a category to which the audio information belongs; the formula for the user portrait tag weight is: norm (W)
behavior*F
tC TF IDF), wherein the behavior type weight W
behavior{ subscription: 5, the playlist clicks: 1.4R, search: 1.3R, album on demand: 1.2R, next: 1R, negative feedback: 0.1}, album end rate R ═ Σ PlayTime
audio/∑Duration
audio(ii) a Time decay F
t=max(1,1*e
-0.8*max(0,(now-playtime)/(24*3600))) Now is the current time, and playtime is the time of behavior occurrence in ms; the behavior times C are calculated by the day and are times of the same behavior type aiming at the same album; importance of labels
The numerator of the TF calculation formula represents the number of times a certain tag appears on a user, the denominator represents the total number of user tags, the numerator of the power of the IDF calculation formula represents the total number of users, and the denominator represents the number of users +1 containing a certain tag.
The recommendation system for the streaming listening audio content in the vehicle-mounted scene is configured, where the user attribute recommendation is based on the collected attributes of the seed users and the information of the customized playlist, and the operation experience, and the off-line model training subsystem calculates the preference degrees of the users with different attributes to the playlist content, and performs the recommendation by using the following formula:
![Figure BDA0002304745260000041](https://patentimages.storage.***apis.com/01/7a/68/c3df6d1e5720b4/BDA0002304745260000041.png)
the method comprises the steps of calculating the relative probability of a user liking a label l by knowing user attributes u1, u2, … … and un, calculating the relative probability of the user liking the label l by N and N respectively being the total number of data and the frequency of the label l being liked by Ni and Ni respectively being the total number of data under the attribute i and the frequency of the label l being liked by Ni, similarly to tf-idf, wherein the first item is a penalty item, the higher the label heat is, the lower the value is (idf), the second item is the sum of conditional probabilities, the higher the label occurrence probability under the attribute is, the higher the value is (tf), the (N- α) is a penalty item coefficient, α is regarded as 1 by default, the recommendation interval 0 is not less than α and not more than 1, β is regarded as the weight of weakening a hot door label in each attribute, the penalty is regarded as 1 by default, the personality recommendation interval 1 is not less than β and not more than 2, the α value is larger, the smaller the popularity is the smaller, the popularity is the larger, the;the bigger the β value is, the stronger the heat weakening is, the personalized the score is, and the smaller the β value is, the weaker the heat weakening is, the popular the score is.
The recommendation system for the streaming listening audio content in the vehicle-mounted scene is configured, where the popular content is behavior data obtained by counting clicks of an album of the user, and the offline model training subsystem calculates the importance of each content classification in each hour and performs the following steps:
the numerator of the TF calculation formula represents the number of times that a certain content classification appears in a certain hour, and the denominator represents the total number of the content classification in the hour; the numerator of the power of the IDF calculation represents the total hours of the day, 24, and the denominator represents the number of hours +1 containing the content classification.
In the recommendation system for the streaming listening audio content in the vehicle-mounted scene, the candidate set is ranked by using the user portrait at the initial stage of less forward feedback behavior of the user through the offline model training subsystem, and the obtained content label weight is used as the basis of overall ranking, and the click rate pre-estimation model can be used for automatically learning the proportion and the final ranking of the candidate set at the later stage along with the increase of forward feedback data.
In the recommendation system for the streaming listening audio content in the vehicle-mounted scene, the online content delivery subsystem performs online content delivery according to the calculation result of the offline model training subsystem, and the online content delivery is divided into two links of recall and sequencing; the recalling is to obtain various candidate sets calculated by an offline model of an offline model training subsystem from a storage system, and then calculate the ratio of each candidate set according to the obtained offline data statistics; the sequencing is to obtain the related information of the current user and the intermediate data of off-line calculation, extract the characteristics, calculate the content sequencing most likely to be liked by the user through a model, and deliver the final result.
The recommendation system for the streaming listening audio content in the vehicle-mounted scene has the following advantages:
1. the system adopts the streaming listening, reduces excessive interactive operation of a driver in the driving process, and further reduces the risk of traffic accidents.
2. The problem that the active behavior data of a user is sparse in the streaming listening in a vehicle-mounted scene is solved. On the product, a user is guided to customize a playlist, a favorite content label is selected during registration, and then user behavior data are collected in a multi-dimensional mode by combining subscription, click behavior, searching, negative feedback and the like. In the algorithm, user attributes of the seed users are collected, a playlist is customized, a model is established, and the preference degrees of the user attributes and the content labels are calculated, so that user attribute recommendation is realized; and calculating the importance of content classification according to the hour dimension, and realizing hot recommendation.
3. The forward feedback data in the early stage is sparse, and the weight of the content label of the user portrait can be adopted as the standard of result sorting. When the magnitude of the generated forward feedback reaches a certain program (usually about 10 times of the characteristic magnitude) for the content recommended by the system by the user, a supervised learning model, namely click rate estimation, can be adopted to optimize the sequencing of the recommendation results.
4. In the aspect of algorithm modeling, besides information related to users and contents, information of automobiles and scenes is fused, so that the recommended contents are more suitable for vehicle-mounted scenes.
Drawings
Fig. 1 is a schematic architecture diagram of a recommendation system for streaming listening to audio content in a vehicle-mounted scene according to the present invention.
Fig. 2 is a model diagram of click through rate pre-estimation ranking of the recommendation system for streaming audio content listening in a vehicle-mounted scene according to the present invention.
Fig. 3 is a recall flowchart of the recommendation system for streaming audio content in an in-vehicle scenario of the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the drawings.
The invention provides a recommendation system for streaming listening audio content in a vehicle-mounted scene, which is used by matching with a client, a server, a local file system and a storage system. The storage system comprises a distributed cache subsystem, an inverted index subsystem, a relational database subsystem and a distributed file subsystem. The recommendation system also depends on a middleware service system, and the middleware service system comprises an asynchronous communication subsystem based on an Actor model, a distributed real-time processing subsystem, a distributed computing subsystem and a real-time log collecting subsystem. As shown in fig. 1.
The real-time data collection subsystem collects relevant information through a client program, reports the relevant information to the http web server, records the relevant information into a local file system through the http web server, performs operations such as information completion, splitting, cleaning and the like through the real-time log collection subsystem, and then records the information into a distributed file subsystem of the storage system, the offline model training subsystem calculates offline model data according to the original data recorded into the storage system, and finally the online content delivery subsystem delivers the offline model data; the related information includes user behavior data, car information, scene information, and the like.
The operation process of the off-line model training subsystem comprises two important links of candidate set generation and candidate set sequencing, the candidate set generation is divided into user active behavior and off-line model calculation, and the candidate set sequencing is used for calculating the preference degree of a user to a candidate set.
The user active behavior is that the user actively fills in favorite content labels through corresponding product forms, and the favorite content labels comprise customized playing lists and interest selections; displaying the customized playlist on a product interface, and defining the content of the playlist by a user, wherein the content of the playlist is based on content classification, content labels and content keywords; the interest selection is an interface activated by a user in registration and selects the interested content tags of the user.
The off-line model calculation is that the off-line model training subsystem analyzes data through an algorithm so as to obtain a content label which a user likes, wherein the data comprises user information, user behaviors, automobile information, scene information and the like; the off-line model calculation is composed of four parts of drama pursuit, user portrait, user attribute recommendation and popular content.
The episode tracing is implemented by analyzing a user listening history record stored in a storage system by using a distributed computing subsystem and an offline model training subsystem, and the process is as follows: grouping according to the unique mark of each user, reserving the listening records (such as novels) of the programs of the continuous listening type, reserving the listened program lists of the last three months according to the time reverse order, inquiring the next content of each program, and finally storing the calculated result in the reverse index subsystem for storage.
The user portrait is that firstly user behavior data is obtained from a distributed file subsystem, audio information is obtained from a relational database subsystem, then the two types of data are associated according to unique audio marks by an offline model training subsystem, then grouping is carried out according to each user, the user portrait of each user is calculated, and the label weight on each user is calculated through the user label weight, namely the behavior type weight, time attenuation, namely TF-IDF; the user behavior data comprises audio listening duration, subscription, click on a playlist list, search click, album on demand, next, negative feedback and the like; the audio information comprises duration, an album to which the audio information belongs, a label of the album, a category to which the audio information belongs, and the like; the formula for the user portrait tag weight is: norm (W)
behavior*F
tC TF IDF), wherein the behavior type weight W
behavior{ subscription: 5, the playlist clicks: 1.4R, search: 1.3R, album on demand: 1.2R, next: 1R, negative feedback: 0.1}, album end rate R ═ Σ PlayTime
audio/∑Duration
audioTime decay F
t=max(1,1*e
-0.8*max(0,(now-playtime)/(24*3600))) Now is the current time, and playtime is the time of behavior occurrence in ms; the behavior times C are calculated by the day and are times of the same behavior type aiming at the same album; importance of labels
The numerator of TF calculation formula represents the number of times a certain label appears on a user, the denominator represents the total number of user labels, the numerator of the power of IDF calculation formula represents the total number of users, and the denominatorIndicating the number of users +1 who contain a certain tag.
The user attribute recommendation is based on the collected attributes of the seed users and the information of the customized playlist, for example, through WeChat applet collection and operation experience, the off-line model training subsystem calculates the preference degrees of the users with different attributes to the playlist content, and the recommendation is performed through the following formula:
namely known user attributes u1, u2, … …, un, calculating the relative probability that the user likes tag 1; the process is as follows:
independent line hypothesis
P(u1u2…un|l=1)=P(u1|l=1)P(u2|l=1)…P(un|l=1)
Bayesian formula
To obtain
Setting up
P(l=1)=p,P(l=0)=1-p,P(l=1|ui)=qi,P(l=0|ui)=1-qi
Finally obtain
N and N are respectively the total number of data, the frequency that a label 1 is liked is high, Ni and Ni are respectively the total number of data under an attribute i, the frequency that the label 1 is liked is high, similarly to tf-idf, a first item is a punishment item, the higher the heat degree of the label is, the lower the value is (idf), a second item is the summation of conditional probabilities, the higher the occurrence probability of the label under the attribute is, the higher the value is (tf), the (N- α) is a punishment item coefficient, α defaults to be 1 (not punishment), a recommendation interval 0 is not less than α and not less than 1, β is the weight for weakening the hot label in each attribute, defaults to be 1 (not weakening), the recommendation interval 1 is not less than β and not more than 2, the larger the heat degree is, the larger the punishment is made, the smaller the larger the punishment is made for the α, the larger is made for the individualization is made, the value is the β, the stronger the weakening for the heat degree is made individualization, and the.
The popular content is behavior data of counting user album clicks, the importance of each content classification in each hour is calculated by the offline model training subsystem, and the method is carried out by the following formula:
the numerator of the TF calculation formula represents the number of times that a certain content classification appears in a certain hour, and the denominator represents the total number of the content classification in the hour; the numerator of the power of the IDF calculation represents the total hours of the day, 24, and the denominator represents the number of hours +1 containing the content classification.
And candidate set sorting is realized by an offline model training subsystem, at the initial stage of less forward feedback behaviors of a user, the user portrait can be utilized, the obtained content label weight is used as the basis of overall sorting, and the click rate pre-estimation model can be used at the later stage along with the increase of forward feedback data to automatically learn the proportion and final sorting of the candidate sets. The process of ranking the click rate estimation model comprises the following steps: the method comprises the steps of collecting user behavior data and service content data, extracting features including scene features, automobile features, user features, content features and the like by an offline model training subsystem, discretizing the features, thermally coding the features, writing the features into a storage system, simultaneously using logistic regression training data, adding behavior data of recommended results, obtaining model data, writing the model data into the storage system, reading the features and the model data from the storage system, calculating click rates of recommended candidate results in real time, and finally sorting the recommended results according to the click rates. As shown in fig. 2.
The online content delivery subsystem builds high-performance and high-availability distributed application based on an Actor model asynchronous communication subsystem. The online content delivery subsystem carries out online content delivery according to the calculation result of the offline model training subsystem, and the whole online content delivery subsystem is divided into two links of recall and sequencing; the recalling is to obtain various candidate sets calculated by an offline model of the offline model training subsystem from a distributed cache subsystem, an inverted index subsystem and a relational database subsystem of the storage system, and then calculate the ratio of each candidate set according to the obtained offline data statistics; the specific process comprises the following steps: firstly, accessing a user-defined play list by a user, entering user drama chase if the user-defined play list exists, wherein the percentage of the drama chase is not more than 50% according to the time reverse order, and then combining the drama chase with the user-defined play list; if not, switching to other strategies, determining the weight of the self-selected content label, the user drama chase, the user portrait, the user attribute and the default playlist to wait for the selection of the set, setting the initialized weight as the self-selected content label 4, the user drama chase 2, the user portrait 2, the user attribute 1, the default playlist 1 and the like, setting the weight of each candidate set comprehensively and manually according to the self-selected content label, the user drama chase, the user portrait, the user attribute, the default playlist and the like, and immediately taking effect after the weight is changed. As shown in fig. 3. The sorting is to obtain the related information of the current user and the intermediate data of off-line calculation from a distributed real-time processing subsystem, a distributed cache subsystem, an inverted index subsystem and a relational database subsystem, extract the characteristics, calculate the most likely favorite content sorting of the user through a model and put in the final result.
The following describes the recommendation system for streaming audio content listening in a car scene according to the present invention with reference to the following embodiments.
Example 1
A recommendation system for streaming audio content listening in a vehicle-mounted scene is matched with a client, a server, a local file system and a storage system for use. The recommendation system comprises a real-time data collection subsystem, an offline model training subsystem and an online content delivery subsystem.
1. A real-time collection subsystem. And the client collects the audio playing behavior data and reports the data to the nginx web server. And collecting and gathering by a log collecting subsystem flume, supplementing album information, and storing the album information to the distributed storage subsystem hdfs according to time. Nginx (engine x) is a high performance HTTP and reverse proxy web server, while also providing IMAP/POP3/SMTP services. The flash is a highly available, highly reliable and distributed system for collecting, aggregating and transmitting mass logs provided by Cloudera. Hdfs (Hadoop Distributed File System) refers to a Distributed File System (Distributed File System) designed to fit on general purpose hardware (comfort hardware).
2. And an off-line model training subsystem.
(1) And (5) tracing the drama.
And writing a distributed computing program MapReduce. Reading the user listening records of the last three months by the map of the task 1, and reserving the records classified as novel; grouping according to the unique user identifier and providing the grouping for the reduce; reduce descending the data by time, keeping the listening record of the latest time of each album. The task 2 reads the data of the task 1 and adds all the audio information of the album; map is grouped according to the unique mark of the album and provided for reduce; reduce calculates the next set of audio content in the listening history. The task 3map reads the data of the task 2 and groups the data according to the unique mark of the user; reduce stores the grouped data into the inverted indexing subsystem elastic search. MapReduce is a programming model for parallel operation of large-scale data sets (greater than 1 TB). An elastic search is a Lucene-based search server.
(2) User portrait calculation.
Firstly, cleaning original data: the playing end event data is associated with the audio information in the service library through the audio unique mark; combining the new data and the historical data through a unique user mark; calculating the attenuation weight of each audio playing time length according to the unique user mark and the unique album mark of the merged data, and accumulating; and finally, arranging the attenuation weights in a descending order.
Calculating a user label: and acquiring a label blacklist and album information (including content classification and labels), filtering the blacklist of the album labels, and rejecting albums containing the blacklist. And (4) performing association and combination on the data cleaned in the previous step according to the unique album mark. The decay weights are accumulated for each album label under each user, and then the final weight is calculated by a normalization formula.
(3) Hot recommendations, i.e., user attribute recommendations and hot content.
And collecting the data of the album clicks of the users, and counting the number of the album clicks of each category in each hour, dividing the number of all categories in the hour by the quotient to be tf. The logarithm of the base-10 quotient is calculated as idf by dividing 24 by the sum of the hours of occurrence +1 for each class. tf is multiplied by idf as the importance of a certain classification at a certain hour. Then, each category is subdivided according to the entertainment, knowledge, life and information modes, and the importance of each large category in each hour is calculated. The data is saved in the inverted indexing subsystem elastic search. And recalling the content recommended by the online release subsystem every hour, recalling the major category with the highest importance, and then performing normalization processing according to the classification importance to match, so that the recall rate is improved.
3. And an online content delivery subsystem. The user requests a subsystem service interface, a user unique mark uid is transmitted, and the system acquires a user-defined play list, a drama, a self-selected content tag, a user portrait and user attributes according to the uid. If the user-defined play list exists, acquiring related albums from the inverted indexing subsystem Elasticissearch through album labels stored in the play list, and forming a candidate set by combining with the episode. If the user-defined playlist is not included, an album label which the user likes is obtained through the user attribute and the user attribute recommendation model, a self-selected content label, the user portrait and the hot label are added, and a related album is obtained from the elastic search to form a candidate set. And carrying out quantity distribution on each candidate set according to respective weight. And finally, sorting according to the label weight of the user portrait and recommending.
The invention provides a recommendation system for streaming listening audio content in a vehicle-mounted scene, which is a system and a method for providing personalized uninterrupted audio content for different users in the vehicle-mounted scene, and solves the problem of sparse active behavior data of the users in the vehicle-mounted scene by utilizing big data and expert knowledge. And the radio station mode is adopted, and the influence on a driver is reduced by adopting the streaming listening. And moreover, the automobile information and the scene information are fused, so that the recommended audio content is more in line with the vehicle-mounted characteristics.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.