CN114764470A

CN114764470A - Method, device and equipment for acquiring user portrait and storage medium

Info

Publication number: CN114764470A
Application number: CN202110035190.0A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2022-07-19

Abstract

The application discloses a method, a device and equipment for obtaining a user portrait and a storage medium, relates to the technical field of computers, and is used for obtaining a representation vector of a user relative to a streaming media aspect and improving comprehensiveness and accuracy of the user portrait. According to the method, a training sample set can be constructed according to user behavior sequences of a target user and a non-target user, iterative training is carried out on a difference learning model, differences among all users are learned during model convergence, and finally, an obtained expression vector of the target user can be used as expression of the target user in the aspect of streaming media, so that the content of user images is richer, the comprehensiveness and accuracy of user images are improved, a foundation is laid for subsequent streaming media object recommendation and user analysis, and the accuracy of user recommendation is improved.

Description

Method, device and equipment for acquiring user portrait and storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of Artificial Intelligence (AI), and provides a method, a device and equipment for acquiring a user portrait and a storage medium.

Background

User portrayal, in short, uses a vector to represent a user or a feature of a user. In a scene related to a streaming media object such as music or video, a recommendation of the streaming media object is generally required for a user. For example, music is used, and in order to implement accurate music recommendation for a user, a representation of the user, that is, a user profile is often used, and the profile of the user in terms of music can express the preference of the user for music, thereby assisting in performing personalized music recommendation for the user. In addition, user images are also required in scenes such as user group division and analysis, and accurate user images are a prerequisite for accurate follow-up recommendation and user analysis, and therefore accurate user images are also necessary.

Therefore, how to improve the accuracy of the user representation is a problem to be considered.

Disclosure of Invention

The embodiment of the application provides a method, a device and equipment for acquiring a user portrait and a storage medium, and is used for acquiring a representation vector of a user relative to a streaming media aspect and improving comprehensiveness and accuracy of the user portrait.

In one aspect, a method of obtaining a user representation is provided, the method comprising:

acquiring user behavior sequences of a target user and a plurality of non-target users; wherein, a user behavior sequence comprises a plurality of streaming media objects operated by a user;

constructing a training sample set according to each user behavior sequence; wherein, a training sample comprises a first object set composed of at least two streaming media objects in the user behavior sequence of the target user and a second object set composed of at least two streaming media objects in the user behavior sequence of the non-target user, and the first object set and the second object set have intersection;

performing iterative training on the difference learning model according to the training sample set; during each training, a plurality of loss values are correspondingly obtained according to the user representation vector of the target user and the first object set and the second object set of each training sample, one loss value is used for representing the difference degree between the target user and the corresponding non-target user, and the user representation vector is updated according to the plurality of loss values;

obtaining a user representation of the target user from the user representation vector upon determining that the dissimilarity learning model converges from the plurality of loss values.

In one aspect, a method for recommending a streaming media object is provided, where the method includes:

obtaining a user representation vector of a target user by the method in the aspect;

recommending the streaming media object with the matching degree larger than the set matching degree threshold value to the target user according to the matching result of the user representation vector of the target user and the object representation vector of each streaming media object; or,

and performing similar matching on the user representation vector of the target user and the user representation vectors of other users, determining similar users with the similarity larger than a set similarity threshold, and recommending the streaming media object in the user behavior sequence of the similar users to the target user.

In one aspect, an apparatus for capturing a representation of a user is provided, the apparatus comprising:

the user sequence acquisition unit is used for acquiring user behavior sequences of a target user and a plurality of non-target users; wherein, a user behavior sequence comprises a plurality of streaming media objects operated by a user;

the training sample construction unit is used for constructing a training sample set according to each user behavior sequence; wherein, a training sample comprises a first object set composed of at least two streaming media objects in the user behavior sequence of the target user and a second object set composed of at least two streaming media objects in the user behavior sequence of the non-target user, and the first object set and the second object set have intersection;

the training unit is used for carrying out iterative training on the difference learning model according to the training sample set; during each training, a plurality of loss values are correspondingly obtained according to the user representation vector of the target user and the first object set and the second object set of each training sample, one loss value is used for representing the difference degree between the target user and the corresponding non-target user, and the user representation vector is updated according to the plurality of loss values;

a representation unit for obtaining a user representation of the target user from the user representation vector upon determining convergence of the dissimilarity learning model from the plurality of loss values.

Optionally, the first object set is composed of at least two continuous streaming media objects in the user behavior sequence of the target user, and the second object set is composed of at least two continuous streaming media objects in the user behavior sequence of the non-target user.

Optionally, an intersection of the first set of objects and the second set of objects, a difference between the first set of objects and the intersection, and a difference between the second set of objects and the intersection in each training sample constitute a triple;

the training unit is specifically configured to:

and obtaining a triplet loss value corresponding to each training sample according to the user representation vector of the target user and the triplet corresponding to each training sample.

Optionally, the training unit is specifically configured to:

for each training sample, respectively obtaining a first probability that the user behavior sequence of the target user comprises the first object set and a second probability that the user behavior sequence of the target user comprises the second object set according to the user representation vector and an object representation vector of a streaming media object in each training sample;

and obtaining a loss value of each training sample according to the first probability and the second probability, wherein the loss value is positively correlated with the first probability and negatively correlated with the second probability.

Optionally, the first set of objects includes a first streaming media object and a second streaming media object, and the second set of objects includes the second streaming media object and a third streaming media object;

the training unit is specifically configured to:

acquiring a first degree of association of the first streaming media object and the second streaming media object relative to the user representation vector, and acquiring a second degree of association of the second streaming media object and the third streaming media object relative to the user representation vector;

obtaining the first probability according to the first degree of association and obtaining the second probability according to the second degree of association; wherein the relevance value is positively correlated with the probability value.

Optionally, the convergence condition of the difference learning model includes:

the loss value of each training sample is not less than a set first threshold value; and/or the presence of a gas in the gas,

the sum of the loss values of all training samples is not less than a set second threshold value, and the second threshold value is greater than the first threshold value.

Optionally, the apparatus further includes an object vector obtaining unit, configured to:

respectively acquiring a feature vector sequence of each streaming media object according to the streaming media data of each streaming media object;

and respectively carrying out vector coding according to the characteristic vector sequence of each streaming media object to obtain an object representation vector of each streaming media object.

Optionally, the object vector obtaining unit is specifically configured to:

respectively sampling audio data of each audio to obtain an audio frequency spectrogram corresponding to each audio;

the audio frequency spectrogram comprises a plurality of time slice frequency sequences of each audio frequency in continuous time, and the time slice frequency sequence corresponds to one feature vector in the feature vector sequence.

Optionally, the object vector obtaining unit is specifically configured to:

for each audio, performing time domain sampling on the audio data of each audio according to a set time interval to obtain a plurality of time sequences of each audio in a time domain;

combining the plurality of time sequences according to the set time slice length to obtain a plurality of time sequence combinations;

and for each time sequence combination, performing time-frequency conversion on each time sequence combination, and sampling the frequency domain signals according to a set frequency interval to obtain a frequency sequence corresponding to each time sequence combination.

In one aspect, an apparatus for recommending streaming media objects is provided, the apparatus comprising:

a user vector acquisition unit configured to acquire a user representation vector of a target user by the method according to the above aspect;

the recommending unit is used for recommending the streaming media object with the matching degree larger than the set matching degree threshold value to the target user according to the matching result of the user representation vector of the target user and the object representation vector of each streaming media object; or carrying out similar matching on the user representation vector of the target user and the user representation vectors of other users, determining similar users with the similarity larger than a set similarity threshold, and recommending the streaming media object in the user behavior sequence of the similar users to the target user.

In one aspect, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the above methods when executing the computer program.

In one aspect, a computer storage medium is provided having computer program instructions stored thereon that, when executed by a processor, implement the steps of any of the above-described methods.

In one aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of any of the methods described above.

In the method for obtaining the user portrait, a training sample set can be constructed according to user behavior sequences of the target user and the non-target user, and iterative training is performed on the difference learning model. Wherein each training sample comprises a first set of objects in a sequence of user behavior of a target user and a second set of objects in a sequence of user behavior of a non-target user, can be used to represent the favorite information of the target user and the favorite information of the non-target user, respectively, and further in each training process, obtaining a loss value representing the degree of difference between the target user and the non-target user according to the representation of the target object and the representations of the streaming media objects included in each training sample, and further updates the representation of the target user based on the loss values, learns the differences between individual users through the training process described above, therefore, the finally updated expression vector of the target user can be used as the expression of the target user in the streaming media aspect, so that the content of the user portrait is richer, and the comprehensiveness and the accuracy of the user portrait are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic view of a scenario provided in an embodiment of the present application;

fig. 2 is a schematic view of another scenario provided in an embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for obtaining a user representation according to an embodiment of the present disclosure;

FIG. 4 is a model architecture of a difference learning model based on the triplet loss algorithm according to an embodiment of the present application;

fig. 5 is a schematic diagram of a training process of a difference learning model according to an embodiment of the present application;

fig. 6 is a schematic flowchart of an audio sample acquisition audio frequency spectrogram according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a frequency spectrum after audio decomposition according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an apparatus for obtaining a user representation according to an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of a streaming media object recommendation apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

For the convenience of understanding the technical solutions provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained first:

the stream media object: may include objects transmitted using streaming media techniques and may include, for example, audio, video, or images, among others.

User behavior sequence: taking the streaming media object as the audio, the sequence of user behavior may be a plurality of pieces of music played by the user in history, for example, the sequence of music played by the user in history is S1, S2, S3 and S4 in sequence, and then the sequence of user behavior of the user may be { S1, S2, S3, S4 }.

User portrayal, in short, uses a vector to represent a user or a feature of a user. In a scene related to a streaming media object such as music or video, a recommendation of the streaming media object is generally required for a user. For example, music is used, and in order to implement accurate music recommendation for a user, a representation of the user, that is, a user profile is often used, and the profile of the user in terms of music can express the preference of the user for music, thereby assisting in performing personalized music recommendation for the user. In addition, it is necessary to use user images in scenes such as user group classification and analysis, and accurate user images are a prerequisite for accurate recommendation and user analysis.

At present, the user representation can be obtained by direct modeling, that is, by processing, extracting and representing the basic information and history data of the user, so as to obtain the user representation. However, this method can only represent the basic information of the users themselves, and ignores the difference between users, so that the obtained information represented by the user identification vector is not comprehensive, and thus, the recommendation cannot be accurately performed in the subsequent recommendation.

The method and the device can reflect the difference between users substantially in consideration of the difference between different users' preferences for streaming media objects such as music, and can also perform personalized recommendation for the users more accurately in scenes such as music recommendation and the like based on the difference between the users.

In view of this, an embodiment of the present application provides a method for obtaining a user portrait, in which a training sample set may be constructed according to user behavior sequences of a target user and a non-target user, and iterative training is performed on a difference learning model. Wherein each training sample comprises a first set of objects in a sequence of user behavior of a target user and a second set of objects in a sequence of user behavior of a non-target user, can be used to characterize the preference information of the target user and the preference information of the non-target user, respectively, and further during each training process, obtaining a loss value representing the degree of difference between the target user and the non-target user according to the representation of the target object and the representations of the streaming media objects included in each training sample, and further updates the representation of the target user based on the loss values, learns the differences between the individual users through the training process described above, therefore, the finally updated expression vector of the target user can be used as the expression of the target user in the aspect of streaming media, so that the content of the user portrait is richer, and the comprehensiveness and the accuracy of the user portrait are improved.

In addition, the embodiment of the application maximizes the difference between the first object set of the target user and the second object set of the non-target user by using the thought of triple loss, so that the target user and the non-target user can be more easily distinguished, that is, the difference between the target user and the non-target user is learned, and the learned expression vector can be used for depicting the portrait of the target user.

After the idea of the embodiment of the present application is introduced, a brief description will be given below of the technology related to the embodiment of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain cross subject, and relates to multi-domain subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

The scheme provided by the embodiment of the application mainly relates to voice processing technology and machine learning/deep learning technology belonging to the field of artificial intelligence, and is specifically explained through the subsequent embodiments.

In the following, some simple descriptions are given to application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be suitable for most scenes needing to obtain the user portrait in the streaming media object related scenes and user analysis or personalized recommendation scenes.

Please refer to fig. 1, which is a schematic diagram of a scenario in which the embodiment of the present application can be applied, where the scenario includes a server 101 and a plurality of terminals 102.

The server 101 may include, among other things, one or more processors 1011, memory 1012, and an I/O interface 1013 for interacting with the terminals. The server 101 may further configure a database 1014, and the database 1014 may be configured to store the learned expression vectors of the users, the trained model parameters, and the like. The server 101 may be, for example, a background server of a streaming media object playing application such as audio or video.

The server 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.

The memory 1012 of the server 101 may store program instructions of the method for obtaining a user representation according to the embodiment of the present application, and when the program instructions are executed by the processor 1011, the program instructions may be used to implement the steps of the method for obtaining a user representation according to the embodiment of the present application, so as to obtain a representation of a user about a streaming media object. In addition, the memory 1012 of the server 101 may further store program instructions of the streaming media object recommendation method provided in the embodiment of the present application, and when the program instructions are executed by the processor 1011, the program instructions can be used to implement the steps of the streaming media object recommendation method provided in the embodiment of the present application, and a recommendation of a streaming media object personalized for a user can be obtained according to the above method by obtaining a representation vector of the user.

A server 101 and a plurality of terminals 102 may be communicatively coupled directly or indirectly via one or more networks 103. The network 103 may be a wired network or a Wireless network, for example, the Wireless network may be a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, and of course, may also be other possible networks, which is not limited in this embodiment of the present application.

Referring to fig. 2, another schematic view of a scenario provided in the embodiment of the present application may include a streaming media object server 201, a database 202, a computing cluster 203, a vector storage server 204, and a user terminal 205.

The streaming media object server 201 may be a server providing the streaming media object, for example, an application server which may be an audio application or a video application, or a website server of an audio website or a video website.

The database 202 may be configured to store user behavior data generated when a user performs an operation on a streaming media object and streaming media data of the streaming media object, and when a user portrait is obtained, corresponding user behavior data and streaming media data of the corresponding streaming media object may be obtained from the database 202 according to actual requirements. The database 202 may adopt a Structured Query Language (SQL) database, a Hadoop Distributed File System (HDFS/HIVE) based on HIVE, or a Key Value (Key-Value, KV) non-relational data storage scheme.

The compute cluster 203 may be used to provide computing resources for the retrieval process of the user's representation vectors.

The vector storage server 204 may be configured to store the obtained representation vectors of the users, and may provide functions such as vector query and matching to the streaming media object server 201. When similar user matching is required, the streaming media object server 201 may provide the vector storage server 204 with the user to be matched, and the vector storage server 204 may perform vector matching according to the expression vector of the user and return the matched similar user to the streaming media object server 201.

Or, when a streaming media object needs to be recommended to a user, the streaming media object server 201 may provide the vector storage server 204 with the user to be recommended, and the vector storage server 204 may perform vector matching according to the representation vector of the user and return the matched streaming media object to the streaming media object server 201. The user terminal 205 may install an application corresponding to the streaming media object server 201, and may view the recommended streaming media object in the application.

In practical applications, the streaming media object server 201, the database 202, the computing cluster 203, and the vector storage server 204 may be different devices, or some or all of the devices may be implemented by the same device, for example, the streaming media object server 201 and the vector storage server 204 may be the same device.

In addition, the technical scheme of the embodiment of the application can also be applied to vehicle-mounted scenes, such as user representation and music or video recommendation in a vehicle-mounted music or video platform.

Certainly, the method provided in the embodiment of the present application is not limited to the application scenario shown in fig. 1 or fig. 2, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. Referring to fig. 3, a flowchart of a method for obtaining a user representation according to an embodiment of the present application is shown, where the method may be executed by the server 101 in fig. 1 or the computing cluster 203 shown in fig. 2, and a flow of the method is described as follows.

Step 301: and acquiring a user behavior sequence of a target user and a plurality of non-target users.

When obtaining the expression vector of a user, in order to learn the differences between the user and all other users, the user behavior sequences of the entire users in the system may be obtained first, and of course, when aiming at some specific scenes, only part of the users need to be involved, and then only the user behavior sequences of the users may be obtained, where one user behavior sequence includes a plurality of streaming media objects operated by one user.

Since the user portrayal process of each user is similar, a user is specifically described herein as an example, wherein the user portrayal required is called a target user, and for the gliding process of the target user, other users among the users are relatively non-target users.

The operation of the user on the streaming media object can reflect the preference of the user, so that the operation behavior data of the user can be obtained, and the representation vector of the streaming media object is obtained based on the operation behavior data of the user.

Specifically, after the operation behavior data of each user is obtained, the user behavior sequence of each user may be obtained based on the operation behavior data of each user. For example, for music, the operation behavior of the user may be a playing operation, and then the historical playing data of each user may be obtained, so as to sort the playing sequence of each user, for example, the music played by the user a in the history is in the order of S1, S2, S3, and S4, then the user behavior sequence of the user may be { S1, S2, S3, S4}, the music played by the user B in the history is in the order of S2, S5, and S6, then the user behavior sequence of the user may be { S2, S5, S6 }.

Step 302: and constructing a training sample set according to each user behavior sequence.

In the embodiment of the application, after the user behavior sequence of each user is obtained, the streaming media objects included in the user behavior sequence of each user can be obtained, and a training sample set is constructed according to the streaming media objects and the user behavior sequence.

The training sample comprises a first object set formed by at least two streaming media objects in a user behavior sequence of a target user and a second object set formed by at least two streaming media objects in a user behavior sequence of a non-target user, and an intersection exists between the first object set and the second object set.

In this embodiment of the present application, since the continuous object operations can better reflect the preference of the user, the first object set may be composed of at least two continuous streaming media objects in the user behavior sequence of the target user, and the second object set may be composed of at least two continuous streaming media objects in the user behavior sequence of the non-target user. Taking music as an example, the first set of objects is a plurality of pieces of music of user u that have been played continuously, and the second set of objects is a plurality of pieces of music of other users different from user u that have been played continuously.

In one possible implementation, the first set of objects may include first streaming media object S1 and second streaming media object S2 located in a sequence of user behaviors of the target user, and the second set of objects includes second streaming media object S2 and third streaming media object S3 located in a sequence of user behaviors of the non-target user. Taking music as an example, S1 and S2 are adjacent music of the song listening sequence of the user u, and S2 and S3 are adjacent music of the song listening sequence of other users different from the user u.

Illustratively, for the sequence of user behaviors of the target user { S1, S2, S3, S4}, and the sequence of user behaviors of a non-target user { S2, S5, S6}, then one possible training sample is { S2, S3, S5}, where { S2, S3} is a first set of objects and { S2, S5} is a second set of objects.

Or, according to another training sample construction mode, aiming at a user behavior sequence { S1, S2, S3, S4, S5} of a target user and a user behavior sequence { S2, S3, S4, S5, S6} of a non-target user, then a possible training sample is { S2, S3, S4, S5}, wherein { S2, S3, S4} is a first object set, and { S3, S4, S5} is a second object set.

Step 303: and performing iterative training on the difference learning model according to the training sample set.

In this embodiment of the present application, initialization of the initial user representation vector of each user may be random assignment, or an initialization algorithm may be used to obtain the initial user representation vector of each user, or feature extraction may be performed based on basic information of the user to obtain the initial user representation vector of the user, where the basic information of the user may include a nickname, an age, a region, and historical data for operating a streaming media object of the user.

And performing iterative training on the difference learning model through the training sample set and the initial user representation vector of the target user until the difference learning model reaches a convergence condition. The iterative training includes a plurality of training processes, and the data processing of each training process is similar, so that the following description mainly takes one training process as an example.

Specifically, during each training, a plurality of loss values may be obtained according to the user representation vector of the target user and the first object set and the second object set of each training sample, where one loss value is used to represent the degree of difference between the target user and the corresponding non-target user, and the user representation vector is updated according to the plurality of loss values.

In this embodiment, any algorithm capable of calculating a loss value based on two sets of objects having an intersection may be applied to this embodiment, and one possible algorithm is a triplet loss algorithm, so that an intersection of a first object set and a second object set, a difference between the first object set and the intersection, and a difference between the second object set and the intersection in each training sample may form a triplet, and further, a triplet loss value corresponding to each training sample may be obtained according to a user representation vector of a target user and a triplet corresponding to each training sample. The representation of each member in the triplet may be obtained by an object representation vector of the streaming media object included by the member, for example, one of the members of the triplet, and the intersection of the first set of objects and the second set of objects may be represented by the object representation vector of the streaming media object in the intersection.

The triple loss algorithm will be described below by taking an example that the first object set includes the first streaming media object S1 and the second streaming media object S2, and the second object set includes the second streaming media object S2 and the third streaming media object S3.

Referring to fig. 4, a model architecture of the difference learning model based on the triplet loss algorithm is shown.

Specifically, the three streaming media objects S1, S2 and S3 from left to right as shown in fig. 4 can be classified at a maximum interval by using the concept of triplet loss, wherein S1 and S2 are adjacent objects in the user behavior sequence of the target user u, and S2 and S3 are adjacent objects in the user behavior sequence of the non-target user, i.e. adjacent objects of other users different from the target user u.

Fig. 5 is a schematic diagram of a training process of the difference learning model.

Step 501: and respectively predicting a first probability that the user behavior sequence of the target user comprises the first object set and a second probability that the user behavior sequence of the target user comprises the second object set according to the user feature vector of the target user aiming at each training sample.

In the embodiment of the application, the purpose of the difference learning is to learn the difference between users, and the difference between users is reflected in the behavior of users, for example, the probability that the user a plays which songs continuously is higher, and the probability that the user B plays the other songs continuously is higher, so that the difference between the user a and the user B is reflected. Therefore, for a given target user, a first probability that the first set of objects is located in the sequence of user behaviors of the target user and a second probability that the second set of objects is located in the sequence of user behaviors of the target user can be predicted according to the user feature vector of the target user. Theoretically, because the first set of objects is extracted from the sequence of user behaviors of the target user, the expectation for the first probability is larger as better, and the second set of objects is extracted from the sequence of user behaviors of the non-target user, the expectation for the second probability is smaller as better, so that the difference between the target user and the non-target user is highlighted to the maximum.

As shown in fig. 4, the feature vector sequence of S1 is vector-encoded by the vector encoding (encoder) layer of the disparity learning model to obtain the object feature vector h1 of S1, the feature vector sequence of S2 is vector-encoded by the vector encoding layer to obtain the object feature vector h2 of S2, and so on. The obtaining of the object feature vector of each streaming media object will be described in detail later, and will not be described in detail herein.

Triple loss is defined as follows:

J＝max(0，f(h1，h2)-f(h2，h3)-δ)

wherein J is the loss value of { S1, S2, S3} in a training sample; f is a mapping function, a plurality of vectors are mapped to a score value, f (h1, h2) is to map the object feature vectors h1 and h2 corresponding to S1 and S2 to a score value, f (h2, h3) is to map the object feature vectors h2 and h3 corresponding to S2 and S3 to another score value, and the score values can be used for representing similarity, distance, probability of behavior belonging to a certain user and the like; δ is a small positive integer, also called the interval.

Since S1 and S2 are neighboring objects in the sequence of user behavior of the target user u, { S1, S2} may be considered as a positive sample that can characterize the behavior of the target user u with respect to the target user u, while S2 and S3 are neighboring objects in the sequence of user behavior of the non-target user, { S2, S3} may be considered as a negative sample of the target user u with respect to the target user u. The purpose of J is to make the difference between the scoring value of the positive sample of S1, S2 and the scoring value of the negative sample of S2, S3 larger than delta, so as to maximize the difference of the positive and negative samples. Since { S1, S2} is from the target user u, and { S2, S3} is from other users, maximizing the difference of positive and negative samples is substantially equivalent to learning the difference between the target user u and other users, so that a representation vector capable of representing the difference between the target user u and other users can be learned, and then the representation vector is used for depicting the target user u.

In a specific application, in order to obtain a representation of a user related to a streaming media object, an initial user representation vector hu of an initialized target user u may be input into a triplet loss together with h1 to h3, so that the above formula may be transformed into the following formula:

J＝max(0，f(h1，h2，hu))-f(h2，h3，hu)-δ)

wherein f (h1, h2, hu) is that the user behavior sequence of the target user includes a first probability of the first set of objects, and f (h2, h3, hu) is that the user behavior sequence of the target user includes a second probability of the second set of objects.

In the embodiment of the present application, after the hu is introduced, the model may know that this triplet loss is unique to the target user u, for example, two users u1 and u2, even though some samples of them are the same, for example, they all play S1 and S2 in sequence, but not play S2 and S3 in sequence, but the two users always have other samples different, so that the model may learn the difference between u1 and u2 according to the introduction of the hu.

Specifically, when the first probability and the second probability are obtained, a first association degree of the first streaming media object and the second streaming media object with respect to the user feature vector of the target user, a second association degree of the second streaming media object and the third streaming media object with respect to the user feature vector of the target user, a first probability of the first object set according to the first association degree, and a second probability of the second object set according to the second association degree may be obtained; wherein, the relevance value is positively correlated with the probability value. That is, from the perspective of the target user, it is considered that the degree of association between the first streaming media object and the second streaming media object, and the degree of association between the second streaming media object and the third streaming media object are, in general, the higher the degree of association with respect to the target user is, the higher the probability that a set of streaming media objects is located in the user behavior sequence of the user is.

Step 502: and obtaining a loss value of each training sample according to the first probability and the second probability, wherein the loss value is in positive correlation with the first probability and in negative correlation with the second probability.

In the embodiment of the present application, after a first probability corresponding to a first object set and a second probability corresponding to a second object set are obtained, a loss value of each training sample may be obtained according to the first probability and the second probability, where the loss value is in positive correlation with the first probability and in negative correlation with the second probability.

Specifically, the loss value may be calculated from f (h1, h2, hu) and f (h2, h3, hu) by the above formula, and the loss value is related to the difference between the first probability and the second probability.

Step 503: it is determined whether a convergence condition is satisfied.

When the determination result in step 503 is yes, that is, the model has satisfied the convergence condition, the training is ended.

Step 504: when the determination result in step 503 is no, the user feature vector of the target user is updated, and the process jumps to step 501.

Through the process, loss values corresponding to all training samples can be obtained, whether the difference learning model is converged can be further judged according to the loss values, if the convergence condition is met, iterative training is ended, and if the convergence condition is not met, the user feature vector of the target user can be updated, and the next training process is started. The user feature vector of the target user can be regarded as a model parameter of the difference learning model and is gradually updated along with the training process, and of course, the user feature vector of the target user can also include other model parameters, and the user feature vector of the target user is gradually optimized in the training process.

Specifically, the convergence condition may include one or more of the following conditions:

(1) the loss value of each training sample is not less than a set first threshold value.

(2) The sum of the loss values of all the training samples is not less than a set second threshold value, and the second threshold value is greater than the first threshold value.

Please continue with fig. 3.

Step 304: and when the difference learning model converges, obtaining the user portrait of the target user according to the user representation vector.

When the difference learning model converges, that is, when the difference between the target user and the non-target user is maximized, the model can know that the triplet loss is unique to the target user u, and finally, the user expression vector of the target user during model convergence can be used as the expression of the target user in the aspect of streaming media objects and added into the portrait of u.

Next, an acquisition process of the object feature vectors of the streaming media objects is introduced, and since the acquisition processes of the object feature vectors of the streaming media objects are similar, the following specifically introduces one streaming media object as an example.

In specific application, the streaming media data of each streaming media object can be acquired, and then the feature vector sequence of each streaming media object is acquired according to the streaming media data of each streaming media object, and the feature vector sequence can express the streaming media data information of the streaming media object itself, and then vector coding is performed according to the feature vector sequence of the streaming media object, so as to obtain the object representation vector of each streaming media object.

When the streaming media object is a video, the corresponding feature vector sequence can be obtained according to the video stream data of each video.

Specifically, for a video, video stream data of the video can be divided into a plurality of video segment data in terms of time, and then feature extraction is performed on each video segment data, and one feature vector can be obtained from one video segment data, so that a feature vector sequence composed of feature vectors of a plurality of video segments of the video is obtained. When the features are extracted, feature extraction can be performed on each video segment by using a Convolutional Neural Network (CNN), or for example, time sequence information between video segments can be extracted while performing feature extraction by using a Long Short-Term Memory network (LSTM), and the extracted features are richer.

When the streaming media object is an audio, sampling may be performed according to audio data of each audio when a feature vector sequence of the audio is acquired, so as to obtain an audio spectrogram corresponding to each audio. The audio frequency spectrogram comprises a plurality of time slice frequency sequences of each audio frequency in continuous time, and the time slice frequency sequence corresponds to one feature vector in the feature vector sequence.

In the embodiment of the application, when the feature vector sequence of the audio is obtained, sampling can be performed according to the audio data of each audio to obtain an audio frequency spectrogram corresponding to each audio. Wherein the audio spectrogram comprises a plurality of time segment frequency sequences which are continuous in time for each audio, and a time segment frequency sequence corresponds to one feature vector in the feature vector sequence.

Fig. 6 is a schematic flow chart of obtaining an audio spectrogram for an audio sample.

S601: and performing time domain sampling on the audio data of each audio according to a set time interval to obtain a plurality of time sequences of each audio in the time domain.

Generally speaking, the audio signal has two dimensions, namely a time domain and a frequency domain, which can be expressed as a time sequence or a frequency sequence, so that the audio signal is sampled in the time dimension, for example, one audio signal can be sampled every 0.1s, and discrete time sequences T1 to Tn are obtained, each value representing the size of the audio at the sampling point.

Of course, in actual application, the frequency domain may also be sampled first, that is, the frequency domain and the time domain are exchanged in actual operation, which is not limited in this embodiment of the present application.

S602: and combining the plurality of time sequences according to the set time segment length to obtain a plurality of time sequence combinations.

After obtaining the discrete time series T1-Tn, combining according to the set time section lengths, for example, the time section length is 3s sampling interval 0.1s, each group of sequences includes 30 values of 3s/0.1s, for example, T1-T30 are a group named G1, T31-T60 are named G2, and so on, and finally, a plurality of time series combinations G1-Gm are obtained.

S603: and aiming at each time sequence combination, performing time-frequency conversion on each time sequence combination, and sampling frequency domain signals according to a set frequency interval to obtain a frequency sequence corresponding to each time sequence combination.

For each time series combination Gi, time-Frequency conversion is performed, for example, Fast Fourier Transform (FFT), Mel-Frequency Cepstrum Coefficient (MFCC), Discrete Fourier Transform (DFT), and other algorithms may be adopted to obtain a Frequency signal of each time series combination Gi, and one Frequency signal represents distribution of different frequencies included in a group of time series.

Furthermore, the frequency signals of each time series combination Gi are also sampled according to a set frequency interval, for example, the frequency interval may be 10hz, so as to obtain a discrete frequency series, and assuming that the upper and lower limits of the frequency are 0 to f, the number of each frequency series is f/10, each Gi can be represented as such a plurality of frequency series, but the difference is that the values of the same frequency of different gis are different, and the low frequency values of the gis are large and the high frequency values of the gis are large corresponding to the audio frequency, that is, some parts of the audio frequency are heavy bass sounds, and some parts of the audio frequency are high sounds. Assuming n Gi, m frequencies, an mxn matrix is obtained, which is the spectrogram. Referring to fig. 7, a schematic diagram of a frequency spectrum after audio decomposition is shown, where a horizontal axis is a time axis, an interval of the divided time segments is about 1.75s, that is, a length of a time segment of each Gi is 1.75s, a vertical axis is a frequency axis, which represents a frequency corresponding to each time segment, upper and lower limits of the frequency are 110hz to 3520hz, and a gray scale value represents values corresponding to different frequencies.

In summary, a portrait related to the streaming media object is generated for each user through the streaming media object and the user behavior, and the obtained portrait is used as a part of the user basic portrait, so that the user preference is more finely described.

In the embodiment of the application, the representing vectors of the users can be obtained through the above process, and the representing vectors of the users can be used in the process of recommending the streaming media object.

Specifically, when a streaming media object needs to be recommended to a user, vector matching may be performed between the representation vector of the target user and the object representation vectors of the streaming media objects for a target user to be recommended, and then the streaming media object whose matching degree is greater than a set matching degree threshold is recommended to the target user.

Alternatively, the streaming media object recommendation may also be made based on similar users. Specifically, the user representation vector of the target user and the user representation vectors of other users may be subjected to similarity matching, a similar user with a similarity greater than a set similarity threshold is determined, and then the streaming media object in the user behavior sequence of the similar user is recommended to the target user.

Referring to fig. 8, based on the same inventive concept, the embodiment of the present application further provides an apparatus 80 for obtaining a user portrait, the apparatus including:

a user sequence acquiring unit 801, configured to acquire a user behavior sequence of a target user and a plurality of non-target users; wherein, a user behavior sequence comprises a plurality of streaming media objects operated by a user;

a training sample construction unit 802, configured to construct a training sample set according to each user behavior sequence; the training sample comprises a first object set consisting of at least two streaming media objects in a user behavior sequence of a target user and a second object set consisting of at least two streaming media objects in a user behavior sequence of a non-target user, and the first object set and the second object set have an intersection;

a training unit 803, configured to perform iterative training on the difference learning model according to the training sample set; during each training, a plurality of loss values are correspondingly obtained according to the user representation vector of the target user and the first object set and the second object set of each training sample, one loss value is used for representing the difference degree between the target user and the corresponding non-target user, and the user representation vector is updated according to the plurality of loss values;

a representation unit 804 for obtaining a user representation of the target user from the user representation vector when determining the divergence learning model convergence from the plurality of loss values.

Optionally, an intersection of the first object set and the second object set, a difference between the first object set and the intersection, and a difference between the second object set and the intersection in each training sample constitute a triple;

the training unit 803 is specifically configured to:

Optionally, the training unit 803 is specifically configured to:

aiming at each training sample, respectively obtaining a first probability that a user behavior sequence of a target user comprises a first object set and a second probability that the user behavior sequence comprises a second object set according to a user representation vector and an object representation vector of a streaming media object in each training sample;

Optionally, the first set of objects includes a first streaming media object and a second streaming media object, and the second set of objects includes a second streaming media object and a third streaming media object;

the training unit 803 is specifically configured to:

acquiring a first association degree of the first streaming media object and the second streaming media object relative to the user representation vector, and acquiring a second association degree of the second streaming media object and the third streaming media object relative to the user representation vector;

obtaining a first probability according to the first relevance and obtaining a second probability according to the second relevance; wherein, the relevance value is positively correlated with the probability value.

the sum of the loss values of all the training samples is not less than a set second threshold value, and the second threshold value is greater than the first threshold value.

Optionally, the apparatus further includes an object vector obtaining unit 805, configured to:

Optionally, the object vector obtaining unit 805 is specifically configured to:

the audio frequency spectrogram comprises a plurality of time slice frequency sequences of each audio frequency in continuous time, wherein one time slice frequency sequence corresponds to one feature vector in the feature vector sequence.

Optionally, the object vector obtaining unit 805 is specifically configured to:

combining the plurality of time sequences according to the set time segment length to obtain a plurality of time sequence combinations;

and for each time sequence combination, performing time-frequency conversion on each time sequence combination, and sampling frequency domain signals according to a set frequency interval to obtain a frequency sequence corresponding to each time sequence combination.

The apparatus may be configured to execute the methods shown in the embodiments shown in fig. 3 to fig. 7, and therefore, for functions and the like that can be implemented by each functional module of the apparatus, reference may be made to the description of the embodiments shown in fig. 3 to fig. 7, which is not repeated here.

Referring to fig. 9, based on the same inventive concept, an embodiment of the present application further provides a streaming media object recommendation apparatus 90, including:

a user vector obtaining unit 901, configured to obtain a user representation vector of a target user by the above method for obtaining a user image;

a recommending unit 902, configured to recommend, to the target user, a streaming media object whose matching degree is greater than a set matching degree threshold according to a matching result between the user representation vector of the target user and the object representation vectors of the streaming media objects; or carrying out similar matching on the user representation vector of the target user and the user representation vectors of other users, determining similar users with the similarity larger than a set similarity threshold, and recommending the streaming media object in the user behavior sequence of the similar users to the target user.

The device may be configured to execute the step of the streaming media object recommendation process, and therefore, for functions and the like that can be realized by each functional module of the device, reference may be made to the description of the embodiment of the streaming media object recommendation process, which is not repeated here.

Referring to fig. 10, based on the same technical concept, an embodiment of the present application further provides a computer apparatus 100, which may include a memory 1001 and a processor 1002.

The memory 1001 is used for storing computer programs executed by the processor 1002. The memory 1001 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the computer device, and the like. The processor 1002 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The specific connection medium between the memory 1001 and the processor 1002 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 1001 and the processor 1002 are connected through the bus 1003 in fig. 10, the bus 1003 is represented by a thick line in fig. 10, and the connection manner between other components is merely illustrative and not limited. The bus 1003 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The memory 1001 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1001 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or any other medium which can be used to carry or store desired program code in the form of instructions or data structures and which can be accessed by a computer. The memory 1001 may be a combination of the above memories.

A processor 1002, configured to execute the method executed by the apparatus in the embodiment shown in fig. 3 to fig. 7 or in the embodiment of the recommendation process of a streaming media object when calling the computer program stored in the memory 1001.

In some possible embodiments, various aspects of the methods provided by the present application may also be implemented in a form of a program product including program code for causing a computer device to perform the steps of the methods according to various exemplary embodiments of the present application described above in this specification when the program product runs on the computer device, for example, the computer device may perform the methods performed by the devices in the embodiments of the streaming media object recommendation process described above in the embodiments shown in fig. 3 to fig. 7.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of capturing an image of a user, the method comprising:

2. The method of claim 1, wherein the first set of objects consists of at least two streaming media objects that are contiguous in the sequence of user behavior of the target user, and wherein the second set of objects consists of at least two streaming media objects that are contiguous in the sequence of user behavior of the non-target user.

3. The method of claim 1, wherein an intersection of a first set of objects with a second set of objects, a difference between the first set of objects and the intersection, and a difference between the second set of objects and the intersection in each training sample constitute a triplet;

obtaining a plurality of loss values according to the user representation vector of the target user and the first object set and the second object set of each training sample, including:

and obtaining the triplet loss value corresponding to each training sample according to the user representation vector of the target user and the triplet corresponding to each training sample.

4. The method of claim 1, wherein obtaining a plurality of loss values based on the user representation vector of the target user and the first set of objects and the second set of objects for each training sample comprises:

5. The method of claim 4, wherein the first set of objects comprises a first streaming media object and a second streaming media object, the second set of objects comprises the second streaming media object and a third streaming media object;

then, for each training sample, respectively obtaining a first probability that the user behavior sequence of the target user includes the first object set and a second probability that the user behavior sequence includes the second object set according to the user representation vector and the object representation vector of the streaming media object in each training sample, including:

6. The method of claim 1, wherein the convergence condition of the difference learning model comprises:

7. The method of claim 1, wherein prior to iteratively training a dissimilarity learning model from the set of training samples, the method further comprises:

8. The method as claimed in claim 7, wherein if the target streaming media object is audio, the obtaining a feature vector sequence of each streaming media object according to the streaming media data of each streaming media object respectively comprises:

wherein the audio spectrogram comprises a plurality of time segment frequency sequences which are continuous in time for each audio, and a time segment frequency sequence corresponds to one feature vector in the feature vector sequence.

9. The method of claim 8, wherein separately sampling audio data of each audio to obtain an audio spectrogram corresponding to each audio comprises:

10. An apparatus for capturing a representation of a user, the apparatus comprising:

the training sample construction unit is used for constructing a training sample set according to each user behavior sequence; wherein, a training sample comprises a first object set composed of at least two streaming media objects in the user behavior sequence of the target user and a second object set composed of at least two streaming media objects in the user behavior sequence of the non-target user, and an intersection exists between the first object set and the second object set;

a determining unit, configured to obtain a user representation of the target user according to the user representation vector when determining that the difference learning model converges according to the plurality of loss values.

11. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor when executing the computer program realizes the steps of the method of any of claims 1 to 9.

12. A computer storage medium having computer program instructions stored thereon, wherein,

the computer program instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 9.