CN107480289B

CN107480289B - User attribute acquisition method and device

Info

Publication number: CN107480289B
Application number: CN201710738930.0A
Authority: CN
Inventors: 杨阳; 黄秀; 杨子豪; 沈复民; 谢宁; 申恒涛
Original assignee: Chengdu Aohaichuan Technology Co ltd
Current assignee: Chengdu Aohaichuan Technology Co ltd
Priority date: 2017-08-24
Filing date: 2017-08-24
Publication date: 2020-06-30
Anticipated expiration: 2037-08-24
Also published as: CN107480289A

Abstract

The embodiment of the invention provides a user attribute obtaining method and device, and relates to the field of data processing. The method comprises the steps of obtaining texts and images in a microblog of a user; then obtaining a text input matrix corresponding to the text; obtaining an image input matrix corresponding to the image; obtaining a total input matrix based on the text input matrix and the image input matrix; then, based on the total input matrix, the subject distribution condition in the text and the image is obtained, and based on the subject distribution condition, the attribute of the user is obtained. High efficiency, high accuracy and strong practicability.

Description

User attribute acquisition method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a user attribute acquisition method and device.

Background

At present, the existing methods such as Poisson Gamma Belief Network (PGBN) can only obtain the attributes of users by processing text contents, and cannot be directly applied in a large-scale social media environment, which is low in efficiency and inaccurate.

Disclosure of Invention

The present invention is directed to a user attribute obtaining device and method to improve the above-mentioned problems. In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

in a first aspect, an embodiment of the present invention provides a method for obtaining a user attribute, where the method includes obtaining a text and an image in a microblog of a user; obtaining a text input matrix corresponding to the text; obtaining an image input matrix corresponding to the image; obtaining a total input matrix based on the text input matrix and the image input matrix; and obtaining the subject distribution condition in the text and the image based on the total input matrix, and obtaining the attribute of the user based on the subject distribution condition.

In a second aspect, an embodiment of the present invention provides a user attribute obtaining apparatus, where the apparatus includes a first obtaining unit, a second obtaining unit, a third obtaining unit, a fourth obtaining unit, and a fifth obtaining unit. The first acquisition unit is used for acquiring texts and images in the microblog of the user. And the second acquisition unit is used for acquiring a text input matrix corresponding to the text. And the third acquisition unit is used for acquiring an image input matrix corresponding to the image. And the fourth acquisition unit is used for acquiring a total input matrix based on the text input matrix and the image input matrix. And the fifth acquiring unit is used for acquiring the subject distribution condition in the text and the image based on the total input matrix and acquiring the attribute of the user based on the subject distribution condition.

According to the method and the device for obtaining the user attribute, the text and the image in the microblog of the user are obtained; then obtaining a text input matrix corresponding to the text; obtaining an image input matrix corresponding to the image; obtaining a total input matrix based on the text input matrix and the image input matrix; then, based on the total input matrix, the subject distribution condition in the text and the image is obtained, and based on the subject distribution condition, the attribute of the user is obtained. High efficiency, high accuracy and strong practicability.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention;

fig. 2 is a flowchart of a user attribute obtaining method according to an embodiment of the present invention;

fig. 3 is a block diagram of a user attribute obtaining apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram of another user attribute obtaining apparatus according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Fig. 1 shows a block diagram of an electronic device 100 applicable to an embodiment of the present invention. As shown in FIG. 1, electronic device 100 may include a memory 102, a memory controller 104, one or more processors 106 (only one shown in FIG. 1), a peripherals interface 108, an input-output module 110, an audio module 112, a display module 114, a radio frequency module 116, and user attribute acquisition means.

The memory 102, the memory controller 104, the processor 106, the peripheral interface 108, the input/output module 110, the audio module 112, the display module 114, and the radio frequency module 116 are electrically connected directly or indirectly to realize data transmission or interaction. For example, electrical connections between these components may be made through one or more communication or signal buses. The user attribute acquiring method includes at least one software functional module that can be stored in the memory 102 in the form of software or firmware (firmware), for example, a software functional module or a computer program included in the user attribute acquiring apparatus.

The memory 102 may store various software programs and modules, such as program instructions/modules corresponding to the user attribute obtaining method and apparatus provided in the embodiments of the present application. The processor 106 executes various functional applications and data processing by running software programs and modules stored in the memory 102, that is, implements the user attribute acquisition method in the embodiment of the present application.

The Memory 102 may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Read Only Memory (EPROM), electrically Erasable Read Only Memory (EEPROM), and the like.

The processor 106 may be an integrated circuit chip having signal processing capabilities. The processor may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Which may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The peripherals interface 108 couples various input/output devices to the processor 106 and to the memory 102. In some embodiments, the peripheral interface 108, the processor 106, and the memory controller 104 may be implemented in a single chip. In other examples, they may be implemented separately from the individual chips.

The input-output module 110 is used for providing input data to a user to enable the user to interact with the electronic device 100. The input/output module 110 may be, but is not limited to, a mouse, a keyboard, and the like.

Audio module 112 provides an audio interface to a user that may include one or more microphones, one or more speakers, and audio circuitry.

The display module 114 provides an interactive interface (e.g., a user interface) between the electronic device 100 and a user or for displaying image data to a user reference. In this embodiment, the display module 114 may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. Supporting single-point and multi-point touch operations means that the touch display can sense touch operations from one or more locations on the touch display at the same time, and the sensed touch operations are sent to the processor 106 for calculation and processing.

The rf module 116 is used for receiving and transmitting electromagnetic waves, and implementing interconversion between the electromagnetic waves and electrical signals, so as to communicate with a communication network or other devices.

It will be appreciated that the configuration shown in FIG. 1 is merely illustrative and that electronic device 100 may include more or fewer components than shown in FIG. 1 or have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

In the embodiment of the invention, the electronic device 100 may be a user terminal or a server. The user terminal may be a pc (personal computer), a tablet computer, a mobile phone, a notebook computer, an intelligent television, a set-top box, a vehicle-mounted terminal, and other terminal devices.

Referring to fig. 2, an embodiment of the present invention provides a method for obtaining a user attribute, where the method includes: step S200, step S210, step S220, step S230, and step S240.

Step S200: and acquiring texts and images in the microblog of the user.

Texts and images in the microblogs of the user can be acquired from the Xinlang microblogs.

Step S210: and obtaining a text input matrix corresponding to the text.

Based on step S210, further performing word segmentation processing on the text and counting word frequency to obtain at least one word segmentation and word frequency of each word segmentation in the at least one word segmentation; and obtaining a text input matrix corresponding to the text based on the at least one word segmentation and the word frequency of each word segmentation.

In this embodiment, the python is used to perform word segmentation on the text and create a vocabulary (vocabulary) of all words, where each row of the vocabulary is a word and the number of the row is the index (index) of the word; meanwhile, word frequency (count) is counted according to microblog (document), and a text input matrix X corresponding to the text is generated_uWherein X is_u(i, j) is the ith word in the ith textThe current frequency.

Step S220: and obtaining an image input matrix corresponding to the image.

Based on step S220, further performing sift feature extraction on the image to obtain a first feature vector corresponding to the image and obtain an image input matrix corresponding to the image based on the first feature vector.

Step S230: and obtaining a total input matrix based on the text input matrix and the image input matrix.

Based on step S230, further, the text input matrix, the image input matrix and a preset training set input matrix are spliced to obtain a total input matrix.

And splicing the text input matrix, the image input matrix and a preset training set input matrix along the dimension of the text/image index to obtain a spliced matrix, and summing the spliced matrix along the dimension in blocks according to users to obtain a total input matrix. In the total input matrix, each column represents a user, and each user represents a text or image.

Further, before step S230, the method may further include: acquiring training texts and training images in a plurality of microblogs; obtaining a training text input matrix corresponding to the training text; obtaining a training image matrix corresponding to the training image; and obtaining the input matrix of the training set based on the input matrix of the training text and the training image matrix.

Further, performing sift feature extraction on each training image to obtain a second feature vector corresponding to each training image;

acquiring a clustering center of each type and image features contained in each type based on a preset clustering algorithm and a second feature vector corresponding to each training image;

and counting the number of the image features contained in each training image to obtain a training image matrix corresponding to a plurality of training images.

Further, after obtaining the training set input matrix based on the training text input matrix and the training image matrix, the method may further include:

setting the maximum theme number of the bottommost layer of the Poisson gamma belief network;

based on the input matrix of the training set, randomly distributing a theme for each training text and each training image, obtaining an initialization matrix and generating an initial value of each probability parameter;

and iteratively training the Poisson gamma belief network based on the initialization matrix and the initial values of all probability parameters to obtain the theme distribution condition in the training text and the training image.

Specifically, more than 1000 training texts and training images in microblogs are obtained from the Xinlang microblog, the training texts and the training images in a plurality of microblogs are used as training sets, the python is used for carrying out word segmentation on the training texts and making a vocabulary (vocabularies) of all words, each line of the vocabulary is a vocabulary, and the number of lines of the vocabulary is an index value (index) of the vocabulary; meanwhile, word frequency (count) is counted according to microblog (document), and a text-word-index (document-word-index) training text input matrix X is generated_wWherein X is_w(i, j) is the frequency with which the ith word appears in the jth text. After all training images are subjected to feature extraction by using sift in sequence, obtaining a plurality of second feature vectors, connecting the plurality of second feature vectors in series, and performing K clustering on the plurality of second feature vectors by using a preset clustering algorithm K-means algorithm to obtain a clustering center of each type and image features contained in each type; counting the number of the image features contained in each training image to obtain a training image matrix X corresponding to a plurality of training images_v。X_v(i, j) is the frequency with which the ith class of features appears in the jth image. The preset clustering algorithm can be a K-means algorithm.

Setting the maximum theme number K of the bottom layer of the Poisson gamma belief network_0maxDetermining an upper limit of the number of topics extracted through the first layer (the number of topics decreases from the lowest layer to the highest layer, namely, the topics at the higher layers are more general); in training text input matrix X_wTraining image matrix X_vIn, is random for eachEach occurrence of a vocabulary/feature class in text/images assigns a topic (topic) (total K)_0maxTopic) to get the initialized matrix: that is, the assigned frequency of each vocabulary/feature class of the text/each image in each microblog under the subjects of the lowest classes

Respectively representing the frequency (number) of the ith vocabulary/feature class in the jth text/image being allocated to the kth class of subject, and the matrix of the bottommost subject-vocabulary/subject-feature class

Representing the proportion of the ith vocabulary/characteristic class under the kth theme, considering all texts/pictures) and the proportion matrix of each theme corresponding to the vocabulary/characteristic class contained in each microblog text/each image

And the proportion of each theme in the jth text/picture is respectively expressed as a vector, and initial values of each probability parameter (without practical significance and only participating in calculation) are generated. In the following process, the meaning of each matrix is not changed, but the values are changed.

Setting the current outer layer as T, starting from the bottom layer to the top layer, and executing certain iteration number (B) for each T_T+C_T) Training all layers below the layer in two steps according to a certain rule (T is less than or equal to T), wherein each iteration comprises the following steps: from the bottommost layer up to the outer current layer T, layer by layer, each layer samples the values of a portion of the matrix. Assuming that the internal current layer (current layer from the bottommost layer to the T-th layer) is T, when T is 1, the current layer is adopted by gibbsSample method, first of all, use

And

the theme of the vocabulary/the characteristic class is re-sampled and distributed, and after a certain sampling times, a stable (no longer changing along with the sampling) theme-vocabulary/theme-characteristic class matrix Z _ w can be obtained through combination⁽¹⁾/Z_v⁽¹⁾(Z_w⁽¹⁾(k,i)/Z_v⁽¹⁾(k, i) representing the frequency with which the ith vocabulary/feature class is assigned as the kth topic) and topic-text/topic-picture matrices

(

Indicating the number of words/feature classes contained in the jth text/picture that are assigned to the kth topic).

When t is more than or equal to 2, the number K of the themes of the layer is firstly counted_tNumber of subjects initialized to previous layer K_t-1Then, respectively sampling the stable frequency of the vocabulary/feature classes contained in the previous layer of each topic in each microblog text/image under the condition that the vocabulary/feature classes are distributed to the current layer of each topic according to the Gibbs sampling principle

And

(here, the topic of the previous layer can be regarded as the vocabulary/feature class under the topic of the current layer), and a text/image-current layer topic matrix is obtained

And

(

representing the number of words in the jth text/image that are assigned to the ith topic in the current layer); at the same time will

And

the current layer theme-previous layer theme matrix Z _ w can be obtained through merging^(t)/Z_v^(t)And a topic-text/topic-image matrix

Then using the obtained current layer theme-previous layer theme matrix Z _ w^(t)/Z_v^(t)Sampling the ratio of each topic vocabulary/characteristic class in the upper layer under each topic in the current layer

And

(

and (3) representing the proportion of the vocabulary/feature classes of the ith theme of the upper layer under the kth theme of the current layer, and considering all texts/images.

Sampling and calculating probability parameters layer by layer. From the outer current layer T to the bottommost layer down layer by layer, each layer samples the values of the other partial matrix. First using Z _ w^(T)/Z_v^(T)Sampling out the weight vector r of the vocabulary/characteristic class contained in each topic of the external current layer T^(T)(the larger the corresponding theme weight is, the heavier the occupied proportion is); then use r^(T)(when T is T) or

And

(t<t time) as a probability generation parameter andaccording to the sampling, the sampling is lower than the sampling

And

it is noted that, starting from a layer, the θ and associated probability parameters of all layers above it become common to both text and image (common θ is defined by the text and image being associated with each other

Obtaining a matrix obtained by splicing along the theme dimension as a probability parameter sample; the relevant common probability parameters are obtained by common theta sampling), so when the layers are sampled, the common theta of the higher layers is respectively summed

And

and after matrix multiplication, splicing to obtain a matrix which is a public probability parameter for sampling. When the number of iterations reaches a certain threshold B_TWhen removing inactive topics (i.e. vocabulary/feature classes under some topics not containing any lower-level topics), the number of topics K in the current layer is cut_t. And when the iterative sampling of all layers is finished, finishing the training to obtain the distribution conditions of all microblog vocabularies and picture characteristic classes in the training text and the training image under each layer of theme.

Step S240: and obtaining the subject distribution condition in the text and the image based on the total input matrix, and obtaining the attribute of the user based on the subject distribution condition.

Initializing a first network parameter, a second network parameter and a preset third network parameter of the Poisson gamma belief network based on the step S240; and taking the total input matrix as the input of a Poisson gamma belief network, sampling layer by layer from the bottommost layer to the topmost layer of the Poisson gamma belief network, and iteratively updating the first network parameter, the second network parameter and the third network parameter until a preset iteration number is reached to obtain the distribution condition of the subjects in the text and the image.

Specifically, a first network parameter θ, a second network parameter r and a preset third network parameter Φ of the poisson gamma belief network are initialized, wherein the preset third network parameter Φ is Φ obtained by training in the training text and the training image. At a certain number of iterations (B)_T+C_T) Training all layers in two steps, and executing the following process in each iteration: using the total input matrix X layer by layer from the bottommost layer to the topmost layer_w/X_vGenerated by training set

And formed by splicing training sets and test sets along user dimensions

(or common theta)^(t)) For total data set

(i.e., the topic-user matrix) is sampled. The correlated probability parameters are generated from the second to top-most samples. At the top layer, the top layer r obtained by training is used as a probability generation parameter, and theta is sampled and generated^(T)(text and picture common). Theta^(T)The tail of (i.e. all columns after a certain column) is the distribution of each topic in the microblog of the user. And after the iteration is finished, obtaining the distribution condition of each theme in the microblog of the user.

Referring to fig. 3, an embodiment of the present invention provides an apparatus 300 for obtaining user attributes, where the apparatus may include: a first acquisition unit 320, a second acquisition unit 330, a third acquisition unit 340, a fourth acquisition unit 350, and a fifth acquisition unit 360.

The first obtaining unit 320 is configured to obtain a text and an image in a microblog of a user.

The second obtaining unit 330 is configured to obtain a text input matrix corresponding to the text.

The second acquisition unit 330 may include a second acquisition sub-unit 331.

The second obtaining subunit 331, configured to perform word segmentation processing on the text and count word frequency to obtain at least one word and word frequency of each word in the at least one word; and obtaining a text input matrix corresponding to the text based on the at least one word segmentation and the word frequency of each word segmentation.

A third obtaining unit 340, configured to obtain an image input matrix corresponding to the image.

The third acquiring unit 340 may include a third acquiring subunit 341.

A third obtaining subunit 341, configured to perform sift feature extraction on the image, obtain a first feature vector corresponding to the image, and obtain an image input matrix corresponding to the image based on the first feature vector.

A fourth obtaining unit 350, configured to obtain a total input matrix based on the text input matrix and the image input matrix.

The fourth acquisition unit 350 may include a fourth acquisition sub-unit 351.

The fourth obtaining subunit 351 is configured to splice the text input matrix, the image input matrix, and a preset training set input matrix to obtain a total input matrix.

A fifth obtaining unit 360, configured to obtain, based on the total input matrix, a distribution situation of topics in the text and the image, and obtain, based on the distribution situation of topics, an attribute of the user.

The fifth acquiring unit 360 may include a fifth acquiring sub-unit 361.

The fifth obtaining subunit 361 is configured to initialize a first network parameter, a second network parameter, and a preset third network parameter of the poisson gamma belief network; and taking the total input matrix as the input of a Poisson gamma belief network, sampling layer by layer from the bottommost layer to the topmost layer of the Poisson gamma belief network, and iteratively updating the first network parameter, the second network parameter and the third network parameter until a preset iteration number is reached to obtain the distribution condition of the subjects in the text and the image.

Referring to fig. 4, the apparatus 300 may further include: a training unit 310.

A training unit 310, configured to obtain training texts and training images in multiple microblogs; obtaining a training text input matrix corresponding to the training text; obtaining a training image matrix corresponding to the training image; and obtaining the input matrix of the training set based on the input matrix of the training text and the training image matrix.

The training unit 310 may comprise a training subunit 311.

The training subunit 311 is further configured to perform sift feature extraction on each training image to obtain a second feature vector corresponding to each training image; acquiring a clustering center of each type and image features contained in each type based on a preset clustering algorithm and a second feature vector corresponding to each training image; and counting the number of the image features contained in each training image to obtain a training image matrix corresponding to a plurality of training images.

The training unit 310 is further configured to set a maximum number of topics at a bottommost layer of the poisson gamma belief network; based on the input matrix of the training set, randomly distributing a theme for each training text and each training image, obtaining an initialization matrix and generating an initial value of each probability parameter; and iteratively training the Poisson gamma belief network based on the initialization matrix and the initial values of all probability parameters to obtain the theme distribution condition in the training text and the training image.

The above units may be implemented by software codes, and in this case, the above units may be stored in the memory 102. The above units may also be implemented by hardware, for example, an integrated circuit chip.

The user attribute obtaining apparatus 300 according to the embodiment of the present invention has the same implementation principle and technical effect as those of the foregoing method embodiments, and for brief description, reference may be made to corresponding contents in the foregoing method embodiments for parts that are not mentioned in the apparatus embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A user attribute obtaining method is applied to an electronic device, and is characterized by comprising the following steps:

acquiring training texts and training images in a plurality of microblogs;

obtaining a training text input matrix corresponding to the training text;

obtaining a training image matrix corresponding to the training image;

obtaining a training set input matrix based on the training text input matrix and the training image matrix;

iteratively training the Poisson gamma belief network based on the initialization matrix and initial values of all probability parameters, sampling a plurality of matrix values of each layer from the bottommost layer of the Poisson gamma belief network to the external current layer upwards layer by layer, sampling and calculating the probability parameters layer by layer, sampling the residual matrix values of each layer from the external current layer to the bottommost layer downwards layer by layer, and obtaining the theme distribution condition in the training text and the training image when the iterative sampling of all layers is completed;

acquiring texts and images in a microblog of a user;

obtaining a text input matrix corresponding to the text;

obtaining an image input matrix corresponding to the image;

splicing the text input matrix, the image input matrix and a preset training set input matrix to obtain a total input matrix;

and obtaining the subject distribution condition in the text and the image based on the total input matrix, and obtaining the attribute of the user based on the subject distribution condition.

2. The method of claim 1, wherein obtaining a text input matrix corresponding to the text comprises:

performing word segmentation processing on the text and counting word frequency to obtain at least one word segmentation and the word frequency of each word segmentation in the at least one word segmentation;

and obtaining a text input matrix corresponding to the text based on the at least one word segmentation and the word frequency of each word segmentation.

3. The method of claim 1, wherein obtaining an image input matrix corresponding to the image comprises:

and performing sift feature extraction on the image to obtain a first feature vector corresponding to the image and obtain an image input matrix corresponding to the image based on the first feature vector.

4. The method of claim 1, wherein obtaining a training image matrix corresponding to the training image comprises:

performing sift feature extraction on each training image to obtain a second feature vector corresponding to each training image;

5. The method of claim 1, wherein obtaining the distribution of topics in the text and the image based on the total input matrix comprises:

initializing a first network parameter, a second network parameter and a preset third network parameter of the Poisson gamma belief network;

and taking the total input matrix as the input of a Poisson gamma belief network, sampling layer by layer from the bottommost layer to the topmost layer of the Poisson gamma belief network, and iteratively updating the first network parameter, the second network parameter and the third network parameter until a preset iteration number is reached to obtain the distribution condition of the subjects in the text and the image.

6. A user attribute acquisition apparatus, characterized in that the apparatus comprises:

the training unit is used for acquiring training texts and training images in a plurality of microblogs; obtaining a training text input matrix corresponding to the training text; obtaining a training image matrix corresponding to the training image; obtaining a training set input matrix based on the training text input matrix and the training image matrix; setting the maximum theme number of the bottommost layer of the Poisson gamma belief network; based on the input matrix of the training set, randomly distributing a theme for each training text and each training image, obtaining an initialization matrix and generating an initial value of each probability parameter; iteratively training the Poisson gamma belief network based on the initialization matrix and initial values of all probability parameters, sampling a plurality of matrix values of each layer from the bottommost layer of the Poisson gamma belief network to the external current layer upwards layer by layer, sampling and calculating the probability parameters layer by layer, sampling the residual matrix values of each layer from the external current layer to the bottommost layer downwards layer by layer, and obtaining the theme distribution condition in the training text and the training image when the iterative sampling of all layers is completed;

the first acquisition unit is used for acquiring texts and images in the microblog of the user;

the second acquisition unit is used for acquiring a text input matrix corresponding to the text;

the third acquisition unit is used for acquiring an image input matrix corresponding to the image;

the fourth acquiring subunit is used for splicing the text input matrix, the image input matrix and a preset training set input matrix to acquire a total input matrix;

and the fifth acquiring unit is used for acquiring the subject distribution condition in the text and the image based on the total input matrix and acquiring the attribute of the user based on the subject distribution condition.

7. The apparatus of claim 6, wherein the second obtaining unit comprises:

the second obtaining subunit is used for performing word segmentation processing on the text and counting word frequency to obtain at least one word segmentation and the word frequency of each word segmentation in the at least one word segmentation; and obtaining a text input matrix corresponding to the text based on the at least one word segmentation and the word frequency of each word segmentation.