CN107480289B - User attribute acquisition method and device - Google Patents

User attribute acquisition method and device Download PDF

Info

Publication number
CN107480289B
CN107480289B CN201710738930.0A CN201710738930A CN107480289B CN 107480289 B CN107480289 B CN 107480289B CN 201710738930 A CN201710738930 A CN 201710738930A CN 107480289 B CN107480289 B CN 107480289B
Authority
CN
China
Prior art keywords
training
image
text
obtaining
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710738930.0A
Other languages
Chinese (zh)
Other versions
CN107480289A (en
Inventor
杨阳
黄秀
杨子豪
沈复民
谢宁
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Aohaichuan Technology Co ltd
Original Assignee
Chengdu Aohaichuan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Aohaichuan Technology Co ltd filed Critical Chengdu Aohaichuan Technology Co ltd
Priority to CN201710738930.0A priority Critical patent/CN107480289B/en
Publication of CN107480289A publication Critical patent/CN107480289A/en
Application granted granted Critical
Publication of CN107480289B publication Critical patent/CN107480289B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Resources & Organizations (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a user attribute obtaining method and device, and relates to the field of data processing. The method comprises the steps of obtaining texts and images in a microblog of a user; then obtaining a text input matrix corresponding to the text; obtaining an image input matrix corresponding to the image; obtaining a total input matrix based on the text input matrix and the image input matrix; then, based on the total input matrix, the subject distribution condition in the text and the image is obtained, and based on the subject distribution condition, the attribute of the user is obtained. High efficiency, high accuracy and strong practicability.

Description

User attribute acquisition method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a user attribute acquisition method and device.
Background
At present, the existing methods such as Poisson Gamma Belief Network (PGBN) can only obtain the attributes of users by processing text contents, and cannot be directly applied in a large-scale social media environment, which is low in efficiency and inaccurate.
Disclosure of Invention
The present invention is directed to a user attribute obtaining device and method to improve the above-mentioned problems. In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
in a first aspect, an embodiment of the present invention provides a method for obtaining a user attribute, where the method includes obtaining a text and an image in a microblog of a user; obtaining a text input matrix corresponding to the text; obtaining an image input matrix corresponding to the image; obtaining a total input matrix based on the text input matrix and the image input matrix; and obtaining the subject distribution condition in the text and the image based on the total input matrix, and obtaining the attribute of the user based on the subject distribution condition.
In a second aspect, an embodiment of the present invention provides a user attribute obtaining apparatus, where the apparatus includes a first obtaining unit, a second obtaining unit, a third obtaining unit, a fourth obtaining unit, and a fifth obtaining unit. The first acquisition unit is used for acquiring texts and images in the microblog of the user. And the second acquisition unit is used for acquiring a text input matrix corresponding to the text. And the third acquisition unit is used for acquiring an image input matrix corresponding to the image. And the fourth acquisition unit is used for acquiring a total input matrix based on the text input matrix and the image input matrix. And the fifth acquiring unit is used for acquiring the subject distribution condition in the text and the image based on the total input matrix and acquiring the attribute of the user based on the subject distribution condition.
According to the method and the device for obtaining the user attribute, the text and the image in the microblog of the user are obtained; then obtaining a text input matrix corresponding to the text; obtaining an image input matrix corresponding to the image; obtaining a total input matrix based on the text input matrix and the image input matrix; then, based on the total input matrix, the subject distribution condition in the text and the image is obtained, and based on the subject distribution condition, the attribute of the user is obtained. High efficiency, high accuracy and strong practicability.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention;
fig. 2 is a flowchart of a user attribute obtaining method according to an embodiment of the present invention;
fig. 3 is a block diagram of a user attribute obtaining apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram of another user attribute obtaining apparatus according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Fig. 1 shows a block diagram of an electronic device 100 applicable to an embodiment of the present invention. As shown in FIG. 1, electronic device 100 may include a memory 102, a memory controller 104, one or more processors 106 (only one shown in FIG. 1), a peripherals interface 108, an input-output module 110, an audio module 112, a display module 114, a radio frequency module 116, and user attribute acquisition means.
The memory 102, the memory controller 104, the processor 106, the peripheral interface 108, the input/output module 110, the audio module 112, the display module 114, and the radio frequency module 116 are electrically connected directly or indirectly to realize data transmission or interaction. For example, electrical connections between these components may be made through one or more communication or signal buses. The user attribute acquiring method includes at least one software functional module that can be stored in the memory 102 in the form of software or firmware (firmware), for example, a software functional module or a computer program included in the user attribute acquiring apparatus.
The memory 102 may store various software programs and modules, such as program instructions/modules corresponding to the user attribute obtaining method and apparatus provided in the embodiments of the present application. The processor 106 executes various functional applications and data processing by running software programs and modules stored in the memory 102, that is, implements the user attribute acquisition method in the embodiment of the present application.
The Memory 102 may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Read Only Memory (EPROM), electrically Erasable Read Only Memory (EEPROM), and the like.
The processor 106 may be an integrated circuit chip having signal processing capabilities. The processor may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Which may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The peripherals interface 108 couples various input/output devices to the processor 106 and to the memory 102. In some embodiments, the peripheral interface 108, the processor 106, and the memory controller 104 may be implemented in a single chip. In other examples, they may be implemented separately from the individual chips.
The input-output module 110 is used for providing input data to a user to enable the user to interact with the electronic device 100. The input/output module 110 may be, but is not limited to, a mouse, a keyboard, and the like.
Audio module 112 provides an audio interface to a user that may include one or more microphones, one or more speakers, and audio circuitry.
The display module 114 provides an interactive interface (e.g., a user interface) between the electronic device 100 and a user or for displaying image data to a user reference. In this embodiment, the display module 114 may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. Supporting single-point and multi-point touch operations means that the touch display can sense touch operations from one or more locations on the touch display at the same time, and the sensed touch operations are sent to the processor 106 for calculation and processing.
The rf module 116 is used for receiving and transmitting electromagnetic waves, and implementing interconversion between the electromagnetic waves and electrical signals, so as to communicate with a communication network or other devices.
It will be appreciated that the configuration shown in FIG. 1 is merely illustrative and that electronic device 100 may include more or fewer components than shown in FIG. 1 or have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
In the embodiment of the invention, the electronic device 100 may be a user terminal or a server. The user terminal may be a pc (personal computer), a tablet computer, a mobile phone, a notebook computer, an intelligent television, a set-top box, a vehicle-mounted terminal, and other terminal devices.
Referring to fig. 2, an embodiment of the present invention provides a method for obtaining a user attribute, where the method includes: step S200, step S210, step S220, step S230, and step S240.
Step S200: and acquiring texts and images in the microblog of the user.
Texts and images in the microblogs of the user can be acquired from the Xinlang microblogs.
Step S210: and obtaining a text input matrix corresponding to the text.
Based on step S210, further performing word segmentation processing on the text and counting word frequency to obtain at least one word segmentation and word frequency of each word segmentation in the at least one word segmentation; and obtaining a text input matrix corresponding to the text based on the at least one word segmentation and the word frequency of each word segmentation.
In this embodiment, the python is used to perform word segmentation on the text and create a vocabulary (vocabulary) of all words, where each row of the vocabulary is a word and the number of the row is the index (index) of the word; meanwhile, word frequency (count) is counted according to microblog (document), and a text input matrix X corresponding to the text is generateduWherein X isu(i, j) is the ith word in the ith textThe current frequency.
Step S220: and obtaining an image input matrix corresponding to the image.
Based on step S220, further performing sift feature extraction on the image to obtain a first feature vector corresponding to the image and obtain an image input matrix corresponding to the image based on the first feature vector.
Step S230: and obtaining a total input matrix based on the text input matrix and the image input matrix.
Based on step S230, further, the text input matrix, the image input matrix and a preset training set input matrix are spliced to obtain a total input matrix.
And splicing the text input matrix, the image input matrix and a preset training set input matrix along the dimension of the text/image index to obtain a spliced matrix, and summing the spliced matrix along the dimension in blocks according to users to obtain a total input matrix. In the total input matrix, each column represents a user, and each user represents a text or image.
Further, before step S230, the method may further include: acquiring training texts and training images in a plurality of microblogs; obtaining a training text input matrix corresponding to the training text; obtaining a training image matrix corresponding to the training image; and obtaining the input matrix of the training set based on the input matrix of the training text and the training image matrix.
Further, performing sift feature extraction on each training image to obtain a second feature vector corresponding to each training image;
acquiring a clustering center of each type and image features contained in each type based on a preset clustering algorithm and a second feature vector corresponding to each training image;
and counting the number of the image features contained in each training image to obtain a training image matrix corresponding to a plurality of training images.
Further, after obtaining the training set input matrix based on the training text input matrix and the training image matrix, the method may further include:
setting the maximum theme number of the bottommost layer of the Poisson gamma belief network;
based on the input matrix of the training set, randomly distributing a theme for each training text and each training image, obtaining an initialization matrix and generating an initial value of each probability parameter;
and iteratively training the Poisson gamma belief network based on the initialization matrix and the initial values of all probability parameters to obtain the theme distribution condition in the training text and the training image.
Specifically, more than 1000 training texts and training images in microblogs are obtained from the Xinlang microblog, the training texts and the training images in a plurality of microblogs are used as training sets, the python is used for carrying out word segmentation on the training texts and making a vocabulary (vocabularies) of all words, each line of the vocabulary is a vocabulary, and the number of lines of the vocabulary is an index value (index) of the vocabulary; meanwhile, word frequency (count) is counted according to microblog (document), and a text-word-index (document-word-index) training text input matrix X is generatedwWherein X isw(i, j) is the frequency with which the ith word appears in the jth text. After all training images are subjected to feature extraction by using sift in sequence, obtaining a plurality of second feature vectors, connecting the plurality of second feature vectors in series, and performing K clustering on the plurality of second feature vectors by using a preset clustering algorithm K-means algorithm to obtain a clustering center of each type and image features contained in each type; counting the number of the image features contained in each training image to obtain a training image matrix X corresponding to a plurality of training imagesv。Xv(i, j) is the frequency with which the ith class of features appears in the jth image. The preset clustering algorithm can be a K-means algorithm.
Setting the maximum theme number K of the bottom layer of the Poisson gamma belief network0maxDetermining an upper limit of the number of topics extracted through the first layer (the number of topics decreases from the lowest layer to the highest layer, namely, the topics at the higher layers are more general); in training text input matrix XwTraining image matrix XvIn, is random for eachEach occurrence of a vocabulary/feature class in text/images assigns a topic (topic) (total K)0maxTopic) to get the initialized matrix: that is, the assigned frequency of each vocabulary/feature class of the text/each image in each microblog under the subjects of the lowest classes
Figure BDA0001388001240000071
Figure BDA0001388001240000074
Respectively representing the frequency (number) of the ith vocabulary/feature class in the jth text/image being allocated to the kth class of subject, and the matrix of the bottommost subject-vocabulary/subject-feature class
Figure BDA0001388001240000075
Figure BDA0001388001240000072
Representing the proportion of the ith vocabulary/characteristic class under the kth theme, considering all texts/pictures) and the proportion matrix of each theme corresponding to the vocabulary/characteristic class contained in each microblog text/each image
Figure BDA0001388001240000076
Figure BDA0001388001240000073
And the proportion of each theme in the jth text/picture is respectively expressed as a vector, and initial values of each probability parameter (without practical significance and only participating in calculation) are generated. In the following process, the meaning of each matrix is not changed, but the values are changed.
Setting the current outer layer as T, starting from the bottom layer to the top layer, and executing certain iteration number (B) for each TT+CT) Training all layers below the layer in two steps according to a certain rule (T is less than or equal to T), wherein each iteration comprises the following steps: from the bottommost layer up to the outer current layer T, layer by layer, each layer samples the values of a portion of the matrix. Assuming that the internal current layer (current layer from the bottommost layer to the T-th layer) is T, when T is 1, the current layer is adopted by gibbsSample method, first of all, use
Figure BDA0001388001240000081
And
Figure BDA0001388001240000082
the theme of the vocabulary/the characteristic class is re-sampled and distributed, and after a certain sampling times, a stable (no longer changing along with the sampling) theme-vocabulary/theme-characteristic class matrix Z _ w can be obtained through combination(1)/Z_v(1)(Z_w(1)(k,i)/Z_v(1)(k, i) representing the frequency with which the ith vocabulary/feature class is assigned as the kth topic) and topic-text/topic-picture matrices
Figure BDA0001388001240000083
(
Figure BDA0001388001240000084
Indicating the number of words/feature classes contained in the jth text/picture that are assigned to the kth topic).
When t is more than or equal to 2, the number K of the themes of the layer is firstly countedtNumber of subjects initialized to previous layer Kt-1Then, respectively sampling the stable frequency of the vocabulary/feature classes contained in the previous layer of each topic in each microblog text/image under the condition that the vocabulary/feature classes are distributed to the current layer of each topic according to the Gibbs sampling principle
Figure BDA0001388001240000085
And
Figure BDA0001388001240000086
(here, the topic of the previous layer can be regarded as the vocabulary/feature class under the topic of the current layer), and a text/image-current layer topic matrix is obtained
Figure BDA0001388001240000087
And
Figure BDA0001388001240000088
(
Figure BDA0001388001240000089
representing the number of words in the jth text/image that are assigned to the ith topic in the current layer); at the same time will
Figure BDA00013880012400000810
And
Figure BDA00013880012400000811
the current layer theme-previous layer theme matrix Z _ w can be obtained through merging(t)/Z_v(t)And a topic-text/topic-image matrix
Figure BDA00013880012400000812
Then using the obtained current layer theme-previous layer theme matrix Z _ w(t)/Z_v(t)Sampling the ratio of each topic vocabulary/characteristic class in the upper layer under each topic in the current layer
Figure BDA00013880012400000813
And
Figure BDA00013880012400000814
(
Figure BDA00013880012400000815
and (3) representing the proportion of the vocabulary/feature classes of the ith theme of the upper layer under the kth theme of the current layer, and considering all texts/images.
Sampling and calculating probability parameters layer by layer. From the outer current layer T to the bottommost layer down layer by layer, each layer samples the values of the other partial matrix. First using Z _ w(T)/Z_v(T)Sampling out the weight vector r of the vocabulary/characteristic class contained in each topic of the external current layer T(T)(the larger the corresponding theme weight is, the heavier the occupied proportion is); then use r(T)(when T is T) or
Figure BDA0001388001240000091
And
Figure BDA0001388001240000092
(t<t time) as a probability generation parameter andaccording to the sampling, the sampling is lower than the sampling
Figure BDA0001388001240000093
And
Figure BDA0001388001240000094
it is noted that, starting from a layer, the θ and associated probability parameters of all layers above it become common to both text and image (common θ is defined by the text and image being associated with each other
Figure BDA0001388001240000095
Obtaining a matrix obtained by splicing along the theme dimension as a probability parameter sample; the relevant common probability parameters are obtained by common theta sampling), so when the layers are sampled, the common theta of the higher layers is respectively summed
Figure BDA0001388001240000096
And
Figure BDA0001388001240000097
and after matrix multiplication, splicing to obtain a matrix which is a public probability parameter for sampling. When the number of iterations reaches a certain threshold BTWhen removing inactive topics (i.e. vocabulary/feature classes under some topics not containing any lower-level topics), the number of topics K in the current layer is cutt. And when the iterative sampling of all layers is finished, finishing the training to obtain the distribution conditions of all microblog vocabularies and picture characteristic classes in the training text and the training image under each layer of theme.
Step S240: and obtaining the subject distribution condition in the text and the image based on the total input matrix, and obtaining the attribute of the user based on the subject distribution condition.
Initializing a first network parameter, a second network parameter and a preset third network parameter of the Poisson gamma belief network based on the step S240; and taking the total input matrix as the input of a Poisson gamma belief network, sampling layer by layer from the bottommost layer to the topmost layer of the Poisson gamma belief network, and iteratively updating the first network parameter, the second network parameter and the third network parameter until a preset iteration number is reached to obtain the distribution condition of the subjects in the text and the image.
Specifically, a first network parameter θ, a second network parameter r and a preset third network parameter Φ of the poisson gamma belief network are initialized, wherein the preset third network parameter Φ is Φ obtained by training in the training text and the training image. At a certain number of iterations (B)T+CT) Training all layers in two steps, and executing the following process in each iteration: using the total input matrix X layer by layer from the bottommost layer to the topmost layerw/XvGenerated by training set
Figure BDA0001388001240000098
And formed by splicing training sets and test sets along user dimensions
Figure BDA0001388001240000099
(or common theta)(t)) For total data set
Figure BDA00013880012400000910
(i.e., the topic-user matrix) is sampled. The correlated probability parameters are generated from the second to top-most samples. At the top layer, the top layer r obtained by training is used as a probability generation parameter, and theta is sampled and generated(T)(text and picture common). Theta(T)The tail of (i.e. all columns after a certain column) is the distribution of each topic in the microblog of the user. And after the iteration is finished, obtaining the distribution condition of each theme in the microblog of the user.
According to the method and the device for obtaining the user attribute, the text and the image in the microblog of the user are obtained; then obtaining a text input matrix corresponding to the text; obtaining an image input matrix corresponding to the image; obtaining a total input matrix based on the text input matrix and the image input matrix; then, based on the total input matrix, the subject distribution condition in the text and the image is obtained, and based on the subject distribution condition, the attribute of the user is obtained. High efficiency, high accuracy and strong practicability.
Referring to fig. 3, an embodiment of the present invention provides an apparatus 300 for obtaining user attributes, where the apparatus may include: a first acquisition unit 320, a second acquisition unit 330, a third acquisition unit 340, a fourth acquisition unit 350, and a fifth acquisition unit 360.
The first obtaining unit 320 is configured to obtain a text and an image in a microblog of a user.
The second obtaining unit 330 is configured to obtain a text input matrix corresponding to the text.
The second acquisition unit 330 may include a second acquisition sub-unit 331.
The second obtaining subunit 331, configured to perform word segmentation processing on the text and count word frequency to obtain at least one word and word frequency of each word in the at least one word; and obtaining a text input matrix corresponding to the text based on the at least one word segmentation and the word frequency of each word segmentation.
A third obtaining unit 340, configured to obtain an image input matrix corresponding to the image.
The third acquiring unit 340 may include a third acquiring subunit 341.
A third obtaining subunit 341, configured to perform sift feature extraction on the image, obtain a first feature vector corresponding to the image, and obtain an image input matrix corresponding to the image based on the first feature vector.
A fourth obtaining unit 350, configured to obtain a total input matrix based on the text input matrix and the image input matrix.
The fourth acquisition unit 350 may include a fourth acquisition sub-unit 351.
The fourth obtaining subunit 351 is configured to splice the text input matrix, the image input matrix, and a preset training set input matrix to obtain a total input matrix.
A fifth obtaining unit 360, configured to obtain, based on the total input matrix, a distribution situation of topics in the text and the image, and obtain, based on the distribution situation of topics, an attribute of the user.
The fifth acquiring unit 360 may include a fifth acquiring sub-unit 361.
The fifth obtaining subunit 361 is configured to initialize a first network parameter, a second network parameter, and a preset third network parameter of the poisson gamma belief network; and taking the total input matrix as the input of a Poisson gamma belief network, sampling layer by layer from the bottommost layer to the topmost layer of the Poisson gamma belief network, and iteratively updating the first network parameter, the second network parameter and the third network parameter until a preset iteration number is reached to obtain the distribution condition of the subjects in the text and the image.
Referring to fig. 4, the apparatus 300 may further include: a training unit 310.
A training unit 310, configured to obtain training texts and training images in multiple microblogs; obtaining a training text input matrix corresponding to the training text; obtaining a training image matrix corresponding to the training image; and obtaining the input matrix of the training set based on the input matrix of the training text and the training image matrix.
The training unit 310 may comprise a training subunit 311.
The training subunit 311 is further configured to perform sift feature extraction on each training image to obtain a second feature vector corresponding to each training image; acquiring a clustering center of each type and image features contained in each type based on a preset clustering algorithm and a second feature vector corresponding to each training image; and counting the number of the image features contained in each training image to obtain a training image matrix corresponding to a plurality of training images.
The training unit 310 is further configured to set a maximum number of topics at a bottommost layer of the poisson gamma belief network; based on the input matrix of the training set, randomly distributing a theme for each training text and each training image, obtaining an initialization matrix and generating an initial value of each probability parameter; and iteratively training the Poisson gamma belief network based on the initialization matrix and the initial values of all probability parameters to obtain the theme distribution condition in the training text and the training image.
The above units may be implemented by software codes, and in this case, the above units may be stored in the memory 102. The above units may also be implemented by hardware, for example, an integrated circuit chip.
The user attribute obtaining apparatus 300 according to the embodiment of the present invention has the same implementation principle and technical effect as those of the foregoing method embodiments, and for brief description, reference may be made to corresponding contents in the foregoing method embodiments for parts that are not mentioned in the apparatus embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (7)

1. A user attribute obtaining method is applied to an electronic device, and is characterized by comprising the following steps:
acquiring training texts and training images in a plurality of microblogs;
obtaining a training text input matrix corresponding to the training text;
obtaining a training image matrix corresponding to the training image;
obtaining a training set input matrix based on the training text input matrix and the training image matrix;
setting the maximum theme number of the bottommost layer of the Poisson gamma belief network;
based on the input matrix of the training set, randomly distributing a theme for each training text and each training image, obtaining an initialization matrix and generating an initial value of each probability parameter;
iteratively training the Poisson gamma belief network based on the initialization matrix and initial values of all probability parameters, sampling a plurality of matrix values of each layer from the bottommost layer of the Poisson gamma belief network to the external current layer upwards layer by layer, sampling and calculating the probability parameters layer by layer, sampling the residual matrix values of each layer from the external current layer to the bottommost layer downwards layer by layer, and obtaining the theme distribution condition in the training text and the training image when the iterative sampling of all layers is completed;
acquiring texts and images in a microblog of a user;
obtaining a text input matrix corresponding to the text;
obtaining an image input matrix corresponding to the image;
splicing the text input matrix, the image input matrix and a preset training set input matrix to obtain a total input matrix;
and obtaining the subject distribution condition in the text and the image based on the total input matrix, and obtaining the attribute of the user based on the subject distribution condition.
2. The method of claim 1, wherein obtaining a text input matrix corresponding to the text comprises:
performing word segmentation processing on the text and counting word frequency to obtain at least one word segmentation and the word frequency of each word segmentation in the at least one word segmentation;
and obtaining a text input matrix corresponding to the text based on the at least one word segmentation and the word frequency of each word segmentation.
3. The method of claim 1, wherein obtaining an image input matrix corresponding to the image comprises:
and performing sift feature extraction on the image to obtain a first feature vector corresponding to the image and obtain an image input matrix corresponding to the image based on the first feature vector.
4. The method of claim 1, wherein obtaining a training image matrix corresponding to the training image comprises:
performing sift feature extraction on each training image to obtain a second feature vector corresponding to each training image;
acquiring a clustering center of each type and image features contained in each type based on a preset clustering algorithm and a second feature vector corresponding to each training image;
and counting the number of the image features contained in each training image to obtain a training image matrix corresponding to a plurality of training images.
5. The method of claim 1, wherein obtaining the distribution of topics in the text and the image based on the total input matrix comprises:
initializing a first network parameter, a second network parameter and a preset third network parameter of the Poisson gamma belief network;
and taking the total input matrix as the input of a Poisson gamma belief network, sampling layer by layer from the bottommost layer to the topmost layer of the Poisson gamma belief network, and iteratively updating the first network parameter, the second network parameter and the third network parameter until a preset iteration number is reached to obtain the distribution condition of the subjects in the text and the image.
6. A user attribute acquisition apparatus, characterized in that the apparatus comprises:
the training unit is used for acquiring training texts and training images in a plurality of microblogs; obtaining a training text input matrix corresponding to the training text; obtaining a training image matrix corresponding to the training image; obtaining a training set input matrix based on the training text input matrix and the training image matrix; setting the maximum theme number of the bottommost layer of the Poisson gamma belief network; based on the input matrix of the training set, randomly distributing a theme for each training text and each training image, obtaining an initialization matrix and generating an initial value of each probability parameter; iteratively training the Poisson gamma belief network based on the initialization matrix and initial values of all probability parameters, sampling a plurality of matrix values of each layer from the bottommost layer of the Poisson gamma belief network to the external current layer upwards layer by layer, sampling and calculating the probability parameters layer by layer, sampling the residual matrix values of each layer from the external current layer to the bottommost layer downwards layer by layer, and obtaining the theme distribution condition in the training text and the training image when the iterative sampling of all layers is completed;
the first acquisition unit is used for acquiring texts and images in the microblog of the user;
the second acquisition unit is used for acquiring a text input matrix corresponding to the text;
the third acquisition unit is used for acquiring an image input matrix corresponding to the image;
the fourth acquiring subunit is used for splicing the text input matrix, the image input matrix and a preset training set input matrix to acquire a total input matrix;
and the fifth acquiring unit is used for acquiring the subject distribution condition in the text and the image based on the total input matrix and acquiring the attribute of the user based on the subject distribution condition.
7. The apparatus of claim 6, wherein the second obtaining unit comprises:
the second obtaining subunit is used for performing word segmentation processing on the text and counting word frequency to obtain at least one word segmentation and the word frequency of each word segmentation in the at least one word segmentation; and obtaining a text input matrix corresponding to the text based on the at least one word segmentation and the word frequency of each word segmentation.
CN201710738930.0A 2017-08-24 2017-08-24 User attribute acquisition method and device Active CN107480289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710738930.0A CN107480289B (en) 2017-08-24 2017-08-24 User attribute acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710738930.0A CN107480289B (en) 2017-08-24 2017-08-24 User attribute acquisition method and device

Publications (2)

Publication Number Publication Date
CN107480289A CN107480289A (en) 2017-12-15
CN107480289B true CN107480289B (en) 2020-06-30

Family

ID=60602525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710738930.0A Active CN107480289B (en) 2017-08-24 2017-08-24 User attribute acquisition method and device

Country Status (1)

Country Link
CN (1) CN107480289B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984689B (en) * 2018-07-02 2021-08-03 广东睿江云计算股份有限公司 Multi-copy synchronization method and device in combined file system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838836A (en) * 2014-02-25 2014-06-04 中国科学院自动化研究所 Multi-modal data fusion method and system based on discriminant multi-modal deep confidence network
CN104361059A (en) * 2014-11-03 2015-02-18 中国科学院自动化研究所 Harmful information identification and web page classification method based on multi-instance learning
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
CN106446117A (en) * 2016-09-18 2017-02-22 西安电子科技大学 Text analysis method based on poisson-gamma belief network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838836A (en) * 2014-02-25 2014-06-04 中国科学院自动化研究所 Multi-modal data fusion method and system based on discriminant multi-modal deep confidence network
CN104361059A (en) * 2014-11-03 2015-02-18 中国科学院自动化研究所 Harmful information identification and web page classification method based on multi-instance learning
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
CN106446117A (en) * 2016-09-18 2017-02-22 西安电子科技大学 Text analysis method based on poisson-gamma belief network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"A Deep Approach for Multi-modal User Attribute Modeling";Xiu Huang 等;《ADC 2017: Databases Theory and Applications》;20170920;217-230 *
"The Poisson Gamma Belief Network";Mingyuan Zhou;《arXiv:1511.02199v1[stat.ML]》;20151106;1-13 *
"基于多模态社交媒体数据源的用户画像构建的研究";黄秀;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180915(第9期);I139-41 *

Also Published As

Publication number Publication date
CN107480289A (en) 2017-12-15

Similar Documents

Publication Publication Date Title
US9348898B2 (en) Recommendation system with dual collaborative filter usage matrix
CN106326391B (en) Multimedia resource recommendation method and device
CN109190046A (en) Content recommendation method, device and content recommendation service device
CN108932268B (en) Data processing method and device
CN109558533B (en) Personalized content recommendation method and device based on multiple clustering
CN110278447B (en) Video pushing method and device based on continuous features and electronic equipment
US20190392077A1 (en) Facet-based query refinement based on multiple query interpretations
CN114385780B (en) Program interface information recommendation method and device, electronic equipment and readable medium
CN111010592A (en) Video recommendation method and device, electronic equipment and storage medium
CN111898380A (en) Text matching method and device, electronic equipment and storage medium
WO2019001463A1 (en) Data processing method and apparatus
CN112070550A (en) Keyword determination method, device and equipment based on search platform and storage medium
CN110851712A (en) Book information recommendation method and device and computer readable medium
CN111062490B (en) Method and device for processing and identifying network data containing private data
CN107480289B (en) User attribute acquisition method and device
US10339559B2 (en) Associating social comments with individual assets used in a campaign
CN110300329B (en) Video pushing method and device based on discrete features and electronic equipment
CN107832578A (en) Data processing method and device based on situation variation model
CN107480684B (en) Image processing method and device
CN114357184A (en) Item recommendation method and related device, electronic equipment and storage medium
CN110516717B (en) Method and apparatus for generating image recognition model
CN115373697A (en) Data processing method and data processing device
CN113255933A (en) Feature engineering and graph network generation method and device and distributed system
CN110688508A (en) Image-text data expansion method and device and electronic equipment
CN111897910A (en) Information pushing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant