CN107480289A - User property acquisition methods and device - Google Patents
User property acquisition methods and device Download PDFInfo
- Publication number
- CN107480289A CN107480289A CN201710738930.0A CN201710738930A CN107480289A CN 107480289 A CN107480289 A CN 107480289A CN 201710738930 A CN201710738930 A CN 201710738930A CN 107480289 A CN107480289 A CN 107480289A
- Authority
- CN
- China
- Prior art keywords
- text
- input matrix
- image
- training
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 239000011159 matrix material Substances 0.000 claims abstract description 142
- 238000009826 distribution Methods 0.000 claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 97
- 239000013598 vector Substances 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 241000208340 Araliaceae Species 0.000 claims description 3
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims description 3
- 235000003140 Panax quinquefolius Nutrition 0.000 claims description 3
- 235000008434 ginseng Nutrition 0.000 claims description 3
- 230000003252 repetitive effect Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 238000005070 sampling Methods 0.000 description 7
- 238000003860 storage Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 230000002093 peripheral effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 244000097202 Rathbunia alamosensis Species 0.000 description 2
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004549 pulsed laser deposition Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Marketing (AREA)
- Computational Linguistics (AREA)
- Economics (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Computing Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Resources & Organizations (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- User Interface Of Digital Computer (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of user property acquisition methods and device provided in an embodiment of the present invention, are related to data processing field.Methods described includes obtaining text and image in the microblogging of user;Text input matrix corresponding to the text is obtained again;Obtain image input matrix corresponding to described image;Based on the text input matrix and described image input matrix, total input matrix is obtained;Total input matrix is then based on, obtains the theme distribution situation in the text and described image, and based on the theme distribution situation, obtains the attribute of the user.Efficiency high, the degree of accuracy are high, practical.
Description
Technical field
The present invention relates to technical field of data processing, in particular to a kind of user property acquisition methods and device.
Background technology
At present, existing method such as Poisson gamma belief network (Poisson Gamma Belief Network, PGBN)
The attribute of user can be obtained by handling content of text, and cannot answered under large-scale social media environment and directly
With efficiency is low, inaccurate.
The content of the invention
It is an object of the invention to provide a kind of acquisition of user property and device, to improve above mentioned problem.On realizing
Purpose is stated, the technical scheme that the present invention takes is as follows:
In a first aspect, the embodiments of the invention provide a kind of user property acquisition methods, methods described includes obtaining user
Microblogging in text and image;Obtain text input matrix corresponding to the text;It is defeated to obtain image corresponding to described image
Enter matrix;Based on the text input matrix and described image input matrix, total input matrix is obtained;Based on total input square
Battle array, obtains the theme distribution situation in the text and described image, and based on the theme distribution situation, obtain the use
The attribute at family.
Second aspect, the embodiments of the invention provide a kind of user property acquisition device, described device includes first and obtained
Unit, second acquisition unit, the 3rd acquiring unit, the 4th acquiring unit and the 5th acquiring unit.First acquisition unit, for obtaining
Take the text and image in the microblogging at family.Second acquisition unit, for obtaining text input matrix corresponding to the text.The
Three acquiring units, for obtaining image input matrix corresponding to described image.4th acquiring unit, for defeated based on the text
Enter matrix and described image input matrix, obtain total input matrix.5th acquiring unit, for based on total input matrix,
The theme distribution situation in the text and described image is obtained, and based on the theme distribution situation, obtains the user
Attribute.
A kind of user property acquisition methods and device provided in an embodiment of the present invention, obtain text in the microblogging of user and
Image;Text input matrix corresponding to the text is obtained again;Obtain image input matrix corresponding to described image;Based on described
Text input matrix and described image input matrix, obtain total input matrix;Total input matrix is then based on, described in acquisition
Theme distribution situation in text and described image, and based on the theme distribution situation, obtain the attribute of the user.Effect
Rate is high, the degree of accuracy is high, practical.
Other features and advantages of the present invention will illustrate in subsequent specification, also, partly become from specification
It is clear that or by implementing understanding of the embodiment of the present invention.The purpose of the present invention and other advantages can be by saying what is write
Specifically noted structure is realized and obtained in bright book, claims and accompanying drawing.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by embodiment it is required use it is attached
Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair
The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to this
A little accompanying drawings obtain other related accompanying drawings.
Fig. 1 is the structured flowchart of electronic equipment provided in an embodiment of the present invention;
Fig. 2 is the flow chart of user property acquisition methods provided in an embodiment of the present invention;
Fig. 3 is a kind of structured flowchart of user property acquisition device provided in an embodiment of the present invention;
Fig. 4 is the structured flowchart of another kind user property acquisition device of the embodiment of the present invention.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
Part of the embodiment of the present invention, rather than whole embodiments.The present invention implementation being generally described and illustrated herein in the accompanying drawings
The component of example can be configured to arrange and design with a variety of.Therefore, the reality of the invention to providing in the accompanying drawings below
The detailed description for applying example is not intended to limit the scope of claimed invention, but is merely representative of the selected implementation of the present invention
Example.Based on the embodiment in the present invention, what those of ordinary skill in the art were obtained under the premise of creative work is not made
Every other embodiment, belongs to the scope of protection of the invention.
It should be noted that:Similar label and letter represents similar terms in following accompanying drawing, therefore, once a certain Xiang Yi
It is defined, then it further need not be defined and explained in subsequent accompanying drawing in individual accompanying drawing.Meanwhile the present invention's
In description, term " first ", " second " etc. are only used for distinguishing description, and it is not intended that instruction or hint relative importance.
Fig. 1 shows a kind of structured flowchart for the electronic equipment 100 that can be applied in the embodiment of the present invention.As shown in figure 1,
Electronic equipment 100 can include memory 102, storage control 104, one or more (one is only shown in Fig. 1) processors
106th, Peripheral Interface 108, input/output module 110, audio-frequency module 112, display module 114, radio-frequency module 116 and user property
Acquisition device.
Memory 102, storage control 104, processor 106, Peripheral Interface 108, input/output module 110, audio mould
Directly or indirectly electrically connected between block 112, display module 114,116 each element of radio-frequency module, with realize the transmission of data or
Interaction.For example, electrical connection can be realized by one or more communication bus or signal bus between these elements.User property
Acquisition methods respectively include it is at least one can be stored in the form of software or firmware (firmware) it is soft in memory 102
Part functional module, such as the software function module or computer program that the user property acquisition device includes.
Memory 102 can store various software programs and module, and the user property as the embodiment of the present application provides obtains
Take programmed instruction/module corresponding to method and device.Processor 106 is by running the software program of storage in the memory 102
And module, so as to perform various function application and data processing, that is, realize that the user property in the embodiment of the present application obtains
Method.
Memory 102 can include but is not limited to random access memory (Random Access Memory, RAM), only
Read memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only
Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM),
Electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..
Processor 106 can be a kind of IC chip, have signal handling capacity.Above-mentioned processor can be general
Processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit (Network
Processor, abbreviation NP) etc.;It can also be digital signal processor (DSP), application specific integrated circuit (ASIC), ready-made programmable
Gate array (FPGA) either other PLDs, discrete gate or transistor logic, discrete hardware components.It can
To realize or perform disclosed each method, step and the logic diagram in the embodiment of the present application.General processor can be micro-
Processor or the processor can also be any conventional processors etc..
Various input/output devices are coupled to processor 106 and memory 102 by the Peripheral Interface 108.At some
In embodiment, Peripheral Interface 108, processor 106 and storage control 104 can be realized in one single chip.Other one
In a little examples, they can be realized by independent chip respectively.
Input/output module 110 is used to be supplied to user input data to realize interacting for user and electronic equipment 100.It is described
Input/output module 110 may be, but not limited to, mouse and keyboard etc..
Audio-frequency module 112 provides a user COBBAIF, and it may include one or more microphones, one or more raises
Sound device and voicefrequency circuit.
Display module 114 provides an interactive interface (such as user interface) between electronic equipment 100 and user
Or referred to for display image data to user.In the present embodiment, the display module 114 can be liquid crystal display or touch
Control display.If touch control display, it can be that the capacitance type touch control screen or resistance-type for supporting single-point and multi-point touch operation touch
Control screen etc..Single-point and multi-point touch operation is supported to refer to that touch control display can sense on the touch control display one or more
Individual opening position is with caused touch control operation, and the touch control operation that this is sensed transfers to processor 106 to be calculated and handled.
Radio-frequency module 116 is used to receiving and sending electromagnetic wave, realizes the mutual conversion of electromagnetic wave and electric signal, so that with
Communication network or other equipment are communicated.
It is appreciated that structure shown in Fig. 1 is only to illustrate, electronic equipment 100 may also include it is more more than shown in Fig. 1 or
Less component, or there is the configuration different from shown in Fig. 1.Each component shown in Fig. 1 can use hardware, software or its
Combination is realized.
In the embodiment of the present invention, electronic equipment 100 can be used as user terminal, or as server.User terminal
Can be PC (personal computer) computer, tablet personal computer, mobile phone, notebook computer, intelligent television, set top box, vehicle-mounted
The terminal devices such as terminal.
Referring to Fig. 2, the embodiments of the invention provide a kind of user property acquisition methods, methods described includes:Step
S200, step S210, step S220, step S230 and step S240.
Step S200:Obtain the text and image in the microblogging of user.
The text and image in the microblogging of user can be obtained from Sina weibo.
Step S210:Obtain text input matrix corresponding to the text.
Based on step S210, further, word segmentation processing is carried out to the text and counts word frequency, obtains at least one point
The word frequency each segmented in word, and at least one participle;Based at least one participle and the word frequency each segmented, obtain
Obtain text input matrix corresponding to the text.
In the present embodiment, word segmentation processing is carried out to text using python and makes the vocabulary of all words
(vocabulary), every a line of vocabulary is a kind of vocabulary, and line number where vocabulary is the index value (index) of the vocabulary;Together
When by microblogging (document) statistics word frequency (count), generate text input matrix X corresponding to the textu, wherein Xu(i,j)
The frequency occurred for i-th of vocabulary in i-th of text.
Step S220:Obtain image input matrix corresponding to described image.
Based on step S220, further, sift feature extractions are carried out to described image, obtain corresponding to described image the
One characteristic vector is simultaneously based on the first eigenvector, obtains image input matrix corresponding to described image.
Step S230:Based on the text input matrix and described image input matrix, total input matrix is obtained.
Based on step S230, further, by the text input matrix, described image input matrix and default training
Collection input matrix is spliced, and obtains total input matrix.
The text input matrix, image input matrix and default training set input matrix are indexed along text/image
Dimension splicing after, obtain splicing matrix, splicing matrix along the dimension piecemeal by user summation, obtain total input matrix.It is total defeated
Enter in matrix, each row represent a user, each one text of user representative or image.
Further, before step S230, methods described can also include:Obtain training text in multiple microbloggings and
Training image;Obtain training text input matrix corresponding to the training text;Obtain training figure corresponding to the training image
As matrix;Based on the training text input matrix and the training image matrix, the training set input matrix is obtained.
Further, sift feature extractions are carried out to each training image, it is corresponding obtains each training image
Second feature vector;
Based on second feature vector corresponding to default clustering algorithm and each training image, obtain per the poly- of one kind
Class center and the characteristics of image included per one kind;
The number for the described image feature that each training image of statistics includes, it is corresponding to obtain multiple training images
Training image matrix.
Further, based on the training text input matrix and the training image matrix, it is defeated to obtain the training set
After entering matrix, methods described can also include:
The maximum number of topics of the bottom of Poisson gamma belief network is set;
It is random each to distribute master for each training text and the training image based on the training set input matrix
Topic, obtain initialization matrix and generate the initial value of each probability parameter;
Initial value based on the initialization matrix and each probability parameter, Poisson gamma belief network described in repetitive exercise,
Obtain the theme distribution situation in the training text and the training image.
Specifically, the training text and training image in more than 1000 microbloggings, Duo Gewei are obtained from Sina weibo
Training text and training image in rich are carried out word segmentation processing to training text using python and made all as training set
The vocabulary (vocabulary) of word, every a line of vocabulary is a kind of vocabulary, and line number where vocabulary is the index of the vocabulary
It is worth (index);Simultaneously by microblogging (document) statistics word frequency (count), text-vocabulary (document-word_ is generated
Index) training text input matrix Xw, wherein Xw(i, j) is the frequency that i-th of vocabulary occurs in j-th of text.All instructions
After white silk image extract feature using sift successively, multiple second feature vectors are obtained, multiple second feature vectors are connected,
K clusters are carried out to multiple second feature vector using default clustering algorithm K-means algorithms, obtained per a kind of cluster centre
And the characteristics of image included per one kind;The number for the described image feature that each training image of statistics includes, is obtained multiple
Training image matrix X corresponding to the training imagev。Xv(i, j) is the frequency that the i-th category feature occurs in j-th of image.Institute
It can be K-means algorithms to state default clustering algorithm.
The maximum number of topics K of the bottom of Poisson gamma belief network is set0max, it is determined that be extracted via first layer
Theme transformation (successively decreases, i.e., more high-rise theme more has generality) from the bottom to top number of topics;In training text
This input matrix Xw, training image matrix XvIn, vocabulary/feature class distribution to occur every time in each text/image at random is led
Inscribe (topic) (common K0maxKind theme), the matrix initialized:That is each word of the image of text in every microblogging/every
Allocated frequency of the remittance/feature class under all kinds of themes of the bottom J-th of text/image is represented respectively
In i-th kind of vocabulary/feature class be dispensed on frequency (number) under kth class theme, bottom theme-vocabulary/theme-feature
Matroid The ratio shared by the lower i-th kind of vocabulary/feature class of k-th of theme is represented, is considered
All text/pictures) and every microblogging text/every image contained by vocabulary/feature class correspond to the scaling matrices of each theme For vector, each theme proportion in j-th of text/picture is represented respectively, and is generated each general
The initial value of rate parameter (no practical significance, simply participating in calculating).During afterwards, the meaning of above-mentioned each matrix is constant,
But value can change.
If outside current layer is T, then progressively to top since the bottom, for each T with certain iteration
Number (BT+CT) by certain rule all layers (t≤T) below this layer are trained in two steps, each iteration comprises the following steps:From
The bottom is successively upward to outside current layer T, and each layer samples to the value of part matrix.If internal current layer is (from most
Current layer of the bottom to T layers) be t, then as t=1, by Gibbs sampling method, first withWithTo vocabulary/feature
The theme of class carries out resampling distribution, after certain sampling number, by merging can be stablized (no longer with sampling
And change) theme-vocabulary/theme-feature matroid Z_w(1)/Z_v(1)(Z_w(1)(k,i)/Z_v(1)(k, i) represents
I kinds vocabulary/feature class is assigned to the frequency of k-th of theme) and theme-text/theme-picture matrix
(Represent the vocabulary/feature for being assigned to k-th of theme that j-th of text/picture is included
Class number).
As t >=2, first by this layer of number of topics KtIt is initialized as the number of topics K of last layert-1, it is former by gibbs sampler afterwards
It is each that reason samples each theme of last layer is included in each microblogging text/image vocabulary/feature class is assigned to current layer respectively
Stable frequency under themeWith(herein, the theme of last layer can be considered vocabulary/feature class under current layer theme),
Obtain text/image-current layer theme matrixWith(Represent to be allocated in j-th of text/image
To the vocabulary number of i-th of theme of current layer);Simultaneously willWithCurrent layer theme-last layer theme is can obtain by merging
Matrix Z_w(t)/Z_v(t)With theme-text/theme-image arrayObtained current layer master is utilized afterwards
Topic-last layer theme matrix Z_w(t)/Z_v(t)Sample out under each theme of current layer, each theme vocabulary/feature class of last layer
RatioWith(Represent under k-th of theme of current layer, the vocabulary/feature of i-th of theme of last layer
Ratio shared by class, consider all text/images.
Successively sample and calculate probability parameter.Successively downward from outside current layer T to the bottom, each layer is to another portion
The value of sub-matrix is sampled.First with Z_w(T)/Z_v(T)Sample out vocabulary/feature class contained by outside each themes of current layer T
Weight vector r(T)(the more big then proportion of corresponding theme weights is heavier);Followed by r(T)(during t=T) orAnd(t<During T) obtained as probability generation parameter and sampling foundation, sampling
Low one layerWithIt is worth noting that, since a certain layer, the θ and dependent probability parameter of its all above layer
All become text and image and share that (common theta is by respective by text and imageSpelled along theme dimension
The matrix obtained after connecing samples to obtain as probability parameter;Related public probability parameter is sampled to obtain by common theta), therefore sample this
During a little layers, with the shared θ of higher respectively andAndThe matrix for splicing to obtain after matrix multiple is joined for public probability
Number is sampled.When iterations reaches a certain threshold value BTWhen, remove current layer in inactive theme (under i.e. some themes not
Include vocabulary/feature class of any low one layer of theme), cut the number of topics K of current layert.When all layers of iteration samplings are all entered
After having gone, training is completed, and obtains in training text and training image all microblogging vocabulary and picture feature class in each theme of each layer
Under distribution situation.
Step S240:Based on total input matrix, the theme distribution situation in the text and described image is obtained, with
And based on the theme distribution situation, obtain the attribute of the user.
Based on step S240, the first network parameter of Poisson gamma belief network, the second network parameter and default are initialized
The 3rd network parameter;Using total input matrix as the input of Poisson gamma belief network, from the Poisson gamma conviction
The bottom of network is successively sampled to top, and iteration updates the first network parameter, second network parameter and described
3rd network parameter, until reaching default number of iterations, obtain the theme distribution situation in the text and described image.
Specifically, the first network parameter θ of Poisson gamma belief network, the second network parameter r and default the are initialized
Three network parameter Φ, wherein default 3rd network parameter Φ is the Φ for training to obtain in above-mentioned training text and training image.
With certain iterations (BT+CT) all layers are trained in two steps, each iteration performs following process:From the bottom to top,
Successively utilize total input matrix Xw/Xv, training set generationAnd spliced by training set and test set along user's dimension
Form(or common theta(t)) to total data set(i.e. theme-user's matrix) is sampled.
The sampling generation dependent probability parameter from the second layer to top.In top, given birth to by the use of the top r that training obtains as probability
Into parameter, sample and generate θ(T)(text and picture share).θ(T)Afterbody (i.e. since it is a certain row after all row) i.e.
The distribution situation of each theme in microblogging for the user.After the completion of iteration, the distribution feelings of each theme in the microblogging of user are obtained
Condition.
A kind of user property acquisition methods and device provided in an embodiment of the present invention, obtain text in the microblogging of user and
Image;Text input matrix corresponding to the text is obtained again;Obtain image input matrix corresponding to described image;Based on described
Text input matrix and described image input matrix, obtain total input matrix;Total input matrix is then based on, described in acquisition
Theme distribution situation in text and described image, and based on the theme distribution situation, obtain the attribute of the user.Effect
Rate is high, the degree of accuracy is high, practical.
Referring to Fig. 3, the embodiments of the invention provide a kind of user property acquisition device 300, described device can include:
First acquisition unit 320, second acquisition unit 330, the 3rd acquiring unit 340, the 4th acquiring unit 350 and the 5th acquiring unit
360。
First acquisition unit 320, the text and image in microblogging for obtaining user.
Second acquisition unit 330, for obtaining text input matrix corresponding to the text.
Second acquisition unit 330 can include second and obtain subelement 331.
Second obtains subelement 331, for carrying out word segmentation processing to the text and counting word frequency, obtains at least one point
The word frequency each segmented in word, and at least one participle;Based at least one participle and the word frequency each segmented, obtain
Obtain text input matrix corresponding to the text.
3rd acquiring unit 340, for obtaining image input matrix corresponding to described image.
3rd acquiring unit 340 can include the 3rd and obtain subelement 341.
3rd obtains subelement 341, for carrying out sift feature extractions to described image, obtains the corresponding to described image
One characteristic vector is simultaneously based on the first eigenvector, obtains image input matrix corresponding to described image.
4th acquiring unit 350, for based on the text input matrix and described image input matrix, obtaining total input
Matrix.
4th acquiring unit 350 can include the 4th and obtain subelement 351.
4th obtains subelement 351, for by the text input matrix, described image input matrix and default training
Collection input matrix is spliced, and obtains total input matrix.
5th acquiring unit 360, for based on total input matrix, obtaining the theme in the text and described image
Distribution situation, and based on the theme distribution situation, obtain the attribute of the user.
5th acquiring unit 360 can include the 5th and obtain subelement 361.
5th obtains subelement 361, for initializing first network parameter, the second network ginseng of Poisson gamma belief network
Several and default 3rd network parameter;Using total input matrix as the input of Poisson gamma belief network, from the Poisson
The bottom of gamma belief network successively samples, iteration updates the first network parameter, second network ginseng to top
Several and described 3rd network parameter, until reaching default number of iterations, obtains the theme distribution in the text and described image
Situation.
Referring to Fig. 4, described device 300 can also include:Training unit 310.
Training unit 310, for obtaining training text and training image in multiple microbloggings;Obtain the training text pair
The training text input matrix answered;Obtain training image matrix corresponding to the training image;Inputted based on the training text
Matrix and the training image matrix, obtain the training set input matrix.
The training unit 310 can include training subelement 311.
Subelement 311 is trained, is additionally operable to carry out sift feature extractions to each training image, obtains each instruction
Practice second feature vector corresponding to image;Based on second feature corresponding to default clustering algorithm and each training image to
Amount, obtain the characteristics of image that per a kind of cluster centre and every one kind includes;The each training image of statistics includes described
The number of characteristics of image, obtain training image matrix corresponding to multiple training images.
The training unit 310, it is additionally operable to set the maximum number of topics of the bottom of Poisson gamma belief network;Based on institute
Training set input matrix is stated, it is random each to distribute theme for each training text and the training image, initialized
Matrix and the initial value for generating each probability parameter;Initial value based on the initialization matrix and each probability parameter, repetitive exercise
The Poisson gamma belief network, obtain the theme distribution situation in the training text and the training image.
Above each unit can be that now, above-mentioned each unit can be stored in memory 102 by software code realization.
Above each unit can equally be realized by hardware such as IC chip.
User property acquisition device 300 provided in an embodiment of the present invention, its realization principle and caused technique effect are with before
It is identical to state embodiment of the method, to briefly describe, device embodiment part does not refer to part, refers to phase in preceding method embodiment
Answer content.
In several embodiments provided herein, it should be understood that disclosed apparatus and method, can also pass through
Other modes are realized.Device embodiment described above is only schematical, for example, flow chart and block diagram in accompanying drawing
Show the device of multiple embodiments according to the present invention, method and computer program product architectural framework in the cards,
Function and operation.At this point, each square frame in flow chart or block diagram can represent the one of a module, program segment or code
Part, a part for the module, program segment or code include one or more and are used to realize holding for defined logic function
Row instruction.It should also be noted that at some as in the implementation replaced, the function that is marked in square frame can also with different from
The order marked in accompanying drawing occurs.For example, two continuous square frames can essentially perform substantially in parallel, they are sometimes
It can perform in the opposite order, this is depending on involved function.It is it is also noted that every in block diagram and/or flow chart
The combination of individual square frame and block diagram and/or the square frame in flow chart, function or the special base of action as defined in performing can be used
Realize, or can be realized with the combination of specialized hardware and computer instruction in the system of hardware.
In addition, each functional module in each embodiment of the present invention can integrate to form an independent portion
Point or modules individualism, can also two or more modules be integrated to form an independent part.
If the function is realized in the form of software function module and is used as independent production marketing or in use, can be with
It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words
The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be
People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention.
And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.Need
Illustrate, herein, such as first and second or the like relational terms be used merely to by an entity or operation with
Another entity or operation make a distinction, and not necessarily require or imply between these entities or operation any this reality be present
The relation or order on border.Moreover, term " comprising ", "comprising" or its any other variant are intended to the bag of nonexcludability
Contain, so that process, method, article or equipment including a series of elements not only include those key elements, but also including
The other element being not expressly set out, or also include for this process, method, article or the intrinsic key element of equipment.
In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including the key element
Process, method, other identical element also be present in article or equipment.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.It should be noted that:Similar label and letter exists
Similar terms is represented in following accompanying drawing, therefore, once being defined in a certain Xiang Yi accompanying drawing, is then not required in subsequent accompanying drawing
It is further defined and explained.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation
In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to
Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those
Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include
Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that
Other identical element also be present in process, method, article or equipment including the key element.
Claims (10)
- A kind of 1. user property acquisition methods, applied to an electronic equipment, it is characterised in that methods described includes:Obtain the text and image in the microblogging of user;Obtain text input matrix corresponding to the text;Obtain image input matrix corresponding to described image;Based on the text input matrix and described image input matrix, total input matrix is obtained;Based on total input matrix, the theme distribution situation in the text and described image is obtained, and based on the master Distribution situation is inscribed, obtains the attribute of the user.
- 2. according to the method for claim 1, it is characterised in that text input matrix corresponding to the text is obtained, including:Word segmentation processing is carried out to the text and counts word frequency, is obtained every at least one participle, and at least one participle The word frequency of individual participle;Based at least one participle and the word frequency each segmented, text input matrix corresponding to the text is obtained.
- 3. according to the method for claim 1, it is characterised in that image input matrix corresponding to described image is obtained, including:Sift feature extractions are carried out to described image, obtain first eigenvector corresponding to described image and based on the described first spy Sign vector, obtains image input matrix corresponding to described image.
- 4. according to the method for claim 1, it is characterised in that based on the text input matrix and described image input square Battle array, obtains total input matrix, including:The text input matrix, described image input matrix and default training set input matrix are spliced, obtained total Input matrix.
- 5. according to the method for claim 4, it is characterised in that by the text input matrix, described image input matrix Spliced with default training set input matrix, before obtaining total input matrix, methods described also includes:Obtain the training text and training image in multiple microbloggings;Obtain training text input matrix corresponding to the training text;Obtain training image matrix corresponding to the training image;Based on the training text input matrix and the training image matrix, the training set input matrix is obtained.
- 6. according to the method for claim 5, it is characterised in that training image matrix corresponding to the training image is obtained, Including:Sift feature extractions are carried out to each training image, obtain second feature corresponding to each training image to Amount;Based on second feature vector corresponding to default clustering algorithm and each training image, obtain per in a kind of cluster The heart and the characteristics of image included per one kind;The number for the described image feature that each training image of statistics includes, obtains and is instructed corresponding to multiple training images Practice image array.
- 7. according to the method for claim 5, it is characterised in that based on the training text input matrix and the training figure As matrix, after obtaining the training set input matrix, methods described also includes:The maximum number of topics of the bottom of Poisson gamma belief network is set;It is random each to distribute theme for each training text and the training image based on the training set input matrix, Obtain initialization matrix and generate the initial value of each probability parameter;Initial value based on the initialization matrix and each probability parameter, Poisson gamma belief network described in repetitive exercise, is obtained Theme distribution situation in the training text and the training image.
- 8. according to the method for claim 1, it is characterised in that based on total input matrix, obtain the text and institute The theme distribution situation in image is stated, including:Initialize first network parameter, the second network parameter and default 3rd network parameter of Poisson gamma belief network;Using total input matrix as the input of Poisson gamma belief network, from the bottom of the Poisson gamma belief network To top, successively sample, iteration updates the first network parameter, second network parameter and the 3rd network ginseng Number, until reaching default number of iterations, obtains the theme distribution situation in the text and described image.
- 9. a kind of user property acquisition device, it is characterised in that described device includes:First acquisition unit, the text and image in microblogging for obtaining user;Second acquisition unit, for obtaining text input matrix corresponding to the text;3rd acquiring unit, for obtaining image input matrix corresponding to described image;4th acquiring unit, for based on the text input matrix and described image input matrix, obtaining total input matrix;5th acquiring unit, for based on total input matrix, obtaining the theme distribution feelings in the text and described image Condition, and based on the theme distribution situation, obtain the attribute of the user.
- 10. device according to claim 9, it is characterised in that the second acquisition unit includes:Second obtains subelement, for carrying out word segmentation processing to the text and counting word frequency, obtains at least one participle, and institute State the word frequency each segmented at least one participle;Based at least one participle and the word frequency that each segments, described in acquisition Text input matrix corresponding to text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710738930.0A CN107480289B (en) | 2017-08-24 | 2017-08-24 | User attribute acquisition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710738930.0A CN107480289B (en) | 2017-08-24 | 2017-08-24 | User attribute acquisition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107480289A true CN107480289A (en) | 2017-12-15 |
CN107480289B CN107480289B (en) | 2020-06-30 |
Family
ID=60602525
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710738930.0A Active CN107480289B (en) | 2017-08-24 | 2017-08-24 | User attribute acquisition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107480289B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984689A (en) * | 2018-07-02 | 2018-12-11 | 广东睿江云计算股份有限公司 | More copies synchronized method and devices in a kind of union file system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103838836A (en) * | 2014-02-25 | 2014-06-04 | 中国科学院自动化研究所 | Multi-modal data fusion method and system based on discriminant multi-modal deep confidence network |
CN104361059A (en) * | 2014-11-03 | 2015-02-18 | 中国科学院自动化研究所 | Harmful information identification and web page classification method based on multi-instance learning |
CN105426356A (en) * | 2015-10-29 | 2016-03-23 | 杭州九言科技股份有限公司 | Target information identification method and apparatus |
CN105760507A (en) * | 2016-02-23 | 2016-07-13 | 复旦大学 | Cross-modal subject correlation modeling method based on deep learning |
CN106446117A (en) * | 2016-09-18 | 2017-02-22 | 西安电子科技大学 | Text analysis method based on poisson-gamma belief network |
-
2017
- 2017-08-24 CN CN201710738930.0A patent/CN107480289B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103838836A (en) * | 2014-02-25 | 2014-06-04 | 中国科学院自动化研究所 | Multi-modal data fusion method and system based on discriminant multi-modal deep confidence network |
CN104361059A (en) * | 2014-11-03 | 2015-02-18 | 中国科学院自动化研究所 | Harmful information identification and web page classification method based on multi-instance learning |
CN105426356A (en) * | 2015-10-29 | 2016-03-23 | 杭州九言科技股份有限公司 | Target information identification method and apparatus |
CN105760507A (en) * | 2016-02-23 | 2016-07-13 | 复旦大学 | Cross-modal subject correlation modeling method based on deep learning |
CN106446117A (en) * | 2016-09-18 | 2017-02-22 | 西安电子科技大学 | Text analysis method based on poisson-gamma belief network |
Non-Patent Citations (3)
Title |
---|
MINGYUAN ZHOU: ""The Poisson Gamma Belief Network"", 《ARXIV:1511.02199V1[STAT.ML]》 * |
XIU HUANG 等: ""A Deep Approach for Multi-modal User Attribute Modeling"", 《ADC 2017: DATABASES THEORY AND APPLICATIONS》 * |
黄秀: ""基于多模态社交媒体数据源的用户画像构建的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984689A (en) * | 2018-07-02 | 2018-12-11 | 广东睿江云计算股份有限公司 | More copies synchronized method and devices in a kind of union file system |
CN108984689B (en) * | 2018-07-02 | 2021-08-03 | 广东睿江云计算股份有限公司 | Multi-copy synchronization method and device in combined file system |
Also Published As
Publication number | Publication date |
---|---|
CN107480289B (en) | 2020-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111476284B (en) | Image recognition model training and image recognition method and device and electronic equipment | |
US20200285903A1 (en) | System for time-efficient assignment of data to ontological classes | |
CN109214002A (en) | A kind of transcription comparison method, device and its computer storage medium | |
CN110377740B (en) | Emotion polarity analysis method and device, electronic equipment and storage medium | |
CN109446328A (en) | A kind of text recognition method, device and its storage medium | |
CN108509407A (en) | Text semantic similarity calculating method, device and user terminal | |
CN110909222B (en) | User portrait establishing method and device based on clustering, medium and electronic equipment | |
CN109359539A (en) | Attention appraisal procedure, device, terminal device and computer readable storage medium | |
CN104008420A (en) | Distributed outlier detection method and system based on automatic coding machine | |
CN112200318B (en) | Target detection method, device, machine readable medium and equipment | |
CN110457677A (en) | Entity-relationship recognition method and device, storage medium, computer equipment | |
CN107145485A (en) | Method and apparatus for compressing topic model | |
CN111523324A (en) | Training method and device for named entity recognition model | |
WO2017136674A1 (en) | Generating feature embeddings from a co-occurrence matrix | |
CN112131322A (en) | Time series classification method and device | |
CN106951267A (en) | Screen size adaptive approach and device | |
CN107239775A (en) | Terrain classification method and device | |
WO2022252822A1 (en) | Information presentation method and apparatus, and device and medium | |
CN109213554A (en) | A kind of icon layout method, computer readable storage medium and terminal device | |
CN110363206A (en) | Cluster, data processing and the data identification method of data object | |
CN107169530A (en) | Mask method, device and the electronic equipment of picture | |
CN107404383A (en) | The generation method and device of digital signature | |
CN107480289A (en) | User property acquisition methods and device | |
CN111859933A (en) | Training method, recognition method, device and equipment of Malay recognition model | |
CN109726824A (en) | The transfer learning method and terminal device of training pattern |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |