CN107480289A - User property acquisition methods and device - Google Patents

User property acquisition methods and device Download PDF

Info

Publication number
CN107480289A
CN107480289A CN201710738930.0A CN201710738930A CN107480289A CN 107480289 A CN107480289 A CN 107480289A CN 201710738930 A CN201710738930 A CN 201710738930A CN 107480289 A CN107480289 A CN 107480289A
Authority
CN
China
Prior art keywords
text
input matrix
image
training
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710738930.0A
Other languages
Chinese (zh)
Other versions
CN107480289B (en
Inventor
杨阳
黄秀
杨子豪
沈复民
谢宁
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Macao Haichuan Technology Co Ltd
Original Assignee
Chengdu Macao Haichuan Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Macao Haichuan Technology Co Ltd filed Critical Chengdu Macao Haichuan Technology Co Ltd
Priority to CN201710738930.0A priority Critical patent/CN107480289B/en
Publication of CN107480289A publication Critical patent/CN107480289A/en
Application granted granted Critical
Publication of CN107480289B publication Critical patent/CN107480289B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Resources & Organizations (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of user property acquisition methods and device provided in an embodiment of the present invention, are related to data processing field.Methods described includes obtaining text and image in the microblogging of user;Text input matrix corresponding to the text is obtained again;Obtain image input matrix corresponding to described image;Based on the text input matrix and described image input matrix, total input matrix is obtained;Total input matrix is then based on, obtains the theme distribution situation in the text and described image, and based on the theme distribution situation, obtains the attribute of the user.Efficiency high, the degree of accuracy are high, practical.

Description

User property acquisition methods and device
Technical field
The present invention relates to technical field of data processing, in particular to a kind of user property acquisition methods and device.
Background technology
At present, existing method such as Poisson gamma belief network (Poisson Gamma Belief Network, PGBN) The attribute of user can be obtained by handling content of text, and cannot answered under large-scale social media environment and directly With efficiency is low, inaccurate.
The content of the invention
It is an object of the invention to provide a kind of acquisition of user property and device, to improve above mentioned problem.On realizing Purpose is stated, the technical scheme that the present invention takes is as follows:
In a first aspect, the embodiments of the invention provide a kind of user property acquisition methods, methods described includes obtaining user Microblogging in text and image;Obtain text input matrix corresponding to the text;It is defeated to obtain image corresponding to described image Enter matrix;Based on the text input matrix and described image input matrix, total input matrix is obtained;Based on total input square Battle array, obtains the theme distribution situation in the text and described image, and based on the theme distribution situation, obtain the use The attribute at family.
Second aspect, the embodiments of the invention provide a kind of user property acquisition device, described device includes first and obtained Unit, second acquisition unit, the 3rd acquiring unit, the 4th acquiring unit and the 5th acquiring unit.First acquisition unit, for obtaining Take the text and image in the microblogging at family.Second acquisition unit, for obtaining text input matrix corresponding to the text.The Three acquiring units, for obtaining image input matrix corresponding to described image.4th acquiring unit, for defeated based on the text Enter matrix and described image input matrix, obtain total input matrix.5th acquiring unit, for based on total input matrix, The theme distribution situation in the text and described image is obtained, and based on the theme distribution situation, obtains the user Attribute.
A kind of user property acquisition methods and device provided in an embodiment of the present invention, obtain text in the microblogging of user and Image;Text input matrix corresponding to the text is obtained again;Obtain image input matrix corresponding to described image;Based on described Text input matrix and described image input matrix, obtain total input matrix;Total input matrix is then based on, described in acquisition Theme distribution situation in text and described image, and based on the theme distribution situation, obtain the attribute of the user.Effect Rate is high, the degree of accuracy is high, practical.
Other features and advantages of the present invention will illustrate in subsequent specification, also, partly become from specification It is clear that or by implementing understanding of the embodiment of the present invention.The purpose of the present invention and other advantages can be by saying what is write Specifically noted structure is realized and obtained in bright book, claims and accompanying drawing.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by embodiment it is required use it is attached Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to this A little accompanying drawings obtain other related accompanying drawings.
Fig. 1 is the structured flowchart of electronic equipment provided in an embodiment of the present invention;
Fig. 2 is the flow chart of user property acquisition methods provided in an embodiment of the present invention;
Fig. 3 is a kind of structured flowchart of user property acquisition device provided in an embodiment of the present invention;
Fig. 4 is the structured flowchart of another kind user property acquisition device of the embodiment of the present invention.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is Part of the embodiment of the present invention, rather than whole embodiments.The present invention implementation being generally described and illustrated herein in the accompanying drawings The component of example can be configured to arrange and design with a variety of.Therefore, the reality of the invention to providing in the accompanying drawings below The detailed description for applying example is not intended to limit the scope of claimed invention, but is merely representative of the selected implementation of the present invention Example.Based on the embodiment in the present invention, what those of ordinary skill in the art were obtained under the premise of creative work is not made Every other embodiment, belongs to the scope of protection of the invention.
It should be noted that:Similar label and letter represents similar terms in following accompanying drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent accompanying drawing in individual accompanying drawing.Meanwhile the present invention's In description, term " first ", " second " etc. are only used for distinguishing description, and it is not intended that instruction or hint relative importance.
Fig. 1 shows a kind of structured flowchart for the electronic equipment 100 that can be applied in the embodiment of the present invention.As shown in figure 1, Electronic equipment 100 can include memory 102, storage control 104, one or more (one is only shown in Fig. 1) processors 106th, Peripheral Interface 108, input/output module 110, audio-frequency module 112, display module 114, radio-frequency module 116 and user property Acquisition device.
Memory 102, storage control 104, processor 106, Peripheral Interface 108, input/output module 110, audio mould Directly or indirectly electrically connected between block 112, display module 114,116 each element of radio-frequency module, with realize the transmission of data or Interaction.For example, electrical connection can be realized by one or more communication bus or signal bus between these elements.User property Acquisition methods respectively include it is at least one can be stored in the form of software or firmware (firmware) it is soft in memory 102 Part functional module, such as the software function module or computer program that the user property acquisition device includes.
Memory 102 can store various software programs and module, and the user property as the embodiment of the present application provides obtains Take programmed instruction/module corresponding to method and device.Processor 106 is by running the software program of storage in the memory 102 And module, so as to perform various function application and data processing, that is, realize that the user property in the embodiment of the present application obtains Method.
Memory 102 can include but is not limited to random access memory (Random Access Memory, RAM), only Read memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..
Processor 106 can be a kind of IC chip, have signal handling capacity.Above-mentioned processor can be general Processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.;It can also be digital signal processor (DSP), application specific integrated circuit (ASIC), ready-made programmable Gate array (FPGA) either other PLDs, discrete gate or transistor logic, discrete hardware components.It can To realize or perform disclosed each method, step and the logic diagram in the embodiment of the present application.General processor can be micro- Processor or the processor can also be any conventional processors etc..
Various input/output devices are coupled to processor 106 and memory 102 by the Peripheral Interface 108.At some In embodiment, Peripheral Interface 108, processor 106 and storage control 104 can be realized in one single chip.Other one In a little examples, they can be realized by independent chip respectively.
Input/output module 110 is used to be supplied to user input data to realize interacting for user and electronic equipment 100.It is described Input/output module 110 may be, but not limited to, mouse and keyboard etc..
Audio-frequency module 112 provides a user COBBAIF, and it may include one or more microphones, one or more raises Sound device and voicefrequency circuit.
Display module 114 provides an interactive interface (such as user interface) between electronic equipment 100 and user Or referred to for display image data to user.In the present embodiment, the display module 114 can be liquid crystal display or touch Control display.If touch control display, it can be that the capacitance type touch control screen or resistance-type for supporting single-point and multi-point touch operation touch Control screen etc..Single-point and multi-point touch operation is supported to refer to that touch control display can sense on the touch control display one or more Individual opening position is with caused touch control operation, and the touch control operation that this is sensed transfers to processor 106 to be calculated and handled.
Radio-frequency module 116 is used to receiving and sending electromagnetic wave, realizes the mutual conversion of electromagnetic wave and electric signal, so that with Communication network or other equipment are communicated.
It is appreciated that structure shown in Fig. 1 is only to illustrate, electronic equipment 100 may also include it is more more than shown in Fig. 1 or Less component, or there is the configuration different from shown in Fig. 1.Each component shown in Fig. 1 can use hardware, software or its Combination is realized.
In the embodiment of the present invention, electronic equipment 100 can be used as user terminal, or as server.User terminal Can be PC (personal computer) computer, tablet personal computer, mobile phone, notebook computer, intelligent television, set top box, vehicle-mounted The terminal devices such as terminal.
Referring to Fig. 2, the embodiments of the invention provide a kind of user property acquisition methods, methods described includes:Step S200, step S210, step S220, step S230 and step S240.
Step S200:Obtain the text and image in the microblogging of user.
The text and image in the microblogging of user can be obtained from Sina weibo.
Step S210:Obtain text input matrix corresponding to the text.
Based on step S210, further, word segmentation processing is carried out to the text and counts word frequency, obtains at least one point The word frequency each segmented in word, and at least one participle;Based at least one participle and the word frequency each segmented, obtain Obtain text input matrix corresponding to the text.
In the present embodiment, word segmentation processing is carried out to text using python and makes the vocabulary of all words (vocabulary), every a line of vocabulary is a kind of vocabulary, and line number where vocabulary is the index value (index) of the vocabulary;Together When by microblogging (document) statistics word frequency (count), generate text input matrix X corresponding to the textu, wherein Xu(i,j) The frequency occurred for i-th of vocabulary in i-th of text.
Step S220:Obtain image input matrix corresponding to described image.
Based on step S220, further, sift feature extractions are carried out to described image, obtain corresponding to described image the One characteristic vector is simultaneously based on the first eigenvector, obtains image input matrix corresponding to described image.
Step S230:Based on the text input matrix and described image input matrix, total input matrix is obtained.
Based on step S230, further, by the text input matrix, described image input matrix and default training Collection input matrix is spliced, and obtains total input matrix.
The text input matrix, image input matrix and default training set input matrix are indexed along text/image Dimension splicing after, obtain splicing matrix, splicing matrix along the dimension piecemeal by user summation, obtain total input matrix.It is total defeated Enter in matrix, each row represent a user, each one text of user representative or image.
Further, before step S230, methods described can also include:Obtain training text in multiple microbloggings and Training image;Obtain training text input matrix corresponding to the training text;Obtain training figure corresponding to the training image As matrix;Based on the training text input matrix and the training image matrix, the training set input matrix is obtained.
Further, sift feature extractions are carried out to each training image, it is corresponding obtains each training image Second feature vector;
Based on second feature vector corresponding to default clustering algorithm and each training image, obtain per the poly- of one kind Class center and the characteristics of image included per one kind;
The number for the described image feature that each training image of statistics includes, it is corresponding to obtain multiple training images Training image matrix.
Further, based on the training text input matrix and the training image matrix, it is defeated to obtain the training set After entering matrix, methods described can also include:
The maximum number of topics of the bottom of Poisson gamma belief network is set;
It is random each to distribute master for each training text and the training image based on the training set input matrix Topic, obtain initialization matrix and generate the initial value of each probability parameter;
Initial value based on the initialization matrix and each probability parameter, Poisson gamma belief network described in repetitive exercise, Obtain the theme distribution situation in the training text and the training image.
Specifically, the training text and training image in more than 1000 microbloggings, Duo Gewei are obtained from Sina weibo Training text and training image in rich are carried out word segmentation processing to training text using python and made all as training set The vocabulary (vocabulary) of word, every a line of vocabulary is a kind of vocabulary, and line number where vocabulary is the index of the vocabulary It is worth (index);Simultaneously by microblogging (document) statistics word frequency (count), text-vocabulary (document-word_ is generated Index) training text input matrix Xw, wherein Xw(i, j) is the frequency that i-th of vocabulary occurs in j-th of text.All instructions After white silk image extract feature using sift successively, multiple second feature vectors are obtained, multiple second feature vectors are connected, K clusters are carried out to multiple second feature vector using default clustering algorithm K-means algorithms, obtained per a kind of cluster centre And the characteristics of image included per one kind;The number for the described image feature that each training image of statistics includes, is obtained multiple Training image matrix X corresponding to the training imagev。Xv(i, j) is the frequency that the i-th category feature occurs in j-th of image.Institute It can be K-means algorithms to state default clustering algorithm.
The maximum number of topics K of the bottom of Poisson gamma belief network is set0max, it is determined that be extracted via first layer Theme transformation (successively decreases, i.e., more high-rise theme more has generality) from the bottom to top number of topics;In training text This input matrix Xw, training image matrix XvIn, vocabulary/feature class distribution to occur every time in each text/image at random is led Inscribe (topic) (common K0maxKind theme), the matrix initialized:That is each word of the image of text in every microblogging/every Allocated frequency of the remittance/feature class under all kinds of themes of the bottom J-th of text/image is represented respectively In i-th kind of vocabulary/feature class be dispensed on frequency (number) under kth class theme, bottom theme-vocabulary/theme-feature Matroid The ratio shared by the lower i-th kind of vocabulary/feature class of k-th of theme is represented, is considered All text/pictures) and every microblogging text/every image contained by vocabulary/feature class correspond to the scaling matrices of each theme For vector, each theme proportion in j-th of text/picture is represented respectively, and is generated each general The initial value of rate parameter (no practical significance, simply participating in calculating).During afterwards, the meaning of above-mentioned each matrix is constant, But value can change.
If outside current layer is T, then progressively to top since the bottom, for each T with certain iteration Number (BT+CT) by certain rule all layers (t≤T) below this layer are trained in two steps, each iteration comprises the following steps:From The bottom is successively upward to outside current layer T, and each layer samples to the value of part matrix.If internal current layer is (from most Current layer of the bottom to T layers) be t, then as t=1, by Gibbs sampling method, first withWithTo vocabulary/feature The theme of class carries out resampling distribution, after certain sampling number, by merging can be stablized (no longer with sampling And change) theme-vocabulary/theme-feature matroid Z_w(1)/Z_v(1)(Z_w(1)(k,i)/Z_v(1)(k, i) represents I kinds vocabulary/feature class is assigned to the frequency of k-th of theme) and theme-text/theme-picture matrix (Represent the vocabulary/feature for being assigned to k-th of theme that j-th of text/picture is included Class number).
As t >=2, first by this layer of number of topics KtIt is initialized as the number of topics K of last layert-1, it is former by gibbs sampler afterwards It is each that reason samples each theme of last layer is included in each microblogging text/image vocabulary/feature class is assigned to current layer respectively Stable frequency under themeWith(herein, the theme of last layer can be considered vocabulary/feature class under current layer theme), Obtain text/image-current layer theme matrixWith(Represent to be allocated in j-th of text/image To the vocabulary number of i-th of theme of current layer);Simultaneously willWithCurrent layer theme-last layer theme is can obtain by merging Matrix Z_w(t)/Z_v(t)With theme-text/theme-image arrayObtained current layer master is utilized afterwards Topic-last layer theme matrix Z_w(t)/Z_v(t)Sample out under each theme of current layer, each theme vocabulary/feature class of last layer RatioWith(Represent under k-th of theme of current layer, the vocabulary/feature of i-th of theme of last layer Ratio shared by class, consider all text/images.
Successively sample and calculate probability parameter.Successively downward from outside current layer T to the bottom, each layer is to another portion The value of sub-matrix is sampled.First with Z_w(T)/Z_v(T)Sample out vocabulary/feature class contained by outside each themes of current layer T Weight vector r(T)(the more big then proportion of corresponding theme weights is heavier);Followed by r(T)(during t=T) orAnd(t<During T) obtained as probability generation parameter and sampling foundation, sampling Low one layerWithIt is worth noting that, since a certain layer, the θ and dependent probability parameter of its all above layer All become text and image and share that (common theta is by respective by text and imageSpelled along theme dimension The matrix obtained after connecing samples to obtain as probability parameter;Related public probability parameter is sampled to obtain by common theta), therefore sample this During a little layers, with the shared θ of higher respectively andAndThe matrix for splicing to obtain after matrix multiple is joined for public probability Number is sampled.When iterations reaches a certain threshold value BTWhen, remove current layer in inactive theme (under i.e. some themes not Include vocabulary/feature class of any low one layer of theme), cut the number of topics K of current layert.When all layers of iteration samplings are all entered After having gone, training is completed, and obtains in training text and training image all microblogging vocabulary and picture feature class in each theme of each layer Under distribution situation.
Step S240:Based on total input matrix, the theme distribution situation in the text and described image is obtained, with And based on the theme distribution situation, obtain the attribute of the user.
Based on step S240, the first network parameter of Poisson gamma belief network, the second network parameter and default are initialized The 3rd network parameter;Using total input matrix as the input of Poisson gamma belief network, from the Poisson gamma conviction The bottom of network is successively sampled to top, and iteration updates the first network parameter, second network parameter and described 3rd network parameter, until reaching default number of iterations, obtain the theme distribution situation in the text and described image.
Specifically, the first network parameter θ of Poisson gamma belief network, the second network parameter r and default the are initialized Three network parameter Φ, wherein default 3rd network parameter Φ is the Φ for training to obtain in above-mentioned training text and training image. With certain iterations (BT+CT) all layers are trained in two steps, each iteration performs following process:From the bottom to top, Successively utilize total input matrix Xw/Xv, training set generationAnd spliced by training set and test set along user's dimension Form(or common theta(t)) to total data set(i.e. theme-user's matrix) is sampled. The sampling generation dependent probability parameter from the second layer to top.In top, given birth to by the use of the top r that training obtains as probability Into parameter, sample and generate θ(T)(text and picture share).θ(T)Afterbody (i.e. since it is a certain row after all row) i.e. The distribution situation of each theme in microblogging for the user.After the completion of iteration, the distribution feelings of each theme in the microblogging of user are obtained Condition.
A kind of user property acquisition methods and device provided in an embodiment of the present invention, obtain text in the microblogging of user and Image;Text input matrix corresponding to the text is obtained again;Obtain image input matrix corresponding to described image;Based on described Text input matrix and described image input matrix, obtain total input matrix;Total input matrix is then based on, described in acquisition Theme distribution situation in text and described image, and based on the theme distribution situation, obtain the attribute of the user.Effect Rate is high, the degree of accuracy is high, practical.
Referring to Fig. 3, the embodiments of the invention provide a kind of user property acquisition device 300, described device can include: First acquisition unit 320, second acquisition unit 330, the 3rd acquiring unit 340, the 4th acquiring unit 350 and the 5th acquiring unit 360。
First acquisition unit 320, the text and image in microblogging for obtaining user.
Second acquisition unit 330, for obtaining text input matrix corresponding to the text.
Second acquisition unit 330 can include second and obtain subelement 331.
Second obtains subelement 331, for carrying out word segmentation processing to the text and counting word frequency, obtains at least one point The word frequency each segmented in word, and at least one participle;Based at least one participle and the word frequency each segmented, obtain Obtain text input matrix corresponding to the text.
3rd acquiring unit 340, for obtaining image input matrix corresponding to described image.
3rd acquiring unit 340 can include the 3rd and obtain subelement 341.
3rd obtains subelement 341, for carrying out sift feature extractions to described image, obtains the corresponding to described image One characteristic vector is simultaneously based on the first eigenvector, obtains image input matrix corresponding to described image.
4th acquiring unit 350, for based on the text input matrix and described image input matrix, obtaining total input Matrix.
4th acquiring unit 350 can include the 4th and obtain subelement 351.
4th obtains subelement 351, for by the text input matrix, described image input matrix and default training Collection input matrix is spliced, and obtains total input matrix.
5th acquiring unit 360, for based on total input matrix, obtaining the theme in the text and described image Distribution situation, and based on the theme distribution situation, obtain the attribute of the user.
5th acquiring unit 360 can include the 5th and obtain subelement 361.
5th obtains subelement 361, for initializing first network parameter, the second network ginseng of Poisson gamma belief network Several and default 3rd network parameter;Using total input matrix as the input of Poisson gamma belief network, from the Poisson The bottom of gamma belief network successively samples, iteration updates the first network parameter, second network ginseng to top Several and described 3rd network parameter, until reaching default number of iterations, obtains the theme distribution in the text and described image Situation.
Referring to Fig. 4, described device 300 can also include:Training unit 310.
Training unit 310, for obtaining training text and training image in multiple microbloggings;Obtain the training text pair The training text input matrix answered;Obtain training image matrix corresponding to the training image;Inputted based on the training text Matrix and the training image matrix, obtain the training set input matrix.
The training unit 310 can include training subelement 311.
Subelement 311 is trained, is additionally operable to carry out sift feature extractions to each training image, obtains each instruction Practice second feature vector corresponding to image;Based on second feature corresponding to default clustering algorithm and each training image to Amount, obtain the characteristics of image that per a kind of cluster centre and every one kind includes;The each training image of statistics includes described The number of characteristics of image, obtain training image matrix corresponding to multiple training images.
The training unit 310, it is additionally operable to set the maximum number of topics of the bottom of Poisson gamma belief network;Based on institute Training set input matrix is stated, it is random each to distribute theme for each training text and the training image, initialized Matrix and the initial value for generating each probability parameter;Initial value based on the initialization matrix and each probability parameter, repetitive exercise The Poisson gamma belief network, obtain the theme distribution situation in the training text and the training image.
Above each unit can be that now, above-mentioned each unit can be stored in memory 102 by software code realization. Above each unit can equally be realized by hardware such as IC chip.
User property acquisition device 300 provided in an embodiment of the present invention, its realization principle and caused technique effect are with before It is identical to state embodiment of the method, to briefly describe, device embodiment part does not refer to part, refers to phase in preceding method embodiment Answer content.
In several embodiments provided herein, it should be understood that disclosed apparatus and method, can also pass through Other modes are realized.Device embodiment described above is only schematical, for example, flow chart and block diagram in accompanying drawing Show the device of multiple embodiments according to the present invention, method and computer program product architectural framework in the cards, Function and operation.At this point, each square frame in flow chart or block diagram can represent the one of a module, program segment or code Part, a part for the module, program segment or code include one or more and are used to realize holding for defined logic function Row instruction.It should also be noted that at some as in the implementation replaced, the function that is marked in square frame can also with different from The order marked in accompanying drawing occurs.For example, two continuous square frames can essentially perform substantially in parallel, they are sometimes It can perform in the opposite order, this is depending on involved function.It is it is also noted that every in block diagram and/or flow chart The combination of individual square frame and block diagram and/or the square frame in flow chart, function or the special base of action as defined in performing can be used Realize, or can be realized with the combination of specialized hardware and computer instruction in the system of hardware.
In addition, each functional module in each embodiment of the present invention can integrate to form an independent portion Point or modules individualism, can also two or more modules be integrated to form an independent part.
If the function is realized in the form of software function module and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention. And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.Need Illustrate, herein, such as first and second or the like relational terms be used merely to by an entity or operation with Another entity or operation make a distinction, and not necessarily require or imply between these entities or operation any this reality be present The relation or order on border.Moreover, term " comprising ", "comprising" or its any other variant are intended to the bag of nonexcludability Contain, so that process, method, article or equipment including a series of elements not only include those key elements, but also including The other element being not expressly set out, or also include for this process, method, article or the intrinsic key element of equipment. In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including the key element Process, method, other identical element also be present in article or equipment.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.It should be noted that:Similar label and letter exists Similar terms is represented in following accompanying drawing, therefore, once being defined in a certain Xiang Yi accompanying drawing, is then not required in subsequent accompanying drawing It is further defined and explained.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element also be present in process, method, article or equipment including the key element.

Claims (10)

  1. A kind of 1. user property acquisition methods, applied to an electronic equipment, it is characterised in that methods described includes:
    Obtain the text and image in the microblogging of user;
    Obtain text input matrix corresponding to the text;
    Obtain image input matrix corresponding to described image;
    Based on the text input matrix and described image input matrix, total input matrix is obtained;
    Based on total input matrix, the theme distribution situation in the text and described image is obtained, and based on the master Distribution situation is inscribed, obtains the attribute of the user.
  2. 2. according to the method for claim 1, it is characterised in that text input matrix corresponding to the text is obtained, including:
    Word segmentation processing is carried out to the text and counts word frequency, is obtained every at least one participle, and at least one participle The word frequency of individual participle;
    Based at least one participle and the word frequency each segmented, text input matrix corresponding to the text is obtained.
  3. 3. according to the method for claim 1, it is characterised in that image input matrix corresponding to described image is obtained, including:
    Sift feature extractions are carried out to described image, obtain first eigenvector corresponding to described image and based on the described first spy Sign vector, obtains image input matrix corresponding to described image.
  4. 4. according to the method for claim 1, it is characterised in that based on the text input matrix and described image input square Battle array, obtains total input matrix, including:
    The text input matrix, described image input matrix and default training set input matrix are spliced, obtained total Input matrix.
  5. 5. according to the method for claim 4, it is characterised in that by the text input matrix, described image input matrix Spliced with default training set input matrix, before obtaining total input matrix, methods described also includes:
    Obtain the training text and training image in multiple microbloggings;
    Obtain training text input matrix corresponding to the training text;
    Obtain training image matrix corresponding to the training image;
    Based on the training text input matrix and the training image matrix, the training set input matrix is obtained.
  6. 6. according to the method for claim 5, it is characterised in that training image matrix corresponding to the training image is obtained, Including:
    Sift feature extractions are carried out to each training image, obtain second feature corresponding to each training image to Amount;
    Based on second feature vector corresponding to default clustering algorithm and each training image, obtain per in a kind of cluster The heart and the characteristics of image included per one kind;
    The number for the described image feature that each training image of statistics includes, obtains and is instructed corresponding to multiple training images Practice image array.
  7. 7. according to the method for claim 5, it is characterised in that based on the training text input matrix and the training figure As matrix, after obtaining the training set input matrix, methods described also includes:
    The maximum number of topics of the bottom of Poisson gamma belief network is set;
    It is random each to distribute theme for each training text and the training image based on the training set input matrix, Obtain initialization matrix and generate the initial value of each probability parameter;
    Initial value based on the initialization matrix and each probability parameter, Poisson gamma belief network described in repetitive exercise, is obtained Theme distribution situation in the training text and the training image.
  8. 8. according to the method for claim 1, it is characterised in that based on total input matrix, obtain the text and institute The theme distribution situation in image is stated, including:
    Initialize first network parameter, the second network parameter and default 3rd network parameter of Poisson gamma belief network;
    Using total input matrix as the input of Poisson gamma belief network, from the bottom of the Poisson gamma belief network To top, successively sample, iteration updates the first network parameter, second network parameter and the 3rd network ginseng Number, until reaching default number of iterations, obtains the theme distribution situation in the text and described image.
  9. 9. a kind of user property acquisition device, it is characterised in that described device includes:
    First acquisition unit, the text and image in microblogging for obtaining user;
    Second acquisition unit, for obtaining text input matrix corresponding to the text;
    3rd acquiring unit, for obtaining image input matrix corresponding to described image;
    4th acquiring unit, for based on the text input matrix and described image input matrix, obtaining total input matrix;
    5th acquiring unit, for based on total input matrix, obtaining the theme distribution feelings in the text and described image Condition, and based on the theme distribution situation, obtain the attribute of the user.
  10. 10. device according to claim 9, it is characterised in that the second acquisition unit includes:
    Second obtains subelement, for carrying out word segmentation processing to the text and counting word frequency, obtains at least one participle, and institute State the word frequency each segmented at least one participle;Based at least one participle and the word frequency that each segments, described in acquisition Text input matrix corresponding to text.
CN201710738930.0A 2017-08-24 2017-08-24 User attribute acquisition method and device Active CN107480289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710738930.0A CN107480289B (en) 2017-08-24 2017-08-24 User attribute acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710738930.0A CN107480289B (en) 2017-08-24 2017-08-24 User attribute acquisition method and device

Publications (2)

Publication Number Publication Date
CN107480289A true CN107480289A (en) 2017-12-15
CN107480289B CN107480289B (en) 2020-06-30

Family

ID=60602525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710738930.0A Active CN107480289B (en) 2017-08-24 2017-08-24 User attribute acquisition method and device

Country Status (1)

Country Link
CN (1) CN107480289B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984689A (en) * 2018-07-02 2018-12-11 广东睿江云计算股份有限公司 More copies synchronized method and devices in a kind of union file system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838836A (en) * 2014-02-25 2014-06-04 中国科学院自动化研究所 Multi-modal data fusion method and system based on discriminant multi-modal deep confidence network
CN104361059A (en) * 2014-11-03 2015-02-18 中国科学院自动化研究所 Harmful information identification and web page classification method based on multi-instance learning
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
CN106446117A (en) * 2016-09-18 2017-02-22 西安电子科技大学 Text analysis method based on poisson-gamma belief network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838836A (en) * 2014-02-25 2014-06-04 中国科学院自动化研究所 Multi-modal data fusion method and system based on discriminant multi-modal deep confidence network
CN104361059A (en) * 2014-11-03 2015-02-18 中国科学院自动化研究所 Harmful information identification and web page classification method based on multi-instance learning
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
CN106446117A (en) * 2016-09-18 2017-02-22 西安电子科技大学 Text analysis method based on poisson-gamma belief network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MINGYUAN ZHOU: ""The Poisson Gamma Belief Network"", 《ARXIV:1511.02199V1[STAT.ML]》 *
XIU HUANG 等: ""A Deep Approach for Multi-modal User Attribute Modeling"", 《ADC 2017: DATABASES THEORY AND APPLICATIONS》 *
黄秀: ""基于多模态社交媒体数据源的用户画像构建的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984689A (en) * 2018-07-02 2018-12-11 广东睿江云计算股份有限公司 More copies synchronized method and devices in a kind of union file system
CN108984689B (en) * 2018-07-02 2021-08-03 广东睿江云计算股份有限公司 Multi-copy synchronization method and device in combined file system

Also Published As

Publication number Publication date
CN107480289B (en) 2020-06-30

Similar Documents

Publication Publication Date Title
CN111476284B (en) Image recognition model training and image recognition method and device and electronic equipment
US20200285903A1 (en) System for time-efficient assignment of data to ontological classes
CN109214002A (en) A kind of transcription comparison method, device and its computer storage medium
CN110377740B (en) Emotion polarity analysis method and device, electronic equipment and storage medium
CN109446328A (en) A kind of text recognition method, device and its storage medium
CN108509407A (en) Text semantic similarity calculating method, device and user terminal
CN110909222B (en) User portrait establishing method and device based on clustering, medium and electronic equipment
CN109359539A (en) Attention appraisal procedure, device, terminal device and computer readable storage medium
CN104008420A (en) Distributed outlier detection method and system based on automatic coding machine
CN112200318B (en) Target detection method, device, machine readable medium and equipment
CN110457677A (en) Entity-relationship recognition method and device, storage medium, computer equipment
CN107145485A (en) Method and apparatus for compressing topic model
CN111523324A (en) Training method and device for named entity recognition model
WO2017136674A1 (en) Generating feature embeddings from a co-occurrence matrix
CN112131322A (en) Time series classification method and device
CN106951267A (en) Screen size adaptive approach and device
CN107239775A (en) Terrain classification method and device
WO2022252822A1 (en) Information presentation method and apparatus, and device and medium
CN109213554A (en) A kind of icon layout method, computer readable storage medium and terminal device
CN110363206A (en) Cluster, data processing and the data identification method of data object
CN107169530A (en) Mask method, device and the electronic equipment of picture
CN107404383A (en) The generation method and device of digital signature
CN107480289A (en) User property acquisition methods and device
CN111859933A (en) Training method, recognition method, device and equipment of Malay recognition model
CN109726824A (en) The transfer learning method and terminal device of training pattern

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant