CN109635207A - A kind of social network user personality prediction technique based on Chinese text analysis - Google Patents
A kind of social network user personality prediction technique based on Chinese text analysis Download PDFInfo
- Publication number
- CN109635207A CN109635207A CN201811553414.1A CN201811553414A CN109635207A CN 109635207 A CN109635207 A CN 109635207A CN 201811553414 A CN201811553414 A CN 201811553414A CN 109635207 A CN109635207 A CN 109635207A
- Authority
- CN
- China
- Prior art keywords
- user
- speech
- personality
- text
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000004458 analytical method Methods 0.000 title claims abstract description 22
- 238000005457 optimization Methods 0.000 claims abstract description 11
- 238000012360 testing method Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims abstract description 11
- 238000013528 artificial neural network Methods 0.000 claims abstract description 9
- 230000003993 interaction Effects 0.000 claims abstract description 9
- 230000002452 interceptive effect Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 10
- 241001269238 Data Species 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 230000005484 gravity Effects 0.000 claims description 6
- 244000046052 Phaseolus vulgaris Species 0.000 claims description 4
- 235000010627 Phaseolus vulgaris Nutrition 0.000 claims description 4
- 206010029216 Nervousness Diseases 0.000 claims description 3
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000002790 cross-validation Methods 0.000 claims description 3
- 230000005284 excitation Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 230000007423 decrease Effects 0.000 claims description 2
- 239000000463 material Substances 0.000 abstract description 2
- 230000006399 behavior Effects 0.000 description 5
- 230000019771 cognition Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000008451 emotion Effects 0.000 description 4
- 230000036651 mood Effects 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 208000019901 Anxiety disease Diseases 0.000 description 3
- 230000036506 anxiety Effects 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 230000035479 physiological effects, processes and functions Effects 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 2
- 239000010931 gold Substances 0.000 description 2
- 229910052737 gold Inorganic materials 0.000 description 2
- 235000012907 honey Nutrition 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 206010029333 Neurosis Diseases 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000000638 stimulation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002889 sympathetic effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of social network user personality prediction technique based on Chinese text analysis, by handling user's Chinese text data that nearly a period of time is issued on social networks, text is divided into user's basic status information, user interaction information, user version information three classes.Text data is pre-processed, the data set being made of all kinds of words is obtained;Part-of-speech tagging is carried out to text data based on sentiment dictionary, calculates the frequency of occurrences of all kinds of parts of speech in the text, above three category information of Combinatorial Optimization user, and the result test based on expert's scale to user constructs the data set of numeralization as factual data;Feature Engineering is carried out to obtained data set, the characteristic element collection for being used for personality prediction will be obtained, personality prediction model is obtained based on BP neural network training, is predicted by the personality of the model realization social network user.The present invention is convenient with data acquisition, does not depend on psychological professional experience, need not spend human and material resources, the high advantage of accuracy.
Description
Technical field
The present invention relates to Internet technical fields, and in particular to a kind of social network user people based on Chinese text analysis
Lattice prediction technique.
Background technique
With the fast development of Internet technology and its continuous expansion of application field, such as microblogging, circle of friends social network
Network is changing the life of the mankind, and people can release news and interact on it, thus forms huge network
Environment, information spread speed is fast and range is wide, possesses good timeliness, convenient and efficient.
Information on social networks carrier, which shows the emotion come, mainly to be influenced by user personality;For reversed, use
The external manifestation of family individual character is mainly therefore, to analyze the personality of social network user by the emotional expression of user, can be effectively
The emotion information for understanding user, after promoting internet product, precisely launching advertisement, push personalized content, product is provided
Phase service etc. is of great significance.
Mankind's personality model of mainstream is five-factor model personality model, and the personality of a people is considered as following five kinds of personal traits
Synthesis:
Neurotic (Neuroticism): actively performance: be easily offended, be easy dejected, uneasy, self-consciousness is strong, impulsion,
It is weak in mind.Passive behavior: safe, calm, self-consciousness is weaker, self compares satisfaction.It is neurotic mainly to reflect individual
The tendency of unhappy mood is embodied to things, and can reflect the case where personal mood rises and falls, to the control force ratio of impulsion
It is poor, it is easy to produce the attitude of bored surrounding.
Extropism (Extraversion): actively performance: it is export-oriented, enthusiastic, energetic, like stimulation things, like handing over
Friend is easy to produce positive mood.Passive behavior: it is bad to link up, is bad to express oneself emotion, is serious.Extropism
Mainly reflect in the self-confidence of individual, like speaking, sociable and love life performance, and can actively seek
Positive mood.
Open (Openness): actively performance: imagination abundant, dare to attempt fresh things, sense at esthetic sentiment
By very abundant, to have serious hope to knowledge, values be open.Passive behavior: more pragmatic, compliance is satisfied with the existing state of affairs relatively, is obeyed
Agreement.The open curiosity for mainly reflecting people's external world, to having deep love for for life and liking for fangle.
Biddability (Compliance): actively performance: it is believed that others, the people that obeys others' opinion, be ready to help others, be
It is sincere, compliance, sympathetic.Passive behavior: mercilessness is suspected, ridicules other people, rebellion.Biddability mainly reflects people
Trust between people, rather than to other people suspection and to other people handle without mittens, while embodying whether individual is ready to help
Other people standards of measurement.
Preciseness (Conscientiousness): actively performance: it is rigorous for things treat, coherent processing, from
Letter is responsible for, is self-discipline, sense of accomplishment, careful.Passive behavior: it is unordered, it has a weak will, it is careless.Careful major embodiment individual does things
Tendency, oneself is shown to restrain oneself, the comparison that gets down to the job is carefully, be filled with unbounded confidence for the ability of oneself.
Nowadays it in terms of the personality prediction to people, is mainly test by psychology scale, is predicted people to a system
The problem of column, answers, and scores according still further to certain rule, analyzes the personality of people.This mode depends on psychological professional
Experience, and labor intensive, time are too many.With the development of natural language processing technique and machine Learning Theory, it is based on textual data
An important research topic is had become according to the sentiment analysis for carrying out category of psychology, but current for Chinese social networks text
Correlative study and invention it is less.
Summary of the invention
The purpose of the present invention is to provide a kind of social network user personality prediction techniques based on Chinese text analysis, can
To be handled and be excavated by the text information issued to social network user, and then its personality composition is analyzed, there is standard
The advantages that exactness is high, analysis speed is fast, automation.
In order to achieve the above object, the invention is realized by the following technical scheme:
A kind of social network user personality prediction technique based on Chinese text analysis, characterized in that comprise the steps of:
S1, preliminary treatment is carried out to Chinese social networks text, text is divided into user's basic status information, user interaction
Information and user version information three classes;
S2, user version information is pre-processed, obtains the data set D being made of all kinds of wordsword;
S3, the text feature of user version information is extracted, based on sentiment dictionary to data set DwordCarry out part of speech
Mark calculates the frequency of occurrences of all kinds of parts of speech in the text, the above three classes text information of Combinatorial Optimization, to be based on expert's scale pair
The result that user is test constructs the data set D of numeralization as factual datacomp;
S4, logarithm value data set DcompFeature Engineering is carried out, i.e., feature is screened, obtains predicting for personality
Characteristic element collection Dpre;
S5, personality prediction, the characteristic element collection D obtained with step S4 are carried out based on BP neural network training patternpreIn
Feature vector is made with the rate of specific gravity of nervousness, extropism, opening, biddability, preciseness this 5 personalities as mode input
For model output, neural network is constructed, training prediction model carries out personality prediction.
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein in the step S1:
When user's basic status information includes the quantity, follower's quantity, bean vermicelli quantity, social networks use of issued state
Length averagely issues frequency, to reflect user to the basic service condition of social networks;
User interaction information includes the expression quantity in social networks, topic numbers, number, hop count, to reflect use
Family and public topic and good friend's interacts situation;
User version information is the pure language content in text, to reflect the speech habits, expression way and emotion of user
Tendency.
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein in the step S2
Pretreatment, which refers to, cleans text data, is segmented and removed stop words:
Described carries out cleaning comprising filtering out the figure in social networks text using canonical matching process to text data
The non-textual contents such as field, url network address, emoticon, the transmitting symbol that piece, expression, location information, double " # " are surrounded;
The participle, which refers to, segments user version, and full text information is converted to the set of word;
It is described to go stop words to refer to remove text noise using regular expression, remove text medium-high frequency but without real
The stop words of border meaning, stop words include pronoun, auxiliary word and punctuation mark.
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein in the step S3:
Three category information of Combinatorial Optimization refers to: counting quantity, the follower's quantity, number of fans of user's issued state
Amount, social networks issue the basic status information of frequency using duration, averagely, count expression quantity, topic in the user version
Quantity, secondary number, hop count interactive information;
The described result to be test based on expert's scale to user constructs the data of numeralization as factual data
Collect DcompRefer to: user being tested by five-factor model personality expert's scale, is scored according to Expert Rules, calculates five people
The respective specific gravity of lattice forms five label datas, obtains word in conjunction with part of speech annotation results and Combinatorial Optimization three classes information result
Resistant frequency, basic status information, interactive information constitute the data set D of numeralizationcomp。
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein DcompShare 111 spies
Sign, corresponding 102 kinds of parts of speech, 5 kinds of user state informations and 4 kinds of interactive informations, the step S4 specifically includes:
S41, the correlation for calculating separately five personalities and every kind of part of speech:
Correlation is measured using Pearson correlation coefficients, its calculation formula is:
Wherein, Cov (X, Y) indicates the covariance of variable X and variable Y, σXAnd σYIt is the standard of variable X and variable Y respectively
Difference,WithRespectively represent variable X and the average value of Y;Herein, X is a certain part of speech feature W in 102 part of speech feature Wi
(i=1 ... 102) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) corresponding data;It takes respectively every
Kind personality corresponds to highest preceding 13 parts of speech of Pearson's coefficient, constitutes 5 part of speech set:
Setj={ W1...13|ChjCorresponding preceding 13 part of speech feature } j=1 ... 5
The set that 13 part of speech feature corresponding to each personality are constituted seeks union, obtains part of speech set SetW:
SetW=Set1∪Set2∪…∪Set5
S42, the correlation for calculating separately five personalities and 5 kinds of user state information features and 4 kinds of interactive information features:
Correlation is measured using Pearson correlation coefficients, calculation formula such as formula (1) is described, and herein, X is 9 Xiang Xingte
Levy a certain feature F in Fi(i=1 ... 9) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) is corresponding
Data.It takes every kind of personality to correspond to the highest first three items feature of Pearson's coefficient respectively, constitutes 5 characteristic sets:
Setj={ F1...3|ChjCorresponding preceding 3 part of speech feature } j=1 ... 5
The set that 3 features corresponding to each personality are constituted seeks union, obtains set SetF:
SetF=Set1∪Set2∪…∪Set5
S43, the correlation of each part of speech filtered out in step S41 between any two is calculated:
Each part of speech is recalculated in SetWThe frequency of occurrences in the text of composition;It is measured using Pearson correlation coefficients related
Property, calculation formula is as described in S41, herein, X SetWIn the corresponding data of a certain part of speech feature, Y SetWIn remove X
The corresponding data of a certain part of speech feature outside calculate:
Wherein,For part of speech WiWith part of speech WjPearson's coefficient, if certain is big to Pearson's coefficient between part of speech
In 0.6, then one therein is rejected, set Set is obtainedWn;To SetWnWith SetFIt asks simultaneously, obtains Setpre=SetWn∪SetF, then
Its corresponding data is merged with five personality label datas, obtains multi-tag multiple target characteristic element collection Dpre。
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein the step S5 tool
Body includes:
S51, it is being based on DpreBefore training pattern, first data are normalized, using logarithm normalizing, are calculated public
Formula is x=lg (x)/lg (max), and wherein x is characterized value, and max is the maximum value in this feature corresponding data;
S52, tanh function is selected as the excitation function of neuron, be able to maintain its output input in Nonlinear Monotone
Relationship is risen and declined, gradient solution, zmodem are met;
S53, over-fitting is prevented using L2 regularization, that is, weight decaying;
S54, be trained and predict using ten folding cross validations, using grid search tune ginseng regularized learning algorithm rate,
Dropout rate, epochs and neuronal quantity parameter.
Compared with the prior art, the present invention has the following advantages:
1, data acquisition of the present invention is extremely convenient, under the premise of tested user agrees to, by program automatic collection social network
Network text;
2, the present invention is different from traditional expert's scale test Analysis personality, is based on social networks text, and calling trains
Model prediction personality, do not depend on psychological professional experience, without spending human and material resources, and there is pinpoint accuracy, time-consuming few
The advantages that.
Detailed description of the invention
Fig. 1 is flow chart of the method for the present invention;
Fig. 2 is the Feature Engineering flow chart in the embodiment of the present invention;
Fig. 3 is the neural network structure figure in the embodiment of the present invention.
Specific embodiment
The present invention is further elaborated by the way that a preferable specific embodiment is described in detail below in conjunction with attached drawing.
As shown in Figure 1, 2, a kind of social network user personality prediction technique based on Chinese text analysis, characterized in that
It comprises the steps of:
S1, preliminary treatment is carried out to Chinese social networks text, text is divided into user's basic status information, user interaction
Information and user version information three classes;The social networks text can be the acquisition nearly 1 year textual data issued of user
According to social networks text is short text, generally comprises many noises, it is therefore desirable to carry out preliminary treatment;
S2, user version information is pre-processed, obtains the data set D being made of all kinds of wordsword;Pretreatment refers to text
Notebook data is cleaned, segmented and is gone stop words;
S3, the text feature of user version information is extracted, based on sentiment dictionary to data set DwordCarry out part of speech
Mark calculates the frequency of occurrences of all kinds of parts of speech in the text, the above three classes text information of Combinatorial Optimization, to be based on expert's scale pair
The result that user is test constructs the data set D of numeralization as factual datacomp;
S4, due to intrinsic dimensionality it is excessive, it is therefore desirable to the data set D of logarithm valuecompFeature Engineering is carried out, i.e., to feature
It is screened, obtains the characteristic element collection D predicted for personalitypre;
S5, personality prediction, the characteristic element collection D obtained with step S4 are carried out based on BP neural network training patternpreIn
Feature vector is made with the rate of specific gravity of nervousness, extropism, opening, biddability, preciseness this 5 personalities as mode input
For model output, neural network is constructed, training prediction model carries out personality prediction.
In the step S1: user's basic status information includes quantity, the follower's quantity, number of fans of issued state
Amount, social networks use duration, averagely issue frequency, to reflect user to the basic service condition of social networks;User interaction
Information includes the expression quantity in social networks, topic numbers, number, hop count, to reflect that user becomes reconciled with public topic
The interaction situation of friend;User version information is the pure language content in text, with reflect the speech habits of user, expression way and
Sentiment orientation.
In the step S2: described to carry out cleaning comprising filtering out social network using canonical matching process to text data
The non-texts such as field, url network address, emoticon, the transmitting symbol that picture, expression, location information, double " # " in network text surround
This content;The participle, which refers to, segments user version, and full text information is converted to the set of word;Described goes
Stop words, which refers to, removes text noise using regular expression, remove text medium-high frequency but the not no stop words of practical significance,
Stop words includes pronoun, auxiliary word and punctuation mark.Such as: " upper boudoir honey was invited, again for small long holidays the 5th day to Chinese social networks text
It is that a Hu is eaten sea and drunk, it is desirable to the ground of peace and quiet at one is found in the Nanjing Road being crowded with people, it is also not easy, ensconce Guang Hai publishing house
In coffee-house bring unexpected peace and quiet ", carry out step S2 processing after become " invited within small long holidays the 5th day boudoir honey one
The not easy unexpected peace and quiet of Guang Hai publishing house coffee-house in ground that peace and quiet at the Nanjing Road one that is crowded with people are drunk in sea are eaten recklessly ".
In the step S3: three category information of Combinatorial Optimization refers to: counting quantity, the follower of user's issued state
Quantity, bean vermicelli quantity, social networks issue the basic status information of frequency using duration, averagely, count table in the user version
Feelings quantity, topic numbers, secondary number, hop count interactive information;It is described to be test based on expert's scale user
As a result it is used as factual data, constructs the data set D of numeralizationcompRefer to: user being surveyed by five-factor model personality expert's scale
Examination, scores according to Expert Rules, calculates five respective specific gravity of personality, five label datas are formed, in conjunction with part-of-speech tagging
As a result and Combinatorial Optimization three classes information result obtains the number that part of speech frequency, basic status information, interactive information constitute numeralization
According to collection Dcomp.In the present embodiment, it is using the process that the five-factor model personality expert scale based on expertise tests user,
Totally 60 problem, corresponding 12 problems of each personality, wherein 1/3 entitled negative sense is related, 2/3 entitled positive phase
It closes, finally makes scoring according to five kinds of personalities of each user of the code of points of the scale, calculate the institute of every kind of personality score afterwards
Accounting weight forms five label datas.In conjunction with before based on sentiment dictionary to DwordCarry out part-of-speech tagging, Combinatorial Optimization three
Part of speech frequency that category information obtains, basic status information, interactive information constitute the data set D of numeralizationcomp。
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein in the present embodiment, setting
DcompShare 111 features, corresponding 102 kinds of parts of speech, 5 kinds of user state informations and 4 kinds of interactive informations, as shown in Fig. 2, described
Step S4 specifically includes:
S41, the correlation for calculating separately five personalities and every kind of part of speech:
Correlation is measured using Pearson correlation coefficients, its calculation formula is:
Wherein, Cov (X, Y) indicates the covariance of variable X and variable Y, σXAnd σYIt is the standard of variable X and variable Y respectively
Difference,WithRespectively represent variable X and the average value of Y;Herein, X is a certain part of speech feature W in 102 part of speech feature Wi(i
=1 ... 102) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) corresponding data;Every kind is taken respectively
Personality corresponds to highest preceding 13 parts of speech of Pearson's coefficient, constitutes 5 part of speech set:
Setj={ W1…13|ChjCorresponding preceding 13 part of speech feature } j=1 ... 5
The set that 13 part of speech feature corresponding to each personality are constituted seeks union, obtains part of speech set SetW:
SetW=Set1∪Set2∪…∪Set5
Such as set Set1For the corresponding 13 part of speech set of neurotic personality { " present ", " anxiety word ", " mobile word ", " gold
Money word ", " religion word ", " dead word ", " mankind's word ", " seeing clearly word ", " cause and effect word ", " dirty word ", " perception course word " " should be with
Word ", " body word " }
Such as set Set2For the corresponding 13 part of speech set of extropism personality, { " friend's word ", " healthy word ", " property word " is " empty
Between word ", " leisure word ", " cognition course word ", " visual word ", " feeling word ", " work word ", " love word ", " dirty word " " should be with
Word ", " numerical ratio " }
Such as set Set3For open personality corresponding 13 part of speech set { " present ", " friend's word ", " healthy word ", " property
Word ", " mobile word ", " space word ", " mankind's word ", " social process word ", " cognition course word ", " seeing clearly word ", " visual word ",
" physiology course word ", " body word " }
Such as set Set4For the corresponding 13 part of speech set of biddability personality " money word ", " work word ", " love word ",
" past ", " cause and effect word ", " including word ", " excluding word ", " sense of hearing word ", " multipurpose word ", " word of ingesting ", " perception course word ",
" relative term ", " should and word " }
Such as set Set5For the corresponding 13 part of speech set of preciseness personality " angry word ", " achievement word ", " money word ",
" refering in particular to determine word ", " social process word ", " cognition course word ", " seeing clearly word ", " work word ", " excluding word ", " relative term ", " number
Word ratio ", " physiology course word ", " body word " }
To Set1、Set2、Set3、Set4、Set5Union is asked to obtain part of speech set SetW, such as gather { " present ", " friend
Word ", " anxiety word ", " angry word ", " healthy word ", " property word ", " mobile word ", " space word ", " achievement word ", " leisure word ", " gold
Money word ", " religion word ", " dead word ", " refering in particular to determine word ", " mankind's word ", " social process word ", " cognition course word " " are seen clearly
Word ", " visual word ", " feeling word ", " work word ", " love word ", " past ", " cause and effect word ", " including word ", " excluding word " " listens
Feel word ", " dirty word ", " multipurpose word ", " word of ingesting ", " perception course word ", " relative term ", " should and word ", " numerical ratio ",
" physiology course word ", " body word " };
S42, the correlation for calculating separately five personalities and 5 kinds of user state information features and 4 kinds of interactive information features:
Correlation is measured using Pearson correlation coefficients, calculation formula such as formula (1) is described, and herein, X is 9 Xiang Xingte
Levy a certain feature F in Fi(i=1 ... 9) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) is corresponding
Data.It takes every kind of personality to correspond to the highest first three items feature of Pearson's coefficient respectively, constitutes 5 characteristic sets:
Setj={ F1…3|ChjCorresponding preceding 3 part of speech feature } j=1 ... 5
The set that 3 features corresponding to each personality are constituted seeks union, obtains set SetF:
SetF=Set1∪Set2∪…∪Set5
Such as set { " issued state number ", " number of fans ", " topic numbers ", " secondary number " };
S43, the correlation of each part of speech filtered out in step S41 between any two is calculated:
Each part of speech is recalculated in SetWThe frequency of occurrences in the text of composition;It is measured using Pearson correlation coefficients related
Property, calculation formula is as described in S41, herein, X SetWIn the corresponding data of a certain part of speech feature, Y SetWIn remove X
The corresponding data of a certain part of speech feature outside calculate:
Wherein,For part of speech WiWith part of speech WjPearson's coefficient, if certain is big to Pearson's coefficient between part of speech
In 0.6, then one therein is rejected, set Set is obtainedWn;
As { " present ", " friend's word ", " anxiety word ", " angry word ", " healthy word ", " property word ", " mobile word " is " empty for set
Between word ", " achievement word ", " leisure word ", " money word ", " religion word ", " dead word ", " refering in particular to determine word ", " mankind's word " " recognizes
Course word ", " seeing clearly word ", " visual word ", " feeling word ", " work word ", " love word ", " past ", " cause and effect word " " includes
Word ", " exclude word ", " sense of hearing word ", " dirty word ", " multipurpose word ", " word of ingesting ", " should and word ", " numerical ratio ", " body
Word " };
To set SetWnWith set SetFIt asks simultaneously, obtains set Setpre=SetWn∪SetF, then by its corresponding data with
Five personality label datas merge, and obtain multi-tag multiple target characteristic element collection Dpre。
The step S5 specifically includes to be illustrated in figure 3 corresponding neural network structure figure:
S51, it is being based on DpreBefore training pattern, first data are normalized, using logarithm normalizing, are calculated public
Formula is x=lg (x)/lg (max), and wherein x is characterized value, and max is the maximum value in this feature corresponding data;
S52, tanh function is selected as the excitation function of neuron, its output input is made to be able to maintain non-Nonlinear Monotone
Raising and lowering relationship meets gradient solution, zmodem;
S53, over-fitting is prevented using L2 regularization, that is, weight decaying;
Over-fitting is prevented using L2 regularization, i.e. weight decays, and refers to plus a regularization term after cost function, can
To obtain:
Wherein, j is since 1.It is obtained after seeking it local derviation:
Obtain gradient decline formula:
When j is 0, it is believed that the value of λ is 0, when not having regularization, coefficient θjWeight be 1, and it is present:
Weight is decayed.According to "ockham's razor" rule, smaller weight indicates that the complexity of network is lower, logarithm
According to fitting it is also more preferable.
S54, it is trained and predicts using ten folding cross validations, data set D is divided into the similar mutual exclusion of 10 sizes
Subset, i.e. D1∪D2∪…∪D10,Each subset DiAll therefrom stratified sampling obtains, to guarantee number
According to the consistency of distribution.Use the union of k-1 subset as training set every time, remaining subset is as test set.Using grid
Ginseng regularized learning algorithm rate, dropout rate, epochs and neuronal quantity parameter are adjusted in search.
It is discussed in detail although the contents of the present invention have passed through above preferred embodiment, but it should be appreciated that above-mentioned
Description is not considered as limitation of the present invention.After those skilled in the art have read above content, for of the invention
A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.
Claims (6)
1. a kind of social network user personality prediction technique based on Chinese text analysis, which is characterized in that comprise the steps of:
S1, preliminary treatment is carried out to Chinese social networks text, text is divided into user's basic status information, user interaction information
And user version information three classes;
S2, user version information is pre-processed, obtains the data set D being made of all kinds of wordsword;
S3, the text feature of user version information is extracted, based on sentiment dictionary to data set DwordPart-of-speech tagging is carried out,
Calculate the frequency of occurrences of all kinds of parts of speech in the text, the above three classes text information of Combinatorial Optimization, to be based on expert's scale to user
The result test constructs the data set D of numeralization as factual datacomp;
S4, logarithm value data set DcompFeature Engineering is carried out, i.e., feature is screened, the spy predicted for personality is obtained
Levy element collection Dpre;
S5, personality prediction, the characteristic element collection D obtained with step S4 are carried out based on BP neural network training patternpreIn feature
Vector as mode input, using nervousness, extropism, opening, biddability, preciseness this 5 personalities rate of specific gravity as mould
Type output, constructs neural network, and training prediction model carries out personality prediction.
2. the social network user personality prediction technique as described in claim 1 based on Chinese text analysis, which is characterized in that
In the step S1:
User's basic status information include the quantity of issued state, follower's quantity, bean vermicelli quantity, social networks using duration,
Frequency is issued, averagely to reflect user to the basic service condition of social networks;
User interaction information include social networks in expression quantity, topic numbers, number, hop count, with reflect user with
The interaction situation of public topic and good friend;
User version information is the pure language content in text, to reflect the speech habits, expression way and Sentiment orientation of user.
3. the social network user personality prediction technique as claimed in claim 2 based on Chinese text analysis, which is characterized in that
Pretreatment in the step S2, which refers to, cleans text data, is segmented and removed stop words:
Described carries out cleaning comprising filtering out picture, table in social networks text using canonical matching process to text data
The non-textual contents such as field, url network address, emoticon, the transmitting symbol that feelings, location information, double " # " are surrounded;
The participle, which refers to, segments user version, and full text information is converted to the set of word;
It is described to go stop words to refer to remove text noise using regular expression, remove text medium-high frequency but without practical meaning
The stop words of justice, stop words include pronoun, auxiliary word and punctuation mark.
4. the social network user personality prediction technique as claimed in claim 3 based on Chinese text analysis, which is characterized in that
In the step S3:
Three category information of Combinatorial Optimization refers to: counting the quantity, follower's quantity, bean vermicelli quantity, society of user's issued state
The basic status information handed over Web vector graphic duration, averagely issue frequency, counts expression quantity, topic numbers ,@in the user version
The interactive information of number, hop count;
The described result to be test based on expert's scale to user constructs the data set of numeralization as factual data
DcompRefer to: user being tested by five-factor model personality expert's scale, is scored according to Expert Rules, calculates five Xiang Renge
Respective specific gravity forms five label datas, obtains part of speech in conjunction with part of speech annotation results and Combinatorial Optimization three classes information result
Frequency, basic status information, interactive information constitute the data set D of numeralizationcomp。
5. the social network user personality prediction technique as claimed in claim 4 based on Chinese text analysis, which is characterized in that
Dcomp111 features are shared, corresponding 102 kinds of parts of speech, 5 kinds of user state informations and 4 kinds of interactive informations, the step S4 are specific
Include:
S41, the correlation for calculating separately five personalities and every kind of part of speech:
Correlation is measured using Pearson correlation coefficients, its calculation formula is:
Wherein, Cov (X, Y) indicates the covariance of variable X and variable Y, σXAnd σYIt is the standard deviation of variable X and variable Y respectively,
WithRespectively represent variable X and the average value of Y;Herein, X is a certain part of speech feature W in 102 part of speech feature Wi(i=1 ...
102) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) corresponding data;Every kind of personality is taken respectively
Corresponding highest preceding 13 parts of speech of Pearson's coefficient, constitute 5 part of speech set:
Setj={ W1…13|ChjCorresponding preceding 13 part of speech feature } j=1 ... 5
The set that 13 part of speech feature corresponding to each personality are constituted seeks union, obtains part of speech set SetW:
SetW=Set1USet2U…USet5
S42, the correlation for calculating separately five personalities and 5 kinds of user state information features and 4 kinds of interactive information features:
Correlation is measured using Pearson correlation coefficients, calculation formula such as formula (1) is described, and herein, X is in 9 property feature F
A certain feature Fi(i=1 ... 9) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) corresponding number
According to.It takes every kind of personality to correspond to the highest first three items feature of Pearson's coefficient respectively, constitutes 5 characteristic sets:
Setj={ F1…3|ChjCorresponding preceding 3 part of speech feature } j=1 ... 5
The set that 3 features corresponding to each personality are constituted seeks union, obtains set SetF:
SetF=Set1USet2U…USet5
S43, the correlation of each part of speech filtered out in step S41 between any two is calculated:
Each part of speech is recalculated in SetWThe frequency of occurrences in the text of composition;Correlation is measured using Pearson correlation coefficients,
Calculation formula is as described in S41, herein, X SetWIn the corresponding data of a certain part of speech feature, Y SetWIn in addition to X certain
The corresponding data of item part of speech feature, that is, calculate:
Wherein,For part of speech WiWith part of speech WjPearson's coefficient, if certain to Pearson's coefficient between part of speech be greater than 0.6,
One therein is then rejected, set Set is obtainedwn;To SetwnWith SetFIt asks simultaneously, obtains Setpre=Setwn∪SetF, then its is right
The data answered merge with five personality label datas, obtain multi-tag multiple target characteristic element collection Dpre。
6. the social network user personality prediction technique as claimed in claim 5 based on Chinese text analysis, which is characterized in that
The step S5 specifically includes:
S51, it is being based on DpreBefore training pattern, first data are normalized, using logarithm normalizing, its calculation formula is x
=lg (x)/lg (max), wherein x is characterized value, and max is the maximum value in this feature corresponding data;
S52, select tanh function as the excitation function of neuron, make its output input be able to maintain Nonlinear Monotone rise and
Decline relationship meets gradient solution, zmodem;
S53, over-fitting is prevented using L2 regularization, that is, weight decaying;
S54, it is trained and predicts using ten folding cross validations, regularized learning algorithm rate, dropout are joined using grid search tune
Rate, epochs and neuronal quantity parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811553414.1A CN109635207A (en) | 2018-12-18 | 2018-12-18 | A kind of social network user personality prediction technique based on Chinese text analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811553414.1A CN109635207A (en) | 2018-12-18 | 2018-12-18 | A kind of social network user personality prediction technique based on Chinese text analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109635207A true CN109635207A (en) | 2019-04-16 |
Family
ID=66075217
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811553414.1A Withdrawn CN109635207A (en) | 2018-12-18 | 2018-12-18 | A kind of social network user personality prediction technique based on Chinese text analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109635207A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110391013A (en) * | 2019-07-17 | 2019-10-29 | 北京智能工场科技有限公司 | A kind of system and device based on semantic vector building neural network prediction mental health |
CN111352972A (en) * | 2020-02-28 | 2020-06-30 | 厦门医学院 | Statistical personality calculation method based on behavior big data |
CN112580329A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Text noise data identification method and device, computer equipment and storage medium |
CN112818662A (en) * | 2021-01-29 | 2021-05-18 | 清华大学 | Psychological stress prediction system and method based on social network media |
CN113222772A (en) * | 2021-04-08 | 2021-08-06 | 合肥工业大学 | Native personality dictionary construction method, system, storage medium and electronic device |
CN113345590A (en) * | 2021-06-29 | 2021-09-03 | 安徽大学 | User mental health monitoring method and system based on heterogeneous graph |
-
2018
- 2018-12-18 CN CN201811553414.1A patent/CN109635207A/en not_active Withdrawn
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110391013A (en) * | 2019-07-17 | 2019-10-29 | 北京智能工场科技有限公司 | A kind of system and device based on semantic vector building neural network prediction mental health |
CN110391013B (en) * | 2019-07-17 | 2020-08-14 | 北京智能工场科技有限公司 | System and device for predicting mental health by building neural network based on semantic vector |
CN112580329A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Text noise data identification method and device, computer equipment and storage medium |
CN112580329B (en) * | 2019-09-30 | 2024-02-20 | 北京国双科技有限公司 | Text noise data identification method, device, computer equipment and storage medium |
CN111352972A (en) * | 2020-02-28 | 2020-06-30 | 厦门医学院 | Statistical personality calculation method based on behavior big data |
CN112818662A (en) * | 2021-01-29 | 2021-05-18 | 清华大学 | Psychological stress prediction system and method based on social network media |
CN113222772A (en) * | 2021-04-08 | 2021-08-06 | 合肥工业大学 | Native personality dictionary construction method, system, storage medium and electronic device |
CN113222772B (en) * | 2021-04-08 | 2023-10-31 | 合肥工业大学 | Native personality dictionary construction method, native personality dictionary construction system, storage medium and electronic equipment |
CN113345590A (en) * | 2021-06-29 | 2021-09-03 | 安徽大学 | User mental health monitoring method and system based on heterogeneous graph |
CN113345590B (en) * | 2021-06-29 | 2022-12-16 | 安徽大学 | User mental health monitoring method and system based on heterogeneous graph |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rosa et al. | A knowledge-based recommendation system that includes sentiment analysis and deep learning | |
Zad et al. | Emotion detection of textual data: An interdisciplinary survey | |
Wang et al. | A review of emotion sensing: categorization models and algorithms | |
CN109635207A (en) | A kind of social network user personality prediction technique based on Chinese text analysis | |
Li et al. | Making restaurant reviews useful and/or enjoyable? The impacts of temporal, explanatory, and sensory cues | |
CN106095833B (en) | Human-computer dialogue content processing method | |
Spivey et al. | Continuous dynamics in real-time cognition | |
Gärdenfors et al. | Using conceptual spaces to model actions and events | |
Ellis et al. | The processing of verb-argument constructions is sensitive to form, function, frequency, contingency and prototypicality | |
Flekova et al. | Personality profiling of fictional characters using sense-level links between lexical resources | |
Jacobs | (Neuro-) cognitive poetics and computational stylistics | |
CN110119849A (en) | A kind of personal traits prediction technique and system based on network behavior | |
Dsouza et al. | Chat with bots intelligently: A critical review & analysis | |
de Lencastre et al. | Brand response analysis: A Peircean semiotic approach | |
Chen et al. | Construction of affective education in mobile learning: The study based on learner’s interest and emotion recognition | |
Pachouly et al. | Depression detection on social media network (Twitter) using sentiment analysis | |
van Rij | Pronoun processing: Computational, behavioral, and psychophysiological studies in children and adults | |
Li et al. | Tailoring personality traits in large language models via unsupervisedly-built personalized lexicons | |
KR20210028378A (en) | Method and Apparatus for Profiling Unconsciousness Mechanism and Consciousness Behavioral Type | |
Wang et al. | Construction of a novel production develop decision model based on text mined | |
Fawcett | The cultural classification of ‘things’ | |
Iovane et al. | A computational model for managing emotions and affections in emotional learning platforms and learning experience in emotional computing context | |
Sanocki et al. | Novel scene understanding, from gist to elaboration | |
Karunarathana et al. | Ensemble Learning Approach for Identifying Personality Traits based on Individuals' Behavior | |
Hima et al. | Big-five personality traits based on four main methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190416 |
|
WW01 | Invention patent application withdrawn after publication |