CN109635207A - A kind of social network user personality prediction technique based on Chinese text analysis - Google Patents

A kind of social network user personality prediction technique based on Chinese text analysis Download PDF

Info

Publication number
CN109635207A
CN109635207A CN201811553414.1A CN201811553414A CN109635207A CN 109635207 A CN109635207 A CN 109635207A CN 201811553414 A CN201811553414 A CN 201811553414A CN 109635207 A CN109635207 A CN 109635207A
Authority
CN
China
Prior art keywords
user
speech
personality
text
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201811553414.1A
Other languages
Chinese (zh)
Inventor
李岩锋
高俊波
孙伟
李铁锋
白静静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN201811553414.1A priority Critical patent/CN109635207A/en
Publication of CN109635207A publication Critical patent/CN109635207A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of social network user personality prediction technique based on Chinese text analysis, by handling user's Chinese text data that nearly a period of time is issued on social networks, text is divided into user's basic status information, user interaction information, user version information three classes.Text data is pre-processed, the data set being made of all kinds of words is obtained;Part-of-speech tagging is carried out to text data based on sentiment dictionary, calculates the frequency of occurrences of all kinds of parts of speech in the text, above three category information of Combinatorial Optimization user, and the result test based on expert's scale to user constructs the data set of numeralization as factual data;Feature Engineering is carried out to obtained data set, the characteristic element collection for being used for personality prediction will be obtained, personality prediction model is obtained based on BP neural network training, is predicted by the personality of the model realization social network user.The present invention is convenient with data acquisition, does not depend on psychological professional experience, need not spend human and material resources, the high advantage of accuracy.

Description

A kind of social network user personality prediction technique based on Chinese text analysis
Technical field
The present invention relates to Internet technical fields, and in particular to a kind of social network user people based on Chinese text analysis Lattice prediction technique.
Background technique
With the fast development of Internet technology and its continuous expansion of application field, such as microblogging, circle of friends social network Network is changing the life of the mankind, and people can release news and interact on it, thus forms huge network Environment, information spread speed is fast and range is wide, possesses good timeliness, convenient and efficient.
Information on social networks carrier, which shows the emotion come, mainly to be influenced by user personality;For reversed, use The external manifestation of family individual character is mainly therefore, to analyze the personality of social network user by the emotional expression of user, can be effectively The emotion information for understanding user, after promoting internet product, precisely launching advertisement, push personalized content, product is provided Phase service etc. is of great significance.
Mankind's personality model of mainstream is five-factor model personality model, and the personality of a people is considered as following five kinds of personal traits Synthesis:
Neurotic (Neuroticism): actively performance: be easily offended, be easy dejected, uneasy, self-consciousness is strong, impulsion, It is weak in mind.Passive behavior: safe, calm, self-consciousness is weaker, self compares satisfaction.It is neurotic mainly to reflect individual The tendency of unhappy mood is embodied to things, and can reflect the case where personal mood rises and falls, to the control force ratio of impulsion It is poor, it is easy to produce the attitude of bored surrounding.
Extropism (Extraversion): actively performance: it is export-oriented, enthusiastic, energetic, like stimulation things, like handing over Friend is easy to produce positive mood.Passive behavior: it is bad to link up, is bad to express oneself emotion, is serious.Extropism Mainly reflect in the self-confidence of individual, like speaking, sociable and love life performance, and can actively seek Positive mood.
Open (Openness): actively performance: imagination abundant, dare to attempt fresh things, sense at esthetic sentiment By very abundant, to have serious hope to knowledge, values be open.Passive behavior: more pragmatic, compliance is satisfied with the existing state of affairs relatively, is obeyed Agreement.The open curiosity for mainly reflecting people's external world, to having deep love for for life and liking for fangle.
Biddability (Compliance): actively performance: it is believed that others, the people that obeys others' opinion, be ready to help others, be It is sincere, compliance, sympathetic.Passive behavior: mercilessness is suspected, ridicules other people, rebellion.Biddability mainly reflects people Trust between people, rather than to other people suspection and to other people handle without mittens, while embodying whether individual is ready to help Other people standards of measurement.
Preciseness (Conscientiousness): actively performance: it is rigorous for things treat, coherent processing, from Letter is responsible for, is self-discipline, sense of accomplishment, careful.Passive behavior: it is unordered, it has a weak will, it is careless.Careful major embodiment individual does things Tendency, oneself is shown to restrain oneself, the comparison that gets down to the job is carefully, be filled with unbounded confidence for the ability of oneself.
Nowadays it in terms of the personality prediction to people, is mainly test by psychology scale, is predicted people to a system The problem of column, answers, and scores according still further to certain rule, analyzes the personality of people.This mode depends on psychological professional Experience, and labor intensive, time are too many.With the development of natural language processing technique and machine Learning Theory, it is based on textual data An important research topic is had become according to the sentiment analysis for carrying out category of psychology, but current for Chinese social networks text Correlative study and invention it is less.
Summary of the invention
The purpose of the present invention is to provide a kind of social network user personality prediction techniques based on Chinese text analysis, can To be handled and be excavated by the text information issued to social network user, and then its personality composition is analyzed, there is standard The advantages that exactness is high, analysis speed is fast, automation.
In order to achieve the above object, the invention is realized by the following technical scheme:
A kind of social network user personality prediction technique based on Chinese text analysis, characterized in that comprise the steps of:
S1, preliminary treatment is carried out to Chinese social networks text, text is divided into user's basic status information, user interaction Information and user version information three classes;
S2, user version information is pre-processed, obtains the data set D being made of all kinds of wordsword
S3, the text feature of user version information is extracted, based on sentiment dictionary to data set DwordCarry out part of speech Mark calculates the frequency of occurrences of all kinds of parts of speech in the text, the above three classes text information of Combinatorial Optimization, to be based on expert's scale pair The result that user is test constructs the data set D of numeralization as factual datacomp
S4, logarithm value data set DcompFeature Engineering is carried out, i.e., feature is screened, obtains predicting for personality Characteristic element collection Dpre
S5, personality prediction, the characteristic element collection D obtained with step S4 are carried out based on BP neural network training patternpreIn Feature vector is made with the rate of specific gravity of nervousness, extropism, opening, biddability, preciseness this 5 personalities as mode input For model output, neural network is constructed, training prediction model carries out personality prediction.
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein in the step S1:
When user's basic status information includes the quantity, follower's quantity, bean vermicelli quantity, social networks use of issued state Length averagely issues frequency, to reflect user to the basic service condition of social networks;
User interaction information includes the expression quantity in social networks, topic numbers, number, hop count, to reflect use Family and public topic and good friend's interacts situation;
User version information is the pure language content in text, to reflect the speech habits, expression way and emotion of user Tendency.
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein in the step S2 Pretreatment, which refers to, cleans text data, is segmented and removed stop words:
Described carries out cleaning comprising filtering out the figure in social networks text using canonical matching process to text data The non-textual contents such as field, url network address, emoticon, the transmitting symbol that piece, expression, location information, double " # " are surrounded;
The participle, which refers to, segments user version, and full text information is converted to the set of word;
It is described to go stop words to refer to remove text noise using regular expression, remove text medium-high frequency but without real The stop words of border meaning, stop words include pronoun, auxiliary word and punctuation mark.
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein in the step S3:
Three category information of Combinatorial Optimization refers to: counting quantity, the follower's quantity, number of fans of user's issued state Amount, social networks issue the basic status information of frequency using duration, averagely, count expression quantity, topic in the user version Quantity, secondary number, hop count interactive information;
The described result to be test based on expert's scale to user constructs the data of numeralization as factual data Collect DcompRefer to: user being tested by five-factor model personality expert's scale, is scored according to Expert Rules, calculates five people The respective specific gravity of lattice forms five label datas, obtains word in conjunction with part of speech annotation results and Combinatorial Optimization three classes information result Resistant frequency, basic status information, interactive information constitute the data set D of numeralizationcomp
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein DcompShare 111 spies Sign, corresponding 102 kinds of parts of speech, 5 kinds of user state informations and 4 kinds of interactive informations, the step S4 specifically includes:
S41, the correlation for calculating separately five personalities and every kind of part of speech:
Correlation is measured using Pearson correlation coefficients, its calculation formula is:
Wherein, Cov (X, Y) indicates the covariance of variable X and variable Y, σXAnd σYIt is the standard of variable X and variable Y respectively Difference,WithRespectively represent variable X and the average value of Y;Herein, X is a certain part of speech feature W in 102 part of speech feature Wi (i=1 ... 102) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) corresponding data;It takes respectively every Kind personality corresponds to highest preceding 13 parts of speech of Pearson's coefficient, constitutes 5 part of speech set:
Setj={ W1...13|ChjCorresponding preceding 13 part of speech feature } j=1 ... 5
The set that 13 part of speech feature corresponding to each personality are constituted seeks union, obtains part of speech set SetW:
SetW=Set1∪Set2∪…∪Set5
S42, the correlation for calculating separately five personalities and 5 kinds of user state information features and 4 kinds of interactive information features:
Correlation is measured using Pearson correlation coefficients, calculation formula such as formula (1) is described, and herein, X is 9 Xiang Xingte Levy a certain feature F in Fi(i=1 ... 9) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) is corresponding Data.It takes every kind of personality to correspond to the highest first three items feature of Pearson's coefficient respectively, constitutes 5 characteristic sets:
Setj={ F1...3|ChjCorresponding preceding 3 part of speech feature } j=1 ... 5
The set that 3 features corresponding to each personality are constituted seeks union, obtains set SetF:
SetF=Set1∪Set2∪…∪Set5
S43, the correlation of each part of speech filtered out in step S41 between any two is calculated:
Each part of speech is recalculated in SetWThe frequency of occurrences in the text of composition;It is measured using Pearson correlation coefficients related Property, calculation formula is as described in S41, herein, X SetWIn the corresponding data of a certain part of speech feature, Y SetWIn remove X The corresponding data of a certain part of speech feature outside calculate:
Wherein,For part of speech WiWith part of speech WjPearson's coefficient, if certain is big to Pearson's coefficient between part of speech In 0.6, then one therein is rejected, set Set is obtainedWn;To SetWnWith SetFIt asks simultaneously, obtains Setpre=SetWn∪SetF, then Its corresponding data is merged with five personality label datas, obtains multi-tag multiple target characteristic element collection Dpre
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein the step S5 tool Body includes:
S51, it is being based on DpreBefore training pattern, first data are normalized, using logarithm normalizing, are calculated public Formula is x=lg (x)/lg (max), and wherein x is characterized value, and max is the maximum value in this feature corresponding data;
S52, tanh function is selected as the excitation function of neuron, be able to maintain its output input in Nonlinear Monotone Relationship is risen and declined, gradient solution, zmodem are met;
S53, over-fitting is prevented using L2 regularization, that is, weight decaying;
S54, be trained and predict using ten folding cross validations, using grid search tune ginseng regularized learning algorithm rate, Dropout rate, epochs and neuronal quantity parameter.
Compared with the prior art, the present invention has the following advantages:
1, data acquisition of the present invention is extremely convenient, under the premise of tested user agrees to, by program automatic collection social network Network text;
2, the present invention is different from traditional expert's scale test Analysis personality, is based on social networks text, and calling trains Model prediction personality, do not depend on psychological professional experience, without spending human and material resources, and there is pinpoint accuracy, time-consuming few The advantages that.
Detailed description of the invention
Fig. 1 is flow chart of the method for the present invention;
Fig. 2 is the Feature Engineering flow chart in the embodiment of the present invention;
Fig. 3 is the neural network structure figure in the embodiment of the present invention.
Specific embodiment
The present invention is further elaborated by the way that a preferable specific embodiment is described in detail below in conjunction with attached drawing.
As shown in Figure 1, 2, a kind of social network user personality prediction technique based on Chinese text analysis, characterized in that It comprises the steps of:
S1, preliminary treatment is carried out to Chinese social networks text, text is divided into user's basic status information, user interaction Information and user version information three classes;The social networks text can be the acquisition nearly 1 year textual data issued of user According to social networks text is short text, generally comprises many noises, it is therefore desirable to carry out preliminary treatment;
S2, user version information is pre-processed, obtains the data set D being made of all kinds of wordsword;Pretreatment refers to text Notebook data is cleaned, segmented and is gone stop words;
S3, the text feature of user version information is extracted, based on sentiment dictionary to data set DwordCarry out part of speech Mark calculates the frequency of occurrences of all kinds of parts of speech in the text, the above three classes text information of Combinatorial Optimization, to be based on expert's scale pair The result that user is test constructs the data set D of numeralization as factual datacomp
S4, due to intrinsic dimensionality it is excessive, it is therefore desirable to the data set D of logarithm valuecompFeature Engineering is carried out, i.e., to feature It is screened, obtains the characteristic element collection D predicted for personalitypre
S5, personality prediction, the characteristic element collection D obtained with step S4 are carried out based on BP neural network training patternpreIn Feature vector is made with the rate of specific gravity of nervousness, extropism, opening, biddability, preciseness this 5 personalities as mode input For model output, neural network is constructed, training prediction model carries out personality prediction.
In the step S1: user's basic status information includes quantity, the follower's quantity, number of fans of issued state Amount, social networks use duration, averagely issue frequency, to reflect user to the basic service condition of social networks;User interaction Information includes the expression quantity in social networks, topic numbers, number, hop count, to reflect that user becomes reconciled with public topic The interaction situation of friend;User version information is the pure language content in text, with reflect the speech habits of user, expression way and Sentiment orientation.
In the step S2: described to carry out cleaning comprising filtering out social network using canonical matching process to text data The non-texts such as field, url network address, emoticon, the transmitting symbol that picture, expression, location information, double " # " in network text surround This content;The participle, which refers to, segments user version, and full text information is converted to the set of word;Described goes Stop words, which refers to, removes text noise using regular expression, remove text medium-high frequency but the not no stop words of practical significance, Stop words includes pronoun, auxiliary word and punctuation mark.Such as: " upper boudoir honey was invited, again for small long holidays the 5th day to Chinese social networks text It is that a Hu is eaten sea and drunk, it is desirable to the ground of peace and quiet at one is found in the Nanjing Road being crowded with people, it is also not easy, ensconce Guang Hai publishing house In coffee-house bring unexpected peace and quiet ", carry out step S2 processing after become " invited within small long holidays the 5th day boudoir honey one The not easy unexpected peace and quiet of Guang Hai publishing house coffee-house in ground that peace and quiet at the Nanjing Road one that is crowded with people are drunk in sea are eaten recklessly ".
In the step S3: three category information of Combinatorial Optimization refers to: counting quantity, the follower of user's issued state Quantity, bean vermicelli quantity, social networks issue the basic status information of frequency using duration, averagely, count table in the user version Feelings quantity, topic numbers, secondary number, hop count interactive information;It is described to be test based on expert's scale user As a result it is used as factual data, constructs the data set D of numeralizationcompRefer to: user being surveyed by five-factor model personality expert's scale Examination, scores according to Expert Rules, calculates five respective specific gravity of personality, five label datas are formed, in conjunction with part-of-speech tagging As a result and Combinatorial Optimization three classes information result obtains the number that part of speech frequency, basic status information, interactive information constitute numeralization According to collection Dcomp.In the present embodiment, it is using the process that the five-factor model personality expert scale based on expertise tests user, Totally 60 problem, corresponding 12 problems of each personality, wherein 1/3 entitled negative sense is related, 2/3 entitled positive phase It closes, finally makes scoring according to five kinds of personalities of each user of the code of points of the scale, calculate the institute of every kind of personality score afterwards Accounting weight forms five label datas.In conjunction with before based on sentiment dictionary to DwordCarry out part-of-speech tagging, Combinatorial Optimization three Part of speech frequency that category information obtains, basic status information, interactive information constitute the data set D of numeralizationcomp
The above-mentioned social network user personality prediction technique based on Chinese text analysis, wherein in the present embodiment, setting DcompShare 111 features, corresponding 102 kinds of parts of speech, 5 kinds of user state informations and 4 kinds of interactive informations, as shown in Fig. 2, described Step S4 specifically includes:
S41, the correlation for calculating separately five personalities and every kind of part of speech:
Correlation is measured using Pearson correlation coefficients, its calculation formula is:
Wherein, Cov (X, Y) indicates the covariance of variable X and variable Y, σXAnd σYIt is the standard of variable X and variable Y respectively Difference,WithRespectively represent variable X and the average value of Y;Herein, X is a certain part of speech feature W in 102 part of speech feature Wi(i =1 ... 102) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) corresponding data;Every kind is taken respectively Personality corresponds to highest preceding 13 parts of speech of Pearson's coefficient, constitutes 5 part of speech set:
Setj={ W1…13|ChjCorresponding preceding 13 part of speech feature } j=1 ... 5
The set that 13 part of speech feature corresponding to each personality are constituted seeks union, obtains part of speech set SetW:
SetW=Set1∪Set2∪…∪Set5
Such as set Set1For the corresponding 13 part of speech set of neurotic personality { " present ", " anxiety word ", " mobile word ", " gold Money word ", " religion word ", " dead word ", " mankind's word ", " seeing clearly word ", " cause and effect word ", " dirty word ", " perception course word " " should be with Word ", " body word " }
Such as set Set2For the corresponding 13 part of speech set of extropism personality, { " friend's word ", " healthy word ", " property word " is " empty Between word ", " leisure word ", " cognition course word ", " visual word ", " feeling word ", " work word ", " love word ", " dirty word " " should be with Word ", " numerical ratio " }
Such as set Set3For open personality corresponding 13 part of speech set { " present ", " friend's word ", " healthy word ", " property Word ", " mobile word ", " space word ", " mankind's word ", " social process word ", " cognition course word ", " seeing clearly word ", " visual word ", " physiology course word ", " body word " }
Such as set Set4For the corresponding 13 part of speech set of biddability personality " money word ", " work word ", " love word ", " past ", " cause and effect word ", " including word ", " excluding word ", " sense of hearing word ", " multipurpose word ", " word of ingesting ", " perception course word ", " relative term ", " should and word " }
Such as set Set5For the corresponding 13 part of speech set of preciseness personality " angry word ", " achievement word ", " money word ", " refering in particular to determine word ", " social process word ", " cognition course word ", " seeing clearly word ", " work word ", " excluding word ", " relative term ", " number Word ratio ", " physiology course word ", " body word " }
To Set1、Set2、Set3、Set4、Set5Union is asked to obtain part of speech set SetW, such as gather { " present ", " friend Word ", " anxiety word ", " angry word ", " healthy word ", " property word ", " mobile word ", " space word ", " achievement word ", " leisure word ", " gold Money word ", " religion word ", " dead word ", " refering in particular to determine word ", " mankind's word ", " social process word ", " cognition course word " " are seen clearly Word ", " visual word ", " feeling word ", " work word ", " love word ", " past ", " cause and effect word ", " including word ", " excluding word " " listens Feel word ", " dirty word ", " multipurpose word ", " word of ingesting ", " perception course word ", " relative term ", " should and word ", " numerical ratio ", " physiology course word ", " body word " };
S42, the correlation for calculating separately five personalities and 5 kinds of user state information features and 4 kinds of interactive information features:
Correlation is measured using Pearson correlation coefficients, calculation formula such as formula (1) is described, and herein, X is 9 Xiang Xingte Levy a certain feature F in Fi(i=1 ... 9) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) is corresponding Data.It takes every kind of personality to correspond to the highest first three items feature of Pearson's coefficient respectively, constitutes 5 characteristic sets:
Setj={ F1…3|ChjCorresponding preceding 3 part of speech feature } j=1 ... 5
The set that 3 features corresponding to each personality are constituted seeks union, obtains set SetF:
SetF=Set1∪Set2∪…∪Set5
Such as set { " issued state number ", " number of fans ", " topic numbers ", " secondary number " };
S43, the correlation of each part of speech filtered out in step S41 between any two is calculated:
Each part of speech is recalculated in SetWThe frequency of occurrences in the text of composition;It is measured using Pearson correlation coefficients related Property, calculation formula is as described in S41, herein, X SetWIn the corresponding data of a certain part of speech feature, Y SetWIn remove X The corresponding data of a certain part of speech feature outside calculate:
Wherein,For part of speech WiWith part of speech WjPearson's coefficient, if certain is big to Pearson's coefficient between part of speech In 0.6, then one therein is rejected, set Set is obtainedWn
As { " present ", " friend's word ", " anxiety word ", " angry word ", " healthy word ", " property word ", " mobile word " is " empty for set Between word ", " achievement word ", " leisure word ", " money word ", " religion word ", " dead word ", " refering in particular to determine word ", " mankind's word " " recognizes Course word ", " seeing clearly word ", " visual word ", " feeling word ", " work word ", " love word ", " past ", " cause and effect word " " includes Word ", " exclude word ", " sense of hearing word ", " dirty word ", " multipurpose word ", " word of ingesting ", " should and word ", " numerical ratio ", " body Word " };
To set SetWnWith set SetFIt asks simultaneously, obtains set Setpre=SetWn∪SetF, then by its corresponding data with Five personality label datas merge, and obtain multi-tag multiple target characteristic element collection Dpre
The step S5 specifically includes to be illustrated in figure 3 corresponding neural network structure figure:
S51, it is being based on DpreBefore training pattern, first data are normalized, using logarithm normalizing, are calculated public Formula is x=lg (x)/lg (max), and wherein x is characterized value, and max is the maximum value in this feature corresponding data;
S52, tanh function is selected as the excitation function of neuron, its output input is made to be able to maintain non-Nonlinear Monotone Raising and lowering relationship meets gradient solution, zmodem;
S53, over-fitting is prevented using L2 regularization, that is, weight decaying;
Over-fitting is prevented using L2 regularization, i.e. weight decays, and refers to plus a regularization term after cost function, can To obtain:
Wherein, j is since 1.It is obtained after seeking it local derviation:
Obtain gradient decline formula:
When j is 0, it is believed that the value of λ is 0, when not having regularization, coefficient θjWeight be 1, and it is present:
Weight is decayed.According to "ockham's razor" rule, smaller weight indicates that the complexity of network is lower, logarithm According to fitting it is also more preferable.
S54, it is trained and predicts using ten folding cross validations, data set D is divided into the similar mutual exclusion of 10 sizes Subset, i.e. D1∪D2∪…∪D10,Each subset DiAll therefrom stratified sampling obtains, to guarantee number According to the consistency of distribution.Use the union of k-1 subset as training set every time, remaining subset is as test set.Using grid Ginseng regularized learning algorithm rate, dropout rate, epochs and neuronal quantity parameter are adjusted in search.
It is discussed in detail although the contents of the present invention have passed through above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read above content, for of the invention A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims (6)

1. a kind of social network user personality prediction technique based on Chinese text analysis, which is characterized in that comprise the steps of:
S1, preliminary treatment is carried out to Chinese social networks text, text is divided into user's basic status information, user interaction information And user version information three classes;
S2, user version information is pre-processed, obtains the data set D being made of all kinds of wordsword
S3, the text feature of user version information is extracted, based on sentiment dictionary to data set DwordPart-of-speech tagging is carried out, Calculate the frequency of occurrences of all kinds of parts of speech in the text, the above three classes text information of Combinatorial Optimization, to be based on expert's scale to user The result test constructs the data set D of numeralization as factual datacomp
S4, logarithm value data set DcompFeature Engineering is carried out, i.e., feature is screened, the spy predicted for personality is obtained Levy element collection Dpre
S5, personality prediction, the characteristic element collection D obtained with step S4 are carried out based on BP neural network training patternpreIn feature Vector as mode input, using nervousness, extropism, opening, biddability, preciseness this 5 personalities rate of specific gravity as mould Type output, constructs neural network, and training prediction model carries out personality prediction.
2. the social network user personality prediction technique as described in claim 1 based on Chinese text analysis, which is characterized in that In the step S1:
User's basic status information include the quantity of issued state, follower's quantity, bean vermicelli quantity, social networks using duration, Frequency is issued, averagely to reflect user to the basic service condition of social networks;
User interaction information include social networks in expression quantity, topic numbers, number, hop count, with reflect user with The interaction situation of public topic and good friend;
User version information is the pure language content in text, to reflect the speech habits, expression way and Sentiment orientation of user.
3. the social network user personality prediction technique as claimed in claim 2 based on Chinese text analysis, which is characterized in that Pretreatment in the step S2, which refers to, cleans text data, is segmented and removed stop words:
Described carries out cleaning comprising filtering out picture, table in social networks text using canonical matching process to text data The non-textual contents such as field, url network address, emoticon, the transmitting symbol that feelings, location information, double " # " are surrounded;
The participle, which refers to, segments user version, and full text information is converted to the set of word;
It is described to go stop words to refer to remove text noise using regular expression, remove text medium-high frequency but without practical meaning The stop words of justice, stop words include pronoun, auxiliary word and punctuation mark.
4. the social network user personality prediction technique as claimed in claim 3 based on Chinese text analysis, which is characterized in that In the step S3:
Three category information of Combinatorial Optimization refers to: counting the quantity, follower's quantity, bean vermicelli quantity, society of user's issued state The basic status information handed over Web vector graphic duration, averagely issue frequency, counts expression quantity, topic numbers ,@in the user version The interactive information of number, hop count;
The described result to be test based on expert's scale to user constructs the data set of numeralization as factual data DcompRefer to: user being tested by five-factor model personality expert's scale, is scored according to Expert Rules, calculates five Xiang Renge Respective specific gravity forms five label datas, obtains part of speech in conjunction with part of speech annotation results and Combinatorial Optimization three classes information result Frequency, basic status information, interactive information constitute the data set D of numeralizationcomp
5. the social network user personality prediction technique as claimed in claim 4 based on Chinese text analysis, which is characterized in that Dcomp111 features are shared, corresponding 102 kinds of parts of speech, 5 kinds of user state informations and 4 kinds of interactive informations, the step S4 are specific Include:
S41, the correlation for calculating separately five personalities and every kind of part of speech:
Correlation is measured using Pearson correlation coefficients, its calculation formula is:
Wherein, Cov (X, Y) indicates the covariance of variable X and variable Y, σXAnd σYIt is the standard deviation of variable X and variable Y respectively, WithRespectively represent variable X and the average value of Y;Herein, X is a certain part of speech feature W in 102 part of speech feature Wi(i=1 ... 102) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) corresponding data;Every kind of personality is taken respectively Corresponding highest preceding 13 parts of speech of Pearson's coefficient, constitute 5 part of speech set:
Setj={ W1…13|ChjCorresponding preceding 13 part of speech feature } j=1 ... 5
The set that 13 part of speech feature corresponding to each personality are constituted seeks union, obtains part of speech set SetW:
SetW=Set1USet2U…USet5
S42, the correlation for calculating separately five personalities and 5 kinds of user state information features and 4 kinds of interactive information features:
Correlation is measured using Pearson correlation coefficients, calculation formula such as formula (1) is described, and herein, X is in 9 property feature F A certain feature Fi(i=1 ... 9) corresponding data, Y are a certain personality Ch in 5 kinds of personality Chj(j=1 ... 5) corresponding number According to.It takes every kind of personality to correspond to the highest first three items feature of Pearson's coefficient respectively, constitutes 5 characteristic sets:
Setj={ F1…3|ChjCorresponding preceding 3 part of speech feature } j=1 ... 5
The set that 3 features corresponding to each personality are constituted seeks union, obtains set SetF:
SetF=Set1USet2U…USet5
S43, the correlation of each part of speech filtered out in step S41 between any two is calculated:
Each part of speech is recalculated in SetWThe frequency of occurrences in the text of composition;Correlation is measured using Pearson correlation coefficients, Calculation formula is as described in S41, herein, X SetWIn the corresponding data of a certain part of speech feature, Y SetWIn in addition to X certain The corresponding data of item part of speech feature, that is, calculate:
Wherein,For part of speech WiWith part of speech WjPearson's coefficient, if certain to Pearson's coefficient between part of speech be greater than 0.6, One therein is then rejected, set Set is obtainedwn;To SetwnWith SetFIt asks simultaneously, obtains Setpre=Setwn∪SetF, then its is right The data answered merge with five personality label datas, obtain multi-tag multiple target characteristic element collection Dpre
6. the social network user personality prediction technique as claimed in claim 5 based on Chinese text analysis, which is characterized in that The step S5 specifically includes:
S51, it is being based on DpreBefore training pattern, first data are normalized, using logarithm normalizing, its calculation formula is x =lg (x)/lg (max), wherein x is characterized value, and max is the maximum value in this feature corresponding data;
S52, select tanh function as the excitation function of neuron, make its output input be able to maintain Nonlinear Monotone rise and Decline relationship meets gradient solution, zmodem;
S53, over-fitting is prevented using L2 regularization, that is, weight decaying;
S54, it is trained and predicts using ten folding cross validations, regularized learning algorithm rate, dropout are joined using grid search tune Rate, epochs and neuronal quantity parameter.
CN201811553414.1A 2018-12-18 2018-12-18 A kind of social network user personality prediction technique based on Chinese text analysis Withdrawn CN109635207A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811553414.1A CN109635207A (en) 2018-12-18 2018-12-18 A kind of social network user personality prediction technique based on Chinese text analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811553414.1A CN109635207A (en) 2018-12-18 2018-12-18 A kind of social network user personality prediction technique based on Chinese text analysis

Publications (1)

Publication Number Publication Date
CN109635207A true CN109635207A (en) 2019-04-16

Family

ID=66075217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811553414.1A Withdrawn CN109635207A (en) 2018-12-18 2018-12-18 A kind of social network user personality prediction technique based on Chinese text analysis

Country Status (1)

Country Link
CN (1) CN109635207A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110391013A (en) * 2019-07-17 2019-10-29 北京智能工场科技有限公司 A kind of system and device based on semantic vector building neural network prediction mental health
CN111352972A (en) * 2020-02-28 2020-06-30 厦门医学院 Statistical personality calculation method based on behavior big data
CN112580329A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Text noise data identification method and device, computer equipment and storage medium
CN112818662A (en) * 2021-01-29 2021-05-18 清华大学 Psychological stress prediction system and method based on social network media
CN113222772A (en) * 2021-04-08 2021-08-06 合肥工业大学 Native personality dictionary construction method, system, storage medium and electronic device
CN113345590A (en) * 2021-06-29 2021-09-03 安徽大学 User mental health monitoring method and system based on heterogeneous graph

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110391013A (en) * 2019-07-17 2019-10-29 北京智能工场科技有限公司 A kind of system and device based on semantic vector building neural network prediction mental health
CN110391013B (en) * 2019-07-17 2020-08-14 北京智能工场科技有限公司 System and device for predicting mental health by building neural network based on semantic vector
CN112580329A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Text noise data identification method and device, computer equipment and storage medium
CN112580329B (en) * 2019-09-30 2024-02-20 北京国双科技有限公司 Text noise data identification method, device, computer equipment and storage medium
CN111352972A (en) * 2020-02-28 2020-06-30 厦门医学院 Statistical personality calculation method based on behavior big data
CN112818662A (en) * 2021-01-29 2021-05-18 清华大学 Psychological stress prediction system and method based on social network media
CN113222772A (en) * 2021-04-08 2021-08-06 合肥工业大学 Native personality dictionary construction method, system, storage medium and electronic device
CN113222772B (en) * 2021-04-08 2023-10-31 合肥工业大学 Native personality dictionary construction method, native personality dictionary construction system, storage medium and electronic equipment
CN113345590A (en) * 2021-06-29 2021-09-03 安徽大学 User mental health monitoring method and system based on heterogeneous graph
CN113345590B (en) * 2021-06-29 2022-12-16 安徽大学 User mental health monitoring method and system based on heterogeneous graph

Similar Documents

Publication Publication Date Title
Rosa et al. A knowledge-based recommendation system that includes sentiment analysis and deep learning
Zad et al. Emotion detection of textual data: An interdisciplinary survey
Wang et al. A review of emotion sensing: categorization models and algorithms
CN109635207A (en) A kind of social network user personality prediction technique based on Chinese text analysis
Li et al. Making restaurant reviews useful and/or enjoyable? The impacts of temporal, explanatory, and sensory cues
CN106095833B (en) Human-computer dialogue content processing method
Spivey et al. Continuous dynamics in real-time cognition
Gärdenfors et al. Using conceptual spaces to model actions and events
Ellis et al. The processing of verb-argument constructions is sensitive to form, function, frequency, contingency and prototypicality
Flekova et al. Personality profiling of fictional characters using sense-level links between lexical resources
Jacobs (Neuro-) cognitive poetics and computational stylistics
CN110119849A (en) A kind of personal traits prediction technique and system based on network behavior
Dsouza et al. Chat with bots intelligently: A critical review & analysis
de Lencastre et al. Brand response analysis: A Peircean semiotic approach
Chen et al. Construction of affective education in mobile learning: The study based on learner’s interest and emotion recognition
Pachouly et al. Depression detection on social media network (Twitter) using sentiment analysis
van Rij Pronoun processing: Computational, behavioral, and psychophysiological studies in children and adults
Li et al. Tailoring personality traits in large language models via unsupervisedly-built personalized lexicons
KR20210028378A (en) Method and Apparatus for Profiling Unconsciousness Mechanism and Consciousness Behavioral Type
Wang et al. Construction of a novel production develop decision model based on text mined
Fawcett The cultural classification of ‘things’
Iovane et al. A computational model for managing emotions and affections in emotional learning platforms and learning experience in emotional computing context
Sanocki et al. Novel scene understanding, from gist to elaboration
Karunarathana et al. Ensemble Learning Approach for Identifying Personality Traits based on Individuals' Behavior
Hima et al. Big-five personality traits based on four main methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20190416

WW01 Invention patent application withdrawn after publication