CN104573034B - User group's division method and system based on CDR tickets - Google Patents

User group's division method and system based on CDR tickets Download PDF

Info

Publication number
CN104573034B
CN104573034B CN201510020953.9A CN201510020953A CN104573034B CN 104573034 B CN104573034 B CN 104573034B CN 201510020953 A CN201510020953 A CN 201510020953A CN 104573034 B CN104573034 B CN 104573034B
Authority
CN
China
Prior art keywords
user
focus central
short message
call
storm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510020953.9A
Other languages
Chinese (zh)
Other versions
CN104573034A (en
Inventor
罗云彬
李�浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201510020953.9A priority Critical patent/CN104573034B/en
Publication of CN104573034A publication Critical patent/CN104573034A/en
Application granted granted Critical
Publication of CN104573034B publication Critical patent/CN104573034B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A kind of user group's division methods and system for disclosing CDR tickets in detail based on calling;Methods described includes:The CDR tickets in predetermined amount of time are periodically obtained, the record in acquired CDR tickets extracts the contact data of each user, including:Talk times, conversation object when the user is as caller/called subscriber, the duration of call, and the number and object of user's sending and receiving short message;Respectively according to the contact data of each user, the hot value between the user and each associated user of the user is calculated, the associated user is the user that call or short message be present with the user;Focus central user is determined according to the hot value;The user that call or short message with the focus central user only be present is rejected in the associated user of each focus central user respectively;Remaining associated user after each focus central user and its rejecting is each divided into a user group.The present invention being capable of more accurate dividing user groups.

Description

User group's division method and system based on CDR tickets
Technical field
The present invention relates to the communications field, more particularly to a kind of user group's division method and system based on CDR tickets.
Background technology
With the application and development of big data and mobile Internet, the analysis application study based on user behavior is also from simple Theoretical research progressively arrive the application of concrete practice.Especially with the big data technology such as Hadoop, MapReduce increasingly into It is ripe, analysis mining is carried out to user data based on big data, the practical application for obtaining user behavior gradually increases.
Hadoop is more popular among big data application recent years, for solving the distribution of storage mass data Storage system.Its two big Core Feature is exactly HDFS (Hadoop distributed file systems) and MapReduce.Wherein, HDFS is The file management tool of Hadoop system, different from traditional database purchase mode, its data is deposited in a manner of Block (block) Storage is managed collectively on each DataNode (back end) by HDFS;MapReduce is appointing for Hadoop system Business execution instrument, its main thought is that job (task) is distributed on each DataNode nodes by Map (mapping) processes Handled, after the completion of the Map stages perform, intermediate result is carried out by Merge (merging) by Reduce (stipulations), and export most Whole result.
It is the detailed forms data of internet records based on user that user data, which excavates relatively common, user behavior is carried out pre- Survey, so as to realize the commercial activity of orientation and safety management.For example, the internet records based on user, can be nearest to user Network behavior is analyzed, including surfing flow, likes website etc., and orientation can be realized according to hobby website for operator Push, user is reminded to update set meal in time according to changes in flow rate;For security department, by obtaining website orientation visit amount, row Look into yellow, reaction website etc..
Use above and analysis are directed to the behavioral trait of unique user, or perhaps user (is removed with outdoor with article Other data include website, flow etc.) relationship characteristic carry out data mining.Another of big data application and excavation are important Aspect is the relation being concerned about between user and user, that is, user group's division.The main method of user group's division at present Including two classes:
One kind is divided based on " label " or similar class indication, for example will pay close attention to identical " label " (such as electricity Shadow etc.) user be divided into a colony;
Another kind of divided based on the customer relationship in social network sites, for example attention rate height in social network sites (is compared As good friend's number is high) user be used as primary user, using and user that primary user is friend relation all as a colony.
The method of existing colony's division is all relatively coarse colony's division.The side of user group's division based on label All it is that user is divided according to particular demands or preference, the user that itself can not be represented under same label has each other in method Social interaction, it is divided into a user group and improper.And the method divided based on the customer relationship in social network sites In, degree of relationship between user-user is embodied without rational balancing method, for example some good friend of a primary user may be with All do not associated between other good friends of the primary user, should not belong to same user group.
The content of the invention
The technical problem to be solved in the present invention be how more accurate dividing user groups.
In order to solve the above problems, the invention provides a kind of user group for disclosing CDR tickets in detail based on calling to draw Divide method, including:
S101, the CDR tickets periodically in acquisition predetermined amount of time, the record extraction in acquired CDR tickets Go out the contact data of each user, including:When talk times, conversation object when the user is as caller/called subscriber, call It is long, and the number and object of user's sending and receiving short message;
S102, respectively according to the contact data of each user, calculate the heat between the user and each associated user of the user Angle value, the associated user are the user that call or short message be present with the user;
S103, focus central user determined according to the hot value;
S104, rejected respectively in the associated user of each focus central user only with the focus central user exist call or The user of short message;Remaining associated user after each focus central user and its rejecting is each divided into a user group.
Alternatively, the hot value H (m-n) between user m and user n is:
Wherein, ps (m-n) refers to talk times of the user m as calling user n;Ps (n-m) refers to user m as quilt It is user and user's n talk times;Ms (m-n) refers to user m and initiates direction user n transmission short message numbers as active short message;ms (n-m) refer to user m and send short message number as passive short message initiator and user n.pt(m-n)iRefer to user m as caller with using The family n singles duration of call exceedes 120s part, and unit is the second, and S1 is that user m surpasses as caller and user's n single duration of calls Cross the number of 120 seconds;pt(n-m)jRefer to user m as called and part of user's n singles duration of call more than 120 seconds, unit For the second, S2 is user m as the called number for exceeding 120s with user's n singles duration of call;To round up.
Alternatively, the step S103 includes:
Hot value summation is obtained respectively for each user, including:By between the user and each associated user of the user Hot value adds up, and accumulation result is the hot value summation;
Using user of the summation of hot value higher than predetermined hot value threshold value as candidate user;
Following screening operations are carried out respectively to each candidate user of each two:Compare the associated user of two candidate users, unite The quantity N1 of the folded associated user of weight calculation;The quantity N1 and this candidate user are calculated respectively for described two candidate users The percentage of associated user's total quantity;If the percentage of one of candidate user exceedes predetermined ratio threshold value, rejecting should Candidate user;If two percentages are above predetermined ratio threshold value, the relatively low candidate user of hot value summation is rejected;
Using remaining candidate user after screening operation as the focus central user.
Alternatively, described rejected respectively in the associated user of each focus central user only exists with the focus central user The user of call or short message includes:
Operations described below is carried out respectively for each focus central user:
Using the focus central user and its associated user respectively as a point in network structure, if any two is used Call or short message between family be present, then increase line between two points;
Using focus central user as starting point in the network structure, successively from each associated user, without Cross the Points And lines repeated and return to the focus central user;If focus central user can not be returned to, by phase when setting out User is closed to reject.
Alternatively, step S104 includes:
51st, focus central user and its associated user are loaded on each Storm nodes, according to the heat of focus central user Angle value summation carry out order loading;First Storm node carries out step 52;
52nd, associated user all on this Storm nodes is traveled through, rejects only with the focus central user to exist and lead to The associated user of words or short message;If user group's information is then passed through Storm nodes by next Storm nodes after rejecting Socket interfaces be sent to after next Storm nodes and carry out step 53, then carry out step 54 without next node;It is described User group's information includes the mark of remaining user on present node;
53rd, it is laggard to reject the user repeated on this node according to user group's information for next Storm nodes Row step 52;
54th, the user on each Storm nodes is respectively a user group marked off.
Present invention also offers a kind of user group's dividing system for disclosing CDR tickets in detail based on calling, including:
Data extraction module, for periodically obtaining the CDR tickets in predetermined amount of time, according to acquired CDR tickets In record extract the contact data of each user, including:It is talk times when the user is as caller/called subscriber, logical Talk about object, the duration of call, and the number and object of user's sending and receiving short message;
Hot value computing module, for according to the contact data of each user, calculating each phase of the user and the user respectively The hot value between user is closed, the associated user is the user that call or short message be present with the user;
Focus central user determining module, for determining focus central user according to the hot value;
Division module, for respectively in the associated user of each focus central user, reject only with the focus central user In the presence of call or the associated user of short message;Remaining associated user after each focus central user and its rejecting is each divided into one Individual user group.
Alternatively, the hot value H (m-n) between user m and user n is:
Wherein, ps (m-n) refers to talk times of the user m as calling user n;Ps (n-m) refers to user m as quilt It is user and user's n talk times;Ms (m-n) refers to user m and initiates direction user n transmission short message numbers as active short message;ms (n-m) refer to user m and send short message number as passive short message initiator and user n.pt(m-n)iRefer to user m as caller with using The family n singles duration of call exceedes 120s part, and unit is the second, and S1 is that user m surpasses as caller and user's n single duration of calls Cross the number of 120 seconds;pt(n-m)jRefer to user m as called and part of user's n singles duration of call more than 120 seconds, unit For the second, S2 is user m as the called number for exceeding 120s with user's n singles duration of call;To round up.
Alternatively, the focus central user determining module includes:
Sum unit, for obtaining hot value summation respectively for each user, including:By each phase of the user He the user The hot value closed between user adds up, and accumulation result is the hot value summation;
Candidate user primary election unit, for user of the summation of hot value higher than predetermined hot value threshold value to be used as candidate Family;
Candidate user screening unit, for carrying out following screening operations respectively to each candidate user of each two:Compare two The associated user of candidate user, count the quantity N1 of overlapping associated user;Institute is calculated respectively for described two candidate users State quantity N1 and the percentage of associated user's total quantity of this candidate user;If the percentage of one of candidate user exceedes Predetermined ratio threshold value, then reject the candidate user;If two percentages are above predetermined ratio threshold value, it is total to reject hot value With relatively low candidate user;
Determining unit, for using remaining candidate user after screening operation as the focus central user.
Alternatively, the division module rejected respectively in the associated user of each focus central user only with the focus center There is call in user or the user of short message refers to:
The division module carries out operations described below respectively for each focus central user:
Using the focus central user and its associated user respectively as a point in network structure, if any two is used Call or short message between family be present, then increase line between two points;With focus central user in the network structure As starting point, successively from each associated user, the focus central user is returned without the Points And lines repeated;If no Focus central user can be returned to, then is rejected associated user when setting out.
Alternatively, the division module includes:
Load units, multiple Storm nodes and output unit;
The load units are used to load focus central user and its associated user on each Storm nodes, according to heat The hot value summation carry out order loading of dot center user;Start the rejecting operation of first Storm node;
Each Storm nodes are used for after the rejecting of this Storm nodes is operated and started, to phase all on this Storm nodes Close user to be traveled through, reject the associated user that call or short message with the focus central user only be present;After rejecting if under User group's information is then sent to next Storm by the socket interfaces of Storm nodes and saved by one Storm node Point, then start output unit without next node;User group's information includes remaining user on present node Mark;It is additionally operable to after user group's information is received, the phase repeated on this node is rejected according to user group's information Start the rejecting operation of this Storm nodes after the user of pass;
The output unit is used for upon actuation using the user on each Storm nodes respectively as a user marked off Colony exports.
The big data that the present invention is based in CDR (call detail record, call detail record) ticket carries out user Colony divides, as a result more accurate, can find out isolated user, support the dynamic of user group to change, can be to realize other bases Basis is provided in the business of user group.And for operator, the data in CDR tickets, which carry out analysis, to be had first It superiority condition, without obtaining data from other system to be divided.The groups of users that the present invention divides can both be used Pushed in the orientation of specific user colony, can be used for sorting safely, further data mining can also be done.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of user group's division methods of embodiment one;
Fig. 2 is customer relationship figure;
Fig. 3 is focus central user and the graph of a relation of associated user;
Fig. 4 is the operating diagram of streaming computing system;
Fig. 5 is the schematic diagram of an example of embodiment one.
Embodiment
Technical scheme is described in detail below in conjunction with drawings and Examples.
If it should be noted that not conflicting, each feature in the embodiment of the present invention and embodiment can be tied mutually Close, within protection scope of the present invention.In addition, though logical order is shown in flow charts, but in some situations Under, can be with the step shown or described by being performed different from order herein.
Embodiment one, a kind of user group's division methods based on CDR tickets, as shown in figure 1, including:
S101, the CDR tickets periodically in acquisition predetermined amount of time, the record extraction in acquired CDR tickets Go out the contact data of each user, including:When talk times, conversation object when the user is as caller/called subscriber, call It is long, and the number and object of user's sending and receiving short message;
S102, respectively according to the contact data of each user, calculate the heat between the user and each associated user of the user Angle value, the associated user are the user that call or short message be present with the user;
S103, focus central user determined according to the hot value;
S104, rejected respectively in the associated user of each focus central user only with the focus central user exist call or The user of short message is (i.e.:Call and short message is not present with other associated users of the focus central user);By each focus center User and its reject after remaining associated user be each divided into a user group.
In the present embodiment, by calculating the hot value between two two users respectively, more can accurately it be divided;With The division of family colony can update with the period of change of CDR call bill datas, can embody the transition of customer relationship.
Can obtain CDR CDR files deposit Hadoop distributed systems from BSS systems, pass through in the present embodiment Record in original CDR tickets is converted to hot value by MapReduce.
Here CDR ticket related datas are introduced first.
CDR refer to call, the detailed unirecord of short message, user converses every time, short message can all form a CDR ticket, nothing By being called or caller.This time of user call, the related data of short message are have recorded in CDR tickets in detail.Talked about with the CDR of UNICOM Exemplified by list (each operator CDR tickets are similar), it is divided into voice call and short message CDR tickets.Specific form is as follows:
Table one, each field of voice call CDR tickets and implication
Table two, each field of short message CDR tickets and implication
Many fields are contained in CDR tickets, only list here current embodiment require that field.
For dividing user groups, it is necessary to handle original CDR tickets, quantized values are formed, the present embodiment is referred to as " hot value ".Here wherein several key factors are chosen:Talk times, short message number and the duration of call.
It is added in the present embodiment using the duration of call as a parameter in temperature value calculating method, is in BSS system numbers In the middle nearlyer user of discovery relation according to statistics, such as man and wife, father and mother, sons and daughters close relation type user, exist greatly logical The words user that number is less but the duration of call is longer, this certain customers are relatively important.In order to ensure the accurate of computational methods Property, the duration of call is converted into talk times adds among " hot value " calculating, finds that user is averaged by a large amount of statistical experiments The duration of call is 2 minutes, therefore here can (unit is to the part of user's caller and the called single duration of call more than 120 seconds Second) divided by 120 seconds and be converted to talk times after rounding up.The calculating side of hot value H (m-n) between user m and user n Method is as follows:
Wherein, ps (m-n) refers to talk times of the user m as calling user n;Ps (n-m) refers to user m as quilt It is user and user's n talk times;Ms (m-n) refers to user m and initiates direction user n transmission short message numbers as active short message;ms (n-m) refer to user m and send short message number as passive short message initiator and user n.pt(m-n)iRefer to user m as caller with using The family n singles duration of call exceedes 120s part, and unit is the second, and S1 is that user m surpasses as caller and user's n single duration of calls Cross the number of 120 seconds;pt(n-m)jRefer to user m as called and part of user's n singles duration of call more than 120 seconds, unit For the second, S2 is user m as the called number for exceeding 120s with user's n singles duration of call;To round up.
In an embodiment of the present embodiment, above formula is disassembled to progress MapReduce calculating of coming;Designing Two problems are should be noted during MapReduce.
1) record of CDR tickets shown in table one and table two, that is, every CDR among CDR original documents Ticket writing just for user a behavior (call, short message, caller, called), therefore, it is impossible to directly calculate hot value;
2) it is " words begin the time " and " words finish the time " in message registration CDR tickets in table one, therefore, it is necessary to first extracts two Subtraction acquisition user's communication duration is done after field and is converted to hot value again.
Specific MapReduce design methods are as follows:
The Map stages (i.e. above-mentioned steps S102):CDR original documents are read according to row, per one user behavior record of behavior. The MapReduce stages are with key-value pair<Key,Value>Mode output data, therefore the Key forms in Map stage output results For " calling subscribe | called subscriber ", Value is the hot value that this records conversion, i.e.,<Calling subscribe | called subscriber, hot value >.If this is recorded as short message record, exported as hot value 1, output result is<Calling subscribe | called subscriber, 1>;If should Bar is recorded as voice call record, then extracts " words begin the time " and " words finish the time " field according to separator, calculate and obtain user The duration of call, and divided by rounded up after 120 seconds as hot value export, then output result is<Calling subscribe | called subscriber, The duration of call/120>.
The Reduce stages (i.e. above-mentioned steps S103):Identical Key Value can pool together, and form array, specific lattice Formula<Calling subscribe | called subscriber, [hot value 2 ... the hot value N of hot value 1]>, the Reduce stages are to the temperature in the array Value is added up, that is, obtains the hot value (hereinafter referred to as user-user hot value) between each user and its different associated user, Form is<Calling subscribe | called subscriber, hot value>.
Note:It is important to note that Reduce outputs here have caller is called to distinguish.For example, user A and user B Between hot value, practically equal to herein<User A | user B, hot value>Add<User B | user A, hot value>.It is former Because being that can not realize that not differentiating between caller to same group of user is called in a MapReduce, confusion can be caused by not differentiating between.
So and without prejudice to focus center below and relational users calculate, so such output result can also.If The user-user hot value after merging have to be exported, it is only necessary to perform a MapReduce again, and only need the Reduce stages .The Reduce stages only need to judge that Key separators are front and rear exchange after it is identical with another Key (user A | user B is exchanged Afterwards=user B | user A), it is to need to merge item.Export after cumulative, repeat no more here.
Two kinds of forms be present by the temperature Value Data obtained after above-mentioned steps S103, one kind be with<User-user, heat Angle value>Form, one kind be with<Calling subscribe | called subscriber, hot value>Form, both sides relation have illustrated above.Pre- place Manage after terminating, it is necessary to choose focus central user.Focus central user is chosen to be calculated according to user's hot value summation, that is, The hot value summation between the user and each associated user is calculated, and is screened according to hereafter method.Detailed process is divided into two The individual stage:
(1) computational methods of user's hot value summation
Because the user-user hot value data that pretreatment obtains are located in Hadoop distributed storages, need also exist for adopting MapReduce modes are taken to count user's hot value summation.The description before of user's hot value data result has two kinds Mode, but it is substantially consistent, that is, the hot value of sole user and sole user, only one of which is according still further to caller quilt Sequencing is made to divide into two halves.Finally to obtain the hot value summation between some user and each associated user.Design MapReduce specific methods are as follows:
The Map stages:Key is split according to separator, was such as originally<User A | user B, hot value v>, then decompose For two Key, value values are original hot values corresponding to each Key, and output result is<User A, hot value v>,<With Family B, hot value v>.So equivalent to each user carry out separate statistics.
The Reduce stages:According to map output results, identical Key Value values can form array form, that is, identical Hot value between user and different associated users can gather, and be formed such as<User the N, [hot value of 1 hot value of hot value 2 3 ... hot value m]>, hot value is added up, that is, obtains the hot value summation of the user.According to hot value summation to user It is ranked up, obtains the candidate user of focus central user.
Pay attention to:The forward user that sorts now is not focus central user, and simply in next step using " secondary hot spots It is irrelevant " method choose focus central user do basis.
(2) " secondary hot spots are irrelevant " chooses focus central user
, may be there is also association, such as Fig. 2 institutes between the high user of hot value summation after the completion of user's hot value summation calculates Show situation.
From figure 2 it can be seen that user B hot value summation is maximum, then it is focus central user to define B first.Due to User A and user B is associated user, if occurred according to associated user of the method before possibly as B, and can from Fig. 2 It is between actually user A and user B other associated users and irrelevant to find out, use that itself should be new as one The focus central user of family colony.Therefore, it is necessary to which certain method weeds out user A from user B user group.
Corresponding, it is understood that there may be another situation, it is exactly that user A is that hot value summation comes as user B User above, but the contact between user A and user B associated user is also a lot, so illustrates that user A actually also belongs to In user B colonies, it is not necessary to independently form customer group.
Therefore, two kinds of fitting methods are proposed in order to solve the above problems, in an embodiment of the present embodiment to solve Certainly.First, focus central user is selected using " secondary hot spots are irrelevant ", focus center is secondly chosen using " traversal is reachable " method The associated user of user forms customer group.
When selecting focus central user using " secondary hot spots are irrelevant ", step S103 can specifically include:
Hot value summation is obtained respectively for each user, including:By between the user and each associated user of the user Hot value adds up, and accumulation result is the hot value summation;
Using user of the summation of hot value higher than predetermined hot value threshold value as candidate user;
Following screening operations are carried out respectively to each candidate user of each two:Compare the associated user of two candidate users, unite The quantity N1 of the folded associated user of weight calculation;The quantity N1 and this candidate user are calculated respectively for described two candidate users The percentage of associated user's total quantity;If the percentage of one of candidate user exceedes predetermined ratio threshold value, rejecting should Candidate user;If two percentages are above predetermined ratio threshold value, the relatively low candidate user of hot value summation is rejected;
Using remaining candidate user after screening operation as the focus central user.
The principle of " secondary hot spots are irrelevant " is, by the associated user of counting user, compares the correlation of two candidate users The overlapping quantity of user accounts for the percentage of respective associated user's total quantity, if exceeding predetermined ratio threshold value (such as 50%), illustrates Two candidate users belong to same user group.If user A and the overlapping quantity of user B associated users are 50, user A related use Family total quantity is 150, and user B associated user's total quantity is 90, and associated user's total quantity that overlapping quantity accounts for B exceedes 50%, then illustrate that user B belongs to user A colonies, therefore focus central user can not be treated as;If the percentage of two candidate users Than being above 50% (for example user A associated user's total quantity is 92 people), then only by the higher time of wherein hot value summation From family as focus central user.
Concrete methods of realizing can also realize that implementation method is divided into Map stages and Reduce ranks by MapReduce modes Section.
The Map stages:To record<User A- user B, hot value>Split, Key is divided into two according to separator Key, two Key-Value forms are formed respectively:<User A, user B>,<User B, user A>, now Value values be not Hot value, but user.
The Reduce stages:The data now inputted are formed such as<User, [associated user 1, associated user 2 ... associated user m]>.Circulation is compared the associated user of different candidate users two-by-two, according to before 50% comparative approach, judges candidate Whether user can be used as focus central user.Final output only can be as the candidate user of focus central user, its form Still it is<User, [associated user 1, associated user 2 ... associated user m]>.
Note:Many same subscribers are there will necessarily be in the associated user of the different focus central users now obtained.Some use Family may be not suitable for the user group, it is necessary to be screened by " traversal is reachable " method below.
S103 is determined after focus central user, it is necessary to screened to the associated user of focus central user, is not institute Some associated users belong to user's body group, and presumable user is only relevant with focus central user.Therefore, using " traversal It is reachable " algorithm and coordinate streaming computing processing to be not belonging to the user of the colony to reject, obtain final user group.
In order to show the relation between user group, the user for not meeting user group's feature is rejected, is not only needed Focus central user and its relevant user information, and need the relation between associated user.Therefore, need exist for new by one MapReduce obtains the relation between associated user.
Specific method is as follows:
The Map stages:For what is obtained in step S103<User-user, hot value>Data are as input, same step The focus central user and its associated user obtained in S104 also serves as inputting, simply as restrictive condition.If two in Key User belongs to the associated user of certain focus central user, then exports this record, and it is focus central user to change Key.Lift Example explanation is right<User B- user C, hot value>Record, if user B, user C belong to focus central user A association user, Illustrate that the record has reacted the customer group internal relations.Output form is<Focus central user A, user B- user C>.
The Reduce stages:Reduce stage input forms are<Focus central user A, [user B- user C, user C- user E ... ...]>, therefore need not be handled, directly output is exactly the internal relations of the user group where hotspot users A.
Pay attention to, merely just output relation, not the hot value of output relation.In order to improve precision, can carry out Hot value exports, and sets a threshold value to exclude the relatively low customer relationship of hot value (be not two use of associated user each other 0) hot value between family can be set to.
Travel through reachable general principle and be namely based on structure chart thought, be divided into two steps and be described:
1) all relationship markings between focus central user and associated user are come out, including the relation between associated user, But user is limited to the focus central user and its associated user.Namely form a network structure.
2) using focus central user as starting point, successively from each associated user, without Points And lines are repeated, finally Focus central user can be returned to.Illustrate that all users on the track belong to the user group member;If it can not return to Focus central user illustrates that the user is only relevant with focus central user, unrelated with other members, it is impossible to can be regarded as the user group Member.As shown in figure 3, B be focus central user, it is necessary to judge A, E, F, G.First since user G, path B->G->F- >B.Focus central user can be just returned to without the point and path that repeat.So G, F belong to the groups of users, same Method can learn that user E also complies with the condition.But user A can not meet to require, can not return to B, so being not belonging to the use Family colony.
Using traversal up to reject user when, it is described rejected respectively in the associated user of each focus central user only with this Focus central user can specifically include in the presence of the step of user of call or short message:
Operations described below is carried out respectively for each focus central user:
Using the focus central user and its associated user respectively as a point in network structure, if any two is used Call or short message between family be present, then increase line between two points;
Using focus central user as starting point in the network structure, successively from each associated user, without Cross the Points And lines repeated and return to the focus central user;If focus central user can not be returned to, by phase when setting out User is closed to reject.
The present embodiment can be, but not limited to realize step S104 on streaming computing system Storm.Why streaming is applied Computing system, it is the method for carrying out sequential processes according to Bolt using streaming computing, and distributed node mode can be used.
Streaming computing system Storm is to solve the processing problem for extensive data immediately, and its basic thought is The step of needing and calculate processing data, is decomposed, and be distributed on each node (this node is node on logical meaning, point For Spout and Bolt two types, wherein Spout is data source nodes, and Bolt is data processing node).Logical node is run In on each Supervisor of system (working node) different Port (port).The end run by distributing different processing tasks Mouthful, complete the parallel processing in real time to big data quantity.Its structure chart is as shown in Figure 1:
Fig. 4 illustrates Storm handling process, and pending flow is sent from Spout nodes, according to certain distribution principle Be sent on Bolt_Type1 nodes and handled, identical Bolt_Type completes same treatment, therefore can be parallel to a plurality of Data are handled.Processing procedure per a data is referred to as flowing (Tuple), and the processing procedure per data all passes through in Fig. 4 Cross Spout_Type1->Bolt_Type1->Bolt_Type2.
When being realized using streaming computing system, step S104 is specifically included:
51st, Storm nodes are allocated according to focus central user, that is, focus is loaded on each Storm nodes Central user and its associated user, its internal described network structure can also be included.Loading sequence is used according to focus center Family hot value summation carries out sequential placement.For example, focus central user A hot value summations>Focus central user C hot values are total With, then A deployment Storm nodes before C node.First Storm node carries out step 52;
52nd, associated user all on this Storm nodes is traveled through, rejects only with the focus central user to exist and lead to The associated user of words or short message;If user group's information is then passed through Storm nodes by next Storm nodes after rejecting Socket interfaces be sent to after next Storm nodes and carry out step 53, then carry out step 54 without next node;It is described User group's information includes the mark of remaining user on present node;
53rd, it is laggard to reject the user repeated on this node according to user group's information for next Storm nodes Row step 52;
54th, the user on each Storm nodes is respectively as a user group marked off.
After being sequentially completed all Storm node processings, what final each node obtained is exactly the customer group finally marked off Body, including focus central user and the remaining associated user of screening.Node must be the less customer group of number of users behind Storm Body, some possibility only have a user namely " isolated user ".
Fig. 5 is the application schematic diagram of an example of the present embodiment, wherein:
BSS systems are original user CDR ticket generation systems, and the present embodiment is obtained former by modes such as FTP to BSS systems Beginning user CDR ticket, the specific form of CDR tickets is in chapters and sections introduction below;
Hadoop clusters are the physical clusters that original CDR tickets are stored in the present embodiment, and all MapReduce programs Perform platform.
User's CDR ticket original documents.Mainly the call short message of user and other users are single in detail, converse each time and short Letter can all produce a CDR ticket writing.
CDR data (user-user, hot value) are to original bill files on Hadoop clusters by MapReduce Pre-process the record of formation.
Focus central user.Refer to the central user of the user group, how many focus central user is individual with regard to how many User group.Focus central user is screened by way of MapReduce in the present embodiment, the mode of selection uses the present embodiment The choosing method of design.
Streaming computing system:What last stage processing obtained after completing is N number of focus central user and its associated user, phase The colony might not be belonged to by closing user, and the associated user of different focus central users has coincidence.Therefore, it is necessary in streaming After-treatment, the N number of user group of final output are carried out to user group in computing system.
User group:The final output result of the present embodiment, it includes N number of user group and part is not belonging to any colony Individual consumer.
Embodiment two, a kind of user group's dividing system for disclosing CDR tickets in detail based on calling, including:
Data extraction module, for periodically obtaining the CDR tickets in predetermined amount of time, according to acquired CDR tickets In record extract the contact data of each user, including:It is talk times when the user is as caller/called subscriber, logical Talk about object, the duration of call, and the number and object of user's sending and receiving short message;
Hot value computing module, for according to the contact data of each user, calculating each phase of the user and the user respectively The hot value between user is closed, the associated user is the user that call or short message be present with the user;
Focus central user determining module, for determining focus central user according to the hot value;
Division module, for respectively in the associated user of each focus central user, reject only with the focus central user In the presence of call or the associated user of short message;Remaining associated user after each focus central user and its rejecting is each divided into one Individual user group.
In an embodiment of the present embodiment, the hot value H (m-n) between user m and user n is:
Wherein, ps (m-n) refers to talk times of the user m as calling user n;Ps (n-m) refers to user m as quilt It is user and user's n talk times;Ms (m-n) refers to user m and initiates direction user n transmission short message numbers as active short message;ms (n-m) refer to user m and send short message number as passive short message initiator and user n.pt(m-n)iRefer to user m as caller with using The family n singles duration of call exceedes 120s part, and unit is the second, and S1 is that user m surpasses as caller and user's n single duration of calls Cross the number of 120 seconds;pt(n-m)jRefer to user m as called and part of user's n singles duration of call more than 120 seconds, unit For the second, S2 is user m as the called number for exceeding 120s with user's n singles duration of call;To round up.
In an embodiment of the present embodiment, the focus central user determining module can specifically include:
Sum unit, for obtaining hot value summation respectively for each user, including:By each phase of the user He the user The hot value closed between user adds up, and accumulation result is the hot value summation;
Candidate user primary election unit, for user of the summation of hot value higher than predetermined hot value threshold value to be used as candidate Family;
Candidate user screening unit, for carrying out following screening operations respectively to each candidate user of each two:Compare two The associated user of candidate user, count the quantity N1 of overlapping associated user;Institute is calculated respectively for described two candidate users State quantity N1 and the percentage of associated user's total quantity of this candidate user;If the percentage of one of candidate user exceedes Predetermined ratio threshold value, then reject the candidate user;If two percentages are above predetermined ratio threshold value, it is total to reject hot value With relatively low candidate user;
Determining unit, for using remaining candidate user after screening operation as the focus central user.
In an embodiment of the present embodiment, the division module is respectively in the associated user of each focus central user Reject only can specifically refer to user of the focus central user in the presence of call or short message:
The division module carries out operations described below respectively for each focus central user:
Using the focus central user and its associated user respectively as a point in network structure, if any two is used Call or short message between family be present, then increase line between two points;With focus central user in the network structure As starting point, successively from each associated user, the focus central user is returned without the Points And lines repeated;If no Focus central user can be returned to, then is rejected associated user when setting out.
In an embodiment of the present embodiment, the division module can specifically include:
Load units, multiple Storm nodes and output unit;
The load units are used to load focus central user and its associated user on each Storm nodes, according to heat The hot value summation carry out order loading of dot center user;Start the rejecting operation of first Storm node;
Each Storm nodes are used for after the rejecting of this Storm nodes is operated and started, to phase all on this Storm nodes Close user to be traveled through, reject the associated user that call or short message with the focus central user only be present;After rejecting if under User group's information is then sent to next Storm by the socket interfaces of Storm nodes and saved by one Storm node Point, then start output unit without next node;User group's information includes remaining user on present node Mark;It is additionally operable to after user group's information is received, the phase repeated on this node is rejected according to user group's information Start the rejecting operation of this Storm nodes after the user of pass;
The output unit is used for upon actuation using the user on each Storm nodes respectively as a user marked off Colony exports.
One of ordinary skill in the art will appreciate that all or part of step in the above method can be instructed by program Related hardware is completed, and described program can be stored in computer-readable recording medium, such as read-only storage, disk or CD Deng.Alternatively, all or part of step of above-described embodiment can also be realized using one or more integrated circuits.Accordingly Ground, each module/unit in above-described embodiment can be realized in the form of hardware, can also use the shape of software function module Formula is realized.The present invention is not restricted to the combination of the hardware and software of any particular form.
Certainly, the present invention can also have other various embodiments, ripe in the case of without departing substantially from spirit of the invention and its essence Know those skilled in the art when can be made according to the present invention it is various it is corresponding change and deformation, but these corresponding change and become Shape should all belong to the scope of the claims of the present invention.

Claims (8)

1. a kind of user group's division methods for disclosing CDR tickets in detail based on calling, including:
S101, the CDR tickets periodically in acquisition predetermined amount of time, the record in acquired CDR tickets extract often The contact data of individual user, including:Talk times, conversation object when the user is as caller/called subscriber, the duration of call, And the number and object of user's sending and receiving short message;
S102, respectively according to the contact data of each user, calculate the hot value between the user and each associated user of the user, The associated user is the user that call or short message be present with the user;
S103, focus central user determined according to the hot value;
S104, rejected respectively in the associated user of each focus central user only with the focus central user in the presence of call or short message User;Remaining associated user after each focus central user and its rejecting is each divided into a user group;
Only call or short message be present with the focus central user in described rejected respectively in the associated user of each focus central user User include:
Operations described below is carried out respectively for each focus central user:
Using the focus central user and its associated user respectively as a point in network structure, if any two user it Between exist call or short message, then increase line between two points;
Set out in the network structure using focus central user as starting point to each associated user, then successively from described each Associated user is set out, and the focus central user is returned without the Points And lines repeated;If focus center use can not be returned to Family, then associated user when setting out is rejected.
2. the method as described in claim 1, it is characterised in that the hot value H (m-n) between user m and user n is:
Wherein, ps (m-n) refers to talk times of the user m as calling user n;Ps (n-m) refers to that user m uses as called Family and user's n talk times;Ms (m-n) refers to user m and initiates direction user n transmission short message numbers as active short message;ms(n-m) Refer to user m and send short message number as passive short message initiator and user n;pt(m-n)iIt is mono- as caller and user n to refer to user m The secondary duration of call exceedes 120s part, and unit is the second, and S1 is for user m as caller and user's n single duration of calls more than 120 The number of second;pt(n-m)jReferring to user m as called and part of user's n singles duration of call more than 120 seconds, unit is the second, S2 is user m as the called number for exceeding 120s with user's n singles duration of call;To round up.
3. the method as described in claim 1, it is characterised in that the step S103 includes:
Hot value summation is obtained respectively for each user, including:By the temperature between the user and each associated user of the user Value is cumulative, and accumulation result is the hot value summation;
Using user of the summation of hot value higher than predetermined hot value threshold value as candidate user;
Following screening operations are carried out respectively to each candidate user of each two:Compare the associated user of two candidate users, statistics weight The quantity N1 of folded associated user;It is related to this candidate user that the quantity N1 is calculated respectively for described two candidate users The percentage of total number of users amount;If the percentage of one of candidate user exceedes predetermined ratio threshold value, the candidate is rejected User;If two percentages are above predetermined ratio threshold value, the relatively low candidate user of hot value summation is rejected;
Using remaining candidate user after screening operation as the focus central user.
4. the method as described in claim 1, it is characterised in that step S104 includes:
51st, focus central user and its associated user are loaded on each Storm nodes, according to the hot value of focus central user Summation carry out order loading;First Storm node carries out step 52;
52nd, associated user all on this Storm nodes is traveled through, reject only with the focus central user exist call or The associated user of short message;If user group's information is then passed through Storm nodes by next Storm nodes after rejecting Socket interfaces carry out step 53 after being sent to next Storm nodes, and step 54 is then carried out without next node;The use Family community information includes the mark of remaining user on present node;
53rd, next Storm nodes are walked after rejecting the user repeated on this node according to user group's information Rapid 52;
54th, the user on each Storm nodes is respectively a user group marked off.
A kind of 5. user group's dividing system for disclosing CDR tickets in detail based on calling, it is characterised in that including:
Data extraction module, for periodically obtaining the CDR tickets in predetermined amount of time, according in acquired CDR tickets Record extracts the contact data of each user, including:Talk times, call pair when the user is as caller/called subscriber As, the duration of call, and the number and object of user's sending and receiving short message;
Hot value computing module, for according to the contact data of each user, calculating each related use of the user to the user respectively Hot value between family, the associated user are the user that call or short message be present with the user;
Focus central user determining module, for determining focus central user according to the hot value;
Division module, in the associated user of each focus central user, rejecting and only existing with the focus central user respectively Call or the associated user of short message;Remaining associated user after each focus central user and its rejecting is each divided into a use Family colony;
The division module is rejected only with the focus central user to exist respectively in the associated user of each focus central user leads to The user of words or short message refers to:
The division module carries out operations described below respectively for each focus central user:
Using the focus central user and its associated user respectively as a point in network structure, if any two user it Between exist call or short message, then increase line between two points;In the network structure using focus central user as Starting point is set out to each associated user, then successively from each associated user, without described in the Points And lines return repeated Focus central user;If focus central user can not be returned to, associated user when setting out is rejected.
6. system as claimed in claim 5, it is characterised in that the hot value H (m-n) between user m and user n is:
Wherein, ps (m-n) refers to talk times of the user m as calling user n;Ps (n-m) refers to that user m uses as called Family and user's n talk times;Ms (m-n) refers to user m and initiates direction user n transmission short message numbers as active short message;ms(n-m) Refer to user m and send short message number as passive short message initiator and user n;pt(m-n)iIt is mono- as caller and user n to refer to user m The secondary duration of call exceedes 120s part, and unit is the second, and S1 is for user m as caller and user's n single duration of calls more than 120 The number of second;pt(n-m)jReferring to user m as called and part of user's n singles duration of call more than 120 seconds, unit is the second, S2 is user m as the called number for exceeding 120s with user's n singles duration of call;To round up.
7. system as claimed in claim 5, it is characterised in that the focus central user determining module includes:
Sum unit, for obtaining hot value summation respectively for each user, including:By each related use of the user to the user Hot value between family adds up, and accumulation result is the hot value summation;
Candidate user primary election unit, for using user of the summation of hot value higher than predetermined hot value threshold value as candidate user;
Candidate user screening unit, for carrying out following screening operations respectively to each candidate user of each two:Compare two candidates The associated user of user, count the quantity N1 of overlapping associated user;The number is calculated respectively for described two candidate users Measure N1 and the percentage of associated user's total quantity of this candidate user;If the percentage of one of candidate user exceedes predetermined Proportion threshold value, then reject the candidate user;If two percentages are above predetermined ratio threshold value, reject hot value summation compared with Low candidate user;
Determining unit, for using remaining candidate user after screening operation as the focus central user.
8. system as claimed in claim 5, it is characterised in that the division module includes:
Load units, multiple Storm nodes and output unit;
The load units are used to load focus central user and its associated user on each Storm nodes, according in focus The hot value summation carry out order loading of heart user;Start the rejecting operation of first Storm node;
Each Storm nodes are used for after the rejecting of this Storm nodes is operated and started, to related use all on this Storm nodes Family is traveled through, and rejects the associated user that call or short message with the focus central user only be present;If next after rejecting User group's information is then sent to next Storm nodes by Storm nodes by the socket interfaces of Storm nodes, is not had There is next node then to start output unit;User group's information includes the mark of remaining user on present node; It is additionally operable to after user group's information is received, the associated user repeated on this node is rejected according to user group's information Start the rejecting operation of this Storm nodes afterwards;
The output unit is used for upon actuation using the user on each Storm nodes respectively as a user group marked off Output.
CN201510020953.9A 2015-01-15 2015-01-15 User group's division method and system based on CDR tickets Active CN104573034B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510020953.9A CN104573034B (en) 2015-01-15 2015-01-15 User group's division method and system based on CDR tickets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510020953.9A CN104573034B (en) 2015-01-15 2015-01-15 User group's division method and system based on CDR tickets

Publications (2)

Publication Number Publication Date
CN104573034A CN104573034A (en) 2015-04-29
CN104573034B true CN104573034B (en) 2018-03-23

Family

ID=53089096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510020953.9A Active CN104573034B (en) 2015-01-15 2015-01-15 User group's division method and system based on CDR tickets

Country Status (1)

Country Link
CN (1) CN104573034B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105208214B (en) * 2015-10-22 2019-02-26 海信集团有限公司 A kind of incoming call processing method and device
CN106557984B (en) * 2016-11-18 2020-09-11 中国联合网络通信集团有限公司 Social group determination method and device
CN106899492B (en) * 2017-02-17 2020-04-14 上海新炬网络技术有限公司 Method for mining relationship chain of colleague users

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102083010A (en) * 2009-11-26 2011-06-01 ***通信集团公司 Method and equipment for screening user information
CN102135983A (en) * 2011-01-17 2011-07-27 北京邮电大学 Group dividing method and device based on network user behavior
CN103001993A (en) * 2011-09-19 2013-03-27 中兴通讯股份有限公司 Server, network data providing method and device thereof
CN103605791A (en) * 2013-12-04 2014-02-26 深圳中兴网信科技有限公司 Information pushing system and information pushing method
CN103761246A (en) * 2013-12-19 2014-04-30 国家计算机网络与信息安全管理中心 Link network based user domain identifying method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102083010A (en) * 2009-11-26 2011-06-01 ***通信集团公司 Method and equipment for screening user information
CN102135983A (en) * 2011-01-17 2011-07-27 北京邮电大学 Group dividing method and device based on network user behavior
CN103001993A (en) * 2011-09-19 2013-03-27 中兴通讯股份有限公司 Server, network data providing method and device thereof
CN103605791A (en) * 2013-12-04 2014-02-26 深圳中兴网信科技有限公司 Information pushing system and information pushing method
CN103761246A (en) * 2013-12-19 2014-04-30 国家计算机网络与信息安全管理中心 Link network based user domain identifying method and device

Also Published As

Publication number Publication date
CN104573034A (en) 2015-04-29

Similar Documents

Publication Publication Date Title
CN109995650B (en) SDN network-based path calculation method and device under multidimensional constraint
CN103701934B (en) Resource optimal scheduling method and virtual machine host machine optimal selection method
CN106599230A (en) Method and system for evaluating distributed data mining model
CN104573034B (en) User group&#39;s division method and system based on CDR tickets
Ding et al. A new hierarchical ranking aggregation method
CN106022568A (en) Workflow processing method and apparatus
Sidiropoulos et al. Gazing at the skyline for star scientists
US8644468B2 (en) Carrying out predictive analysis relating to nodes of a communication network
US9633057B2 (en) Method and system for collecting, searching and determining the strength of contacts from a mobile contact list
CN114841374A (en) Method for optimizing transverse federated gradient spanning tree based on stochastic greedy algorithm
CN102083010A (en) Method and equipment for screening user information
CN106296315A (en) Context aware systems based on user power utilization data
CN113961173A (en) Single system micro-service splitting method based on field event driving
CN110147427A (en) Project case method for pushing and device
CN113961712A (en) Knowledge graph-based fraud telephone analysis method
Choudhari et al. Predictive to prescriptive analysis for customer churn in telecom industry using hybrid data mining techniques
CN104794234B (en) Data processing method and device for fellow peers&#39; evaluation
Li et al. Customer churn prediction of china telecom based on cluster analysis and decision tree algorithm
Zhang et al. Logistics service supply chain order allocation mixed K-Means and Qos matching
CN104965846A (en) Virtual human establishing method on MapReduce platform
CN102364475A (en) System and method for sequencing search results based on identity recognition
CN109308423A (en) Secondary method of partition in secret protection record link
CN106611339B (en) Seed user screening method, and product user influence evaluation method and device
Jain Introduction to data mining techniques
CN110852555A (en) Intelligent case dividing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant