CN112861128B - Method and system for identifying machine account numbers in batches - Google Patents

Method and system for identifying machine account numbers in batches Download PDF

Info

Publication number
CN112861128B
CN112861128B CN202110083543.4A CN202110083543A CN112861128B CN 112861128 B CN112861128 B CN 112861128B CN 202110083543 A CN202110083543 A CN 202110083543A CN 112861128 B CN112861128 B CN 112861128B
Authority
CN
China
Prior art keywords
account
key
behavior
user
linear regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110083543.4A
Other languages
Chinese (zh)
Other versions
CN112861128A (en
Inventor
王嘉伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weimeng Chuangke Network Technology China Co Ltd
Original Assignee
Weimeng Chuangke Network Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weimeng Chuangke Network Technology China Co Ltd filed Critical Weimeng Chuangke Network Technology China Co Ltd
Priority to CN202110083543.4A priority Critical patent/CN112861128B/en
Publication of CN112861128A publication Critical patent/CN112861128A/en
Application granted granted Critical
Publication of CN112861128B publication Critical patent/CN112861128B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2141Access rights, e.g. capability lists, access control lists, access tables, access matrices

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a method and a system for recognizing machine accounts in batches, wherein a computing engine Spark periodically acquires user behavior logs of login accounts in a previous period from a database, and extracts all accounts with the behavior quantity exceeding a preset quantity threshold in the previous period; acquiring the occurrence time of all key behaviors in the previous period of each account and forming an elastic data set of each account; sequentially counting the number of key behaviors occurring in each length time period according to the preset length time period; fitting the change relation of the key behavior quantity in each length time period of any account with time by adopting a linear regression equation to obtain an account linear regression fit curve; calculating the fitting goodness of the corresponding key behavior data of each account according to the linear regression fitting curve; and judging whether each account is a machine account in batches according to the fitting goodness of the key behavior data of each account. And searching the key behavior of the account based on Spark, and reducing the accidental injury rate to the non-machine account.

Description

Method and system for identifying machine account numbers in batches
Technical Field
The invention relates to the field of computers, in particular to a method and a system for identifying machine accounts in batches.
Background
In the internet social platform of modern social media, a large number of lawbreakers log in some accounts in batches by using scripts to perform illegal operations such as brushing, and the part of accounts generally have no substantial content, so that the use of normal users is negatively affected, and meanwhile, the fairness of the platform is challenged to a certain extent. Therefore, the machine account numbers logged in by the script in batches are required to be found out.
In carrying out the present invention, the applicant has found that at least the following problems exist in the prior art: the prior art typically considers the first 5% of users to be machine accounts by counting the daily accesses per user and then ordering the accesses from high to low. Although some machine accounts can be found, the trauma rate is relatively high, especially for head accounts, which hurts normal users and is not acceptable.
Disclosure of Invention
The embodiment of the invention provides a method and a system for batch identification of machine accounts, which are used for searching key behaviors of the accounts based on Spark, fitting by adopting linear regression according to the change of key behavior data to obtain a fitting curve, calculating the fitting goodness of the fitting curve, and reducing the accidental injury rate to non-machine accounts.
To achieve the above objective, in one aspect, an embodiment of the present invention provides a method for identifying machine account numbers in batches, including:
The computing engine Spark periodically acquires user behavior logs of login accounts in a previous period from a database, extracts all accounts with the behavior quantity exceeding a preset quantity threshold value in the previous period, and forms a user account set;
Acquiring the occurrence time of all key behaviors of each account in a previous period in a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
for any account in a user account set, sequentially counting the number of key behaviors occurring in each length time period according to the elastic data set of the account and the preset length time period; the key behavior refers to a behavior which is performed by a user in an account authority range and reaches a preset importance level;
For any account in the user account set, fitting the change relation of the key behavior quantity in each length time period of the account with time by adopting a linear regression equation to obtain a linear regression fit curve of key behavior data of the account; calculating the fitting goodness of the corresponding key behavior data of each account according to the linear regression fitting curve of each account in the user account set;
judging whether each account is a machine account in batches according to the fitting goodness of the key behavior data of each account in the user account set; the machine account number refers to an account number which is logged in batches by using a script to perform illegal operation.
In another aspect, an embodiment of the present invention provides a system for batch identification of machine account numbers, including a database and a computing engine Spark, where the computing engine Spark includes: the system comprises a key behavior data integration unit, a linear regression unit and a judgment unit, wherein:
The database is used for storing user behavior logs of the login account;
The key behavior data integration unit is used for periodically acquiring user behavior logs of login accounts in a previous period from the database, extracting all accounts with the behavior number exceeding a preset number threshold value in the previous period, and forming a user account set; acquiring the occurrence time of all key behaviors of each account in a previous period in a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
for any account in a user account set, sequentially counting the number of key behaviors occurring in each length time period according to the elastic data set of the account and the preset length time period; the key behavior refers to a behavior which is performed by a user in an account authority range and reaches a preset importance level;
The linear regression unit is used for fitting the change relation of the key behavior quantity in each length time period of the account with respect to any account in the user account set by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; calculating the fitting goodness of the corresponding key behavior data of each account according to the linear regression fitting curve of each account in the user account set;
The judging unit is used for judging whether each account is a machine account in batches according to the fitting goodness of the key behavior data of each account in the user account set; the machine account number refers to an account number which is logged in batches by using a script to perform illegal operation.
The technical scheme has the following beneficial effects: the key behaviors of the account are searched based on Spark, linear regression is adopted for fitting according to the change of key behavior data to obtain a fitting curve, the fitting goodness of the fitting curve is calculated, the machine accounts can be identified in batches, the low-frequency machine accounts can be found out by screening the key behaviors, the finding rate of the low-frequency machine accounts is improved, the accidental injury rate of non-machine accounts is reduced, and the work of automatically identifying the machine accounts in batches can be achieved through Spark.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of mass identifying machine account numbers according to an embodiment of the present invention;
fig. 2 is a system architecture diagram of a batch identification machine account number in accordance with an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, in combination with an embodiment of the present invention, there is provided a method for identifying machine account numbers in batches, including:
S101: the computing engine Spark periodically acquires user behavior logs of login accounts in a previous period from a database, extracts all accounts with the behavior quantity exceeding a preset quantity threshold value in the previous period, and forms a user account set;
Acquiring the occurrence time of all key behaviors of each account in a previous period in a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
for any account in a user account set, sequentially counting the number of key behaviors occurring in each length time period according to the elastic data set of the account and the preset length time period; the key behavior refers to a behavior which is performed by a user in an account authority range and reaches a preset importance level;
S102: for any account in the user account set, fitting the change relation of the key behavior quantity in each length time period of the account with time by adopting a linear regression equation to obtain a linear regression fit curve of key behavior data of the account; calculating the fitting goodness of the corresponding key behavior data of each account according to the linear regression fitting curve of each account in the user account set;
s103: judging whether each account is a machine account in batches according to the fitting goodness of the key behavior data of each account in the user account set; the machine account number refers to an account number which is logged in batches by using a script to perform illegal operation.
Preferably, in step 101, the obtaining, in the user behavior log of the database, the occurrence time of all the key behaviors of each account in the user account set in the previous period, and forming the occurrence time of all the key behaviors of each account into an elastic data set of each account specifically includes:
s1011: the calculation engine Spark obtains the occurrence time of all key behaviors of each account in a user account set in the previous period in a user behavior log of a database, and forms a middle elastic data set comprising the account generated by the key behaviors and the occurrence time of the key behaviors aiming at each key behavior;
S1012: and acquiring all intermediate elastic data sets of the same account through a groupByKey function of the calculation engine Sspark, forming an array of the occurrence time of all key behaviors in each intermediate elastic data set of the account, and forming the elastic data sets of the account and the array of the occurrence time of all key behaviors of the account.
Preferably, the method further comprises:
S1013: after the middle elastic data sets are formed, subtracting the starting point time of the current period from the occurrence time of the key behaviors through mapToPair functions of the computing engine Spark for each middle elastic data set to obtain the relative time of each key behavior, and converting units of each relative time to obtain the conversion time of each key behavior to obtain an optimized middle elastic data set; and taking the optimized intermediate elastic data set as an object obtained by the groupByKey function to form an elastic data set of each account. The total trend of the number of key behaviors occurring in each length period of the same account is that the number of key behaviors in unit conversion time in the period is larger than the number of key behaviors in unit relative time.
Preferably, step 102 specifically includes:
S1021: aiming at any account, taking a preset length time period as an independent variable of a linear regression equation, and taking the key behavior quantity of the account as a dependent variable of the linear regression equation to obtain a linear regression fit curve of key behavior data of the account;
S1022: calculating the mean square error of the dependent variable valuation in the linear regression fit curve of the account key behavior data, and calculating the actual variance of the account key behavior data according to the key behavior quantity of the account in each preset length time period; and taking the ratio of the mean square error and the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account.
Preferably, the steps specifically include:
S1031: comparing the fitting goodness of the key behavior data of each account in the user account set with a set goodness threshold in batches;
S1032: when the goodness-of-fit of the key behavior data of a certain account is greater than or equal to a goodness-of-fit threshold, judging the account as a machine account; and when the goodness-of-fit of the key behavior data of a certain account is smaller than the goodness-of-fit threshold, judging that the account is a non-machine account.
As shown in fig. 2, in combination with an embodiment of the present invention, there is further provided a system for identifying machine account numbers in batches, including a database and a computing engine Spark, where the computing engine Spark includes: a key behavior data integrating unit 21, a linear regression unit 22 and a judging unit 23, wherein:
The database is used for storing user behavior logs of the login account;
The key behavior data integrating unit 21 is configured to periodically obtain, from the database, a user behavior log of the login account in a previous period, extract all accounts whose behavior number exceeds a preset number threshold in the previous period, and form a user account set; acquiring the occurrence time of all key behaviors of each account in a previous period in a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account; for any account in a user account set, sequentially counting the number of key behaviors occurring in each length time period according to the elastic data set of the account and the preset length time period; the key behavior refers to a behavior which is performed by a user in an account authority range and reaches a preset importance level;
The linear regression unit 22 is configured to fit, for any account in the user account set, a relationship of the key behavior number in each length period of the account over time by using a linear regression equation, so as to obtain a linear regression fit curve of key behavior data of the account; calculating the fitting goodness of the corresponding key behavior data of each account according to the linear regression fitting curve of each account in the user account set;
the judging unit 23 is configured to judge whether each account is a machine account in batch according to the goodness of fit of the key behavior data of each account in the user account set; the machine account number refers to an account number which is logged in batches by using a script to perform illegal operation.
Preferably, the key behavior data integrating unit 21 includes:
The intermediate elastic data set subunit 211 is configured to obtain, in a user behavior log of the database, occurrence times of all key behaviors of each account in a user account set in a previous period, and form, for each key behavior, an intermediate elastic data set including an account in which the key behavior occurs and the occurrence time of the key behavior;
the key behavior data integration subunit 212 is configured to obtain all intermediate elastic datasets of the same account through a groupByKey function of the computing engine Sspark, form an array of occurrence times of all key behaviors in each intermediate elastic dataset of the account, and form an elastic dataset of the account from the account and the array of occurrence times of all key behaviors of the account.
Preferably, the key behavior data integrating unit 21 further includes:
An intermediate elastic data set optimizing subunit 213, configured to obtain, for each intermediate elastic data set, a relative time when each key action occurs by subtracting a start time of a current period from an occurrence time of the key action by using a mapToPair function of a computing engine Spark after the intermediate elastic data set is formed, and convert units of each relative time to obtain a conversion time when each key action occurs, so as to obtain an optimized intermediate elastic data set;
the key behavior data integration subunit 21 is specifically configured to take the optimized intermediate elastic data set as an object obtained by the groupByKey function to form an elastic data set of each account.
Preferably, the linear regression unit 22 includes:
A linear fitting subunit 221, configured to obtain, for any account, a linear regression fit curve of the account key behavior data by using a preset length period as an argument of a linear regression equation and using the number of key behaviors of the account as an argument of the linear regression equation;
A goodness-of-fit calculation subunit 222, configured to calculate a mean square error of the dependent variable valuations in the linear regression fit curve of the account key behavior data, and calculate an actual variance of the account key behavior data according to the number of key behaviors in each preset length time period of the account; and taking the ratio of the mean square error and the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account. Wherein the predetermined length of time period may be a unit conversion time.
Preferably, the judging unit 23 includes:
A comparing subunit 231, configured to compare, in batch, the goodness of fit of the key behavior data of each account in the user account set with a set goodness threshold;
A determining subunit 232, configured to determine that an account is a machine account when the goodness-of-fit of the key behavior data of the account is greater than or equal to the goodness-of-fit threshold; and when the goodness-of-fit of the key behavior data of a certain account is smaller than the goodness-of-fit threshold, judging that the account is a non-machine account.
The embodiment of the invention has the beneficial effects that:
The key behaviors of the account are searched based on Spark, linear regression is adopted for fitting according to the change of key behavior data to obtain a fitting curve, the fitting goodness of the fitting curve is calculated, the machine accounts can be identified in batches, the low-frequency machine accounts can be found out by screening the key behaviors, the finding rate of the low-frequency machine accounts is improved, the accidental injury rate of non-machine accounts is reduced, and the work of automatically identifying the machine accounts in batches can be achieved through Spark.
The foregoing technical solutions of the embodiments of the present invention will be described in detail with reference to specific application examples, and reference may be made to the foregoing related description for details of the implementation process that are not described.
Technical terms related to the invention are explained as follows:
machine account number: in the internet social platform of modern social media, a large number of lawbreakers log in some accounts in batches by using scripts to perform illegal operations such as brushing, and the part of accounts generally have no substantial content, so that the use of normal users is negatively affected, and meanwhile, the fairness of the platform is challenged to a certain extent.
Behavior log: and (5) logging recorded when the Internet account performs uplink operation, such as praise, comment, attention and other actions. The information includes operation number, account number, time, target, etc.
The invention discloses a Spark and linear regression-based machine account batch identification system and a method, which can automatically and batchly find out machine accounts logged in by scripts in a batch manner through a data mining and analysis mode. The method not only can find the machine account number with low frequency access, but also has very high finding rate for the machine account number with low frequency access, and reduces the accidental injury rate of the whole system.
The invention discloses a machine account batch identification system and a machine account batch identification method based on Spark and linear regression, which have the following complete technical scheme:
1. for all user sets U (i.e., user account sets) with a number of actions (praise, comment, forward) that exceeds C the last day.
2. Querying the time of yesterday key behaviors of all the uid in U by using the Spark's hive query, and forming the time stamp of the key behaviors into an intermediate elastic data set RDD 1; spark is a computing engine, distributed clusters are set, and hive is a database.
3. Using the mapToPair function of Spark, the timestamp of t minus yesterday's 0 point is divided by t0 (preferably 3600 s) to form the optimized intermediate elastic dataset RDD2 in the format [ uid, h ]. That is, for the number of critical actions occurring in each length of time period of the same account, the general trend is that the number of critical actions per unit conversion time in the period is greater than the number of critical actions per unit relative time.
4. The h values of the same uid are grouped together using the Spark's groupByKey function to form the elastic dataset RDD3 for that account in the format [ uid, [ h0, h1 … ] ].
5. For any account in the user account set, according to the elastic data set of the account, sequentially counting the number of key behaviors occurring in each length time period according to the preset length time period, namely: RDD3 is fetched from Spark using the collect function of Spark to form an array L, for each element in L: the total amount of behavior at intervals of T0 is counted, and the total amount of behavior T0, T0-2T0, and T2 of the user at times of 0-T0, T1, and T0-3T0 are obtained. . . Etc., forming a list T.
6. For any account in the user account set, fitting the change relation of the key behavior quantity in each length time period of the account with time by adopting a linear regression equation to obtain a linear regression fit curve of key behavior data of the account; and calculating the fitting goodness of the corresponding key behavior data of each account according to the linear regression fitting curve of each account in the user account set.
The goodness-of-fit test is performed, and if the sequences T0, T1, T2 are nearly fixed numbers that do not change so much, the goodness-of-fit R 2 for linear regression will be high.
7. A threshold R0 is defined and if R 2 > R0 and the account is considered to be a machine account.
Specific examples are as follows:
For all users with the number of behaviors greater than 1000 on the previous day, how many key behavior records are queried in hive, such as [1:20201010080810,1:20201010080910 … ], indicating that user number 1 initiates a key behavior at 2020101008081020201010080910.
Steps 2, 3 are then actually followed by converting the time stamp to the hour in which it was located, i.e., [1:8,1:8 … ];
And then, the same data of all the uid are aggregated together in the step 4, so that the data of [ uid: an hour list where the key behaviors are located ] is obtained, namely [1: [8,8,9,9,10,10, 11, 11. ],2: [9,10,18,18,18. ] … ].
For one of the users, assume his behavior list of [0,0,1,1,2,2,3,3, … ] and then count the behavior amount every T0 to get T (T0 is one hour, and the length of T is 24):
[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1]
It can be seen that this account behaves very uniformly at any time of the day, much like a machine account. Because the line graph of T for the machine account number would be very smooth, resembling a straight line. Whereas normal users are generally not visiting during the night, there is always a period of time of day to rest. The behavior follows T and has obvious fluctuation, that is, if linear regression is used for fitting, the fitting effect of the machine account number is very good; further, if a T sequence is used for linear regression fit, the better the goodness of fit, the more likely it is an abnormal account.
The following is the calculation of R 2.
The best fit can be accomplished by a wide variety of software, i have used the cutve_fit method of the python and scipy packages.
Defining f (x) as a straight line y=ax+b, then: a step of
popt,pcov=curve_fit(f,x,T)
The definition x= [0,1,2,3 … ] length is consistent with the length of T.
After executing this statement, popt has been filled with the optimizations b and a.
Calculating a fitting goodness R:
yvals=f(x)
sum0=0
sum1=0
average=numpy.average(T)
for i in range(len(yvals)):
sum0+=(T[i]-yvals[i])**2
sum1+=(T[i]-average)**2
R2=1-(sum0/sum1)
The calculation of T for this user is that R 2 is about 0.9995, where r0=0.98, knowing that R 2 > R0 then this user is determined to be a machine user
Looking again at a normal user's T:
[1,0,0,0,0,0,0,0,1,0,2,1,10,10,0,0,0,4,0,19,20,40,40,20];
R 2 is about 0.2, it is noted that R 2 < R0;
this user is determined to be a normal user.
The embodiment of the invention has the following beneficial effects:
The key behaviors of the account are searched based on Spark, linear regression is adopted for fitting according to the change of key behavior data to obtain a fitting curve, the fitting goodness of the fitting curve is calculated, the machine accounts can be identified in batches, the low-frequency machine accounts can be found out by screening the key behaviors, the finding rate of the low-frequency machine accounts is improved, the accidental injury rate of non-machine accounts is reduced, and the work of automatically identifying the machine accounts in batches can be achieved through Spark.
It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. As will be apparent to those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks (illustrative logical block), units, and steps described in connection with the embodiments of the invention may be implemented by electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software (interchangeability), various illustrative components described above (illustrative components), elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation is not to be understood as beyond the scope of the embodiments of the present invention.
The various illustrative logical blocks or units described in the embodiments of the invention may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described. A general purpose processor may be a microprocessor, but in the alternative, the general purpose processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a user terminal. In the alternative, the processor and the storage medium may reside as distinct components in a user terminal.
In one or more exemplary designs, the above-described functions of embodiments of the present invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer readable media includes both computer storage media and communication media that facilitate transfer of computer programs from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media may include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store program code in the form of instructions or data structures and other data structures that may be read by a general or special purpose computer, or a general or special purpose processor. Further, any connection is properly termed a computer-readable medium, e.g., if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless such as infrared, radio, and microwave, and is also included in the definition of computer-readable medium. The disks (disks) and disks (disks) include compact disks, laser disks, optical disks, DVDs, floppy disks, and blu-ray discs where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included within the computer-readable media.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (8)

1. A method for mass identification of machine account numbers, comprising:
The computing engine Spark periodically acquires user behavior logs of login accounts in a previous period from a database, extracts all accounts with the behavior quantity exceeding a preset quantity threshold value in the previous period, and forms a user account set;
Acquiring the occurrence time of all key behaviors of each account in a previous period in a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
For any account in a user account set, sequentially counting the number of key behaviors occurring in each length time period according to the elastic data set of the account and the preset length time period; the key behavior refers to a behavior which is performed by a user in an account authority range and reaches a preset importance level;
For any account in the user account set, fitting the change relation of the key behavior quantity in each length time period of the account with time by adopting a linear regression equation to obtain a linear regression fit curve of key behavior data of the account; calculating the fitting goodness of the corresponding key behavior data of each account according to the linear regression fitting curve of each account in the user account set;
Judging whether each account is a machine account in batches according to the fitting goodness of the key behavior data of each account in the user account set; the machine account number is an account number which logs in batch by utilizing a script to perform illegal operation;
The batch judgment of whether each account is a machine account or not according to the fitting goodness of the key behavior data of each account in the user account set specifically comprises the following steps:
comparing the fitting goodness of the key behavior data of each account in the user account set with a set goodness threshold in batches;
When the goodness-of-fit of the key behavior data of a certain account is greater than or equal to a goodness-of-fit threshold, judging the account as a machine account;
And when the goodness-of-fit of the key behavior data of a certain account is smaller than the goodness-of-fit threshold, judging that the account is a non-machine account.
2. The method for batch identification of machine accounts according to claim 1, wherein the obtaining, in the user behavior log of the database, occurrence times of all key behaviors of each account in a previous period in the user account set, and forming the occurrence times of all key behaviors of each account into an elastic data set of each account specifically includes:
The calculation engine Spark obtains the occurrence time of all key behaviors of each account in a user account set in the previous period in a user behavior log of a database, and forms a middle elastic data set comprising the account generated by the key behaviors and the occurrence time of the key behaviors aiming at each key behavior;
and acquiring all intermediate elastic data sets of the same account through a groupByKey function of the calculation engine Sspark, forming an array of the occurrence time of all key behaviors in each intermediate elastic data set of the account, and forming the elastic data sets of the account and the array of the occurrence time of all key behaviors of the account.
3. The method of batch identification of machine account numbers of claim 2 further comprising:
After the middle elastic data sets are formed, subtracting the starting point time of the current period from the occurrence time of the key behaviors through mapToPair functions of the computing engine Spark for each middle elastic data set to obtain the relative time of each key behavior, and converting units of each relative time to obtain the conversion time of each key behavior to obtain an optimized middle elastic data set; and taking the optimized intermediate elastic data set as an object obtained by the groupByKey function to form an elastic data set of each account.
4. The method for batch identification of machine accounts according to claim 2, wherein for any account in the user account set, a linear regression equation is adopted to fit the time-dependent relationship of the number of key behaviors in each length period of the account, so as to obtain a linear regression fit curve of the key behavior data of the account, and the method specifically comprises:
Aiming at any account, taking a preset length time period as an independent variable of a linear regression equation, and taking the key behavior quantity of the account as a dependent variable of the linear regression equation to obtain a linear regression fit curve of key behavior data of the account;
Calculating the goodness-of-fit of the corresponding key behavior data of each account according to the linear regression fit curve of each account in the user account set, wherein the method specifically comprises the following steps:
Calculating the mean square error of the dependent variable valuation in the linear regression fit curve of the account key behavior data, and calculating the actual variance of the account key behavior data according to the key behavior quantity of the account in each preset length time period; and taking the ratio of the mean square error and the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account.
5. A system for mass identification of machine account numbers, comprising a database and a computing engine Spark, the computing engine Spark comprising: the system comprises a key behavior data integration unit, a linear regression unit and a judgment unit, wherein:
The database is used for storing user behavior logs of the login account;
The key behavior data integration unit is used for periodically acquiring user behavior logs of login accounts in a previous period from the database, extracting all accounts with the behavior number exceeding a preset number threshold value in the previous period, and forming a user account set; acquiring the occurrence time of all key behaviors of each account in a previous period in a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account; for any account in a user account set, sequentially counting the number of key behaviors occurring in each length time period according to the elastic data set of the account and the preset length time period; the key behavior refers to a behavior which is performed by a user in an account authority range and reaches a preset importance level;
The linear regression unit is used for fitting the change relation of the key behavior quantity in each length time period of the account with respect to any account in the user account set by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; calculating the fitting goodness of the corresponding key behavior data of each account according to the linear regression fitting curve of each account in the user account set;
The judging unit is used for judging whether each account is a machine account in batches according to the fitting goodness of the key behavior data of each account in the user account set; the machine account number is an account number which logs in batch by utilizing a script to perform illegal operation;
the judging unit includes:
The comparison subunit is used for comparing the fitting goodness of the key behavior data of each account in the user account set with a set goodness threshold in batches;
The judging subunit is used for judging that the account is a machine account when the fitting goodness of the key behavior data of the account is greater than or equal to a goodness threshold value; and when the goodness-of-fit of the key behavior data of a certain account is smaller than the goodness-of-fit threshold, judging that the account is a non-machine account.
6. The system for batch identification of machine accounts according to claim 5, wherein the critical-performance data integration unit comprises:
The middle elastic data set subunit is used for acquiring the occurrence time of all key behaviors of each account in the user account set in the previous period from the user behavior log of the database, and forming a middle elastic data set comprising the account generated by the key behaviors and the occurrence time of the key behaviors aiming at each key behavior;
The key behavior data integration subunit is configured to obtain all intermediate elastic data sets of the same account through a groupByKey function of the calculation engine Sspark, form an array of occurrence times of all key behaviors in each intermediate elastic data set of the account, and form an elastic data set of the account from the account and the array of occurrence times of all key behaviors of the account.
7. The system for batch identification of machine accounts according to claim 6, wherein the critical behavior data integration unit further comprises:
the intermediate elastic data set optimizing subunit is used for obtaining the relative time of each key behavior occurrence by subtracting the starting point time of the current period from the occurrence time of the key behavior through mapToPair function of the computing engine Spark for each intermediate elastic data set after the intermediate elastic data sets are formed, and converting units of each relative time to obtain the conversion time of each key behavior occurrence to obtain an optimized intermediate elastic data set;
The key behavior data integration subunit is specifically configured to take the optimized intermediate elastic data set as an object obtained by the groupByKey function to form an elastic data set of each account.
8. The system for batch identification of machine account numbers of claim 6 wherein the linear regression unit comprises:
The linear fitting subunit is used for aiming at any account, taking a preset length time period as an independent variable of a linear regression equation and taking the key behavior quantity of the account as the dependent variable of the linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account;
The fitting goodness calculating subunit is used for calculating the mean square error of the dependent variable estimation in the linear regression fitting curve of the account key behavior data and calculating the actual variance of the account key behavior data according to the key behavior quantity of the account in each preset length time period; and taking the ratio of the mean square error and the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account.
CN202110083543.4A 2021-01-21 2021-01-21 Method and system for identifying machine account numbers in batches Active CN112861128B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110083543.4A CN112861128B (en) 2021-01-21 2021-01-21 Method and system for identifying machine account numbers in batches

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110083543.4A CN112861128B (en) 2021-01-21 2021-01-21 Method and system for identifying machine account numbers in batches

Publications (2)

Publication Number Publication Date
CN112861128A CN112861128A (en) 2021-05-28
CN112861128B true CN112861128B (en) 2024-06-18

Family

ID=76008938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110083543.4A Active CN112861128B (en) 2021-01-21 2021-01-21 Method and system for identifying machine account numbers in batches

Country Status (1)

Country Link
CN (1) CN112861128B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571484A (en) * 2011-12-14 2012-07-11 上海交通大学 Method for detecting and finding online water army
CN103839197A (en) * 2014-03-19 2014-06-04 国家电网公司 Method for judging abnormal electricity consumption behaviors of users based on EEMD method

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5970390B2 (en) * 2013-02-19 2016-08-17 日本電信電話株式会社 bot determination device and method, program, and numerical set distribution determination device
JP6249794B2 (en) * 2014-01-27 2017-12-20 Kddi株式会社 Bot determination device, bot determination method, and program
US10389745B2 (en) * 2015-08-07 2019-08-20 Stc.Unm System and methods for detecting bots real-time
CN107305611B (en) * 2016-04-22 2020-11-03 腾讯科技(深圳)有限公司 Method and device for establishing model corresponding to malicious account and method and device for identifying malicious account
CN106886915B (en) * 2017-01-17 2020-07-28 华南理工大学 Advertisement click estimation method based on time attenuation sampling
JP7206657B2 (en) * 2017-09-15 2023-01-18 東京電力ホールディングス株式会社 POWER THEFT MONITORING SYSTEM, POWER THEFT MONITORING DEVICE, POWER THEFT MONITORING METHOD AND PROGRAM
CN108109015A (en) * 2017-12-29 2018-06-01 广州品唯软件有限公司 A kind of marketing selective analysis method and device
US10778689B2 (en) * 2018-09-06 2020-09-15 International Business Machines Corporation Suspicious activity detection in computer networks
CN109359848A (en) * 2018-10-09 2019-02-19 烟台海颐软件股份有限公司 A kind of extremely relevant electricity consumer recognition methods of line loss and system
CN109818921B (en) * 2018-12-14 2021-09-21 微梦创科网络科技(中国)有限公司 Method and device for analyzing abnormal flow of website interface
CN110288114A (en) * 2019-03-22 2019-09-27 国网浙江省电力有限公司信息通信分公司 Violation electricity consumption behavior prediction method based on power marketing data
CN110620770B (en) * 2019-09-19 2021-11-09 微梦创科网络科技(中国)有限公司 Method and device for analyzing network black product account number
CN111159399A (en) * 2019-12-13 2020-05-15 天津大学 Automobile vertical website water army discrimination method
CN110988422B (en) * 2019-12-19 2022-04-26 北京中电普华信息技术有限公司 Electricity stealing identification method and device and electronic equipment
CN111275416B (en) * 2020-01-15 2024-02-27 中国人民解放军国防科技大学 Digital currency abnormal transaction detection method, device, electronic equipment and medium
CN111368254B (en) * 2020-03-02 2023-04-07 西安邮电大学 Multi-view data missing completion method for multi-manifold regularization non-negative matrix factorization
CN111507377B (en) * 2020-03-24 2023-08-11 微梦创科网络科技(中国)有限公司 Method and device for identifying number-keeping accounts in batches
CN111507611A (en) * 2020-04-15 2020-08-07 北京中电普华信息技术有限公司 Method and system for determining electricity stealing suspected user
CN111984695B (en) * 2020-07-21 2024-02-20 微梦创科网络科技(中国)有限公司 Method and system for determining black clusters based on Spark
CN112000711A (en) * 2020-07-21 2020-11-27 微梦创科网络科技(中国)有限公司 Method and system for determining evaluation user based on Spark
CN112084229A (en) * 2020-07-27 2020-12-15 北京市燃气集团有限责任公司 Method and device for identifying abnormal gas consumption behaviors of town gas users
CN112115324B (en) * 2020-08-10 2023-10-24 微梦创科网络科技(中国)有限公司 Method and device for confirming praise and praise users based on power law distribution
CN112148947B (en) * 2020-09-28 2024-03-22 微梦创科网络科技(中国)有限公司 Method and system for excavating and brushing users in batches
CN112149036B (en) * 2020-09-28 2023-11-10 微梦创科网络科技(中国)有限公司 Method and system for identifying batch abnormal interaction behaviors

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571484A (en) * 2011-12-14 2012-07-11 上海交通大学 Method for detecting and finding online water army
CN103839197A (en) * 2014-03-19 2014-06-04 国家电网公司 Method for judging abnormal electricity consumption behaviors of users based on EEMD method

Also Published As

Publication number Publication date
CN112861128A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
JP4797069B2 (en) Keyword management program, keyword management system, and keyword management method
US9396247B2 (en) Method and device for processing a time sequence based on dimensionality reduction
CN107070940B (en) Method and device for judging malicious login IP address from streaming login log
CN110489646B (en) User portrait construction method and terminal equipment
CN114116822A (en) Information push method, terminal and storage medium
CN111506828B (en) Batch real-time identification method and device for abnormal attention behaviors
CN112148690A (en) File caching method, file access request processing method and device
CN109754854B (en) Method and system for matching diagnosis codes and diagnosis names
CN112861128B (en) Method and system for identifying machine account numbers in batches
CN114817645A (en) Time sequence data storage and reading method, device, equipment and storage medium
CN111708954B (en) Ranking method and system of ranking list
CN115617255A (en) Management method and management device for cache files
CN109542909B (en) Method and system for identifying associative storage devices in big data storage system
CN112149036B (en) Method and system for identifying batch abnormal interaction behaviors
CN114650239B (en) Data brushing amount identification method, storage medium and electronic equipment
CN109150819B (en) A kind of attack recognition method and its identifying system
CN114218134A (en) Method and device for caching users
CN114297099A (en) Data cache optimization method and device, nonvolatile storage medium and electronic equipment
CN112148947B (en) Method and system for excavating and brushing users in batches
CN114357069B (en) Big data sampling method and system based on distributed storage
CN111026958B (en) Method and device for ordering hot microblogs
CN112149037B (en) Method and system for identifying abnormal attention in real time based on logistic regression
CN112000711A (en) Method and system for determining evaluation user based on Spark
CN111860299A (en) Target object grade determining method and device, electronic equipment and storage medium
CN116776310B (en) Automatic user account identification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant