Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Before introducing the method for identifying an abnormal object provided in one or more embodiments of the specification, the inventive concept of the method will be described.
First, the identification and management of telephone fraud continues to be a major and difficult point in the field of wind control. Therefore, the scheme will identify the abnormal telephone number. In addition, website addresses and apps that have similar characteristics to phone numbers and are also typically capable of carrying out fraudulent activities by users can also be identified.
Second, the conventional technology usually performs abnormal object recognition based on static data. Although these static data may characterize some behavioral preference or statistical significance of behavior in the history of the abnormal object, revealing the potential risk of behavior. However, these static data are typically based on statistics of individual behaviors. The feature of counting the number of certain behaviors individually causes a large disturbance because users have different behavior motivations in different environments or scenes. Thus, identifying abnormal objects based solely on static data is often inaccurate.
In order to improve the accuracy of abnormal object identification, the scheme tries to combine all behaviors together so as to identify behavior motivation and risk corresponding to the behavior combination. The behavior sequence is better combined with the context, and starts from a series of behaviors in which the user is located. Therefore, the scheme takes the behavior sequence of the object as the relevant characteristic when identifying the abnormal object.
The action sequence is a sequence in which operation action histories of the user are arranged in chronological order. It contains the action event itself and the sequence information of the action event in a certain time window. For example, the sequence of behavior over the past 1 hour can be expressed as: "A- > B- > C- > D", wherein A-D can be used for representing remark names stored by different users for a certain object. It should be noted that although "B- > C- > a- > D" and "a- > B- > C- > D" both contain the same behavior event, they are two completely different behavior patterns due to different occurrence sequences.
Finally, because the behavior sequence is simultaneously used as the input feature in the scheme, when the abnormal object is identified, the classification of the abnormal object is considered to be carried out by adopting a Recurrent Neural Network (RNN) model for describing the input feature with the sequence characteristic. And the remark names of the objects may be contained in the behavior sequence, so that the remark names cannot be directly input into the computer. Therefore, it may be considered to encode or vectorize the behavior sequence. Specifically, the behavior sequence may be encoded or vectorized by using a word vectorization algorithm (e.g., word2vector or cw2vec, etc.) or a text classification (fasttext) algorithm.
The above is the inventive concept of the solution provided in the present specification, and the solution provided in the present specification can be obtained based on the inventive concept. The solutions provided in this specification are further elaborated below:
fig. 1 is a schematic diagram of an identification method of an abnormal object provided in this specification. In fig. 1, first, storage information generated when N users perform storage activities for respective objects in respective object sets is collected. The object may include, but is not limited to, a phone number, a website address, an app, and the like. Taking the object as a phone number as an example, the corresponding object set may be an address book. The content of the address book is a very important supplement to the existing wind control data. The storage information may include, but is not limited to, a remark name and a storage time of each object. Then, based on the collected storage information, static statistical data and dynamic storage behavior sequences of the objects are obtained. And coding the storage behavior sequence according to a preset coding algorithm. And finally, splicing the coded storage behavior sequence with the statistical data and inputting the spliced storage behavior sequence and the statistical data into a classifier so as to identify whether each object is an abnormal object.
Fig. 2 is a flowchart of an identification method for an abnormal object according to an embodiment of the present disclosure. The execution subject of the method may be a device with processing capabilities: as shown in fig. 2, the method may specifically include:
step 202, obtaining storage information generated when a plurality of users execute storage behaviors for each object in the respective object set.
The object may include, but is not limited to, a phone number, a website address, an app, and the like. Taking the object as a phone number as an example, the corresponding object set may be an address book. The storage information may include, but is not limited to, a remark name and a storage time of each object. It is understood that, when the object is a telephone number, the stored information can be obtained from the address book. That is, the address lists of a plurality of users are obtained first, and then the storage information of each telephone number is obtained from the respective address lists of the plurality of users. Taking the object as a website address as an example, the corresponding storage information can be obtained from the browser.
It should be noted that, in addition to the above stored information, the present solution may also obtain an Equipment identifier (e.g., International Mobile Equipment identity Number (IMEI)) corresponding to the stored information. Furthermore, correspondence between users, objects, and stored information may also be recorded.
Taking the object as a phone number as an example, the correspondence relationship can be shown in table 1.
TABLE 1
It should be understood that the content of table 1 is only illustrative, and the correspondence provided by the embodiments of the present specification is not limited to the above. For example, table 1 may further include a device identifier, which is not limited in this specification.
It should be noted that the step 202 may be performed periodically. So as to determine whether the user performs a deletion operation for a certain object after comparing the stored information of the same user in two previous and next periods, and so on.
And step 204, acquiring static statistical data and dynamic storage behavior sequences of each object according to the storage information.
The static statistical data is obtained by statistical and data mining, and may include one or more of the following: how many users the object was stored over the past few days, the number of days the object was stored over the past few days, whether the object was stored as a target name over the past few days (e.g., a cheat, etc.), the number of times the object was deleted over the past few days, and so on.
Taking the object as a phone number as an example, the static statistical data can be obtained based on the contents in table 1. The number of times that the telephone number is deleted in the past several days can be obtained by comparing the address lists of the previous and next two periods of each user.
Because the characteristics of the abnormal objects cannot be fully described only through static statistical data, the scheme further obtains the dynamic storage behavior sequence of each object so as to improve the accuracy and coverage rate of the identification of the abnormal objects. The storage behavior sequence can more accurately depict a mode formed by a certain behavior combination of the abnormal object, and is more accurate than the statistical characteristics of single behaviors.
In an implementation manner, the obtaining a dynamic storage behavior sequence of each object according to the storage information may include:
and for each object in the objects, screening out the remark name and the storage time of the object from the remark name and the storage time of the object. And sorting the remark names of the objects according to the storage time of the objects. And generating a storage behavior sequence of the object according to the sorted remark names.
For example, assume that the phone number: 186 xxx is stored in its respective address book by three different users, user a having stored the phone number on 1/9/2017 with the remark names: a cheater; user B stored the phone number in 2017, 9, 10, and its remark name is: a fraudster; user C stored the phone number in 2017, 9, month 22, with the remark name: a fraudster. Then according to the above-mentioned storage time, the corresponding remark names may be sorted as: cheater, fraudster. According to the sorting result, the storage behavior sequence of the telephone numbers can be generated: cheater- > cheater.
Therefore, the storage behavior sequence in the scheme reflects the sequence information of the object stored by the user and the content information, so that the characteristics of the object can be more accurately described.
It will be appreciated that the above example is the generation of a sequence of storage behaviors of an object from a dimension of a user. Of course, in practical applications, the storage behavior sequence may also be generated from other dimensions (e.g., dimensions of the device), which is not limited in this specification. When the storage behavior sequence of the object is generated from the dimension of the device, the device identifier may be obtained while the storage information is obtained, and a generation manner of the storage behavior sequence of the dimension may be as described above, which is not repeated herein.
And step 206, coding the storage behavior sequence according to a preset coding algorithm.
The predetermined encoding algorithm may include, but is not limited to, a word vectorization algorithm (e.g., word2vector or cw2vec, etc.), a text classification algorithm (fasttext), and the like.
With the sequence of storage behaviors generated above: for example, the remark names in the storage sequence are essentially character strings, and thus, each character string can be converted into a corresponding vector through the preset encoding algorithm. And then, splicing the vectors to obtain the coded storage behavior sequence.
And step 208, splicing the coded storage behavior sequence with the statistical data to obtain spliced data of each object.
Since static statistics are usually some numbers, such as the number of times of storage and the number of times of deletion, they can be directly input into the classifier. Therefore, the coded storage behavior sequence can be directly spliced with the statistical data, and spliced data can be obtained. It is understood that the spliced data is a multidimensional vector.
Step 210, inputting the spliced data into a classifier to identify whether each object is an abnormal object.
In one implementation, the classifier may be an RNN model or a Long Short-Term Memory network (LSTM) model or the like. Specifically, after the stitched data is input to the classifier, the probability that each object is an abnormal object and the probability that it is not an abnormal object may be output. Based on the two probabilities, it is possible to identify whether each object is an abnormal object.
In summary, the present solution aims to provide a method for identifying an abnormal object by combining a storage behavior sequence of the object. The storage behavior sequence intuitively reflects the case-making skills of the cheater, and can assist the strategy analyst to conveniently analyze the case-cheating behavior skills, so that the working efficiency is improved. In addition, the scheme takes the whole stored behavior sequence (including the sequence information of the behaviors and the like) as a research object, and characterizes the behavior of the abnormal object. Therefore, a fraud behavior system in the wind control system is enriched, and more effective information is provided for feature depiction. In particular, the characteristic of the storage behavior sequence of the telephone numbers is introduced in the telephone fraud process, so that the accuracy and the coverage rate of fraud telephone identification can be obviously improved.
In the method for identifying an abnormal object provided in one or more embodiments of the present specification, data of two aspects of an object are obtained: static data and dynamic data, the dynamic data and the static data are fused and subjected to characteristic processing, a mode formed by certain behavior combination of an abnormal object is accurately carved, and the statistical characteristic of the abnormal object is more accurate than that of a single behavior.
The following describes the identification process using an object as a telephone number as an example. It should be noted that, since the address book usually includes various information, the identification process of the fraud phone is specifically described below based on the content of the address book.
Fig. 3 is a flow chart of a method for identifying fraudulent calls provided by the present specification. As shown in fig. 3, the method may include the steps of:
step 302, obtaining the address lists of a plurality of users.
The address book may include information such as a telephone number, a remark name, and storage time of the contact. The remark name and the storage time may be collectively referred to as storage information corresponding to the telephone number.
After the address lists of a plurality of users are acquired, the correspondence shown in table 1 may be established. In addition, table 1 may also include device identifiers and the like, which are not limited in this specification.
It should be noted that the step 302 may be performed periodically. So as to determine whether the user performs a deleting operation for a certain telephone number after comparing the address lists of the same user in two periods before and after the user.
And step 304, acquiring static statistical data and dynamic storage behavior sequences of each telephone number according to the content of the address list.
The static statistical data is obtained by statistical and data mining, and may include one or more of the following: how many users a phone number was stored in the past days, the number of days a phone number was stored in the past days, the number of times a phone number was deleted in the past days, and whether a phone number was stored as a "spoof" in the past days, etc. Specifically, the above static statistical data may be acquired based on the contents of table 1. The number of times that the telephone number is deleted in the past several days can be obtained by comparing the address lists of the previous and next two periods of each user.
The embodiment also obtains a dynamic storage behavior sequence of the telephone number, because only static statistical data is obtained and the content of the address list is not fully utilized. The storage behavior sequence is better combined with the context, and the behavior motivation and the risk corresponding to the combination of all behaviors are comprehensively considered from the series of behaviors where the user is located. Compared with the statistical data, the stored behavior sequence can depict certain behavior preference of the fraud telephone in history or the statistical significance of the behavior, and reveal the potential risk of the behavior. The difference is that the stored behavior sequence can more accurately depict the pattern of a certain behavior combination of fraudulent calls, more accurately than using the statistical characteristics of the individual behaviors.
In an implementation manner, the obtaining a dynamic storage behavior sequence of 1 phone number according to the content of the address book may include:
and screening the remark names and the storage time of the telephone numbers from the address lists of a plurality of users. And sorting the remark names of the telephone numbers according to the storage time of the telephone numbers. And generating a storage behavior sequence of the telephone number according to the sorted remark names.
It can be understood that, with reference to the above obtaining method, a dynamic storage behavior sequence of each phone number in the address list of multiple users can be obtained.
For example, assume that the phone number: 186 xxx is stored in its respective address book by three different users, user a having stored the phone number on 1/9/2017 with the remark names: a cheater; user B stored the phone number in 2017, 9, 10, and its remark name is: a fraudster; user C stored the phone number in 2017, 9, month 22, with the remark name: a fraudster. Then according to the above-mentioned storage time, the corresponding remark names may be sorted as: cheater, fraudster. According to the sorting result, the storage behavior sequence of the telephone numbers can be generated: cheater- > cheater.
It will be appreciated that in the present scenario a sequence of stored actions for a telephone number is generated from the user's dimensions. Of course, in practical applications, the storage behavior sequence may also be generated from other dimensions (e.g., dimensions of the device), which is not limited in this specification. When the storage behavior sequence of the phone number is generated from the dimension of the device, the address book may be acquired and the device identifier may be acquired at the same time, and the generation manner of the storage behavior sequence of the dimension may be as described above, which is not repeated herein.
And step 306, coding the storage behavior sequence according to a preset coding algorithm.
The preset encoding algorithm may include, but is not limited to, a word vectorization algorithm (e.g., word2vector or cw2 vec), a text classification algorithm (fasttext), and the like.
With the sequence of storage behaviors generated above: for example, the remark names in the storage sequence are essentially character strings, and thus, each character string can be converted into a corresponding vector through the preset encoding algorithm. And then, splicing the vectors to obtain the coded storage behavior sequence.
And 308, splicing the coded storage behavior sequence with the statistical data to obtain spliced data of each telephone number.
Since static statistics are usually some numbers, such as the number of times of storage and the number of times of deletion, they can be directly input into the classifier. Therefore, the coded storage behavior sequence can be directly spliced with the statistical data, and spliced data can be obtained. It is understood that the spliced data is a multidimensional vector.
Step 310, the spliced data is input into a classifier to identify whether each phone number is a fraud phone.
In one implementation, the classifier may be an RNN model or a Long Short-Term Memory network (LSTM) model or the like. Specifically, after the spliced data is input into the classifier, the probability that each telephone number is a fraudulent telephone and the probability that it is not a fraudulent telephone can be output. Based on the two probabilities, it can be identified whether the respective telephone numbers are fraudulent calls.
Embodiments of the present specification aim to propose a method of identifying fraudulent calls according to a sequence of actions of a user storing telephone numbers as a feature. The method is mainly characterized in that the whole behavior sequence (including the sequence information of behaviors and the like) with the stored telephone numbers is used as a research object to characterize the behavior of a fraudster. By the method, the accuracy and the coverage rate of fraud phone identification can be remarkably improved.
In correspondence to the above method for identifying an abnormal object, an embodiment of the present specification further provides an apparatus for identifying an abnormal object, as shown in fig. 4, the apparatus may include:
an obtaining unit 402, configured to obtain storage information generated when a plurality of users perform a storage action on each object in the respective object sets.
The object herein may include any one of: phone number, website address, and app, etc.
The obtaining unit 402 is further configured to obtain static statistical data and a dynamic storage behavior sequence of each object according to the storage information.
The static statistical data may include one or more of the following: how many users the object was stored in the past several days, the number of days the object was stored in the past several days, whether the object was stored as a target name in the past several days, and the number of times the object was deleted in the past several days.
The storage information may include the remark names and storage times of the respective objects.
The obtaining unit 402 may specifically be configured to:
and for each object in the objects, screening out the remark name and the storage time of the object from the remark name and the storage time of the object.
And sorting the remark names of the objects according to the storage time of the objects.
And generating a storage behavior sequence of the object according to the sequenced remark names.
The encoding unit 404 is configured to encode the sequence of storage behaviors acquired by the acquiring unit 402 according to a preset encoding algorithm.
The preset encoding algorithm may include any one of: word vectorization algorithms, text classification algorithms, and the like.
And a splicing unit 406, configured to splice the storage behavior sequence encoded by the encoding unit 404 with the statistical data to obtain spliced data of each object.
And the identifying unit 408 is configured to input the spliced data obtained by the splicing unit 406 into a classifier to identify whether each object is an abnormal object.
The functions of each functional module of the device in the above embodiments of the present description may be implemented through each step of the above method embodiments, and therefore, a specific working process of the device provided in one embodiment of the present description is not repeated herein.
In the device for identifying an abnormal object provided in one embodiment of the present specification, the obtaining unit 402 obtains storage information generated when a plurality of users perform storage behaviors on respective objects in respective object sets. The obtaining unit 402 obtains static statistical data and dynamic storage behavior sequences of each object according to the storage information. The encoding unit 404 encodes the sequence of storage behaviors according to a preset encoding algorithm. The splicing unit 406 splices the encoded storage behavior sequence with the statistical data to obtain spliced data of each object. The identification unit 408 inputs the stitched data to a classifier to identify whether each object is an abnormal object. This can improve the accuracy of identifying an abnormal object.
Corresponding to the above method for identifying an abnormal object, an embodiment of the present specification further provides an apparatus for identifying an abnormal object, which may include, as shown in fig. 5: memory 502, one or more processors 504, and one or more programs. Wherein the one or more programs are stored in the memory 502 and configured to be executed by the one or more processors 504, the programs when executed by the processors 504 implement the steps of:
and acquiring storage information generated when a plurality of users execute storage behaviors aiming at each object in the respective object set.
And acquiring static statistical data and dynamic storage behavior sequences of each object according to the storage information.
And coding the storage behavior sequence according to a preset coding algorithm.
And splicing the coded storage behavior sequence with the statistical data to obtain spliced data of each object.
And inputting the spliced data into a classifier to identify whether each object is an abnormal object.
The identification device for the abnormal object provided by one embodiment of the specification can improve the accuracy of identification of the abnormal object.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or may be embodied in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a server. Of course, the processor and the storage medium may reside as discrete components in a server.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The above-mentioned embodiments, objects, technical solutions and advantages of the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the present specification, and are not intended to limit the scope of the present specification, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present specification should be included in the scope of the present specification.