CN109587357B

CN109587357B - Crank call identification method

Info

Publication number: CN109587357B
Application number: CN201811357638.5A
Authority: CN
Inventors: 李鑫
Original assignee: Shanghai Mt Networks Co ltd
Current assignee: Shanghai Mt Networks Co ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2021-04-06
Anticipated expiration: 2038-11-14
Also published as: CN109587357A

Abstract

The invention relates to the technical field of electronic communication, in particular to a method for identifying crank calls, which comprises the following steps: reading call data, classifying the call data according to a set time interval to form a plurality of record items, and forming a data set A by the record items; cleaning the classified call data, and deleting the record entries with the set elements being empty in the data set A to obtain a data set B; generating the characteristics of the calling number in the data set B by performing statistical calculation on each calling number data in a set time interval in the data set B, and marking as a set C; and judging whether the calling number is a harassing call in a set time interval according to the characteristics of the generated calling number in the data set B. The invention carries out multi-level multi-layer rule judgment by formulating a judgment rule, wherein the threshold definition of the judgment is determined by cluster analysis and information entropy, and finally, the result of telephone judgment is obtained. The invention has high applicability and is more flexible.

Description

Crank call identification method

Technical Field

The invention relates to the technical field of electronic communication, in particular to a method for identifying crank calls.

Background

With the continuous development of communication technology, mobile communication services are continuously enriched, the construction cost of a mobile communication network and the cost of a mobile phone terminal are continuously reduced, people have increasingly greater dependence on mobile communication, and the frequency of use is higher. However, the rapid development of mobile communication brings convenience to people, and also enables some people to utilize mobile communication to publicize and transmit some harassing information for business purposes, which causes the inundation of harassing calls and brings great trouble to the lives of people, and the harassing calls not only influence the lives of people but also influence the normal development of society. Harassing calls are mainly characterized by: illegal users make a large-scale call to a mobile client, hang up after ringing once, call to a recording call when the client dials back, form harassment and fraud, subjectively violate the will of mobile phone users, and objectively infringe the free communication and the peace of life of the users or mask the calls of the users.

The Chinese patent application with the application number of 201410249964.X discloses a method and a device for identifying crank calls. The Chinese patent application with the application number of 201710552232.1 discloses a crank call identification and interception method, which comprises the steps of processing original data by collecting signaling information of a communication network, selecting an identification factor according to characteristics, classifying all calls by using a weighted naive Bayes classification algorithm so as to identify crank calls, and finally intercepting the calls. The Chinese patent application with the application number of 201610312825.6 discloses a method, a device and a terminal for identifying a crank call, which utilize voiceprint information to judge, match the voiceprint information with prestored voiceprint information by acquiring the voiceprint information of a voice sample of a calling party call voice after an incoming call is connected, and mark the voiceprint information as a crank call if the matching is successful and the prestored voiceprint information has a crank call mark.

However, the existing crank call identification method achieves the purpose of identifying crank calls by using a weighted naive Bayes classification algorithm, a voiceprint information identification technology and condition judgment, and has the following defects: the threshold value established by the rule is low in reliability through manual setting, and the classification of the calls through the classification algorithm is based on feature selection identification factors, but the current crank call form, calling number and the like are changed every day, the features of the crank call are continuously changed, and therefore the adjustability is poor. In addition, the applicable range of identifying the crank call by combining the voiceprint information is limited according to the pre-marked voiceprint information base, and the voice of the person who makes the crank call every day can be changed or the voiceprint information is converted by using a sound wave conversion system. Therefore, although the existing crank call identification method can identify crank calls, the application range is relatively limited, and the adjustability is poor.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a crank call identification method which is high in applicability and flexible.

The method for identifying the crank call provided by the embodiment of the invention comprises the following steps:

reading call data, classifying the call data according to a set time interval to form a plurality of record items, and forming a data set A by the record items;

cleaning the classified call data, and deleting the record entries with the set elements being empty in the data set A to obtain a data set B;

generating the characteristics of the calling number in the data set B by performing statistical calculation on each calling number data in a set time interval in the data set B, and marking as a set C;

and judging whether the calling number is a harassing call in a set time interval according to the characteristics of the generated calling number in the data set B.

Further, in the above method, each record entry includes, but is not limited to, one or more of the following: called number, calling number, start time, duration, call type, originating or terminating, enterprise number, ringing duration, end code, and called city.

Further, in the above method, the feature of the generated calling number in the data set B includes: the number of dialing times, the dialing object non-repetition rate, the dialing non-call rate, the call duration, whether to dial a number in a connected mode, the number of called places and the number of called lines and called rates of internal lines.

Further, in the above method, the manner of determining whether the calling number is a harassing call in the set time interval according to the characteristics of the generated calling number in the data set B is as follows:

if the serial number dialing behavior is equal to 1, the calling number is a harassing calling number, and the calling number which is not judged enters the next judgment;

if the internal line called rate is larger than the threshold value a, the internal line called rate is a normal calling number, and the calling number is not judged to enter the next judgment;

if the call duration is greater than the threshold b, the call is a normal calling number, and the calling number is not judged to enter the next judgment;

if the dialing times are larger than the threshold c and the dialing object non-repetition rate is equal to the threshold d, the harassing calling number is determined, and the calling number is not determined to enter the next determination;

if the number of dialing times is greater than the threshold c and the dialing non-call-up rate is equal to the threshold e, the harassing calling number is determined, and the calling number is not determined to enter the next determination;

if the number of the called places > is equal to the threshold value f, the number is a nuisance calling number, and the calling number is not judged to be a normal calling number.

Further, in the above method, the threshold values are determined by:

combining the calling number and the time mark to form a data set D as a recorded label, and carrying out cluster analysis on the data set D through a K-means algorithm;

after clustering analysis, automatically dividing all calling numbers into ten categories, and expressing the characteristics of each category of the calling numbers by using the average value of the calling numbers;

adding the classification result to a data set D for describing the classification of the record entry, and recording the updated data set as E;

judging whether the record entry is a harassment entry or not by judging whether the category is a harassment category or not, and adding a parameter harassment entry value or a normal entry value to a set E to form a set F;

and (3) performing information entropy calculation aiming at whether the disturbance is the disturbance: and Ent (X) ═ P0log2(P0) + P1log2(P1), wherein P0 represents the proportion of normal entries, P1 represents the proportion of nuisance entries, and then each threshold is calculated.

Further, in the above method, the method of calculating each threshold value is as follows:

setting the minimum value and the maximum value of the threshold value and the step length of each calculation;

setting a threshold value as a minimum value, dividing all the entries in the set E which are larger than the threshold value into a first group, and dividing all the entries which are smaller than the threshold value into a second group;

respectively calculating the information entropies of whether the two groups are harassment, and merging and recording the results;

gradually increasing the minimum value of the threshold value by the step length until the maximum value;

and selecting the information entropy and the threshold corresponding to the minimum value as a final calculation result.

Further, in the above method, the minimum value of the threshold a of the internal line callee ratio is 0, the maximum value thereof is 1, and the step length of each calculation is 0.01.

Further, in the above method, the minimum value of the threshold b of the call duration is 0, the maximum value is 200, and the step length is 1 for each increment.

Further, in the above method, the minimum value of the threshold c of the dialing number is 0, the maximum value is 100, and the step length is 1 for each increment.

Further, in the above method, the minimum value of the threshold d of the dialing target non-repetition rate is 0, the maximum value is 1, and the step length is 0.01 for each increment.

Further, in the above method, the minimum value of the threshold e of the dialing non-call completion rate is 0, the maximum value is 1, and the step length is increased by 0.01 each time.

Further, in the above method, the minimum value of the threshold f of the number of called places is 0, the maximum value is 50, and the step length is 1 for each increase.

Compared with the prior art, the method for identifying the crank calls provided by the embodiment of the invention comprises the following steps: reading call data, classifying the call data according to a set time interval to form a plurality of record items, and forming a data set A by the record items; cleaning the classified call data, and deleting the record entries with the set elements being empty in the data set A to obtain a data set B; generating the characteristics of the calling number in the data set B by performing statistical calculation on each calling number data in a set time interval in the data set B, and marking as a set C; and judging whether the calling number is a harassing call in a set time interval according to the characteristics of the generated calling number in the data set B. The invention carries out multi-level multi-layer rule judgment by formulating a judgment rule, wherein the threshold definition of the judgment is determined by cluster analysis and information entropy, and finally, the result of telephone judgment is obtained. The threshold value of the invention is not manually set, but can be judged and adjusted according to the information entropy, so the invention has high applicability and is more flexible.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a method for identifying a crank call according to the present invention;

FIG. 2 is a flow chart of a method for determining a threshold value according to the present invention;

fig. 3 is a flowchart of a method for calculating a threshold according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiments of the present invention will be described in further detail with reference to the drawings attached hereto.

As shown in fig. 1, an embodiment of the present invention discloses a method for identifying a crank call, including:

s101, reading call data, classifying the call data according to a set time interval to form a plurality of record items, and forming a data set A by the record items;

s102, cleaning the classified call data, and deleting the record entries with empty set elements in the data set A to obtain a data set B;

s103, generating characteristics of the calling number in the data set B by performing statistical calculation on each calling number data in a set time interval in the data set B, and marking as a set C;

and S104, judging whether the calling number is a harassing call in a set time interval according to the characteristics of the generated calling number in the data set B.

In step S101 in the embodiment of the present invention, the call data is divided and sorted in five-minute time slices.

Further, in the above method, each record entry includes, but is not limited to, one or more of the following: called number, calling number, start time, duration, call type (originating or terminating), enterprise number, ringing duration, end code, and called city. For example: a record item is [15802811404, 02095056015, 20171227090031, 27, 0, 1, 2004902310, 5, 0, 1, Chengdu/Sichuan ]).

Specifically, each item in the record entry is represented as:

according to the embodiment of the invention, after all call data are read, the data are classified according to the starting time and the interval of five minutes. The start time is set according to the earliest call start time until all call data is divided. For example, if the earliest call start time is 2017, 12 and 30 months, 00: 00: 00 according to "00: 00: 00-00: 04: 59, 00: 05: 00-0: 09: 59 … ". Can be denoted as A (A1, A2 …), where An denotes each group of data and A denotes the set of groups of data. The above grouped data is subjected to the operation of step S102.

In step S102, the embodiment of the present invention cleans data for each five-minute time slice. Specifically, the entry having the missing value except the called enterprise number in the An data is deleted, for example, the record entry whose calling number or called number is empty needs to be deleted (if only the called enterprise number is empty, it does not need to be deleted). Then, the telephone of the calling ticket, i.e. "call type (originating or terminating)", is extracted as the record entry of 1. The above processing is performed for each An, and all the resulting data Bn are denoted as B (B1, B2 …). The number of groups of B (B1, B2 …) should be the same as the number of groups of A (A1, A2 …). The data set B thus obtained proceeds to the next step S103 to continue the operation.

In step S103 of the embodiment of the present invention, feature calculation is performed on each calling number data of each five-minute time slice to generate features for subsequent determination. Preferably, the features generated therein comprise: the number of dialing times, the number of non-repeated dialing objects, the number of missed dialing calls, the duration of the call, whether the number of the calling party is dialed by a calling number, the number of the called places and the number of the called lines.

Specifically, the number of dialing is the total number of times that the same calling number makes a call in Bn. The non-repetition rate of the dialing object is to count all called telephones dialed by the same calling number, take out the repeated called telephones, and then calculate the number of the non-repeated called telephones. The non-repetition rate of the dialing object is the number of non-repeated called telephones/the dialing times of the calling number. The call non-call rate is the number of record entries which count the call type of the same calling number as 1, namely the number of calls which are not called after the call is made, and the ratio of the value to the number of calls is the call non-call rate. The call duration is the average value of a certain calling number in Bn (duration-ringing duration), and the unit is second. The number of the called land cities is to count all the called land cities of a certain calling number in Bn, then to take out the repeated land cities, and the obtained number of the unrepeated land cities is the number of the called land cities of the calling number. The serial number dialing behavior refers to that for the same calling number, if the called numbers of two continuous records are different only in the last three digits and are not the same number, the serial number dialing behavior is marked as one suspected serial number dialing; if a calling number appears 5 times suspected serial number dialing in one Bn, the serial number dialing behavior is recorded to exist, the value is 1, otherwise, the value is 0. The internal line called rate means that the number of records with the same calling enterprise number and the same called enterprise number in the telephone played by the same calling number is counted, and the internal line called rate is obtained by dividing the number by the dialing times of the calling number.

The embodiment of the invention can obtain the characteristics of the calling number in the Bn by counting all the calling numbers in the Bn. As shown in the following table:

in the above table, where the belonged time 201712291710 represents 2017, 12, 29, 17, 10: 00-14: time slice of 59 deg.

In the embodiment of the present invention, all Bn are calculated to form the information as shown in the above table, which is denoted as Cn, and the set thereof is denoted as C.

After the above determination, the calling number in a certain time segment Cn in the embodiment of the present invention will be divided into two categories: one is a normal calling number; the other is a nuisance calling number. Therefore, the invention obtains the harassing call number list and completes the identification target of the harassing call.

It should be noted that the above threshold values of the embodiment of the present invention are not determined artificially, but obtained by calculation. That is, by performing calculation for records of different environments, different judgment parameters can be obtained. Therefore, the invention has stronger adaptability.

Further, as shown in fig. 2, the threshold values are determined by:

s201, combining the calling number and the time mark to form a data set D as a recorded label, and carrying out cluster analysis on the data set D through a K-means algorithm;

s202, after clustering analysis, automatically dividing all calling numbers into ten categories, and expressing the characteristics of each category of the calling numbers by using the average value of the calling numbers;

s203, adding the classification result to the data set D to describe the classification of the record entry, and recording the updated data set as E;

s204, judging whether the record entry is a harassment entry or not by judging whether the category is a harassment category or not, and adding a parameter harassment entry value or a normal entry value to a set E to form a set F;

s205, carrying out information entropy calculation for disturbance or not: and Ent (X) ═ P0log2(P0) + P1log2(P1), wherein P0 represents the proportion of normal entries, P1 represents the proportion of nuisance entries, and then each threshold is calculated.

In the implementation of the present invention, C1 … Cn is summed together and the calling number and the timestamp are combined as one parameter (calling number-timestamp), for example (0111615274 and 201712291710). This data set is denoted as D. Where (caller-timestamp) is the label of the record and the other values are used as the characteristics of the record for subsequent cluster analysis.

The embodiment of the invention performs clustering analysis on the data set D through a K-means algorithm. In order to fully mine the possible categories, the invention sets the number of clustering categories to 10. After the clustering algorithm, all the calling numbers can be automatically classified into ten categories, and the average value of the calling numbers is used for representing the characteristics of all the categories. The following table shows:

any one (calling number-time slice) record of the embodiment of the invention belongs to one of ten types. The classification result is added to D, which describes the classification to which the record item belongs by a list of parameters (classification category) with one of values 0 to 9. The updated data set is denoted as E.

In step S204 in the embodiment of the present invention, whether the category is a harassment category is labeled, and whether the record entry is a harassment entry is further labeled. In the category table, whether the category is a harassing call is distinguished according to the common sense. Particularly, the invention classifies the category of dialing times higher than 20 times as a suspected harassment category, the category of the existence of the serial number dialing as the suspected harassment category, and the category of the internal line called rate equal to 1 as the normal category. Other non-classified classes are classified into normal classes. That is, [2, 3, 4, 5, 7] in the above table is a disturbance category, and [0, 1, 6, 8, 9] is a normal category.

In implementation, the data set E judges that all record items are classified into two types according to the types, if the type of the item belongs to is a harassment type, the item is a harassment item, and if the type of the item belongs to is a normal type, the item is classified as a normal item. The E data set will append a parameter "whether it is harassing", the harassing entry has a value of 1, and the normal entry has a value of 0. The updated data set is denoted as F.

In the embodiment of the invention, because the information entropy calculation is only carried out aiming at whether the disturbance exists, the categories are only 0 and 1, and the formula is as follows: ent (x) P0log2(P0) + P1log2 (P1); where P0 represents the proportion of normal entries, with a value equal to the number of normal entries/total number of entries. P1 represents the proportion of harassment entries, with a value equal to the number of harassment entries/total number of entries. The smaller the information entropy is, the more the difference between the numbers of 0 or 1 in the items is represented; the larger the information entropy, the smaller the difference between the numbers of 0 or 1 in the entry.

Further, as shown in fig. 3, the method of calculating each threshold is as follows:

s301, setting the minimum value and the maximum value of the threshold value and the step length of each calculation;

s302, setting the threshold value as the minimum value, dividing all the items in the set E which are larger than the threshold value into a first group, and dividing the items which are smaller than the threshold value into a second group;

s303, respectively calculating the information entropies of whether the two groups are harassment, and merging and recording the results;

s304, gradually increasing the step size of the minimum value of the threshold value until the maximum value;

s305, selecting the information entropy and the threshold corresponding to the minimum value as a final calculation result.

In the implementation, the threshold value of the internal line called rate is calculated as an example:

step 1, determining the possible minimum value 0 and the maximum value 1 of the threshold, and the step size of each calculation is 0.01.

And 2, setting the threshold value of the internal line called rate as the minimum value 0, dividing all the items in the E which are larger than the threshold value into a first group, and dividing all the items of which the internal line called rate is smaller than the threshold value into a second group.

And 3, respectively calculating the two groups of information entropies whether the information entropies are harassment, and summing and recording results.

And 4, gradually increasing the step size until the maximum value is reached, namely 0.01, 0.02 … 0.99, 1. Repeating the steps 2 and 3 each time.

And 5, after the calculation is completed, the information entropy sum is minimum, which means that the corresponding threshold value can distinguish normal telephone entries from crank telephone entries. Therefore, the information entropy and the threshold corresponding to the minimum value are selected as the final calculation result. For example, when the threshold is set to 0.3, the sum of the information entropies of the two groups into which the threshold is divided is minimum, and the threshold a of the internal line callee used in the rule should be 0.3.

The threshold obtained by the calculation is used as the threshold in the identification process of the crank call. Once the threshold is determined, it can be used for a longer period of time, or it can be recalculated periodically as needed, or it can be recalculated for different regions.

In summary, the present invention makes a multi-level multi-layer rule judgment by formulating a judgment rule, wherein the threshold definition of the judgment is determined by cluster analysis and information entropy, and finally, a result of the telephone judgment is obtained. The threshold value of the method is not manually set, but can be judged and adjusted according to the information entropy, so that the method is high in applicability and flexible.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for identifying crank calls is characterized by comprising the following steps:

judging whether the calling number is a harassing call in a set time interval according to the characteristics of the generated calling number in the data set B;

the method for judging whether the calling number is a harassing call in the set time interval according to the characteristics of the generated calling number in the data set B is as follows:

if the serial number dialing behavior =1, the harassment calling number is obtained, and the calling number which is not judged enters the next judgment;

if the number of dialing times is greater than the threshold c and the dialing object non-repetition rate > = threshold d, the harassing calling number is determined, and the calling number is not determined to enter the next determination;

if the number of dialing times is greater than the threshold c and the dialing non-call-up rate is > = threshold e, the harassing calling number is determined, and the calling number is not determined to enter the next determination;

if the number of called places > = threshold f, the number is a harassing calling number, and the calling number is not judged to be a normal calling number;

each threshold is determined by:

and (3) performing information entropy calculation aiming at whether the disturbance is the disturbance: ent (X) = P0log2(P0) + P1log2(P1), wherein P0 represents the proportion of normal entries, P1 represents the proportion of harassing entries, and then each threshold is calculated;

the method of calculating each threshold is as follows:

respectively calculating the information entropy of whether the first group and the second group are harassment, and merging and recording the results;

2. The method of claim 1, wherein the record entry includes, but is not limited to, one or more of the following: called number, calling number, start time, duration, call type, originating or terminating, enterprise number, ringing duration, end code, and called city.

3. The method of claim 1, wherein the characteristics of the generated calling number in data set B comprise: the number of dialing times, the dialing object non-repetition rate, the dialing non-call rate, the call duration, whether to dial a number in a connected mode, the number of called places and the number of called lines and called rates of internal lines.

4. The method according to claim 1, wherein the internal line callee threshold a has a minimum value of 0 and a maximum value of 1, and the step size of each calculation is 0.01.

5. The method of claim 1, wherein the threshold b for the duration of the call is a minimum value of 0 and a maximum value of 200, and each increment is 1.

6. The method according to claim 1, wherein the threshold value c of the number of dialing is 0 at the minimum value and 100 at the maximum value, and each increment is 1.

7. The method according to claim 1, wherein the threshold d for the dialing target non-repetition rate has a minimum value of 0 and a maximum value of 1, and each increment step size is 0.01.

8. The method according to claim 1, wherein the threshold e of the dialing miss rate has a minimum value of 0 and a maximum value of 1, and each increment is 0.01.

9. The method according to claim 1, wherein the threshold f for the number of called places is 0 at the minimum and 50 at the maximum, and each increment is 1.