CN111669757A

CN111669757A - Terminal fraud call identification method based on conversation text word vector

Info

Publication number: CN111669757A
Application number: CN202010542362.9A
Authority: CN
Inventors: 孙晓晨; 宁珊; 林格平; 张之含; 侯炜; 洪永婷; 倪善金; 周书敏; 万辛; 沈亮
Original assignee: EB INFORMATION TECHNOLOGY Ltd; National Computer Network and Information Security Management Center
Current assignee: Xinxun Digital Technology Hangzhou Co ltd; National Computer Network and Information Security Management Center
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2020-09-15
Anticipated expiration: 2040-06-15
Also published as: CN111669757B

Abstract

A terminal fraud call identification method based on conversation text word vectors comprises the following steps: the user marks the incoming call in the terminal App, when the incoming call is marked as a fraud category, the incoming call is converted into a text after the approval of the user authorization, the text is viewed and desensitized by the user, and the text is uploaded to the server and stored as a text sample after the authorization of the user; performing word segmentation and part-of-speech tagging on a text sample to obtain a syntactic dependency label and a word combination vector of a segmented word, splicing the word combination vector, the part-of-speech tagging and the syntactic dependency label to form a content vector of the segmented word, and calculating a scene element label to which the segmented word belongs to obtain a semantic vector of the text sample; constructing a fraud classification recognition model, using a text sample in a server as a training sample, and then pushing the trained model to an App from the server; and after receiving the new call to be identified, the App obtains the fraud category to which the call belongs according to the model and prompts the user. The invention belongs to the technical field of information, and can accurately identify fraud telephones based on conversation texts.

Description

Terminal fraud call identification method based on conversation text word vector

Technical Field

The invention relates to a terminal fraud call identification method based on a conversation text word vector, and belongs to the technical field of information.

Background

The current telecommunication fraud cases launched overseas are increasing day by day, and the filtering requirements of mobile phone users on fraud calls are increasing. However, more and more fraudulent communication behaviors tend to be concealed, the characteristics related to the communication behaviors are weakened, and the accuracy and the recall rate of the mobile phone system for identifying bad calls can be further improved only by analyzing and identifying the communication texts.

At present, the fraud call filtering method based on the mobile phone terminal system in the market is more primitive. Mainstream manufacturers generally adopt a user marking means, that is, the user actively marks the category of the phone and uploads the phone to the server to form a fraud number marking library, so as to filter fraud numbers. The drawback of this approach is that fraudulent calls cannot be found in real time, often when the victim has been found to be deceived.

Therefore, how to accurately identify fraudulent calls based on call texts has become a technical problem generally concerned by various mobile phone manufacturers and mobile phone system developers.

Disclosure of Invention

In view of the above, the present invention provides a method for identifying a terminal fraud phone based on a conversation text word vector, which can accurately identify a fraud phone based on a conversation text.

In order to achieve the above object, the present invention provides a method for identifying terminal fraud calls based on conversation text word vectors, comprising:

step one, a user marks an incoming call in a mobile phone terminal App, for the incoming call marked as a fraud category by the user, the incoming call is extracted and converted into a text after the user authorizes to approve, then the converted text is submitted to the user for inspection and desensitization, and finally the text after the user inspection and desensitization is uploaded to a server to be stored as a text sample after the user authorizes to approve;

secondly, performing word segmentation and part-of-speech tagging on each text sample in the server to obtain a syntactic dependency tag of each word segmentation, then calculating a word vector, a character vector, a pinyin vector and a stroke vector of each word segmentation in the text sample to form a word combination vector of each word segmentation in the text sample, splicing the word combination vector, the part-of-speech tagging and the syntactic dependency tag of each word segmentation to form a content vector of each word segmentation, calculating a scenario element tag to which each word segmentation belongs according to the content vector of each word segmentation, and finally averaging the content vectors and the scenario element tags of all words segmentation in the text sample to obtain a semantic vector corresponding to the text sample;

thirdly, constructing a fraud classification recognition model, inputting semantic vectors corresponding to texts, outputting fraud-related classes to which the texts belong, training the fraud classification recognition model by using text samples uploaded by users in a server as training samples, and then pushing the trained model to a mobile phone terminal App of the users from the server side for model updating;

and step four, after receiving a new call to be identified, the mobile phone terminal App of the user extracts the content text of the call to be identified for word segmentation, generates part-of-speech labels, sentence dependence labels and word combination vectors of all the segmented words in the text, then obtains the fraud category to which the call number to be identified belongs according to a fraud classification identification model in the mobile phone terminal App, and prompts the user through App information.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a method for identifying a conversation text, which can quickly convert the conversation text into a numerical vector, fuse a word vector, a pinyin vector and a stroke vector, construct event elements of various fraud scenes on the basis of part-of-speech identification, realize the targeted analysis of various fraud scenes from a plurality of angles such as event description, subsequent actions, double attitudes and the like, fully ensure the privacy of users, solve the problem of semantic deviation caused by homophonic special-shaped characters or polyphonic characters, and improve the accuracy and recall rate of the identification of bad calls of the users and manufacturers to the greatest extent.

Drawings

FIG. 1 is a flow chart of a terminal fraud call identification method based on conversation text word vectors of the present invention.

Fig. 2 is a flowchart of a specific step of performing word segmentation and part-of-speech tagging on each text sample to obtain a syntactic dependency tag of each segmented word in step two of fig. 1.

Fig. 3 is a flowchart illustrating a specific step of combining word combination vectors, part-of-speech tags, and syntactic dependency tags of each participle in the text sample to form a content vector of each participle, calculating a context element tag to which each participle belongs according to the content vector of each participle, and finally averaging the content vectors and the context element tags of all participles in the text sample to obtain a semantic vector corresponding to the text sample in step two of fig. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a method for identifying terminal fraud calls based on conversation text word vectors, which comprises:

step one, a user marks an incoming call in a mobile phone terminal App, for the incoming call marked as a fraud category by the user, the incoming call is extracted and converted into a text after the user authorizes to approve, then the converted text is submitted to the user for inspection and desensitization, and finally the text after the user inspection and desensitization is uploaded to a server to be stored as a text sample after the user authorizes to approve, wherein the desensitization is to remove sensitive information related to personal identity, such as identity card number, name, mobile phone number and the like;

The first step may further comprise:

step 11, after a user installs a mobile phone terminal App, obtaining a function of marking an incoming call, when the user marks that the current incoming call is a fraud type by using the function, extracting content in the first 60 seconds of the incoming call by using an HMM algorithm in the mobile phone terminal App so as to generate a content text, then removing personal identity related information in the content text based on a general rule, and finally pushing the desensitized text in the mobile phone terminal App to be viewed by the user;

step 12, the user views the text, can edit the text to further perfect desensitization, and then select whether to upload the desensitization text marked as a fraud category by the user to the server, if so, upload the text and the mark of the fraud category to the server under the authorization of the user;

step 13, performing text cleaning on the text received by the server, wherein the text cleaning comprises the steps of removing abnormal characters except Chinese, English and numbers in the text, uniformly replacing line feed characters and placeholders with blanks, and separating and converting a plurality of blanks into a blank;

and step 14, cleaning the text again, intercepting the first 180 characters of the text, and removing the text with the text amount smaller than 15 characters.

As shown in fig. 2, in the second step, performing word segmentation and part-of-speech tagging on each text sample to obtain a syntactic dependency tag of each word segmentation, which may further include:

step 21, generating a stop word dictionary based on Chinese grammar;

step 22, manually adding common words as a user-defined dictionary based on the fraud scene;

step 23, performing word segmentation and part-of-speech tagging on the text sample by using an HMM algorithm based on a DAG (hidden Markov model) word graph, and simultaneously inputting an optimized word segmentation result of a custom dictionary;

step 24, performing syntactic dependency analysis on each participle by using a fast Offset-based algorithm, and outputting a syntactic dependency label of each participle, as shown in the following table:

and 25, filtering stop words in the text sample by using the stop word dictionary.

In the second step, the word vector, the pinyin vector and the stroke vector of each participle are calculated to form a word combination vector of each participle in the text sample, and the method further comprises the following steps:

outputting a word vector C of each participle by using a skip-Gram method_w0Word vector C_cPinyin vector C_pAnd stroke vector C_bThen, a word combination vector for each participle is constructed:

wherein the content of the first and second substances,

the vector is a plurality of word combination vectors obtained by different combination modes, and sum represents summation operation.

The invention utilizes a skip-Gram model to convert words into numerical vectors. The core of skip-Gram is a Huffman tree, each word reaches a leaf node from the root of the tree, and a word in its context can be predicted. Each word is iterated N-1 times, resulting in a prediction of all words in its context. I.e. assuming that a text sample S is composed of n words w₁......w_nComposition of, wherein the word w_tThe probability of 2k words occurring with a context word window size of k can be predicted.

As shown in fig. 3, in the second step, the word combination vector, the part-of-speech tagging, and the syntactic dependency tag of each participle in the text sample are merged to form a content vector of each participle, a context element tag to which each participle belongs is obtained through calculation according to the content vector of each participle, and finally, the content vectors and the context element tags of all participles in the text sample are averaged, so as to obtain a semantic vector corresponding to the text, which may further include:

step A1, setting a plurality of scene elements, labeling the scene elements corresponding to each participle by combining with specific event scenes, labeling 12 types of scene elements shown in the following table, and classifying the scene elements not belonging to the 12 types as others;

step A2, inputting the word combination vector, part-of-speech tagging and sentence dependency label of each participle in the text sample into an LSTM model for encoding, and obtaining a content vector corresponding to each participle;

step A3, calculating a weighted influence factor of each participle relative to other participles according to the word combination vector of each participle by using Self-orientation;

step A4, combining the content vector of each participle obtained in step A2 and the weighted influence factor of each participle obtained in step A3 into a new content vector of each participle, and then inputting the new content vector of each participle into a CNN model, wherein the output of the CNN model is a scene element corresponding to each participle;

step A5, inputting the new content vector and scene element of each participle in the text sample into the LSTM model for encoding, combining the output results of the LSTM models corresponding to all participles in the text sample into a vector matrix, and taking the average value of the second dimension of the orientation vector matrix as the semantic vector of the text sample.

In step three, a fraud classification identification model may be constructed based on the CNN model.

In the fourth step, the working process of the fraud classification identification model in the mobile phone terminal App is as follows:

the method comprises the steps of combining and forming a content vector of each participle according to a word combination vector, part-of-speech tagging and a sentence dependency tag of each participle in a call text, calculating and obtaining a scene element tag to which each participle belongs according to the content vector of each participle, averaging the content vectors and the scene element tags of all the participles in the call text to obtain a semantic vector corresponding to the call text, inputting the semantic vector corresponding to the call text into a fraud classification recognition model in an App (application) of a mobile phone terminal to obtain a fraud category to which a call number to be recognized belongs, pushing the tag obtained by recognition through an App message and reminding a user, selecting whether to correct the tag by the user and editing and desensitizing the text again, and uploading the text and the tag to a server for secondary training if the user agrees to authorize.

It is worth mentioning that in step two, a plurality of word combination vectors of each participle in the training sample can be calculated, for example

And then in the third step, semantic vectors obtained by corresponding to different word combination vectors are respectively input into fraud classification recognition models for training, and according to the recognition accuracy rates of the fraud classification recognition models corresponding to different word combination vectors, a fraud classification recognition model with the highest recognition accuracy rate and a word combination vector corresponding to the fraud classification recognition model are selected, so that the fraud classification recognition model and the word combination vector selected in the fourth step are used for calculating and obtaining the fraud category to which the call number to be recognized belongs.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A terminal fraud call identification method based on conversation text word vectors is characterized by comprising the following steps:

2. The method of claim 1, wherein step one further comprises:

step 12, the user views the text, edits the text to further improve desensitization, and then selects whether to upload the desensitization text marked as a fraud category by the user to the server, and if so, uploads the text and the mark of the fraud category to the server under the authorization of the user;

3. The method according to claim 1, wherein in the second step, word segmentation and part-of-speech tagging are performed on each text sample to obtain a syntactic dependency label of each word segmentation, and the method further comprises:

step 21, generating a stop word dictionary based on Chinese grammar;

step 24, performing syntactic dependency analysis on each participle by using a fast Offset-based algorithm, and outputting a syntactic dependency label of each participle;

4. The method of claim 1, wherein in step two, the word vector, the pinyin vector, and the stroke vector of each participle are calculated to form a word combination vector of each participle in the text sample, and further comprising:

wherein the content of the first and second substances,

5. The method according to claim 1, wherein in the second step, a word combination vector, part-of-speech tagging and a syntactic dependency tag of each participle in the text sample are combined to form a content vector of each participle, a context element tag to which each participle belongs is obtained through calculation according to the content vector of each participle, and finally, the content vectors and the context element tags of all participles in the text sample are averaged to obtain a semantic vector corresponding to the text, further comprising:

step A1, setting a plurality of scene elements;

6. The method as recited in claim 1, wherein, in step three, a fraud classification identification model is constructed based on a CNN model.

7. The method as claimed in claim 1, wherein in step four, the fraud classification recognition model works in the cell phone terminal App as follows:

8. The method as claimed in claim 1, wherein a plurality of word combination vectors of each segmented word in the training samples are calculated in step two, then the semantic vectors corresponding to different word combination vectors are respectively inputted into the fraud classification recognition models in step three for training, and the fraud classification recognition model with the highest recognition accuracy and the corresponding word combination vector thereof are selected according to the recognition accuracy of the fraud classification recognition models corresponding to different word combination vectors, so that the fraud category to which the call number to be recognized belongs is calculated and obtained in step four by using the selected fraud classification recognition model and the word combination vector.