CN111666412A - Fraud log text analysis method and system based on SVM text analysis - Google Patents

Fraud log text analysis method and system based on SVM text analysis Download PDF

Info

Publication number
CN111666412A
CN111666412A CN202010490624.1A CN202010490624A CN111666412A CN 111666412 A CN111666412 A CN 111666412A CN 202010490624 A CN202010490624 A CN 202010490624A CN 111666412 A CN111666412 A CN 111666412A
Authority
CN
China
Prior art keywords
text
weight
list
log text
log
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010490624.1A
Other languages
Chinese (zh)
Inventor
王中华
郝振江
刘志会
许高尚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Guorui Digital Safety System Co ltd
National Computer Network and Information Security Management Center
Original Assignee
Tianjin Guorui Digital Safety System Co ltd
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Guorui Digital Safety System Co ltd, National Computer Network and Information Security Management Center filed Critical Tianjin Guorui Digital Safety System Co ltd
Priority to CN202010490624.1A priority Critical patent/CN111666412A/en
Publication of CN111666412A publication Critical patent/CN111666412A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of text analysis, and particularly relates to a fraud log text analysis method based on SVM text analysis. The method comprises the following steps: analyzing the incoming log text number by using a black, white and grey list pair of the mobile phone number to generate a list weight; analyzing the log text by using the keywords to generate keyword weights; analyzing the log text by using an SVM model to generate an SVM model weight; and comprehensively analyzing by using the list weight, the keyword weight and the SVM model weight to generate a fraud log text weight, and judging the log text by using the fraud log text weight. The invention comprehensively judges the content of the log text from the source of the log text, reduces the misjudgment rate, improves the identification accuracy of the log text and saves time.

Description

Fraud log text analysis method and system based on SVM text analysis
Technical Field
The invention belongs to the field of text analysis, and particularly relates to a fraud log text analysis method and system based on SVM text analysis.
Background
Fraud text analysis currently relies primarily on keyword filtering techniques. The language and content in the text are also changing day by day. Even if the word of 'notarization notice' appears, automatic analysis and recognition cannot be realized. At present, a plurality of technologies for realizing automatic text classification exist, but a Support Vector Machine (SVM) technology is one of the most popular technologies at present with the best classification effect. However, in many scenarios, normal text and fraud text are very similar, such as children of college and university seeking parents to pay for their lives. Fraud text is not well judged using only text classification.
Disclosure of Invention
Aiming at the problems, the invention designs and realizes a fraud log text analysis method based on SVM text analysis, which comprises the following steps:
analyzing the incoming log text number by using a black, white and grey list pair of the mobile phone number to generate a list weight;
analyzing the log text by using the keywords to generate keyword weights;
analyzing the log text by using an SVM model to generate an SVM model weight;
and comprehensively analyzing by using the list weight, the keyword weight and the SVM model weight to generate a fraud log text weight, and judging the log text by using the fraud log text weight.
Further, the mobile phone number black, white and grey lists comprise a white list, a grey list and a black list;
the logging text number analysis using the mobile phone number black and white and grey list pairs comprises:
log text number classifications using the white list, gray list, and black list pairs;
and generating list weight according to the classification.
Further, the analyzing the log text by using the keyword to generate a keyword weight includes:
judging the format of the log text;
and generating the weight of the keyword by using the keyword according to the log text format.
Further, the analyzing the log text by using the SVM model, and generating the SVM model weight includes:
and establishing an SVM model, and analyzing the log text by using the SVM model to generate SVM model weight.
Further, the establishing the SVM model includes:
collecting a training log text, and performing feature extraction on the training log text to generate a feature extraction text;
performing feature identification on the feature extraction text by using TF-IDF to generate a feature identification text;
normalizing the feature identification text to generate normalized data;
and classifying the normalized data by using an SVM (support vector machine), and establishing an SVM model.
The invention also provides a fraud log text analysis system based on SVM text analysis, comprising:
the list analysis module is used for analyzing the incoming log text number by using a mobile phone number black, white and grey list to generate a list weight;
the keyword analysis module is used for analyzing the log text by using the keywords and generating keyword weights;
the SVM analysis module is used for analyzing the log text by using an SVM model and generating SVM model weight;
the comprehensive analysis module is used for comprehensively analyzing by using the list weight, the keyword weight and the SVM model weight to generate a fraud log text weight;
and the judging module is used for judging the log text by utilizing the fraud log text weight.
Further, the mobile phone number black, white and grey lists comprise a white list, a grey list and a black list;
the list analysis module comprises:
a classification component to classify log text numbers using the white list, gray list, and black list pairs;
and the list weight component is used for generating the list weight according to the classification.
Further, the keyword analysis module includes:
the judging component is used for judging the format of the log text;
and the keyword weight component is used for generating keyword weight by using keywords according to the log text format.
Further, the SVM analysis module includes:
a building component for building an SVM model;
an SVM analysis component for analyzing the log text using the SVM model;
and generating an SVM model weight component for generating SVM model weights.
Further, the establishing component comprises:
the collecting unit is used for collecting training log texts;
the characteristic extraction unit is used for extracting the characteristics of the training log text;
the generating characteristic extracting text unit is used for generating a characteristic extracting text;
the characteristic identification unit is used for carrying out characteristic identification on the characteristic extraction text by using TF-IDF;
a feature identification text generation unit for generating a feature identification text;
the normalization unit is used for normalizing the feature identification text to generate normalized data;
the classification unit is used for classifying the normalized data by using an SVM;
and the establishing unit is used for establishing the SVM model.
The fraud log text analysis method and system based on SVM text analysis use comprehensive analysis of list weight, keyword weight and SVM model weight to judge log text, namely, SVM is used for realizing log text classification automation, and the problem that part of fraud texts are similar to normal texts and cannot be classified correctly is solved. The invention comprehensively judges the content of the log text from the source of the log text, reduces the misjudgment rate, improves the identification accuracy of the log text and saves time.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 shows a flow chart of a fraud log text analysis method based on SVM text analysis according to an embodiment of the present invention;
FIG. 2 shows a schematic structural diagram of a fraud log text analysis system based on SVM text analysis according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method and the device are used for analyzing the log text and judging whether the log text is a fraud log text. The invention discloses a fraud topic analysis method based on SVM text analysis, which can adopt but is not limited to the following procedures. Illustratively, as shown in fig. 1, the method comprises:
step one, analyzing incoming log text numbers by using a black, white and grey list of mobile phone numbers to generate list weights.
Specifically, the log text number refers to the cell phone from which the log text is obtained. And presetting a black, white and grey list of the mobile phone number. The mobile phone numbers comprise normal mobile phone numbers, such as 139 beginning number mobile phone number, 133 beginning number mobile phone number and the like; including official customer service numbers such as telecom customer service number 10001; including a service number beginning with 106; including international mobile phone numbers; including the mailbox address, Apple ID, etc., all relevant numbers that can send relevant text or iMessage.
Specifically, the black, white and gray list of the mobile phone number includes a white list, a gray list and a black list. All the mobile phone numbers are in the black, white and grey list of the mobile phone numbers.
White list, i.e. trusted list. Black list, i.e. untrusted list.
And when one mobile phone number is not in the white list and is not in the black list, the mobile phone number is in the grey list. The white list and the black list are obtained by collecting big data in advance.
Specifically, the log text number classification is performed by using the white list, the gray list and the black list.
Exemplarily, the list weight corresponding to the white list is defined as a first list weight, the list weight corresponding to the gray list is defined as a second list weight, and the list weight corresponding to the black list is defined as a third list weight. Wherein 0 is less than or equal to the first list weight < the second list weight < the third list weight is less than or equal to 1.
And firstly judging whether the number is in a white list or not by the log text number. If yes, the weight value of the list of the number is the first list weight. And if the number is in the grey list or the black list, the weight value of the list of the number is the weight of the second list or the weight of the third list.
Further, the text number of the log can be analyzed by using a white and grey list pair of the mobile phone number, and the weight of the list can be generated. That is, the mobile phone number includes a white list and a gray list, and all mobile phone numbers not in the white list are in the gray list. The white list is collected in advance by big data.
The list weight corresponding to the white list is defined as a first list weight in advance, and the list weight corresponding to the gray list is defined as a second list weight. Wherein 0 is less than or equal to the first list weight < the second list weight is less than or equal to 1.
And firstly judging whether the number is in a white list or not by the log text number. If yes, the weight value of the list of the number is the first list weight. And if the number is not in the white list, the weight value of the list of the number is the weight of the second list.
And step two, using the keywords to analyze the log text to generate the keyword weight.
Specifically, the keyword is capable of summarizing the fraud log text content to be searched to the greatest extent when the fraud log text is judged. A log text mainly says what content of the log text is, and one (more times, a plurality of) keywords can be classified into one aspect, such as fraud log texts containing "notarization places", which mostly contain the same word combination "notarization place notice", and the word combination "notarization place notice" is the keyword. The keywords are divided into single keywords and combined keywords. The keywords are a word and are single keywords; the keywords are a combination of a plurality of words and phrases and are combined keywords. Illustratively, the single keyword includes: police officers, etc.; the combined keywords comprise Chinese agricultural banks, safety insurance, notarization department notice and the like. The 'public security bureau' is a word by which the content of the log text can be judged to be related to public security. The Chinese agricultural bank is composed of three words of China, agriculture and bank, if the three words are used for analyzing the log text respectively, the word China cannot obtain the content of the log text; the word "agriculture" can misjudge the content of the log text as being relevant to agriculture; although the word "bank" can determine that the log text content is related to the bank, the log text cannot be further analyzed. And the combined keyword 'Chinese agricultural bank' can be used for judging that the log text content is related to the Chinese agricultural bank.
The keywords are obtained through manual and big data analysis, the keywords have categories (such as customer service fraud, impersonation public security and the like), and the keywords are used for matching the texts, so that fraud log texts and the corresponding categories are preliminarily identified. The relevant categories are set by manual + big data analysis in advance.
Illustratively, the log text category may be preliminarily set using, but not limited to, the following manner.
Such as customer service fraud, including the keywords such as company and bank; the faked police certainly contains keywords such as police bureau and police officer.
For example, following fraud log text, the log text portion content is replaced with a content. The content does not affect the keyword determination. ". small messenger: the customer can dial the card or inquire the card with the card issuing center: reminding, the account of your tail number is related to drug transaction recently, and all accounts are sealed. Contact police officer, phone: *** - ********".
The above two log texts respectively include keywords ". about.bank" and ". about.police office". The two log texts are preliminarily set to belong to the customer service fraud class and the impersonation public security class respectively.
The log text belongs to a certain type, and then a plurality of keywords are certainly contained. Such as customer service type fraud, such as bank related log text, telecommunications related log text, insurance related log text, etc. For example, the journal text associated with a bank may contain the keyword ". about.bank". If the journal text associated with the bank also contains a telephone, the contained telephone should be a normal customer service telephone, such as the keyword "9". The normal customer service telephone can be collected and obtained through related bank official websites and the like.
A normal journal text, such as a journal text associated with a bank, containing the keyword "×" bank "; if the telephone appears, the telephone is unified for the bank, for example, the telephone of the industrial and commercial bank is as follows: 95588, respectively; if the website information appears, the website is a normal website of the bank, and the website necessarily comprises a domain name corresponding to the bank. Such as the recruiter's bank must be "cmbchina. com" or "cmbt. cn". The corresponding phone and domain name of the bank are keywords.
Using the keywords to analyze the log text, the following ways may be used, but are not limited to:
and 2.1, judging the format of the log text.
The log text format is defined in advance and generated by manual + big data analysis.
Illustratively, the following three pieces of log text, the log text portion contents are replaced with a. The content does not affect the keyword determination.
Normal journal text one "your account x in month x: consumption renminbi element. [. bank ] "; and the normal journal text two ([ bank ] you tail number [. star ] date [. ang ] ] repayment RMB ]. Yuan, and the RMB bill is cleared by the current period after the account. Requesting for consumption in stages 95 × by 3# by 1 "; fraud log text three ". about.small messenger: and the customer can dial the credit card or inquire the credit card issuing center of the bank if the customer is in question.
Analyzing the three log texts, wherein a normal log text I is in a format of 'unit information', comprises 'star bank' and has no telephone information; normal log text two and fraud log text three, in the format "unit information + phone information", i.e. containing ". x bank" and phone information.
And 2.2, generating the weight of the keyword by using the keyword according to the log text format.
A log text format includes a plurality of keywords. If one log text belongs to a certain log text format and also comprises a plurality of corresponding keywords, and the log text of the type generates a normal keyword weight which is counted as a keyword weight one. Redefining such log-text categories as normal categories. If one log text belongs to a certain log text format but only comprises partial keywords, the log text generates suspected keyword weight, the suspected keyword weight is counted as keyword weight two, and for the log text of the type, the category of the log text is recorded according to the preliminary setting result of the category of the log text. The values of the first keyword weight and the second keyword weight are set in advance.
For example, the three log texts, the preliminary set log text category, are all "customer service fraud". A normal log text one format is 'unit information', and the log text content contains a keyword '. multidot.bank'; the normal log text is in a format of 'unit information + telephone information', and the log text content comprises a keyword 'bank' and a keyword '95'; fraud log text three, formatted as "unit information + phone information", the log text content contains the keywords ". x.bank", but does not contain keywords related to phone information.
The keywords of the normal log text I and the normal log text II conform to the corresponding formats of the normal log text I and the normal log text II, so that the normal keyword weights are generated in the normal log text I and the normal log text II respectively and are counted as the keyword weight I, and the log text category is reset to be the normal category; and if the fraud log text third key word does not conform to the corresponding format, the fraud log text third key word is considered to generate a suspected key word weight, the suspected key word weight is counted as a key word weight two, and the log text category is recorded as customer service type fraud.
And thirdly, analyzing the log text by using an SVM model to generate SVM model weight.
Specifically, an SVM model is established, and the SVM model is used for analyzing the new log text to generate SVM model weight.
The SVM model establishing method comprises the following steps:
collecting a training log text, and performing feature extraction on the training log text to generate a feature extraction text;
performing feature identification on the feature extraction text by using TF-IDF to generate a feature identification text;
normalizing the feature identification text to generate normalized data;
and classifying the normalized data by using an SVM (support vector machine), and establishing an SVM model.
Illustratively, the SVM model may be established using, but not limited to, the following methods.
And 3.1, collecting a training log text, and performing feature extraction on the training log text to generate a feature extraction text.
Specifically, different types of training log texts are collected; performing word segmentation on all training log texts, and representing the training log texts by using the words as the dimensionality of a vector;
counting all appearing words and frequencies thereof in each type of training log text document, and then eliminating stop words and single words;
counting the total word frequency of the words appearing in each type, and taking a plurality of words with higher frequency as a characteristic word set of the type;
removing words appearing in each type, combining feature word sets of all categories to form a total feature word set, wherein the total feature word set is a feature set, and using the feature set to screen features in the training log text.
Illustratively, the feature extraction text may be generated using, but is not limited to, the following methods.
Through manual and big data analysis, different types of training log texts are collected, such as log texts of customer service fraud types, impersonation public security types, normal types and the like.
Collecting the journal text includes normalizing the training journal text.
Because the contents of part of the journal text are separated by using special symbols, complex characters are used for replacing simplified characters, special characters are used for replacing normal contents, and the like. For example, partial log text content of " -Q" or "peptide-related", "+ v" or the like. This requires a pre-normalization process for the training log text.
The normalization process can be performed in the following manner, but is not limited to:
A) simplified characters are used to replace traditional characters.
And processing the text by using a simplified and traditional Chinese character conversion tool. After the log text is processed, the following steps are performed: "link Q-Q is", "China customs", "v letter".
B) The special character is replaced.
And generating a special character dictionary in advance, wherein the special character dictionary is obtained by manual and big data collection. Special character dictionaries are used to convert special characters in the log text to normal content.
For example, in a special character dictionary, "+ v letter" means "plus WeChat". After the log text "+ v letter" is processed, it is "plus little letter"
C) The special symbols are deleted.
And generating a special symbol table in advance, wherein the special symbol table is defined manually and comprises @andthe like. The journal text "contact Q-Q is" after processing is "contact QQ is".
The normalization log text is generated by normalizing the training log text.
Specifically, all normalized log texts are participled; normalized log text is represented by these words as dimensions of a vector.
In particular, word segmentation refers to segmenting a sequence of Chinese characters into a single word. The word segmentation is the foundation of text mining, and the effect of automatically identifying the meaning of the sentence by a computer can be achieved by successfully segmenting the input Chinese segment. The method is also called mechanical word segmentation method, which matches the Chinese character string to be analyzed with the entry in a sufficiently large machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is recognized).
By way of example, the following methods may be used for word segmentation, but are not limited to:
a word segmentation method based on character string matching, a word segmentation method based on understanding, and a word segmentation method based on statistics.
By way of example, the following tools may be used for word segmentation, but are not limited to:
SCWS、ICTCLAS、HTTPCWS、CC-CEDICT。
illustratively, a log text "new lotto of Macao Xingmuijing lottery group recruiting agent, new lotto of agent" and a log text "latest agent mode" are participated to generate a text "latest mode of Macao Xingmuijing lottery group recruiting agent" in which the "agent" words appear three times in the text in total, and after the participle is arranged, repeated words appear only once. Although the meaning of "new" and "latest" is close to each other, the two words are generated in the dictionary after word segmentation.
Based on the words appearing in the above-mentioned plain text, the following space is constructed: { "Australian" 1, "New" 2, "grape Jing" 3, "lottery" 4, "group" 5, "Bing" 6, "agency" 7, "New" 8, "newest" 9, "mode" 10 }.
The space contains 10 words, and the log text is represented by the words as dimensions of a vector.
Specifically, all appearing words and frequencies of the words in each type of log text document are counted, and stop words and single words are eliminated.
Further, a space is constructed from the plurality of log texts, the space including all the occurring words in the log texts. And (4) counting all the occurring words and frequencies of all the log texts in each type of log texts.
Illustratively, the journal text "aomenxin pocketbook recruiting agency, proxying new bets" and journal text "latest agency mode" contains the words "aomenan", "new", "glujing", "bet", "clique", "recruit", "proxy", "up-to-date" and "mode"; these words correspond to frequencies of 1,2,1, 3,2, 1. The two log texts can be represented by 10-dimensional vectors respectively, and can also be represented by one of the following 10-dimensional vectors: [1,2,1,2,1,1,3,2,1,1]. This vector has no relation to the order of occurrence of words in the original log text, and the index content of each vector corresponds to the number of occurrences of words in space.
Specifically, stop words and single words are removed.
Specifically, Stop Words refer to that in information retrieval, in order to save storage space and improve search efficiency, some characters or Words are automatically filtered before or after processing natural language data (or text), and the characters or Words are called Stop Words. The stop words are manually input and are not automatically generated, and the generated stop words form a stop word list. Exemplary, e.g., "in" itself has no definite meaning, and it only plays a role in putting it into a complete sentence. If the word "yes" appears on almost every text, searching the word cannot guarantee that truly relevant search results can be given, and the search range is difficult to help to be narrowed, and the search efficiency is reduced. Such words are stop words.
By way of example, the following stop word lists, Hadamard stop word lists, and Baidu stop word lists may be used or not. A new deactivation word list may also be further generated from the existing deactivation word list.
There are single words. The words have specific meanings, but the words cannot be used in the judgment of the log text. By way of example, such terms include, but are not limited to, the following: i, you, he, it, lift, walk, sit, stand, attract, etc.
For example, in the journal text "aomen new dimeglun betting group recruiting agency, agency new bets" and journal text "latest agency mode", after removing stop words and single words, the new 10-dimensional vector is: [1,2,1,2,1,0,3,0,1,1].
Specifically, the total word frequency of the words appearing in each type is counted, and a plurality of words with higher frequency are taken as the characteristic word set of the type.
Illustratively, the total word frequency of all occurrences of the journal text "aomenxin guo betting group recruiting agency, proxying new bets" and journal text "latest agency mode" is 1+2+1+2+1+0+3+0+1+1, i.e., 12. Wherein the word with higher frequency is new for 2 times; play for 2 times; "proxy", 3 times. And taking the three words with high frequency as the characteristic word set of the class.
Specifically, words appearing in each type are removed, feature word sets of all categories are combined to form a total feature word set, the total feature word set is a feature set, and the feature set is used for screening features in the log text.
Statistics were performed for each type as above. If a word occurs in all types, the word is usually independent of subject, and the text cannot be judged by the word. Deleting the word, and then combining the feature word sets of all categories to generate a total feature word set.
For example, assuming "New" appears in all types of log text, the word "New" is deleted. The overall feature word set includes feature word sets for "bets," "agents," and other types of log text.
A space is constructed using the total set of feature words, and each log text in this space generates a corresponding vector. Namely, feature extraction is carried out on the training log text to generate a feature extraction text.
And 3.2, carrying out feature identification on the feature extraction text by using TF-IDF to generate a feature identification text.
Specifically, TF-IDF (term frequency-inverse text frequency index) is a commonly used weighting technique for information retrieval and data mining. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. The main idea of TF-IDF is: if a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification. A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, TF-IDF tends to filter out common words, preserving important words.
The TF-IDF is calculated using the following formula:
TF-the number of times the word appears in the document/the total number of words in the document;
IDF ═ lg (total number of documents/number of documents containing the word);
TF-IDF=TF*IDF。
and after the total feature word set is constructed, performing weight correction on the features by using TF-IDF to form a plurality of feature models.
The TF-IDF construction of word weights can be performed using, but is not limited to, the following tools: TfidfTransformer class of scinit-leran.
Illustratively, assume that the space generated by constructing the total feature word set is as follows: { "France": 1, "old": 2, "apple": 3, "lottery": 4, "manager": 5, "bank": 6, "agency": 7, "oldest": 8, "stock": 9, "fund": 10 }.
In the log text, "the Macao Xinpu Xingjing lottery group recruits agents, agents new lotteries, and the latest agent mode", the word one has 15 words. The new vector value of the text is [0,0,0,2,0,0,3,0,0 ]. This means that "betting" and "agent" correspond to values of 2 and 3, respectively, and their word frequencies are 2/15 and 3/15, respectively. The sum of these two numbers 5/15 is a simple measure of query relevance in the entire space, the TF value. In the original text, the "bet" and "agent" metric values are relatively close. Because the 'agent' is a relatively general word, it appears in many log texts; when such words are seen, the subject matter of the text remains largely unknown. While "gambling" is a relatively specialized word that, when viewed, is more or less able to understand the subject matter of the corresponding journal text. In judging the journal text, "bets" are more important in the relevance ranking. It is therefore necessary to give a weight to each word in the text.
Assuming that 1 hundred million log texts are known and 50 ten thousand texts containing "betting", IDF of "betting" is lg (1 hundred million/50 ten thousand) 2.30; if the text containing the "agent" is 2000 ten thousand, the IDF of the "agent" is lg (1 billion/2000 ten thousand) 0.70.
Multiplying the TF values of the lottery and the agent by the IDF value to obtain TF-IDF values of the lottery and the agent, which are 0.31 and 0.14 respectively.
Using TF-IDF, the relevance of a set of words in the text can be calculated, namely:
correlation TF1*IDF1+TF2*IDF2+...+TFn*IDFn
Then, the log text "aomenxin pu jingjing lottery group recruits an agent to act on a new lottery, and the latest agent mode" has a correlation of 0.44, i.e. the log text is subjected to feature identification.
And performing feature identification on the feature extraction text by using TF-IDF to generate a feature identification text.
And 3.3, normalizing the feature identification text to generate normalized data.
Specifically, normalization is a dimensionless processing means that changes the absolute value of a numerical value into a certain relative value relationship. The purpose of normalization is to allow the preprocessed data to be limited to a certain range, (e.g., [0,1] or [ -1,1 ]).
The normalization process can be performed using, but is not limited to, the following methods:
using a linear function transformation, the expression is as follows:
when normalized to between 0-1:
Figure BDA0002520835780000131
when normalized to between 0.1-0.9:
Figure BDA0002520835780000132
wherein: x is the value before conversion; y is the converted value; MaxValue is the maximum value of the sample; MinValue is the minimum value of the sample.
And in order to limit the value of the feature identification text within a certain range, one of the formulas is selected according to the model training effect, and the feature identification text is normalized to generate normalized data.
And 3.4, classifying the normalized data by using an SVM (support vector machine) and establishing an SVM model.
Specifically, a Support Vector Machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner, i.e., a two-class classification model. The basic model is positioned as the linear classifier with the maximum interval on the feature space, and the learning strategy is interval maximization. The learning goal of a linear classifier is to find a hyperplane in the multidimensional data space that effectively separates the data points. In fact, most sample data is not linearly separable, at which time a hyperplane satisfying such conditions does not exist at all. For the case of nonlinearity, the processing method of SVM is to select a kernel function k and map the data to a high-dimensional space to solve the problem of inseparability of linearity in the original space.
Among the kernel functions of the support vector machine, a gaussian kernel (RBF) is most commonly used. The classification hyperplane function equation is:
f(x)=wTx+b
wherein w is a parameter vector to be learned; x is a vector of the input sample set; b is the offset of the hyperplane from the origin.
Through Lagrange dual conversion, a parameter vector solving equation can be obtained as follows:
Figure BDA0002520835780000141
αiis a Lagrange factor; y isiIs a sample target variable; x is the number ofiVariables are input for the samples. Substituting the hyperplane function equation to obtain:
Figure BDA0002520835780000142
wherein, αiIs a Lagrange factor; y isiIs a sample target variable; x is the number ofiInputting variables for the samples; x is a vector of the input sample set; b is the offset of the hyperplane from the origin; < xiAnd x > is the inner product of the multi-dimensional space training samples.
And selecting a Gaussian kernel function to realize implicit mapping of the feature space, wherein the Gaussian kernel function is as follows:
Figure BDA0002520835780000143
wherein x is a vector of the input sample set; x is the number ofiAs the center of the kernel function, is,sample input variables; is the width parameter of the function.
Substituting a hyperplane function equation to obtain a final nonlinear hyperplane function equation:
Figure BDA0002520835780000144
wherein the content of the first and second substances,iis the width parameter of the function; y isiIs a sample target variable; x is the number ofiAs kernel function center, sample input variables; x is a vector of the input sample set; b is the offset of the hyperplane from the origin.
By way of example, but not limitation, the SVM model may be built as follows: and (3) building an SVM model by using scimit-learn. The most important of scimit-lern are two parameters, C and Gamma. And C is a penalty parameter for the error item in the training process of the SVM model. The smaller the value of C is, the stronger the generalization ability is, but the lower the fitting is easily; the larger the value of C is, the more comprehensive the utilization of the difference among training samples is, but the overfitting is easy, and the generalization capability is worse. Gamma determines the distribution of the sample set after mapping to the feature space. The smaller the Gamma is, the higher the training accuracy is, but the generalization ability may be weakened; the larger the Gamma, the fewer the support vectors.
Specifically, after an SVM model is established, the SVM model is used for analyzing new log texts to generate SVM model weights.
And step four, comprehensively analyzing by using the list weight, the keyword weight and the SVM model weight to generate a fraud log text weight, and judging the log text by using the fraud log text weight.
Specifically, fraud log text weights are generated by using the list weights, the keyword weights and the SVM model weights.
The following formula may be used, but is not limited to:
fraud log text weight ═ list weight — (keyword weight + SVM model weight)
For any log text, a respective fraud log text weight is generated. A suspected threshold and a fraud log text threshold are defined in advance, wherein the suspected threshold < the fraud log text threshold.
When the fraud log text weight is smaller than the suspected threshold value, the log text is judged to be normal log text. When the fraud log text weight is greater than or equal to the suspected threshold and smaller than the fraud log text threshold, judging that the log text is suspected log text; the log text is manually calibrated, that is, the log text is judged to be normal log text or fraud log text manually. When the fraud log text weight is greater than or equal to the fraud log text threshold, determining that the log text is fraud log text.
After collecting and analyzing the new log text, automatically updating the black-white-gray list by using the new log text to generate a new black-white-gray list; automatically updating the keywords to generate new keywords; and adding the new log text into the training log text, and automatically generating a new SVM model. And analyzing the updated log text by using the new black-white-grey list, the new keywords and the new SVM model, thereby realizing the automation improvement of the method.
Further, the second step and the third step are not in sequence.
The following procedure may also be used:
step one, analyzing incoming log text numbers by using a black, white and grey list of mobile phone numbers to generate list weights.
Secondly, analyzing the log text by using the keywords to generate keyword weights; and analyzing the log text by using an SVM model to generate SVM model weight.
And thirdly, comprehensively analyzing by using the list weight, the keyword weight and the SVM model weight to generate a fraud log text weight, and judging the log text by using the fraud log text weight.
Specifically, if the fraud log text weight value of one log text is 0, the log text is considered not to be the fraud log text.
If the list weight is 0, the generation of the keyword weight and the SVM model weight is not needed, and the fraud log text weight value is also 0.
Further, if the first list weight is defined as 0 in advance, that is, the list weight corresponding to the white list is 0, the last fraud log text weight is also 0, and then the keyword weight and the SVM model weight do not need to be generated at this time. If the log text number is judged to be in the white list, the fraud log text weight value of the log text is 0, and the log text is directly judged to be not the fraud log text; and if the log text number is judged not to be in the white list, continuously generating the keyword weight and the SVM model weight.
The following procedure may be used at this time:
analyzing incoming log text numbers by using a black, white and grey list of mobile phone numbers to generate list weights; if the weight of the list is 0, executing the step two; and if the weight of the list is not 0, executing the step three.
Step two, judging that the log text is not a fraud log text; and (6) ending.
Thirdly, analyzing the log text by using the keywords to generate keyword weights; analyzing the log text by using an SVM model to generate an SVM model weight; and step four is executed.
And step four, comprehensively analyzing by using the list weight, the keyword weight and the SVM model weight to generate a fraud log text weight, and judging the log text by using the fraud log text weight. And (6) ending.
If the first list weight is not 0, all log texts need to be judged, and the list weight, the keyword weight and the SVM model weight need to be generated at the same time.
The following procedure may be used at this time:
analyzing incoming log text numbers by using a black, white and grey list of mobile phone numbers to generate list weights; analyzing the log text by using the keywords to generate keyword weights; and analyzing the log text by using an SVM model to generate SVM model weight.
And step two, comprehensively analyzing by using the list weight, the keyword weight and the SVM model weight to generate a fraud log text weight, and judging the log text by using the fraud log text weight.
The present invention further includes a fraud log text analysis system based on SVM text analysis, which is exemplarily shown in fig. 2 and comprises:
the list analysis module is used for analyzing the incoming log text number by using a mobile phone number black, white and grey list to generate a list weight;
the keyword analysis module is used for analyzing the log text by using the keywords and generating keyword weights;
the SVM analysis module is used for analyzing the log text by using an SVM model and generating SVM model weight;
and the comprehensive analysis module is used for receiving the list weight transmitted by the list analysis module, the keyword weight transmitted by the keyword analysis module and the SVM model weight transmitted by the SVM analysis module, and generating fraud log text weight by using comprehensive analysis of the list weight, the keyword weight and the SVM model weight.
And the judging module is used for receiving fraud log text weight data transmitted by the comprehensive analysis module and judging the log text by utilizing the fraud log text weight, wherein the judgment comprises machine judgment and manual judgment.
Specifically, the mobile phone number black, white and grey lists comprise a white list, a grey list and a black list;
the list analysis module comprises:
a classification component to classify log text numbers using the white list, gray list, and black list pairs;
and the list weight component receives the incoming log text number classification data transmitted by the classification component and is used for generating the list weight according to the classification.
Specifically, the keyword analysis module includes:
the judging component is used for judging the format of the log text;
and the keyword weight component receives the log text format data transmitted by the judgment component and is used for generating keyword weight by using the keywords according to the log text format.
Specifically, the SVM analysis module includes:
a building component for building an SVM model;
the SVM analysis component receives the SVM model transmitted by the establishment component and is used for analyzing the log text by using the established SVM model;
and generating an SVM model weight component, receiving the log text analysis data transmitted by the SVM analysis component, and generating SVM model weights.
Specifically, the establishing component includes:
a collecting unit for collecting the log text;
the characteristic extraction unit is used for receiving the log text transmitted by the collection unit and extracting the characteristics of the log text;
the feature extraction text generation unit is used for receiving feature extraction data transmitted by the feature extraction unit and generating a feature extraction text;
the characteristic identification unit is used for receiving the characteristic extraction text transmitted by the characteristic extraction text generation unit and carrying out characteristic identification on the characteristic extraction text by using TF-IDF;
the feature identification text generating unit is used for receiving the feature identification data transmitted by the feature identification unit and generating a feature identification text;
the normalization unit is used for receiving the characteristic identification text transmitted by the characteristic identification text generation unit, normalizing the characteristic identification text and generating normalization data;
the classification unit is used for receiving the normalized data transmitted by the normalization unit and classifying the normalized data by using an SVM (support vector machine);
and the establishing unit is used for receiving the classification data transmitted by the classification unit and establishing the SVM model.
Illustratively, the system further comprises:
and the data query management module is used for full text data query, harmful topic text query, model parameter management and self-learning management.
The full text data query is used for querying all log texts;
the harmful topic text query is used for querying all fraud log texts;
the model parameter management is used for setting system model parameters;
the self-learning management is used for system automation updating.
The system model parameters are used for setting system internal parameters, such as:
-v: setting the number of cross validation, such as setting the number of cross validation to 10;
-t: kernel function type, e.g., type is linear;
-h: whether to use heuristics, if so;
-c: and a penalty factor, such as setting the penalty factor to 100.
The system automatic updating method comprises the following steps: after collecting and analyzing the new log text, automatically updating the black-white-gray list by using the new log text to generate a new black-white-gray list; automatically updating the keywords to generate new keywords; and adding the new log text into the training log text, automatically generating a new SVM model, and updating the list analysis module, the keyword analysis module and the SVM analysis module by using the new black-white-grey list, the new keywords and the new SVM model, so that the system model can be continuously improved, and the automatic updating of the system is realized.
According to the method, the log text is judged by comprehensively analyzing the list weight, the keyword weight and the SVM model weight, namely, the log text classification automation is realized by using the SVM, and the problem that part of fraud texts are similar to normal texts and cannot be correctly classified is solved. The method comprehensively judges the content of the log text from the source of the log text, reduces the misjudgment rate, improves the identification accuracy of the log text and saves time.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A fraud log text analysis method based on SVM text analysis is characterized in that,
the method comprises the following steps:
analyzing the incoming log text number by using a black, white and grey list pair of the mobile phone number to generate a list weight;
analyzing the log text by using the keywords to generate keyword weights;
analyzing the log text by using an SVM model to generate an SVM model weight;
and comprehensively analyzing by using the list weight, the keyword weight and the SVM model weight to generate a fraud log text weight, and judging the log text by using the fraud log text weight.
2. The analytical method according to claim 1,
the mobile phone number black, white and grey lists comprise a white list, a grey list and a black list;
the logging text number analysis using the mobile phone number black and white and grey list pairs comprises:
log text number classifications using the white list, gray list, and black list pairs;
and generating list weight according to the classification.
3. The analytical method according to claim 1,
the analyzing the log text by using the keywords and generating the keyword weight comprises the following steps:
judging the format of the log text;
and generating the weight of the keyword by using the keyword according to the log text format.
4. The analytical method according to claim 1,
analyzing the log text by using an SVM model, wherein generating SVM model weights comprises:
and establishing an SVM model, and analyzing the log text by using the SVM model to generate SVM model weight.
5. The analytical method according to claim 4,
the establishing of the SVM model comprises the following steps:
collecting a training log text, and performing feature extraction on the training log text to generate a feature extraction text;
performing feature identification on the feature extraction text by using TF-IDF to generate a feature identification text;
normalizing the feature identification text to generate normalized data;
and classifying the normalized data by using an SVM (support vector machine), and establishing an SVM model.
6. A fraud log text analysis system based on SVM text analysis is characterized in that,
the system comprises:
the list analysis module is used for analyzing the incoming log text number by using a mobile phone number black, white and grey list to generate a list weight;
the keyword analysis module is used for analyzing the log text by using the keywords and generating keyword weights;
the SVM analysis module is used for analyzing the log text by using an SVM model and generating SVM model weight;
the comprehensive analysis module is used for comprehensively analyzing by using the list weight, the keyword weight and the SVM model weight to generate a fraud log text weight;
and the judging module is used for judging the log text by utilizing the fraud log text weight.
7. The analytical system of claim 6,
the mobile phone number black, white and grey lists comprise a white list, a grey list and a black list;
the list analysis module comprises:
a classification component to classify log text numbers using the white list, gray list, and black list pairs;
and the list weight component is used for generating the list weight according to the classification.
8. The analytical system of claim 6,
the keyword analysis module comprises:
the judging component is used for judging the format of the log text;
and the keyword weight component is used for generating keyword weight by using keywords according to the log text format.
9. The analytical system of claim 6,
the SVM analysis module comprises:
a building component for building an SVM model;
an SVM analysis component for analyzing the log text using the SVM model;
and generating an SVM model weight component for generating SVM model weights.
10. The analytical system of claim 9,
the set-up component comprises:
the collecting unit is used for collecting training log texts;
the characteristic extraction unit is used for extracting the characteristics of the training log text;
the generating characteristic extracting text unit is used for generating a characteristic extracting text;
the characteristic identification unit is used for carrying out characteristic identification on the characteristic extraction text by using TF-IDF;
a feature identification text generation unit for generating a feature identification text;
the normalization unit is used for normalizing the feature identification text to generate normalized data;
the classification unit is used for classifying the normalized data by using an SVM;
and the establishing unit is used for establishing the SVM model.
CN202010490624.1A 2020-06-02 2020-06-02 Fraud log text analysis method and system based on SVM text analysis Pending CN111666412A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010490624.1A CN111666412A (en) 2020-06-02 2020-06-02 Fraud log text analysis method and system based on SVM text analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010490624.1A CN111666412A (en) 2020-06-02 2020-06-02 Fraud log text analysis method and system based on SVM text analysis

Publications (1)

Publication Number Publication Date
CN111666412A true CN111666412A (en) 2020-09-15

Family

ID=72385539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010490624.1A Pending CN111666412A (en) 2020-06-02 2020-06-02 Fraud log text analysis method and system based on SVM text analysis

Country Status (1)

Country Link
CN (1) CN111666412A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704405A (en) * 2021-08-30 2021-11-26 平安银行股份有限公司 Quality control scoring method, device, equipment and storage medium based on recording content

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8023974B1 (en) * 2007-02-15 2011-09-20 Trend Micro Incorporated Lightweight SVM-based content filtering system for mobile phones
CN104301896A (en) * 2014-10-15 2015-01-21 上海欣方智能***有限公司 Intelligent fraud short message monitor and alarm system and method
CN109446423A (en) * 2018-10-26 2019-03-08 北京捷报数据技术有限公司 A kind of Judgment by emotion system and method for news and text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8023974B1 (en) * 2007-02-15 2011-09-20 Trend Micro Incorporated Lightweight SVM-based content filtering system for mobile phones
CN104301896A (en) * 2014-10-15 2015-01-21 上海欣方智能***有限公司 Intelligent fraud short message monitor and alarm system and method
CN109446423A (en) * 2018-10-26 2019-03-08 北京捷报数据技术有限公司 A kind of Judgment by emotion system and method for news and text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王文霞: "基于贝叶斯文本分类算法的垃圾短信过滤***", 《山西大同大学学报(自然科学版)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704405A (en) * 2021-08-30 2021-11-26 平安银行股份有限公司 Quality control scoring method, device, equipment and storage medium based on recording content

Similar Documents

Publication Publication Date Title
CN110826320B (en) Sensitive data discovery method and system based on text recognition
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN103886108B (en) The feature selecting and weighing computation method of a kind of unbalanced text set
US20210286835A1 (en) Method and device for matching semantic text data with a tag, and computer-readable storage medium having stored instructions
CN112632989B (en) Method, device and equipment for prompting risk information in contract text
CN101763431A (en) PL clustering method based on massive network public sentiment information
CN110390084A (en) Text duplicate checking method, apparatus, equipment and storage medium
CN111144106A (en) Two-stage text feature selection method under unbalanced data set
CN117556225B (en) Pedestrian credit data risk management system
Sharma et al. Novel use of logistic regression and likelihood ratios for the estimation of gender of the writer from a database of handwriting features
CN111666412A (en) Fraud log text analysis method and system based on SVM text analysis
CN112464670A (en) Recognition method, recognition model training method, device, equipment and storage medium
CN107483420B (en) Information auditing device and method
CN111598691A (en) Method, system and device for evaluating default risk of credit/debt main body
CN110135509A (en) A kind of intelligent finance credit-graded approach neural network based
CN113407734B (en) Method for constructing knowledge graph system based on real-time big data
CN112699949B (en) Potential user identification method and device based on social platform data
CN112069392B (en) Method and device for preventing and controlling network-related crime, computer equipment and storage medium
Babu et al. Identifying fake news using machine learning
CN114741501A (en) Public opinion early warning method and device, readable storage medium and electronic equipment
CN111666765A (en) Fraud topic analysis method and system based on k-means text clustering
Küster et al. The Informational Content of Key Audit Matters: Evidence from Using Artificial Intelligence in Textual Analysis
CN111951105A (en) Intelligent credit wind control system based on multidimensional big data analysis
Zhou et al. Keyword extraction based on random forest and XGBoost-an example of fraud judgment document
McElwee et al. Social Media, Money, and Politics: Campaign Finance in the 2016 US Congressional Cycle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200915