CN112380323A

CN112380323A - Junk information removing system and method based on Chinese word segmentation recognition technology

Info

Publication number: CN112380323A
Application number: CN202011391134.2A
Authority: CN
Inventors: 杨奚诚; 王诚; 熊瑛; 卢倩; 夏洋阳
Original assignee: Hefei D2s Soft Information Technology Co ltd
Current assignee: Hefei D2s Soft Information Technology Co ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-02-19

Abstract

The invention discloses a junk information removing system and method based on a Chinese word segmentation recognition technology, relates to the technical field of junk information recognition, and solves the technical problems that in the prior art, the junk short message recognition rate is not high and the working efficiency is low; the method comprises the steps of firstly, preliminarily screening short messages received by an intelligent terminal according to numbers sent by the short messages to obtain preliminarily screened short messages, then extracting verification keywords in the short messages by a segmentation technology, matching the verification keywords with a sensitive word bank, and finally judging through an intelligent model; triple detection is set, so that the accuracy of garbage information identification and the working efficiency of the invention are improved; the invention is provided with the short message preprocessing module, which is helpful for improving the rejection efficiency of the spam short messages in the invention; the invention is provided with the short message analysis module, which is beneficial to improving the recognition rate of the spam short messages and ensuring that the intelligent terminal is not disturbed by the spam short messages.

Description

Junk information removing system and method based on Chinese word segmentation recognition technology

Technical Field

The invention belongs to the field of junk information identification, relates to a Chinese word segmentation identification technology, and particularly relates to a junk information removing system and method based on the Chinese word segmentation identification technology.

Background

The short message service, as a basic service of the mobile communication network, provides a convenient message communication service for users, and at the same time, becomes a channel for sending some illegal short messages, causing many damages, such as illegal crimes of strange and fraud by using short messages, unreal messages and rumors by using short messages, and the like.

The invention patent with publication number CN103874033A discloses a method for recognizing irregular spam messages based on chinese word segmentation, which comprises the steps of performing chinese word segmentation according to normal horizontal reading for the same message according to the content of the message, and calculating the weight according to the number of words in the segmentation result; judging the range of the contents of the irregular short messages according to the characteristic that the number of characters of each line of short messages must be controlled by the irregular short messages, converting the characters in the range of the contents of the irregularly arranged short messages into horizontal arrangement in a vertical mode, then performing Chinese word segmentation, and calculating the weight according to the number of words of the overall word segmentation result; then, according to the comparison of the two weights, whether the short messages are arranged normally or irregularly is judged.

According to the scheme, the matching keywords are analyzed by adopting the content according to the arrangement type, whether the spam messages are spam messages is identified, the missing judgment of the spam messages is further avoided, and the recall ratio and the precision ratio of the spam messages are improved; however, the scheme mainly aims at irregularly arranged spam short messages, so that the form of identifying the short messages is single, and the practicability of the scheme is reduced; therefore, the above solution still needs further improvement.

Disclosure of Invention

In order to solve the problems existing in the scheme, the invention provides a junk information removing system and method based on a Chinese word segmentation recognition technology.

The purpose of the invention can be realized by the following technical scheme: a junk information removing system based on Chinese word segmentation recognition technology comprises a processor, an IP analysis module, an information publishing module, a data storage module, a short message preprocessing module, an intelligent model module and a short message analysis module;

the short message preprocessing module is used for preprocessing the short message received by the intelligent terminal to obtain a primary screened short message and sending the primary screened short message to the short message analysis module through the processor;

the short message analysis module analyzes the primary screened short messages through the intelligent model and the keyword analysis technology in sequence, screens out spam short messages according to the analysis result, and sends IP analysis signals to the IP analysis module through the processor;

the intelligent model module is used for acquiring an intelligent model;

the IP analysis module is used for analyzing the IP address of the junk short message.

Preferably, the short message preprocessing module is configured to perform preliminary screening on the short messages, and includes:

the intelligent terminal receives the short message and then sends the short message to the short message preprocessing module; the intelligent terminal comprises an intelligent mobile phone and a tablet personal computer;

the short message preprocessing module acquires a sending number of the short message after receiving the short message, and acquires a short message mark database stored in the storage module through the processor;

matching the sending number with the number in the short message database, intercepting the short message corresponding to the sending number when the matching result is obtained, and automatically removing the short message from the intelligent terminal; when the matching result is not obtained, the short message is marked as a primary screened short message, and the primary screened short message and the short message analysis signal are sent to a short message analysis module through a processor;

and sending the sending record of the short message analysis signal to a data storage module for storage through a processor.

Preferably, the short message analysis module is configured to analyze the primarily selected short message, and includes:

acquiring a sensitive word bank in a data storage module through a processor; the sensitive word bank at least comprises a keyword of a sensitive word type, and the sensitive word type comprises drugs and yellow-related drugs;

extracting the preliminary screened short messages by a Chinese word segmentation technology to obtain verification keywords, matching the verification keywords with the keywords in the sensitive word bank, judging the preliminary screened short messages to be spam short messages when the verification keywords are matched with results in the sensitive word bank, and automatically removing the spam short messages from the intelligent terminal; when the verification keyword cannot be matched with the result in the sensitive word bank, acquiring an intelligent model in the data storage module;

converting the primary screened short message into an input array, marking the input array as a verification input array, and inputting the verification input array into an intelligent model to judge the primary screened short message;

and when the primary screened short messages are judged to be spam short messages, the primary screened short messages are automatically removed from the intelligent terminal.

Preferably, the intelligent model module is configured to train the neural network model to obtain an intelligent model, and includes:

acquiring a spam message database through the Internet, and numbering spam messages; the specific reference numbers are 5 digits, such as 1+01+ 00; wherein, the position of 1 represents the arrangement rule of the spam messages, and the arrangement rule comprises a horizontal row and a vertical row; the position of 01 represents the type of sensitive words, 00 represents the mixed type, 01 represents drugs, and 02 represents the reference yellow; the position of 00 represents the number of sensitive words;

preprocessing the junk short messages, converting the preprocessed junk short messages into an input array of a neural network model, and taking the serial numbers corresponding to the junk short messages as an output array of the neural network to train the neural network model; the neural network model comprises an error feedforward neural network and an RBF neural network;

and marking the trained neural network model as an intelligent model, and sending the intelligent model to a data storage module for storage through a processor.

Preferably, the IP analysis module analyzes the IP address of the spam message after receiving the IP analysis signal, and adds the IP address to the IP blacklist when the number of times of sending the spam message by the IP address exceeds the preset number of times of the spam message.

Preferably, the information issuing module is used for issuing a rejection result of the spam short messages and periodically issuing a rejection record of the spam short messages to the intelligent terminal.

Preferably, the short message tag database is generated by a third party platform, and includes:

generating an empty short message mark library through a processor;

acquiring a harassment number statistical table through a third-party platform; the third party platform comprises China Mobile, China Unicom and China telecom, and the number in the harassing number statistical table is a number marked as a harassing call by a user of the third party platform;

acquiring the marking times of the numbers in the harassment number statistical table, and marking the marking times as BC;

when the marking times BC is larger than L1, the number corresponding to the marking times is stored in a short message marking library; wherein L1 is a preset marking time threshold, L1> 0;

and sending the short message mark library to a data storage module for storage through a processor.

Preferably, the processor is respectively in communication connection with the IP analysis module, the information release module, the data storage module, the short message preprocessing module, the intelligent model module and the short message analysis module, and the data storage module is in communication connection with the information release module.

A junk information removing method based on a Chinese word segmentation recognition technology comprises the following steps:

the method comprises the following steps: the intelligent terminal receives the short message and then sends the short message to the short message preprocessing module; the short message preprocessing module acquires a sending number of the short message after receiving the short message, and acquires a short message mark database stored in the storage module through the processor; matching the sending number with the number in the short message database, intercepting the short message corresponding to the sending number when the matching result is obtained, and automatically removing the short message from the intelligent terminal; when the matching result is not obtained, the short message is marked as a primary screened short message, and the primary screened short message and the short message analysis signal are sent to a short message analysis module through a processor;

step two: acquiring a sensitive word bank in a data storage module through a processor; extracting the preliminary screened short messages by a Chinese word segmentation technology to obtain verification keywords, matching the verification keywords with the keywords in the sensitive word bank, judging the preliminary screened short messages to be spam short messages when the verification keywords are matched with results in the sensitive word bank, and automatically removing the spam short messages from the intelligent terminal; when the verification keyword cannot be matched with the result in the sensitive word bank, acquiring an intelligent model in the data storage module; converting the primary screened short message into an input array, marking the input array as a verification input array, and inputting the verification input array into an intelligent model to judge the primary screened short message; and when the primary screened short messages are judged to be spam short messages, the primary screened short messages are automatically removed from the intelligent terminal.

Compared with the prior art, the invention has the beneficial effects that:

1. the method comprises the steps of firstly, preliminarily screening short messages received by an intelligent terminal according to numbers sent by the short messages to obtain preliminarily screened short messages, then extracting verification keywords in the short messages by a segmentation technology, matching the verification keywords with a sensitive word bank, and finally judging through an intelligent model; triple detection is set, so that the accuracy of garbage information identification and the working efficiency of the invention are improved;

2. the invention is provided with a short message preprocessing module, which is used for preliminarily screening short messages; the intelligent terminal receives the short message and then sends the short message to the short message preprocessing module; the short message preprocessing module acquires a sending number of the short message after receiving the short message, and acquires a short message mark database stored in the storage module through the processor; matching the sending number with the number in the short message database, intercepting the short message corresponding to the sending number when the matching result is obtained, and automatically removing the short message from the intelligent terminal; when the matching result is not obtained, the short message is marked as a primary screened short message, and the primary screened short message and the short message analysis signal are sent to a short message analysis module through a processor; the short message preprocessing module realizes the preliminary screening of the short messages by screening the numbers of the sent short messages, and is beneficial to improving the removal efficiency of the spam short messages;

3. the invention is provided with a short message analysis module, which is used for analyzing the primary selected short message; acquiring a sensitive word bank in a data storage module through a processor; extracting the preliminary screened short messages by a Chinese word segmentation technology to obtain verification keywords, matching the verification keywords with the keywords in the sensitive word bank, judging the preliminary screened short messages to be spam short messages when the verification keywords are matched with results in the sensitive word bank, and automatically removing the spam short messages from the intelligent terminal; when the verification keyword cannot be matched with the result in the sensitive word bank, acquiring an intelligent model in the data storage module; converting the primary screened short message into an input array, marking the input array as a verification input array, and inputting the verification input array into an intelligent model to judge the primary screened short message; when the primary screened short messages are judged to be spam short messages, the primary screened short messages are automatically removed from the intelligent terminal; the short message analysis module further screens the primary screened short messages through a Chinese word segmentation technology and an intelligent model in sequence, and automatically eliminates the screened spam short messages, so that the method is beneficial to improving the recognition rate of the spam short messages and ensuring that the intelligent terminal is not disturbed by the spam short messages.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of the principle of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a spam removal system based on a chinese word segmentation recognition technology includes a processor, an IP analysis module, an information distribution module, a data storage module, a short message preprocessing module, an intelligent model module, and a short message analysis module;

the intelligent model module is used for acquiring an intelligent model;

Further, the short message preprocessing module is used for primarily screening the short messages, and comprises:

the intelligent terminal receives the short message and then sends the short message to the short message preprocessing module; the intelligent terminal comprises an intelligent mobile phone and a tablet computer;

Further, the short message analysis module is used for analyzing the primarily selected short message, and comprises:

acquiring a sensitive word bank in a data storage module through a processor; the sensitive word bank at least comprises a key word of a sensitive word type, and the sensitive word type comprises drugs and yellow-related drugs;

Further, the intelligent model module is used for training the neural network model to obtain the intelligent model, and comprises:

acquiring a spam message database through the Internet, and numbering spam messages; the specific reference numbers are 5 digits, such as 1+01+ 00; wherein, the position of 1 represents the arrangement rule of the spam messages, and the arrangement rule comprises horizontal rows and vertical rows; the position of 01 represents the type of sensitive words, 00 represents the mixed type, 01 represents drugs, and 02 represents the reference yellow; the position of 00 represents the number of sensitive words;

Further, the IP analysis module analyzes the IP address of the junk short message after receiving the IP analysis signal, and adds the IP address into an IP blacklist when the number of times of sending the junk short message by the IP address exceeds the preset number of times of sending the junk short message.

Further, the information issuing module is used for issuing a rejection result of the spam short messages and periodically issuing a rejection record of the spam short messages to the intelligent terminal.

Further, the short message mark database is generated through a third-party platform, and comprises the following steps:

generating an empty short message mark library through a processor;

acquiring a harassment number statistical table through a third-party platform; the third party platform comprises China Mobile, China Unicom and China telecom, and the number in the harassing number statistical table is the number marked as a harassing call by the user of the third party platform;

when the marking times BC is larger than L1, the number corresponding to the marking times is stored in a short message marking library; wherein L1 is a preset marking time threshold;

Further, the processor is respectively in communication connection with the IP analysis module, the information release module, the data storage module, the short message preprocessing module, the intelligent model module and the short message analysis module, and the data storage module is in communication connection with the information release module.

A junk information removing method based on Chinese word segmentation recognition technology comprises the following steps:

The above formulas are all calculated by removing dimensions and taking values thereof, the formula is one closest to the real situation obtained by collecting a large amount of data and performing software simulation, and the preset parameters in the formula are set by the technical personnel in the field according to the actual situation.

The working principle of the invention is as follows:

the intelligent terminal receives the short message and then sends the short message to the short message preprocessing module; the short message preprocessing module acquires a sending number of the short message after receiving the short message, and acquires a short message mark database stored in the storage module through the processor; matching the sending number with the number in the short message database, intercepting the short message corresponding to the sending number when the matching result is obtained, and automatically removing the short message from the intelligent terminal; when the matching result is not obtained, the short message is marked as a primary screened short message, and the primary screened short message and the short message analysis signal are sent to a short message analysis module through a processor;

acquiring a sensitive word bank in a data storage module through a processor; extracting the preliminary screened short messages by a Chinese word segmentation technology to obtain verification keywords, matching the verification keywords with the keywords in the sensitive word bank, judging the preliminary screened short messages to be spam short messages when the verification keywords are matched with results in the sensitive word bank, and automatically removing the spam short messages from the intelligent terminal; when the verification keyword cannot be matched with the result in the sensitive word bank, acquiring an intelligent model in the data storage module; converting the primary screened short message into an input array, marking the input array as a verification input array, and inputting the verification input array into an intelligent model to judge the primary screened short message; and when the primary screened short messages are judged to be spam short messages, the primary screened short messages are automatically removed from the intelligent terminal.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing is merely exemplary and illustrative of the present invention and various modifications, additions and substitutions may be made by those skilled in the art to the specific embodiments described without departing from the scope of the invention as defined in the following claims.

Claims

1. A junk information removing system based on Chinese word segmentation recognition technology is characterized by comprising a processor, an IP analysis module, an information publishing module, a data storage module, a short message preprocessing module, an intelligent model module and a short message analysis module;

the intelligent model module is used for acquiring an intelligent model;

2. The system of claim 1, wherein the short message preprocessing module is configured to perform preliminary screening on short messages, and comprises:

3. The system of claim 1, wherein the text message analysis module is configured to analyze the initially selected text message, and comprises:

4. The system of claim 1, wherein the intelligent model module is configured to train a neural network model to obtain an intelligent model, and the system comprises:

acquiring a spam message database through the Internet, and numbering spam messages;

5. The junk information removal system according to claim 1 wherein the IP analysis module analyzes the IP address of the junk short message after receiving the IP analysis signal, and adds the IP address to an IP blacklist when the number of times of sending the junk short message by the IP address exceeds a predetermined number of times of sending the junk short message.

6. The system of claim 1, wherein the information distribution module is configured to distribute a reject result of the spam messages and periodically distribute a reject record of the spam messages to the intelligent terminal.

7. A junk information removing method based on a Chinese word segmentation recognition technology is characterized by comprising the following steps: