CN103605690A - Device and method for recognizing advertising messages in instant messaging - Google Patents

Device and method for recognizing advertising messages in instant messaging Download PDF

Info

Publication number
CN103605690A
CN103605690A CN201310537715.6A CN201310537715A CN103605690A CN 103605690 A CN103605690 A CN 103605690A CN 201310537715 A CN201310537715 A CN 201310537715A CN 103605690 A CN103605690 A CN 103605690A
Authority
CN
China
Prior art keywords
instant message
text
characteristic
feature
proper vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310537715.6A
Other languages
Chinese (zh)
Inventor
孙林
陈培军
秦吉胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310537715.6A priority Critical patent/CN103605690A/en
Publication of CN103605690A publication Critical patent/CN103605690A/en
Priority to PCT/CN2014/087175 priority patent/WO2015062377A1/en
Priority to US15/034,307 priority patent/US20160283582A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a device and a method for recognizing advertising messages in instant messaging. The device comprises detecting a text field in an instant message sent by an instant messaging client side, extracting one or multiple characteristic vectors contained in the text field, and recognizing the instant message matched with an advertising message according to the characteristic vectors. In application of the device and the method, the text field in the instant messages sent by an instant messaging client side is detected, one or multiple characteristic vectors contained in the text field are extracted, and the instant message matched with the advertising message is then recognized according to the characteristic vectors, so that advertisements in the instant messaging can be recognized effectively and corresponding shielding or disable-send-massage management can be executed.

Description

In a kind of instant messaging, identify the apparatus and method of advertisement information
Technical field
The present invention relates to computer network field, be specifically related to identify in a kind of instant messaging the apparatus and method of advertisement information.
Background technology
Along with the development of internet, diverse network application, especially JICQ, become the important channel of people's obtaining information, exchange of information.Yet in the instant message of instant messaging, there is the ad content of a great deal of, to user, brought inconvenience, also reduced the quality of instant messaging simultaneously.
Summary of the invention
In view of the above problems, proposed the present invention and in a kind of a kind of instant messaging that overcomes the problems referred to above or address the above problem at least in part, identified to provide the method for identifying advertisement information in the device of advertisement information and corresponding a kind of instant messaging.
According to one aspect of the present invention, the device of identifying advertisement information in a kind of instant messaging is provided, comprising: text acquiring unit, is suitable for detecting the text field in the instant message that instant communication client sends; Proper vector extraction unit, is suitable for extracting the one or more proper vectors that comprise in described the text field; Recognition unit, is suitable for according to described proper vector, the instant message that identification is mated with advertisement information.
Alternatively, this device also comprises: screen unit, is suitable for, when recognition unit identifies the instant message mating with advertisement information, the instant message mating with advertisement information being carried out to shielding processing.
Alternatively, this device also comprises: administrative unit, be suitable for when recognition unit identifies the instant message mating with advertisement information, the client of the instant message mating with advertisement information described in the instant message mating with advertisement information described in sign and transmission, and do not forward in the given time the instant message being sent by this client.
Alternatively, described recognition unit, be suitable for according to described proper vector judge instant message whether with characteristic of advertisement database in record matching.
Alternatively, described recognition unit, is suitable for each feature in described proper vector, detects in characteristic of advertisement database whether repeatedly occur this feature; Described recognition unit, be suitable for judging whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold in characteristic of advertisement database, be the record matching of determining in described instant message and characteristic of advertisement database, otherwise do not mate.
Alternatively, described recognition unit, be suitable for each feature in described proper vector, from characteristic of advertisement database, search and whether have this feature, if existed, further check the weights of this feature, if the weights of this feature are more than or equal to Second Threshold, in characteristic of advertisement database, repeatedly occur this feature.
Alternatively, this device further comprises characteristic of advertisement database update unit, described characteristic of advertisement database update unit, while being suitable for the record matching in determining described instant message and characteristic of advertisement database, for each feature in described proper vector, if detect in characteristic of advertisement database and have this feature, the weights of this feature in characteristic of advertisement database are added to 1.
Alternatively, described recognition unit, be suitable in each feature in described proper vector, before whether there is this feature in detection characteristic of advertisement database, whether the number that judges the feature in described proper vector is less than the 3rd threshold value, be that described instant message does not mate and finishes decision operation with the record in characteristic of advertisement database, otherwise for each feature in described proper vector, detect in characteristic of advertisement database whether repeatedly occur this feature.
Alternatively, described proper vector extraction unit comprises: Chinese text obtains subelement, is suitable for the text field to carry out text-processing to obtain Chinese text; Phonetic text obtains subelement, is suitable for transferring the Chinese character in the Chinese text obtaining to phonetic and obtains phonetic text; Fingerprint obtains subelement, is suitable for extracting the feature of described phonetic text, by the proper vector of phonetic text described in the Characteristics creation extracting.
Alternatively, described Chinese text obtains subelement, is suitable for the text field to carry out data cleansing operation, and the content in the text field is converted to regular character; Phonetic is converted into Chinese character; And conventional Chinese character will be retained.
Alternatively, described Chinese text obtains subelement, be suitable for identifying and abandon HTML mark, the complex form of Chinese characters is converted to simplified Chinese character, double byte character is converted to half-angle character, capitalization English letter is converted to small letter English alphabet, and identifies and abandon url and punctuation mark, so that the content in the text field is converted to regular character; Described Chinese text obtains subelement, is suitable for using two-way maximum matching algorithm that the phonetic in text is converted to Chinese character, if the corresponding a plurality of Chinese characters of phonetic, from a plurality of Chinese characters of correspondence optional one, so that the phonetic in text is converted into Chinese character; Described Chinese text obtains subelement, is suitable for using the Chinese characters in common use in GBK coding schedule to filter the text field, abandons all characters that do not belong to Chinese characters in common use, to retain conventional Chinese character.
Alternatively, described phonetic text obtains subelement, is suitable for using the Chinese-character phonetic letter table of comparisons, and each Chinese character is converted to corresponding pinyin string, to obtain phonetic text.
Alternatively, described fingerprint obtains subelement, is suitable for take individual Chinese character and extracts the feature of described phonetic text as cutting granularity, and use vector space model by the proper vector of phonetic text described in the Characteristics creation extracting.
According to another aspect of the present invention, a kind of identify advertisement information in instant messaging method is provided, comprising: the text field in the instant message that detection instant communication client sends; Extract the one or more proper vectors that comprise in described the text field; According to described proper vector, the instant message that identification is mated with advertisement information.
Alternatively, the method also comprises: when identifying the instant message mating with advertisement information, the instant message mating with advertisement information is carried out to shielding processing.
Alternatively, when identifying the instant message mating with advertisement information, the client of the instant message mating with advertisement information described in the instant message mating with advertisement information described in sign and transmission, and do not forward in the given time the instant message being sent by this client.
Alternatively, according to described proper vector, the instant message that mates with advertisement information of identification, specifically comprises: according to described proper vector judge instant message whether with characteristic of advertisement database in record matching.
Alternatively, described according to described proper vector judge instant message whether with characteristic of advertisement database in record matching, specifically comprise: to each feature in described proper vector, detect in characteristic of advertisement database whether repeatedly occur this feature; Judge that whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold, is the record matching of determining in described instant message and characteristic of advertisement database, otherwise does not mate in characteristic of advertisement database.
Alternatively, in described detection characteristic of advertisement database, whether repeatedly occur that this feature comprises: from characteristic of advertisement database, search and whether have this feature, if existed, further check the weights of this feature, if the weights of this feature are more than or equal to Second Threshold, in characteristic of advertisement database, repeatedly there is this feature.
Alternatively, during record matching in determining described instant message and characteristic of advertisement database, the method further comprises: for each feature in described proper vector, if detected in characteristic of advertisement database, have this feature, these weights by this feature in characteristic of advertisement database add 1.
Alternatively, in each feature in described proper vector, before whether there is this feature in detection characteristic of advertisement database, described judge instant message whether with characteristic of advertisement database in record matching further comprise: whether the number that judges the feature in described proper vector is less than the 3rd threshold value, that described instant message does not mate with record in characteristic of advertisement database and finishes decision operation, otherwise for each feature in described proper vector, detect in characteristic of advertisement database whether repeatedly occur this feature.
Alternatively, the one or more proper vectors that comprise in the described the text field of described extraction, specifically comprise: the text field is carried out to text-processing to obtain Chinese text; Transfer the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text; Extract the feature of described phonetic text, by the proper vector of phonetic text described in the Characteristics creation extracting.
Alternatively, described the text field is carried out to text-processing to obtain Chinese text, specifically comprise: the text field is carried out to data cleansing operation, the content in the text field is converted to regular character; Phonetic is converted into Chinese character; Retain conventional Chinese character.
Alternatively, described the text field is carried out to data cleansing operation, specifically comprise: identify and abandon HTML mark, the complex form of Chinese characters is converted to simplified Chinese character, double byte character is converted to half-angle character, capitalization English letter is converted to small letter English alphabet, and identify and abandon url and punctuation mark; Described phonetic in text is converted into Chinese character, specifically comprises: use two-way maximum matching algorithm that the phonetic in text is converted to Chinese character, if the corresponding a plurality of Chinese character of phonetic, from a plurality of Chinese characters of correspondence optional one; The Chinese character that described reservation is conventional, specifically comprises: use the Chinese characters in common use in GBK coding schedule to filter the text field, abandon all characters that do not belong to Chinese characters in common use.
Alternatively, describedly transfer the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text, specifically comprise: use the Chinese-character phonetic letter table of comparisons, each Chinese character is converted to corresponding pinyin string, obtain phonetic text.
Alternatively, the feature of the described phonetic text of described extraction, by the proper vector of phonetic text described in the Characteristics creation extracting, specifically comprise: the individual Chinese character of take extracts the feature of described phonetic text as cutting granularity, and use vector space model by the proper vector of phonetic text described in the Characteristics creation extracting.
According to identifying the apparatus and method of advertisement information in instant messaging of the present invention, the text field in the instant message that can send by detection instant communication client, extract the one or more proper vectors that comprise in described the text field, and the instant message mating with advertisement information according to eigenvector recognition.The advertisement in instant messaging can be effectively identified and speech management can be shielded accordingly or prohibit.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:
Fig. 1 shows according to identifying the process flow diagram of the method for advertisement information in the instant messaging of first embodiment of the invention;
Fig. 2 shows the detailed process flow diagram that extracts the one or more proper vectors that comprise in the text field;
Fig. 3 shows step S210, step S220 as shown in Figure 2 and the detailed process flow diagram of step S230;
Fig. 4 shows the detailed process flow diagram of step S300 as shown in Figure 1;
Fig. 5 shows according to identifying the process flow diagram of the method for advertisement information in the instant messaging of second embodiment of the invention;
Fig. 6 shows according to identifying the block diagram of the device of advertisement information in the instant messaging of first embodiment of the invention;
Fig. 7 shows according to the detailed block diagram of identifying the device of advertisement information in the instant messaging of first embodiment of the invention; And
Fig. 8 shows according to the detailed block diagram of identifying the device of advertisement information in the instant messaging of second embodiment of the invention.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.
Fig. 1 shows according to identifying the process flow diagram of the method for advertisement information in the instant messaging of first embodiment of the invention.The method comprises the following steps S100, S200 and S300.
The text field in the instant message that S100, detection instant communication client send.
In the present embodiment, can be from instant message the content (such as picture, video etc.) of the non-text of filtering, screening obtains the text field.
S200, extract the one or more proper vectors that comprise in described the text field.In the present embodiment, can, by detecting punctuate symbol, by the text field cutting, be multistage text, and then obtain a plurality of proper vectors; Also can non-divided the text field, and then obtain a proper vector.
S300, according to described proper vector, the instant message that mates with advertisement information of identification.
In the present embodiment, to each feature in proper vector, can detect in a default characteristic of advertisement database whether repeatedly occur this feature.After having detected all features in proper vector, the feature repeatedly occurring in characteristic of advertisement database in judging characteristic vector accounts for the ratio of whole features of proper vector, thereby judges whether instant message mates with the record in characteristic of advertisement database.In the present embodiment, default characteristic of advertisement database is used Redis characteristic of advertisement database, can be to analyze by the web advertisement text to magnanimity (such as capturing the junk information such as the web advertisement of collecting) feature that obtains magnanimity, and the number of each feature of obtaining of statistics and obtain weights, make feature (Shingle) and weights (Value) formation characteristic of advertisement database.
Step S200 of the present invention and step S300, by with characteristic of advertisement database in record carry out Similar Text monitoring and identify the advertisement in instant message.A kind of Similar Text detection method that is different from step S200 of the present invention and step S300 is: the feature of first extracting text (is for example carried out participle to text, extract entity word) and use various technology to expand and (for example use synonym word woods feature, the knowledge bases such as near synonym dictionary are carried out vocabulary extension), and with VSM model, text (for example using VSM model is a vector by one piece of text representation) is described, then use clustering method to carry out cluster (for example, for two pieces of texts to text, after vectorization represents, calculate two vectorial cosine angles for characterizing the similarity of two pieces of texts, if similarity is greater than certain threshold value, think that two pieces of texts are similar), the text being gathered is together similar.
Yet, in network application, exist the mutation of a large amount of Similar Texts, as used the complex form of Chinese characters, applicable phonetic to replace word, replace former word, add a large amount of insignificant interference characters by phonetically similar word, etc., there is following shortcoming in above-mentioned technology: (one) word segmentation result exists error; (2) text of the different words of unisonance cannot be judged as similar; (3) cannot be Similar Text by two pieces of text identification processing through alphabetizing; (4) for example, to the computation complexity of text too high (, be vector by text representation, need larger operand).Therefore, this method cannot meet the computing requirement of real-time in current big data quantity situation.
Fig. 2 shows the detailed process flow diagram that extracts the one or more proper vectors that comprise in the text field.The method comprises the following steps S210, S220 and S230.
S210, the text field is carried out to text-processing to obtain Chinese text.
By obtaining Chinese text by the text field, the impact of the mutation that can eliminate Similar Texts such as including insignificant interference character, the complex form of Chinese characters on the present embodiment.
S220, transfer the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text.
By the Chinese character unification in Chinese text is converted into phonetic, can eliminate with phonetic replace word, the mutation that replaces the Similar Texts such as former word by the phonetically similar word impact on the recognition effect of the present embodiment.
S230, extract the feature of described phonetic text, by the proper vector of phonetic text described in the Characteristics creation extracting.
In the present embodiment, can adopt N gram language model (N-gram) to mention the proper vector of phonetic text, the Chinese character granularity in the Chinese text obtaining based on step S210, the phonetic text that step S220 is obtained extracts N-gram feature SHINGLE 1, SHINGLE 2... SHINGLE m.For example, if the Chinese text that step S210 obtains is " I love Tian An-men, Beijing ", Chinese character granularity be " I ", " love ", " north ", " capital ", " my god ", " peace ", " door ", the phonetic text that step S220 obtains is " wo ai bei jing tian an men ", pinyin string is split as " wo ", " ai ", " bei ", " jing ", " tian ", " an ", " men " so, if make N=6, in step S230, the N-gram feature SHINGLE obtaining 1for " wo ai bei jing tian an ", SHINGLE 2for " ai bei jing tian an men ", the like.And use vector space model (VSM, Vector Space Model) to form proper vector D=<SHINGLE 1, SHINGLE 2..., SHINGLE m>.
Fig. 3 shows step S210, step S220 as shown in Figure 2 and the detailed process flow diagram of step S230.Step S210 specifically comprises:
S211, the text field is carried out to data cleansing operation, the content in the text field is converted to regular character.
Wherein, the text field is carried out to data cleansing operation, specifically comprise: identify and abandon HTML mark, the complex form of Chinese characters is converted to simplified Chinese character, double byte character is converted to half-angle character, capitalization English letter is converted to small letter English alphabet, and identify and abandon url and punctuation mark.
S212, phonetic is converted into Chinese character.
Wherein, the phonetic in the text of processing through step S211 is converted into Chinese character, specifically comprises: use two-way maximum matching algorithm that the phonetic in text is converted to Chinese character, if the corresponding a plurality of Chinese character of phonetic, from a plurality of Chinese characters of correspondence optional one.
S213, retain conventional Chinese character.
Wherein, retain conventional Chinese character, specifically comprise: use the Chinese characters in common use in GBK coding schedule to filter text, abandon all characters that do not belong to Chinese characters in common use, only retain Chinese character GBK and be coded in the Chinese character in 0xB0A0~0xF7FE.
Step S220 specifically comprises: use the Chinese-character phonetic letter table of comparisons, each Chinese character is converted to corresponding pinyin string, obtain phonetic text.
By step, S210 obtains Chinese text by the text field, and by step S220, transfers the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text, can, by the different mutation of Similar Text, be identified as identical phonetic text.For example, by the text field as shown in table 1 and three kinds of mutation, by step S210 and S220, obtain identical phonetic text.
Table 1 the text field and three kinds of mutation
Figure BDA0000407804490000081
Figure BDA0000407804490000091
Use step S210 of the present invention and step S220 to process respectively above-mentioned original text and three kinds of mutation, can obtain identical phonetic text: " tian mao shou ye zhan tie dao liu lan qi fang wen tian mao chao shi zhan tie dao liu lan qi fang wen ".Take mutation 3 as example: the text after step S110 carries out data cleansing as: " 1x3f days Mao homepages paste Liu pull tfa days mao supermarkets of device access paste Liu and pull device access sdjh " phonetic turns Chinese character, result phonetic being converted into after Chinese character through step S212 is: " 1x3f days Mao homepages paste Liu pull tfa days cat supermarkets of device access paste Liu and pull device access sdjh ", wherein " 1x3f ", " tfa " and " sdjh " be not due in lexicon with Pinyin, therefore do not process, " mao " is in lexicon with Pinyin, therefore the random Chinese character " cat " of selecting is used for substituting it, through step S213, retain conventional Chinese character, result is: " a day Mao homepage paste Liu pull a device access day cat supermarket paste Liu and pull device access ", further use the Chinese-character phonetic letter table of comparisons, each Chinese character is converted to corresponding phonetic, obtain above-mentioned phonetic text.Original text, mutation 1 and mutation 2 also can obtain identical phonetic text.
When N=6, the proper vector obtaining through step S230 is <tian mao shou ye zhan tie, mao shou ye zhan tie dao, shou ye zhan tie dao liu, ye zhan tie dao liu lan, zhan tie dao liu lan qi, tie dao liu lan qi fang, dao liu lan qi fang wen, liu lan qi fang wen tan, lan qi fang wen tan mao, qi fang wen tan mao chao, fang wen tan mao chao shi, wen tan mao chao shi zhan, tan mao chao shi zhan tie, mao chao shi zhan tie dao, chao shi zhan tie dao liu, shi zhan tie dao liu lan, zhan tie dao liu lan qi, tie dao liu lan qi fang, dao liu lan qi fang wen>.
Fig. 4 shows the detailed process flow diagram of step S300 in Fig. 1.To the proper vector of being obtained by above-mentioned steps S200, step S300 specifically comprises the following steps:
Whether the number K of the feature in S310, judging characteristic vector is less than the 3rd threshold value T3, be to perform step S390, otherwise execution step S320.The advantage of this single stepping has 2 points at least, first, in actual instant messaging, the length of advertisement information often can be too not short, for example, and the text field of a great deal of is that the text that length is very little (no more than five to seven Chinese characters) therefore judges by this step in instant messaging, make the proper vector that the text by text size little (number of the feature of obtaining is less than default threshold value) is extracted no longer carry out the judgement of step S320-S370, reduced the operand of the present embodiment method; Moreover, the short number of features of text size of text is few, known according to follow-up step S370, for being not for the instant message of advertisement, exist because of the indivedual features for being obtained by the text field extraction in characteristic of advertisement database, occur being mistaken for characteristic of advertisement database in the probability of record matching, by step S310, avoided this erroneous judgement.
In S320, selected characteristic vector one not with characteristic of advertisement database in the feature (Shingle) that compares of record.
S330, judge in characteristic of advertisement database whether have the feature of obtaining in step S320, if perform step S340, otherwise execution step S360.
S340, judge whether the weights of this feature in characteristic of advertisement database are more than or equal to Second Threshold T2, if perform step S350, otherwise execution step S360.
In S350, judgement characteristic of advertisement database, repeatedly there is this feature, and perform step S360.Owing to having judged in step S340 that weights are more than or equal to Second Threshold T2, so judge in step S350 and repeatedly occur this feature in characteristic of advertisement database.
Whole features in S360, judging characteristic vector, whether with characteristic of advertisement database in record compare, if perform step S370, otherwise return to execution step S320, read one not with characteristic of advertisement database in the feature that compares of record, each feature to proper vector, all can perform step S330.
S370, judge whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold T1, is to perform step S380 in characteristic of advertisement database, otherwise execution step S390.In the present embodiment, by judge that the feature repeatedly occurring in a proper vector accounts for the ratio of whole features of this proper vector in characteristic of advertisement database, whether reflection instant message mates with the record in characteristic of advertisement database.As from the foregoing, the operational method that the present embodiment adopts all belongs to simple text transform operation and simple data compare operation, and the relation between operand and text size is roughly once linear relationship, and computing expense is little.
S380, determine the record matching in instant message and characteristic of advertisement database and finish decision operation.
S390, determine that instant message does not mate with record in characteristic of advertisement database and finishes decision operation.
Preferably, while determining the record matching in instant message and characteristic of advertisement database in step S380, the method of the present embodiment further comprises: for each feature in described proper vector, if detect in characteristic of advertisement database and have this feature, these weights by this feature in characteristic of advertisement database add 1.In other words, if the record matching in instant message and characteristic of advertisement database upgrades characteristic of advertisement database Redis, thereby when using method of the present invention, realize the renewal to characteristic of advertisement database.
It is example that the proper vector of being obtained by the text field in table 1 is take in continuation, when N=6, the proper vector obtaining through step S200 is <tian mao shou ye zhan tie, mao shou ye zhan tie dao, shou ye zhan tie dao liu, ye zhan tie dao liu lan, zhan tie dao liu lan qi, tie dao liu lan qi fang, dao liu lan qi fang wen, liu lan qi fang wen tan, lan qi fang wen tan mao, qi fang wen tan mao chao, fang wen tan mao chao shi, wen tan mao chao shi zhan, tan mao chao shi zhan tie, mao chao shi zhan tie dao, chao shi zhan tie dao liu, shi zhan tie dao liu lan, zhan tie dao liu lan qi, tie dao liu lan qi fang, dao liu lan qi fang wen>.First by step S310, whether the number K=24 of the feature in judging characteristic vector is less than the 3rd threshold value T3, suppose the 3rd threshold value T3=10, K > T3, further by step S420, choose one not with characteristic of advertisement database in the feature that compares of record, for example " tian mao shou ye zhan tie ", by step S330, judge and in characteristic of advertisement database, whether have this feature, if be judged as NO, by step S360, return to step S320 and choose another feature, if step S330 is judged as YES, by step S340, whether the weights Value that judges this feature in characteristic of advertisement database is more than or equal to Second Threshold T2, suppose weights Value=6, Second Threshold T2=2, by repeatedly there is this feature in step S350 judgement characteristic of advertisement database, preferably, can for example to feature, carry out mark in several ways or by this feature of charting to record the operating result of this step.When 24 features all having been carried out judging (at least passing through step S320 and step S330), perform step S370, whether the ratio that the feature that judgement repeatedly occurs in characteristic of advertisement database accounts for above-mentioned 24 features reaches first threshold T1, suppose in characteristic of advertisement database repeatedly occur be characterized as 12, the ratio that accounts for above-mentioned 24 features is 50%, suppose that first threshold T1 is 30%, determine the record matching in instant message and characteristic of advertisement database and finish decision operation.
Fig. 5 shows according to identifying the process flow diagram of the method for advertisement information in the instant messaging of second embodiment of the invention.The second embodiment is roughly the same with the first embodiment as shown in Figure 1, difference is that the present embodiment also comprises step S400, comprise: when identifying the instant message mating with advertisement information, the instant message mating with advertisement information is carried out to shielding processing, and/or, the client of the instant message mating with advertisement information described in the instant message mating with advertisement information described in sign and transmission, and do not forward in the given time the instant message being sent by this client.Thereby shield a specific instant message, and/or realize sending the taboo speech management of the client of advertisement information.
Fig. 6 shows according to identifying the block diagram of the device of advertisement information in the instant messaging of first embodiment of the invention.This device comprises text acquiring unit 100, proper vector extraction unit 200, recognition unit 300, screen unit 400, and characteristic of advertisement database 500.
Text acquiring unit 100, is suitable for detecting the text field in the instant message that instant communication client sends.In the present embodiment, proper vector extraction unit 200 can be from content distributed the content of the non-text such as filtering picture, video, screening obtains the text field.
Proper vector extraction unit 200, is suitable for extracting the one or more proper vectors that comprise in described the text field.In the present embodiment, proper vector extraction unit 200 can, by detecting punctuate symbol, be multistage text by the text field cutting, and then obtain a plurality of proper vectors; Also can non-divided the text field, and then obtain a proper vector.
Recognition unit 300, is suitable for according to described proper vector, the instant message that identification is mated with advertisement information.In the present embodiment, recognition unit 300, be suitable for according to described proper vector judge instant message whether with characteristic of advertisement database 500 in record matching.
Characteristic of advertisement database 500 in the present embodiment is used Redis characteristic of advertisement database, can be by the network text of magnanimity (such as capturing the junk information such as the web advertisement of collecting) is analyzed to the feature that obtains magnanimity, and the number of each feature of obtaining of statistics and obtain weights, make feature (Shingle) and weights (Value) formation characteristic of advertisement database.
Particularly, recognition unit 300, is suitable for each feature in described proper vector, detects in characteristic of advertisement database 500 whether repeatedly occur this feature.Particularly, recognition unit 300, be suitable for each feature in described proper vector, from characteristic of advertisement database 500, search and whether have this feature, if existed, further check the weights of this feature, if the weights of this feature are more than or equal to default Second Threshold T2, judge and in characteristic of advertisement database 500, repeatedly occur this feature.
Recognition unit 300, be further adapted for the ratio that the feature repeatedly occurring judging in described proper vector accounts for whole features of this proper vector and whether reach first threshold T1 in characteristic of advertisement database 500, be the record matching of determining in instant message and characteristic of advertisement database 500, otherwise do not mate.
Further, recognition unit 300, be suitable in each feature in described proper vector, before whether there is this feature in detection characteristic of advertisement database 500, whether the number that judges the feature in described proper vector is less than the 3rd threshold value T3, be that described instant message does not mate and finishes decision operation with the record in characteristic of advertisement database 500, otherwise further for each feature in described proper vector, detect in characteristic of advertisement database 500 whether repeatedly occur this feature.
Fig. 7 shows according to the detailed block diagram of identifying the device of advertisement information in the instant messaging of first embodiment of the invention.Wherein, in the instant messaging of the present embodiment, identify the device of advertisement information, also comprise screen unit 400, be suitable for, when recognition unit 300 identifies above-mentioned coupling, the instant message mating with advertisement information being carried out to shielding processing.Further, in the instant messaging of the present embodiment, identify the device of advertisement information, also comprise administrative unit 600, be suitable for when recognition unit 300 identifies the instant message mating with advertisement information, the client of the instant message mating with advertisement information described in the instant message mating with advertisement information described in sign and transmission, and do not forward in the given time the instant message being sent by this client, thereby realized sending the taboo speech management of the client of advertisement.
Particularly, the proper vector extraction unit 200 of the present embodiment, comprises that Chinese text obtains subelement 210, phonetic text obtains subelement 220 and fingerprint obtains subelement 230.
Wherein, Chinese text obtains subelement 210, is suitable for the text field to carry out text-processing to obtain Chinese text.
More specifically, Chinese text obtains subelement 210, be suitable for the text field to carry out data cleansing operation, data cleansing operation comprises identifies and abandons HTML mark, the complex form of Chinese characters is converted to simplified Chinese character, double byte character is converted to half-angle character, capitalization English letter is converted to small letter English alphabet, and identify and abandon url and punctuation mark, so that the content in text is converted to regular character, the content in text is converted to regular character; Chinese text obtains subelement 210, be further adapted for phonetic is converted into Chinese character, comprise and use two-way maximum matching algorithm that the phonetic in text is converted to Chinese character, if the corresponding a plurality of Chinese characters of phonetic, from a plurality of Chinese characters of correspondence optional one, so that the phonetic in text is converted into Chinese character; Chinese text obtains subelement 210, be further adapted for and retain conventional Chinese character, comprise that the Chinese characters in common use that use in GBK coding schedule filter text, abandon all characters that do not belong to Chinese characters in common use, only retain Chinese character GBK and be coded in the Chinese character in 0xB0A0~0xF7FE, to retain conventional Chinese character.
Phonetic text obtains subelement 220, is suitable for transferring the Chinese character in the Chinese text obtaining to phonetic and obtains phonetic text, comprises and uses the Chinese-character phonetic letter table of comparisons, each Chinese character is converted to corresponding pinyin string, to obtain phonetic text.
By Chinese text, obtain subelement 210 and obtain Chinese text by the text field, and by phonetic text, obtain subelement 220 and transfer the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text, can, by the different mutation of Similar Text, be identified as identical phonetic text.
Fingerprint obtains subelement 230, be suitable for extracting the feature of described phonetic text, by the proper vector of phonetic text described in the Characteristics creation extracting, particularly, fingerprint obtains subelement 230, be suitable for take individual Chinese character and extract the feature of described phonetic text as cutting granularity, and use vector space model by the proper vector of phonetic text described in the Characteristics creation extracting.Preferably, fingerprint obtains subelement 230 and adopts N gram language model (N-gram) to mention the proper vector of phonetic text, based on Chinese text, obtain the Chinese character granularity in the Chinese text that subelement 210 obtains, phonetic text is obtained to the phonetic text that subelement 220 obtains and extract N-gram feature SHINGLE 1, SHINGLE 2... SHINGLE m.And use vector space model to form proper vector D=<SHINGLE 1, SHINGLE 2..., SHINGLE m>.
Fig. 8 shows according to the detailed block diagram of identifying the device of advertisement information in the instant messaging of second embodiment of the invention.The second embodiment of this device and the first embodiment are as shown in Figure 7 roughly the same, and difference is, this device further comprises characteristic of advertisement database update unit 700.
Described characteristic of advertisement database update unit 700, while being suitable for the record matching in determining instant message and characteristic of advertisement database 500, for each feature in described proper vector, if detect in characteristic of advertisement database 500 and have this feature, the weights of this feature in characteristic of advertisement database 500 are added to 1.In other words, if the record matching in instant message and characteristic of advertisement database upgrades characteristic of advertisement database 500, thereby realize the renewal to characteristic of advertisement database 500.
It should be noted that:
The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of identifying the some or all parts in the equipment of advertisement information in the instant messaging of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims (10)

1. in instant messaging, identify a device for advertisement information, comprising:
Text acquiring unit, is suitable for detecting the text field in the instant message that instant communication client sends;
Proper vector extraction unit, is suitable for extracting the one or more proper vectors that comprise in described the text field;
Recognition unit, is suitable for according to described proper vector, the instant message that identification is mated with advertisement information.
2. device according to claim 1, wherein, this device also comprises:
Screen unit, is suitable for, when recognition unit identifies the instant message mating with advertisement information, the instant message mating with advertisement information being carried out to shielding processing.
3. device according to claim 1 and 2, wherein, this device also comprises:
Administrative unit, be suitable for when recognition unit identifies the instant message mating with advertisement information, the client of the instant message mating with advertisement information described in the instant message mating with advertisement information described in sign and transmission, and do not forward in the given time the instant message being sent by this client.
4. according to the device described in claim 1-3 any one, wherein,
Described recognition unit, be suitable for according to described proper vector judge instant message whether with characteristic of advertisement database in record matching.
5. according to the device described in claim 1-4 any one, wherein,
Described recognition unit, is suitable for each feature in described proper vector, detects in characteristic of advertisement database whether repeatedly occur this feature;
Described recognition unit, be suitable for judging whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold in characteristic of advertisement database, be the record matching of determining in described instant message and characteristic of advertisement database, otherwise do not mate.
6. in instant messaging, identify a method for advertisement information, comprising:
The text field in the instant message that detection instant communication client sends;
Extract the one or more proper vectors that comprise in described the text field;
According to described proper vector, the instant message that identification is mated with advertisement information.
7. method according to claim 6, wherein, the method also comprises:
When identifying the instant message mating with advertisement information, the instant message mating with advertisement information is carried out to shielding processing.
8. according to the method described in claim 6 or 7, wherein,
When identifying the instant message mating with advertisement information, the client of the instant message mating with advertisement information described in the described instant message mating with advertisement information of sign and transmission, and do not forward in the given time the instant message being sent by this client.
9. according to the method described in claim 6-8 any one, wherein, according to described proper vector, the instant message that identification is mated with advertisement information, further comprises:
According to described proper vector judge instant message whether with characteristic of advertisement database in record matching.
10. according to the method described in claim 6-9 any one, wherein, described according to described proper vector judge instant message whether with characteristic of advertisement database in record matching, further comprise:
To each feature in described proper vector, detect in characteristic of advertisement database whether repeatedly occur this feature;
Judge that whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold, is the record matching of determining in described instant message and characteristic of advertisement database, otherwise does not mate in characteristic of advertisement database.
CN201310537715.6A 2013-11-04 2013-11-04 Device and method for recognizing advertising messages in instant messaging Pending CN103605690A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201310537715.6A CN103605690A (en) 2013-11-04 2013-11-04 Device and method for recognizing advertising messages in instant messaging
PCT/CN2014/087175 WO2015062377A1 (en) 2013-11-04 2014-09-23 Device and method for detecting similar text, and application
US15/034,307 US20160283582A1 (en) 2013-11-04 2014-09-23 Device and method for detecting similar text, and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310537715.6A CN103605690A (en) 2013-11-04 2013-11-04 Device and method for recognizing advertising messages in instant messaging

Publications (1)

Publication Number Publication Date
CN103605690A true CN103605690A (en) 2014-02-26

Family

ID=50123913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310537715.6A Pending CN103605690A (en) 2013-11-04 2013-11-04 Device and method for recognizing advertising messages in instant messaging

Country Status (1)

Country Link
CN (1) CN103605690A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346480A (en) * 2014-11-27 2015-02-11 百度在线网络技术(北京)有限公司 Information mining method and device
WO2015062377A1 (en) * 2013-11-04 2015-05-07 北京奇虎科技有限公司 Device and method for detecting similar text, and application
CN105490913A (en) * 2014-09-16 2016-04-13 腾讯科技(深圳)有限公司 Instant message processing method and device
CN105515830A (en) * 2015-11-26 2016-04-20 广州酷狗计算机科技有限公司 User management method and device
CN107018062A (en) * 2016-06-24 2017-08-04 卡巴斯基实验室股份公司 System and method for recognizing rubbish message using subject information
CN108628822A (en) * 2017-03-24 2018-10-09 阿里巴巴集团控股有限公司 Recognition methods without semantic text and device
CN108768824A (en) * 2018-05-15 2018-11-06 腾讯科技(深圳)有限公司 Information processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1681335A (en) * 2004-04-10 2005-10-12 乐金电子(中国)研究开发中心有限公司 Method for filtering advertisements from multimedia short message service
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics
CN102761848A (en) * 2012-08-01 2012-10-31 成都四方信息技术有限公司 Method for determining short message intercepting key words
CN103366019A (en) * 2013-08-06 2013-10-23 飞天诚信科技股份有限公司 Webpage intercepting method and device based on iOS device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1681335A (en) * 2004-04-10 2005-10-12 乐金电子(中国)研究开发中心有限公司 Method for filtering advertisements from multimedia short message service
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics
CN102761848A (en) * 2012-08-01 2012-10-31 成都四方信息技术有限公司 Method for determining short message intercepting key words
CN103366019A (en) * 2013-08-06 2013-10-23 飞天诚信科技股份有限公司 Webpage intercepting method and device based on iOS device

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015062377A1 (en) * 2013-11-04 2015-05-07 北京奇虎科技有限公司 Device and method for detecting similar text, and application
CN105490913A (en) * 2014-09-16 2016-04-13 腾讯科技(深圳)有限公司 Instant message processing method and device
CN104346480A (en) * 2014-11-27 2015-02-11 百度在线网络技术(北京)有限公司 Information mining method and device
WO2016082575A1 (en) * 2014-11-27 2016-06-02 百度在线网络技术(北京)有限公司 Information mining method and apparatus, and storage medium
CN105515830A (en) * 2015-11-26 2016-04-20 广州酷狗计算机科技有限公司 User management method and device
CN107018062A (en) * 2016-06-24 2017-08-04 卡巴斯基实验室股份公司 System and method for recognizing rubbish message using subject information
CN108628822A (en) * 2017-03-24 2018-10-09 阿里巴巴集团控股有限公司 Recognition methods without semantic text and device
CN108628822B (en) * 2017-03-24 2021-12-07 创新先进技术有限公司 Semantic-free text recognition method and device
CN108768824A (en) * 2018-05-15 2018-11-06 腾讯科技(深圳)有限公司 Information processing method and device

Similar Documents

Publication Publication Date Title
CN111897970B (en) Text comparison method, device, equipment and storage medium based on knowledge graph
CN103605691A (en) Device and method used for processing issued contents in social network
CN110020422B (en) Feature word determining method and device and server
CN103605690A (en) Device and method for recognizing advertising messages in instant messaging
CN105426356B (en) A kind of target information recognition methods and device
CN103605694A (en) Device and method for detecting similar texts
CN107423278B (en) Evaluation element identification method, device and system
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN102227724A (en) Machine learning for transliteration
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN101197793B (en) Garbage information detection method and device
CN108536868B (en) Data processing method and device for short text data on social network
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
CN111866004B (en) Security assessment method, apparatus, computer system, and medium
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN111666766A (en) Data processing method, device and equipment
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN111078979A (en) Method and system for identifying network credit website based on OCR and text processing technology
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
US20160283582A1 (en) Device and method for detecting similar text, and application
CN114707517B (en) Target tracking method based on open source data event extraction
Wong et al. iSentenizer‐μ: Multilingual Sentence Boundary Detection Model
CN114117299A (en) Website intrusion tampering detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140226

RJ01 Rejection of invention patent application after publication