CN107239447A - Junk information recognition methods and device, system - Google Patents

Junk information recognition methods and device, system Download PDF

Info

Publication number
CN107239447A
CN107239447A CN201710417747.0A CN201710417747A CN107239447A CN 107239447 A CN107239447 A CN 107239447A CN 201710417747 A CN201710417747 A CN 201710417747A CN 107239447 A CN107239447 A CN 107239447A
Authority
CN
China
Prior art keywords
text
content
data
information
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710417747.0A
Other languages
Chinese (zh)
Other versions
CN107239447B (en
Inventor
陈方毅
陈伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Mei You Information Technology Co Ltd
Original Assignee
Xiamen Mei You Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Mei You Information Technology Co Ltd filed Critical Xiamen Mei You Information Technology Co Ltd
Priority to CN201710417747.0A priority Critical patent/CN107239447B/en
Publication of CN107239447A publication Critical patent/CN107239447A/en
Application granted granted Critical
Publication of CN107239447B publication Critical patent/CN107239447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Present invention is disclosed a kind of junk information recognition methods and device, system, belong to technical field of internet application.Methods described includes:Extract the content of text of the original information of user, semantic reduction is carried out to the content of text, obtain going back original text, go back original text by described by gradient descent algorithm in preset sample pattern storehouse and carry out matching operation, obtain the rubbish probability that the original information of the user is junk information, by being compared to the rubbish probability with default rubbish probability threshold value, it is junk information to recognize the original information of user.In addition, additionally providing a kind of junk information identifying device, system.Above-mentioned junk information recognition methods and device, system can carry out the identification of junk information for the original information of user after semantic conversion.

Description

Junk information recognition methods and device, system
Technical field
The present invention relates to technical field of internet application, more particularly to a kind of junk information recognition methods and device, system.
Background technology
With the development of Internet technology, the network information becomes increasingly abundant, and the then fish of the original information of various users on website Dragon mixes, and the junk information such as useless advertisement, pornographic is more and more.Therefore, the original information of user in website should pass through rubbish in advance Rubbish word is filtered, that is to say, that should be carried out the identification of junk information to the original information of user in advance, will be identified as the use of junk information The original information screen in family falls, to ensure the degree of purity of site information.
However, when carrying out the issue of the original information of user, by carrying out semantic conversion to the original information of user in advance, from And reach and avoid the purpose for being identified as junk information.For example, to avoid being identified as junk information, during issuing advertisement information, The Arabic numerals such as QQ number are converted into Chinese figure, the purpose for being identified as junk information is avoided so as to reach.
At present, the identification of existing junk information is general by match or partly being matched completely with benchmark rubbish word Mode recognizes junk information, and can not carry out the identification of junk information for the original information of user after semantic conversion, So as to greatly reduce the accuracy of junk information identification, cause the False Rate of junk information higher.
The content of the invention
In order to solve that junk information knowledge can not be carried out for the original information of user after semantic conversion in correlation technique Other technical problem, the invention provides a kind of junk information recognition methods and device, system.
The embodiments of the invention provide a kind of junk information recognition methods, including:
Extract the content of text of the original information of user;
Semantic reduction is carried out to the content of text, obtains going back original text;
Go back original text by described by gradient descent algorithm in preset sample pattern storehouse and carry out matching operation, obtain institute State the rubbish probability that the original information of user is junk information;
By being compared to the rubbish probability with default rubbish probability threshold value, recognize that the original information of user is Junk information.
In addition, the embodiments of the invention provide a kind of junk information identifying device, including:
Content of text extraction module, the content of text for extracting the original information of user;
Semantic recovery module, for carrying out the content of text semantic reduction, obtains going back original text;Matching operation mould Block, for going back original text by described by gradient descent algorithm in preset sample pattern storehouse and carrying out matching operation, obtains institute State the rubbish probability that the original information of user is junk information;
Junk information identification module, for by being compared to the rubbish probability with default rubbish probability threshold value, It is junk information to recognize the original information of the user.
In addition, the embodiment of the present invention additionally provides a kind of system, including:
Processor;
Memory for storing processor-executable instruction;
Wherein, the processor is configured as performing:
Extract the content of text of the original information of user;
Semantic reduction is carried out to the content of text, obtains going back original text;
Go back original text by described by gradient descent algorithm in preset sample pattern storehouse and carry out matching operation, obtain institute State the rubbish probability that the original information of user is junk information;
By being compared to the rubbish probability with default rubbish probability threshold value, recognize that the original information of user is Junk information.
The technical scheme that embodiments of the invention are provided can include the following benefits:
When carrying out junk information identification to the original information of user, language is carried out by the content of text to the original information of user Justice reduction, so as to carry out the identification of junk information for the original information of user after semantic conversion, is substantially increased The accuracy of junk information identification, reduces the False Rate of junk information.
It should be appreciated that the general description of the above and detailed description hereinafter are only exemplary, this can not be limited Invention.
Brief description of the drawings
Accompanying drawing herein is merged in specification and constitutes the part of this specification, shows the implementation for meeting the present invention Example, and for explaining principle of the invention together with specification.
Fig. 1 is a kind of junk information recognition methods flow chart according to an exemplary embodiment.
Fig. 2 is a kind of junk information recognition methods flow chart according to an exemplary embodiment.
Fig. 3 is a kind of junk information recognition methods flow chart according to an exemplary embodiment.
Fig. 4 is that the one kind for correspondingly implementing step S220 in the junk information recognition methods that exemplifies according to Fig. 3 is implemented Flow chart.
Fig. 5 is that the one kind for correspondingly implementing step S130 in the junk information recognition methods that exemplifies according to Fig. 1 is implemented Flow chart.
Fig. 6 is a kind of junk information identifying device block diagram according to an exemplary embodiment.
Fig. 7 is a kind of block diagram that Fig. 6 correspondingly implements semantic recovery module 120 in the junk information identifying device that exemplifies.
Fig. 8 is another junk information identifying device block diagram for correspondingly being implemented to exemplify according to Fig. 6.
Fig. 9 is a kind of block diagram that Fig. 8 correspondingly implements characteristic extracting module 220 in the junk information identifying device that exemplifies.
Figure 10 is a kind of block diagram that Fig. 6 correspondingly implements matching operation module 130 in the junk information identifying device that exemplifies.
Figure 11 is a kind of block diagram of system according to an exemplary embodiment.
Embodiment
Here explanation will be performed to exemplary embodiment in detail, its example is illustrated in the accompanying drawings.Following exemplary is implemented Embodiment described in example does not represent all embodiments consistent with the present invention.On the contrary, they are only and such as institute The example of the consistent apparatus and method of some aspects be described in detail in attached claims, the present invention.
Fig. 1 is a kind of junk information recognition methods flow chart according to an exemplary embodiment.As shown in figure 1, should Junk information recognition methods may comprise steps of.
In step s 110, the content of text of the original information of user is extracted.
The original information of user is the information that user inputs on network.For example, in forum, comment of the user to a certain theme is stayed Speech.
It is understood that the original information of user includes the contents such as expression, text.
The good and bad jumbled together for the original information of user, and many junk information are generally comprised in text, needs in advance to the original information of user Content of text carry out junk information identification.Thus, extract content of text from the original information of user.
When extracting content of text in the original information of user, it can be extracted by various Text Extractions, herein Without limiting.
In the step s 120, semantic reduction is carried out to content of text, obtains going back original text.
Semanteme reduction is according to progress text-processing according to semanteme to content of text.Semantic analysis is being carried out to content of text Afterwards, corresponding reduction treatment is carried out, obtains going back original text.
It is understood that to avoid the junk information of issue from being screened out, user is by semantic conversion, so as to avoid The original information of user of issue is identified as junk information.
For example, by the way that QQ number " 1234567 " is converted into " one two three four five six seven " by Arabic numerals, thus avoid by It is identified as junk information.
In another example, by the conversion of phonogram/combined characters, " plus my wechat gives you " is converted into " I prestige gives you family ", So as to avoid being identified as junk information.
Therefore, semantic reduction treatment need to be carried out to the content of text of the original information of user.
Semanteme reduction is to carry out semantic analysis to content of text, extracts the text implication representated by content of text.
Have a variety of to the method that content of text carries out semantic analysis, can be by potential applications indexing means, based on vector Content of text is expressed as the matrix form of characteristic-document by spatial model, and by singularity value decomposition by matrix contraction, will Content of text is mapped to the semantic space of same low-dimensional with Feature Words;Semantic analysis can also be carried out based on external semantic knowledge, For example, by phonogram/combined characters dictionary, extracting the text implication in content of text;Can also be by other means to text Content carries out semantic analysis, and the method not to semantic analysis is defined herein.
In step s 130, content of text is carried out by matching fortune in preset sample pattern storehouse by gradient descent algorithm Calculate, obtain the rubbish probability that the original information of user is junk information.
Gradient descent algorithm is a kind of optimized algorithm in machine learning.
Sample pattern storehouse is pre-prepd, and the probability that each sample pattern is junk information is included in sample pattern storehouse.
Rubbish probability is the probability size that the original information of user is junk information.
In gradient descent algorithm, by using content of text and sample mould of the gradient progressively declined to the original information of user Sample pattern in type storehouse carries out matching operation, after computing convergence, obtains the rubbish that the original information of user is junk information general Rate.
In step S140, by being compared to rubbish probability with default rubbish probability threshold value, identification user is original Information is junk information.
Rubbish probability threshold value is the rubbish probability critical value pre-set.
When the original information of a user reaches rubbish probability for the rubbish probability of junk information, then the original letter of the user is recognized Cease for junk information.
For example, default rubbish probability threshold value is 70%, when the rubbish probability of the original information of user reaches 70%, then know The original information of the user is not junk information.
Using method as described above, by the way that the content of text of the original information of user is carried out into semantic reduction, semanteme is gone back The original text of going back obtained after original carries out matching operation in preset sample pattern storehouse, obtains the original information of user for junk information Rubbish probability, and then according to default rubbish probability threshold value, the original information of identification user is junk information, so as to for The original information of user after semantic conversion carries out the identification of junk information, substantially increases the accurate of identification junk information Property.
Fig. 2 is a kind of junk information recognition methods flow chart according to an exemplary embodiment.As shown in Fig. 2 The step S120 that Fig. 1 correspondingly implements to exemplify may comprise steps of.
In step S121, the Chinese figure in identification content of text.
Chinese figure is the numeral represented with Chinese form.Chinese figure includes Chinese word figure and Chinese small letter number Word, such as " one ", " one ".
In a specific exemplary embodiment, by the way that content of text is contrasted with preset digital word stock, identification Chinese figure in content of text.
In step S122, Arabic numerals are converted to Chinese figure, obtain that content of text is corresponding to go back original text.
Using method as described above, changed by recognizing the Chinese figure in the original information of user, then by Chinese figure For Arabic numerals, and then Arabic numerals are subjected to matching operation in preset sample pattern storehouse, obtain the original letter of user The rubbish probability for junk information is ceased, so as to carry out the identification of junk information to Chinese figure, identification rubbish is substantially increased The accuracy of rubbish information.
Optionally, it can also comprise the following steps in Fig. 1 step S120 for correspondingly implementing to exemplify:
The semanteme of content of text is reduced according to preset phonogram/combined characters character library, obtains corresponding going back original text.
Phonogram/combined characters character library is to include each text and corresponding phonogram and/or the dictionary of combined characters.
It is understood that in the content of text of the original information of user, being likely present by phonogram/combined characters conversion Word.Thus, by carrying out semantic analysis to content of text, and according to preset phonogram/combined characters dictionary, to content of text Carry out semantic reduction.
For example, the original information of user is originally meant " rogue goes extremely ", but to avoid recognizing in junk information, text during issue Hold for " pomegranate awns goes extremely ".The semanteme of the original information of user is recognized by preset phonogram/combined characters character library, and carries out partials The conversion of word/combined characters, " pomegranate awns goes extremely " is converted to " rogue goes extremely ".
Using method as described above, the semanteme of the original information of user is recognized by preset phonogram/combined characters character library, And carrying out the conversion of phonogram/combined characters, it is to avoid by the conversion of phonogram/combined characters, None- identified is part junk information Junk information, substantially increases the accuracy of junk information identification.
Fig. 3 is a kind of junk information recognition methods flow chart according to an exemplary embodiment.As shown in figure 3, Before step S130 in Fig. 1 correspondence embodiments, the junk information recognition methods can also comprise the following steps.
In step S210, content-data is extracted from predetermined database.
Database is the data warehouse according to data structure storage and managing web community information.
For example, the various information datas of Mei You communities according to data structure storage in predetermined database.
Content-data is the text message in database according to data structure storage.
In step S220, the feature extraction of text vector is carried out from content-data by random forests algorithm.
In machine learning, random forest is a grader for including multiple decision trees.
Text vector is the data mode for content-data characterize after feature extraction by decision tree classifier.
Random forest is made up of multiple decision trees.Each node in decision tree is the condition on some feature, Content-data is classified according to different conditions, and then content-data is converted to by text vector according to classification.
In step S230, according to text vector and corresponding weight vectors, the corresponding data category of content-data is obtained.
Weight vectors are corresponding with text vector.Each weight component in weight vectors is and the text in text vector This component is one-to-one.
When content-data is classified according to different conditions, variant condition is to that should have corresponding weight, therefore, After the feature extraction that text data is carried out to content-data, in obtained text vector, each text component also has corresponding power Weight component.
In a specific exemplary embodiment, data category is the corresponding garbage information degree of content-data, according to not Same garbage information degree, classifies to content-data.
In a specific exemplary embodiment, by calculating the product between text vector and corresponding weight vectors, Corresponding data category is searched according to the product.
In step S240, the configuration of regulation engine is carried out according to content-data and corresponding data category, sample is formed Model library.
Regulation engine is a kind of business rule decision-making component.
In regulation engine, rule condition is corresponding with rule action.By receiving data input, business rule are explained Then, and according to business rule operational decision making is made.When the rule condition in business rule is met, then triggering performs corresponding rule Then act.
In a specific exemplary embodiment, by configuring the likelihood between input content of text and content-data, When the content of text of the input probability similar to content-data reaches the likelihood, then the content of text of the recognition and verification input For the corresponding data category of the content-data.
For example, the corresponding data categories of content-data B be junk information, regulation engine configure when rule condition be with it is interior It is 80% to hold the likelihood between data B.By calculating the phase between analysis, the content of text A inputted and content-data B It is 90% like rate, then recognition and verification content of text A is junk information.
Using method as described above, the line discipline of going forward side by side of the feature extraction by carrying out content-data in database in advance draws The configuration held up, forms sample pattern storehouse, when subsequently carrying out the judgement of junk information, by by content of text in sample pattern storehouse It is middle to carry out the calculating of rubbish probability, so as to substantially increase the accuracy of identification junk information.
Fig. 4 is the description to step S220 further details according to an exemplary embodiment.As shown in figure 3, Sample pattern storehouse is divided into multiple sample pattern classes, and step S220 may comprise steps of.
In step S221, semantic reduction is carried out to content-data.
It is understood that to avoid the junk information of issue from being screened out, user carries out the fractionation of phonetically similar word/phonogram Deng operation after carry out the original information of user issue.
Therefore, semantic reduction treatment need to be carried out in content-data.
Semanteme reduction is to carry out text-processing to content-data according to semanteme.For example, a string of Chinese figures are first converted to Arabic numerals, and then it is reconverted into QQ, wechat.
In a specific exemplary embodiment, content-data is:My prestige of family gives you.Pass through phonogram/combined characters " I prestige gives you family " is converted into " plus my wechat gives you " by reduction.By preset phonogram/combined characters dictionary, to phonogram/ Combined characters carry out reduction treatment, so that examination goes out junk information.
In a specific exemplary embodiment, content-data is:No, and Vyuting1028103172 much teaches you .Same word will be switched to by QQ, wechat by semanteme reduction, i.e., changed into Vyuting1028103172 progress extractions One general dimension, the content-data after semantic reduction obtained from entering is " no, and wechat much teaches you ".Due to rubbish Generally there is situations such as adding wechat, QQ in rubbish information, by the way that various wechats, QQ number are uniformly processed into a dimension, it is to avoid The text vector arrived is excessive, while it also avoid the situation that a wechat, QQ number did not occurred and led to not identification.
In step S222, participle operation is carried out to the content-data after semantic reduction, the corresponding text of content-data is obtained This participle.
It is understood that content-data may be multiple words composition, such as " plus my wechat gives you ".
It is similar between text by largely effecting on if directly carrying out feature extraction to the content-data after semanteme reduction Degree, thus before feature extraction is carried out, by the way that participle operation is carried out to content-data in advance, and then to being obtained after participle operation Text participle carry out the feature extraction of text vector respectively.
Participle operation is to will refer to a word sequence being cut into single word one by one.
As it was previously stated, content-data is the text message in database according to data structure storage.And text message can Can be a word, it is also possible to be multiple words, it is also possible to be other forms.
Therefore, by carrying out participle operation to content-data, content-data is cut into single text point one by one Word.
The mode of participle operation is performed to content-data a variety of.Character string can be based on by content-data mechanically cutting For text participle one by one, the corresponding text participle of the content-data is obtained;Semantic point can also be carried out to content-data Analysis, and then be text participle one by one by content-data cutting based on semanteme, obtain the corresponding text point of the content-data Word;Participle operation can also be performed to content-data in other way, is not limited thereto.
In step S223, text vector is carried out by random forests algorithm respectively to the corresponding text participle of content-data Feature extraction.
Using method as described above, when making sample pattern storehouse, carried in the feature that text vector is carried out to content-data Before taking, semantic reduction is carried out to content-data in advance and participle is operated, obtained so that carrying out feature extraction from content-data The text vector arrived is more accurate, improves the degree of accuracy in sample pattern storehouse.
Fig. 5 is the description to step S130 further details according to an exemplary embodiment.As shown in figure 5, Sample pattern storehouse is divided into multiple sample pattern classes, and step S130 may comprise steps of.
In step S131, a corresponding sample pattern class is chosen from sample pattern storehouse according to the original information of user.
In sample pattern storehouse, sample pattern is divided into multiple sample pattern classes, each sample pattern class includes predetermined quantity Sample pattern.
In step S132, matching operation is carried out to the original information of user and sample pattern class by gradient descent algorithm, Obtain the rubbish probability that the original information of user is junk information.
It is to carry out stochastic gradient fortune using the sample pattern in a sample pattern class every time when carrying out matching operation Calculate.I.e.:
X (t+1)=X (t)+Δ X (t)
Δ X (t)=- η g (t)
Wherein, η is learning rate, and g (t) is gradients of the X in t.
By carrying out the classification of sample pattern class to sample pattern storehouse, when the sample pattern in sample pattern storehouse is more, Choose a sample pattern class and carry out matching operation, the consumption of resource when reducing matching operation, and can quickly restrain.
If for example, the first half sample pattern is identical with the gradient of later half sample pattern in sample pattern storehouse, by by before Half sample pattern is as a sample pattern class, and later half sample pattern is as another sample pattern class, so that once During the traversal matching operation in sample pattern storehouse, the method for sample pattern class is advanced two step to optimal solution, and overall matching Budget method only advances a step.
Optionally, can be more by the classification of sample pattern class when there is the sample pattern repeated in sample pattern storehouse Promote the convergence of matching operation soon.
Can be first, after matching operation is carried out each time, it regard the content-data for being identified as junk information as sample pattern It is stored in sample pattern storehouse.
Using method as described above, by the way that the sample pattern in sample pattern storehouse is divided into multiple sample pattern classes, enter And the matching operation of stochastic gradient is carried out in a sample pattern class every time, greatly reduce the consumption of calculation resources, and faster Ground reaches convergence, improves the efficiency of junk information identification.
Following is apparatus of the present invention embodiment, can be used for performing above-mentioned junk information recognition methods embodiment.For this The details not disclosed in invention device embodiment, refer to junk information recognition methods embodiment of the present invention.
Fig. 6 is a kind of junk information identifying device block diagram according to an exemplary embodiment, and the system is included but not It is limited to:Content of text acquisition module 110, semantic recovery module 120, matching operation module 130 and junk information identification module 140。
Content of text extraction module 110, the content of text for extracting the original information of user;
Semantic recovery module 120, for carrying out content of text semantic reduction, obtains going back original text;
Matching operation module 130, for being entered by gradient descent algorithm by original text is gone back in preset sample pattern storehouse Row matching operation, obtains the rubbish probability that the original information of user is junk information;
Junk information identification module 140, for by being compared to rubbish probability with default rubbish probability threshold value, knowing The other original information of user is junk information.
The function of modules and the implementation process of effect specifically refer to above-mentioned junk information recognition methods in said apparatus The implementation process of middle correspondence step, will not be repeated here.
Optionally, as shown in fig. 7, Fig. 6 is correspondingly implemented in the junk information identifying device that exemplifies, semantic recovery module 120 also include but is not limited to:Chinese figure recognition unit 121 and digital conversion unit 122.
Chinese figure recognition unit 121, for recognizing the Chinese figure in content of text;
Digital conversion unit 122, for being converted to Arabic numerals to Chinese figure, obtains the corresponding reduction of content of text Text.
Optionally, Fig. 6 is correspondingly implemented in the junk information identifying device that exemplifies, semantic recovery module 120 also include but It is not limited to:Phonogram/combined characters reduction unit.
Phonogram/combined characters reduction unit, for according to semanteme of the preset phonogram/combined characters character library to content of text Reduction, obtains corresponding going back original text.
Fig. 8 is another junk information identifying device block diagram for correspondingly being implemented to exemplify according to Fig. 6, the device also include but It is not limited to:Content-data extraction module 210, characteristic extracting module 220, data category determining module 230 and the life of sample pattern storehouse Into module 240.
Content-data extraction module 210, for extracting content-data from predetermined database;
Characteristic extracting module 220, the feature for carrying out text vector from content-data by random forests algorithm is carried Take;
Data category determining module 230, for according to text vector and corresponding weight vectors, determining content-data correspondence Data category;
Sample pattern storehouse generation module 240, for carrying out regulation engine according to content-data and corresponding data category Configuration, forms sample pattern storehouse.
Optionally, as shown in figure 9, the characteristic extracting module 220 for correspondingly implementing to exemplify in Fig. 8 includes but is not limited to:Language Adopted reduction unit 221, participle unit 222 and participle feature extraction unit 223.
Semantic reduction unit 221, for carrying out semantic reduction to content-data;
Participle unit 222, for carrying out participle operation to the content-data after semantic reduction, obtains content-data corresponding Text participle;
Participle feature extraction unit 223, for being distinguished by random forests algorithm the corresponding text participle of content-data Carry out the feature extraction of text vector.
Optionally, as shown in Figure 10, sample pattern storehouse is divided into what correspondence implementation in multiple sample pattern classes, Fig. 6 was exemplified Matching operation module 130 includes but is not limited to:Sample pattern class chooses unit 131 and matching operation unit 132.
Sample pattern class chooses unit 131, for choosing corresponding one from sample pattern storehouse according to the original information of user Individual sample pattern class;
Matching operation unit 132, for being carried out by gradient descent algorithm to the original information of user and sample pattern class With computing, the rubbish probability that the original information of user is junk information is obtained.
Figure 11 is a kind of block diagram of system 100 according to an exemplary embodiment.With reference to Figure 11, system 100 can be with Including one or more following component:Processing assembly 101, memory 102, power supply module 103, multimedia groupware 104, audio Component 105, sensor cluster 107 and communication component 108.Wherein, said modules and it is not all necessary, system 100 can be with Other assemblies are increased according to itself functional requirement or some components are reduced, the present embodiment is not construed as limiting.
The integrated operation of the usual control system 100 of processing assembly 101, such as with display, call, data communication, phase Operation that machine is operated and record operation is associated etc..Processing assembly 101 can include one or more processors 109 to perform Instruction, to complete all or part of step of aforesaid operations.In addition, processing assembly 101 can include one or more modules, just Interaction between processing assembly 101 and other assemblies.For example, processing assembly 101 can include multi-media module, it is many to facilitate Interaction between media component 104 and processing assembly 101.
Memory 102 is configured as storing various types of data supporting the operation in system 100.These data are shown Example includes the instruction of any application program or method for operating on the system 100.Memory 102 can be by any kind of Volatibility or non-volatile memory device or combinations thereof realization, such as SRAM (Static Random Access Memory, static RAM), EEPROM (Electrically Erasable Programmable Read- Only Memory, Electrically Erasable Read Only Memory), EPROM (Erasable Programmable Read Only Memory, Erasable Programmable Read Only Memory EPROM), (Programmable Read-Only Memory may be programmed read-only PROM Memory), ROM (Read-Only Memory, read-only storage), magnetic memory, flash memory, disk or CD.Storage Also be stored with one or more modules in device 102, and one or more modules are configured to by the one or more processors 109 Perform, to complete all or part of step in any shown method of Fig. 1, Fig. 2, Fig. 3, Fig. 4 and Fig. 5.
Power supply module 103 provides electric power for the various assemblies of system 100.Power supply module 103 can include power management system System, one or more power supplys, and other components associated with generating, managing and distributing electric power for system 100.
Multimedia groupware 104 is included in the screen of one output interface of offer between the system 100 and user.One In a little embodiments, screen can include LCD (Liquid Crystal Display, liquid crystal display) and TP (Touch Panel, touch panel).If screen includes touch panel, screen may be implemented as touch-screen, to receive from user's Input signal.Touch panel includes one or more touch sensors with the gesture on sensing touch, slip and touch panel.Institute State touch sensor can not only sensing touch or sliding action border, but also detection touches or slide phase with described The duration of pass and pressure.
Audio-frequency assembly 105 is configured as output and/or input audio signal.For example, audio-frequency assembly 105 includes a Mike Wind, when system 100 is in operator scheme, when such as call model, logging mode and speech recognition mode, microphone is configured as connecing Receive external audio signal.The audio signal received can be further stored in memory 102 or be sent out via communication component 108 Send.In certain embodiments, audio-frequency assembly 105 also includes a loudspeaker, for exports audio signal.
Sensor cluster 107 includes one or more sensors, and the state for providing various aspects for system 100 is commented Estimate.For example, sensor cluster 107 can detect opening/closed mode of system 100, the relative positioning of component, sensor group Part 107 can be with the position change of 100 1 components of detecting system 100 or system and the temperature change of system 100.At some In embodiment, the sensor cluster 107 can also include Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 108 is configured to facilitate the communication of wired or wireless way between system 100 and other equipment.System 100 can access the wireless network based on communication standard, such as WiFi (Wireless-Fidelity, wireless network), 2G or 3G, Or combinations thereof.In one exemplary embodiment, communication component 108 is received from external broadcasting management via broadcast channel The broadcast singal or broadcast related information of system.In one exemplary embodiment, the communication component 108 also includes NFC (Near Field Communication, near-field communication) module, to promote junction service.For example, can be based in NFC module RFID (Radio Frequency Identification, radio frequency identification) technology, IrDA (Infrared Data Association, Infrared Data Association) technology, UWB (Ultra-Wideband, ultra wide band) technology, BT (Bluetooth, it is blue Tooth) technology and other technologies realize.
In the exemplary embodiment, system 100 can be by one or more ASIC (Application Specific Integrated Circuit, application specific integrated circuit), DSP is (at Digital Signal Processing, data signal Manage device), PLD (Programmable Logic Device, PLD), FPGA (Field-Programmable Gate Array, field programmable gate array), controller, microcontroller, microprocessor or other electronic components realize, be used for Perform the above method.
The concrete mode of the computing device operation of system in the embodiment is in the control about the data transfer Detailed description is performed in the embodiment of method, explanation will be no longer elaborated herein.
Optionally, the present invention also provides a kind of system, performs any shown rubbish letter of Fig. 1, Fig. 2, Fig. 3, Fig. 4 and Fig. 5 Cease all or part of step of recognition methods.The system includes:
Processor;
Memory for storing processor-executable instruction;
Wherein, the processor is configured as performing:
Extract the content of text of the original information of user;
Semantic reduction is carried out to the content of text, obtains going back original text;
Go back original text by described by gradient descent algorithm in preset sample pattern storehouse and carry out matching operation, obtain institute State the rubbish probability that the original information of user is junk information;
By being compared to the rubbish probability with default rubbish probability threshold value, recognize that the original information of user is Junk information.
The concrete mode of the computing device operation of system in the embodiment is in the relevant junk information identification side Detailed description is performed in the embodiment of method, explanation will be not set forth in detail herein.
In the exemplary embodiment, a kind of storage medium is additionally provided, the storage medium is computer-readable recording medium, For example can be to include the provisional and non-transitorycomputer readable storage medium of instruction.The storage medium is for example including instruction Memory 102, above-mentioned instruction can perform to complete above-mentioned junk information recognition methods by the processor 109 of system 100.
It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, sheet Art personnel can perform various modifications and changes without departing from the scope.The scope of the present invention only will by appended right Ask to limit.

Claims (13)

1. a kind of junk information recognition methods, it is characterised in that methods described includes:
Extract the content of text of the original information of user;
Semantic reduction is carried out to the content of text, obtains going back original text;
Go back original text by described by gradient descent algorithm in preset sample pattern storehouse and carry out matching operation, obtain the use The original information in family is the rubbish probability of junk information;
By being compared to the rubbish probability with default rubbish probability threshold value, it is rubbish to recognize the original information of user Information.
2. according to the method described in claim 1, it is characterised in that described that semantic reduction is carried out to the content of text, obtain Also original text the step of include:
Recognize the Chinese figure in the content of text;
Arabic numerals are converted to the Chinese figure, obtain that the content of text is corresponding to go back original text.
3. according to the method described in claim 1, it is characterised in that described that semantic reduction is carried out to the content of text, obtain Also original text the step of include:
The semanteme of the content of text is reduced according to preset phonogram/combined characters character library, obtains corresponding going back original text.
4. according to the method described in claim 1, it is characterised in that described that the original text of going back is existed by gradient descent algorithm Matching operation is carried out in preset sample pattern storehouse, the step of obtaining the rubbish probability that the original information of the user is junk information Before, methods described also includes:
Content-data is extracted from predetermined database;
The feature extraction of text vector is carried out from the content-data by random forests algorithm;
According to the text vector and corresponding weight vectors, the corresponding data category of the content-data is determined;
The configuration of regulation engine is carried out according to the content-data and the corresponding data category, the sample pattern is formed Storehouse.
5. method according to claim 4, it is characterised in that it is described by random forests algorithm from the content-data The step of feature extraction for carrying out text vector, includes:
Semantic reduction is carried out to the content-data;
Participle operation is carried out to the content-data after semantic reduction, the corresponding text participle of the content-data is obtained;
Carry out the feature extraction of text vector respectively to the corresponding text participle of the content-data by random forests algorithm.
6. according to the method described in claim 1, it is characterised in that the sample pattern storehouse is divided into multiple sample pattern classes, institute State and go back original text by described by gradient descent algorithm in preset sample pattern storehouse and carry out matching operation, obtain the user The step of original information is the rubbish probability of junk information includes:
A corresponding sample pattern class is chosen from the sample pattern storehouse according to the original information of the user;
Matching operation is carried out to the original information of the user and the sample pattern class by gradient descent algorithm, the use is obtained The original information in family is the rubbish probability of junk information.
7. a kind of junk information identifying device, it is characterised in that described device includes:
Content of text extraction module, the content of text for extracting the original information of user;
Semantic recovery module, for carrying out the content of text semantic reduction, obtains going back original text;
A matching operation module, for being carried out the original text of going back in preset sample pattern storehouse by gradient descent algorithm With computing, the rubbish probability that the original information of the user is junk information is obtained;
Junk information identification module, for by being compared to the rubbish probability with default rubbish probability threshold value, recognizing The original information of user is junk information.
8. device according to claim 7, it is characterised in that the semantic recovery module includes:
Chinese figure recognition unit, for recognizing the Chinese figure in the content of text;
Digital conversion unit, for being converted to Arabic numerals to the Chinese figure, obtains the content of text corresponding also Original text.
9. device according to claim 7, it is characterised in that the semantic recovery module includes:
Phonogram/combined characters reduction unit, for according to semanteme of the preset phonogram/combined characters character library to the content of text Reduction, obtains corresponding going back original text.
10. device according to claim 7, it is characterised in that described device also includes:
Content-data extraction module, for extracting content-data from predetermined database;
Characteristic extracting module, the feature extraction for carrying out text vector from the content-data by random forests algorithm;
Data category determining module, for according to the text vector and corresponding weight vectors, determining the content-data pair The data category answered;
Sample pattern storehouse generation module, for carrying out regulation engine according to the content-data and the corresponding data category Configuration, forms the sample pattern storehouse.
11. device according to claim 10, it is characterised in that the characteristic extracting module includes:
Semantic reduction unit, for carrying out semantic reduction to the content-data;
Participle unit, for carrying out participle operation to the content-data after semantic reduction, obtains the content-data correspondence Text participle;
Participle feature extraction unit, for being carried out respectively to the corresponding text participle of the content-data by random forests algorithm The feature extraction of text vector.
12. device according to claim 7, it is characterised in that the sample pattern storehouse is divided into multiple sample pattern classes, institute Stating matching operation module includes:
Sample pattern class chooses unit, for choosing corresponding one from the sample pattern storehouse according to the original information of the user Individual sample pattern class;
Matching operation unit, for being carried out by gradient descent algorithm to the original information of the user and the sample pattern class With computing, the rubbish probability that the original information of the user is junk information is obtained.
13. a kind of system, it is characterised in that the system includes:
Processor;
Memory for storing processor-executable instruction;
Wherein, the processor is configured as performing:
Extract the content of text of the original information of user;
Semantic reduction is carried out to the content of text, obtains going back original text;
Go back original text by described by gradient descent algorithm in preset sample pattern storehouse and carry out matching operation, obtain the use The original information in family is the rubbish probability of junk information;
By being compared to the rubbish probability with default rubbish probability threshold value, it is rubbish to recognize the original information of user Information.
CN201710417747.0A 2017-06-05 2017-06-05 Junk information identification method, device and system Active CN107239447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710417747.0A CN107239447B (en) 2017-06-05 2017-06-05 Junk information identification method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710417747.0A CN107239447B (en) 2017-06-05 2017-06-05 Junk information identification method, device and system

Publications (2)

Publication Number Publication Date
CN107239447A true CN107239447A (en) 2017-10-10
CN107239447B CN107239447B (en) 2020-12-18

Family

ID=59984879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710417747.0A Active CN107239447B (en) 2017-06-05 2017-06-05 Junk information identification method, device and system

Country Status (1)

Country Link
CN (1) CN107239447B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255069A (en) * 2018-07-31 2019-01-22 阿里巴巴集团控股有限公司 A kind of discrete text content risks recognition methods and system
CN109344388A (en) * 2018-08-02 2019-02-15 中央电视台 A kind of comment spam recognition methods, device and computer readable storage medium
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment
CN109829102A (en) * 2018-12-27 2019-05-31 浙江工业大学 A kind of web advertisement recognition methods based on random forest
CN111581959A (en) * 2019-01-30 2020-08-25 北京京东尚科信息技术有限公司 Information analysis method, terminal and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101699432A (en) * 2009-11-13 2010-04-28 黑龙江工程学院 Ordering strategy-based information filtering system
CN101908055A (en) * 2010-03-05 2010-12-08 黑龙江工程学院 Method for setting information classification threshold for optimizing lam percentage and information filtering system using same
US20110258201A1 (en) * 2008-05-28 2011-10-20 Barracuda Inc. Multilevel intent analysis apparatus & method for email filtration
CN102567304A (en) * 2010-12-24 2012-07-11 北大方正集团有限公司 Filtering method and device for network malicious information
CN103166830A (en) * 2011-12-14 2013-06-19 中国电信股份有限公司 Spam email filtering system and method capable of intelligently selecting training samples
CN104038391A (en) * 2014-07-02 2014-09-10 网易(杭州)网络有限公司 Method and device for detecting junk email
CN104050556A (en) * 2014-05-27 2014-09-17 哈尔滨理工大学 Feature selection method and detection method of junk mails
CN104702492A (en) * 2015-03-19 2015-06-10 百度在线网络技术(北京)有限公司 Garbage message model training method, garbage message identifying method and device thereof
KR20160067473A (en) * 2014-12-04 2016-06-14 숭실대학교산학협력단 Method for spam classfication, recording medium and device for performing the method
CN106202330A (en) * 2016-07-01 2016-12-07 北京小米移动软件有限公司 The determination methods of junk information and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110258201A1 (en) * 2008-05-28 2011-10-20 Barracuda Inc. Multilevel intent analysis apparatus & method for email filtration
CN101699432A (en) * 2009-11-13 2010-04-28 黑龙江工程学院 Ordering strategy-based information filtering system
CN101908055A (en) * 2010-03-05 2010-12-08 黑龙江工程学院 Method for setting information classification threshold for optimizing lam percentage and information filtering system using same
CN102567304A (en) * 2010-12-24 2012-07-11 北大方正集团有限公司 Filtering method and device for network malicious information
CN103166830A (en) * 2011-12-14 2013-06-19 中国电信股份有限公司 Spam email filtering system and method capable of intelligently selecting training samples
CN104050556A (en) * 2014-05-27 2014-09-17 哈尔滨理工大学 Feature selection method and detection method of junk mails
CN104038391A (en) * 2014-07-02 2014-09-10 网易(杭州)网络有限公司 Method and device for detecting junk email
KR20160067473A (en) * 2014-12-04 2016-06-14 숭실대학교산학협력단 Method for spam classfication, recording medium and device for performing the method
CN104702492A (en) * 2015-03-19 2015-06-10 百度在线网络技术(北京)有限公司 Garbage message model training method, garbage message identifying method and device thereof
CN106202330A (en) * 2016-07-01 2016-12-07 北京小米移动软件有限公司 The determination methods of junk information and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HASSAN NAJADAT: "Mobile SMS Spam Filtering based on Mixing Classifiers", 《INTERNATIONAL JOURNAL OF ADVANCED COMPUTING RESEARCH》 *
吴世竞: "垃圾短信过滤***的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
周彦: "中文文本分类方法的研究与实现", 《中国学位论文全文数据库》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255069A (en) * 2018-07-31 2019-01-22 阿里巴巴集团控股有限公司 A kind of discrete text content risks recognition methods and system
CN109344388A (en) * 2018-08-02 2019-02-15 中央电视台 A kind of comment spam recognition methods, device and computer readable storage medium
CN109344388B (en) * 2018-08-02 2023-06-09 中央电视台 Method and device for identifying spam comments and computer-readable storage medium
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment
CN109829102A (en) * 2018-12-27 2019-05-31 浙江工业大学 A kind of web advertisement recognition methods based on random forest
CN111581959A (en) * 2019-01-30 2020-08-25 北京京东尚科信息技术有限公司 Information analysis method, terminal and storage medium

Also Published As

Publication number Publication date
CN107239447B (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN107239447A (en) Junk information recognition methods and device, system
US20210232763A1 (en) Graphical systems and methods for human-in-the-loop machine intelligence
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
WO2018032937A1 (en) Method and apparatus for classifying text information
CN106250513B (en) Event modeling-based event personalized classification method and system
CN107357787B (en) Semantic interaction method and device and electronic equipment
CN111738011A (en) Illegal text recognition method and device, storage medium and electronic device
CN110888990A (en) Text recommendation method, device, equipment and medium
CN107491435B (en) Method and device for automatically identifying user emotion based on computer
CN105488025A (en) Template construction method and apparatus and information identification method and apparatus
CN105117384A (en) Classifier training method, and type identification method and apparatus
CN105488151A (en) Reference document recommendation method and apparatus
CN108509569A (en) Generation method, device, electronic equipment and the storage medium of enterprise's portrait
CN110019777B (en) Information classification method and equipment
CN113094552A (en) Video template searching method and device, server and readable storage medium
CN106648926A (en) Information input method and device
CN103761221B (en) System and method for identifying sensitive text messages
CN104866308A (en) Scenario image generation method and apparatus
CN111639178A (en) Automatic classification and interpretation of life science documents
JP6237168B2 (en) Information processing apparatus and information processing program
CN106407393A (en) An information processing method and device for intelligent apparatuses
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN106445908A (en) Text identification method and apparatus
CN112347254B (en) Method, device, computer equipment and storage medium for classifying news text
CN111984589A (en) Document processing method, document processing device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 361000 Area 1F-D1, Huaxun Building A, Software Park, Xiamen Torch High-tech Zone, Xiamen City, Fujian Province

Applicant after: Xiamen Meishao Co., Ltd.

Address before: Unit G03, Room 102, 22 Guanri Road, Phase II, Xiamen Software Park, Fujian Province

Applicant before: XIAMEN MEIYOU INFORMATION SCIENCE & TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant