CN109657134A - A kind of data filtering method and device - Google Patents

A kind of data filtering method and device Download PDF

Info

Publication number
CN109657134A
CN109657134A CN201811313297.1A CN201811313297A CN109657134A CN 109657134 A CN109657134 A CN 109657134A CN 201811313297 A CN201811313297 A CN 201811313297A CN 109657134 A CN109657134 A CN 109657134A
Authority
CN
China
Prior art keywords
data
sensitive keys
tested
caption information
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811313297.1A
Other languages
Chinese (zh)
Inventor
罗玄
黄君实
陈强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201811313297.1A priority Critical patent/CN109657134A/en
Publication of CN109657134A publication Critical patent/CN109657134A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of data filtering method and devices, which comprises the caption information for obtaining data to be tested judges whether the caption information includes sensitive keys word in predetermined keyword library;If the caption information includes the sensitive keys word in the predetermined keyword library, the quantity of the sensitive keys word is obtained;Obtain the network click amount of the data to be tested;The quantity for the sensitive keys word for including in network click amount and caption information based on the data to be tested filters the data to be tested.The junk datas such as violence, vulgar not only can be quickly filtered out based on scheme provided by the invention, can also judge potentially to hide the data for relatively needing to filter deeply in time, improve network environment while promoting filter efficiency.

Description

A kind of data filtering method and device
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of data filtering method and device.
Background technique
With the continuous development of network technology, more and more people by Web Publishing, transmitting and obtain various Information Numbers According to.But since the covering surface of network is very wide, the data class and data mode propagated on network are also very much, such as text, figure Picture, sound, video etc..It is low in addition to various news datas, recreation data, encyclopaedia data etc. in the data spread on network The storage of other bad datas such as custom, violence is also higher, therefore, for these data suppress and filter it is particularly important.
Summary of the invention
On the present invention provides a kind of data filtering methods and device to overcome the above problem or at least be partially solved State problem.
According to an aspect of the invention, there is provided a kind of data filtering method, comprising:
The caption information for obtaining data to be tested, judges whether the caption information includes predetermined keyword library In sensitive keys word;
If the caption information includes the sensitive keys word in the predetermined keyword library, the sensitive pass is obtained The quantity of keyword;
Obtain the network click amount of the data to be tested;
The number for the sensitive keys word for including in network click amount and caption information based on the data to be tested Amount filters the data to be tested.
Optionally, the sensitivity for including in the network click amount and caption information based on the data to be tested The quantity of keyword filters the data to be tested, comprising:
If the network click amount of the data to be tested is more than the first default click volume, and is wrapped in the caption information The quantity of the sensitive keys word included is more than the first default value, then filters the data to be tested;And/or
If the network click amount of the data to be tested is lower than the second default click volume, and wraps in the caption information The quantity of the sensitive keys word included is more than the second default value, then filters the data to be tested.
Optionally, it is described obtain data to be tested caption information, judge the caption information whether include Sensitive keys word in predetermined keyword library, comprising:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to The data of network click amount within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the text of the data to be tested Heading message;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
Optionally, it is described judge the caption information whether include sensitive keys word in predetermined keyword library it Before, further includes:
It obtains the sensitive keys word Jing Guo manual examination and verification and/or the sensitive of the article title information extraction of filter data is already expired and close Keyword;
Predetermined keyword library is constructed based on the sensitive keys word.
Optionally, described to judge whether the caption information includes sensitive keys word in predetermined keyword library, packet It includes:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
If the sensitive keys word successful match in the word and the predetermined keyword library, judges the caption Information includes the sensitive keys word in predetermined keyword library;
If the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, the text mark is judged Topic information does not include the sensitive keys word in predetermined keyword library.
Optionally, the data to be tested include internet video data;The caption for obtaining data to be tested Information judges whether the caption information includes sensitive keys word in predetermined keyword library, comprising:
Obtain the caption letter of the video data of the video data for being stored in video server and/or the main live streaming of live streaming Breath judges whether the caption information includes sensitive keys word in predetermined keyword library.
Optionally, it is described obtain data to be tested caption information, judge the caption information whether include Sensitive keys word in predetermined keyword library, further includes:
Obtain the currently watched video data of user caption information, judge the caption information whether include Sensitive keys word in predetermined keyword library.
According to another aspect of the present invention, a kind of data filtering device is additionally provided,
Judgment module is configured to obtain the caption information of data to be tested, whether judges the caption information Including the sensitive keys word in predetermined keyword library;
First obtains module, if being configured to the caption information includes the sensitive keys in the predetermined keyword library Word then obtains the quantity of the sensitive keys word;
Second obtains module, is configured to obtain the network click amount of the data to be tested;
Filtering module, be configured in the network click amount and caption information of the data to be tested include The quantity of sensitive keys word filters the data to be tested.
Optionally, the filtering module includes:
First filter element is configured to when the network click amount of the data to be tested be more than the first default click volume, and When the quantity for the sensitive keys word for including in the caption information is more than the first default value, the number to be detected is filtered According to;And/or
Second filter element is configured to the network click amount when the data to be tested lower than the second default click volume, and When the quantity for the sensitive keys word for including in the caption information is more than the second default value, the number to be detected is filtered According to.
Optionally, the judgment module is additionally configured to:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to The data of network click amount within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the text of the data to be tested Heading message;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
Optionally, the judgment module is additionally configured to:
Before judging whether the caption information include sensitive keys word in predetermined keyword library, obtains and pass through The sensitive keys word of manual examination and verification and/or be already expired filter data article title information extraction sensitive keys word;
Predetermined keyword library is constructed based on the sensitive keys word.
Optionally, the judgment module is additionally configured to:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
When the sensitive keys word successful match in the word and the predetermined keyword library, the caption is judged Information includes the sensitive keys word in predetermined keyword library;
When the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, the text mark is judged Topic information does not include the sensitive keys word in predetermined keyword library.
Optionally, the data to be tested include internet video data;
The judgment module is additionally configured to obtain the video data for being stored in video server and/or the main live streaming of live streaming Video data caption information, judge whether the caption information includes sensitive keys in predetermined keyword library Word.
Optionally, the judgment module is additionally configured to obtain the caption information of the currently watched video data of user, Judge whether the caption information includes sensitive keys word in predetermined keyword library.
According to another aspect of the present invention, a kind of computer storage medium is additionally provided, the computer storage medium is deposited Computer program code is contained, when the computer program code is run on the computing device, the calculating equipment is caused to be held Row data filtering method described in any of the above embodiments.
According to another aspect of the present invention, a kind of calculating equipment is additionally provided, comprising:
Processor;
It is stored with the memory of computer program code;
When the computer program code is run by the processor, the calculating equipment is caused to execute any of the above-described The data filtering method.
The present invention provides a kind of more efficient data filtering method and devices, in data filtering side provided by the invention In method, by judging whether the caption information of data to be tested includes sensitive keys word, and judging to include sensitive close Its quantity is obtained after keyword, meanwhile, the network click amount of data to be tested is also obtained, judges that it propagates temperature, and then combine The network click amount of sensitive keys word quantity and data to be tested in the caption information of data to be tested carries out it Filtering.Based on data filtering method provided by the invention, by using sensitive keys word and temperature combination mode to be checked Measured data is filtered detection, not only can directly filter out the junk datas such as violence, vulgar, can also judge in time potential The hiding data for relatively needing to filter deeply, and then promote the filter efficiency of bad data and junk data, improve network environment.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
According to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings, those skilled in the art will be brighter The above and other objects, advantages and features of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is data filtering method flow diagram according to an embodiment of the present invention;
Fig. 2 is data filtering method flow diagram according to the preferred embodiment of the invention;
Fig. 3 is data filtering device structural schematic diagram according to an embodiment of the present invention;
Fig. 4 is data filtering device structural schematic diagram according to the preferred embodiment of the invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Fig. 1 is data filtering method flow diagram according to an embodiment of the present invention, as shown in Figure 1, real according to the present invention The data filtering method for applying example may include:
Step S102 obtains the caption information of data to be tested, judges whether above-mentioned caption information includes pre- If the sensitive keys word in keywords database;
Step S104 is obtained above-mentioned if above-mentioned caption information includes the sensitive keys word in predetermined keyword library The quantity of sensitive keys word;
Step S106 obtains the network click amount of data to be tested;
Step S108, the sensitive keys word for including in network click amount and caption information based on data to be tested Quantity filter data to be tested.
The embodiment of the invention provides a kind of efficient data filtering methods, by the caption for judging data to be tested Whether information includes sensitive keys word, and judge include obtain its quantity after sensitive keys word, meanwhile, can also obtain to The network click amount of detection data judges that it propagates temperature, and then the sensitivity in the caption information of combination data to be tested Keyword quantity and the network click amount of data to be tested are filtered it.Due to bad datas such as part is vulgar, violences Spread speed and temperature may be larger, therefore, data filtering method based on the embodiment of the present invention is closed based on sensitive The mode of the combination of keyword and temperature is filtered detection to data to be tested, not only can directly filter out violence, vulgar etc. Junk data can also be judged potentially to hide the data for relatively needing to filter deeply in time, and then promote bad data and rubbish The filter efficiency of rubbish data improves network environment.
Optionally, the data to be tested in the embodiment of the present invention can be the data such as Internet picture data, video data, It can be video data and/or the live streaming master for being stored in video server when data to be tested are internet video data The video data of live streaming.That is, above-mentioned steps S102 can also include: obtain be stored in video server video data and/ Or the caption information of the video data of the main live streaming of live streaming, judge whether caption information includes in predetermined keyword library Sensitive keys word.With the rise of short-sighted frequency, more and more users can by the network for the video transmission clapped conveniently with other The network user shares, and the video data that each network user uploads is stored in video server, by examining to it It is issued to other network users again after surveying filtering, it is ensured that the Internet Security of other network users avoids the propagation of bad data.
In addition, live streaming is also data dissemination mode more popular at present, but since data volume is too big and data pass It is uncontrollable to broadcast speed, therefore, in addition to detecting to the video data being stored on video server in the embodiment of the present invention Except, the video data that main live streaming is broadcast live can also be detected, bad data is filtered suppress in time, purify network Environment.
In an alternative embodiment of the invention, above-mentioned steps S102 can also include obtaining the currently watched video data of user Caption information, judge whether the text heading message includes sensitive keys word in predetermined keyword library.For some For video data, other users request may be just had when some user just uploads, it is therefore, real based on the present invention The scheme for applying example offer can detect the currently watched video data of user, that is, provide a user the same of video traffic In addition to this detection of Shi Jinhang data is also possible to detect video data before providing a user video traffic, To further increase the treatment effeciency of data.
It refers to, can be wrapped in network click amount and caption information based on data to be tested in above-mentioned steps S108 The quantity of the sensitive keys word included filters data to be tested.In a preferred embodiment of the invention, filtering rule can be preset, when The quantity for the sensitive keys word for including in the network click amount and caption information of data to be tested meets above-mentioned filtering rule When then, data to be tested can be filtered.For video data, some video datas may be vulgar, but the view Sensitive keys word in the article title information of frequency evidence is intended only as interesting part and includes once in a while, it should be noted that It is that the click volume of the video data may be very big, therefore, is containing some sensitive keys words and click volume is extra high In the case of, this video data is likely to just belong to vulgar video, should just be filtered.And those caption itself is believed Data in breath including a large amount of sensitive keys words also can be filtered directly even if its click volume is not high.That is, above-mentioned steps S108 exists May include following manner when filtering video to be detected:
Step S108-1, if the network click amount of institute's data to be tested is more than the first default click volume, and caption is believed The quantity for the sensitive keys word for including in breath is more than the first default value, then filters the data to be tested;And/or
Step S108-2, if the network click amount of data to be tested is lower than the second default click volume, and caption information In include the quantity of sensitive keys word be more than the second default value, then filter the data to be tested.
That is, the embodiment of the present invention can be preset about parameter relevant to data temperature such as network click amount, I.e. first default click volume and the second default click volume, and the correlation about the sensitive keys word in data literal heading message Parameter, i.e. the first default value and the second default value.Assuming that setting the first default click volume in practical application as 1000, second Default click volume is also 1000, and the first default value is 2, and the second default value is 5, then for data to be tested, first Its network click amount is obtained, then obtains the sensitive keys word number in its caption information, if the net of the data to be tested Network click volume is greater than 1000, simultaneously again includes 2 or more sensitive keys words in caption information, at this point, to be checked to this Measured data carries out suppressing filter.Or network click amount is less than 1000, but that includes 6 sensitive keys words, at this point, The data to be tested can be equally filtered.The sensitive keys word in caption information can also be first obtained in practical application Number, then network click amount is obtained, and click volume default for first, the second default click volume, the first default value and second Default value can also be configured according to different scene and filtration needs, and the present invention is without limitation.
Since web database technology is very big, when being filtered processing to data, if examined one by one to all data It surveys, workload may be larger, and low efficiency.It therefore, in a preferred embodiment, can be with when choosing data to be tested The network click amount of each data in presetting database is first obtained, and is ranked up based on network click amount, according to network click amount Data within a preset range generate hot data library;Any data in hot data library is chosen again as data to be tested, To obtain the caption information of data to be tested, judge whether caption information includes sensitivity in predetermined keyword library Keyword.
It is mentioned above, the filtering of data to be tested is network click amount and caption information institute based on data to be tested Including the judgement of sensitive keys word quantity realize, therefore, can be first based in presetting database in the preferred embodiment of the present invention Data be ranked up according to network click amount, and then filter out temperature higher data building hot data library, then based on heat Data in gated data library filter after being detected as data to be tested.Since network click amount originally belongs to data to be tested One of filter criteria, and known therefore the network click amount of each data is being filtered in building hot data library When processing, only sensitive keys word quantity included in its caption information need to be judged i.e. from data to be tested are wherein chosen Can, further to promote data filtering efficiency, save detection time.
For example, can carry out gradient division for the data in presetting database, i.e. acquisition network click amount is Top10's Data construct hot data library, for the hot data in hot data library, are in the case that click volume is greater than 1000 on the day of High dsc data also includes several sensitive keys words in text performance information, may determine that the data to be tested may at this time Be exactly it is vulgar, will be filtered;For network click amount lower than 1000, the sensitive keys word for belonging to low-heat data, but including Again very much, it is also likely to be vulgar for equally judging the data to be tested, it is also desirable to be filtered.
It introduces above, after obtaining data to be tested caption information, judges whether caption information includes pre- If the sensitive keys word in keywords database, wherein predetermined keyword library can be the keywords database constructed in advance, excellent in the present invention It selects in embodiment, as shown in Fig. 2, can also include: before above-mentioned steps S102
Step S110, obtains sensitive keys word Jing Guo manual examination and verification and/or the article title information of filter data is already expired and mention The sensitive keys word taken;
Step S112 constructs predetermined keyword library based on above-mentioned sensitive keys word.
There are many type for the rubbish for needing to filter in network data, in embodiments of the present invention, can obtain respectively different The junk data of type, such as vulgar, violence, and then obtain and extracted based on the data article title information Jing Guo manual examination and verification Different types of sensitive keys word, the usual word of junk information can also be obtained as sensitive keys word, it is above-mentioned getting Predetermined keyword library can be constructed after sensitive keys word.Sensitive keys word in predetermined keyword library, can root when being stored It is stored, can also be stored according to the frequency of use height of each sensitive keys word, the present invention is not done according to respective type It limits.
Optionally, the article title information of data to be tested is matched with the sensitive keys word in predetermined keyword library When, it may include following steps:
S1 segments caption information, obtains at least one word that above-mentioned caption information includes;
S2 matches above-mentioned word with the sensitive keys word in predetermined keyword library;If above-mentioned word and default pass Sensitive keys word successful match in keyword library then judges that caption information includes the sensitive keys in predetermined keyword library Word;If above-mentioned word matches unsuccessful with the sensitive keys word in predetermined keyword library, judge that caption information is not wrapped Include the sensitive keys word in predetermined keyword library.
Participle, i.e., be cut into individual word one by one for a chinese character sequence.In the embodiment of the present invention, to data to be tested Caption information segmented after, first stop word can be washed, only retain and have the word of physical meaning, and then again will Word after participle is matched with the sensitive keys word in predetermined keyword library respectively, as long as having a word and default key Sensitive keys word successful match in word, that is, can determine whether in the article title information of the data to be tested include sensitive keys word, If whole words match unsuccessful, judge not including sensitive keys word in the article title information of data to be tested.
Based on the same inventive concept, the embodiment of the invention also provides a kind of data filtering devices, as shown in figure 3, according to Data filtering device provided in an embodiment of the present invention may include:
Judgment module 310 is configured to obtain the caption information of data to be tested, judges that above-mentioned caption information is The no sensitive keys word including in predetermined keyword library;
First obtains module 320, if being configured to above-mentioned caption information includes the sensitive keys in predetermined keyword library Word then obtains the quantity of sensitive keys word;
Second obtains module 330, is configured to obtain the network click amount of data to be tested;
Filtering module 340, be configured in the network click amount and caption information of data to be tested include The quantity of sensitive keys word filters data to be tested.
In a preferred embodiment, as shown in figure 4, filtering module 340 may include:
First filter element 341 is configured to when the network click amount of data to be tested be more than the first default click volume, and text When the quantity for the sensitive keys word for including in word heading message is more than the first default value, data to be tested are filtered;And/or
Second filter element 342 is configured to be lower than the second default click volume, and text when the network click amount of data to be tested When the quantity for the sensitive keys word for including in word heading message is more than the second default value, data to be tested are filtered.
In a preferred embodiment, judgment module 310 is also configured as:
The network click amount of each data in presetting database is obtained, and is ranked up based on network click amount, according to network The data of click volume within a preset range generate hot data library;
Any data in hot data library is chosen as data to be tested, obtains the caption letter of data to be tested Breath;
Judge whether caption information includes sensitive keys word in predetermined keyword library.
In a preferred embodiment, judgment module 310 is also configured as:
Before judging whether caption information include sensitive keys word in predetermined keyword library, obtain by artificial The sensitive keys word of audit and/or be already expired filter data article title information extraction sensitive keys word;
Predetermined keyword library is constructed based on sensitive keys word.
In a preferred embodiment, judgment module 310 is also configured as:
Caption information is segmented, at least one word that caption information includes is obtained;
Above-mentioned word is matched with the sensitive keys word in predetermined keyword library;
When the sensitive keys word successful match in above-mentioned word and predetermined keyword library, judge that caption information includes Sensitive keys word in predetermined keyword library;
When above-mentioned word matches unsuccessful with the sensitive keys word in predetermined keyword library, judge that caption information does not have Have including the sensitive keys word in predetermined keyword library.
In a preferred embodiment, data to be tested include internet video data;
Judgment module 310 is also configured as obtaining the video data for being stored in video server and/or live streaming is main straight The caption information for the video data broadcast judges whether caption information includes sensitive keys in predetermined keyword library Word.
In a preferred embodiment, judgment module 310 are also configured as obtaining the currently watched video of user The caption information of data judges whether caption information includes sensitive keys word in predetermined keyword library.
Based on the same inventive concept, the embodiment of the invention also provides a kind of computer storage medium, computer storages Media storage has computer program code, when computer program code is run on the computing device, causes to calculate equipment execution Data filtering method described in any of the above embodiments
Based on the same inventive concept, the embodiment of the invention also provides a kind of calculating equipment, comprising:
Processor;
It is stored with the memory of computer program code;
When computer program code is run by processor, cause to calculate equipment execution data mistake described in any of the above embodiments Filtering method.
The embodiment of the invention provides a kind of more efficient data filtering method and devices, are based on sensitive keys word and heat The mode of the combination of degree is filtered detection to data to be tested, not only can directly filter out the junk datas such as violence, vulgar, It can also judge potentially to hide the data for relatively needing to filter deeply in time, and then promote the mistake of bad data and junk data Efficiency is filtered, is purified Internet environment.In addition, being also based on number in presetting database in scheme provided in an embodiment of the present invention According to network click amount obtain hot data, and then screen data to be tested wherein, and be in advance based on and have been subjected to artificial examine The data or idiom of core construct predetermined keyword library, can promote the mistake of data while further saving detection time Filter treatment effeciency.
It is apparent to those skilled in the art that the specific work of the system of foregoing description, device and unit Make process, can refer to corresponding processes in the foregoing method embodiment, for brevity, does not repeat separately herein.
In addition, each functional unit in each embodiment of the present invention can be physically independent, can also two or More than two functional units integrate, and can be all integrated in a processing unit with all functional units.It is above-mentioned integrated Functional unit both can take the form of hardware realization, can also be realized in the form of software or firmware.
Those of ordinary skill in the art will appreciate that: if the integrated functional unit is realized and is made in the form of software It is independent product when selling or using, can store in a computer readable storage medium.Based on this understanding, Technical solution of the present invention is substantially or all or part of the technical solution can be embodied in the form of software products, The computer software product is stored in a storage medium comprising some instructions, with so that calculating equipment (such as Personal computer, server or network equipment etc.) various embodiments of the present invention the method is executed when running described instruction All or part of the steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM), random access memory Device (RAM), the various media that can store program code such as magnetic or disk.
Alternatively, realizing that all or part of the steps of preceding method embodiment can be (all by the relevant hardware of program instruction Such as personal computer, the calculating equipment of server or network equipment etc.) it completes, described program instruction can store in one In computer-readable storage medium, when described program instruction is executed by the processor of calculating equipment, the calculating equipment is held The all or part of the steps of row various embodiments of the present invention the method.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Present invention has been described in detail with reference to the aforementioned embodiments for pipe, those skilled in the art should understand that: at this Within the spirit and principle of invention, it is still possible to modify the technical solutions described in the foregoing embodiments or right Some or all of the technical features are equivalently replaced;And these are modified or replaceed, and do not make corresponding technical solution de- From protection scope of the present invention.
According to an aspect of an embodiment of the present invention, a kind of data filtering method of A1. is provided, comprising:
The caption information for obtaining data to be tested, judges whether the caption information includes predetermined keyword library In sensitive keys word;
If the caption information includes the sensitive keys word in the predetermined keyword library, the sensitive pass is obtained The quantity of keyword;
Obtain the network click amount of the data to be tested;
The number for the sensitive keys word for including in network click amount and caption information based on the data to be tested Amount filters the data to be tested.
A2. method according to a1, wherein the network click amount and text mark based on the data to be tested The quantity for the sensitive keys word for including in topic information filters the data to be tested, comprising:
If the network click amount of the data to be tested is more than the first default click volume, and is wrapped in the caption information The quantity of the sensitive keys word included is more than the first default value, then filters the data to be tested;And/or
If the network click amount of the data to be tested is lower than the second default click volume, and wraps in the caption information The quantity of the sensitive keys word included is more than the second default value, then filters the data to be tested.
A3. method according to a1, wherein the caption information for obtaining data to be tested judges the text Whether word heading message includes sensitive keys word in predetermined keyword library, comprising:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to The data of network click amount within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the text of the data to be tested Heading message;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
A4. method according to a3, wherein described to judge whether the caption information includes predetermined keyword library In sensitive keys word before, further includes:
It obtains the sensitive keys word Jing Guo manual examination and verification and/or the sensitive of the article title information extraction of filter data is already expired and close Keyword;
Predetermined keyword library is constructed based on the sensitive keys word.
A5. method according to a3, wherein described to judge whether the caption information includes predetermined keyword library In sensitive keys word, comprising:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
If the sensitive keys word successful match in the word and the predetermined keyword library, judges the caption Information includes the sensitive keys word in predetermined keyword library;
If the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, the text mark is judged Topic information does not include the sensitive keys word in predetermined keyword library.
A6. according to the described in any item methods of A1-A5, wherein the data to be tested include internet video data;Institute The caption information for obtaining data to be tested is stated, judges whether the caption information includes quick in predetermined keyword library Feel keyword, comprising:
Obtain the caption letter of the video data of the video data for being stored in video server and/or the main live streaming of live streaming Breath judges whether the caption information includes sensitive keys word in predetermined keyword library.
A7. the method according to A6, wherein the caption information for obtaining data to be tested judges the text Whether word heading message includes sensitive keys word in predetermined keyword library, further includes:
Obtain the currently watched video data of user caption information, judge the caption information whether include Sensitive keys word in predetermined keyword library.
Other side according to an embodiment of the present invention additionally provides a kind of data filtering device of B8., comprising:
Judgment module is configured to obtain the caption information of data to be tested, whether judges the caption information Including the sensitive keys word in predetermined keyword library;
First obtains module, if being configured to the caption information includes the sensitive keys in the predetermined keyword library Word then obtains the quantity of the sensitive keys word;
Second obtains module, is configured to obtain the network click amount of the data to be tested;
Filtering module, be configured in the network click amount and caption information of the data to be tested include The quantity of sensitive keys word filters the data to be tested.
B9. the device according to B8, wherein the filtering module includes:
First filter element is configured to when the network click amount of the data to be tested be more than the first default click volume, and When the quantity for the sensitive keys word for including in the caption information is more than the first default value, the number to be detected is filtered According to;And/or
Second filter element is configured to the network click amount when the data to be tested lower than the second default click volume, and When the quantity for the sensitive keys word for including in the caption information is more than the second default value, the number to be detected is filtered According to.
B10. the device according to B8, wherein the judgment module is additionally configured to:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to The data of network click amount within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the text of the data to be tested Heading message;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
B11. device according to b10, wherein the judgment module is additionally configured to:
Before judging whether the caption information include sensitive keys word in predetermined keyword library, obtains and pass through The sensitive keys word of manual examination and verification and/or be already expired filter data article title information extraction sensitive keys word;
Predetermined keyword library is constructed based on the sensitive keys word.
B12. device according to b10, wherein the judgment module is additionally configured to:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
When the sensitive keys word successful match in the word and the predetermined keyword library, the caption is judged Information includes the sensitive keys word in predetermined keyword library;
When the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, the text mark is judged Topic information does not include the sensitive keys word in predetermined keyword library.
B13. according to the described in any item devices of B8-B12, wherein the data to be tested include internet video data;
The judgment module is additionally configured to obtain the video data for being stored in video server and/or the main live streaming of live streaming Video data caption information, judge whether the caption information includes sensitive keys in predetermined keyword library Word.
B14. according to the described in any item devices of B8-B12, wherein it is current to be additionally configured to acquisition user for the judgment module The caption information of the video data of viewing judges whether the caption information includes sensitivity in predetermined keyword library Keyword.
Other side according to an embodiment of the present invention additionally provides a kind of computer storage medium of C15., the calculating Machine storage medium is stored with computer program code, when the computer program code is run on the computing device, leads to institute It states and calculates the equipment execution described in any item data filtering methods of A1-A7.
Other side according to an embodiment of the present invention additionally provides a kind of calculating equipment of D16., comprising:
Processor;
It is stored with the memory of computer program code;
When the computer program code is run by the processor, the calculating equipment is caused to execute A1-A7 any Data filtering method described in.

Claims (10)

1. a kind of data filtering method, comprising:
The caption information for obtaining data to be tested, judges whether the caption information includes in predetermined keyword library Sensitive keys word;
If the caption information includes the sensitive keys word in the predetermined keyword library, the sensitive keys word is obtained Quantity;
Obtain the network click amount of the data to be tested;
The quantity mistake for the sensitive keys word for including in network click amount and caption information based on the data to be tested Filter the data to be tested.
2. according to the method described in claim 1, wherein, the network click amount and text based on the data to be tested The quantity for the sensitive keys word for including in heading message filters the data to be tested, comprising:
If the network click amount of the data to be tested is more than the first default click volume, and includes in the caption information The quantity of sensitive keys word is more than the first default value, then filters the data to be tested;And/or
If the network click amount of the data to be tested is lower than the second default click volume, and includes in the caption information The quantity of sensitive keys word is more than the second default value, then filters the data to be tested.
3. according to the method described in claim 1, wherein, the caption information for obtaining data to be tested, described in judgement Whether caption information includes sensitive keys word in predetermined keyword library, comprising:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to network The data of click volume within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the caption of the data to be tested Information;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
4. described to judge whether the caption information includes predetermined keyword according to the method described in claim 3, wherein Before sensitive keys word in library, further includes:
It obtains the sensitive keys word Jing Guo manual examination and verification and/or the sensitive keys of the article title information extraction of filter data is already expired Word;
Predetermined keyword library is constructed based on the sensitive keys word.
5. described to judge whether the caption information includes predetermined keyword according to the method described in claim 3, wherein Sensitive keys word in library, comprising:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
If the sensitive keys word successful match in the word and the predetermined keyword library, judges the caption information Including the sensitive keys word in predetermined keyword library;
If the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, caption letter is judged It ceases without including the sensitive keys word in predetermined keyword library.
6. method according to claim 1-5, wherein the data to be tested include internet video data; The caption information for obtaining data to be tested, judges whether the caption information includes in predetermined keyword library Sensitive keys word, comprising:
The caption information of the video data of the video data for being stored in video server and/or the main live streaming of live streaming is obtained, Judge whether the caption information includes sensitive keys word in predetermined keyword library.
7. according to the method described in claim 6, wherein, the caption information for obtaining data to be tested, described in judgement Whether caption information includes sensitive keys word in predetermined keyword library, further includes:
The caption information for obtaining the currently watched video data of user, judges whether the caption information includes presetting Sensitive keys word in keywords database.
8. a kind of data filtering device, comprising:
Judgment module, be configured to obtain data to be tested caption information, judge the caption information whether include Sensitive keys word in predetermined keyword library;
First obtains module, if being configured to the caption information includes the sensitive keys word in the predetermined keyword library, Then obtain the quantity of the sensitive keys word;
Second obtains module, is configured to obtain the network click amount of the data to be tested;
Filtering module is configured to the sensitivity in the network click amount and caption information of the data to be tested included The quantity of keyword filters the data to be tested.
9. a kind of computer storage medium, the computer storage medium is stored with computer program code, when the computer When program code is run on the computing device, the calculating equipment perform claim is caused to require the described in any item data mistakes of 1-7 Filtering method.
10. a kind of calculating equipment, comprising:
Processor;
It is stored with the memory of computer program code;
When the computer program code is run by the processor, cause the calculating equipment perform claim that 1-7 is required to appoint Data filtering method described in one.
CN201811313297.1A 2018-11-06 2018-11-06 A kind of data filtering method and device Pending CN109657134A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811313297.1A CN109657134A (en) 2018-11-06 2018-11-06 A kind of data filtering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811313297.1A CN109657134A (en) 2018-11-06 2018-11-06 A kind of data filtering method and device

Publications (1)

Publication Number Publication Date
CN109657134A true CN109657134A (en) 2019-04-19

Family

ID=66110134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811313297.1A Pending CN109657134A (en) 2018-11-06 2018-11-06 A kind of data filtering method and device

Country Status (1)

Country Link
CN (1) CN109657134A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110767211A (en) * 2019-09-23 2020-02-07 浙江从泰网络科技有限公司 Voice synthesis broadcasting system based on text content data cleaning
CN110971619A (en) * 2020-01-02 2020-04-07 惠州学院 Network technology security system and method with bad information filtering processing
CN111586421A (en) * 2020-01-20 2020-08-25 全息空间(深圳)智能科技有限公司 Method, system and storage medium for auditing live broadcast platform information
CN113891120A (en) * 2021-09-29 2022-01-04 广东省高峰科技有限公司 IPTV service terminal access method and system thereof
CN114840776A (en) * 2022-07-04 2022-08-02 北京拓普丰联信息科技股份有限公司 Method, device, electronic equipment and storage medium for recording data publishing source

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160350282A1 (en) * 2014-02-25 2016-12-01 Tencent Technology (Shenzhen) Company Limited Sensitive text detecting method and apparatus
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN106776946A (en) * 2016-12-02 2017-05-31 重庆大学 A kind of detection method of fraudulent website
CN108153723A (en) * 2017-12-27 2018-06-12 北京百度网讯科技有限公司 Hot spot information comment generation method, device and terminal device
CN108304379A (en) * 2018-01-15 2018-07-20 腾讯科技(深圳)有限公司 A kind of article recognition methods, device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160350282A1 (en) * 2014-02-25 2016-12-01 Tencent Technology (Shenzhen) Company Limited Sensitive text detecting method and apparatus
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN106776946A (en) * 2016-12-02 2017-05-31 重庆大学 A kind of detection method of fraudulent website
CN108153723A (en) * 2017-12-27 2018-06-12 北京百度网讯科技有限公司 Hot spot information comment generation method, device and terminal device
CN108304379A (en) * 2018-01-15 2018-07-20 腾讯科技(深圳)有限公司 A kind of article recognition methods, device and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110767211A (en) * 2019-09-23 2020-02-07 浙江从泰网络科技有限公司 Voice synthesis broadcasting system based on text content data cleaning
CN110767211B (en) * 2019-09-23 2022-02-18 浙江斑智科技有限公司 Voice synthesis broadcasting system based on text content data cleaning
CN110971619A (en) * 2020-01-02 2020-04-07 惠州学院 Network technology security system and method with bad information filtering processing
CN111586421A (en) * 2020-01-20 2020-08-25 全息空间(深圳)智能科技有限公司 Method, system and storage medium for auditing live broadcast platform information
CN113891120A (en) * 2021-09-29 2022-01-04 广东省高峰科技有限公司 IPTV service terminal access method and system thereof
CN114840776A (en) * 2022-07-04 2022-08-02 北京拓普丰联信息科技股份有限公司 Method, device, electronic equipment and storage medium for recording data publishing source
CN114840776B (en) * 2022-07-04 2022-09-20 北京拓普丰联信息科技股份有限公司 Method, device, electronic equipment and storage medium for recording data publishing source

Similar Documents

Publication Publication Date Title
CN109657134A (en) A kind of data filtering method and device
Myers et al. Information diffusion and external influence in networks
CN110233849B (en) Method and system for analyzing network security situation
US9300755B2 (en) System and method for determining information reliability
US10032081B2 (en) Content-based video representation
CN105072089B (en) A kind of WEB malice scanning behavior method for detecting abnormality and system
US9600530B2 (en) Updating a search index used to facilitate application searches
CN102279875B (en) Method and device for identifying fishing website
TWI498752B (en) Extracting information from unstructured data and mapping the information to a structured schema using the naive bayesian probability model
CN108334758B (en) Method, device and equipment for detecting user unauthorized behavior
CN105183781B (en) Information recommendation method and device
CN104408102B (en) For network hot word and the data processing method and device of the degree of association of object
CN105488023B (en) A kind of text similarity appraisal procedure and device
Hristakieva et al. The spread of propaganda by coordinated communities on social media
Middleton et al. Geoparsing and geosemantics for social media: Spatiotemporal grounding of content propagating rumors to support trust and veracity analysis during breaking news
CN112464036B (en) Method and device for auditing violation data
GB2456916A (en) Method for presenting promotional information on a web page, e.g. an on-line targeted advertising method.
CN105574030A (en) Information search method and device
CN109376231A (en) A kind of media hotspot tracking and system
CN109190014A (en) A kind of regular expression generation method, device and electronic equipment
CN102946391B (en) The method of prompting malice network address and a kind of browser in a kind of browser
US8171020B1 (en) Spam detection for user-generated multimedia items based on appearance in popular queries
CN110245297B (en) Book keyword search-oriented user subject privacy protection method and system
EP2680210A1 (en) Method and system for cross-platform content recommendation
Fletcher et al. Practical web traffic analysis: standards, privacy, techniques, and results

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190419

RJ01 Rejection of invention patent application after publication