CN109376229A - A kind of click bait detection method based on convolutional neural networks - Google Patents
A kind of click bait detection method based on convolutional neural networks Download PDFInfo
- Publication number
- CN109376229A CN109376229A CN201811476642.3A CN201811476642A CN109376229A CN 109376229 A CN109376229 A CN 109376229A CN 201811476642 A CN201811476642 A CN 201811476642A CN 109376229 A CN109376229 A CN 109376229A
- Authority
- CN
- China
- Prior art keywords
- word
- feature
- convolutional neural
- neural networks
- detection method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 27
- 238000001514 detection method Methods 0.000 title claims abstract description 17
- 239000013598 vector Substances 0.000 claims abstract description 42
- 239000011159 matrix material Substances 0.000 claims abstract description 11
- 238000005520 cutting process Methods 0.000 claims abstract description 4
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 235000013399 edible fruits Nutrition 0.000 claims description 2
- 238000000034 method Methods 0.000 abstract description 28
- 238000010276 construction Methods 0.000 abstract description 4
- 238000007477 logistic regression Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000005252 bulbus oculi Anatomy 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The click bait detection method based on convolutional neural networks that the invention discloses a kind of, decomposes its implementation sequence the following steps are included: step 1: sentences decomposition is become single word by the cutting of word;Step 2: the vectorization of word indicates that, due to discrete word, word can not need for vocabulary to be shown as the vector for computer disposal as the input of model, and vocabulary is shown as continuous dense term vector, all term vectors are spliced to form lookup matrix;Step 3: automatically extracting feature, automatically generate useful feature using convolutional neural networks structure;Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction result of model by classifier.This method can be automatically derived independent of external feature, and be suitable for all language, better than the method for previous manual construction feature, the robustness and validity of method of the invention in across language task.
Description
Technical field
The present invention relates to Internet technical field, in particular to a kind of click bait detection side based on convolutional neural networks
Method.
Background technique
With the appearance of Web2.0, people prefer online reading news.As the substitute of traditionally on paper media, network
The topic that news covers is more extensive, and media content is abundant, various informative, also provides more selections.However, on the other hand,
Online news website is also flooded with a large amount of low-quality contents.Most of such network medias do not collect reader's subscription usually
Take, their principal income from the advertisement shown on its webpage, the height of advertising expense rely primarily on user browsing and
It clicks.In order to attract more users to pay close attention to and medium contentions similar with other, subnetwork media, which are found out, uses some suctions
Induce one eyeball caption method, such title be known as click bait.Clicking bait usually has misleading title, overstates
Big or hidden parts are true, it is clear that click bait in the epoch of information explosion and hamper the more efficient acquisition information of reader, make to read
Person feels disappointed.The public credibility of media also will be greatly reduced in the long run, therefore detect and prevent click bait from becoming very
It is necessary.Pervious work relies on artificial constructed vocabulary and syntactic feature, and by achieve in these methods it is good at
Achievement.However, this kind of work largely depends on professional knowledge in feature extraction, the language without these characteristics may not apply to
The information of character level and capital and small letter etc. play an important role in clicking bait detection in speech, such as English, and as in
Text, the language such as Japanese do not include such feature.On the other hand, although clicking bait phenomenon all types language on the internet
All generally occurred, but other language other than English are clicked bait detection and studied almost without people.
Summary of the invention
Invention is designed to provide a kind of click bait detection method based on convolutional neural networks, and this method can be certainly
It is dynamic to obtain independent of external feature, and it is suitable for all language.Show the experiment of Chinese and English corpus method of the invention
Achieve it is consistent as a result, and be better than previous manual construction feature method, experimental result shows of the invention simultaneously
Robustness and validity of the method in across language task, to solve the problems mentioned in the above background technology.
To achieve the above object, the invention provides the following technical scheme:
A kind of click bait detection method based on convolutional neural networks indicates from input news, provides judgement to system
Whether be click bait, decompose its implementation sequence the following steps are included:
Step 1: sentences decomposition is become single word by the cutting of word, and the sentence semantics carried by word can be preferably
Sentence is further analyzed and is handled using model;
Step 2: the vectorization of word indicates that, due to discrete word, word can not need for vocabulary to be shown as the input of model
For the vector of computer disposal, vocabulary is shown as continuous dense term vector, to save the sentence learnt from the context
All term vectors are spliced to form and search matrix L ∈ R by method and semantic informationd×|V|, embeded matrix can be uniform from passing through one
It is distributed random initializtion or training obtains in advance from a big text corpus in advance;
Step 3: automatically extracting feature, automatically generate useful feature using convolutional neural networks structure, use multiple tools
There is different window size convolution filters to generate feature, capture varigrained local feature, this feature figure is carried out maximum
Pondization operation;
Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction of model by classifier
As a result, whether the text of i.e. input is to click bait, and export confidence level probability.
Further, vocabulary is shown as continuous dense term vector, according to formula: ei=Lbk∈Rd。
Further, matrix L ∈ Rd×|V|Middle d is the dimension of term vector, | V | it is vocabulary size.
Further, the new feature that convolutional neural networks generate is ciBy Equation f (weI:i+h-1+b) obtain.
Further, logistic regression is used to distribution caption click bait or click bait, according to formula:
Further, use and intersect entropy loss as optimization aim, formula is
Compared with prior art, the beneficial effects of the present invention are: the click proposed by the present invention based on convolutional neural networks
Title is expressed as fixed dimension using convolutional neural networks by bait detection method, continuously, the vector of real value.Firstly, in the overall situation
In range, local feature and maximum pond n- gram language model are extracted using convolution filter to select most important feature.
This method can be automatically derived independent of external feature, and be suitable for all language.The experiment of Chinese and English corpus is shown
Method of the invention achieves consistent as a result, and better than the method for previous manual construction feature, experimental result while table
The robustness and validity of method of the invention in across language task is illustrated.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is the program code figure of step 1 of the present invention;
Fig. 3 is the program code figure of step 2 of the present invention;
Fig. 4 is the program code figure of step 3 of the present invention;
Fig. 5 is the program code figure of step 4 of the present invention;
Fig. 6 is model structure block diagram of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
A kind of click bait detection method based on convolutional neural networks, process such as Fig. 1 are indicated from input news, to being
System provides and determines whether click bait, decompose its implementation sequence the following steps are included:
Step 1: sentences decomposition is become single word by the cutting of word, and the sentence semantics carried by word can be preferably
(program code such as Fig. 2) is further analyzed and handled to sentence using model;
Step 2: the vectorization of word indicates that, due to discrete word, word can not need for vocabulary to be shown as the input of model
For the vector of computer disposal, vocabulary is shown as continuous dense term vector, to save the sentence learnt from the context
All term vectors are spliced to form and search matrix L ∈ R by method and semantic informationd×|V|, embeded matrix can be uniform from passing through one
It is distributed random initializtion or training obtains in advance from a big text corpus in advance;Although the term vector packet of pre-training
The syntax and semantic information learnt from the context is contained, but has kept model fully end-to-end using the method for random initializtion
(program code such as Fig. 3);
Step 3: feature is automatically extracted, automatically generates useful feature using convolutional neural networks (CNN) structure, without
Artificial or expertise is wanted, feature is generated with different window size convolution filters using multiple, captures different grain size
Local feature, maximum pondization is carried out to this feature figure and is operated;Intuitively, this operation selects most abundant in global scope
Functional character, title is filled into a regular length, and the output of convolutional neural networks is that the real value an of fixed size is special
Vector is levied, and is used as the expression (program code such as Fig. 4) of title;
Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction of model by classifier
As a result, whether the text of i.e. input is to click bait, and export confidence level probability, using logistic regression recurrence come contingency table
Whether topic is to click bait or click bait (program code such as Fig. 5).
For structure such as Fig. 6 of model, which is first embedded into word in term vector, and vector is then input to volume
Automatic Feature Extraction is carried out in product neural network.Finally using logistic regression classifier for classifying.
1. term vector
Vocabulary is shown as continuous dense term vector.Compared to traditional one single hot word vector expression, this expression side
Method all achieves preferably in the relevant task of various natural language processings as a result, such as sentiment analysis, machine translation etc.;
ei=Lbk∈Rd (1)
All term vectors, which are constituted, searches matrix L ∈ Rd×|V|, d is the dimension of term vector, | V | it is vocabulary size.
Embeded matrix can be uniformly distributed random initializtion or in advance from a big text corpus from by one
In in advance training obtain.Although the term vector of pre-training contains the syntax and semantic information learnt from the context, make
Keep model fully end-to-end with the method for random initializtion.The experimental results showed that even the vector of random initializtion also can
Better than former state-of-the-art method.In the training process, word vector is updated by backpropagation training error.Verifying and
Fixed obtained word vector when test.
It defines to form, gives a word wi, we look into matrix L, indicate e by the term vector that projection operation is retrievedi。bk
It is a binary, the vector of vocabulary size, other than the position of corresponding vocabulary is 1, other positions are 0.Term vector can lead to
Cross neural language model random initializtion or pre-training.
2. convolutional neural networks
Useful feature is automatically generated using convolutional neural networks (CNN).The output of convolutional neural networks is a fixation
The real-valued vector of size, is used as the expression of title.
A given headline is H={ w1, w2..., wn, the word constituted is indicated by searching for the obtained vector of table L
For { e1, e2..., en, vector is splicing to [e1;e2;…;en], for the sake of simplicity, use eI:i+n-1Indicate n term vector
{ei, ei+1..., ei+n-1, filter w ∈ Rh×k, h is the size of window and k is term vector dimension, is newly characterized in ciBy equation
f(w·eI:i+h-1+ b) (2) obtain.
Wherein b is bias vector, is a linear activation primitive, we are slided with it along the vector of splicing, generates one
A characteristic pattern c={ c1, c2..., ci-h+1}.Maximum pondization operation is carried out to this feature figure.Intuitively, this operation is in the overall situation
The most abundant functional character of selection in range.Feature is generated with different window size convolution filters using multiple, is captured
Varigrained local feature.Title is also filled into a regular length.
3. disaggregated model
Final feature vector c will be generated by each filter.Logistic regression is to click bait for distribution caption
Or click bait (formula 3).Use the optimization aim for intersecting entropy loss as us, is formally provided in formula 4.Y is mark
Quasi- answer,It is the logistic regression probability of each label.
In conclusion the click bait detection method proposed by the present invention based on convolutional neural networks, uses convolutional Neural
Title is expressed as fixed dimension by network, continuously, the vector of real value.Firstly, being mentioned in global scope using convolution filter
Local feature and maximum pond n- gram language model is taken to select most important feature.This method can automatically derive independent of
External feature, and it is suitable for all language.Method of the invention, which achieves consistent knot, to be shown to the experiment of Chinese and English corpus
Fruit, and it is better than the method for previous manual construction feature, experimental result shows method of the invention simultaneously and appoints across language
Robustness and validity in business.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto,
Anyone skilled in the art within the technical scope of the present disclosure, according to the technique and scheme of the present invention and its
Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.
Claims (6)
1. a kind of click bait detection method based on convolutional neural networks, which is characterized in that indicated from input news, arrive system
Provide and determine whether click bait, decompose its implementation sequence the following steps are included:
Step 1: sentences decomposition is become single word by the cutting of word, and the sentence semantics carried by word can better use
Model is further analyzed and handles to sentence;
Step 2: the vectorization of word indicates, due to discrete word, word can not as the input of model, need for vocabulary to be shown as
Vocabulary is shown as continuous dense term vector by the vector of computer disposal, with save the syntax learnt from the context and
All term vectors are spliced to form and search matrix L ∈ R by semantic informationd×|V|, embeded matrix can be from being uniformly distributed by one
Random initializtion or in advance from a big text corpus in advance training obtain;
Step 3: automatically extracting feature, automatically generate useful feature using convolutional neural networks structure, had not using multiple
Same window size convolution filter generates feature, captures varigrained local feature, carries out maximum pond to this feature figure
Operation;
Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction knot of model by classifier
Whether fruit, that is, the text inputted are to click bait, and export confidence level probability.
2. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that will
Vocabulary is shown as continuous dense term vector, according to formula: ei=Lbk∈Rd。
3. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that square
Battle array L ∈ Rd×|V|Middle d is the dimension of term vector, | V | it is vocabulary size.
4. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that volume
The new feature that product neural network generates is ciBy Equation f (wei∶i+h-1+ b) it obtains.
5. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that patrol
It collects to return and is used to distribution caption click bait or click bait, according to formula:
6. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that make
It uses and intersects entropy loss as optimization aim, formula is
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811476642.3A CN109376229A (en) | 2018-12-04 | 2018-12-04 | A kind of click bait detection method based on convolutional neural networks |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811476642.3A CN109376229A (en) | 2018-12-04 | 2018-12-04 | A kind of click bait detection method based on convolutional neural networks |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109376229A true CN109376229A (en) | 2019-02-22 |
Family
ID=65375875
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811476642.3A Pending CN109376229A (en) | 2018-12-04 | 2018-12-04 | A kind of click bait detection method based on convolutional neural networks |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109376229A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107491436A (en) * | 2017-08-21 | 2017-12-19 | 北京百度网讯科技有限公司 | A kind of recognition methods of title party and device, server, storage medium |
CN108460134A (en) * | 2018-03-06 | 2018-08-28 | 云南大学 | The text subject disaggregated model and sorting technique of transfer learning are integrated based on multi-source domain |
CN108491389A (en) * | 2018-03-23 | 2018-09-04 | 杭州朗和科技有限公司 | Click bait title language material identification model training method and device |
CN108596470A (en) * | 2018-04-19 | 2018-09-28 | 浙江大学 | A kind of power equipments defect text handling method based on TensorFlow frames |
-
2018
- 2018-12-04 CN CN201811476642.3A patent/CN109376229A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107491436A (en) * | 2017-08-21 | 2017-12-19 | 北京百度网讯科技有限公司 | A kind of recognition methods of title party and device, server, storage medium |
CN108460134A (en) * | 2018-03-06 | 2018-08-28 | 云南大学 | The text subject disaggregated model and sorting technique of transfer learning are integrated based on multi-source domain |
CN108491389A (en) * | 2018-03-23 | 2018-09-04 | 杭州朗和科技有限公司 | Click bait title language material identification model training method and device |
CN108596470A (en) * | 2018-04-19 | 2018-09-28 | 浙江大学 | A kind of power equipments defect text handling method based on TensorFlow frames |
Non-Patent Citations (1)
Title |
---|
JUNFENG FU 等: "A Convolutional Neural Network for Clickbait Detection", 《2017 4TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200301954A1 (en) | Reply information obtaining method and apparatus | |
CN106682192B (en) | Method and device for training answer intention classification model based on search keywords | |
CN103514299B (en) | Information search method and device | |
CN104050160B (en) | Interpreter's method and apparatus that a kind of machine is blended with human translation | |
CN109299480A (en) | Terminology Translation method and device based on context of co-text | |
CN110888990A (en) | Text recommendation method, device, equipment and medium | |
CN106682170B (en) | Application search method and device | |
CN110377695B (en) | Public opinion theme data clustering method and device and storage medium | |
CN110263154A (en) | A kind of network public-opinion emotion situation quantization method, system and storage medium | |
Fu et al. | A convolutional neural network for clickbait detection | |
CN110008473B (en) | Medical text named entity identification and labeling method based on iteration method | |
CN110674378A (en) | Chinese semantic recognition method based on cosine similarity and minimum editing distance | |
CN113407842B (en) | Model training method, theme recommendation reason acquisition method and system and electronic equipment | |
CN110929007A (en) | Electric power marketing knowledge system platform and application method | |
CN108460150A (en) | The processing method and processing device of headline | |
CN112445894A (en) | Business intelligent system based on artificial intelligence and analysis method thereof | |
CN108363700A (en) | The method for evaluating quality and device of headline | |
CN113806547A (en) | Deep learning multi-label text classification method based on graph model | |
CN108399265A (en) | Real-time hot news providing method based on search and device | |
CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
CN111488429A (en) | Short text clustering system based on search engine and short text clustering method thereof | |
CN106649294A (en) | Training of classification models and method and device for recognizing subordinate clauses of classification models | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN113392195A (en) | Public opinion monitoring method and device, electronic equipment and storage medium | |
CN115017271B (en) | Method and system for intelligently generating RPA flow component block |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190222 |
|
RJ01 | Rejection of invention patent application after publication |