CN109376229A

CN109376229A - A kind of click bait detection method based on convolutional neural networks

Info

Publication number: CN109376229A
Application number: CN201811476642.3A
Authority: CN
Inventors: 付俊峰; 梁良; 郑锦坤; 周欣
Original assignee: Information And Communication Branch Of Jiangxi Electric Power Co Ltd; State Grid Corp of China SGCC
Current assignee: Information And Communication Branch Of Jiangxi Electric Power Co Ltd; State Grid Corp of China SGCC
Priority date: 2018-12-04
Filing date: 2018-12-04
Publication date: 2019-02-22

Abstract

The click bait detection method based on convolutional neural networks that the invention discloses a kind of, decomposes its implementation sequence the following steps are included: step 1: sentences decomposition is become single word by the cutting of word；Step 2: the vectorization of word indicates that, due to discrete word, word can not need for vocabulary to be shown as the vector for computer disposal as the input of model, and vocabulary is shown as continuous dense term vector, all term vectors are spliced to form lookup matrix；Step 3: automatically extracting feature, automatically generate useful feature using convolutional neural networks structure；Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction result of model by classifier.This method can be automatically derived independent of external feature, and be suitable for all language, better than the method for previous manual construction feature, the robustness and validity of method of the invention in across language task.

Description

A kind of click bait detection method based on convolutional neural networks

Technical field

The present invention relates to Internet technical field, in particular to a kind of click bait detection side based on convolutional neural networks Method.

Background technique

With the appearance of Web2.0, people prefer online reading news.As the substitute of traditionally on paper media, network The topic that news covers is more extensive, and media content is abundant, various informative, also provides more selections.However, on the other hand, Online news website is also flooded with a large amount of low-quality contents.Most of such network medias do not collect reader's subscription usually Take, their principal income from the advertisement shown on its webpage, the height of advertising expense rely primarily on user browsing and It clicks.In order to attract more users to pay close attention to and medium contentions similar with other, subnetwork media, which are found out, uses some suctions Induce one eyeball caption method, such title be known as click bait.Clicking bait usually has misleading title, overstates Big or hidden parts are true, it is clear that click bait in the epoch of information explosion and hamper the more efficient acquisition information of reader, make to read Person feels disappointed.The public credibility of media also will be greatly reduced in the long run, therefore detect and prevent click bait from becoming very It is necessary.Pervious work relies on artificial constructed vocabulary and syntactic feature, and by achieve in these methods it is good at Achievement.However, this kind of work largely depends on professional knowledge in feature extraction, the language without these characteristics may not apply to The information of character level and capital and small letter etc. play an important role in clicking bait detection in speech, such as English, and as in Text, the language such as Japanese do not include such feature.On the other hand, although clicking bait phenomenon all types language on the internet All generally occurred, but other language other than English are clicked bait detection and studied almost without people.

Summary of the invention

Invention is designed to provide a kind of click bait detection method based on convolutional neural networks, and this method can be certainly It is dynamic to obtain independent of external feature, and it is suitable for all language.Show the experiment of Chinese and English corpus method of the invention Achieve it is consistent as a result, and be better than previous manual construction feature method, experimental result shows of the invention simultaneously Robustness and validity of the method in across language task, to solve the problems mentioned in the above background technology.

To achieve the above object, the invention provides the following technical scheme:

A kind of click bait detection method based on convolutional neural networks indicates from input news, provides judgement to system Whether be click bait, decompose its implementation sequence the following steps are included:

Step 1: sentences decomposition is become single word by the cutting of word, and the sentence semantics carried by word can be preferably Sentence is further analyzed and is handled using model；

Step 2: the vectorization of word indicates that, due to discrete word, word can not need for vocabulary to be shown as the input of model For the vector of computer disposal, vocabulary is shown as continuous dense term vector, to save the sentence learnt from the context All term vectors are spliced to form and search matrix L ∈ R by method and semantic information^d×|V|, embeded matrix can be uniform from passing through one It is distributed random initializtion or training obtains in advance from a big text corpus in advance；

Step 3: automatically extracting feature, automatically generate useful feature using convolutional neural networks structure, use multiple tools There is different window size convolution filters to generate feature, capture varigrained local feature, this feature figure is carried out maximum Pondization operation；

Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction of model by classifier As a result, whether the text of i.e. input is to click bait, and export confidence level probability.

Further, vocabulary is shown as continuous dense term vector, according to formula: e_i=Lb_k∈R^d。

Further, matrix L ∈ R^d×|V|Middle d is the dimension of term vector, | V | it is vocabulary size.

Further, the new feature that convolutional neural networks generate is c_iBy Equation f (we_I:i+h-1+b) obtain.

Further, logistic regression is used to distribution caption click bait or click bait, according to formula:

Further, use and intersect entropy loss as optimization aim, formula is

Compared with prior art, the beneficial effects of the present invention are: the click proposed by the present invention based on convolutional neural networks Title is expressed as fixed dimension using convolutional neural networks by bait detection method, continuously, the vector of real value.Firstly, in the overall situation In range, local feature and maximum pond n- gram language model are extracted using convolution filter to select most important feature. This method can be automatically derived independent of external feature, and be suitable for all language.The experiment of Chinese and English corpus is shown Method of the invention achieves consistent as a result, and better than the method for previous manual construction feature, experimental result while table The robustness and validity of method of the invention in across language task is illustrated.

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is the program code figure of step 1 of the present invention；

Fig. 3 is the program code figure of step 2 of the present invention；

Fig. 4 is the program code figure of step 3 of the present invention；

Fig. 5 is the program code figure of step 4 of the present invention；

Fig. 6 is model structure block diagram of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

A kind of click bait detection method based on convolutional neural networks, process such as Fig. 1 are indicated from input news, to being System provides and determines whether click bait, decompose its implementation sequence the following steps are included:

Step 1: sentences decomposition is become single word by the cutting of word, and the sentence semantics carried by word can be preferably (program code such as Fig. 2) is further analyzed and handled to sentence using model；

Step 2: the vectorization of word indicates that, due to discrete word, word can not need for vocabulary to be shown as the input of model For the vector of computer disposal, vocabulary is shown as continuous dense term vector, to save the sentence learnt from the context All term vectors are spliced to form and search matrix L ∈ R by method and semantic information^d×|V|, embeded matrix can be uniform from passing through one It is distributed random initializtion or training obtains in advance from a big text corpus in advance；Although the term vector packet of pre-training The syntax and semantic information learnt from the context is contained, but has kept model fully end-to-end using the method for random initializtion (program code such as Fig. 3)；

Step 3: feature is automatically extracted, automatically generates useful feature using convolutional neural networks (CNN) structure, without Artificial or expertise is wanted, feature is generated with different window size convolution filters using multiple, captures different grain size Local feature, maximum pondization is carried out to this feature figure and is operated；Intuitively, this operation selects most abundant in global scope Functional character, title is filled into a regular length, and the output of convolutional neural networks is that the real value an of fixed size is special Vector is levied, and is used as the expression (program code such as Fig. 4) of title；

Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction of model by classifier As a result, whether the text of i.e. input is to click bait, and export confidence level probability, using logistic regression recurrence come contingency table Whether topic is to click bait or click bait (program code such as Fig. 5).

For structure such as Fig. 6 of model, which is first embedded into word in term vector, and vector is then input to volume Automatic Feature Extraction is carried out in product neural network.Finally using logistic regression classifier for classifying.

1. term vector

Vocabulary is shown as continuous dense term vector.Compared to traditional one single hot word vector expression, this expression side Method all achieves preferably in the relevant task of various natural language processings as a result, such as sentiment analysis, machine translation etc.；

e_i=Lb_k∈R^d (1)

All term vectors, which are constituted, searches matrix L ∈ R^d×|V|, d is the dimension of term vector, | V | it is vocabulary size.

Embeded matrix can be uniformly distributed random initializtion or in advance from a big text corpus from by one In in advance training obtain.Although the term vector of pre-training contains the syntax and semantic information learnt from the context, make Keep model fully end-to-end with the method for random initializtion.The experimental results showed that even the vector of random initializtion also can Better than former state-of-the-art method.In the training process, word vector is updated by backpropagation training error.Verifying and Fixed obtained word vector when test.

It defines to form, gives a word w_i, we look into matrix L, indicate e by the term vector that projection operation is retrieved_i。b_k It is a binary, the vector of vocabulary size, other than the position of corresponding vocabulary is 1, other positions are 0.Term vector can lead to Cross neural language model random initializtion or pre-training.

2. convolutional neural networks

Useful feature is automatically generated using convolutional neural networks (CNN).The output of convolutional neural networks is a fixation The real-valued vector of size, is used as the expression of title.

A given headline is H={ w₁, w₂..., w_n, the word constituted is indicated by searching for the obtained vector of table L For { e₁, e₂..., e_n, vector is splicing to [e₁；e₂；…；e_n], for the sake of simplicity, use e_I:i+n-1Indicate n term vector {e_i, e_i+1..., e_i+n-1, filter w ∈ R^h×k, h is the size of window and k is term vector dimension, is newly characterized in c_iBy equation f(w·e_I:i+h-1+ b) (2) obtain.

Wherein b is bias vector, is a linear activation primitive, we are slided with it along the vector of splicing, generates one A characteristic pattern c={ c₁, c₂..., c_i-h+1}.Maximum pondization operation is carried out to this feature figure.Intuitively, this operation is in the overall situation The most abundant functional character of selection in range.Feature is generated with different window size convolution filters using multiple, is captured Varigrained local feature.Title is also filled into a regular length.

3. disaggregated model

Final feature vector c will be generated by each filter.Logistic regression is to click bait for distribution caption Or click bait (formula 3).Use the optimization aim for intersecting entropy loss as us, is formally provided in formula 4.Y is mark Quasi- answer,It is the logistic regression probability of each label.

In conclusion the click bait detection method proposed by the present invention based on convolutional neural networks, uses convolutional Neural Title is expressed as fixed dimension by network, continuously, the vector of real value.Firstly, being mentioned in global scope using convolution filter Local feature and maximum pond n- gram language model is taken to select most important feature.This method can automatically derive independent of External feature, and it is suitable for all language.Method of the invention, which achieves consistent knot, to be shown to the experiment of Chinese and English corpus Fruit, and it is better than the method for previous manual construction feature, experimental result shows method of the invention simultaneously and appoints across language Robustness and validity in business.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art within the technical scope of the present disclosure, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims

1. a kind of click bait detection method based on convolutional neural networks, which is characterized in that indicated from input news, arrive system Provide and determine whether click bait, decompose its implementation sequence the following steps are included:

Step 1: sentences decomposition is become single word by the cutting of word, and the sentence semantics carried by word can better use Model is further analyzed and handles to sentence；

Step 2: the vectorization of word indicates, due to discrete word, word can not as the input of model, need for vocabulary to be shown as Vocabulary is shown as continuous dense term vector by the vector of computer disposal, with save the syntax learnt from the context and All term vectors are spliced to form and search matrix L ∈ R by semantic information^d×|V|, embeded matrix can be from being uniformly distributed by one Random initializtion or in advance from a big text corpus in advance training obtain；

Step 3: automatically extracting feature, automatically generate useful feature using convolutional neural networks structure, had not using multiple Same window size convolution filter generates feature, captures varigrained local feature, carries out maximum pond to this feature figure Operation；

Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction knot of model by classifier Whether fruit, that is, the text inputted are to click bait, and export confidence level probability.

2. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that will Vocabulary is shown as continuous dense term vector, according to formula: e_i=Lb_k∈R^d。

3. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that square Battle array L ∈ R^d×|V|Middle d is the dimension of term vector, | V | it is vocabulary size.

4. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that volume The new feature that product neural network generates is c_iBy Equation f (we_i∶i+h-1+ b) it obtains.

5. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that patrol It collects to return and is used to distribution caption click bait or click bait, according to formula:

6. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that make It uses and intersects entropy loss as optimization aim, formula is