CN109376229A - A kind of click bait detection method based on convolutional neural networks - Google Patents

A kind of click bait detection method based on convolutional neural networks Download PDF

Info

Publication number
CN109376229A
CN109376229A CN201811476642.3A CN201811476642A CN109376229A CN 109376229 A CN109376229 A CN 109376229A CN 201811476642 A CN201811476642 A CN 201811476642A CN 109376229 A CN109376229 A CN 109376229A
Authority
CN
China
Prior art keywords
word
feature
convolutional neural
neural networks
detection method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811476642.3A
Other languages
Chinese (zh)
Inventor
付俊峰
梁良
郑锦坤
周欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information And Communication Branch Of Jiangxi Electric Power Co Ltd
State Grid Corp of China SGCC
Original Assignee
Information And Communication Branch Of Jiangxi Electric Power Co Ltd
State Grid Corp of China SGCC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information And Communication Branch Of Jiangxi Electric Power Co Ltd, State Grid Corp of China SGCC filed Critical Information And Communication Branch Of Jiangxi Electric Power Co Ltd
Priority to CN201811476642.3A priority Critical patent/CN109376229A/en
Publication of CN109376229A publication Critical patent/CN109376229A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The click bait detection method based on convolutional neural networks that the invention discloses a kind of, decomposes its implementation sequence the following steps are included: step 1: sentences decomposition is become single word by the cutting of word;Step 2: the vectorization of word indicates that, due to discrete word, word can not need for vocabulary to be shown as the vector for computer disposal as the input of model, and vocabulary is shown as continuous dense term vector, all term vectors are spliced to form lookup matrix;Step 3: automatically extracting feature, automatically generate useful feature using convolutional neural networks structure;Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction result of model by classifier.This method can be automatically derived independent of external feature, and be suitable for all language, better than the method for previous manual construction feature, the robustness and validity of method of the invention in across language task.

Description

A kind of click bait detection method based on convolutional neural networks
Technical field
The present invention relates to Internet technical field, in particular to a kind of click bait detection side based on convolutional neural networks Method.
Background technique
With the appearance of Web2.0, people prefer online reading news.As the substitute of traditionally on paper media, network The topic that news covers is more extensive, and media content is abundant, various informative, also provides more selections.However, on the other hand, Online news website is also flooded with a large amount of low-quality contents.Most of such network medias do not collect reader's subscription usually Take, their principal income from the advertisement shown on its webpage, the height of advertising expense rely primarily on user browsing and It clicks.In order to attract more users to pay close attention to and medium contentions similar with other, subnetwork media, which are found out, uses some suctions Induce one eyeball caption method, such title be known as click bait.Clicking bait usually has misleading title, overstates Big or hidden parts are true, it is clear that click bait in the epoch of information explosion and hamper the more efficient acquisition information of reader, make to read Person feels disappointed.The public credibility of media also will be greatly reduced in the long run, therefore detect and prevent click bait from becoming very It is necessary.Pervious work relies on artificial constructed vocabulary and syntactic feature, and by achieve in these methods it is good at Achievement.However, this kind of work largely depends on professional knowledge in feature extraction, the language without these characteristics may not apply to The information of character level and capital and small letter etc. play an important role in clicking bait detection in speech, such as English, and as in Text, the language such as Japanese do not include such feature.On the other hand, although clicking bait phenomenon all types language on the internet All generally occurred, but other language other than English are clicked bait detection and studied almost without people.
Summary of the invention
Invention is designed to provide a kind of click bait detection method based on convolutional neural networks, and this method can be certainly It is dynamic to obtain independent of external feature, and it is suitable for all language.Show the experiment of Chinese and English corpus method of the invention Achieve it is consistent as a result, and be better than previous manual construction feature method, experimental result shows of the invention simultaneously Robustness and validity of the method in across language task, to solve the problems mentioned in the above background technology.
To achieve the above object, the invention provides the following technical scheme:
A kind of click bait detection method based on convolutional neural networks indicates from input news, provides judgement to system Whether be click bait, decompose its implementation sequence the following steps are included:
Step 1: sentences decomposition is become single word by the cutting of word, and the sentence semantics carried by word can be preferably Sentence is further analyzed and is handled using model;
Step 2: the vectorization of word indicates that, due to discrete word, word can not need for vocabulary to be shown as the input of model For the vector of computer disposal, vocabulary is shown as continuous dense term vector, to save the sentence learnt from the context All term vectors are spliced to form and search matrix L ∈ R by method and semantic informationd×|V|, embeded matrix can be uniform from passing through one It is distributed random initializtion or training obtains in advance from a big text corpus in advance;
Step 3: automatically extracting feature, automatically generate useful feature using convolutional neural networks structure, use multiple tools There is different window size convolution filters to generate feature, capture varigrained local feature, this feature figure is carried out maximum Pondization operation;
Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction of model by classifier As a result, whether the text of i.e. input is to click bait, and export confidence level probability.
Further, vocabulary is shown as continuous dense term vector, according to formula: ei=Lbk∈Rd
Further, matrix L ∈ Rd×|V|Middle d is the dimension of term vector, | V | it is vocabulary size.
Further, the new feature that convolutional neural networks generate is ciBy Equation f (weI:i+h-1+b) obtain.
Further, logistic regression is used to distribution caption click bait or click bait, according to formula:
Further, use and intersect entropy loss as optimization aim, formula is
Compared with prior art, the beneficial effects of the present invention are: the click proposed by the present invention based on convolutional neural networks Title is expressed as fixed dimension using convolutional neural networks by bait detection method, continuously, the vector of real value.Firstly, in the overall situation In range, local feature and maximum pond n- gram language model are extracted using convolution filter to select most important feature. This method can be automatically derived independent of external feature, and be suitable for all language.The experiment of Chinese and English corpus is shown Method of the invention achieves consistent as a result, and better than the method for previous manual construction feature, experimental result while table The robustness and validity of method of the invention in across language task is illustrated.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is the program code figure of step 1 of the present invention;
Fig. 3 is the program code figure of step 2 of the present invention;
Fig. 4 is the program code figure of step 3 of the present invention;
Fig. 5 is the program code figure of step 4 of the present invention;
Fig. 6 is model structure block diagram of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
A kind of click bait detection method based on convolutional neural networks, process such as Fig. 1 are indicated from input news, to being System provides and determines whether click bait, decompose its implementation sequence the following steps are included:
Step 1: sentences decomposition is become single word by the cutting of word, and the sentence semantics carried by word can be preferably (program code such as Fig. 2) is further analyzed and handled to sentence using model;
Step 2: the vectorization of word indicates that, due to discrete word, word can not need for vocabulary to be shown as the input of model For the vector of computer disposal, vocabulary is shown as continuous dense term vector, to save the sentence learnt from the context All term vectors are spliced to form and search matrix L ∈ R by method and semantic informationd×|V|, embeded matrix can be uniform from passing through one It is distributed random initializtion or training obtains in advance from a big text corpus in advance;Although the term vector packet of pre-training The syntax and semantic information learnt from the context is contained, but has kept model fully end-to-end using the method for random initializtion (program code such as Fig. 3);
Step 3: feature is automatically extracted, automatically generates useful feature using convolutional neural networks (CNN) structure, without Artificial or expertise is wanted, feature is generated with different window size convolution filters using multiple, captures different grain size Local feature, maximum pondization is carried out to this feature figure and is operated;Intuitively, this operation selects most abundant in global scope Functional character, title is filled into a regular length, and the output of convolutional neural networks is that the real value an of fixed size is special Vector is levied, and is used as the expression (program code such as Fig. 4) of title;
Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction of model by classifier As a result, whether the text of i.e. input is to click bait, and export confidence level probability, using logistic regression recurrence come contingency table Whether topic is to click bait or click bait (program code such as Fig. 5).
For structure such as Fig. 6 of model, which is first embedded into word in term vector, and vector is then input to volume Automatic Feature Extraction is carried out in product neural network.Finally using logistic regression classifier for classifying.
1. term vector
Vocabulary is shown as continuous dense term vector.Compared to traditional one single hot word vector expression, this expression side Method all achieves preferably in the relevant task of various natural language processings as a result, such as sentiment analysis, machine translation etc.;
ei=Lbk∈Rd (1)
All term vectors, which are constituted, searches matrix L ∈ Rd×|V|, d is the dimension of term vector, | V | it is vocabulary size.
Embeded matrix can be uniformly distributed random initializtion or in advance from a big text corpus from by one In in advance training obtain.Although the term vector of pre-training contains the syntax and semantic information learnt from the context, make Keep model fully end-to-end with the method for random initializtion.The experimental results showed that even the vector of random initializtion also can Better than former state-of-the-art method.In the training process, word vector is updated by backpropagation training error.Verifying and Fixed obtained word vector when test.
It defines to form, gives a word wi, we look into matrix L, indicate e by the term vector that projection operation is retrievedi。bk It is a binary, the vector of vocabulary size, other than the position of corresponding vocabulary is 1, other positions are 0.Term vector can lead to Cross neural language model random initializtion or pre-training.
2. convolutional neural networks
Useful feature is automatically generated using convolutional neural networks (CNN).The output of convolutional neural networks is a fixation The real-valued vector of size, is used as the expression of title.
A given headline is H={ w1, w2..., wn, the word constituted is indicated by searching for the obtained vector of table L For { e1, e2..., en, vector is splicing to [e1;e2;…;en], for the sake of simplicity, use eI:i+n-1Indicate n term vector {ei, ei+1..., ei+n-1, filter w ∈ Rh×k, h is the size of window and k is term vector dimension, is newly characterized in ciBy equation f(w·eI:i+h-1+ b) (2) obtain.
Wherein b is bias vector, is a linear activation primitive, we are slided with it along the vector of splicing, generates one A characteristic pattern c={ c1, c2..., ci-h+1}.Maximum pondization operation is carried out to this feature figure.Intuitively, this operation is in the overall situation The most abundant functional character of selection in range.Feature is generated with different window size convolution filters using multiple, is captured Varigrained local feature.Title is also filled into a regular length.
3. disaggregated model
Final feature vector c will be generated by each filter.Logistic regression is to click bait for distribution caption Or click bait (formula 3).Use the optimization aim for intersecting entropy loss as us, is formally provided in formula 4.Y is mark Quasi- answer,It is the logistic regression probability of each label.
In conclusion the click bait detection method proposed by the present invention based on convolutional neural networks, uses convolutional Neural Title is expressed as fixed dimension by network, continuously, the vector of real value.Firstly, being mentioned in global scope using convolution filter Local feature and maximum pond n- gram language model is taken to select most important feature.This method can automatically derive independent of External feature, and it is suitable for all language.Method of the invention, which achieves consistent knot, to be shown to the experiment of Chinese and English corpus Fruit, and it is better than the method for previous manual construction feature, experimental result shows method of the invention simultaneously and appoints across language Robustness and validity in business.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art within the technical scope of the present disclosure, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims (6)

1. a kind of click bait detection method based on convolutional neural networks, which is characterized in that indicated from input news, arrive system Provide and determine whether click bait, decompose its implementation sequence the following steps are included:
Step 1: sentences decomposition is become single word by the cutting of word, and the sentence semantics carried by word can better use Model is further analyzed and handles to sentence;
Step 2: the vectorization of word indicates, due to discrete word, word can not as the input of model, need for vocabulary to be shown as Vocabulary is shown as continuous dense term vector by the vector of computer disposal, with save the syntax learnt from the context and All term vectors are spliced to form and search matrix L ∈ R by semantic informationd×|V|, embeded matrix can be from being uniformly distributed by one Random initializtion or in advance from a big text corpus in advance training obtain;
Step 3: automatically extracting feature, automatically generate useful feature using convolutional neural networks structure, had not using multiple Same window size convolution filter generates feature, captures varigrained local feature, carries out maximum pond to this feature figure Operation;
Step 4: word is input in classifier by the feature for learning to obtain, finally obtains the prediction knot of model by classifier Whether fruit, that is, the text inputted are to click bait, and export confidence level probability.
2. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that will Vocabulary is shown as continuous dense term vector, according to formula: ei=Lbk∈Rd
3. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that square Battle array L ∈ Rd×|V|Middle d is the dimension of term vector, | V | it is vocabulary size.
4. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that volume The new feature that product neural network generates is ciBy Equation f (wei∶i+h-1+ b) it obtains.
5. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that patrol It collects to return and is used to distribution caption click bait or click bait, according to formula:
6. a kind of click bait detection method based on convolutional neural networks according to claim 1, which is characterized in that make It uses and intersects entropy loss as optimization aim, formula is
CN201811476642.3A 2018-12-04 2018-12-04 A kind of click bait detection method based on convolutional neural networks Pending CN109376229A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811476642.3A CN109376229A (en) 2018-12-04 2018-12-04 A kind of click bait detection method based on convolutional neural networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811476642.3A CN109376229A (en) 2018-12-04 2018-12-04 A kind of click bait detection method based on convolutional neural networks

Publications (1)

Publication Number Publication Date
CN109376229A true CN109376229A (en) 2019-02-22

Family

ID=65375875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811476642.3A Pending CN109376229A (en) 2018-12-04 2018-12-04 A kind of click bait detection method based on convolutional neural networks

Country Status (1)

Country Link
CN (1) CN109376229A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491436A (en) * 2017-08-21 2017-12-19 北京百度网讯科技有限公司 A kind of recognition methods of title party and device, server, storage medium
CN108460134A (en) * 2018-03-06 2018-08-28 云南大学 The text subject disaggregated model and sorting technique of transfer learning are integrated based on multi-source domain
CN108491389A (en) * 2018-03-23 2018-09-04 杭州朗和科技有限公司 Click bait title language material identification model training method and device
CN108596470A (en) * 2018-04-19 2018-09-28 浙江大学 A kind of power equipments defect text handling method based on TensorFlow frames

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491436A (en) * 2017-08-21 2017-12-19 北京百度网讯科技有限公司 A kind of recognition methods of title party and device, server, storage medium
CN108460134A (en) * 2018-03-06 2018-08-28 云南大学 The text subject disaggregated model and sorting technique of transfer learning are integrated based on multi-source domain
CN108491389A (en) * 2018-03-23 2018-09-04 杭州朗和科技有限公司 Click bait title language material identification model training method and device
CN108596470A (en) * 2018-04-19 2018-09-28 浙江大学 A kind of power equipments defect text handling method based on TensorFlow frames

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JUNFENG FU 等: "A Convolutional Neural Network for Clickbait Detection", 《2017 4TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING》 *

Similar Documents

Publication Publication Date Title
US20200301954A1 (en) Reply information obtaining method and apparatus
CN106682192B (en) Method and device for training answer intention classification model based on search keywords
CN103514299B (en) Information search method and device
CN104050160B (en) Interpreter's method and apparatus that a kind of machine is blended with human translation
CN109299480A (en) Terminology Translation method and device based on context of co-text
CN110888990A (en) Text recommendation method, device, equipment and medium
CN106682170B (en) Application search method and device
CN110377695B (en) Public opinion theme data clustering method and device and storage medium
CN110263154A (en) A kind of network public-opinion emotion situation quantization method, system and storage medium
Fu et al. A convolutional neural network for clickbait detection
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN110674378A (en) Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN113407842B (en) Model training method, theme recommendation reason acquisition method and system and electronic equipment
CN110929007A (en) Electric power marketing knowledge system platform and application method
CN108460150A (en) The processing method and processing device of headline
CN112445894A (en) Business intelligent system based on artificial intelligence and analysis method thereof
CN108363700A (en) The method for evaluating quality and device of headline
CN113806547A (en) Deep learning multi-label text classification method based on graph model
CN108399265A (en) Real-time hot news providing method based on search and device
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN111488429A (en) Short text clustering system based on search engine and short text clustering method thereof
CN106649294A (en) Training of classification models and method and device for recognizing subordinate clauses of classification models
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN113392195A (en) Public opinion monitoring method and device, electronic equipment and storage medium
CN115017271B (en) Method and system for intelligently generating RPA flow component block

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190222

RJ01 Rejection of invention patent application after publication