KR102149917B1

KR102149917B1 - An apparatus for detecting spam news with spam phrases, a method thereof and computer recordable medium storing program to perform the method

Info

Publication number: KR102149917B1
Application number: KR1020180160630A
Authority: KR
Inventors: 김창기; 박경수
Original assignee: 줌인터넷 주식회사
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2020-08-31
Also published as: KR20200072724A

Abstract

본 발명은 스팸 문구가 포함된 스팸뉴스 탐지를 위한 장치, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체에 관한 것이다. 이러한 본 발명은 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 도출하는 카테고리분류부와, 상기 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출하는 상관도산출부와, 상기 복수의 문단 중 상기 상관도의 수치가 가장 낮은 문장을 스팸의심문단으로 특정하고, 상기 스팸의심문단의 상관도에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 스팸뉴스판별부를 포함하는 스팸 뉴스 탐지를 위한 장치와, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체를 제공한다. The present invention relates to an apparatus for detecting spam news including spam phrases, a method therefor, and a computer-readable recording medium in which a program for performing the method is recorded. The present invention provides a category classification unit that derives a probability that a plurality of paragraphs included in the news belong to a plurality of categories, and a correlation between each paragraph and other paragraphs through correlation analysis based on the probability of belonging to the plurality of categories. A correlation calculation unit that calculates (correlation) and the sentence with the lowest correlation among the plurality of paragraphs is specified as a spam questionnaire, and whether the news is real or fake according to the correlation of the spam questionnaire. An apparatus for detecting spam news including a spam news determination unit for determining whether or not there is a spam news determination unit, a method therefor, and a computer-readable recording medium in which a program for performing the method is recorded.

Description

스팸 문구가 포함된 스팸뉴스 탐지를 위한 장치, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체{An apparatus for detecting spam news with spam phrases, a method thereof and computer recordable medium storing program to perform the method} An apparatus for detecting spam news with spam phrases, a method thereof and computer recordable medium storing program for detecting spam news including spam phrases, and a method therefor to perform the method}

본 발명은 스팸뉴스 탐지 기술에 관한 것으로, 보다 상세하게는, 스팸 문구가 포함된 스팸뉴스 탐지를 위한 장치, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체에 관한 것이다. The present invention relates to a technology for detecting spam news, and more particularly, to an apparatus for detecting spam news including spam phrases, a method therefor, and a computer-readable recording medium in which a program performing the method is recorded.

인터넷상의 뉴스 중 본문의 내용과는 상관없는 내용의 문장, 혹은 문단을 삽입하여 특정 내용의 스팸을 읽게 만드는 뉴스가 다수 존재한다. Among the news on the Internet, there are many news that make you read spam of a specific content by inserting a sentence or paragraph that has nothing to do with the content of the main text.

[선행기술문헌][Prior technical literature]

[특허문헌] 한국등록특허 제1864439호 2018년 06월 01일 등록 (명칭: 가짜 뉴스 판별 가능한 게시글 그래픽 유저 인터페이스 화면창을 구비한 가짜 뉴스 판별 시스템) [Patent Literature] Registered Korean Patent Registration No. 1864439 on June 01, 2018 (Name: Fake news discrimination system with graphic user interface screen window capable of discriminating fake news)

본 발명은 뉴스 중 기사의 본문 내용과 상관없는 스팸 문구가 삽입된 스팸 뉴스를 탐지하기 위한 장치 및 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체를 제공함에 있다. An object of the present invention is to provide an apparatus for detecting spam news in which spam phrases irrelevant to the body content of an article are inserted, a method for the same, and a computer-readable recording medium in which a program performing the method is recorded.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 스팸 뉴스 탐지를 위한 장치는 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 도출하는 카테고리분류부와, 상기 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출하는 상관도산출부와, 상기 복수의 문단 중 상기 상관도의 수치가 가장 낮은 문장을 스팸의심문단으로 특정하고, 상기 스팸의심문단의 상관도에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 스팸뉴스판별부를 포함한다. An apparatus for detecting spam news according to a preferred embodiment of the present invention for achieving the above object includes a category classifying unit for deriving a probability that a plurality of paragraphs included in the news belong to a plurality of categories, and the plurality of categories. A correlation calculation unit that calculates a correlation between each paragraph and other paragraphs through correlation analysis based on the probability of belonging to the spam, and the sentence with the lowest correlation among the plurality of paragraphs is selected as a spam questionnaire. And a spam news determination unit that determines whether the news is real or fake according to the correlation of the spam questioning group.

상기 스팸뉴스판별부는 가짜 뉴스와 진짜 뉴스에 대한 상관도 분포를 기초로 가짜 뉴스일 확률을 도출하는 가짜뉴스 확률모델과, 진짜 뉴스일 확률을 도출하는 진짜뉴스 확률모델을 도출하고, 상기 스팸의심문단의 상관도를 상기 가짜뉴스 확률모델 및 상기 진짜뉴스 확률모델에 대입하여 상기 뉴스가 진짜 뉴스일 확률과 상기 뉴스가 가짜 뉴스일 확률을 산출하고, 상기 진짜 뉴스일 확률과 상기 가짜 뉴스일 확률에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부 판별하는 것을 특징으로 한다. The spam news determination unit derives a fake news probability model that derives a probability of fake news based on a distribution of correlations between fake news and real news, and a real news probability model that derives the probability of real news, and the spam questioning group By substituting the correlation of the fake news probability model and the real news probability model to calculate the probability that the news is real news and the probability that the news is fake news, according to the probability of the real news and the probability of the fake news It characterized in that it determines whether the news is real or fake.

상기 스팸뉴스판별부는 상기 스팸의심문단의 상관도가 소정 수치 미만이면, 상기 뉴스를 가짜로 판별하는 것을 특징으로 한다. The spam news determination unit is characterized in that if the correlation of the spam questioning group is less than a predetermined value, the news is determined as a fake.

상기 상관도산출부는 각 문단의 카테고리 별 확률에 대한 상기 각 문단 이외의 다른 문단들의 카테고리 별 확률의 평균의 상관도를 산출하는 것을 특징으로 한다. The correlation calculating unit is characterized in that it calculates a correlation of an average of probabilities of categories other than the paragraphs with respect to the probability of each category of each paragraph.

상기 카테고리분류부는 문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 학습하는 것을 특징으로 한다. When a paragraph is input, the category classification unit learns to output a probability that the input paragraph corresponds to each of a plurality of categories as numerical values.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 스팸 뉴스 탐지를 위한 방법은 카테고리분류부가 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 도출하는 단계와, 상관도산출부가 상기 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출하는 단계와, 스팸뉴스판별부가 상기 복수의 문단 중 상기 상관도의 수치가 가장 낮은 문장을 스팸의심문단으로 특정하고, 상기 스팸의심문단의 상관도에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 단계를 포함한다. In the method for detecting spam news according to a preferred embodiment of the present invention for achieving the above-described object, the category classification unit derives the probability that a plurality of paragraphs included in the news belong to a plurality of categories, and a correlation is calculated. The step of calculating a correlation between each paragraph and another paragraph through correlation analysis based on the probability that the additional belongs to the plurality of categories, and the spam news determination unit has the highest value of the correlation among the plurality of paragraphs. And determining whether the news is real or fake according to the correlation between the low sentence and the spam questionnaire.

상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 단계는 상기 스팸뉴스판별부가 가짜 뉴스와 진짜 뉴스에 대한 상관도 분포를 기초로 가짜 뉴스일 확률을 도출하는 가짜뉴스 확률모델과, 진짜 뉴스일 확률을 도출하는 진짜뉴스 확률모델을 도출하는 단계와, 상기 스팸뉴스판별부가 상기 스팸의심문단의 상관도를 상기 가짜뉴스 확률모델 및 상기 진짜뉴스 확률모델에 대입하여 상기 뉴스가 진짜 뉴스일 확률과 상기 뉴스가 가짜 뉴스일 확률을 산출하는 단계와, 상기 스팸뉴스판별부가 상기 진짜 뉴스일 확률과 상기 가짜 뉴스일 확률에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부 판별하는 단계를 포함한다. The step of determining whether the news is real or fake includes a fake news probability model in which the spam news determination unit derives a probability of fake news based on a distribution of correlations between fake news and real news, and a probability of real news. The step of deriving a real news probability model to be derived, and the spam news determination unit substituting the correlation between the spam questioning group into the fake news probability model and the real news probability model, and the probability that the news is real news and the news are And calculating a probability of fake news, and determining whether the news is real or fake according to the probability of the real news and the probability of the fake news.

상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 단계는 상기 스팸뉴스판별부가 상기 스팸의심문단의 상관도가 소정 수치 미만이면, 상기 뉴스를 가짜로 판별하는 것을 특징으로 한다. The step of determining whether the news is real or fake is characterized in that the spam news determination unit determines the news as fake if the correlation between the spam questioning group is less than a predetermined value.

상기 상관도(correlation)를 산출하는 단계는 각 문단의 카테고리 별 확률에 대한 상기 각 문단 이외의 다른 문단들의 카테고리 별 확률의 평균의 상관도를 산출하는 것을 특징으로 한다. The calculating of the correlation is characterized by calculating a correlation of an average of probabilities of categories other than the paragraphs with respect to the probability of each category of each paragraph.

상기 확률을 도출하는 단계 전, 문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 상기 카테고리 분류기를 학습시키는 단계를 더 포함하는 것을 특징으로 한다. Before the step of deriving the probability, when a paragraph is input, the step of learning the category classifier to output a probability that the input paragraph corresponds to each of the plurality of categories as numerical values.

본 발명의 다른 견지에 따르면, 상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 스팸 뉴스 탐지를 위한 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체를 제공할 수 있다. According to another aspect of the present invention, it is possible to provide a computer-readable recording medium in which a program for performing a method for detecting spam news according to a preferred embodiment of the present invention for achieving the above object is recorded.

본 발명에 따르면, 스팸 뉴스를 사전에 검출할 수 있어 스팸 뉴스를 읽는데 낭비하는 시간을 절약할 수 있다. 이는 사용자에게 새로운 사용자경험(UX)을 제공할 수 있다. According to the present invention, it is possible to detect spam news in advance, thereby saving time wasted in reading spam news. This can provide a new user experience (UX) to the user.

도 1은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 장치의 구성을 설명하기 위한 블록도이다.
도 2는 본 발명의 실시예에 따른 스팸탐지장치의 제어부의 구성을 설명하기 위한 블록도이다.
도 3은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법을 설명하기 위한 흐름도이다.
도 4 내지 도 6은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법을 설명하기 위한 도면이다. 1 is a block diagram illustrating the configuration of an apparatus for detecting spam news according to an embodiment of the present invention.
2 is a block diagram illustrating a configuration of a control unit of a spam detection apparatus according to an embodiment of the present invention.
3 is a flowchart illustrating a method for detecting spam news according to an embodiment of the present invention.
4 to 6 are diagrams for explaining a method for detecting spam news according to an embodiment of the present invention.

본 발명의 상세한 설명에 앞서, 이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 실시예에 불과할 뿐, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다. Prior to the detailed description of the present invention, terms or words used in the present specification and claims described below should not be construed as being limited to their usual or dictionary meanings, and the inventors shall use their own invention in the best way. For explanation, based on the principle that it can be appropriately defined as a concept of terms, it should be interpreted as a meaning and concept consistent with the technical idea of the present invention. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are only the most preferred embodiments of the present invention, and do not represent all the technical spirit of the present invention, and various equivalents that can replace them at the time of application It should be understood that there may be water and variations.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 이때, 첨부된 도면에서 동일한 구성 요소는 가능한 동일한 부호로 나타내고 있음을 유의해야 한다. 또한, 본 발명의 요지를 흐리게 할 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략할 것이다. 마찬가지의 이유로 첨부 도면에 있어서 일부 구성요소는 과장되거나 생략되거나 또는 개략적으로 도시되었으며, 각 구성요소의 크기는 실제 크기를 전적으로 반영하는 것이 아니다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this case, it should be noted that the same components in the accompanying drawings are indicated by the same reference numerals as possible. In addition, detailed descriptions of known functions and configurations that may obscure the subject matter of the present invention will be omitted. For the same reason, some components in the accompanying drawings are exaggerated, omitted, or schematically illustrated, and the size of each component does not entirely reflect the actual size.

먼저, 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 장치의 구성을 설명하기로 한다. 도 1은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 장치의 구성을 설명하기 위한 블록도이다. 도 2는 본 발명의 실시예에 따른 스팸탐지장치의 제어부의 구성을 설명하기 위한 블록도이다. First, a configuration of an apparatus for detecting spam news according to an embodiment of the present invention will be described. 1 is a block diagram illustrating the configuration of an apparatus for detecting spam news according to an embodiment of the present invention. 2 is a block diagram illustrating a configuration of a control unit of a spam detection apparatus according to an embodiment of the present invention.

먼저, 도 1을 참조하면, 스팸뉴스 탐지를 위한 장치(100: 이하, '스팸탐지장치'로 축약함)는 통신부(110), 입력부(120), 표시부(130), 저장부(140) 및 제어부(200)를 포함한다. First, referring to FIG. 1, a device for detecting spam news (100: hereinafter, abbreviated as'spam detection device') includes a communication unit 110, an input unit 120, a display unit 130, a storage unit 140, and It includes a control unit 200.

통신부(110)는 네트워크를 통해 다른 장치와 통신하기 위한 것이다. 통신부(110)는 송신되는 신호의 주파수를 상승 변환 및 증폭하는 RF(Radio Frequency) 송신기(Tx) 및 수신되는 신호를 저 잡음 증폭하고 주파수를 하강 변환하는 RF 수신기(Rx)를 포함할 수 있다. 그리고 통신부(110)는 송신되는 신호를 변조하고, 수신되는 신호를 복조하는 모뎀(Modem)을 포함할 수 있다. 예컨대, 통신부(110)는 제어부(200)의 제어에 따라 인터넷 뉴스를 제공하는 서버(미도시)에 접속하여 뉴스를 다운로드 할 수 있다.The communication unit 110 is for communicating with other devices through a network. The communication unit 110 may include a radio frequency (RF) transmitter Tx for up-converting and amplifying a frequency of a transmitted signal, and an RF receiver Rx for low-noise amplifying and down-converting a received signal. Further, the communication unit 110 may include a modem that modulates a transmitted signal and demodulates a received signal. For example, the communication unit 110 may access a server (not shown) that provides Internet news under the control of the controller 200 to download news.

입력부(120)는 스팸탐지장치(100)를 제어하기 위한 사용자의 키 조작을 입력받고 입력 신호를 생성하여 제어부(200)에 전달한다. 입력부(120)는 스팸탐지장치(100)를 제어하기 위한 각 종 키들을 포함할 수 있다. 입력부(120)는 표시부(130)가 터치스크린으로 이루어진 경우, 각 종 키들의 기능이 표시부(130)에서 이루어질 수 있으며, 터치스크린만으로 모든 기능을 수행할 수 있는 경우, 입력부(120)는 생략될 수도 있다. The input unit 120 receives a user's key manipulation for controlling the spam detection device 100, generates an input signal, and transmits the input signal to the controller 200. The input unit 120 may include various kinds of keys for controlling the spam detection device 100. When the display unit 130 is formed of a touch screen, the input unit 120 may perform functions of various keys on the display unit 130, and when all functions can be performed only with the touch screen, the input unit 120 will be omitted. May be.

표시부(130)는 스팸탐지장치(100)의 메뉴, 입력된 데이터, 기능 설정 정보 및 기타 다양한 정보를 사용자에게 시각적으로 제공한다. 표시부(130)는 스팸탐지장치(100)의 부팅 화면, 대기 화면, 메뉴 화면, 등의 화면을 출력하는 기능을 수행한다. 이러한 표시부(130)는 액정표시장치(LCD, Liquid Crystal Display), 유기 발광 다이오드(OLED, Organic Light Emitting Diodes), 능동형 유기 발광 다이오드(AMOLED, Active Matrix Organic Light Emitting Diodes) 등으로 형성될 수 있다. 한편, 표시부(130)는 터치스크린으로 구현될 수 있다. 이러한 경우, 표시부(130)는 터치센서를 포함한다. 터치센서는 사용자의 터치 입력을 감지한다. 터치센서는 정전용량 방식(capacitive overlay), 압력식, 저항막 방식(resistive overlay), 적외선 감지 방식(infrared beam) 등의 터치 감지 센서로 구성되거나, 압력 감지 센서(pressure sensor)로 구성될 수도 있다. 상기 센서들 이외에도 물체의 접촉 또는 압력을 감지할 수 있는 모든 종류의 센서 기기가 본 발명의 터치센서로 이용될 수 있다. 터치센서는 사용자의 터치 입력을 감지하고, 감지 신호를 발생시켜 제어부(200)로 전송한다. 특히, 표시부(130)가 터치스크린으로 이루어진 경우, 입력부(120) 기능의 일부 또는 전부는 표시부(130)를 통해 이루어질 수 있다. The display unit 130 visually provides a menu, input data, function setting information, and various other information of the spam detection device 100 to a user. The display unit 130 performs a function of outputting screens such as a boot screen, a standby screen, a menu screen, and the like of the spam detection device 100. The display unit 130 may be formed of a liquid crystal display (LCD), organic light emitting diodes (OLED), active matrix organic light emitting diodes (AMOLEDs), or the like. Meanwhile, the display unit 130 may be implemented as a touch screen. In this case, the display unit 130 includes a touch sensor. The touch sensor detects a user's touch input. The touch sensor may be composed of a touch sensing sensor such as a capacitive overlay, a pressure type, a resistive overlay, or an infrared beam, or may be composed of a pressure sensor. . In addition to the above sensors, all kinds of sensor devices capable of sensing contact or pressure of an object may be used as the touch sensor of the present invention. The touch sensor detects a user's touch input, generates a detection signal, and transmits it to the controller 200. In particular, when the display unit 130 is formed of a touch screen, some or all of the functions of the input unit 120 may be performed through the display unit 130.

저장부(140)는 스팸탐지장치(100)의 동작에 필요한 프로그램 및 데이터를 저장한다. 특히, 저장부(140)는 동의어 및 반의어 사전, 복수의 인터넷 뉴스를 포함하는 뉴스 데이터베이스 등을 저장한다. 저장부(140)에 저장되는 각 종 데이터는 사용자의 조작에 따라, 삭제, 변경, 추가될 수 있다. The storage unit 140 stores programs and data necessary for the operation of the spam detection device 100. In particular, the storage unit 140 stores a synonym and antonym dictionary, a news database including a plurality of Internet news, and the like. Each type of data stored in the storage unit 140 may be deleted, changed, or added according to a user's manipulation.

제어부(200)는 스팸탐지장치(100)의 전반적인 동작 및 스팸탐지장치(100)의 내부 블록(110 내지 140)들 간 신호 흐름을 제어하고, 데이터를 처리하는 데이터 처리 기능을 수행할 수 있다. 이러한 제어부(200)는 중앙처리장치(CPU: Central Processing Unit), 그래픽처리장치(GPU: Graphic Processing Unit), 디지털신호처리기(DSP: Digital Signal Processor) 등이 될 수 있다. The controller 200 may perform a data processing function of controlling the overall operation of the spam detection apparatus 100 and a signal flow between the internal blocks 110 to 140 of the spam detection apparatus 100, and processing data. The control unit 200 may be a central processing unit (CPU), a graphic processing unit (GPU), a digital signal processor (DSP), or the like.

도 2를 참조하면, 제어부(200)는 카테고리분류부(210), 상관도산출부(220) 및 스팸뉴스판별부(230)를 포함한다. Referring to FIG. 2, the control unit 200 includes a category classification unit 210, a correlation calculation unit 220, and a spam news determination unit 230.

카테고리분류부(210)는 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 도출하기 위한 것이다. 이러한 카테고리분류부(210)는 인공신경망으로 문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 학습된다. 예컨대, "코스피 지수가 하루 만에 또 다시 큰 폭으로 하락하며 2400선을 내줬다"와 같은 문단이 입력되면, 입력된 문단이 카테고리 {사회, 정치, 경제, 국제, 연예, 스포츠} 각각에 속할 확률 {0.42, 0.12, 0.97, 0.24, 0.01, 0.01}을 도출하도록 학습된다. The category classification unit 210 is for deriving the probability that a plurality of paragraphs included in the news belong to a plurality of categories. When a paragraph is inputted through the artificial neural network, the category classifying unit 210 is trained to output a probability that the input paragraph corresponds to each of a plurality of categories as numerical values. For example, if a paragraph such as "The KOSPI index fell sharply again in one day and gave the 2400 line", the probability that the entered paragraph will belong to each category {social, politics, economy, international, entertainment, sports} It is learned to derive {0.42, 0.12, 0.97, 0.24, 0.01, 0.01}.

상관도산출부(220)는 카테고리분류부(210)가 도출한 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출하기 위한 것이다. The correlation calculation unit 220 is for calculating a correlation between each paragraph and another paragraph through correlation analysis based on the probability of belonging to a plurality of categories derived by the category classification unit 210.

스팸뉴스판별부(230)는 복수의 문단 중 상관도의 수치가 가장 낮은 문장을 스팸의심문단으로 특정하고, 스팸의심문단의 상관도에 따라 뉴스가 진짜인지 혹은 가짜인지 여부를 판별한다. The spam news determination unit 230 specifies the sentence with the lowest correlation among the plurality of paragraphs as a spam questioning group, and determines whether the news is real or fake according to the correlation of the spam questioning group.

이러한 카테고리분류부(210), 상관도산출부(220) 및 스팸뉴스판별부(230)를 포함하는 제어부(200)의 동작은 아래에서 더 상세하게 설명될 것이다. The operation of the control unit 200 including the category classification unit 210, the correlation calculation unit 220, and the spam news determination unit 230 will be described in more detail below.

그러면, 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법에 대해서 설명하기로 한다. 도 3은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법을 설명하기 위한 흐름도이다. 도 4 내지 도 6은 본 발명의 실시예에 따른 스팸뉴스 탐지를 위한 방법을 설명하기 위한 도면이다. Then, a method for detecting spam news according to an embodiment of the present invention will be described. 3 is a flowchart illustrating a method for detecting spam news according to an embodiment of the present invention. 4 to 6 are diagrams for explaining a method for detecting spam news according to an embodiment of the present invention.

이와 같은, 스팸 뉴스를 탐지하는 프로세스 이전에, 카테고리분류부(210)는 딥러닝 기법을 통해, 문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 학습된 상태라고 가정한다. 예컨대, 카테고리분류부(210)는 "코스피 지수가 하루 만에 또 다시 큰 폭으로 하락하며 2400선을 내줬다"와 같은 문단이 입력되면, 입력된 문단이 카테고리 {사회, 정치, 경제, 국제, 연예, 스포츠} 각각에 속할 확률 {0.42, 0.12, 0.97, 0.24, 0.01, 0.01}을 도출하도록 학습된다. Prior to the process of detecting spam news as described above, the category classifier 210 is trained to output a probability that the input paragraph corresponds to each of a plurality of categories as a number when a paragraph is input through a deep learning technique. Is assumed. For example, when a paragraph such as "The KOSPI index has fallen sharply again in one day and gave a line of 2400" is input, the entered paragraph is a category (social, politics, economy, international, entertainment). , Sports} are learned to derive the probability of belonging to each of {0.42, 0.12, 0.97, 0.24, 0.01, 0.01}.

도 3을 참조하면, 스팸탐지장치(100)의 제어부(200)는 S110 단계에서 통신부(110)를 통해 뉴스 기사를 제공하는 웹 페이지를 운영하는 웹 서버에 접속하여 해당 웹 페이지에 포함된 뉴스를 다운로드할 수 있다. Referring to FIG. 3, the control unit 200 of the spam detection apparatus 100 accesses a web server that operates a web page that provides news articles through the communication unit 110 in step S110 to receive news included in the corresponding web page. You can download it.

제어부(200)의 카테고리분류부(210)는 S120 단계에서 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 수치로 도출한다. 예컨대, 도 4에 도시된 바와 같이, 뉴스에 4개의 문단(41)이 존재한다고 가정하면, 카테고리분류부(210)는 다음의 표 1과 같이, 4개의 문단이 예컨대, 사회, 정치, 경제, 국제, 연예, 스포츠와 같은 복수의 카테고리에 속할 확률(42)을 도출한다. The category classification unit 210 of the control unit 200 derives, as a number, the probability that a plurality of paragraphs included in the news belong to a plurality of categories in step S120. For example, as shown in FIG. 4, assuming that there are four paragraphs 41 in the news, the category classification unit 210 includes four paragraphs, for example, society, politics, economy, and the like, as shown in Table 1 below. The probability of belonging to multiple categories such as international, entertainment, and sports (42) is derived.

[표 1][Table 1]

다음으로, 제어부(200)의 상관도산출부(220)는 S130 단계에서 복수의 문단이 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출한다. 이때, 상관도산출부(220)는 각 문단의 카테고리 별 확률에 대한 상기 각 문단 이외의 다른 문단들의 카테고리 별 확률의 평균의 상관도를 산출한다. Next, the correlation calculation unit 220 of the control unit 200 calculates the correlation between each paragraph and another paragraph through correlation analysis based on the probability that the plurality of paragraphs belong to the plurality of categories in step S130. Calculate. In this case, the correlation calculation unit 220 calculates a correlation of the average of the probability of each category of other paragraphs other than the above with respect to the probability of each category of each paragraph.

예컨대, 상관도산출부(220)는 표 1의 확률값을 기초로 상관관계 분석을 통해 문단 1과 다른 문단, 즉, 문단 2, 3, 4와의 상관도를 산출할 수 있다. 구체적으로, 상관도산출부(220)는 문단 2, 3, 4의 카테고리 별 확률의 평균을 구한다. 예컨대, 문단 2, 3, 4의 카테고리 별 확률의 평균은 {0.30, 0.41, 0.04, 0.65, 0.32, 0.01}과 같다. 그리고 상관도산출부(220)는 문단 1의 카테고리 별 확률 {0.56, 0.64, 0.21, 0.98, 0.02, 0.01}과 문단 2, 3, 4의 카테고리 별 확률의 평균 {0.30, 0.41, 0.04, 0.65, 0.32, 0.01}의 상관도(correlation)를 산출한다. For example, the correlation calculation unit 220 may calculate a correlation between paragraph 1 and another paragraph, that is, paragraphs 2, 3, and 4 through correlation analysis based on the probability value of Table 1. Specifically, the correlation calculation unit 220 obtains an average of the probability of each category in paragraphs 2, 3, and 4. For example, the average of the probability of each category in paragraphs 2, 3, and 4 is {0.30, 0.41, 0.04, 0.65, 0.32, 0.01}. And the correlation calculation unit 220 is the average of the probability of each category of paragraph 1 {0.56, 0.64, 0.21, 0.98, 0.02, 0.01} and the average of the probability of each category of paragraphs 2, 3, 4 {0.30, 0.41, 0.04, 0.65, A correlation of 0.32, 0.01} is calculated.

이에 따라, 상관도산출부(220)는 다음의 표 2와 같이 상관도(43)를 산출할 수 있다. Accordingly, the correlation degree calculation unit 220 may calculate the correlation degree 43 as shown in Table 2 below.

[표 2][Table 2]

스팸뉴스판별부(230)는 S140 단계에서 복수의 문단 중 상관도(42)의 수치가 가장 낮은 문장을 스팸의심문단으로 특정한다. 예컨대, 표 2에 따르면, 스팸뉴스판별부(230)는 문단 3이 상관도가 가장 낮기 때문에 문단 3을 스팸의심문단으로 특정한다. The spam news determination unit 230 specifies the sentence with the lowest correlation 42 among the plurality of paragraphs as the spam questionnaire in step S140. For example, according to Table 2, the spam news determination unit 230 specifies paragraph 3 as a spam questioning group because paragraph 3 has the lowest correlation.

이어서, 스팸뉴스판별부(230)는 S150 단계에서 스팸의심문단의 상관도에 따라 뉴스가 진짜인지 혹은 가짜인지 여부를 판별한다. Subsequently, the spam news determination unit 230 determines whether the news is real or fake according to the correlation of the spam questioning group in step S150.

일 실시예에 따르면, 도 5 및 도 6에 도시된 바와 같이, 스팸뉴스판별부(230)는 기 저장된 가짜 뉴스에 대한 상관도 분포(A1) 및 진짜 뉴스에 대한 상관도 분포(B1)를 기초로 가짜 뉴스일 확률을 도출하는 가짜뉴스 확률모델(A2)과, 진짜 뉴스일 확률을 도출하는 진짜뉴스 확률모델(B2)을 도출한다. 그리고 스팸뉴스판별부(230)는 스팸의심문단의 상관도를 가짜뉴스 확률모델(A2) 및 진짜뉴스 확률모델(B2)에 대입하여 뉴스가 진짜 뉴스일 확률과 뉴스가 가짜 뉴스일 확률을 산출한다. 이어서, 스팸뉴스판별부(230)는 진짜 뉴스일 확률과 가짜 뉴스일 확률 중 큰 확률값을 값는 경우에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별한다. According to an embodiment, as shown in FIGS. 5 and 6, the spam news determination unit 230 is based on a correlation distribution A1 for pre-stored fake news and a correlation distribution B1 for real news. We derive a fake news probability model (A2) that derives the probability of fake news and a real news probability model (B2) that derives the probability of real news. In addition, the spam news determination unit 230 calculates the probability that the news is real news and the probability that the news is fake news by substituting the correlation of the spam questioning group into the fake news probability model (A2) and the real news probability model (B2). . Subsequently, the spam news determination unit 230 determines whether the news is real or fake according to a case in which a larger probability value among the probability of real news and the probability of fake news is valued.

다른 실시예에 따르면, 스팸뉴스판별부(230)는 스팸의심문단의 상관도가 소정 수치 미만이면, 상기 뉴스를 가짜로 판별한다. According to another embodiment, the spam news determination unit 230 determines the news as a fake if the correlation between the spam questioning group is less than a predetermined value.

예컨대, 뉴스가 문단 1 내지 4를 포함하며, 상관도산출부(220)는 다음의 표 3과 같이, 문단 2, 3, 4와의 상관도를 산출하였다고 가정한다. 또한, 스팸의심문단을 판단하기 위한 기준인 상관도의 소정 수치가 "-0.17"이라고 가정한다. 예컨대, 도 6에서 보듯이, 가짜뉴스 확률모델(A2) 및 진짜뉴스 확률모델(B2)을 기반으로 스팸의심문단을 판단하기 위한 기준값으로서 "-0.17"을 설정할 수 있다.For example, it is assumed that the news includes paragraphs 1 to 4, and the correlation calculation unit 220 calculates the correlation with paragraphs 2, 3, and 4 as shown in Table 3 below. In addition, it is assumed that a predetermined value of the correlation, which is a criterion for determining the spam questionnaire, is "-0.17". For example, as shown in FIG. 6, "-0.17" may be set as a reference value for determining the spam questioning group based on the fake news probability model A2 and the real news probability model B2.

[표 3][Table 3]

이러한 경우, 스팸뉴스판별부(230)는 문단 3의 상관도가 "-0.17" 보다 작은 "-0.2"이기 때문에 이러한 문단 3이 포함된 뉴스를 가짜 뉴스로 판별한다. In this case, the spam news determination unit 230 determines the news including the paragraph 3 as fake news because the correlation degree of paragraph 3 is “-0.2”, which is smaller than “-0.17”.

한편, 앞서 설명된 본 발명의 실시예에 따른 방법은 다양한 컴퓨터수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광 기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 와이어뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 와이어를 포함할 수 있다. Meanwhile, the method according to the embodiment of the present invention described above may be implemented in the form of a program that can be read through various computer means and recorded on a computer-readable recording medium. Here, the recording medium may include a program command, a data file, a data structure, or the like alone or in combination. The program instructions recorded on the recording medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in computer software. For example, the recording medium includes magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic-optical media such as floptical disks ( magneto-optical media), and hardware devices specially configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions may include not only machine language wires such as those made by a compiler, but also high-level language wires that can be executed by a computer using an interpreter or the like.

이상 본 발명을 몇 가지 바람직한 실시예를 사용하여 설명하였으나, 이들 실시예는 예시적인 것이며 한정적인 것이 아니다. 이와 같이, 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 균등론에 따라 다양한 변화와 수정을 가할 수 있음을 이해할 것이다. The present invention has been described above using several preferred embodiments, but these embodiments are illustrative and not limiting. As such, those of ordinary skill in the art to which the present invention pertains will understand that various changes and modifications can be made according to the equivalence theory without departing from the spirit of the present invention and the scope of the rights presented in the appended claims.

100: 스팸탐지장치
110: 통신부
120: 입력부
130: 표시부
140: 저장부
200: 제어부
210: 카테고리분류부
220: 상관도산출부
230: 스팸뉴스판별부 100: spam detection device
110: communication department
120: input
130: display
140: storage unit
200: control unit
210: Category classification department
220: correlation degree calculation unit
230: Spam News Discrimination Department

Claims

스팸 뉴스 탐지를 위한 장치에 있어서,
뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 도출하는 카테고리분류부;
상기 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출하는 상관도산출부; 및
상기 복수의 문단 중 상기 상관도의 수치가 가장 낮은 문장을 스팸의심문단으로 특정하고, 상기 스팸의심문단의 상관도에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 스팸뉴스판별부;를 포함하고,
상기 스팸뉴스판별부는, 가짜 뉴스와 진짜 뉴스에 대한 상관도 분포를 기초로 가짜 뉴스일 확률을 도출하는 가짜뉴스 확률모델과, 진짜 뉴스일 확률을 도출하는 진짜뉴스 확률모델을 도출하고, 상기 스팸의심문단의 상관도를 상기 가짜뉴스 확률모델 및 상기 진짜뉴스 확률모델에 대입하여 상기 뉴스가 진짜 뉴스일 확률과 상기 뉴스가 가짜 뉴스일 확률을 산출하고, 상기 진짜 뉴스일 확률과 상기 가짜 뉴스일 확률에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부 판별하는 것을 특징으로 하는
스팸 뉴스 탐지를 위한 장치.In the device for detecting spam news,
A category classification unit for deriving a probability that a plurality of paragraphs included in the news belong to a plurality of categories;
A correlation calculation unit for calculating a correlation between each paragraph and another paragraph through correlation analysis based on the probability of belonging to the plurality of categories; And
And a spam news discrimination unit for specifying a sentence with the lowest correlation among the plurality of paragraphs as a spam questioning group, and determining whether the news is real or fake according to the correlation of the spam questioning group. and,
The spam news determination unit derives a fake news probability model that derives a probability of fake news based on a distribution of a correlation between fake news and real news, and a real news probability model that derives the probability of real news, and the spam suspicion By substituting the correlation of the paragraphs into the fake news probability model and the real news probability model, a probability that the news is real news and a probability that the news is fake news are calculated, and the probability of the real news and the probability of the fake news are calculated. According to the characterized in that it determines whether the news is real or fake
Device for detecting spam news.

삭제delete

제1항에 있어서,
상기 스팸뉴스판별부는
상기 스팸의심문단의 상관도가 상기 가짜뉴스 확률모델 및 상기 진짜뉴스 확률모델을 기초로 설정된 기준값 미만이면, 상기 뉴스를 가짜로 판별하는 것을 특징으로 하는
스팸 뉴스 탐지를 위한 장치. The method of claim 1,
The spam news determination unit
If the correlation of the spam questioning group is less than a reference value set based on the fake news probability model and the real news probability model, the news is determined as fake
Device for detecting spam news.

제1항에 있어서,
상기 상관도산출부는
각 문단의 카테고리 별 확률에 대한 상기 각 문단 이외의 다른 문단들의 카테고리 별 확률의 평균의 상관도를 산출하는 것을 특징으로 하는
스팸 뉴스 탐지를 위한 장치. The method of claim 1,
The correlation calculation unit
It characterized in that calculating the correlation of the average of the probability of each category of other paragraphs other than the above with respect to the probability of each category of each paragraph
Device for detecting spam news.

제1항에 있어서,
상기 카테고리분류부는
문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 학습하는 것을 특징으로 하는
스팸 뉴스 탐지를 위한 장치. The method of claim 1,
The category classification unit
Characterized in that, when a paragraph is input, the probability that the input paragraph corresponds to each of a plurality of categories is learned to output numerically.
Device for detecting spam news.

스팸 뉴스 탐지를 위한 방법에 있어서,
카테고리분류부가 뉴스에 포함된 복수의 문단이 복수의 카테고리에 속할 확률을 도출하는 단계;
상관도산출부가 상기 복수의 카테고리에 속할 확률을 기초로 상관관계 분석을 통해 각 문단과 다른 문단과의 상관도(correlation)를 산출하는 단계; 및
스팸뉴스판별부가 상기 복수의 문단 중 상기 상관도의 수치가 가장 낮은 문장을 스팸의심문단으로 특정하고, 상기 스팸의심문단의 상관도에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 단계;를 포함하고,
상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 단계는,
상기 스팸뉴스판별부가 가짜 뉴스와 진짜 뉴스에 대한 상관도 분포를 기초로 가짜 뉴스일 확률을 도출하는 가짜뉴스 확률모델과, 진짜 뉴스일 확률을 도출하는 진짜뉴스 확률모델을 도출하는 단계;
상기 스팸뉴스판별부가 상기 스팸의심문단의 상관도를 상기 가짜뉴스 확률모델 및 상기 진짜뉴스 확률모델에 대입하여 상기 뉴스가 진짜 뉴스일 확률과 상기 뉴스가 가짜 뉴스일 확률을 산출하는 단계; 및
상기 스팸뉴스판별부가 상기 진짜 뉴스일 확률과 상기 가짜 뉴스일 확률에 따라 상기 뉴스가 진짜인지 혹은 가짜인지 여부 판별하는 단계;를 포함하는 것을 특징으로 하는
스팸 뉴스 탐지를 위한 방법. In the method for detecting spam news,
Deriving a probability that a plurality of paragraphs included in the news belong to a plurality of categories by a category classification unit;
Calculating a correlation between each paragraph and another paragraph through correlation analysis based on a probability that the correlation calculator belongs to the plurality of categories; And
A step of specifying, by a spam news determination unit, a sentence with the lowest correlation level among the plurality of paragraphs as a spam questioning group, and determining whether the news is real or fake according to the correlation level of the spam questioning group; Including,
The step of determining whether the news is real or fake,
Deriving, by the spam news determination unit, a fake news probability model for deriving a probability of fake news based on a distribution of a correlation between fake news and real news, and a real news probability model for deriving a probability of real news;
Calculating a probability that the news is real news and a probability that the news is fake news by substituting, by the spam news determination unit, a correlation degree of the spam questioning group to the fake news probability model and the real news probability model; And
And determining whether the news is real or fake according to the probability of the spam news determination unit being the real news and the probability of the fake news.
Methods for detecting spam news.

삭제delete

제6항에 있어서,
상기 뉴스가 진짜인지 혹은 가짜인지 여부를 판별하는 단계는,
상기 스팸의심문단의 상관도가 상기 가짜뉴스 확률모델 및 상기 진짜뉴스 확률모델을 기초로 설정된 기준값 미만이면, 상기 뉴스를 가짜로 판별하는 것을 특징으로 하는
스팸 뉴스 탐지를 위한 방법. The method of claim 6,
The step of determining whether the news is real or fake,
If the correlation of the spam questioning group is less than a reference value set based on the fake news probability model and the real news probability model, the news is determined as fake
Methods for detecting spam news.

제6항에 있어서,
상기 상관도(correlation)를 산출하는 단계는
각 문단의 카테고리 별 확률에 대한 상기 각 문단 이외의 다른 문단들의 카테고리 별 확률의 평균의 상관도를 산출하는 것을 특징으로 하는
스팸 뉴스 탐지를 위한 방법. The method of claim 6,
The step of calculating the correlation is
It characterized in that calculating the correlation of the average of the probability of each category of other paragraphs other than the above with respect to the probability of each category of each paragraph
Methods for detecting spam news.

제6항에 있어서,
상기 확률을 도출하는 단계 전,
문단이 입력되면, 입력된 문단이 복수의 카테고리 각각에 해당할 확률을 수치로 출력하도록 상기 카테고리분류부를 학습시키는 단계;를 더 포함하는 것을 특징으로 하는 스팸 뉴스 탐지를 위한 방법. The method of claim 6,
Before the step of deriving the probability,
When a paragraph is input, the step of learning the category classification unit to output a probability that the input paragraph corresponds to each of the plurality of categories as a number; the method for detecting spam news, characterized in that it further comprises.

컴퓨터로 제6항 또는 제8항 내지 제10항 중 어느 한 항에 따른 스팸 뉴스 탐지를 위한 방법을 실행시키기 위한 프로그램이 기록된 컴퓨터 판독 가능한 기록매체. A computer-readable recording medium having a program recorded thereon for executing the method for detecting spam news according to any one of claims 6 or 8 to 10 with a computer.