KR101379935B1

KR101379935B1 - System and method for extracting information from sns messages

Info

Publication number: KR101379935B1
Application number: KR1020130012577A
Authority: KR
Inventors: 유성준; 박인숙
Original assignee: (주)레드테이블
Priority date: 2013-02-04
Filing date: 2013-02-04
Publication date: 2014-04-01
Also published as: KR20130036024A

Abstract

SNS 메시지로부터 원하는 내용의 정보가 포함된 특정 메시지를 추출하고 분류하는 메시지 정보 추출 시스템 및 그 방법이 개시된다.Disclosed are a message information extraction system and method for extracting and classifying a specific message including information of a desired content from an SNS message.

Description

메시지 정보 추출 시스템 및 그 방법{SYSTEM AND METHOD FOR EXTRACTING INFORMATION FROM SNS MESSAGES}Message information extraction system and method thereof {SYSTEM AND METHOD FOR EXTRACTING INFORMATION FROM SNS MESSAGES}

본 발명은 메시지 정보 추출 시스템 및 그 방법에 관한 것으로서, 보다 상세하게는 SNS 메시지로부터 원하는 내용의 정보가 포함된 특정 메시지를 추출하고 분류하는 메시지 정보 추출 시스템 및 그 방법에 관한 것이다.The present invention relates to a message information extraction system and method thereof, and more particularly, to a message information extraction system and method for extracting and classifying a specific message including information of a desired content from an SNS message.

사용자들은 자신들이 방문한 음식점의 서비스, 메뉴, 맛 등을 평가한 '맛집'을 구전으로 주위 사람들에게 알려왔다. 인터넷의 발달로 인해 구전이 아닌 웹의 블로그, 카페 등을 통해 직접 '맛집'에 관한 정보를 공유해 왔다. Users have been telling people around the world about the 'restaurants' that evaluate the services, menus, and flavors of the restaurants they visit. Due to the development of the Internet, they have shared information about 'restaurants' directly through blogs and cafes on the web, not through word of mouth.

하지만 소셜 네트워크 서비스가 활성화 되어 있는 오늘날 사용자들은 자신이 방문한 음식점들을 SNS 메시지를 통해 음식점명과 메뉴, 방문한 음식점에 대한 정보를 실시간으로 SNS에 직접 소개하거나 또는 가족, 친구, 지인들과 식사를 하기 위해 SNS에 맛집 추천을 요청하기도 한다. However, today's social network services enable users to introduce their restaurants to SNS in real-time through SNS messages, or to share their meals with family, friends and acquaintances. You can also ask for a restaurant recommendation.

특히 스마트폰이 보급된 오늘날은 컴퓨터 앞에 앉아서 맛집을 검색하여 찾아가는 것 보다 언제 어디서나 맛집을 검색하여 음식점을 찾아가거나 방문한 맛집을 사진을 찍어 소셜 네트워크 서비스에 올린다. In particular, nowadays smartphones are spreading, rather than sitting in front of a computer and searching for restaurants, anytime, anywhere to search for restaurants, go to restaurants or take pictures of the restaurants you visit and upload them to social network services.

이에, 자신만의 공간이 아닌 SNS를 통해 불특정 다수의 사용자들과 '맛집'에 관한 정보를 공유하거나 교환할 수 있도록, SNS를 통하여 업로드되는 메시지로부터 '맛집' 정보를 추출하여 활용할 수 있도록 하는 기술에 대한 요청이 제기되었으나, 현재까지는 SNS 메시지 중 예컨대 트위터(Twitter.com)에서 맛집 정보를 검색할 수 있는 기술이 알려진 바가 없었다.Therefore, the technology to extract and utilize the 'restaurant' information from the message uploaded through the SNS to share or exchange information about the 'restaurant' with the unspecified number of users through the SNS, not their own space Although a request has been made, until now, no technology for retrieving restaurant information, for example, on Twitter (Twitter.com) has not been known among SNS messages.

본 발명은 상기와 같은 요청에 부응하여 착안된 것으로서, SNS 메시지로부터 원하는 내용의 정보가 포함된 특정 메시지를 추출하고 분류하는 메시지 정보 추출 시스템 및 그 방법을 제공하는 것을 목적으로 한다.The present invention has been conceived in response to the above-described request, and an object thereof is to provide a message information extraction system and method for extracting and classifying a specific message including information of a desired content from an SNS message.

상기와 같은 목적을 달성하기 위하여, 본 발명에 따른 메시지 정보 추출 시스템은, 적어도 하나의 소셜네트워크 사이트에 게시된 적어도 하나의 메시지를 수집하는 웹크롤러부, 메시지에서 불필요한 문자를 제거하는 전처리를 수행하는 전처리부, 전처리가 수행된 후의 메시지에 포함된 적어도 하나의 단어에 대하여 품사 태깅을 수행하는 POS태깅부, 품사 태깅이 수행된 적어도 하나의 단어 중에서 명사를 추출하는 명사 추출부, 추출된 적어도 하나의 명사 중에서 키워드를 추출하는 키워드 추출부를 포함하여 이루어진다.In order to achieve the above object, the message information extraction system according to the present invention, a web crawler unit for collecting at least one message posted on at least one social network site, performing a preprocessing to remove unnecessary characters from the message A pre-processing unit, a POS tagging unit performing part-of-speech tagging on at least one word included in the message after the pre-processing is performed, a noun extracting unit extracting a noun from at least one word on which the part-of-speech tagging is performed, and at least one extracted It includes a keyword extraction unit for extracting a keyword from the noun.

이 때, 키워드 추출부는 키워드 조회부, 미추출 문서 처리부, 조회된 키워드 추출부 및 사전 데이터베이스를 더 포함하는 것이 바람직하다. In this case, it is preferable that the keyword extracting unit further includes a keyword searching unit, an unextracted document processing unit, a searched keyword extracting unit, and a dictionary database.

또한, 미추출 문서 처리부는, 확률 및 빈도 판단부, 분류모델 데이터베이스, 트윗 데이터베이스 및 식당명 판단부를 더 포함하는 것이 좋다.The non-extracted document processing unit may further include a probability and frequency determination unit, a classification model database, a tweet database, and a restaurant name determination unit.

또한, 키워드 추출부는 추출된 상기 키워드를 출력하는 출력부를 더 포함하도록 할 수 있다.The keyword extractor may further include an output unit for outputting the extracted keyword.

한편, 본 발명에 따른 메시지 정보 추출 방법은, 웹크롤러부, 전처리부, POS태깅부, 명사 추출부 및 키워드 추출부를 포함하는 메시지 정보 추출 시스템에서 수행되며, 네트워크를 통하여 연결된 적어도 하나의 소셜네트워크 사이트에 게시된 적어도 하나의 메시지를 수집하는 문서 수집 단계와, 수집된 적어도 하나의 메시지에서 불필요한 문자를 제거하는 전처리 단계와, 전처리 단계가 수행된 후의 메시지에 포함된 적어도 하나의 단어에 대하여 품사 태깅을 수행하는 품사 태깅 단계와, 품사 태깅이 수행된 적어도 하나의 단어 중에서 명사를 추출하는 특정단어 추출 단계 및 그 명사가 검색하고자 하는 카테고리에 속하는 키워드인지 여부를 판단하는 키워드 조회 단계를 포함하여 이루어진다. Meanwhile, the message information extraction method according to the present invention is performed in a message information extraction system including a web crawler unit, a preprocessor, a POS tagging unit, a noun extractor, and a keyword extractor, and at least one social network site connected through a network. Part of the document collection step of collecting at least one message posted to the preprocessing step of removing unnecessary characters from the collected at least one message, and at least one word included in the message after the preprocessing step is performed A part-of-speech tagging step to perform, a specific word extraction step of extracting a noun from at least one word on which the part-of-speech tagging is performed, and a keyword search step of determining whether the noun is a keyword belonging to a category to be searched.

이 때, 상기 키워드 조회 단계는 해당 카테고리에 속하는 복수의 키워드가 저장되어 있는 적어도 하나의 사전 데이터베이스에 그 명사와 일치하는 단어가 있는지 여부를 판단함으로써 이루어진다. In this case, the keyword inquiry step is performed by determining whether there is a word matching the noun in at least one dictionary database in which a plurality of keywords belonging to the category are stored.

한편, 본 발명의 다른 측면에 따른 메시지 정보 추출 방법은, 웹크롤러부, 전처리부, POS태깅부, 명사 추출부, 키워드 추출부 및 분류모델 데이터베이스를 포함하는 메시지 정보 추출 시스템에서 수행되며, 네트워크를 통하여 연결된 적어도 하나의 소셜네트워크 사이트에 게시된 적어도 하나의 메시지를 수집하는 문서 수집 단계와, 수집된 적어도 하나의 메시지에서 불필요한 문자를 제거하는 전처리 단계와, 전처리 단계가 수행된 후의 메시지에 포함된 적어도 하나의 단어에 대하여 품사 태깅을 수행하는 품사 태깅 단계와, 품사 태깅이 수행된 적어도 하나의 단어 중에서 명사를 추출하는 특정단어 추출 단계 및 명사의 특성값의 빈도수와 확률을 계산하고, 계산된 특성값을 상기 분류모델 데이터베이스의 상기 명사에 대응하는 확률값과 비교하여 해당 명사가 포함된 메시지가 긍정적 트윗인지 부정적 트윗인지 여부를 판단하는 통계적 키워드 추출단계를 포함하여 이루어진다.Meanwhile, the message information extraction method according to another aspect of the present invention is performed in a message information extraction system including a web crawler unit, a preprocessor, a POS tagging unit, a noun extractor, a keyword extractor, and a classification model database. A document collecting step of collecting at least one message posted to at least one social network site connected through the web; a preprocessing step of removing unnecessary characters from the collected at least one message; and at least one message included in the message after the preprocessing step is performed. A part-of-speech tagging step of performing a part-of-speech tagging on one word, a step of extracting a noun from at least one word in which the part-of-speech tagging is performed, and calculating the frequency and probability of the characteristic value of the noun, Is compared with a probability value corresponding to the noun in the classification model database. Comprises a statistical keyword extraction process a message with a noun to determine whether the positive tweets what negative tweets.

본 발명을 이용하면, SNS 메시지로부터 원하는 내용의 정보가 포함된 특정 메시지를 추출하고 분류하는 메시지 정보 추출 시스템 및 그 방법을 구현할 수 있다. According to the present invention, a message information extraction system and method for extracting and classifying a specific message including information of a desired content from an SNS message can be implemented.

이에 따라 SNS 메시지 등에 포함된 '맛집' 관련 정보 등 다양하고 유용한 정보를 용이하게 추출하여 활용할 수 있는 효과도 기대할 수 있다.Accordingly, various useful information such as 'restaurant' information included in SNS messages can be easily extracted and utilized.

도 1은 메시지 정보 추출 시스템의 일례를 나타낸 블록도,
도 2는 도 1의 메시지 정보 추출 시스템의 키워드 추출부를 더욱 상세하게 나타낸 도면,
도 3은 도 2의 미추출 문서 처리부를 더욱 상세하게 나타낸 도면,
도 4는 메시지 정보 추출 방법의 일례를 나타낸 흐름도,
도 5는 메시지 정보 추출 방법의 다른 예를 나타낸 흐름도이다.1 is a block diagram showing an example of a message information extraction system;
2 is a diagram illustrating a keyword extracting unit of the message information extracting system of FIG. 1 in more detail;
3 is a view showing in more detail the unextracted document processing unit of FIG. 2;
4 is a flowchart illustrating an example of a message information extraction method;
5 is a flowchart illustrating another example of a message information extraction method.

본 발명은 단문 소셜 네트워크 서비스의 대표적인 예인 트위터(Twitter.com)를 검색 대상 메시지의 일례로, 음식점에 관한 단어를 검색주제의 일례로 하여 설명하기로 한다. The present invention will be described using Twitter (Twitter.com), which is a representative example of the short social network service, as an example of a search target message and a word about a restaurant as an example of a search topic.

트위터 내에서도 사용자가 직접 작성한 트윗 중에서 음식점과 관련된 키워드 검색, 즉 맛집, 레스토랑, 식당 등에 관한 내용을 추출하여 음식점명/지역명을 추출하여 실제로 음식점이 존재하는지를 확인하고 데이터베이스화 한다. Even in Twitter, the user searches for keywords related to restaurants, ie, restaurants, restaurants, restaurants, etc. from the tweets written by the user, extracts restaurant names / area names, and confirms the existence of restaurants.

트위터에서 사용자가 맛집 소개를 원할 경우 데이터베이스를 조회하여 사용자에게 응답한다. If the user wants to introduce a restaurant on Twitter, the database is queried and responded to the user.

140자 이내의 간략한 내용을 분석하여 사용자가 원하는 정보를 제공하기 때문에 기존 특허와 차별성이 있다.It is different from existing patents because it analyzes the brief contents within 140 words and provides the information desired by the user.

본 발명은 실시간으로 데이터가 올라오는 트위터에서 자연어 문서를 자동으로 수집해야 한다. 이에 따라 자연어를 정형화하고 맛집과 관련된 의미있는 단어들을 추출하여 데이터베이스에 저장해야 한다. 정형화된 데이터를 이용하여 사용자들이 맛집과 관련된 키워드(예, 음식점명 등)를 입력하면 자동으로 추출된 질의문하고 답변문을 보여주는 자동화 Q&A 시스템이여야 한다.The present invention should automatically collect the natural language documents from the Twitter that the data comes up in real time. Accordingly, it is necessary to formalize natural language, extract meaningful words related to restaurants, and store them in a database. It should be an automated Q & A system that automatically displays extracted questions and answers when users enter keywords related to restaurants (eg, restaurant names) using standardized data.

본 발명은 SNS 메시지 중 트위터에서 '음식점'과 관련된 단어가 들어가는 자연어 문서를 웹크롤러를 사용하여 수집했다. 즉, 실시간으로 올라오는 수많은 데이터 중에서 '음식점'과 관련된 단어 즉, '식당', '레스토랑', '맛집' 등이 포함된 트윗을 키워드 검색을 통하여 수집하였다. 수집된 SNS 메시지 문서를 규칙과 통계적 방법을 적용하여 음식점명과 지역명을 추출해내는 방법을 기술한다. The present invention collected a natural language document containing a word related to 'restaurant' on Twitter among SNS messages using a web crawler. That is, tweets containing words related to 'restaurant', that is, 'restaurant', 'restaurant', 'restaurant', etc. are collected through a keyword search. This paper describes how to extract restaurant and local names by applying rules and statistical methods to collected SNS message documents.

도 1은 메시지 정보 추출 시스템의 일례를 나타낸 블록도이다. 1 is a block diagram illustrating an example of a message information extraction system.

도 1에서 나타낸 바와 같이, 메시지 정보 추출 시스템(10)은 웹크롤러부(100), 전처리부(110), POS태깅부(120), 명사 추출부(130), 키워드 추출부(140)를 포함하여 이루어진다. 추출된 키워드를 소정의 인터페이스를 통하여 출력하는 출력부(150)를 더 포함하도록 할 수도 있다. As shown in FIG. 1, the message information extraction system 10 includes a web crawler unit 100, a preprocessor 110, a POS tagging unit 120, a noun extraction unit 130, and a keyword extraction unit 140. It is done by It may further include an output unit 150 for outputting the extracted keyword through a predetermined interface.

웹크롤러부(100)는 네트워크를 통하여 연결된 적어도 하나의 소셜네트워크 사이트로부터 해당 사이트에 게시된 복수의 SNS 메시지들을 수집한다. The web crawler unit 100 collects a plurality of SNS messages posted to the site from at least one social network site connected through a network.

전처리부(110)는 수집된 복수의 메시지 각각에서 불필요한 문자를 제거하는 전처리 작업을 수행한다.The preprocessor 110 performs a preprocessing operation to remove unnecessary characters from each of the collected plurality of messages.

POS태깅부(120)는 전처리가 수행된 후의 메시지 각각에 포함된 단어들에 대하여 품사 태깅을 수행한다. 이 때, 명사에는 noun을 의미하는 "N"이, 동사/형용사에는 verb를 의미하는"V"가 태깅된다.The POS tagging unit 120 performs part-of-speech tagging on the words included in each message after the preprocessing is performed. At this time, "N" for noun is tagged for the noun and "V" for verb for the verb / adjective.

명사추출부(130)는 품사 태깅이 수행된 단어들 중에서 명사를 추출한다. The noun extractor 130 extracts nouns from the parts of the word in which the part-of-speech tagging is performed.

키워드 추출부(140)는 명사추출부(130)에서 추출된 명사들 중에서 검색하고자 하는 카테고리에 속하는 키워드를 추출한다. 예컨대, 검색하고자하는 카테고리가 "음식점"인 경우에는, 해당 명사가 "음식점"이라는 카테고리에 속하는 키워드인지 여부를 판단하게 된다. The keyword extractor 140 extracts a keyword belonging to a category to be searched from the nouns extracted by the noun extractor 130. For example, when the category to be searched is "restaurant", it is determined whether the noun is a keyword belonging to the category "restaurant".

도 2는 도 1의 메시지 정보 추출 시스템의 키워드 추출부를 더욱 상세하게 나타낸 도면이다. FIG. 2 is a diagram illustrating a keyword extracting unit of the message information extracting system of FIG. 1 in more detail.

도 2에서 나타낸 바와 같이, 키워드 추출부(140)는 키워드 조회부(142), 사전 데이터베이스(144), 미추출문서 처리부(146) 및 조회된 키워드 추출부(148)를 더 포함하여 이루어진다.As shown in FIG. 2, the keyword extracting unit 140 further includes a keyword searching unit 142, a dictionary database 144, an unextracted document processing unit 146, and a searched keyword extracting unit 148.

또한, 사전 데이터베이스(144)는 식당명 사전(144a), 상권 사전(144b), 전철역 사전(144c), 지역 사전(144d) 등 서로 다른 카테고리에 속하는 키워드들의 모음인 하위 사전 데이터베이스를 더 포함하여 이루어진다.In addition, the dictionary database 144 further includes a lower dictionary database, which is a collection of keywords belonging to different categories such as a restaurant name dictionary 144a, a business dictionary 144b, a subway station dictionary 144c, and a regional dictionary 144d. .

만약 키워드 조회부(142)에서 특정 메시지에 포함된 추출된 명사 전부가 사전 데이터베이스(144)에 등재되지 않은 단어인 경우에는(negative), 미추출 문서 처리부(146)에서 후속 작업을 더 처리하게 된다. 이 경우, 후술하는 통계적 방법을 이용하여 검색을 더 수행할 수 있다.If all of the extracted nouns included in the specific message in the keyword query unit 142 are words that are not listed in the dictionary database 144 (negative), the subsequent document processing unit 146 further processes subsequent work. . In this case, the search may be further performed using the statistical method described below.

만약 키워드 조회부(152)에서 추출된 명사 중에 사전 데이터베이스(144)에 등재된 단어가 검색된 경우에는(positive), 조회된 키워드 추출부(148)에서 해당 명사가 키워드인 것으로 취급하게 된다.If a word listed in the dictionary database 144 is searched among the nouns extracted by the keyword search unit 152 (positive), the searched keyword extraction unit 148 treats the noun as a keyword.

도 3은 도 2의 미추출 문서 처리부를 더욱 상세하게 나타낸 도면이다.FIG. 3 is a diagram illustrating the unextracted document processing unit of FIG. 2 in more detail.

미추출 문서 처리부(146)는 통계적 방법을 이용하여 검색을 더 수행하기 위하여 확률 및 빈도 판단부(146a), 분류모델 데이터베이스(146b), 트윗 데이터베이스(146c) 및 식당명 판단부(146d)를 더 포함하여 이루어진다. The unextracted document processing unit 146 further adds a probability and frequency determination unit 146a, a classification model database 146b, a tweet database 146c, and a restaurant name determination unit 146d to further perform a search using a statistical method. It is made to include.

1. 규칙적인 방법을 통한 '음식점' 관련 1. 'Restaurant' in a regular way SNSSNS 메시지로부터의 From the message 음식점명Restaurant name 및 지역명 추출 방법 And local name extraction method

도 4는 메시지 정보 추출 방법의 일례를 나타낸 흐름도이다. 4 is a flowchart illustrating an example of a message information extraction method.

메시지 정보 추출 방법은 문서 수집 단계(S100), 전처리 단계(S110), 품사 태깅 단계(S120), 특정단어 추출 단계(S130) 및 키워드 조회 단계(S140)를 포함하여 이루어진다. 조회된 키워드를 출력하는 출력 단계(S150)를 더 포함하도록 할 수도 있다.The message information extraction method includes a document collection step (S100), a preprocessing step (S110), a part-of-speech tagging step (S120), a specific word extraction step (S130), and a keyword inquiry step (S140). The method may further include an output step S150 of outputting the inquired keyword.

트위터에서 수집한 트윗문서 Ti(i=1,...,n)는 아래 표 1에서와 같이 서로 독립된 문장으로 이루어진다. The tweet document Ti (i = 1, ..., n) collected from Twitter is composed of independent sentences as shown in Table 1 below.

T1 = "오호! 내방역 근처 맛집. (@만다린)"
T2 = "[대학로맛집] 분위기 좋은 이탈리안 레스토랑, 나무!!!"
T3 = "서울 시청이랑 정동 쪽에 맛집 추천해 주세요."
T4 = "추운 날씨에는 역시 부대찌개 이태원 회사 근처 맛집 고암식당^^"T1 = "Oh! Restaurants near Naebang Station. (@Mandarin)"
T2 = "[Gastronomic Restaurant] Nice Italian restaurant, wood!"
T3 = "Recommended restaurants on the side of Seoul City Hall and Jeongdong."
T4 = "In cold weather, Budae Jjigae goam restaurant near the Itaewon company ^^"

위의 트윗 문서 Ti들은 규칙적이고 정형화된 사전들을 이용한다. The tweet documents Ti above use regular and formal dictionaries.

예컨대, 음식점명사전(Restaurant_Lexicon), 상권사전(MarketArea_Lexicon), 전철역사전(Subway_Lexicon), 지역사전(Place_Lexicon)을 이용할 수 있다.For example, a restaurant name dictionary (Restaurant_Lexicon), a merchant area dictionary (MarketArea_Lexicon), a subway station dictionary (Subway_Lexicon), and a local dictionary (Place_Lexicon) may be used.

사전을 이용한 규칙적인 방법을 활용하여 음식점명과 지역명을 추출하기 위해 Preprocessing(전처리과정)과 POS-Tagging(형태소 분석)을 통하여 명사를 추출한다. In order to extract restaurant names and local names using regular methods using dictionaries, nouns are extracted through preprocessing and POS-tagging.

전처리는 트윗 문서 Ti 에서 음식점명, 지역명을 추출하기 위해 불필요한 문자를 처리하는 것을 말한다. Preprocessing refers to processing unnecessary characters in order to extract restaurant and local names from the tweet document Ti.

불필요한 문자란 불용어(조사, 관사, 전치사, 접속사, 감탄사 등), 특수기호([], "", '', #, @, ?, ! ...), 영문자, 숫자, 한글 자모음, 영문자 등이다.Unnecessary characters include stopwords (survey, article, preposition, conjunction, interjection, etc.), special symbols ([], "", '', #, @,?,! ...), alphabetic characters, numbers, Korean vowels, alphabetic characters And so on.

본 발명에서는 이러한 불필요한 문자를 사용하지 않으며, 형태소 분석(POS-Tagging)에서도 명사에만 품사를 태깅한다.In the present invention, such unnecessary characters are not used, and the part-of-speech tags only the nouns in POS-Tagging.

추출할 음식점명과 지역명이 대부분 명사형이기 때문이다. This is because most restaurants and local names to be extracted are nouns.

표 1의 예문 중에서 T1 = "오호! 내방역 근처 맛집. (@만다린)"를 예로 들어 기술한다. In the example of Table 1, T1 = "Oh! Restaurants near Naebang Station. (@Mandarin)" is described as an example.

(1) 전처리 단계에서는 불필요한 문자를 제거한다. (1) In the preprocessing step, unnecessary characters are removed.

전처리는 함수 Preprocessing(T_i)을 사용하여 수행하며, T_i _(1,...,m)은 Preprocessing을 해야 하는 트윗 문서들이다. Preprocessing is performed using the function Preprocessing (T _i ), where T _i _{(1, ..., m)} are tweet documents that need to be preprocessed.

전처리의 결과는 S_i _(i=1,...n)에 저장한다.The result of the pretreatment is stored in S _i _{(i = 1, ... n)} .

이를 수식으로 표현하면 수학식 1과 같다. This can be expressed by Equation (1).

[수학식1][Equation 1]

Preprocessing(T_i) = {W₁ W₂ ... W_m}.Preprocessing (T _i ) = {W ₁ W ₂ ... W _m }.

(예) (Yes)

S₁ = "내방역(W₁) 근처(W₂) 맛집(W₃) 만다린(W₄)"S ₁ = "Nearby Station (W ₁ ) (W ₂ ) Restaurant (W ₃ ) Mandarin (W ₄ )"

전처리 결과 S₁에 대한 품사 태깅을 하기 위하여 POS-Tagging을 사용한다. 전처리과정이 끝난 S₁를 POSTag(S₁)를 사용하여 구하며 PT₁에 저장한다. POS-Tagging is used to tag the parts of speech for S ₁ as a result of pretreatment. S ₁ asking for the pre-processing done using POSTag (S ₁₎ and stored in the PT _1.

[수학식2]&Quot; (2) "

POSTag(S_i) = {W₁:P₁, W₂:P₂, ..., W_k:P_k}POSTag (S _i ) = {W ₁ : P ₁ , W ₂ : P ₂ , ..., W _k : P _k }

(예)(Yes)

PT₁ = {내방역:N, 근처:N, 맛집:N, 만다린:N}PT ₁ = {Visit: N, Nearby: N, Gourmet: N, Mandarin: N}

품사 태깅을 통해 정형화된 데이터 중에서 명사(N)만을 추출하여 각 사전에서 검색한다. Parts of speech tagging extracts only nouns (N) from the standardized data and searches them in each dictionary.

이 때 검색에는 특성값(feature)를 사용하는데, 명사(N:Noun)만을 특성값으로 사용하는 이유는 음식점명과 지역명이 명사로 태깅되었기 때문이다. At this time, the feature is used for the search. The reason for using only noun (N: Noun) as the feature is that the restaurant name and the local name are tagged as nouns.

POSTag(S_i)의 결과 중에서 명사만을 추출하여 Feature(PT_i)로 표시한다. Extract only nouns from the result of POSTag (S _i ) and mark it as Feature (PT _i ).

PT₁의 내용을 보면 모두 명사로 태깅되었기에 F₁의 결과는 PT₁과 같다. The contents of PT ₁ are all tagged as nouns, so the result of F ₁ is the same as PT ₁ .

Feature(PT_i)의 결과는 F_i _(i=1,..,n)으로 표현하며 다음과 같은 집합으로 구성된다. The result of Feature (PT _i ) is expressed as F _i _{(i = 1, .., n)} and consists of the following sets.

[수학식3]&Quot; (3) "

Feature(PT_i) = {W₁:N, W₂:N, ..., W_k:N} Feature (PT _i ) = {W ₁ : N, W ₂ : N, ..., W _k : N}

(예)(Yes)

F₁ = {내방역:N, 근처:N, 맛집:N, 만다린:N}F ₁ = {Visit: N, Nearby: N, Gourmet: N, Mandarin: N}

'만다린:N'에서 'N'를 제거한 후 '만다린'이 음식점명사전에 있는지 검색한다. 음식점명사전을 RestaurantName_Lexicon이라고 할 때, 사전에 해당 단어가 있는 것이 확인되면 중복성 문제가 발생하므로 저장하지 않고, 없다면 인터넷사전인 Web_Lexicon(예, paran.com)에 질의한다. Remove 'N' from 'mandarin: N' and search for 'mandarin' before the restaurant noun. When the restaurant name dictionary is called RestaurantName_Lexicon, if the word is found in the dictionary, redundancy problem occurs. If not, do not save it. If not, query the Internet dictionary Web_Lexicon (eg, paran.com).

질의한 음식점명이 실제로 음식점명인지 확인되면 RestaurantName_Lexicon에 저장하고, 확인되지 않으면 사용자의 철자오류나 폐업된 음식점일 수 있으니 저장하지 않는다.If it is confirmed that the restaurant name is actually a restaurant name, it is stored in RestaurantName_Lexicon.

예컨대, '내방역:N', '근처:N', '맛집:N' 들은 'N'을 제거하고 RestaurantName_Lexicon에 조회하였지만 찾지 못하였고, 결과적으로 '만다린'만 음식점명이라 확인되었고, 나머지 데이터들은 음식점명이 아닌 것으로 확인되었다면, RestaurantName_Lexicon의 결과 resultRestaurant은 다음과 같은 집합으로 구성할 수 있다.For example, 'History: N', 'Near: N', 'Gourmet: N' removed 'N' and looked up RestaurantName_Lexicon but did not find it. As a result, only Mandarin was identified as the restaurant name. If it is confirmed that it is not a restaurant name, the resultRestaurant resulting from RestaurantName_Lexicon may be configured as the following set.

[수학식4]&Quot; (4) "

resultRestaurant = {W₁:P₁, W₂:P₂,..., W_k:P_k} resultRestaurant = {W ₁ : P ₁ , W ₂ : P ₂ , ..., W _k : P _k }

(예)(Yes)

resultRestaurant = {만다린:N}resultRestaurant = {Mandarin: N}

위에서 추출한 명사 중 음식점명으로 확인된 '만다린'을 제외하고 나머지 명사들에 대하여 상권사전(MarketArea_Lexicon), 전철역사전(Subway_Lexicon) 및 지역사전(Place_Lexicon)을 통하여 검색을 수행할 수 있다. Except for 'mandarin' identified as a restaurant name among the nouns extracted above, the other nouns may be searched through a commercial dictionary (MarketArea_Lexicon), a subway station dictionary (Subway_Lexicon), and a local dictionary (Place_Lexicon).

예컨대, '내방역:N'은 'N'을 제거한 후 Subway_Lexicon에서 검색했더니 지하철역명으로 확인되었다면, '내방역:N'을 Subway_Lexicon에서 '내방역'으로 검색한 결과 resultSubway은 다음과 같이 집합으로 표현할 수 있다.For example, if 'Navi Station: N' is searched in Subway_Lexicon after removing 'N', and it is identified as a subway station name, then 'Substation Station: N' is searched in Subway_Lexicon as 'Internal Station', and the resultSubway is expressed as a set as follows. Can be.

[수학식5][Equation 5]

resultSubway = {W₁:P₁, W₂:P₂,..., W_k:P_k} resultSubway = {W ₁ : P ₁ , W ₂ : P ₂ , ..., W _k : P _k }

(예)(Yes)

resultSubway = {내방역:N}resultSubway = {visit station: N}

결과적으로 T1 = "오호! 내방역 근처 맛집. (@만다린)"에서 '만다린'은 RestaurantName_Lexicon에 '내방역'은 Subway_Lexicon에 포함되어 있음을 확인하였다. As a result, in T1 = "Oho! Restaurants near Naebang Station" (@Mandarin), it was confirmed that 'Mandarin' was included in RestaurantName_Lexicon and 'Inside Station' was included in Subway_Lexicon.

다음은 RestaurantName_Lexicon과 Subway_Lexicon에서 조회한 결과이다.The following is the result of searching in RestaurantName_Lexicon and Subway_Lexicon.

[수학식6][Equation 6]

RestaurantName_Lexicon = {'02-596-6767', '만다린', '서울 서초구 방배동 875-15'}RestaurantName_Lexicon = {'02 -596-6767 ',' Mandarin ',' 875-15 Bangbae-dong, Seocho-gu, Seoul '}

Subway_Lexicon = {'내방역'}Subway_Lexicon = {'Inside Station'}

한편, 음식점명과 지역명을 추출하기 위하여 의사코드(Pseudo Code)를 정의할 수 있다. Meanwhile, a pseudo code may be defined to extract a restaurant name and a local name.

1) One) 음식점명Restaurant name 추출 알고리즘의 의사코드 Pseudocode of Extraction Algorithm

예문 T₁ = "오호! 내방역 근처 맛집. (@만다린)"를 예로 설명한다. Example T ₁ = "Oh! A restaurant near Naebang Station. (@Mandarin)"

① Preprocessing : 내방역 근처 맛집 만다린
② POS-Tagging : 내방역:N 근처:N 맛집:N 곳:N 만다린:N
③ Feature : "내방역:N 근처:N 맛집:N 곳:N 만다린:N" ① Preprocessing: Gourmet Mandarin near Naebang Station
② POS-Tagging: Naebang Station: N Nearby: N Restaurants: N Places: N Mandarin: N
③ Feature: "Naebang Station: N Nearby: N Restaurant: N Where: N Mandarin: N" ④ tokenArr[] = getToken(string);
⑤ for(i=0;i<length(tokenArr);i++) {
// 음식점명 찾기
⑥ word = tokenArr[i];
⑦ if(find_restaurant(word)) {
⑨ restaurant_found[] = word;
tokenArr = tokenArr - word;
break;
}
}
⑧ find_restaurant(word) {
res=null;
if (RestaurantName_Lexicon(word))
res=word;
else
res=Web_Lexicon(word);
return res;
}④ tokenArr [] = getToken (string);
⑤ for (i = 0; i <length (tokenArr); i ++) {
// find restaurant name
⑥ word = tokenArr [i];
⑦ if (find_restaurant (word)) {
Restaurant_found [] = word;
tokenArr = tokenArr-word;
break;
}
}
⑧ find_restaurant (word) {
res = null;
if (RestaurantName_Lexicon (word))
res = word;
else
res = Web_Lexicon (word);
return res;
}

표 2의 예문(T₁)에서 음식점명 추출 알고리즘 코드에 대한 설명은 ④번부터 이다. In the example sentence (T ₁ ) of Table 2, the description of the restaurant name extraction algorithm code is from ④.

④번 단계에서는 문자열(String)의 값을 읽어서 tokenArr[]에 배열로 저장한다.In step ④, the string is read and stored as an array in tokenArr [].

⑤번 단계에서는 토큰어레이(tokenArr[])의 길이를 계산한다.In step ⑤, the length of the token array (tokenArr []) is calculated.

⑥번 단계에서는 토큰어레이의 i번째 값(tokenArr[i])을 워드(word)에 대입한다.In step ⑥, the i th value (tokenArr [i]) of the token array is substituted into a word.

⑦번 단계에서는 워드에 대입된 음식점명을 검색하는 함수를 호출한다. 이 때, 음식점명을 검색하는 함수는 find_restaurant(word)이 된다. Step ⑦ calls the function to search the restaurant name assigned to the word. At this time, the function to search for a restaurant name is find_restaurant (word).

⑧번 단계에서는 워드에 대입된 음식점명(여기에서는 "만다린:N")을 음식점명사전(restaurantName_Lexicon)에서 조회한다.In step ⑧, the restaurant name (here, "Mandarin: N") assigned to Word is searched in the restaurant name dictionary (restaurantName_Lexicon).

⑨ 만약 음식점명사전에서 검색이 되면 검색결과를 restaurant_found[]에 저장하고, 음식점명사전에서 검색이 되지 않으면 웹사전(Web_Lexicon(word))을 검색한다. 그리고 토큰어레이(tokenArr[])에서 검색된 "만다린"을 배열 원소에서 빼고 나머지 데이터를 다시 검색한다.⑨ If a restaurant noun dictionary is found, the search result is stored in restaurant_found []. If a restaurant noun dictionary is not found, the web dictionary (Web_Lexicon (word)) is searched. Then subtract the "mandarin" found in the token array (tokenArr []) from the array element and retrieve the rest of the data again.

2) 지역명 추출 알고리즘의 의사코드2) Pseudo Code of Local Name Extraction Algorithm

예문 T₁ = "오호! 내방역 근처 맛집. (@만다린)"를 예로 설명한다
Example T ₁ = Explain "Oho! Restaurants near Naebang Station. (@Mandarin)"

① Preprocessing : 내방역 근처 맛집 만다린
② POS-Tagging : 내방역:N 근처:N 맛집:N 곳:N 만다린:N
③ Feature : "내방역:N 근처:N 맛집:N 곳:N 만다린:N" ① Preprocessing: Gourmet Mandarin near Naebang Station
② POS-Tagging: Naebang Station: N Nearby: N Restaurants: N Places: N Mandarin: N
③ Feature: "Naebang Station: N Nearby: N Restaurant: N Where: N Mandarin: N" ④ tokenArr[] = getToken(string);
⑤ for(i=0;i<length(tokenArr);i++) {
⑥ word = tokenArr[i];
⑦ if(find_location(word)) {
⑨ region_found[] = word;
tokenArr = tokenArr word;
}
}
⑧ find_region(word) {
region = null;

if(MarketArea_Lexicon(word))
region = region + COMMERCIAL + ";";
if(Subway_Lexicon(word)
region = region + SUBWAY + ";";
if(Place_Lexicon(word)
region = region + PLACE + ";";

if( region != null)
region = region+ " " word
return region;
}④ tokenArr [] = getToken (string);
⑤ for (i = 0; i <length (tokenArr); i ++) {
⑥ word = tokenArr [i];
⑦ if (find_location (word)) {
Region_found [] = word;
tokenArr = tokenArr word;
}
}
⑧ find_region (word) {
region = null;

if (MarketArea_Lexicon (word))
region = region + COMMERCIAL + ";";
if (Subway_Lexicon (word)
region = region + SUBWAY + ";";
if (Place_Lexicon (word)
region = region + PLACE + ";";

if (region! = null)
region = region + "" word
return region;
}

표 3에서 지역명 추출 알고리즘 코드에 대한 설명은 ④번부터 이다. Description of area name extraction algorithm code in Table 3 is from ④.

표 3에서 ④번 단계에서는 문자열(String)의 값을 읽어서 토큰어레이(tokenArr[])에 배열로 저장한다.In step 3 of Table 3, the value of string is read and stored as an array in token array (tokenArr []).

⑦번 단계에서는 워드에 대입된 지역명을 검색하는 함수를 호출한다. 이 때, 지역명을 검색하는 함수는 find_location(word)이 된다. Step ⑦ calls the function to search the local name assigned to the word. At this time, the function for searching for a local name is find_location (word).

⑧번 단계에서는 워드에 대입된 지역명(여기에서는 "내방역:N")을 지역명에 관한 사전, 예컨대 상권사전(MarketArea_Lexicon(word)), 전철역사전(Subway_Lexicon(word)), 지역사전(Place_Lexicon(word)) 등에서 조회한다. 이 실시예에서는 전철역사전에서 지역명이 검색된 것으로 한다.In step ⑧, the name of the area assigned to the word (here, "Nation History: N") is converted into a dictionary relating to the name of the area, such as a commercial area dictionary (MarketArea_Lexicon (word)), a subway station dictionary (Subway_Lexicon (word)), and a local dictionary (Place_Lexicon). (word)) and so on. In this embodiment, it is assumed that a local name is found in a train station dictionary.

⑨번 단계에서는 전철역사전에서 조회된 '내방역' 검색결과를 region_found[]에 저장한다. (만약 검색되지 않았다면 웹사전(Web_Lexicon(word))을 검색하였을 것이다) 그러고 나서, 토큰어레이(tokenArr[])에서 검색된 "내방ㄴ역"을 배열 원소에서 빼고 나머지 데이터를 다시 검색한다.In step ⑨, the search results of 'inside station' in the train station dictionary are stored in region_found []. (If not, you would search for a web dictionary (Web_Lexicon (word)).) Then, subtract the "history" found in the token array (tokenArr []) from the array elements and retrieve the rest of the data again.

2. 통계적 방법을 이용한 2. Using statistical methods 음식점명Restaurant name 및 지역명 포함 And region names 트윗Tweet 검색 방법 How to search

도 5는 본 발명의 다른 측면에 따른 메시지 정보 추출방법은, 전처리 단계(S200), 형태소 분석 단계(S210), 특정단어 추출 단계(S220), 특정단어 확률계산 단계(S230)를 포함하여 이루어진다. 이 때, 학습모델 구축단계(S240)를 더 포함하여 이루어지도록 할 수도 있다. 5 is a message information extraction method according to another aspect of the present invention comprises a preprocessing step (S200), a morpheme analysis step (S210), a specific word extraction step (S220), a specific word probability calculation step (S230). At this time, the learning model building step (S240) may be further included.

트위터에서 수집된 트윗들을 규칙 기반으로 검색했을 경우에 다음과 같은 문제점이 발견된다.The following problems are found when searching tweets collected on Twitter based on rules.

첫째, '맛집'으로 검색했을 경우 '음식점'과 관련된 단어, 즉 '맛집', '레스토랑', '식당'들 중에서 '음식점'이란 의미와 전혀 다른 트윗이 검색될 수 있다. First, if you search for 'restaurant', the word related to 'restaurant', that is, 'restaurant', 'restaurant' and 'restaurant' may be searched for a tweet that is completely different from the meaning of 'restaurant'.

둘째, 특정 음식점명으로 검색했을 경우에 음식점명으로 검색되었지만 지역명이 없어 음식점인지의 여부를 알 수 없는 경우가 있다. Secondly, when searching by a restaurant name, the restaurant name is searched, but there is a case where there is no local name to determine whether it is a restaurant.

셋째, 특정 음식점명으로 검색했을 경우에 음식점 이름의 단어와는 일치되지만 음식점과 전혀 다른 의미의 트윗이 검색되는 경우가 있다. Third, when searching by a specific restaurant name, a tweet that matches the word of the restaurant name but is completely different from the restaurant may be searched.

위에 나열된 문제점을 해결하기 위해 사전을 이용한 규칙기반으로 추출되지 못한 트윗을 대상으로 통계적 방법을 적용하여 '음식점'과 관련된 트윗인지 여부를 확인한다. In order to solve the problems listed above, a statistical method is applied to tweets that are not extracted based on rule-based dictionary to check whether they are tweets related to 'restaurants'.

통계적 방법의 일례로서 나이브 베이즈 분류(Naive Bayes Classifier)를 사용할 수 있다. As an example of a statistical method, Naive Bayes Classifier can be used.

나이브 베이즈 분류는 지도학습(Supervised Learning) 방법을 취하므로, 나이브 베이즈 분류를 적용하는 경우에는 사전에 긍정적 트윗(Positive Tweet)과 부정적 트윗(Negative Tweet)이 미리 분류되어 있어야 한다. Since naive Bayes classification takes the supervised learning method, when applying naive Bayes classification, positive tweet and negative tweet must be classified beforehand.

나이브 베이즈 분류는 "훈련 단계"와 "분류 단계"로 수행된다.Naive Bayes classification is performed in a "training stage" and a "classification stage".

(1) 훈련 단계에서는 수작업으로 분류해 놓은 Positive Tweet, Negative Tweet를 훈련하여 각각의 feature를 추출한다. (1) In the training phase, each feature is extracted by training positive tweets and negative tweets that have been manually sorted.

추출된 feature의 빈도수와 확률 값을 계산하여 분류 모델을 구축한다. A classification model is constructed by calculating the frequency and probability of the extracted features.

분류모델을 구축할 때 feature를 추출하기 위해 미리 분류되어 있는 (긍정적/부정적) 트윗들을 전처리과정과 POS-Tagging을 사용하여 Tweet 내의 단어들에 품사를 부여하고, feature 값의 빈도수와 확률 값을 계산한다. When constructing the classification model, pre-segmented (positive / negative) tweets are pre-processed and POS-Tagging is used to pre-classify the words in Tweets using the pre-processing and POS-Tagging, and calculate the frequency and probability of feature values. do.

(2) 분류 단계에서는 분류 모델을 이용하여 새로운 Tweet에 대한 Positive와 Negative일 확률을 구한다. (2) In the classification stage, the classification model is used to find the probabilities of being positive and negative for a new tweet.

본 실시예에서는 긍정적 트윗의 예로 "집에 오신 분과 맛있는 저녁 식사 먹었어요,(@한시 경복궁)"를, 부정적 트윗의 예로 "#분당맛집은 정자역과 서현역에 많이 있어요."를 적용하였다. In the present embodiment, as an example of a positive tweet, "I had a delicious dinner with someone who came home," (@ Hansi Gyeongbokgung), and as an example of a negative tweet, "# Bundang restaurant is a lot in Jeongja station and Seohyeon station."

분류모델은 아래와 같은 3단계의 과정을 통하여 구축한다. The classification model is constructed through the following three steps.

제1단계(Step 1): 전처리 단계(Preprocessing)Step 1: Preprocessing 제2단계(Step 2): 품사 태깅 단계Step 2: Part of speech tagging 제3단계(Step 3): 특성값 빈도측정 단계Step 3: measuring frequency of characteristic values

제1단계에서는 전처리(Preprocessing) 과정을 통하여 불필요한 기호, 관사, 조사, 감탄사, 자음/모음들을 트윗 문서에서 제거한다. In the first step, unnecessary symbols, articles, surveys, interjections, and consonants / vowels are removed from the tweet document through a preprocessing process.

전처리의 결과는 PS_i _(i=1,...,n)에 저장한다. The result of the pretreatment is stored in PS _i _{(i = 1, ..., n)} .

[수학식7]&Quot; (7) "

Preprocessing(TP₁) = {W₁, W₂, ..., W₃}Preprocessing (TP ₁ ) = {W ₁ , W ₂ , ..., W ₃ }

제1단계 전처리 과정 전후의 트윗 문서는 아래와 같이 비교하여 나타낼 수 있다. Tweet documents before and after the first step preprocessing may be compared and shown as follows.

집에 오신 분과 맛있는 저녁식사 먹었어요.(@한식 경복궁)I had a delicious dinner with someone who came home (@Korean Gyeongbokgung)

PS₁ = "집(W₁) 오신(W₂) 분(W₃) 맛있는(W₄) 저녁식사(W₅) 먹었어요(W₆) 한식(W₇) 경복궁(W₈)"PS ₁ = "Women (W ₁ ) Welcome (W ₂ ) Minutes (W ₃ ) Delicious (W ₄ ) Dinner (W ₅ ) Korean Food (W ₆ ) Korean Food (W ₇ ) Gyeongbokgung Palace (W ₈ )"

제2단계에서는 POS-Tagging을 사용하여 품사를 태깅한다. 태깅된 품사 중에서 명사(N)만을 feature로 추출할 때 사용한다. In the second step, POS-Tagging is used to tag the parts of speech. Used to extract only nouns (N) from the tagged parts of speech as features.

제1단계의 전처리 과정을 거친 트윗 문서의 후 제2단계 품사 태깅 과정 전후의 상태는 아래와 같이 비교하여 나타낼 수 있다. The state before and after the second part-of-speech tagging process of the tweet document after the first step of preprocessing may be compared and expressed as follows.

집 오신 분 맛있는 저녁식사 먹었어요 한식 경복궁I had a delicious dinner at home Korean food Gyeongbokgung

집:N, 오(다):V, 분:N, 맛있(다):V, 저녁식사:N, 먹(다):V, 한식:N, 경복궁:NHouse: N, Oh: V, Minute: N, Delicious: V, Dinner: N, Eat: V, Korean: N, Gyeongbokgung: N

이 때, N(Noun)은 국문법 상의 명사를, V(Verb)는 국문법 상의 동사 또는 형용사를 나타낸다.In this case, N (Noun) represents a noun in Korean law, and V (Verb) represents a verb or adjective in Korean law.

표 5는 POS-Tagging을 이용하여 특성값(feature)을 추출한 경우를 예시한 표이다.Table 5 is a table illustrating a case where a feature value is extracted using POS-Tagging.

f1f1 f2f2 f3f3 f4f4 ...... 맛있(다)delicious) 한식Korean 식사meal 먹(다)eat)

제3단계에서는 각 특성값(feature)의 빈도수와 확률을 계산한다. 계산된 빈도수와 확률 값은 분류 모델이 된다.In the third step, the frequency and the probability of each feature are calculated. The calculated frequency and probability values become classification models.

추출된 특성값에 대한 빈도수는 수학식 8을 통해 계산할 수 있다. The frequency of the extracted feature value may be calculated through Equation 8.

[수학식8]&Quot; (8) "

(i번째 예문 j번째 특성값의 빈도수) = ∑M_ij (frequency of the i th example, the j th characteristic value) = ∑M _ij

여기에서 ∑M_ij는 i번째 예문에서 j번째 특성값이 포함되어 있는지의 여부를 나타낸다. 만약 해당 특성값이 해당 예문에 포함되어 있다면 ∑M_ij는 1, 그렇지 않다면 0으로 결정된다. Here, ∑M _ij indicates whether the j th characteristic value is included in the i th example sentence. If the characteristic is included in the example, ∑M _ij is set to 1, otherwise 0.

표 6은 특성값의 빈도수 계산방법을 예시한 표이다.Table 6 is a table illustrating a method of calculating the frequency of the characteristic values.

맛있(다)delicious) 한식Korean 식사meal 먹(다)eat) ...... 예문1Example 1 1One 1One 1One 1One ...... 예문2Example 2 00 00 00 1One ...... 예문3Example 3 00 00 00 00 ...... ...... ...... ...... ...... ...... ...... 예문iExample i 합계Sum ∑M_i1 ∑M _i1 ∑M_i2 ∑M _i2 ∑M_i3 ∑M _i3 ∑M_i4 ∑M _i4 ∑M_ij ∑M _ij

분류단계에서는 새로운 트윗 문서를 이용하여 특성값을 추출하고 각 특성값의 빈도수와 확률을 계산한 다음, 분류 모델의 확률 값과 비교하여 긍정적 트윗인지 부정적 트윗인지를 판별한다. In the classification step, feature values are extracted using the new tweet document, the frequency and the probability of each feature value are calculated, and then compared with the probability values of the classification model to determine whether the tweet is a positive tweet or a negative tweet.

분류할 트윗 문서의 예로서 새로운 트윗인 "쌈밥과 사골 칼국수가 유명한 종로의 맛집!! (@대련집)"를 적용하면 다음과 같이 분류단계가 수행된다. As an example of a tweet document to be classified, a new tweet, "Ssambap and Sagol kalguksu, a famous restaurant in Jongno !!" (@Dalian House) is applied as follows.

(1) 제1단계(Step 1) : (1) Step 1:

불필요한 기호, 관사, 조사, 감탄사, 자음/모음들을 트윗에 제거한다. Remove unnecessary symbols, articles, surveys, interjections, and consonants / vowels from your tweets.

쌈밥과 사골 칼국수가 유명한 종로의 맛집!! (@대련집)Ssambap and Sargol kalguksu are famous restaurants in Jongno !! (@Dalian)

쌈밥 사골 칼국수 유명한 종로 맛집 대련집 Ssambap Sagol Kalguksu famous Jongno restaurant Dalian

(2) 제2단계(Step 2) : (2) Step 2:

POS-Tagging를 이용하여 품사를 태깅한다.Tag parts of speech using POS-Tagging

쌈밥 사골 칼국수 유명한 종로 맛집 대련집Ssambap Sagol Kalguksu famous Jongno restaurant Dalian

쌈밥:N, 사골:N, 칼국수:N, 유명(하다):V, 종로:N, 맛집:N, 대련집:NSsambap: N, Beef bone: N, Kalguksu: N, Famous: V, Jongno: N, Gourmet: N, Dalian

(3) 제3단계(Step 3) : (3) Step 3:

Step 2에서 추출한 각 feature에 대한 확률 값을 분류 모델로부터 가져온다.Probability values for each feature extracted in Step 2 are taken from the classification model.

품사 중에서 명사, 동사, 형용사만을 선별하여 feature(f)로 사용하고 각 특성값에 대한 빈도수(wf)와 확률값(p)을 구한다.Among nouns, only nouns, verbs, and adjectives are selected and used as feature (f), and frequency (wf) and probability value (p) are obtained for each feature value.

추출된 feature(f)는 수학식 9와 같이 표현한다.The extracted feature (f) is expressed as in Equation (9).

[수학식9]&Quot; (9) "

f = {f₁, f₂, ..., f_n}f = {f ₁ , f ₂ , ..., f _n }

본 실시예에서는 특성값 확률 함수 feature(f)가 아래 수학식 10과 같이 표현된다.In the present embodiment, the characteristic value probability function feature f is expressed by Equation 10 below.

[수학식10]&Quot; (10) "

f = {쌈밥:N, 사골:N, 칼국수:N, 유명(하다):V, 종로:N, 맛집:N, 대련집:N} f = {sambap: N, bone: N, kalguksu: N, famous: V, Jongno: N, gourmet: N, Dalian: N}

또한 feature의 빈도수(wf)는 아래 수학식 11과 같이 표현된다.In addition, the frequency wf of the feature is expressed as in Equation 11 below.

[수학식11][Equation 11]

wf_tp = {쌈밥:0, 사골:0, 칼국수:1, 유명(하다):0, 종로:1, 맛집:1, 대련집:0}wf _tp = {Sambap: 0, bone: 0, kalguksu: 1, famous: 0, Jongno: 1, gourmet: 1, Dalian: 0}

wf_fp = {쌈밥:0, 사골:0, 칼국수:0, 유명(하다):0, 종로:0, 맛집:1, 대련집:0}wf _fp = {Sambap: 0, Bone: 0, Kalguksu: 0, Famous: (0), Jongno: 0, Restaurant: 1, Dalian: 0}

각 특성값의 확률에 0이 하나라도 있을 경우 전체 수식에 영향을 미치기 때문에 Step 3에서 나온 결과에 각각의 확률 값을 곱하면 우도(L_tp, L_fp)를 수학식 12과 같이 구할 수 있다. If the probability of each characteristic value has one zero, it affects the whole equation, so multiplying the result of Step 3 by each probability value can obtain the likelihood (L _tp , L _fp ) as in Equation 12.

[수학식12][Equation 12]

L_tp = 0/22 * 0/22 * 1/22 * 0/22 * 1/22 * 1/22 * 0/22 = 0,L _tp = 0/22 * 0/22 * 1/22 * 0/22 * 1/22 * 1/22 * 0/22 = 0,

L_fp = 0 * 0 * 0 * 0 * 0 * 1/17 * 0 = 0.L _fp = 0 * 0 * 0 * 0 * 0 * 1/17 * 0 = 0.

각 단어에서 하나라도 0이 나올 경우 전체 수식에 영향을 미치기 때문에 아주 작은 값을 모든 수식에 더한 후에 각각을 계산하면 최종 우도(FL_tp, FL_fp)를 계산할 수 있다.If any zero in each word affects the whole equation, you can calculate the final likelihood (FL _tp , FL _fp ) _by adding a very small value to all the equations and calculating each.

[수학식13]&Quot; (13) "

FL_tp = (0/22 + 0.001) * (1/22 + 0.001) * (1/22 + 0.001) * (0/22 + 0.001) * (1/22 + 0.001) * (1/22+0.001) * (0/22 + 0.001) = 4.66 * 10^-15,FL _tp = (0/22 + 0.001) * (1/22 + 0.001) * (1/22 + 0.001) * (0/22 + 0.001) * (1/22 + 0.001) * (1/22 + 0.001) * (0/22 + 0.001) = 4.66 * 10 ^-15 ,

FL_fp = (0 + 0.001) * (0 + 0.001) * (0 + 0.001) * (0 + 0.001) * (0 + 0.001) * (1/17 + 0.001) * (0 + 0.001) = 5.982 * 10^-26.FL _fp = (0 + 0.001) * (0 + 0.001) * (0 + 0.001) * (0 + 0.001) * (0 + 0.001) * (1/17 + 0.001) * (0 + 0.001) = 5.982 * 10 ^-26 .

위에서 구한 우도를 기반으로 확률(P_tp, P_fp)을 계산하여야 한다. The probability (P _tp , P _fp ) should be calculated based on the likelihood obtained above.

[수학식14][Equation 14]

P_tp = 4.66 * 10^-15/(4.66 * 10^-15+ 5.982*10^-26)=0.99,P _tp = 4.66 * 10 ^-15 /(4.66 * 10 ^-15 + 5.982 * 10 ^-26 ) = 0.99,

P_fp = 5.982*10^-26/(4.66*10^-15+ 5.982*10^-26)=1.285 * 10^-11.P _fp = 5.982 * 10 ^-26 /(4.66*10 ^-15 + 5.982 * 10 ^-26 ) = 1.285 * 10 ^-11 .

따라서 주어진 새로운 예문 "쌈밥과 사골 칼국수가 유명한 종로의 맛집!! (@대련집)"은 음식점과 관련된 트윗이고 긍적적 트윗인 것으로 분류된다. Therefore, the new example "Ssambap and Sagol Kalguksu" is a famous restaurant in Jongno !! (@Dalian House) is a tweet related to a restaurant and classified as a positive tweet.

이와 같은 계산 방법을 일반화하면 분류함수 Classify(f_n)은 수학식 15와 같이 정리할 수 있다. By generalizing this calculation method, the classification function Classify (f _n ) can be summarized as in Equation 15.

[수학식15][Equation 15]

이와 같은 과정을 통하여 분류모델 데이터베이스가 구축되므로, 사전 데이터베이스에 등재되지 않은 키워드인 경우에도 전처리 단계(S200), 형태소 분석 단계(S210), 특정단어 추출 단계(S220), 특정단어 확률계산 단계(S230)를 통하여 검색될 수 있게 된다.
Since the classification model database is constructed through such a process, even in the case of keywords not listed in the dictionary database, the preprocessing step (S200), the morphological analysis step (S210), the specific word extraction step (S220), and the specific word probability calculation step (S230). Can be searched through).

Claims

적어도 하나의 소셜네트워크 사이트에 게시된 적어도 하나의 메시지를 수집하는 웹크롤러부,
상기 메시지에서 불필요한 문자를 제거하는 전처리를 수행하는 전처리부,
상기 전처리가 수행된 후의 메시지에 포함된 적어도 하나의 단어에 대하여 품사 태깅을 수행하는 POS태깅부,
상기 품사 태깅이 수행된 적어도 하나의 단어 중에서 명사를 추출하는 명사 추출부,
상기 명사 추출부에서 적어도 하나의 명사가 추출된 경우, 추출된 명사로부터 키워드를 추출하는 키워드 추출부를 포함하고,
상기 키워드 추출부는,
적어도 하나의 카테고리에 속하는 키워드들의 모음을 포함하는 사전 데이터베이스와,
상기 추출된 명사가 상기 사전 데이터베이스에 등재된 단어인지 여부를 조회하는 키워드 조회부와,
상기 추출된 명사가 상기 사전 데이터베이스에 등재된 단어인 경우, 상기 추출된 명사를 조회된 키워드로서 추출하는 조회된 키워드 추출부와,
상기 추출된 명사가 상기 사전 데이터베이스에 등재된 단어가 아닌 경우, 상기 추출된 명사에 대한 다수의 특성값의 빈도수를 계산하고, 상기 추출된 명사에 대응하는 확률값과 상기 빈도수를 기구축된 분류모델 데이터베이스에 따른 분류 모델의 확률값과 비교하여 상기 추출된 명사가 포함된 상기 메시지가 긍정적인지 부정적인지 여부를 판단하는 미추출 문서 처리부를 더 포함하며,
상기 미추출 문서 처리부의 상기 기구축된 분류모델 데이터베이스에는, 긍정적 메시지와 부정적 메시지를 구분하는 다수의 단어에 각각 대응하는 다수의 특성값의 빈도수와 확률값이 계산되어 있으며,
상기 키워드 추출부는, 문자열의 값을 읽어서 토큰어레이에 배열로 저장하고, 상기 토큰어레이의 길이를 계산하고, 상기 토큰어레이의 i번째 명사를 워드에 대입하고, 상기 i번째 명사를 검색하는 함수를 호출하고, 상기 i번째 명사를 음식점명사전에서 조회하고, 만약 상기 음식점명사전에서 검색이 되면 검색결과를 저장하고, 검색이 되지 않으면 웹사전을 검색한 후, 상기 토큰어레이에서 검색된 상기 i번째 명사를 배열 원소에서 빼고 나머지 데이터를 다시 검색함으로써 상기 키워드를 추출하는 메시지 정보 추출 시스템.Web crawler unit for collecting at least one message posted on at least one social network site,
A preprocessor for performing preprocessing to remove unnecessary characters from the message,
A POS tagging unit for performing a part-of-speech tagging on at least one word included in a message after the preprocessing is performed;
A noun extraction unit for extracting nouns from at least one word on which the part-of-speech tagging is performed;
If at least one noun is extracted from the noun extracting unit, and includes a keyword extraction unit for extracting a keyword from the extracted noun,
The keyword extracting unit extracts,
A dictionary database comprising a collection of keywords belonging to at least one category,
A keyword inquiry unit for inquiring whether the extracted noun is a word listed in the dictionary database;
A searched keyword extracting unit for extracting the extracted noun as a searched keyword when the extracted noun is a word listed in the dictionary database;
When the extracted noun is not a word listed in the dictionary database, the frequency of the plurality of characteristic values for the extracted noun is calculated, and the probability model corresponding to the extracted noun and the classification model database Comprising a non-extracted document processing unit for determining whether the message containing the extracted noun is positive or negative compared to the probability value of the classification model according to,
In the structured classification model database of the unextracted document processing unit, frequency and probability values of a plurality of characteristic values respectively corresponding to a plurality of words distinguishing a positive message from a negative message are calculated.
The keyword extracting unit reads a value of a string and stores it as an array in a token array, calculates the length of the token array, substitutes an i th noun of the token array into a word, and calls a function for searching the i th noun. Search for the i-th noun in a restaurant noun dictionary; if a search is made in the restaurant noun dictionary, store a search result; otherwise, search the web dictionary, and search for the i-th noun found in the token array. And extract the keyword by subtracting from an array element and retrieving the remaining data.

제1항에 있어서,
상기 키워드 추출부는 상기 추출된 상기 키워드를 출력하는 출력부를 더 포함하는 메시지 정보 추출 시스템.The method of claim 1,
The keyword extraction unit further comprises an output unit for outputting the extracted keyword.

웹크롤러부, 전처리부, POS태깅부, 명사 추출부 및 키워드 추출부를 포함하는 메시지 정보 추출 시스템의 메시지 정보 추출 방법으로서,
네트워크를 통하여 연결된 적어도 하나의 소셜네트워크 사이트에 게시된 적어도 하나의 메시지를 수집하는 문서 수집 단계와,
수집된 적어도 하나의 상기 메시지에서 불필요한 문자를 제거하는 전처리 단계와,
상기 전처리 단계가 수행된 후의 메시지에 포함된 적어도 하나의 단어에 대하여 품사 태깅을 수행하는 품사 태깅 단계와,
상기 품사 태깅이 수행된 적어도 하나의 단어 중에서 명사를 추출하는 특정단어 추출 단계 및
상기 특정 단어 추출 단계에서 추출된 명사가 있는 경우, 적어도 하나의 카테고리에 속하는 키워드들의 모음을 포함하는 사전 데이터베이스에 상기 추출된 명사가 등재된 단어인지 여부를 조회하는 키워드 조회 단계를 포함하며,
상기 키워드 조회 단계는,
상기 추출된 명사가 상기 사전 데이터베이스에 등재된 단어인 경우, 상기 추출된 명사를 조회된 키워드로서 추출하는 조회된 키워드 추출 단계 및
상기 추출된 명사가 상기 사전 데이터베이스에 등재된 단어가 아닌 경우, 상기 추출된 명사에 대한 다수의 특성값의 빈도수를 계산하고, 상기 추출된 명사에 대응하는 확률값과 상기 빈도수를 기구축된 분류모델 데이터베이스에 따른 분류 모델의 확률값과 비교하여 상기 추출된 명사가 포함된 상기 메시지가 긍정적인지 부정적인지 여부를 판단하는 미추출 문서 처리 단계를 더 포함하며,
상기 키워드 추출 단계는,
문자열의 값을 읽어서 토큰어레이에 배열로 저장하는 단계와,
상기 토큰어레이의 길이를 계산하고, 상기 토큰어레이의 i번째 명사를 워드에 대입하는 단계와,
상기 i번째 명사를 검색하는 함수를 호출하고, 상기 i번째 명사를 음식점명사전에서 조회하는 단계와,
만약 상기 음식점명사전에서 검색이 되면 검색결과를 저장하고, 검색이 되지 않으면 웹사전을 검색하는 단계와,
상기 토큰어레이에서 검색된 상기 i번째 명사를 배열 원소에서 빼고 나머지 데이터를 다시 검색하는 단계를 더 포함하며,
상기 기구축된 분류모델 데이터베이스에는, 긍정적 메시지와 부정적 메시지를 구분하는 다수의 단어에 각각 대응하는 다수의 특성값의 빈도수와 확률값이 계산되어 있는 메시지 정보 추출 방법.A message information extraction method of a message information extraction system including a web crawler unit, a preprocessor, a POS tagging unit, a noun extractor, and a keyword extractor,
A document collecting step of collecting at least one message posted to at least one social network site connected through a network;
A preprocessing step of removing unnecessary characters from at least one collected message;
A part-of-speech tagging step of performing a part-of-speech tagging on at least one word included in a message after the preprocessing step is performed;
A specific word extraction step of extracting a noun from at least one word on which the part-of-speech tagging is performed;
If there is a noun extracted in the specific word extraction step, keyword search step for inquiring whether the extracted noun is a word listed in the dictionary database including a collection of keywords belonging to at least one category,
The keyword inquiry step,
A searched keyword extraction step of extracting the extracted noun as a searched keyword when the extracted noun is a word listed in the dictionary database;
When the extracted noun is not a word listed in the dictionary database, the frequency of the plurality of characteristic values for the extracted noun is calculated, and the probability model corresponding to the extracted noun and the classification model database Comprising a non-extracted document processing step of determining whether the message containing the extracted noun is positive or negative compared to the probability value of the classification model according to,
The keyword extraction step,
Reading the string values and storing them as an array in the token array;
Calculating a length of the token array and substituting an i-th noun of the token array into a word;
Calling a function to retrieve the i-th noun, and querying the i-th noun in a restaurant noun dictionary;
If the restaurant noun dictionary is searched for, storing the search results, if not found, searching the web dictionary;
Subtracting the i-th noun found in the token array from an array element and retrieving the remaining data;
And a frequency and probability value of a plurality of characteristic values respectively corresponding to a plurality of words for distinguishing a positive message from a negative message.