KR102007437B1

KR102007437B1 - Apparatus for classifying contents and method for using the same

Info

Publication number: KR102007437B1
Application number: KR1020160158289A
Authority: KR
Inventors: 윤여찬
Original assignee: 한국전자통신연구원
Priority date: 2016-11-25
Filing date: 2016-11-25
Publication date: 2019-08-05
Also published as: KR20180059112A

Abstract

Disclosed are a content classification apparatus and method. Content classification apparatus according to an embodiment of the present invention includes a content storage unit for receiving and storing the content from the content database; A keyword extraction unit for extracting the keyword in consideration of the appearance position of the keyword from the stored content; And a content classifier configured to connect the content to the extracted keyword, classify the content for each keyword, and a content analyzer for analyzing the content classified for each keyword based on statistical information of a statistical database.

Description

Content classification device and method {APPARATUS FOR CLASSIFYING CONTENTS AND METHOD FOR USING THE SAME}

The present invention relates to data analysis techniques, and more particularly, to big data text analysis techniques.

E-book content is a digitized version of a document that was published on paper. Therefore, it is possible to derive meaningful data by applying techniques such as text analysis and classification described in the technical field.

Text analysis technology can derive meaningful data by analyzing documents written in digital text such as blogs, SNS, and news. For example, it may correspond to a technique that allows blog text to be analyzed to extract and search for key keywords. Big data text technology can extract meaningful information by analyzing a large amount of text. For example, SNS can be analyzed to analyze trends such as public opinion trends. Text classification techniques can group large amounts of text among similar classifications. For example, news documents can be automatically classified into economy, politics, culture, and so on.

The prior art is limited in simply searching for content or classifying content according to a predetermined classification (eg, self-development, political management, novel, essay, poetry, etc.), which makes it difficult to analyze the content in detail.

On the other hand, Korean Patent Publication No. 10-2013-0104573 "morphological-based content classification method and apparatus" in the online service, extracts the nouns by analyzing the morphemes from the title of the content, such as online blogs or news, extracted nouns The present invention discloses a technique for efficiently classifying contents in real time by mapping synonyms, compound words, and the like to a predetermined category by adding a search and weight.

However, Korean Patent Publication No. 10-2013-0104573 has a limitation in that a lot of time and money are consumed because the content is analyzed based on only the extracted nouns.

An object of the present invention is to extract the keywords automatically by analyzing the content, and to provide content by subdividing the content according to various classifications.

In addition, the present invention aims to automatically determine the sales volume, preference, etc. for each of the various classifications through this analysis.

In addition, the present invention is to improve the quality of the content providing service and the efficiency of the analysis by classifying the content in various ways.

Content classification apparatus according to an embodiment of the present invention for achieving the above object is a content storage unit for receiving and storing the content from the content database; A keyword extraction unit for extracting the keyword in consideration of the appearance position of the keyword from the stored content; And a content classifier configured to connect the content to the extracted keyword, classify the content for each keyword, and a content analyzer for analyzing the content classified for each keyword based on statistical information of a statistical database.

The present invention can automatically extract keywords by analyzing content and provide content by subdividing the content according to various classifications.

In addition, the present invention can automatically grasp the sales volume, preferences, etc. by various classifications through this analysis.

In addition, the present invention can classify the content in various ways to increase the quality of the content providing service and increase the efficiency of analysis.

1 is a block diagram showing a content classification apparatus according to an embodiment of the present invention.
2 is a view showing a keyword extraction method according to an embodiment of the present invention.
3 is a diagram illustrating a table of contents of e-book contents according to an embodiment of the present invention.
4 is a diagram illustrating a binary content classification algorithm according to an embodiment of the present invention.
5 is a graph showing the sales amount by age group of the first content according to an embodiment of the present invention.
6 is a graph showing the sales amount by age group of the second content according to an embodiment of the present invention.
7 is a table showing preferred keywords for each age group according to an embodiment of the present invention.
8 is a table showing a preferred keyword for each region according to an embodiment of the present invention.
9 is a flowchart illustrating a content classification method according to an embodiment of the present invention.
10 is a flowchart illustrating an example of a content classification step illustrated in FIG. 9 in detail.
11 is a block diagram illustrating a computer system according to an embodiment of the present invention.

The present invention will now be described in detail with reference to the accompanying drawings. Here, the repeated description, well-known functions and configurations that may unnecessarily obscure the subject matter of the present invention, and detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more completely describe the present invention to those skilled in the art. Accordingly, the shape and size of elements in the drawings may be exaggerated for clarity.

Throughout the specification, when a part is said to "include" a certain component, it means that it can further include other components, except to exclude other components unless specifically stated otherwise.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

1 is a block diagram showing a content classification apparatus according to an embodiment of the present invention. 2 is a view showing a keyword extraction method according to an embodiment of the present invention. 3 is a diagram illustrating a table of contents of e-book contents according to an embodiment of the present invention. 4 is a diagram illustrating a dividing content classification algorithm according to an embodiment of the present invention.

Referring to FIG. 1, the content classification device 100 according to an exemplary embodiment of the present invention may include a content storage unit 110, a keyword extraction unit 120, a content classification unit 130, a content analysis unit 140, and content. Database 10 and statistics database 20.

In the present invention, only one example of the content is described as an e-book content, for example, and all various contents including words, characters, texts, and the like may also be applied to the present invention.

At this time, the content database 10 and the e-book statistics database 20 may be external.

The content storage unit 110 may receive and store content from the content database 10.

The keyword extractor 120 may extract the keyword from the stored content in consideration of the appearance position of the keyword.

In this case, the keyword extractor 120 may extract keywords using various extraction methods.

In this case, the keyword extractor 120 measures a numerical value obtained by multiplying the occurrence frequency (TF) in the document and the threshold value (IDF) of the appearance frequency in the entire document set for each word, and uses the weight as the weight of the word and in descending order. A method of selecting a word having a specific weight or more as a keyword may be used.

In this case, the keyword extractor 120 may consider the position where the keyword appears in order to determine the weight of the keyword.

In content that includes text, words may appear at various locations, such as the title, table of contents, text, and picture labels of the content.

In this case, the keyword extractor 120 may consider the words from the title and the table of contents as more important keywords. The title and the table of contents may selectively use words that can represent the content of the content. Thus, words appearing in the title and table of contents may indicate the nature of the content and may be more important than words used in the text.

In this case, the keyword extractor 120 may determine the weight of the word in a form of multiplying and linearly combining the weight corresponding to the appearance position in consideration of the keyword weight to the position where the keyword appears.

[Equation 1]

In Equation 1, W _t may correspond to the weight of the word W in the title, W _c may correspond to the weight of the word W in the table of contents, and W _d may correspond to the weight of the word in the text.

In this case, the keyword extractor 120 may merge words having similar meanings from the extracted keywords.

For example, the extracted "family" and "home" keywords are semantically very similar and thus merging them together may yield more accurate analysis results.

To this end, the keyword extractor 120 may utilize a lexical dictionary or use a synonym search technology.

Referring to FIG. 2, it can be seen that after extracting keywords from content, the keywords are similarly merged.

The content classification unit 130 may classify the content by keyword by connecting the content to the extracted keyword.

Referring to FIG. 3, it can be seen that the content classifier 130 classifies content in three steps.

In this case, the content classification unit 130 may classify the content by connecting the content to the keyword only in the case where the keyword appears in the table of contents or the title so that the classification accuracy is high in the first step. By classifying in this way, highly relevant content can be selected and linked by extracted keywords.

In addition, in step 2, the content classifying unit 130 may select documents that have high similarity to the contents belonging to the classification by the content classification classified in the first step, so as to belong to the classification. The similarity between content classification and individual content classification belonging to a specific classification can use various techniques such as K-means using cosine similarity. If the similarity between the content classification and the individual content exceeds a certain threshold, the method of including the content in the classification may include a larger amount of content in the classification selected based on the accuracy of the first level.

In addition, in step 3, the content classification unit 130 may analyze the similarity for each classified content bundle and combine two similar classifications into one by combining the classified content bundle pairs having a similarity or more than a threshold.

At this time, the content classifier 130 may merge keyword pairs that the keyword extractor 120 cannot merge by performing a three-step operation.

The content analyzer 140 may analyze the content classified by keywords based on statistical information of the statistical database 20.

The statistical information may correspond to sales / sales volume, number of publications, sales volume by age / gender / region, etc. generated during content sales and collection.

At this time, the content analysis unit 140 may analyze the sales amount based on information such as age / gender / region for a specific keyword classification.

At this time, the content analyzer 140 may analyze the feature or the preferred trend of the main content usage layer for the keyword by using the analyzed sales volume.

5 is a graph showing the sales amount by age group of the first content according to an embodiment of the present invention. 6 is a graph showing the sales amount by age group of the second content according to an embodiment of the present invention.

5 and 6, the content classification device 100 according to an embodiment of the present invention analyzes sales based on information such as age group / gender / region for a specific keyword classification and uses main content for the corresponding keyword. You can analyze what features and preference trends your hierarchy has.

7 is a table showing preferred keywords for each age group according to an embodiment of the present invention. 8 is a table showing a preferred keyword for each region according to an embodiment of the present invention.

Referring to FIG. 7 and FIG. 8, FIG. 7 is a table illustrating analysis of content keywords preferred by age group up to three ranks, and FIG. 8 is a table illustrating analysis of content keywords preferred by region up to three ranks.

9 is a flowchart illustrating a content classification method according to an embodiment of the present invention.

Referring to FIG. 9, in the content classification method according to an embodiment of the present invention, first, keywords may be extracted (S210).

That is, step S210 may extract a keyword from the stored content.

At this time, step S210 may first receive and store the content from the content database 10.

At this time, step S210 may extract a keyword in consideration of the appearance position of the keyword from the stored content.

At this time, in step S210, a keyword may be extracted using various extraction methods.

In this case, step S210 measures a numerical value obtained by multiplying the occurrence frequency (TF) in the document and the threshold value (IDF) of the appearance frequency in the entire document set for each word, using the weight as a word weight, and descending the weight to a specific weight. The method of setting the word with the above as a keyword can be used.

At this time, step S210 may consider the position where the keyword appeared to determine the weight of the keyword.

In this case, step S210 may consider the words from the title and the table of contents as more important keywords. The title and the table of contents may selectively use words that can represent the content of the content. Thus, words appearing in the title and table of contents may indicate the nature of the content and may be more important than words used in the text.

In this case, step S210 may determine the weight of the word in the form of multiplying and linearly combining the weight corresponding to the appearance position in consideration of the keyword weight to the position where the keyword appears.

In this case, step S210 may merge words having similar meanings in the extracted keywords.

For example, "family" and "home" keywords are semantically very similar, so merging and classifying them may yield more accurate analysis results.

To this end, step S210 may utilize a lexical dictionary and may utilize a synonym search technique.

In addition, the content classification method according to an embodiment of the present invention may classify the content (S220).

That is, in step S220, the content may be connected to the extracted keyword to classify the content by keyword.

At this time, in step S220, the content may be categorized by linking the content with the keyword only when the keyword appears in the table of contents or the title so that the classification accuracy is high in the first step (S221). By classifying in this way, highly relevant content can be selected and linked by extracted keywords.

In operation S220, the documents having high similarity to the contents belonging to the classification may be selected for each of the content classifications classified in the first step, so as to belong to the classification. The similarity between content classification and individual content classification belonging to a specific classification can use various techniques such as K-means using cosine similarity. If the similarity between the content classification and the individual content exceeds a certain threshold, the method of including the content in the classification may include a larger amount of content in the classification selected based on the accuracy of the first level.

In operation S220, in operation S223, similarities may be analyzed for each classified content bundle to combine the classified content bundle pairs having a similarity or more than a threshold as one, thereby combining two similar classifications into one (S223).

In this case, step S223 may merge keyword pairs that the keyword extraction unit 120 fails to merge by performing a three-step operation.

In addition, the content classification method according to an embodiment of the present invention may analyze the content (S230).

That is, step S230 may analyze the content classified by keyword based on the statistical information of the statistical database 20.

In this case, step S230 may analyze the sales volume based on information such as age / gender / region for a specific keyword classification.

In this case, step S230 may analyze characteristics or preference trends of the main content usage layer for the keyword by using the analyzed sales volume.

10 is a flowchart illustrating an example of a content classification step illustrated in FIG. 9 in detail.

Referring to FIG. 10, in step S220, content may be categorized by linking the content with the keyword only when the keyword appears in the table of contents or the title so that the classification accuracy is high in the first step (S221). By classifying in this way, highly relevant content can be selected and linked by extracted keywords.

11 is a block diagram illustrating a computer system according to an embodiment of the present invention.

Referring to FIG. 11, an embodiment of the present invention may be implemented in a computer system 1100 such as a computer readable recording medium. As shown in FIG. 11, computer system 1100 may include one or more processors 1110, memory 1130, user interface input device 1140, user interface output device 1150 that communicate with each other via a bus 1120. And storage 1160. In addition, the computer system 1100 may further include a network interface 1170 connected to the network 1180. The processor 1110 may be a central processing unit or a semiconductor device that executes processing instructions stored in the memory 1130 or the storage 1160. The memory 1130 and the storage 1160 may be various types of volatile or nonvolatile storage media. For example, the memory may include a ROM 1131 or a RAM 1132.

As described above, the apparatus and method for classifying contents according to the present invention may not be limitedly applied to the configuration and method of the embodiments described as described above, but the embodiments are all of the embodiments so that various modifications can be made. Or some may be selectively combined.

10: content database 20: statistics database
100: content classification device 110: content storage unit
120: keyword extraction unit 130: content classification unit
140: content analysis unit
1100: computer system 1110: processor
1120: bus 1130: memory
1131: Romans 1132: Ram
1140: user input device 1150: user output device
1160: storage 1170: network interface
1180: network

Claims

A content storage unit for receiving and storing content from a content database;
A keyword extraction unit for extracting the keyword in consideration of the appearance position of the keyword from the stored content;
A content classification unit connecting the content to the extracted keyword and classifying the content for each keyword; And
A content analysis unit analyzing characteristics of content classified by each keyword based on statistical information of a statistical database;
Content classification apparatus comprising a.

The method according to claim 1,
The keyword extraction unit
And an appearance frequency weight of the word is calculated using an appearance frequency of a word included in the stored content.

The method according to claim 2,
The keyword extraction unit
And if the occurrence frequency weight of the word is equal to or greater than a predetermined extraction weight, extracting the content using the keyword.

The method according to claim 3,
The keyword extraction unit
And an appearance position weight according to the appearance position of the word in the content, and determine a final weight of the word using the appearance frequency weight and the appearance position weight.

The method according to claim 4,
The keyword extraction unit
And setting the appearance position weights to which additional weights are assigned to words located in at least one of a title and a table of contents.

The method according to claim 5,
The keyword extraction unit
And if the final weight of the word is equal to or greater than a predetermined extraction weight, extracting the content using the keyword.

The method of claim 6,
The content classification unit
And extracting the keywords extracted from at least one of the title and the table of contents among the extracted keywords, and classifying the contents using a similarity analysis technique for the selected keywords and the content classification.

The method according to claim 7,
The content analysis unit
And analyzing the sales amount of the content for at least one of age, gender, and region information based on statistical information of a statistical database using the keywords extracted from the contents.

The method according to claim 8,
The content analysis unit
And analyzing at least one of a preference rank and a preference trend for the keywords extracted from the content by using the sales amount of the content.

In the method using a content classification device,
Extracting the keyword in consideration of the appearance position of the keyword included in the content received from the content database;
Connecting the content to the extracted keyword to classify the content by the keyword; And
Analyzing characteristics of the content classified for each keyword based on statistical information of a statistical database;
Content classification method comprising a.