CN105975507B - A kind of Questions &. Answers on Multimedia method based on multi-source network news data - Google Patents
A kind of Questions &. Answers on Multimedia method based on multi-source network news data Download PDFInfo
- Publication number
- CN105975507B CN105975507B CN201610273211.1A CN201610273211A CN105975507B CN 105975507 B CN105975507 B CN 105975507B CN 201610273211 A CN201610273211 A CN 201610273211A CN 105975507 B CN105975507 B CN 105975507B
- Authority
- CN
- China
- Prior art keywords
- news
- picture
- theme
- data
- query word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of Questions &. Answers on Multimedia method based on multi-source network news data, includes the following steps:Step 1, based on web crawlers mechanism, the news data of several news websites on internet is obtained;Step 2, parsing news data obtains headline, newsletter archive, news picture, and establish index with this;Step 3, news documents data corresponding with application are retrieved in input inquiry application;Step 4, to the news documents data of acquisition, based on the theme of Latent Dirichlet Allocation model analysis news datas, and result is divided into different themes;Step 5, for each theme, similitude cluster is carried out, and representative picture of the width picture as the theme is selected in the largest number of subclasses of picture to the picture in news documents data all included in it;Step 6, it shows topics and the representative picture of the main body, click theme checks the corresponding news of the theme.
Description
Technical field
It is particularly a kind of based on multi-source network news data the present invention relates to a kind of data mining and image processing techniques
Questions &. Answers on Multimedia method.
Background technology
The fast development of current IT technologies and Internet technology is so that the mode of people's acquisition news is varied, while face
To news data it is also more and more, in face of the news data of such big data quantity, how therefrom to browse to the new of our needs
Hear the research contents that data are a current research hotspot and data mining technology.In network navigation, due to news text
There is missing in the methods of parsing of notebook data and structure index, the subject analysis of news content, the selection of theme image, cause to use
When browsing news data, there are the shortcomings that blindness at family, it is therefore desirable to based on to the progress data mining of multi-source network news data
A kind of strong Questions &. Answers on Multimedia system of systematicness is obtained with image procossing.
Invention content
The purpose of the present invention is to provide a kind of Questions &. Answers on Multimedia method based on multi-source network news data, this method packets
Include following steps:
Step 1, based on web crawlers mechanism, the news data of several news websites on internet is obtained;
Step 2, parsing news data obtains headline, newsletter archive, news picture, and establish index with this;
Step 3, news documents data corresponding with application are retrieved in input inquiry application;
Step 4, it is new based on Latent Dirichlet Allocation model analysis to the news documents data of acquisition
The theme of data is heard, and result is divided into different themes;
Step 5, for each theme, similitude is carried out to the picture in news documents data all included in it and is gathered
Class, and representative picture of the width picture as the theme is selected in the largest number of subclasses of picture;
Step 6, it shows topics and the representative picture of the main body, click theme checks the corresponding news of the theme.
Compared with prior art, the present invention it has the following advantages:
Present invention utilizes news media's data in sources a variety of on network, can as far as possible cover on network about certain
All news datas of a inquiry;Subject analysis and image is utilized when query result is presented to the user in the present invention
The a large amount of news data inquired is carried out classification displaying, allows users to quickly browse to desired clear by the technology of processing
The news look at has considerably improved the viewing experience of user.
The present invention is described further with reference to the accompanying drawings of the specification.
Description of the drawings
Fig. 1 is the Questions &. Answers on Multimedia method flow diagram the present invention is based on multi-source network news data.
Fig. 2 is the design sketch of the Questions &. Answers on Multimedia system demonstration based on multi-source network news data.
Specific embodiment
With reference to Fig. 1, a kind of Questions &. Answers on Multimedia method based on multi-source network news data includes the following steps:
Step 1, based on web crawlers mechanism, the news data of several news websites on internet is obtained;
Step 2, parsing news data obtains headline, newsletter archive, news picture, and establish index with this;
Step 3, news documents data corresponding with application are retrieved in input inquiry application;
Step 4, to the news documents data of acquisition, based on Latent DirichletAllocation model analysis news
The theme of data, and result is divided into different themes;
Step 5, for each theme, similitude is carried out to the picture in news documents data all included in it and is gathered
Class, and representative picture of the width picture as the theme is selected in the largest number of subclasses of picture;
Step 6, it shows topics and the representative picture of the main body, click theme checks the corresponding news of the theme.
News website in step 1 includes ABCNews (http://abcnews.go.com/)、BBCNews(http://
www.bbc.com/)、CNNNews(http://edition.cnn.com/) etc..
The news web page data downloaded are parsed after data download in step 2, the news mark needed
The data such as topic, newsletter archive, news picture, while all newsletter archive data are carried out with not repetitor unique term's
Statistics, after stop words is filtered out, is indexed news data using these unique term in the form of inverted list, is saved in
In database.
When user submits inquiry problem in step 3, higher retrieval recall rate, the present invention first pass through inquiry and expand in order to obtain
The mode of exhibition extends some semantic similar query word Qca={ qc1,qc2,qc3,...,qcnmCome add to user submission inquiry
In, qcnmFor the query word expanded, wherein n is the query word number in former inquiry, and m is that the query word in each former inquiry expands
The query word number of exhibition;Then by adding query word on the basis of the search method Okapi BM25 of existing maturation
Weight returns to relevant news documents to retrieve.The document of retrieval and the Similarity measures formula of inquiry are
In formula, N is that total query word number is (total including the query word in former inquire and the query word after extension
Number), D is news documents, and Q is inputted for inquiry, and qi is query word, and k1 and b are the parameter value in Okapi BM25, and avgdl is
Averaged word number in all news documents;Tf () and IDF () is the statistical value in Okapi BM25,
F (*) is the number that * occurs.
Latent DirichletAllocation (LDA) model described in step 4 is one " Bag-of-Words "
Model, for selecting some descriptor from document, it is assumed that the data set of a M document is given, wherein { w1,w2,
w3,...,wmIt is a vocabulary for including N number of lexical item.LDA assumes that these documents are generated from K theme, in every text
In shelves, each lexical item wi distributes hidden variable a zi, zi∈ { 1,2,3 ..., K } represents the theme label of generation word, document
It is middle generation word probability calculation beWherein p (wi|zi=j) it is lexical item wiIn master
Inscribe the probability in j, p (zi=j) it is the probability that theme j occurs, obey the distribution of Di Li Crays.
For each theme in step 4 in step 5, the picture in all news documents data wherein included carries out
Similitude clusters, and representative picture of the width picture as this theme is then selected in the largest number of subclasses of picture.This
In invention the similitude meter of picture is carried out by using Near-duplicate pictures detection method common in image procossing
It calculates, and picture is divided into different similitude subclasses.In view of two hypothesis:(1) there was only a secondary figure in approximate image set
Seem to be used as in index deposit database;(2) the largest number of subclasses of image illustrate these images in this theme
Occur it is multiple, therefore be largely can be as the presentation graphics of this theme.So based on the two it is assumed that
The present invention selects a secondary picture as subject picture in the largest number of subclasses of picture, according to formulaThe picture of score maximum is calculated as subject picture, wherein | Ck| it is the largest
Picture number in subclass, reljCalculation formula be Score in step 3 (Q, D), that is, the document where picture j is same
Inquire the similarity of Q.
The presentation of user's query result in step 6, as shown in Figure 2.The inquiry that user submits is returned the result with a kind of clear
Clear succinct mode is presented to the user, and improves the viewing experience of user.
Claims (4)
- A kind of 1. Questions &. Answers on Multimedia method based on multi-source network news data, which is characterized in that include the following steps:Step 1, based on web crawlers mechanism, the news data of several news websites on internet is obtained;Step 2, parsing news data obtains headline, newsletter archive, news picture, and establish index with this;Step 3, news documents data corresponding with application are retrieved in input inquiry application;Step 4, to the news documents data of acquisition, based on LatentDirichletAllocation model analysis news datas Theme, and result is divided into different themes;Step 5, for each theme, similitude cluster is carried out to the picture in news documents data all included in it, And representative picture of the width picture as the theme is selected in the largest number of subclasses of picture;Step 6, it shows topics and the representative picture of the theme, click theme checks the corresponding news of the theme;For retrieval in the step 3:The similar query word Q of several semantemes is extended by way of query expansionca={ qc1,qc2,qc3,...,qcnmIt is supplemented to inspection In rope application, qcnmFor the query word expanded, wherein n is the query word number in former inquiry, and m is looking into each former inquiry Ask the query word number that word expands;In the weight λ that query word is added on search method Okapi BM25iRetrieve news documents;The document wherein retrieved and the Similarity measures formula of inquiry areIn formula, N is total query word number, and D is news documents, and Q is inputted for inquiry, qiIt is query word, k1It is Okapi with b Parameter value in BM25, avgdl are the averaged word number in all news documents;Tf () and IDF () is in Okapi BM25 Statistical value,WeightF (*) is the number that * occurs.
- 2. according to the method described in claim 1, it is characterized in that, newsletter archive data are not repeated in the step 2 The statistics of word unique term, and stop words is filtered out, news data is indexed in the form of inverted list using unique term And it preserves.
- 3. it according to the method described in claim 1, it is characterized in that, is examined in the step 5 using Near-duplicate pictures Survey method carries out the Similarity measures of picture.
- 4. according to the method described in claim 1, it is characterized in that, in the step 5 in the largest number of subclasses of picture root The representative picture of theme is obtained according to following formula|Ck| it is the largest picture number in subclass, vijFor image i and the similarity of image j, relj=Score (Q, D).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610273211.1A CN105975507B (en) | 2016-04-28 | 2016-04-28 | A kind of Questions &. Answers on Multimedia method based on multi-source network news data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610273211.1A CN105975507B (en) | 2016-04-28 | 2016-04-28 | A kind of Questions &. Answers on Multimedia method based on multi-source network news data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105975507A CN105975507A (en) | 2016-09-28 |
CN105975507B true CN105975507B (en) | 2018-07-03 |
Family
ID=56993611
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610273211.1A Active CN105975507B (en) | 2016-04-28 | 2016-04-28 | A kind of Questions &. Answers on Multimedia method based on multi-source network news data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105975507B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108345700B (en) * | 2018-03-29 | 2023-01-31 | 百度在线网络技术(北京)有限公司 | Article representative picture selection method and device and computer equipment |
CN108897778B (en) * | 2018-06-04 | 2021-12-31 | 创意信息技术股份有限公司 | Image annotation method based on multi-source big data analysis |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7747629B2 (en) * | 2006-08-23 | 2010-06-29 | International Business Machines Corporation | System and method for positional representation of content for efficient indexing, search, retrieval, and compression |
CN102436442A (en) * | 2011-11-03 | 2012-05-02 | 中国科学技术信息研究所 | Word semantic relativity measurement method based on context |
CN102411638B (en) * | 2011-12-30 | 2013-06-19 | 中国科学院自动化研究所 | Method for generating multimedia summary of news search result |
CN103049470B (en) * | 2012-09-12 | 2016-09-21 | 北京航空航天大学 | Viewpoint searching method based on emotion degree of association |
CN103020261A (en) * | 2012-12-24 | 2013-04-03 | 南京邮电大学 | Image automatic marking method |
CN104765769B (en) * | 2015-03-06 | 2018-04-27 | 大连理工大学 | The short text query expansion and search method of a kind of word-based vector |
-
2016
- 2016-04-28 CN CN201610273211.1A patent/CN105975507B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN105975507A (en) | 2016-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11086883B2 (en) | Systems and methods for suggesting content to a writer based on contents of a document | |
Borth et al. | Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content | |
US10216851B1 (en) | Selecting content using entity properties | |
US8875038B2 (en) | Anchoring for content synchronization | |
CN103455487B (en) | The extracting method and device of a kind of search term | |
Khabiri et al. | Summarizing user-contributed comments | |
US20160357872A1 (en) | Event networks and event view construction and display | |
CN104111941B (en) | The method and apparatus that information is shown | |
US11055312B1 (en) | Selecting content using entity properties | |
Scharl et al. | Analyzing the public discourse on works of fiction–Detection and visualization of emotion in online coverage about HBO’s Game of Thrones | |
Ho et al. | Mining future spatiotemporal events and their sentiment from online news articles for location-aware recommendation system | |
WO2016057984A1 (en) | Methods and systems for base map and inference mapping | |
US20180225379A1 (en) | Recommendation Based On Thematic Structure Of Content Items In Digital Magazine | |
Zhang et al. | Mining and clustering service goals for restful service discovery | |
WO2021111400A1 (en) | System and method for enabling a search platform to users | |
Chen et al. | Tag recommendation by machine learning with textual and social features | |
CN105975507B (en) | A kind of Questions &. Answers on Multimedia method based on multi-source network news data | |
Liu et al. | Event-based cross media question answering | |
Hu et al. | Embracing information explosion without choking: Clustering and labeling in microblogging | |
Spitz et al. | Topexnet: entity-centric network topic exploration in news streams | |
Redondo-García et al. | Augmenting TV newscasts via entity expansion | |
Sivaramakrishnan et al. | Validating effective resume based on employer’s interest with recommendation system | |
JP2012256268A (en) | Advertisement distribution device and advertisement distribution program | |
JP2019175212A (en) | Information display device, article page generation device, information processing device, information display system, and program | |
Bellini et al. | Optimization of information retrieval for cross media contents in a best practice network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |