CN105975507B - A kind of Questions &. Answers on Multimedia method based on multi-source network news data - Google Patents

A kind of Questions &. Answers on Multimedia method based on multi-source network news data Download PDF

Info

Publication number
CN105975507B
CN105975507B CN201610273211.1A CN201610273211A CN105975507B CN 105975507 B CN105975507 B CN 105975507B CN 201610273211 A CN201610273211 A CN 201610273211A CN 105975507 B CN105975507 B CN 105975507B
Authority
CN
China
Prior art keywords
news
picture
theme
data
query word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610273211.1A
Other languages
Chinese (zh)
Other versions
CN105975507A (en
Inventor
唐金辉
李泽超
王学明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN201610273211.1A priority Critical patent/CN105975507B/en
Publication of CN105975507A publication Critical patent/CN105975507A/en
Application granted granted Critical
Publication of CN105975507B publication Critical patent/CN105975507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of Questions &. Answers on Multimedia method based on multi-source network news data, includes the following steps:Step 1, based on web crawlers mechanism, the news data of several news websites on internet is obtained;Step 2, parsing news data obtains headline, newsletter archive, news picture, and establish index with this;Step 3, news documents data corresponding with application are retrieved in input inquiry application;Step 4, to the news documents data of acquisition, based on the theme of Latent Dirichlet Allocation model analysis news datas, and result is divided into different themes;Step 5, for each theme, similitude cluster is carried out, and representative picture of the width picture as the theme is selected in the largest number of subclasses of picture to the picture in news documents data all included in it;Step 6, it shows topics and the representative picture of the main body, click theme checks the corresponding news of the theme.

Description

A kind of Questions &. Answers on Multimedia method based on multi-source network news data
Technical field
It is particularly a kind of based on multi-source network news data the present invention relates to a kind of data mining and image processing techniques Questions &. Answers on Multimedia method.
Background technology
The fast development of current IT technologies and Internet technology is so that the mode of people's acquisition news is varied, while face To news data it is also more and more, in face of the news data of such big data quantity, how therefrom to browse to the new of our needs Hear the research contents that data are a current research hotspot and data mining technology.In network navigation, due to news text There is missing in the methods of parsing of notebook data and structure index, the subject analysis of news content, the selection of theme image, cause to use When browsing news data, there are the shortcomings that blindness at family, it is therefore desirable to based on to the progress data mining of multi-source network news data A kind of strong Questions &. Answers on Multimedia system of systematicness is obtained with image procossing.
Invention content
The purpose of the present invention is to provide a kind of Questions &. Answers on Multimedia method based on multi-source network news data, this method packets Include following steps:
Step 1, based on web crawlers mechanism, the news data of several news websites on internet is obtained;
Step 2, parsing news data obtains headline, newsletter archive, news picture, and establish index with this;
Step 3, news documents data corresponding with application are retrieved in input inquiry application;
Step 4, it is new based on Latent Dirichlet Allocation model analysis to the news documents data of acquisition The theme of data is heard, and result is divided into different themes;
Step 5, for each theme, similitude is carried out to the picture in news documents data all included in it and is gathered Class, and representative picture of the width picture as the theme is selected in the largest number of subclasses of picture;
Step 6, it shows topics and the representative picture of the main body, click theme checks the corresponding news of the theme.
Compared with prior art, the present invention it has the following advantages:
Present invention utilizes news media's data in sources a variety of on network, can as far as possible cover on network about certain All news datas of a inquiry;Subject analysis and image is utilized when query result is presented to the user in the present invention The a large amount of news data inquired is carried out classification displaying, allows users to quickly browse to desired clear by the technology of processing The news look at has considerably improved the viewing experience of user.
The present invention is described further with reference to the accompanying drawings of the specification.
Description of the drawings
Fig. 1 is the Questions &. Answers on Multimedia method flow diagram the present invention is based on multi-source network news data.
Fig. 2 is the design sketch of the Questions &. Answers on Multimedia system demonstration based on multi-source network news data.
Specific embodiment
With reference to Fig. 1, a kind of Questions &. Answers on Multimedia method based on multi-source network news data includes the following steps:
Step 1, based on web crawlers mechanism, the news data of several news websites on internet is obtained;
Step 2, parsing news data obtains headline, newsletter archive, news picture, and establish index with this;
Step 3, news documents data corresponding with application are retrieved in input inquiry application;
Step 4, to the news documents data of acquisition, based on Latent DirichletAllocation model analysis news The theme of data, and result is divided into different themes;
Step 5, for each theme, similitude is carried out to the picture in news documents data all included in it and is gathered Class, and representative picture of the width picture as the theme is selected in the largest number of subclasses of picture;
Step 6, it shows topics and the representative picture of the main body, click theme checks the corresponding news of the theme.
News website in step 1 includes ABCNews (http://abcnews.go.com/)、BBCNews(http:// www.bbc.com/)、CNNNews(http://edition.cnn.com/) etc..
The news web page data downloaded are parsed after data download in step 2, the news mark needed The data such as topic, newsletter archive, news picture, while all newsletter archive data are carried out with not repetitor unique term's Statistics, after stop words is filtered out, is indexed news data using these unique term in the form of inverted list, is saved in In database.
When user submits inquiry problem in step 3, higher retrieval recall rate, the present invention first pass through inquiry and expand in order to obtain The mode of exhibition extends some semantic similar query word Qca={ qc1,qc2,qc3,...,qcnmCome add to user submission inquiry In, qcnmFor the query word expanded, wherein n is the query word number in former inquiry, and m is that the query word in each former inquiry expands The query word number of exhibition;Then by adding query word on the basis of the search method Okapi BM25 of existing maturation Weight returns to relevant news documents to retrieve.The document of retrieval and the Similarity measures formula of inquiry are
In formula, N is that total query word number is (total including the query word in former inquire and the query word after extension Number), D is news documents, and Q is inputted for inquiry, and qi is query word, and k1 and b are the parameter value in Okapi BM25, and avgdl is Averaged word number in all news documents;Tf () and IDF () is the statistical value in Okapi BM25,
F (*) is the number that * occurs.
Latent DirichletAllocation (LDA) model described in step 4 is one " Bag-of-Words " Model, for selecting some descriptor from document, it is assumed that the data set of a M document is given, wherein { w1,w2, w3,...,wmIt is a vocabulary for including N number of lexical item.LDA assumes that these documents are generated from K theme, in every text In shelves, each lexical item wi distributes hidden variable a zi, zi∈ { 1,2,3 ..., K } represents the theme label of generation word, document It is middle generation word probability calculation beWherein p (wi|zi=j) it is lexical item wiIn master Inscribe the probability in j, p (zi=j) it is the probability that theme j occurs, obey the distribution of Di Li Crays.
For each theme in step 4 in step 5, the picture in all news documents data wherein included carries out Similitude clusters, and representative picture of the width picture as this theme is then selected in the largest number of subclasses of picture.This In invention the similitude meter of picture is carried out by using Near-duplicate pictures detection method common in image procossing It calculates, and picture is divided into different similitude subclasses.In view of two hypothesis:(1) there was only a secondary figure in approximate image set Seem to be used as in index deposit database;(2) the largest number of subclasses of image illustrate these images in this theme Occur it is multiple, therefore be largely can be as the presentation graphics of this theme.So based on the two it is assumed that The present invention selects a secondary picture as subject picture in the largest number of subclasses of picture, according to formulaThe picture of score maximum is calculated as subject picture, wherein | Ck| it is the largest Picture number in subclass, reljCalculation formula be Score in step 3 (Q, D), that is, the document where picture j is same Inquire the similarity of Q.
The presentation of user's query result in step 6, as shown in Figure 2.The inquiry that user submits is returned the result with a kind of clear Clear succinct mode is presented to the user, and improves the viewing experience of user.

Claims (4)

  1. A kind of 1. Questions &. Answers on Multimedia method based on multi-source network news data, which is characterized in that include the following steps:
    Step 1, based on web crawlers mechanism, the news data of several news websites on internet is obtained;
    Step 2, parsing news data obtains headline, newsletter archive, news picture, and establish index with this;
    Step 3, news documents data corresponding with application are retrieved in input inquiry application;
    Step 4, to the news documents data of acquisition, based on LatentDirichletAllocation model analysis news datas Theme, and result is divided into different themes;
    Step 5, for each theme, similitude cluster is carried out to the picture in news documents data all included in it, And representative picture of the width picture as the theme is selected in the largest number of subclasses of picture;
    Step 6, it shows topics and the representative picture of the theme, click theme checks the corresponding news of the theme;
    For retrieval in the step 3:
    The similar query word Q of several semantemes is extended by way of query expansionca={ qc1,qc2,qc3,...,qcnmIt is supplemented to inspection In rope application, qcnmFor the query word expanded, wherein n is the query word number in former inquiry, and m is looking into each former inquiry Ask the query word number that word expands;
    In the weight λ that query word is added on search method Okapi BM25iRetrieve news documents;
    The document wherein retrieved and the Similarity measures formula of inquiry are
    In formula, N is total query word number, and D is news documents, and Q is inputted for inquiry, qiIt is query word, k1It is Okapi with b Parameter value in BM25, avgdl are the averaged word number in all news documents;Tf () and IDF () is in Okapi BM25 Statistical value,
    Weight
    F (*) is the number that * occurs.
  2. 2. according to the method described in claim 1, it is characterized in that, newsletter archive data are not repeated in the step 2 The statistics of word unique term, and stop words is filtered out, news data is indexed in the form of inverted list using unique term And it preserves.
  3. 3. it according to the method described in claim 1, it is characterized in that, is examined in the step 5 using Near-duplicate pictures Survey method carries out the Similarity measures of picture.
  4. 4. according to the method described in claim 1, it is characterized in that, in the step 5 in the largest number of subclasses of picture root The representative picture of theme is obtained according to following formula
    |Ck| it is the largest picture number in subclass, vijFor image i and the similarity of image j, relj=Score (Q, D).
CN201610273211.1A 2016-04-28 2016-04-28 A kind of Questions &. Answers on Multimedia method based on multi-source network news data Active CN105975507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610273211.1A CN105975507B (en) 2016-04-28 2016-04-28 A kind of Questions &. Answers on Multimedia method based on multi-source network news data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610273211.1A CN105975507B (en) 2016-04-28 2016-04-28 A kind of Questions &. Answers on Multimedia method based on multi-source network news data

Publications (2)

Publication Number Publication Date
CN105975507A CN105975507A (en) 2016-09-28
CN105975507B true CN105975507B (en) 2018-07-03

Family

ID=56993611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610273211.1A Active CN105975507B (en) 2016-04-28 2016-04-28 A kind of Questions &. Answers on Multimedia method based on multi-source network news data

Country Status (1)

Country Link
CN (1) CN105975507B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345700B (en) * 2018-03-29 2023-01-31 百度在线网络技术(北京)有限公司 Article representative picture selection method and device and computer equipment
CN108897778B (en) * 2018-06-04 2021-12-31 创意信息技术股份有限公司 Image annotation method based on multi-source big data analysis

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7747629B2 (en) * 2006-08-23 2010-06-29 International Business Machines Corporation System and method for positional representation of content for efficient indexing, search, retrieval, and compression
CN102436442A (en) * 2011-11-03 2012-05-02 中国科学技术信息研究所 Word semantic relativity measurement method based on context
CN102411638B (en) * 2011-12-30 2013-06-19 中国科学院自动化研究所 Method for generating multimedia summary of news search result
CN103049470B (en) * 2012-09-12 2016-09-21 北京航空航天大学 Viewpoint searching method based on emotion degree of association
CN103020261A (en) * 2012-12-24 2013-04-03 南京邮电大学 Image automatic marking method
CN104765769B (en) * 2015-03-06 2018-04-27 大连理工大学 The short text query expansion and search method of a kind of word-based vector

Also Published As

Publication number Publication date
CN105975507A (en) 2016-09-28

Similar Documents

Publication Publication Date Title
US11086883B2 (en) Systems and methods for suggesting content to a writer based on contents of a document
Borth et al. Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content
US10216851B1 (en) Selecting content using entity properties
US8875038B2 (en) Anchoring for content synchronization
CN103455487B (en) The extracting method and device of a kind of search term
Khabiri et al. Summarizing user-contributed comments
US20160357872A1 (en) Event networks and event view construction and display
CN104111941B (en) The method and apparatus that information is shown
US11055312B1 (en) Selecting content using entity properties
Scharl et al. Analyzing the public discourse on works of fiction–Detection and visualization of emotion in online coverage about HBO’s Game of Thrones
Ho et al. Mining future spatiotemporal events and their sentiment from online news articles for location-aware recommendation system
WO2016057984A1 (en) Methods and systems for base map and inference mapping
US20180225379A1 (en) Recommendation Based On Thematic Structure Of Content Items In Digital Magazine
Zhang et al. Mining and clustering service goals for restful service discovery
WO2021111400A1 (en) System and method for enabling a search platform to users
Chen et al. Tag recommendation by machine learning with textual and social features
CN105975507B (en) A kind of Questions &. Answers on Multimedia method based on multi-source network news data
Liu et al. Event-based cross media question answering
Hu et al. Embracing information explosion without choking: Clustering and labeling in microblogging
Spitz et al. Topexnet: entity-centric network topic exploration in news streams
Redondo-García et al. Augmenting TV newscasts via entity expansion
Sivaramakrishnan et al. Validating effective resume based on employer’s interest with recommendation system
JP2012256268A (en) Advertisement distribution device and advertisement distribution program
JP2019175212A (en) Information display device, article page generation device, information processing device, information display system, and program
Bellini et al. Optimization of information retrieval for cross media contents in a best practice network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant