CN103699661A - Method and system for acquiring data of video resources - Google Patents

Method and system for acquiring data of video resources Download PDF

Info

Publication number
CN103699661A
CN103699661A CN201310741187.6A CN201310741187A CN103699661A CN 103699661 A CN103699661 A CN 103699661A CN 201310741187 A CN201310741187 A CN 201310741187A CN 103699661 A CN103699661 A CN 103699661A
Authority
CN
China
Prior art keywords
video data
information
video
page
original list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310741187.6A
Other languages
Chinese (zh)
Inventor
曹坤波
郑磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LeTV Cloud Computing Co Ltd
Original Assignee
LeTV Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LeTV Information Technology Beijing Co Ltd filed Critical LeTV Information Technology Beijing Co Ltd
Priority to CN201310741187.6A priority Critical patent/CN103699661A/en
Publication of CN103699661A publication Critical patent/CN103699661A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a method for acquiring data of video resources. The method includes acquiring list pages of the video data according to provided grasping inlets; acquiring information carrying pages of the video data according to the list pages of the video data; grasping the video data carried by the information carrying pages. The method has the advantage that the video data grasping efficiency can be improved by the aid of the method.

Description

The acquisition methods of video resource data and system thereof
Technical field
The present invention relates to information retrieval technique, relate in particular to a kind of acquisition methods and system thereof of video resource data.
Background technology
Along with scientific and technological development, increasing user is by internet hunt and watch various video frequency programs.The video information providing due to internet is very abundant, and user search is very convenient, and Internet video has continuous variation and the fast feature of renewal speed.
Usually, the source of the video resource of video website mainly contains: have the own video data of copyright, the video data that other partner's brakings push, the video data (UGC) that user uploads.Except above-mentioned Data Source, the video data obtaining by network Grasp Modes is also one of important source.
But, under the pattern of whole network data increment, how effectively to capture video data and how to grab clean and tidy, clean video data, be the technical matters of needing solution badly.Therefore being necessary to propose improved technical scheme addresses the above problem.
Summary of the invention
Fundamental purpose of the present invention is to provide a kind of acquisition methods and system thereof of video resource data, to solve the problem of the crawl video data inefficiency that prior art exists.
In order to address the above problem, according to an aspect of the present invention, provide a kind of acquisition methods of video resource data, it comprises: the original list that video data is provided according to provided crawl entrance; According to the original list of video data, obtain the information carrying page of video data; Capture the video data that the described information carrying page carries.
Wherein, described method also comprises: according to the video data that described in the template matches setting in advance, the information carrying page carries, resolve the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
Wherein, after the step of the video data carrying at the described information carrying of the described crawl page, described method also comprises: delete the interfere information in the described video data grabbing, described interfere information comprises: advertising message, titbit information, external link, ranking list information.
Wherein, described method also comprises: the video data of crawl is stored in database with DOM Document Object Model structure.
Wherein, the described information carrying page comprises: the video playback page, video information represent the page.
According to a further aspect in the invention, also provide a kind of system of obtaining of video resource data, it comprises: the first acquisition module, for the original list of video data is provided according to provided crawl entrance; The second acquisition module, for obtaining the information carrying page of video data according to the original list of video data; Handling module, the video data carrying for capturing the described information carrying page.
Wherein, described system also comprises: parsing module, for according to the video data that described in the template matches setting in advance, the information carrying page carries, resolve the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
Wherein, described system also comprises: removing module, for deleting the interfere information of the described video data grabbing, described interfere information comprises: advertising message, titbit information, external link, ranking list information.
Wherein, described system also comprises: memory module, and for the video data of crawl is stored to database with DOM Document Object Model structure.
Wherein, the described information carrying page comprises: the video playback page, video information represent the page.
According to technical scheme of the present invention, by capturing entrance, obtain the original list of video data, according to original list, obtain the information carrying page of video data, and the information that captures is carried the video data that the page carries, realize the whole network video data and effectively captured, improved the crawl efficiency of video data.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the acquisition methods of video resource data according to an embodiment of the invention;
Fig. 2 is the process flow diagram of the acquisition methods of video resource data according to another embodiment of the present invention;
Fig. 3 is the structured flowchart of the system of obtaining of video resource data according to an embodiment of the invention;
Fig. 4 is the structured flowchart of the system of obtaining of video resource data according to another embodiment of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.
According to embodiments of the invention, provide a kind of acquisition methods of video resource data.
Fig. 1 is that as shown in Figure 1, the method comprises according to the process flow diagram of the acquisition methods of the video resource data of the embodiment of the present invention:
Step S102, the original list of video data is provided according to provided crawl entrance.
Wherein, original list is to show the webpage that has numerous video resource data lists, and original list can have multipage.Crawl entrance by providing (network interface of video website for example can capture the data of each page of website by inferior interface) can obtain the original list of video data.
Step S104, the information of obtaining video data according to the original list of video data is carried the page.
The information carrying page is the next stage page of original list, can enter the information page of this concrete video by clicking the link of a certain video in original list.At the original list of video data, a large amount of video resources are arranged in a certain order, for example, and title, uplink time etc.First, the keyword of video data to be captured is set according to content to be crawled, according to the keyword arranging, at the original list of video data, carry out matching treatment, determine URL(uniform resource locator) (URL) address of video data to be captured, then according to the information carrying page of this video data of URL address acquisition of definite video data.
Generally, information can be carried to the page is divided into: the video playback page and video information represent the page.Wherein, by the player of the video playback page, can play concrete video data by direct-on-line; And video information represents the concentrated details information that has represented video data in the page, comprising: heading message, profile information, collection are counted information, video time information etc.
Step S106, the video data that the crawl information carrying page carries.
In practice, can use webpage gripping tool crawl information to carry the video data that the page carries, thereby obtain the complete video data of this webpage.Object for capturing, comprises long video and user's uploaded videos (UGC).
In one embodiment of the invention, the video data capturing is stored in database with DOM Document Object Model structure (dom tree), like this, has just obtained the data source of video resource data.But, in the video data obtaining by direct crawl, may include the interfere informations such as advertising message, titbit information, externally link, ranking list information, these interfere informations are unwanted information, should delete.Particularly, according to the descriptor of described video data (for example title and brief introduction) determine the interfere information in described video data, and delete the interfere information in described video data, thereby obtain the video data of " totally ", " neatly ".
Further, what after the video data carrying at the information carrying page, obtain is simple video data, but the information carrying page also carries the information of the various dimensions of video data, comprising: heading message, profile information, collection are counted information, video time information etc.The information of these various dimensions is useful informations relevant to video data, need to obtain above-mentioned information by parsing.Particularly, according to the form correspondence of webpage, one template is set, wherein this template definition the dimensional information of the video data that specifically carries of each labels class, for example, webpage for a certain type, the various dimensions information of the video data of each labels class carrying is changeless, pre-set the template of this webpage, the video data carrying by this template matches information carrying page, parsing obtains the information of the various dimensions of video data, and the information of these various dimensions is stored to database together with video data.
Please refer to Fig. 2 below, Fig. 2 is the process flow diagram of the acquisition methods of video resource data according to the preferred embodiment of the invention, and as shown in Figure 2, the method comprises:
Step S202, the original list of video data is provided according to provided crawl entrance.
Step S204, the information of obtaining video data according to the original list of video data is carried the page.
Step S206, the video data that the crawl information carrying page carries.
Step S208, deletes the interfere information in the described video data grabbing, and described interfere information includes but not limited to: advertising message, titbit information, external link, ranking list information.
Step S210, the video data carrying according to the template matches information carrying page setting in advance, parsing obtains the information of the various dimensions of video data, includes but not limited to: heading message, profile information, collection are counted information, video time information.
By above-described embodiment, effectively improved the crawl efficiency of the whole network video data.
According to embodiments of the invention, also provide a kind of system of obtaining of video resource data.
Fig. 3 is according to the structured flowchart of the system of obtaining of the video resource data of the embodiment of the present invention, and as shown in Figure 3, described system comprises: the first acquisition module 10, the second acquisition module 20 and handling module 30, describe structure and the annexation of each module below in detail.
The first acquisition module 10, for the original list of video data is provided according to provided crawl entrance.For example, by the crawl entrance (network interface of video website) providing, can obtain and show the original list that has numerous video resource data lists.
The second acquisition module 20 couples mutually with the first acquisition module 10, for obtain the information carrying page of video data according to the original list of video data.Particularly, described the second acquisition module 20 carries out matching treatment according to the keyword setting in advance at the original list of video data, determine the URL(uniform resource locator) address of video data to be captured, according to the information carrying page of this video data of URL(uniform resource locator) address acquisition of definite video data.
The information carrying page is the next stage page of original list, can enter the information page of this concrete video by clicking the link of a certain video in original list.Generally, information can be carried to the page is divided into: the video playback page and video information represent the page.Wherein, by the player of the video playback page, can play concrete video data by direct-on-line; And video information represents the concentrated details information that has represented video data in the page, comprising: heading message, profile information, collection are counted information, video time information etc.
Handling module 30 couples mutually with the second acquisition module 20, the video data carrying for capturing the described information carrying page.In practice, can use webpage gripping tool crawl information to carry the video data that the page carries, thereby obtain the complete video data of this webpage.Object for capturing, comprises long video and user's uploaded videos (UGC).
With reference to figure 4, on the basis of Fig. 3, described system also comprises:
Parsing module 40, it couples mutually with handling module 30, for according to the video data that described in the template matches setting in advance, the information carrying page carries, resolve the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
Continuation is with reference to figure 4, and described system also comprises:
Removing module 50, it couples mutually with parsing module 40, for determine the interfere information of described video data according to the descriptor of described video data, and delete the interfere information in described video data, described interfere information comprises: advertising message, titbit information, external link, ranking list information.
In addition, described system also comprises:
Memory module 60, is stored to database for the video data of just removing after interfere information with DOM Document Object Model (dom tree) structure.
The operation steps of method of the present invention is corresponding with the architectural feature of system, can cross-reference, repeat no longer one by one.
According to technical scheme of the present invention, by capturing entrance, obtain the original list of video data, according to original list, obtain the information carrying page of video data, and the information that captures is carried the video data that the page carries, realize the whole network video data and effectively captured, improved the crawl efficiency of video data.
The foregoing is only embodiments of the invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in claim scope of the present invention.

Claims (10)

1. an acquisition methods for video resource data, is characterized in that, comprising:
The original list of video data is provided according to provided crawl entrance;
According to the original list of video data, obtain the information carrying page of video data;
Capture the video data that the described information carrying page carries.
2. method according to claim 1, is characterized in that, the described information carrying page comprises: the video playback page, video information represent the page.
3. method according to claim 1, is characterized in that, the described original list according to video data obtains the information carrying page of video data, comprising:
According to the keyword setting in advance, at the original list of video data, carry out matching treatment, determine the URL(uniform resource locator) address of video data to be captured;
According to the information carrying page of this video data of URL(uniform resource locator) address acquisition of definite video data.
4. method according to claim 1, is characterized in that, also comprises:
According to the video data that described in the template matches setting in advance, the information carrying page carries, resolve the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
5. method according to claim 1, is characterized in that, also comprises:
According to the descriptor of described video data, determine the interfere information in described video data, and delete the interfere information in described video data, described interfere information comprises: advertising message, titbit information, external link, ranking list information;
The video data of removing after interfere information is stored in database with DOM Document Object Model structure.
6. the system of obtaining of video resource data, is characterized in that, comprising:
The first acquisition module, for the original list of video data is provided according to provided crawl entrance;
The second acquisition module, for obtaining the information carrying page of video data according to the original list of video data;
Handling module, the video data carrying for capturing the described information carrying page.
7. system according to claim 6, is characterized in that, the described information carrying page comprises: the video playback page, video information represent the page.
8. system according to claim 6, it is characterized in that, described the second acquisition module is also for carrying out matching treatment according to the keyword setting in advance at the original list of video data, determine the URL(uniform resource locator) address of video data to be captured, according to the information carrying page of this video data of URL(uniform resource locator) address acquisition of definite video data.
9. system according to claim 6, is characterized in that, also comprises:
Parsing module, for according to the video data that described in the template matches setting in advance, the information carrying page carries, resolves the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
10. system according to claim 6, is characterized in that, also comprises:
Removing module, for determine the interfere information of described video data according to the descriptor of described video data, and deletes the interfere information in described video data, and described interfere information comprises: advertising message, titbit information, external link, ranking list information;
Memory module, for being stored to database by the video data of removing after interfere information with DOM Document Object Model structure.
CN201310741187.6A 2013-12-26 2013-12-26 Method and system for acquiring data of video resources Pending CN103699661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310741187.6A CN103699661A (en) 2013-12-26 2013-12-26 Method and system for acquiring data of video resources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310741187.6A CN103699661A (en) 2013-12-26 2013-12-26 Method and system for acquiring data of video resources

Publications (1)

Publication Number Publication Date
CN103699661A true CN103699661A (en) 2014-04-02

Family

ID=50361189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310741187.6A Pending CN103699661A (en) 2013-12-26 2013-12-26 Method and system for acquiring data of video resources

Country Status (1)

Country Link
CN (1) CN103699661A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572996A (en) * 2015-01-06 2015-04-29 百度在线网络技术(北京)有限公司 Processing method and device for video webpage
CN104980485A (en) * 2015-03-16 2015-10-14 腾讯科技(深圳)有限公司 Sniffing method, device and system for network resource
CN105138674A (en) * 2015-09-08 2015-12-09 成都博元科技有限公司 Database access method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101241553A (en) * 2008-01-24 2008-08-13 北京六维世纪网络技术有限公司 Method and device for recognizing customizing messages jumping-off point and terminal
CN101833587A (en) * 2010-05-28 2010-09-15 上海交通大学 Network video searching system
CN101944111A (en) * 2010-09-09 2011-01-12 中国科学技术大学 Method and device for searching news video
US20120155831A1 (en) * 2010-12-16 2012-06-21 Canon Kabushiki Kaisha Information processing apparatus and method therefor
CN102761623A (en) * 2012-07-26 2012-10-31 北京奇虎科技有限公司 Resource self-adaptive downloading method, system, data storage server and communication system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101241553A (en) * 2008-01-24 2008-08-13 北京六维世纪网络技术有限公司 Method and device for recognizing customizing messages jumping-off point and terminal
CN101833587A (en) * 2010-05-28 2010-09-15 上海交通大学 Network video searching system
CN101944111A (en) * 2010-09-09 2011-01-12 中国科学技术大学 Method and device for searching news video
US20120155831A1 (en) * 2010-12-16 2012-06-21 Canon Kabushiki Kaisha Information processing apparatus and method therefor
CN102761623A (en) * 2012-07-26 2012-10-31 北京奇虎科技有限公司 Resource self-adaptive downloading method, system, data storage server and communication system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572996A (en) * 2015-01-06 2015-04-29 百度在线网络技术(北京)有限公司 Processing method and device for video webpage
CN104572996B (en) * 2015-01-06 2018-09-07 百度在线网络技术(北京)有限公司 The treating method and apparatus of video web-pages
CN104980485A (en) * 2015-03-16 2015-10-14 腾讯科技(深圳)有限公司 Sniffing method, device and system for network resource
CN104980485B (en) * 2015-03-16 2019-01-15 腾讯科技(深圳)有限公司 A kind of sniff methods, devices and systems of Internet resources
CN105138674A (en) * 2015-09-08 2015-12-09 成都博元科技有限公司 Database access method
CN105138674B (en) * 2015-09-08 2018-11-02 成都博元科技有限公司 A kind of data bank access method

Similar Documents

Publication Publication Date Title
CN102436513B (en) Distributed search method and system
Mahto et al. A dive into Web Scraper world
CN106982150B (en) Hadoop-based mobile internet user behavior analysis method
CN101676907A (en) Method and system of directionally acquiring Internet resources
CN102486799B (en) World wide web (WWW) page processing method and device
CN105045838A (en) Network crawler system based on distributed storage system
CN104778208A (en) Method and system for optimally grasping search engine SEO (search engine optimization) website data
CN104615627B (en) A kind of event public feelings information extracting method and system based on microblog
CN104899261A (en) Device and method for constructing structured video image information
CN103902664A (en) Page image rendering method and information providing method and device
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN103823907A (en) Method, device and engine for integrating on-line video resource addresses
CN103699661A (en) Method and system for acquiring data of video resources
CN105550179A (en) Webpage collection method and browser plug-in
CN101008946A (en) Search method of Chinese mobile communication information and device thereof
CN106066875B (en) A kind of high efficient data capture method and system based on deep net crawler
CN101261645B (en) Method and apparatus for obtaining multiple layer information
CN106354846A (en) Intelligent news manuscript selection method and system based on big data
CN101257501B (en) Data leading-in method, system as well as Web server
CN105094787A (en) Method and device for processing enterprise Internet application
CN108205548A (en) A kind of Web Spider structure and its method of work based on agriculture webpage information acquisition
CN105630983A (en) Resource obtaining and optimizing device and method
CN104008190A (en) Crawler system and method thereof
CN101977251A (en) Server-side website resource optimization device and optimization method thereof
EP3282372A1 (en) Method and apparatus for storing data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20151225

Address after: Room six, building 19, building 68, No. 100089 South Road, Haidian District, Beijing

Applicant after: LETV CLOUD COMPUTING CO., LTD.

Address before: Room six, building 19, building 68, No. 100089 South Road, Haidian District, Beijing

Applicant before: LeTV Information Technology (Beijing) Co., Ltd.

WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140402

WD01 Invention patent application deemed withdrawn after publication