CN103699661A - Method and system for acquiring data of video resources - Google Patents
Method and system for acquiring data of video resources Download PDFInfo
- Publication number
- CN103699661A CN103699661A CN201310741187.6A CN201310741187A CN103699661A CN 103699661 A CN103699661 A CN 103699661A CN 201310741187 A CN201310741187 A CN 201310741187A CN 103699661 A CN103699661 A CN 103699661A
- Authority
- CN
- China
- Prior art keywords
- video data
- information
- video
- page
- original list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The invention discloses a method for acquiring data of video resources. The method includes acquiring list pages of the video data according to provided grasping inlets; acquiring information carrying pages of the video data according to the list pages of the video data; grasping the video data carried by the information carrying pages. The method has the advantage that the video data grasping efficiency can be improved by the aid of the method.
Description
Technical field
The present invention relates to information retrieval technique, relate in particular to a kind of acquisition methods and system thereof of video resource data.
Background technology
Along with scientific and technological development, increasing user is by internet hunt and watch various video frequency programs.The video information providing due to internet is very abundant, and user search is very convenient, and Internet video has continuous variation and the fast feature of renewal speed.
Usually, the source of the video resource of video website mainly contains: have the own video data of copyright, the video data that other partner's brakings push, the video data (UGC) that user uploads.Except above-mentioned Data Source, the video data obtaining by network Grasp Modes is also one of important source.
But, under the pattern of whole network data increment, how effectively to capture video data and how to grab clean and tidy, clean video data, be the technical matters of needing solution badly.Therefore being necessary to propose improved technical scheme addresses the above problem.
Summary of the invention
Fundamental purpose of the present invention is to provide a kind of acquisition methods and system thereof of video resource data, to solve the problem of the crawl video data inefficiency that prior art exists.
In order to address the above problem, according to an aspect of the present invention, provide a kind of acquisition methods of video resource data, it comprises: the original list that video data is provided according to provided crawl entrance; According to the original list of video data, obtain the information carrying page of video data; Capture the video data that the described information carrying page carries.
Wherein, described method also comprises: according to the video data that described in the template matches setting in advance, the information carrying page carries, resolve the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
Wherein, after the step of the video data carrying at the described information carrying of the described crawl page, described method also comprises: delete the interfere information in the described video data grabbing, described interfere information comprises: advertising message, titbit information, external link, ranking list information.
Wherein, described method also comprises: the video data of crawl is stored in database with DOM Document Object Model structure.
Wherein, the described information carrying page comprises: the video playback page, video information represent the page.
According to a further aspect in the invention, also provide a kind of system of obtaining of video resource data, it comprises: the first acquisition module, for the original list of video data is provided according to provided crawl entrance; The second acquisition module, for obtaining the information carrying page of video data according to the original list of video data; Handling module, the video data carrying for capturing the described information carrying page.
Wherein, described system also comprises: parsing module, for according to the video data that described in the template matches setting in advance, the information carrying page carries, resolve the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
Wherein, described system also comprises: removing module, for deleting the interfere information of the described video data grabbing, described interfere information comprises: advertising message, titbit information, external link, ranking list information.
Wherein, described system also comprises: memory module, and for the video data of crawl is stored to database with DOM Document Object Model structure.
Wherein, the described information carrying page comprises: the video playback page, video information represent the page.
According to technical scheme of the present invention, by capturing entrance, obtain the original list of video data, according to original list, obtain the information carrying page of video data, and the information that captures is carried the video data that the page carries, realize the whole network video data and effectively captured, improved the crawl efficiency of video data.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the acquisition methods of video resource data according to an embodiment of the invention;
Fig. 2 is the process flow diagram of the acquisition methods of video resource data according to another embodiment of the present invention;
Fig. 3 is the structured flowchart of the system of obtaining of video resource data according to an embodiment of the invention;
Fig. 4 is the structured flowchart of the system of obtaining of video resource data according to another embodiment of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.
According to embodiments of the invention, provide a kind of acquisition methods of video resource data.
Fig. 1 is that as shown in Figure 1, the method comprises according to the process flow diagram of the acquisition methods of the video resource data of the embodiment of the present invention:
Step S102, the original list of video data is provided according to provided crawl entrance.
Wherein, original list is to show the webpage that has numerous video resource data lists, and original list can have multipage.Crawl entrance by providing (network interface of video website for example can capture the data of each page of website by inferior interface) can obtain the original list of video data.
Step S104, the information of obtaining video data according to the original list of video data is carried the page.
The information carrying page is the next stage page of original list, can enter the information page of this concrete video by clicking the link of a certain video in original list.At the original list of video data, a large amount of video resources are arranged in a certain order, for example, and title, uplink time etc.First, the keyword of video data to be captured is set according to content to be crawled, according to the keyword arranging, at the original list of video data, carry out matching treatment, determine URL(uniform resource locator) (URL) address of video data to be captured, then according to the information carrying page of this video data of URL address acquisition of definite video data.
Generally, information can be carried to the page is divided into: the video playback page and video information represent the page.Wherein, by the player of the video playback page, can play concrete video data by direct-on-line; And video information represents the concentrated details information that has represented video data in the page, comprising: heading message, profile information, collection are counted information, video time information etc.
Step S106, the video data that the crawl information carrying page carries.
In practice, can use webpage gripping tool crawl information to carry the video data that the page carries, thereby obtain the complete video data of this webpage.Object for capturing, comprises long video and user's uploaded videos (UGC).
In one embodiment of the invention, the video data capturing is stored in database with DOM Document Object Model structure (dom tree), like this, has just obtained the data source of video resource data.But, in the video data obtaining by direct crawl, may include the interfere informations such as advertising message, titbit information, externally link, ranking list information, these interfere informations are unwanted information, should delete.Particularly, according to the descriptor of described video data (for example title and brief introduction) determine the interfere information in described video data, and delete the interfere information in described video data, thereby obtain the video data of " totally ", " neatly ".
Further, what after the video data carrying at the information carrying page, obtain is simple video data, but the information carrying page also carries the information of the various dimensions of video data, comprising: heading message, profile information, collection are counted information, video time information etc.The information of these various dimensions is useful informations relevant to video data, need to obtain above-mentioned information by parsing.Particularly, according to the form correspondence of webpage, one template is set, wherein this template definition the dimensional information of the video data that specifically carries of each labels class, for example, webpage for a certain type, the various dimensions information of the video data of each labels class carrying is changeless, pre-set the template of this webpage, the video data carrying by this template matches information carrying page, parsing obtains the information of the various dimensions of video data, and the information of these various dimensions is stored to database together with video data.
Please refer to Fig. 2 below, Fig. 2 is the process flow diagram of the acquisition methods of video resource data according to the preferred embodiment of the invention, and as shown in Figure 2, the method comprises:
Step S202, the original list of video data is provided according to provided crawl entrance.
Step S204, the information of obtaining video data according to the original list of video data is carried the page.
Step S206, the video data that the crawl information carrying page carries.
Step S208, deletes the interfere information in the described video data grabbing, and described interfere information includes but not limited to: advertising message, titbit information, external link, ranking list information.
Step S210, the video data carrying according to the template matches information carrying page setting in advance, parsing obtains the information of the various dimensions of video data, includes but not limited to: heading message, profile information, collection are counted information, video time information.
By above-described embodiment, effectively improved the crawl efficiency of the whole network video data.
According to embodiments of the invention, also provide a kind of system of obtaining of video resource data.
Fig. 3 is according to the structured flowchart of the system of obtaining of the video resource data of the embodiment of the present invention, and as shown in Figure 3, described system comprises: the first acquisition module 10, the second acquisition module 20 and handling module 30, describe structure and the annexation of each module below in detail.
The first acquisition module 10, for the original list of video data is provided according to provided crawl entrance.For example, by the crawl entrance (network interface of video website) providing, can obtain and show the original list that has numerous video resource data lists.
The second acquisition module 20 couples mutually with the first acquisition module 10, for obtain the information carrying page of video data according to the original list of video data.Particularly, described the second acquisition module 20 carries out matching treatment according to the keyword setting in advance at the original list of video data, determine the URL(uniform resource locator) address of video data to be captured, according to the information carrying page of this video data of URL(uniform resource locator) address acquisition of definite video data.
The information carrying page is the next stage page of original list, can enter the information page of this concrete video by clicking the link of a certain video in original list.Generally, information can be carried to the page is divided into: the video playback page and video information represent the page.Wherein, by the player of the video playback page, can play concrete video data by direct-on-line; And video information represents the concentrated details information that has represented video data in the page, comprising: heading message, profile information, collection are counted information, video time information etc.
Handling module 30 couples mutually with the second acquisition module 20, the video data carrying for capturing the described information carrying page.In practice, can use webpage gripping tool crawl information to carry the video data that the page carries, thereby obtain the complete video data of this webpage.Object for capturing, comprises long video and user's uploaded videos (UGC).
With reference to figure 4, on the basis of Fig. 3, described system also comprises:
Parsing module 40, it couples mutually with handling module 30, for according to the video data that described in the template matches setting in advance, the information carrying page carries, resolve the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
Continuation is with reference to figure 4, and described system also comprises:
Removing module 50, it couples mutually with parsing module 40, for determine the interfere information of described video data according to the descriptor of described video data, and delete the interfere information in described video data, described interfere information comprises: advertising message, titbit information, external link, ranking list information.
In addition, described system also comprises:
Memory module 60, is stored to database for the video data of just removing after interfere information with DOM Document Object Model (dom tree) structure.
The operation steps of method of the present invention is corresponding with the architectural feature of system, can cross-reference, repeat no longer one by one.
According to technical scheme of the present invention, by capturing entrance, obtain the original list of video data, according to original list, obtain the information carrying page of video data, and the information that captures is carried the video data that the page carries, realize the whole network video data and effectively captured, improved the crawl efficiency of video data.
The foregoing is only embodiments of the invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in claim scope of the present invention.
Claims (10)
1. an acquisition methods for video resource data, is characterized in that, comprising:
The original list of video data is provided according to provided crawl entrance;
According to the original list of video data, obtain the information carrying page of video data;
Capture the video data that the described information carrying page carries.
2. method according to claim 1, is characterized in that, the described information carrying page comprises: the video playback page, video information represent the page.
3. method according to claim 1, is characterized in that, the described original list according to video data obtains the information carrying page of video data, comprising:
According to the keyword setting in advance, at the original list of video data, carry out matching treatment, determine the URL(uniform resource locator) address of video data to be captured;
According to the information carrying page of this video data of URL(uniform resource locator) address acquisition of definite video data.
4. method according to claim 1, is characterized in that, also comprises:
According to the video data that described in the template matches setting in advance, the information carrying page carries, resolve the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
5. method according to claim 1, is characterized in that, also comprises:
According to the descriptor of described video data, determine the interfere information in described video data, and delete the interfere information in described video data, described interfere information comprises: advertising message, titbit information, external link, ranking list information;
The video data of removing after interfere information is stored in database with DOM Document Object Model structure.
6. the system of obtaining of video resource data, is characterized in that, comprising:
The first acquisition module, for the original list of video data is provided according to provided crawl entrance;
The second acquisition module, for obtaining the information carrying page of video data according to the original list of video data;
Handling module, the video data carrying for capturing the described information carrying page.
7. system according to claim 6, is characterized in that, the described information carrying page comprises: the video playback page, video information represent the page.
8. system according to claim 6, it is characterized in that, described the second acquisition module is also for carrying out matching treatment according to the keyword setting in advance at the original list of video data, determine the URL(uniform resource locator) address of video data to be captured, according to the information carrying page of this video data of URL(uniform resource locator) address acquisition of definite video data.
9. system according to claim 6, is characterized in that, also comprises:
Parsing module, for according to the video data that described in the template matches setting in advance, the information carrying page carries, resolves the information of the various dimensions that obtain video data, comprising: heading message, profile information, collection are counted information, video time information.
10. system according to claim 6, is characterized in that, also comprises:
Removing module, for determine the interfere information of described video data according to the descriptor of described video data, and deletes the interfere information in described video data, and described interfere information comprises: advertising message, titbit information, external link, ranking list information;
Memory module, for being stored to database by the video data of removing after interfere information with DOM Document Object Model structure.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310741187.6A CN103699661A (en) | 2013-12-26 | 2013-12-26 | Method and system for acquiring data of video resources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310741187.6A CN103699661A (en) | 2013-12-26 | 2013-12-26 | Method and system for acquiring data of video resources |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103699661A true CN103699661A (en) | 2014-04-02 |
Family
ID=50361189
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310741187.6A Pending CN103699661A (en) | 2013-12-26 | 2013-12-26 | Method and system for acquiring data of video resources |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103699661A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572996A (en) * | 2015-01-06 | 2015-04-29 | 百度在线网络技术(北京)有限公司 | Processing method and device for video webpage |
CN104980485A (en) * | 2015-03-16 | 2015-10-14 | 腾讯科技(深圳)有限公司 | Sniffing method, device and system for network resource |
CN105138674A (en) * | 2015-09-08 | 2015-12-09 | 成都博元科技有限公司 | Database access method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101241553A (en) * | 2008-01-24 | 2008-08-13 | 北京六维世纪网络技术有限公司 | Method and device for recognizing customizing messages jumping-off point and terminal |
CN101833587A (en) * | 2010-05-28 | 2010-09-15 | 上海交通大学 | Network video searching system |
CN101944111A (en) * | 2010-09-09 | 2011-01-12 | 中国科学技术大学 | Method and device for searching news video |
US20120155831A1 (en) * | 2010-12-16 | 2012-06-21 | Canon Kabushiki Kaisha | Information processing apparatus and method therefor |
CN102761623A (en) * | 2012-07-26 | 2012-10-31 | 北京奇虎科技有限公司 | Resource self-adaptive downloading method, system, data storage server and communication system |
-
2013
- 2013-12-26 CN CN201310741187.6A patent/CN103699661A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101241553A (en) * | 2008-01-24 | 2008-08-13 | 北京六维世纪网络技术有限公司 | Method and device for recognizing customizing messages jumping-off point and terminal |
CN101833587A (en) * | 2010-05-28 | 2010-09-15 | 上海交通大学 | Network video searching system |
CN101944111A (en) * | 2010-09-09 | 2011-01-12 | 中国科学技术大学 | Method and device for searching news video |
US20120155831A1 (en) * | 2010-12-16 | 2012-06-21 | Canon Kabushiki Kaisha | Information processing apparatus and method therefor |
CN102761623A (en) * | 2012-07-26 | 2012-10-31 | 北京奇虎科技有限公司 | Resource self-adaptive downloading method, system, data storage server and communication system |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572996A (en) * | 2015-01-06 | 2015-04-29 | 百度在线网络技术(北京)有限公司 | Processing method and device for video webpage |
CN104572996B (en) * | 2015-01-06 | 2018-09-07 | 百度在线网络技术(北京)有限公司 | The treating method and apparatus of video web-pages |
CN104980485A (en) * | 2015-03-16 | 2015-10-14 | 腾讯科技(深圳)有限公司 | Sniffing method, device and system for network resource |
CN104980485B (en) * | 2015-03-16 | 2019-01-15 | 腾讯科技(深圳)有限公司 | A kind of sniff methods, devices and systems of Internet resources |
CN105138674A (en) * | 2015-09-08 | 2015-12-09 | 成都博元科技有限公司 | Database access method |
CN105138674B (en) * | 2015-09-08 | 2018-11-02 | 成都博元科技有限公司 | A kind of data bank access method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102436513B (en) | Distributed search method and system | |
Mahto et al. | A dive into Web Scraper world | |
CN106982150B (en) | Hadoop-based mobile internet user behavior analysis method | |
CN101676907A (en) | Method and system of directionally acquiring Internet resources | |
CN102486799B (en) | World wide web (WWW) page processing method and device | |
CN105045838A (en) | Network crawler system based on distributed storage system | |
CN104778208A (en) | Method and system for optimally grasping search engine SEO (search engine optimization) website data | |
CN104615627B (en) | A kind of event public feelings information extracting method and system based on microblog | |
CN104899261A (en) | Device and method for constructing structured video image information | |
CN103902664A (en) | Page image rendering method and information providing method and device | |
CN104391978A (en) | Method and device for storing and processing web pages of browsers | |
CN103823907A (en) | Method, device and engine for integrating on-line video resource addresses | |
CN103699661A (en) | Method and system for acquiring data of video resources | |
CN105550179A (en) | Webpage collection method and browser plug-in | |
CN101008946A (en) | Search method of Chinese mobile communication information and device thereof | |
CN106066875B (en) | A kind of high efficient data capture method and system based on deep net crawler | |
CN101261645B (en) | Method and apparatus for obtaining multiple layer information | |
CN106354846A (en) | Intelligent news manuscript selection method and system based on big data | |
CN101257501B (en) | Data leading-in method, system as well as Web server | |
CN105094787A (en) | Method and device for processing enterprise Internet application | |
CN108205548A (en) | A kind of Web Spider structure and its method of work based on agriculture webpage information acquisition | |
CN105630983A (en) | Resource obtaining and optimizing device and method | |
CN104008190A (en) | Crawler system and method thereof | |
CN101977251A (en) | Server-side website resource optimization device and optimization method thereof | |
EP3282372A1 (en) | Method and apparatus for storing data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20151225 Address after: Room six, building 19, building 68, No. 100089 South Road, Haidian District, Beijing Applicant after: LETV CLOUD COMPUTING CO., LTD. Address before: Room six, building 19, building 68, No. 100089 South Road, Haidian District, Beijing Applicant before: LeTV Information Technology (Beijing) Co., Ltd. |
|
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140402 |
|
WD01 | Invention patent application deemed withdrawn after publication |