Embodiment
In embodiments of the present invention, based on the ontology knowledge of semantic association information architecture search news video website, utilize described ontology knowledge from the internet, to search out the news video website.The evaluation of promptness is carried out in described news video website, utilize the assessment result of described promptness to set the time interval of picking up of described news video website.Then, utilize the time interval of picking up of described news video website, pick up content in the described news video website in real time, obtain the news video in the described content by the searching method of setting.
For ease of understanding, be that example is further explained explanation below in conjunction with accompanying drawing with several specific embodiments, and each embodiment does not constitute the qualification to the embodiment of the invention to the embodiment of the invention.
Embodiment one
The principle schematic of the searching method of a kind of news video that this embodiment provides as shown in Figure 1, the concrete treatment scheme of the searching method of this news video comprises following treatment step as shown in Figure 2:
Step 21, based on the ontology knowledge of semantic association information architecture search news video website, utilize above-mentioned ontology knowledge, first search technique and website subject identifying method from the internet, to search out the news video website, and with the news video web site stores in the news video site databases.
At first, utilize the news video data in advance of small quantities of seed website to set up the news video database, the descriptor of each news video of storage and each news video in this news video database.Above-mentioned seed website comprises websites such as " www.xinhuanet.com's news ", " rising fast net news ".
In embodiments of the present invention, also to set up the news video site databases in advance, each news video website of storage in this news video site databases, and the evaluation information of each news video website, pick up information such as time interval.
Ontology knowledge based on semantic association information architecture search news video website.The structure principle schematic of this ontology knowledge as shown in Figure 3.Above-mentioned semantic association information spinner will comprise: the searching key word that search engine itself provides, search for the content keyword of the news video website of discovery, search for the content institutional framework keyword of the news video website of discovery and the content description keyword of having searched for the news video website of discovery.The content keyword of above-mentioned news video website comprises: the keyword in the title of the content of news video website, the content description keyword of above-mentioned news video website comprises: the focus video title.Therefore, mainly comprise four kinds of keywords in the above-mentioned ontology knowledge, i.e. searching key word, content keyword, content institutional framework keyword and content description keyword.
At each keyword in the above-mentioned ontology knowledge, utilize the searching request of first search technique structure to the search engine in the internet, the Search Results that the above-mentioned search engine of extraction setting quantity returns, extract the URL (Universal Resource Locator, URL(uniform resource locator)) that comprises in the return results.Identify the URL of the news video website that comprises among the above-mentioned URL by the website subject identifying method.
The treatment scheme of a kind of above-mentioned website subject identifying method that this embodiment provides as described in Figure 4, concrete processing procedure mainly comprises:
At first utilize the pattern information of the URL that comprises in the above-mentioned return results, as the information such as length, the degree of depth and form of URL, using technology such as decision tree or rule set to identify above-mentioned URL is website URL or webpage URL.
For each the website URL that identifies, grasp all webpages in the ground floor of website, utilize the broadcast page recognition technology to calculate the ratio of the video playback page or leaf in above-mentioned all webpages, if this ratio is less than predefined video playback page or leaf threshold value, think that then this website URL is irrelevant with news video website theme, gets rid of this website URL; Otherwise, think that above-mentioned website URL is relevant with news video website theme.
Utilize the corresponding literal (anchor literal) that links of video playback page or leaf in the above-mentioned website relevant that the news video database of setting up is in advance carried out fuzzy query, count total analog result number with news video website theme.Calculate the analog result number of average every link literal correspondence,, think that then this website and news video website theme are irrelevant if this analog result number is counted threshold value less than predefined analog result; Otherwise, think that above-mentioned website URL is relevant with news video website theme, promptly identifying above-mentioned website is the news video website.
Then, with the news video web site stores that identifies in the news video site databases of setting up in advance.
In embodiments of the present invention, the news video website that can also utilize above-mentioned website subject identifying method to be identified, the ontology knowledge of above-mentioned structure is carried out the evaluation that new url produces power, degree of subject relativity two aspects, the concrete processing flow chart of this evaluation procedure mainly comprises following process as shown in Figure 5:
At each keyword in the above-mentioned ontology knowledge, utilize the searching request of first search technique structure to the search engine in the internet, the Search Results that the above-mentioned search engine of extraction setting quantity returns extracts the URL that comprises in the return results.
Obtain the URL of the news video website that comprises among the above-mentioned URL by the website subject identifying method, the quantity of calculating the URL of this news video website accounts for the ratio of the total quantity of the URL that comprises in the above-mentioned return results, if this ratio is less than predetermined subject degree of correlation threshold value, think that then the theme of this keyword and news video website is irrelevant, this keyword is weeded out from above-mentioned ontology knowledge; Otherwise, think that this keyword is relevant with the theme of news video website.Continuation is carried out the relevant evaluation of new url generation power to this keyword.
In the news video site databases, search the URL of above-mentioned all news video websites of identifying, the quantity that calculates the URL of the news video website that is not included in the news video site databases according to lookup result accounts for the ratio between the total quantity of URL of above-mentioned news video website, if this ratio produces capacity threshold less than predefined new url, think that then this keyword does not have new url and produces ability, this keyword is weeded out from above-mentioned ontology knowledge; Otherwise, think that this keyword has topic relativity and new url produces ability.
In general, it is better that above-mentioned website degree of subject relativity threshold value and new url generation capacity threshold all is made as 0.1 effect.
Step 22, promptness, novelty and original evaluation are carried out in the news video website of storing in the news video database, utilize the promptness assessment result of news video website to set the time interval of picking up of news video website.
The news video website of storing in the news video database is carried out the evaluation of promptness, novelty and original three aspects.
This embodiment provides a kind ofly carries out treatment scheme that promptness estimates as shown in Figure 6 to the news video website of storing in the news video database, and concrete processing procedure comprises:
Obtain the news video on the same day of some in the above-mentioned seed website, the news video database is carried out fuzzy query according to the news video on the above-mentioned same day.The news video quantity similar with the news video above-mentioned same day that comprise in each news video website in the statistics news video database, a plurality of similar news video that belongs to same news video website that same news video searches out only writes down once.
Descending sort is carried out by the news video quantity similar with the news video above-mentioned same day that comprise in all news video websites, rank preceding 10% be made as 5 minutes, rank 10%~30% be made as 4 minutes, rank 30~70% be made as 3 minutes, being made as 2 fens of rank 70%~90%, the last 10% be made as 1 fen is that 0 news video website directly was made as 0 fen for the news video quantity similar with the news video above-mentioned same day that comprise in addition.
At last, the promptness evaluation result of above-mentioned each news video website is deposited in the news website database, as the tolerance foundation of the content promptness of each news video website.
Utilize the promptness assessment result of news video website to set the time interval of picking up of news video website.According to the above-mentioned news video quantity similar that comprise time interval of picking up of each news video website is set with the news video on the described same day, the website correspondence that the news video quantity similar with the news video described same day that comprise is many to pick up the time interval short.
A kind of feasible establishing method of picking up the time interval is: it is set in 5 minutes news video website of promptness score, and to pick up the time interval be 5 minutes, being made as 10 minutes of score 4 minutes, score 3 is divided into establishes 20 minutes, being made as 40 minutes of score 2 minutes, being made as 80 minutes of score 1 minute, being made as 1 day of score 0 minute.
This embodiment provides a kind ofly carries out treatment scheme that novelty estimates as shown in Figure 7 to the news video website of storing in the news video database, and concrete processing procedure comprises:
Utilize content-based duplicate detection technology that the news video that newly obtains from each news video website is carried out cluster, from each cluster, select the discovery time comparison news video early of some to be kept.Then, count total number of clicks of all news videos in each the news video website that remains, and then calculate the number of clicks of average each news video.
Number of clicks by above-mentioned average each news video is carried out descending sort to each news video website, rank preceding 10% be made as 5 minutes, rank 10%~30% be made as 4 minutes, rank 30~70% be made as 3 minutes, being made as 2 fens of rank 70%~90%, the last 10% be made as 1 fen is that 0 news video website directly was made as 0 fen for average each video number of clicks in addition.
At last, the novelty evaluation result of above-mentioned each news video website is deposited in the news website database, as the tolerance foundation of the novelty of each news video website.
This embodiment provides a kind ofly carries out the original treatment scheme of estimating as shown in Figure 8 to the news video website of storing in the news video database, and concrete processing procedure comprises:
Utilize content-based duplicate detection technology that the news video that newly obtains from each news video website is carried out cluster, from each cluster, select the discovery time comparison news video early of some to be kept the follow-up news video of remaining news video.Count total video quantity and repeated quantity that each news video website comprises, and then calculate the repeated ratio of each news video website.All news video websites are arranged in the ascending order of repeated ratio, rank preceding 10% be made as 5 minutes, rank 10%~30% be made as 4 minutes, rank 30~70% be made as 3 minutes, being made as 2 fens of rank 70%~90%, the last 10% be made as 1 fen is that 100% news video website directly was made as 0 fen for the repeated ratio in addition.
At last, the original evaluation result of above-mentioned each news video website is deposited in the news website database, as the tolerance foundation of the originality of each news video website.
The treatment scheme of a kind of above-mentioned content-based duplicate detection technology that this embodiment provides as shown in Figure 9, concrete processing procedure comprises as follows:
At first extract the key frame of video of the some of each news video, use Harris (Harris) operator to detect angle point to each key frame of video, utilize the proper vector of the angle point subregion of SIFT (conversion of yardstick invariant features) the above-mentioned key frame of video of latent structure, and utilize PCA (principal component analysis (PCA)) to reduce the dimension of above-mentioned proper vector.Between the key frame of video in twos of two news videos, use KNN (K arest neighbors) algorithm, nearest preceding K the proper vector of computed range is right, BIC (Bayes's information measure) algorithm is used for the characteristic value sequence X={x1 of an above-mentioned K proper vector to forming, x2 ..., the comparison of xN} (N=2K), if have trip point in the above-mentioned characteristic value sequence X sequence, judge that then two key frame of video do not repeat; Otherwise, judge that two key frame of video repeat.
Count the quantity of the key frame of video of two repetitions between the news video, the key frame of video that calculates repetition accounts for the ratio of total key frame of video, if greater than the key frame of video threshold value of setting, judge that then two news videos are repetitions; Otherwise, judge that two news videos do not repeat.
Step 23, utilize time interval of picking up of news video website, pick up news video in the news video website in real time, the news video of picking up is deposited in the news video database by the searching method of setting.
The treatment scheme of a kind of content of picking up the news video website of storing in the news video database in real time that this embodiment provides as shown in figure 10, concrete processing procedure is as follows:
At first from the news video site databases, obtain the URL and the promptness assessment result of each news video website, utilize the promptness assessment result of news video website to set the time interval of picking up of news video website, a kind of feasible time interval establishing method of picking up is: it is set in 5 minutes news video website of promptness score, and to pick up the time interval be 5 minutes, being made as 10 minutes of score 4 minutes, score 3 is divided into establishes 20 minutes, being made as 40 minutes of score 2 minutes, being made as 80 minutes of score 1 minute, being made as 1 day of score 0 minute.
Judge successively according to certain arrangement sequence whether each news video website in the news video site databases has surpassed the corresponding time interval of picking up apart from the time interval of picking up when finishing last time, if surpass, then the content of a corresponding news video site promoter new round is picked up process; Otherwise, judge whether the time interval when end was picked up apart from last time in next website has surpassed the corresponding time interval of picking up.
For each news video website to be picked up, by the searching method of setting the content in the above-mentioned news video website to be picked up, the searching method of above-mentioned setting comprises: the methods such as BFS (Breadth First Search) method that the degree of depth is limited.
Utilize the limited BFS (Breadth First Search) method of the degree of depth that above-mentioned news video website is traveled through, concrete degree of depth restriction can be the constant of an overall situation, also can change with the difference of news video website.For each webpage that runs in the above-mentioned ergodic process, at first utilize the broadcast page recognition technology to judge whether it is the video playback page or leaf, utilize webpage noise remove technology to remove the noise information that it comprises for the video playback page or leaf, the noise here comprises: ground unrest, random noise, and residual noise.With information remaining in the video playback page or leaf as news video.
Utilize above-mentioned content-based duplicate detection technology to carry out duplicate detection to this news video, the news video for duplicate detection is passed through utilizes the image quality that improves news video based on the inverse iteration sciagraphy in video compress territory.After utilizing existing instrument that news video is carried out the transcoding processing, obtain the news video of MP4 or FLV (FLV stream media format) encapsulation format.Then, news video and corresponding descriptor are deposited in the news video database.When end is picked up in the news video website, will deposit in the concluding time in the news video site databases.
News video in the above-mentioned news video site databases can use for the video on-demand system towards the TV news door.The description and the related information of news video can be pushed to Portal (door) website.Behind user's STB (Set Top Box, set-top box) the visit Portal website, can see up-to-date news video tabulation, the user can browse the news video in the news video tabulation, order and program request.
Embodiment two
The structural representation of the searcher of a kind of news video that this embodiment provides comprises following module as shown in figure 11:
News video site search module 11 is used for the ontology knowledge based on semantic association information architecture search news video website, utilizes described ontology knowledge to search out the news video website from the internet;
News video website evaluation module 12 is used for the evaluation of promptness is carried out in the news video website that described news video site search module searches for out, utilizes the assessment result of described promptness to set the time interval of picking up of described news video website;
News video acquisition module 13, be used to utilize the time interval of picking up of news video website that described news video website evaluation module sets, pick up content in the described news video website in real time by the searching method of setting, obtain the news video in the described content.
The searcher of described news video can also comprise:
Ontology knowledge evaluation module 14, be used at each keyword of above-mentioned ontology knowledge, utilize the searching request of first search technique structure to the search engine in the internet, the Search Results that the above-mentioned search engine of extraction setting quantity returns extracts the URL that comprises in the return results.
Obtain the URL of the news video website that comprises among the above-mentioned URL by the website subject identifying method, the quantity of calculating the URL of this news video website accounts for the ratio of the total quantity of the URL that comprises in the above-mentioned return results, if this ratio is less than predetermined subject degree of correlation threshold value, think that then the theme of this keyword and news video website is irrelevant, this keyword is weeded out from above-mentioned ontology knowledge; Otherwise, think that this keyword is relevant with the theme of news video website.Continuation is carried out the relevant evaluation of new url generation power to this keyword.
In the news video site databases, search the URL of above-mentioned all news video websites of identifying, the quantity that calculates the URL of the news video website that is not included in the news video site databases according to lookup result accounts for the ratio between the total quantity of URL of above-mentioned news video website, if this ratio produces capacity threshold less than predefined new url, think that then this keyword does not have new url and produces ability, this keyword is weeded out from above-mentioned ontology knowledge; Otherwise, think that this keyword has topic relativity and new url produces ability.
Described news video site search module 11 specifically can comprise:
Search module 111, be used for each keyword at described ontology knowledge, utilize the searching request of first search technique structure to the search engine in the internet, the Search Results that the described search engine of extraction setting quantity returns extracts the uniform resource position mark URL that comprises in the return results;
Identification module 112 is used for identifying by the website subject identifying method URL of the news video website that URL that described search module extracts comprises, with the news video web site stores that identifies at the news video site databases of setting up in advance.
Described news video website evaluation module 12 specifically can comprise:
Statistical module 121, be used in seed website, obtaining the news video on the same day of some, news video according to the described same day is carried out fuzzy query to the news video database, the news video quantity similar with the news video described same day that comprise in each news video website in the statistics news video database deposits the evaluation result of this news video quantity as the promptness of news video website in the news video site databases in;
Setting module 122 is used for setting according to the described news video quantity similar with the news video on the described same day that comprise time interval of picking up of each news video website, the news video website correspondence that news video quantity is many to pick up the time interval short.
Described news video acquisition module 13 specifically can comprise:
Pick up module 131, be used for when the news video website of news video site databases picked up apart from last time time when finishing surpassed described news video website pick up the time interval after, by the searching method of setting the content in the described news video website is picked up;
Identification module 132, be used for utilizing the broadcast page recognition technology to judge whether it is the video playback page or leaf to each webpage of picking up from described news video website, after removing its noise information that comprises for the video playback page or leaf of judging, with the information of remainder as news video;
Detect and enhancing module 133, be used for utilizing content-based duplicate detection technology to carry out duplicate detection to described news video, utilization strengthens the quality of the news video that duplicate detection passes through based on the inverse iteration sciagraphy in video compress territory, then, described news video and corresponding descriptor are deposited in the news video database.
Described news video website evaluation module 12 can also comprise:
Novelty evaluation module 123 is used for utilizing content-based duplicate detection technology that the news video that newly obtains from each news video website is carried out cluster, selects the discovery time comparison news video early of some to be kept from each cluster.Then, count total number of clicks of all news videos in each the news video website that remains, and then calculate the number of clicks of average each news video.
Set time interval of picking up of each news video website according to the described news video quantity similar that comprise with the news video on the described same day, the news video website correspondence that news video quantity is many to pick up the time interval short.
Number of clicks by above-mentioned average each news video is carried out the novelty evaluation to each news video website, the novelty evaluation result of each news video website is deposited in the news website database, as the tolerance foundation of the novelty of each news video website.
Original evaluation module 124, be used for utilizing content-based duplicate detection technology that the news video that newly obtains from each news video website is carried out cluster, from each cluster, select the discovery time comparison news video early of some to be kept the follow-up news video of remaining news video.Count total video quantity and repeated quantity that each news video website comprises, and then calculate the repeated ratio of each news video website.
Repeated ratio in above-mentioned each news video website is carried out originality evaluation to each news video website, the original evaluation result of each news video website is deposited in the news website database, as the tolerance foundation of the originality of each news video website.
One of ordinary skill in the art will appreciate that all or part of flow process that realizes in the foregoing description method, be to instruct relevant hardware to finish by computer program, described program can be stored in the computer read/write memory medium, this program can comprise the flow process as the embodiment of above-mentioned each side method when carrying out.Wherein, described storage medium can be magnetic disc, CD, read-only storage memory body (Read-Only Memory, ROM) or at random store memory body (Random AccessMemory, RAM) etc.
In sum, the embodiment of the invention has solved the internet news video effectively and has searched for automatically, accurately, timely and integrated problem, can identify the news video website quickly and accurately, can find automatically, in time and integrated news video.
The embodiment of the invention proposes a kind of towards the internet news video search of TV news door and integrated system and method, abundant and high-quality internet news video resource can be provided for the video on-demand system towards the TV news door, can provide necessary news video material and descriptor for the TV news door.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.