TW200951831A - Methods and systems for representation and matching of video content - Google Patents

Methods and systems for representation and matching of video content Download PDF

Info

Publication number
TW200951831A
TW200951831A TW98112574A TW98112574A TW200951831A TW 200951831 A TW200951831 A TW 200951831A TW 98112574 A TW98112574 A TW 98112574A TW 98112574 A TW98112574 A TW 98112574A TW 200951831 A TW200951831 A TW 200951831A
Authority
TW
Taiwan
Prior art keywords
video
visual
correspondence
time
video data
Prior art date
Application number
TW98112574A
Other languages
Chinese (zh)
Inventor
Alexander Bronstein
Michael Bronstein
Shlomo Selim Rakib
Original Assignee
Novafora Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/349,469 external-priority patent/US8358840B2/en
Application filed by Novafora Inc filed Critical Novafora Inc
Publication of TW200951831A publication Critical patent/TW200951831A/en

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The described methods and systems provide for the representation and matching of video content, including spatio-temporal matching of different video sequences. A particular method of determining temporal correspondence between different sets of video data inputs the sets of video data and represents the video data as ordered sequences of visual nucleotides. Temporally corresponding subsets of video data are determined by aligning the sequences of visual nucleotides.

Description

200951831 六、發明說明: 【發明所屬之技術領域】 一種視頻分析之系統及其方法,特別係指一種在視頻序列間 • 進行比較與搜尋對應處與相似度之系統及其方法。 【先前技術】 在視頻分析與電腦圖像的領域中,視頻序列的配對是很重要 的問題。在許多商業軟體中,視頻搜尋、以内容為基礎之補償 ❹ (retneval )、視頻驗證(authentication )、版權(copyright)偵測等 是基礎的元件。 在配對視頻序列的先前技術中,存在一個令人滿意的例子: 由 J. Sivic 以及 A. Zisserman 發表在 Ninth 正证 Intemati〇nal200951831 VI. Description of the Invention: [Technical Field of the Invention] A system and method for video analysis, in particular, a system and method for comparing and searching for correspondence and similarity between video sequences. [Prior Art] In the field of video analysis and computer graphics, pairing of video sequences is an important issue. In many commercial software, video search, content-based compensation ret (retneval), video verification (authentication), copyright (copyright detection) are the basic components. In the prior art of pairing video sequences, there is a satisfactory example: published by J. Sivic and A. Zisserman at Ninth Intemati〇nal

Conference on Computer Vision (ICCV'03) - Volume 2, 2003, iccv, p. 中之 Video Google, a text retrieval approach to object matching m video」。這雜料相在視射搜尋與局雜(ieealize)使 ❿用者所鱗(〇utlme)之所有的事件之方絲處理物件以及補償場 景(scene)° 、、、在先讀姻提出之方法巾存在-個問題,先前技術所提出 ,視頻77析方摘向以影像的集合來處理視頻,因此這些方法需 j大的計算能力財著較高的錯誤發生率,尤其,較容易的「單 …」之〜像”析方法只有較小的識別能力 ,例如,在視頻内 -:丨果之①像中之—個縣(水果)的影像以及在視頻内容為 、。知的〜像中之相同蘋果的影像(使用標識(logo) 的相 200951831 影像)。 、”vT'上所述’可知需要—個具有較少計算能力且有較高可靠度 的視頻77析方法,使得在整體的視頻内容巾轉獨立的視頻影像 做的更好。 【發明内容】 本發月的關鍵之一是以生物的角度思考視頻分析的問題,以 來自生物資几學的洞察力和靈感。特別的是,將視頻資料中獨 特的特徵想為廣義的「原子」是有用的,在視頻中不同視頻畫面 的抽象概㈣廣義的「核_」,其由上述之「原子」建立而成, 視頻自身則與已排序之㈣酸序列相似,像是驗或難分子 (例如·賴DNA) ’視頻分析關題為_個廣義的生物信息學 序列類型的匹配問題。 切本發明提供-種改進過的方式,在時間與空間的不同層級確 視頻序列·包圍全部的視頻序列(例如:判斷兩個視頻序列的 基本相似’碎各_式的失真或轉),在獨的棚部分確認 ^間相似處(例如.觸感興趣之視頻在視_書館巾最適合的 部分),確認在資料庫内的什麼r東西」與在特定視頻部分上感興 趣的「東西」最對應。 〃根據本發明,視頻相轉徵贿符的階級來表現。特徵描 _被選擇為強健的(相對不變的影像失真,例如旋轉、不同的 几度條件、不同的解析度等),描述符在不同的電腦處理的資料單 疋中傳達視覺資訊。因為相似度介於這種方法以及生物資訊學的 200951831 技術’這裡方法被稱為「視頻基因」。 我們找到有用的方式將視頻時空匹配的大問題分為兩個階 • &·在第一階段’感興趣的視頻媒體在時間的層級被匹配。在第 • 一又,在已經過時間匹配之視頻晝面内中感興趣的「東西」在 工間的層級被分析,並且在不同的已經過時間匹配之視頻晝面間 所對應的「東西」被偵測。 二間與時間的失真或視頻内容的空間編輯(例如改變解析 ©度、晝面速率、覆蓋字幕)可能導致表述的改變。時間編輯可能 導致表述崎增或刪除。再-次以生物以及生㈣訊學來比擬,b 將視頻的改變視為生物概念巾之基_「錢」是有用的。不管 突變的各種類型,使用生物資訊學中已經成熟的方法能多句表示 DNA序列之間_似度,這些生㈣訊學中的技躺廣義版本對 分析不同視頻序列示有幫助的^使用這種方法,在惡意的改變(突 變)例如不同的騎度、晝面速率、字幕、增加物、或刪除物中 ® 的不同的視頻媒體能夠被準確的分析。 本發明揭露一種視頻内容之描述與配對之裝置及其方法,其 v中: 、 , 本發明所揭露之視頻内容之描述與配對之裝置,至少包含: 視頻資料源:視頻分割H,與視頻·源連接,用以將視頻資料 分割為複數_關;視頻處理n,與視师料源連接,用以债 測視頻資料中之特徵點之位置、產生與特徵點之位置對應之特徵 點之描述符、及修整被侧出之特徵點之位置以產生特徵點之位 5 200951831 置之子集;視頻聚集器,與視頻分割器及視頻處理器相連接,用 以產生與涵純職之棚DNA,其巾,視頻DNA包含複數 已排序之視覺核苷酸序列。 本發明所揭露之視頻内容之描述與配對之方法,其步驟至少 包括:輸人複數視頻資料集合;描述視織料為已排序之複數視 覺核普酸賴;校準視覺核微序列,藉關斷鋪資料之複數 時間對應處子集;由時間對應處子集之間計算視頻資料之空間對 應處(時空對應處);輸出在視頻資料之子集間之時空對應處。 本發明所揭露之視納容之描述與配對之方法,其步驟至少 包括:輸入複數視頻資料之集合;描述視頻資料為已排序之複數 視覺核微糊’其巾,視师料被分為複數關關;為各時 間區間計算至少—視覺核_,其中,各視覺核苷_為來自視 頻資料之不同_區間之複數視覺原子之集合之群組,且各視覺 原子係插述視頻資料之局部時空細之視_容;依據下列步驟 建立視覺原子:在時間關中_不變化之複數特徵點;在各不 變化之特徵關圍計算視織料之局料空範圍之描述符之集 合;移除不類徵狀子纽不變化之概狀描賴:以町 之不變化之特徵點之位置及描述符建立視覺原子之集合;校準視 覺核芽酸糊’藉關斷視頻龍之複數_她子集;由時間 相似子集之間計算視㈣料之空_應處(時空對應處);輸出在 視頻資料之子集間之時空對應處。 本發明所揭露之視頻内容之描述與配對之方法,其步驟至少 200951831 :=下列步驟建立複數視覺蝴序列··由她^Conference on Computer Vision (ICCV'03) - Volume 2, 2003, iccv, p. Video Google, a text retrieval approach to object matching m video". This miscellaneous material is used in the visual search and the ieealize to cause all the events of the user's scale (〇utlme) to process the object and the compensation scene (the scene), and the method of the prior reading. There is a problem with the towel. According to the prior art, the video 77 is extracted to process the video with a collection of images. Therefore, these methods require a large computing power and a high error rate. In particular, it is easier to The "analysis" method has only a small recognition ability, for example, in the video - the image of the county (fruit) in the image of the image of the fruit, and the video content is the image of The same image of the apple (using the logo of the phase 200951831 image). The above description of the 'vT' can be seen as a video 77 method with less computational power and higher reliability, making the overall video The content of the towel turned to a separate video image to do better. SUMMARY OF THE INVENTION One of the keys to this month is to think about the problem of video analytics from a biological perspective, with insights and inspiration from bioscience. In particular, it is useful to think of the unique features of video material as "atoms" in a broad sense. The abstraction of different video images in video (4) is broadly defined as "nuclear_", which is created by the above-mentioned "atoms". The video itself is similar to the sorted (four) acid sequence, such as the test or difficult molecule (eg, Lai DNA)' video analysis is a generalized bioinformatics sequence type matching problem. The present invention provides an improved way of making video sequences at different levels of time and space, enclosing all video sequences (eg, judging the fundamental similarity of two video sequences, 'distortion or rotation') The unique shed section confirms the similarities between the two (for example, touching the video of interest in the most suitable part of the library), confirming what is in the database and "things" that are of interest in the specific video section. The most corresponding. According to the present invention, the video is transferred to the class of the bribe. Feature Description _ is selected to be robust (relatively invariant image distortion, such as rotation, different degrees of conditions, different resolutions, etc.), and descriptors convey visual information in different computer-processed data sheets. Because the similarity is between this method and the bioinformatics 200951831 technology, the method is called "video gene." We have found a useful way to divide the big problem of video spatio-temporal matching into two orders. • In the first stage, the video media of interest is matched at the level of time. In the first and second, the "things" that are of interest in the time-matched video are analyzed at the level of the work, and the "things" corresponding to the different time-matched video faces. Was detected. Two spatial and temporal distortions or spatial editing of video content (eg, changing resolution, degree, face rate, overlay subtitles) may result in a change in expression. Time editing may result in a rhythm or deletion. Again - compared with biology and bio (four) communication, b regards the change of video as the basis of the biological concept towel - "money" is useful. Regardless of the various types of mutations, the mature methods in bioinformatics can be used to represent the degree of similarity between DNA sequences. The generalized version of these techniques is useful for analyzing different video sequences. In this way, different video media in malicious changes (mutations) such as different rides, face rates, subtitles, additions, or deletions can be accurately analyzed. The invention discloses a device for describing and pairing video content and a method thereof, wherein: the device for describing and pairing the video content disclosed in the present invention comprises at least: a video data source: a video segmentation H, and a video. The source connection is used to divide the video data into a complex number_off; the video processing n is connected with the visual source, and is used for detecting the position of the feature point in the video data and generating a feature point corresponding to the position of the feature point. And trimming the position of the feature points that are laterally out to generate a subset of the feature points 5 200951831; a video aggregator, coupled to the video splitter and the video processor, to generate the shed DNA of the pure position, Its towel, the video DNA contains a plurality of ordered visual nucleotide sequences. The method for describing and pairing the video content disclosed in the present invention comprises the steps of: inputting a plurality of sets of video data; describing the visual material as a sorted complex visual nuclear acid; and calibrating the visual nuclear micro-sequence, by shutting down The complex time of the data is corresponding to the subset; the spatial correspondence of the video data is calculated between the subsets corresponding to the time (the space-time correspondence); and the space-time correspondence between the subsets of the video data is output. The method for describing and pairing the visual volume disclosed in the present invention comprises the steps of: inputting a set of complex video data; describing the video material as a sorted complex visual core micro-smear, the towel is divided into plural Guan; calculating at least - visual kernel_ for each time interval, wherein each visual nucleoside _ is a group of a plurality of visual atoms from different _ intervals of the video material, and each visual atom is interspersed with a part of the video data Time and space fine _ capacity; according to the following steps to establish a visual atom: in the time of the _ no change in the complex feature points; in each of the non-changing features of the circumference of the calculation of the fabric of the hollow space of the descriptor collection; remove An overview of the non-variation of the traits: the creation of a collection of visual atoms by the location and descriptor of the characteristic points of the unchanging chords; calibrating the visual nucleus of the buds. Calculate the space between the temporally similar subsets (four) material _ should be (correspondence between time and space); output the space-time correspondence between the subset of video data. The method for describing and pairing the video content disclosed by the present invention has the steps of at least 200951831 := The following steps establish a complex visual butterfly sequence·· by her ^

^ 一系列__之複數視頻影像之複數槪點 藉以刪除只被纽在視_其巾之—上之_ ;對=特 徵點進仃咖平均,並依據平均拋棄異t之特徵點;以最接近之 特徵點為群之赌舰置適合之研特徵點之標準陣列;計算 j 一系列_連續之複數視娜中被配置之特徵點之各種類之數 1,藉以為柯舰點之鮮_建立魏,射各視覺核苦酸 包含標準_之絲,且視做_相包含㈣賴之視覺核 皆酸序列;校準視覺核苦酸序列,藉以判斷視頻資料之複數時間 相似子集,由時間相似子集之間計算視頻資料之空間對應處(時 空對應處);輸出在視頻資料之子集間之時空對應處。 本發明所揭露之裝置财法如上,與先前猶之間的差異在 於本發明透過輸人視織料之#合及心鱗之視覺㈣酸序列 為述視頻資料後’依據校準後的視覺核:^酸序列來判斷視頻資料 之不同集合間之時騎應處,藉以解決先前技術所存在的問題, 並可以達献有較少計算能力且有較高可靠度的测分析方法之 技術功效。 【實施方式】 在我們對本發明所提之的視頻DNA (video DNA)或視頻基 因研究(video genomics approach)有更詳細的討論之前,讓我們 先用較為理論上的數學方法來探討一般的視頻分析問題。 在數學的層面上,厂般的配對問題可以看成是兩個相關的部 7 200951831 份:其一是相似的程度(相似度),另一個則是對應的部份(對應 處)。現在給定兩個視頻序列’前者的目標是要計算出—個數值來 代表兩個相_似程度;破者岐要找出兩相巾相對 部份。 〜 m 既然一個視頻影像可以看成一個具有空間性和時間性的時空 資料,該時空資料包含兩個空間上的維度以及—個咖上的維度 (也就疋多個2D的視頻影像,以及該些不同視頻影像的時間序 列),我們便可以把_上和空壯的對應處賴來處理。時間上❹ 的對應處是以不同視頻晝面(frame)之間的時間為單位來處理的: 視頻序列被視為-個依序排列之晝面的—維陣列,而配對則是將 兩個視頻序列中的畫面進行對應。空間上的對應處則是以子晝面 為單位去處理’從兩個序列的兩張畫面中找出相對應的像素 (pixel)或代表物件(things)的像素區塊(re咖n)。 相似度和對應處的_其實是十分相_,通f計算出其中 -項後往往另-項便呼之欲出。舉例來說,我們可以將相似度定〇 義為視頻内容裡互相對應之部份的數量。相反地’如果我們對視 頻序列中的不同部分可以訂出-個相似度的標準,那我們便可以· 疋義對應處」為具有最大相似度的部份。 在此先區別出兩種「相似」的概念:語意上的相似和視覺上 的相似。「視覺上的」相似是指兩個物件看起來很祕,也就是它 們的像素所表現出的模樣相似;而「語意上的」相似則是指兩個 物件表現㈣雜念相近。語意上的彳目似賴蓋的範雜視覺上 200951831 的相似大得多。舉例而言,一輛卡車和-輔法拉利跑車在視覺上 大相逕庭’但在語意上卻是相近的(都錢出交通工具的概念)。 、因此’視覺上的她比較容易去量化或去測1:,而語意上相似較 ,為主觀且是要探討的問題而定。 在視頻讯號中常常會有雜訊或是失真的情況發生。這是可能 是由不同的角度、光照(lighting)條件、剪輯(editing)、或解析 度等等_素°而—個理想的相似度鮮必須在這些因素或其他 β可變動的因素的影響之下保持不變。依據命名,無論兩個不同的 物件被如何照射,如果相似度標準判定該兩物件她,則稱此相 似度對光照條件具有不變性。 以下所描述的系統和方法在視頻序列的配對上可以達到剪輯 的不變性或失真的不變性。更具體的來說,提供時空配對之系統 及方法是建構在前面提到的視覺相似上,使得本系統及方法對於 時間上失真(畫面速率的改變)、時間上剪輯(刪除或***畫面)、 或空間上失真(像素上的運算)、空間上剪輯㈤除或***物件至 畫面中)都具有不 變性。 * 從數學的角度來看,時空配對的問題可以轉換為如下的數學 V表示:即給定兩個視頻影像序列,其中,第一個序列可以表示成 時空座標系的(xjt),第二個序列表示成(x',y',t,),則(x,y,t)和(X,, y’,t')間的對應關係即為所求。 當我們將視頻資料(video data)想成一個由像素組成的三維 陣歹】,則‘彳未时的配對問題便可以看成在兩個三維陣列申尋找相 9 200951831 對應的元素。一般來說,這樣的一個配對問題在計算上通常太困 難(屬於NP-complete問題)以至於不實際,這是因為當沒有做進 一步的簡化的時候,計算系統會嘗試去尋找在兩個序列中所有可 能匹配的像素之子集,這需要非常大量的運算。 然而’就像前面所提到的,若將配對問題分成時間上和空間 上的配對,便能大幅度的簡化難度。其中,空間上的配對問題會 較為複雜,因為空間上的配對問題需要進行大量二維空間的計 算。雖然一維之時間上的配對問題仍然很複雜,但使用本發明所 提之視頻DNA或視頻基因動態程序可以較簡單的方法來有效的 配對一維(時間上的)訊號(signal)。 依據本發明,視頻序列間的時空配對程序(pr〇cess)可以分 割為兩個主要的階段進行討論,如「第!圖」所示。 時間上的配對如「第1圖」的第一階段1〇〇 (這步驟在以下會 有更詳細的討論)。對第一視頻序列之子集中的時間座標「t」及第 二視頻序列之子集中的時間座標「t,」進行時間上的配對而產生了 對應處。藉由辆上賊對’我們避免了需要在二維空間中,嘗 試處理兩個視頻相巾之所有可能配對之像素的子集(本質上是 三維的配對問題)。再者,由於這個配對問題的複雜度會因為像素 的數量較少而減少,因此空間上的配對必須在兩個視頻序列中之 夺間上相對應的小子集之間被執行。換句話說,為了空間上的配 對,-個大型# 3D配對問題會被轉變為在相對比較小的2D視頻 畫面之集合中的2D配對問題。舉例來說,嘗試配對上方之視頻序 200951831 列中的物件「蘋果」與下方之視頻序列中之物件「蘋果」,如此, 在「序列A」以及「序列B」中最相關的少量晝面就被檢查過了。 . 若某個視頻序列中係為短查詢(short query ),則兩個視頻序 -列在時間上相對應的部分是不多的,如此可以大量地減少空間上 的配對問題。 在時間上相對應之視頻資料間,空間上的配對被展現在「第1 圖」的第二階段1〇2中。空間上的配對會在第一及第二視頻序列 ❹ 之時間上的配對部分(如晝面)之空間座標(X,y)與(X,,y,)間產生 對應處。 雖然先前技術中,存在配對一維訊號以及二維訊號的演算 法,該些演算法可以組成上述之第一階段與第二階段,但在視頻 内容被剪輯或失真時,該些演算法並不夠強健(robust)以致於無 法有效的運作。 本發明優於先前技術的原因在於,配對可以變得更強大,且 ⑩對失真及煎輯的影片内容而言是不變的。尤其是,時間上的配對 對於視頻序列在時間上的剪輯是不變的(例如,在「第1圖」中 ,之蘋果的不同長寬比、不同的光照、以及不同水果的背景)。 w 因此’先前技術在兩個視頻序列間進行不變的時空配對是有 困難的,在兩相比對之下,本發明表現出以高效率處理此種問題 的方法。 需要被理解的是在這邊所描述的方法正常上都是在至少包含 一個處理器(通常都是在兩個以上處理器的電腦)以及記憶體(通 200951831 承具有mega位元或者是giga位元以上的記憶體)的電腦系統上 實鈀。適合執行這個方法的現有的處理器通常是一般用電腦處理 器如x86、MIPS、P0wer、ARM及類似的處理器。或是被充作影 像編澤處理^,如影像處職、數位喊纽器、以及觸A處 理器’或是類_處理器。在這邊所描述的方法可以在更高階的 程式語言上操作,是「c」、「C++」、「java」、「Μ」、「啊⑽」、 ❹ 以及類似的程式,在較低_元件上操作或是直接嵌人比較高級 、體内%九的結果可以儲存在記,_存媒體上如記憶體或是 隨身碟,硬碟,CD,DV〇,藍光及相同的儲存媒介。 在先前技術中影像資訊可以藉由一小部分「感興趣(加咖) ^」,或是稱為「特徵點㈤㈣」來呈現。這健頻特徵 的=ΓΓ可以被發現的點,以一種相對穩定的多重影像修飾 的^: 。—個特徵點會去影響到其他的特徵點,如同電腦 =述;f—樣,典型地觀局部的影像崎或是特徵點的四週 ❹ 現及/政點通常有能力去維持影像是否被旋轉,以修改後來呈 現,及以不同的光源來呈現。 至 訊。舉—ΓΠΐΓ料和則㈣抑料糾向量資 方向:=布:::可:::空邊緣的3維方向,一 · 體的描述,所⑽卿供了對-個物 廣告上的蘋果物體及在v 了 u。舉例來說,—個在電腦 200951831 舉例來說,部分特徵點應該包含: 如同 C. Harris 以及 M. Stephens 在 1988 年第四屆 Alvey Vision C0nference 中之「A combined comer and edge detector」所描述的角 • 落偵測器(Harris comer detector)及它的變異體(variants)。 2004 年 D. G. Lowe 在 International Journal of Computer Vision」上之「Distinctive image features from seale-iimriant keypoints」所描述的尺度不變特徵轉換(Scale ^批丨加fcature ❿ transform,SIFT )理論。 藉由解碼視頻串流所獲得的移動向量。 時空邊緣的指引。 顏色的分布。 本質的描述。 對已知之β司彙内的像素進行係數分解,例如waveiets、curveiets 等。 〇 習知技術中之特殊的物體。 將這個想法延伸到影像資料上’我們可以提取一個視頻成為 -二維空間的結構(二維空間是由許多2D的圖片所形成,而一維空 ¥間是由許多的視頻畫面來形成)。這個3D架構可以使用為呈現視 頻序列的基本建造基石。 如同前面所談論到的’本發明背後主要的思考關鍵可以理解 為以生物學名Θ的方法來思考視頻分析的問題是相當的有幫助 的,並從生物形成學上獲得靈感。在這裡,這個例子,把特徵點 13 200951831 當成是原子來財是有用的,由多種視頻畫面所濃縮成的特徵點 就像是㈣酸’視頻本身鱗是已解的核聽相,例如大的 DNA或RNA分子。 在視頻序列的空間及時壯的維財杉_鱗。時間維j 度可以是視頻資料的排序。我們可以說是—個特徵點接在另外一 個特徵點之後。如果我池視頻序順分成树間的區間,我們 可以把它當作是視頻成份的-個已排序的序十裡面都包含著特 徵點的集合。如同前面所提到的,在紐我們認為鶴資料是小〇 部分姆酸的序列,以及我們認為一個視頻的訊號也是由像核苷 酸的視頻提交,稱為視頻的DNA。 利用上述之DNA序列分析,本發明所提之系統和方法能夠將 二個視頻表示為三維、二維以及—維的訊號。考慮全部特徵點的 ”合’我們會得到-個三維的(時空上的)結構。考慮時間區間 的序列,我們可以獲得-個—維的表示式。考慮序列中的一個竺 我們可賴得-個二_表示式。相_表喊姻來進行❹ 間與i間上触對階段。接下來為-個兩階段的配對方法。 >在第-階段…個視頻相之_上的表示式將會賊立。. 3視頻序列將會被分割成一些時間區間。這裡所說的時間區間· ^吊不是只有單-的視頻晝面,而是由—些視頻晝面(例如:3 息面)組成的>{段。時間區間將會在這裡有更深人的討論。 對於所有的時間區間,實際的視頻影像會被抽取成一個表示 工(在此也被稱做視覺㈣酸(visualnude〇tide)),表示式中會包 14 200951831 含這個區__特徵點。_依據吾棄各種特徵點的時空 抽取並壓縮這個特徵轉心舉例來說,我們只钱料: 點的不同型態。換句話說,這裡我們只要記錄特徵點的描述符號 (descriptor)以及獨的特徵點鶴之描述符號的數量。 〜 每個時間分割的視頻訊號(我們將會稱為「核聽」,為類似 生物上的㈣酸)將會被表示成—個未排序的集合或是—個特徵 ⑩ 點的「袋子」(或是-個特徵點描述符袋)。因此如果每個特徵 點被想成-個「視覺原子」,那麼「特徵袋」就表示一個特定視頻 的時間區段,可以被稱為—個「核微」。然後,不同視頻的時間 區段的表示式被安排至一個已排序的「序列」或是映射(脱以視 頻DNA)。在本發明中的,我們通常使用「核紐」這個術語, 而非「特徵袋」,因為這會以—個逼近生物資訊方法來幫助引導視 頻分析程序的思考。 符合視頻映射或視頻DNA的兩個視頻序列可以用很多類似 ©的方法來校準’像是以比對和校準職序列的方式。在麵序 列刀析中個重要的問題就是嘗試找出對應處,對應處的尋找 •方式為,給予兩個類似的DNA序列的子集合,經由相對應之核苦 .酸間的最大她和最小輪咖尋對應處。本發_描述的系 統和方法的演算法’軸校準那些在生物資訊㈣舰序列,而 那些演异法可以被用來校準兩個不同的視頻訊號。 經由第-階段配對兩個視娜體的部份之後,將可以完成額 外的’IV像刀析。舉例來說’在第二階段可以朗時間對應處之 15 200951831 視頻序列的子集合間之空間上的配對處。在第—個視頻中所看到 的运些「東西」(像素群組),可以在第二個視頻中配對出來。更 仔細的說’我們現在可以尋找兩個時間對應之視頻影像的晝面内· 容間之空間對應處。 η 在乂之後的第二階段’我們不會丟棄特徵點的時空座標。在 第-又’每個晝面被表示成一個二維的特徵點結構,而且我們 :保留特徵點的座標。為了達到第二階段的目的,在此將會用到 晝:之空間配對、視頻畫面的内容比較、更標準的特徵點演算法、❹ 先前所使用的電腦視覺文件。 對於物件識別和其他的應用,以物件為基礎的分析是必要 的,「視頻基因(videogenomics)」方法提供較先前技術有意義的 優點’其方法如下。 *首先’本發明所描述的系統和方法提供一個較單獨物件描述 符號為高的辨識力。這個辨識力不但是物件描述符號自己的辨識 力’而且包含時間上的支#,像這些描述符號的時間序9卜雖然0 -些存在的方法說,最好的辨識能力是經由—個較大數字的精準 優化特徵點獲得的’但我們發現,事情並非如此。出乎意料的,, 當本發明所描述的系統和方法與先前技術一個一個的比較時,我· 們毛現、、’。果疋時間上的支援(像是時間順序在多數特徵點的群組 出現)較使用非常大量的不同描述符號為更重要的辨識力 (discriminative power )。 舉例來4 ’通常增加物件描述的精準度是必彡貞的。先前技術 16 200951831 ^暴力/ir (brute force)」來增加精準度的方式將會容易獲得更 夕的特徵點和舰點贿符,但是每個概點和特獅描述符是 •由密集運算所產生時,綠技術之「暴力法」由於較高的計算負 、載,將迅速的到達遞減的位置。 如何’我們找出了—個增加物件描述的準確度的方法, ’、他方面㈤要^技術依據兩個強度的順序(依據附近兩個強度 的瓣增加#算負荷)增加視覺的詞彙大小,如此可以依據本發 ❿贿提之純和綠使職赌度崎算進行簡單脑對。本發 月所私述之系統和方法用來改進精準度,我們避免增加特徵點描 述符號的數量,而且經由一個時間分辨率上的分析來增加精準 度。這將經由簡單的增加兩個更多的「核芽酸」(像是在一個視頻 分析中使用較小的日销分割)到「視⑽NA」相比較來完成。 為了避免綱的增加特徵闕數量,本發騎提之祕和方法可 以達到高精準度,而且從計算的觀點將更有效率。 ® 先刖技術的方法,像j· Sivic以及A zisserman的「獅⑶ oogle. a text retrieval approach to object matching in video」的方法 .係將景>片當作影像的-個集合,射,為了獲得高的識別力,因 ,此必須使用大量的特徵點「詞彙」(超過百萬的元素)。相比之下, 使用時間支援的描述可以較少的特徵點詞彙(數百或數千的元素) 得到相同或更好的結果,並增加計算的效益。 第二個優點是朗料基礎之娜細,本發明所述之系統 和方法准許擷取感興趣的物件和内容出現的物件。時間序列可以 17 200951831 被想成描述物件的額外資訊,還有物件本身的描述。 「第2圖」展現出同一物件(蘋果2〇〇)在兩個不同上下文中 出現的情形冰果202和電腦204。在第一案例中,「頻果」這個 物件出現在與錢、及草每在的序列,這便賦予了這物件具有 水果的涵義。而在第二案例#中,「蘋果」這個物件則在與耗型 電腦和iPhone +機的序列中,這便賦予了這物件具有電腦的涵 義。在這裡,系統跟方法足夠複雜地去辨別這些上下文的相異處。^ A series of __ plural video images of multiple points to delete only the _ _ _ _ _ _ _; _ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = The characteristic point that is close to is the standard array of the feature points of the group's gambling ship; the number of the various types of feature points that are configured in the series of _ consecutive complex numbers is calculated, and the number of the various types of the feature points is taken. Establishing Wei, shooting each visual nuclear bitter acid contains the standard _ silk, and as the _ phase contains (four) Lai's visual nuclear acid sequence; calibration visual nuclear acid sequence, to determine the video of the complex time-like subset, by time The spatial correspondence between the similar subsets is calculated (the space-time correspondence); the space-time correspondence between the subsets of the video data is output. The device disclosed in the present invention has the same financial method as the previous one. The difference between the present invention and the previous one is that the visual nucleus of the calibrated visual nucleus is obtained after the visual (four) acid sequence of the input and the squama of the human woven fabric. The acid sequence is used to judge the riding time between different sets of video data, so as to solve the problems existing in the prior art, and can achieve the technical effect of the measurement analysis method with less computational ability and higher reliability. [Embodiment] Before we discuss the video DNA or video genomics approach of the present invention in more detail, let us first discuss the general video analysis with more theoretical mathematical methods. problem. On the mathematical level, the factory-like pairing problem can be seen as two related parts 7 200951831 copies: one is the degree of similarity (similarity), and the other is the corresponding part (correspondence). Now give two video sequences. The former's goal is to calculate a value to represent the two phases. The broken one is to find the opposite part of the two-phase towel. ~ m Since a video image can be viewed as a spatial and temporal space-time data, the space-time data contains two spatial dimensions and a dimension on the coffee (that is, multiple 2D video images, and For the time series of different video images, we can deal with the corresponding _up and empty. The correspondence between temporal ❹ is handled in units of time between different video frames: the video sequence is treated as a sequential array of dimensions, and the pairing is two The pictures in the video sequence correspond. The corresponding position in space is processed in units of sub-planes to find the corresponding pixel (pixel) or pixel block representing the thing from the two pictures of the two sequences. The similarity and the corresponding _ are actually very _, and the f is calculated by the fact that the item is often followed by another item. For example, we can define the similarity as the number of parts of the video content that correspond to each other. Conversely, if we can set a similarity criterion for different parts of the video sequence, then we can make the corresponding part the largest similarity. Here we first distinguish between two concepts of "similarity": semantic similarity and visual similarity. "Visual" similarity means that two objects look very secret, that is, their pixels look similar; and "speech" similarity means that two objects are similar (4). The meaning of the semantics seems to be much larger than the similarity of the 200951831. For example, a truck and a secondary Ferrari are visually very different, but they are similar in semantics (both of which are the concept of transportation). Therefore, she is more likely to quantify or measure 1 visually, but the semantics are similar, subjective and subject to discussion. There are often noise or distortions in the video signal. This may be caused by different angles, lighting conditions, editing, resolution, etc. - an ideal similarity must be influenced by these factors or other beta-variable factors. Keep it unchanged. By naming, no matter how two different objects are illuminated, if the similarity criterion determines the two objects, then the similarity is said to be invariant to the illumination conditions. The systems and methods described below can achieve invariance of clips or distortion invariance in pairing of video sequences. More specifically, the system and method for providing spatio-temporal matching is constructed on the aforementioned visual similarity, such that the system and method are temporally distorted (change in picture rate), temporally clipped (deleted or inserted into the picture), Or spatial distortion (operations on pixels), spatial editing (5), or insertion of objects into the picture) are invariant. * From a mathematical point of view, the problem of spatio-temporal matching can be converted into the following mathematical V representation: given two video image sequences, where the first sequence can be represented as (xjt) of the space-time coordinate system, the second The sequence is expressed as (x', y', t,), and the correspondence between (x, y, t) and (X, y', t') is the desired one. When we think of video data as a three-dimensional array of pixels, then the problem of pairing in the past can be seen as the corresponding element in two three-dimensional arrays. In general, such a pairing problem is usually too computationally difficult (belonging to the NP-complete problem) to be impractical, because when no further simplifications are made, the computing system will try to find it in two sequences. A subset of all possible matching pixels, which requires a very large number of operations. However, as mentioned earlier, if the pairing problem is divided into temporal and spatial pairs, the difficulty can be greatly simplified. Among them, the spatial matching problem is more complicated, because the spatial matching problem requires a lot of two-dimensional space calculation. Although the one-dimensional time matching problem is still complicated, the use of the video DNA or video gene dynamic program of the present invention can effectively match a one-dimensional (temporal) signal with a simpler method. According to the present invention, the spatio-temporal pairing procedure (pr〇cess) between video sequences can be divided into two main stages for discussion, as shown in "FIG. The matching of time is as in the first stage of "Figure 1" (this step will be discussed in more detail below). The time coordinate "t" in the subset of the first video sequence and the time coordinate "t," in the subset of the second video sequence are temporally paired to produce a correspondence. With the thief pair, we avoided the need to process a subset of all possible pairs of pixels (essentially a three-dimensional pairing problem) in two-dimensional space. Furthermore, since the complexity of this pairing problem is reduced by the small number of pixels, the spatial pairing must be performed between the corresponding small subsets in the two video sequences. In other words, for spatial matching, a large #3D pairing problem can be transformed into a 2D pairing problem in a relatively small set of 2D video frames. For example, try pairing the object "Apple" in the video sequence 200951831 column above and the object "Apple" in the video sequence below, so that the most relevant few faces in "Sequence A" and "Sequence B" are Have been checked. If a video sequence is a short query, there are not many temporally corresponding parts of the two video sequences - so that the spatial pairing problem can be greatly reduced. The spatial pairing between the corresponding video data in time is shown in the second stage 1〇2 of “1st picture”. The pairing in space produces a correspondence between the space coordinates (X, y) and (X, y,) of the paired portion (e.g., the facet) at the time of the first and second video sequences ❹. Although in the prior art, there are algorithms for pairing one-dimensional signals and two-dimensional signals, the algorithms may constitute the first stage and the second stage described above, but when the video content is clipped or distorted, the algorithms are not enough. Robust so that it cannot function effectively. The reason why the present invention is superior to the prior art is that pairing can become more powerful, and 10 is constant for distortion and frying film content. In particular, temporal matching is constant for temporal editing of video sequences (for example, in Figure 1, the apples have different aspect ratios, different illuminations, and different fruit backgrounds). Therefore, the prior art has difficulty in performing constant spatio-temporal matching between two video sequences. Under the two comparisons, the present invention exhibits a method of handling such problems with high efficiency. It needs to be understood that the methods described here are normally in a computer with at least one processor (usually on more than two processors) and memory (passing 200951831 with megabit or giga bit) On the computer system above the memory), the real palladium. Existing processors suitable for performing this method are typically computer processors such as x86, MIPS, P0wer, ARM, and the like. Or it can be used as an image editing process, such as image service, digital caller, and touch processor ’ or class _ processor. The methods described here can be manipulated in higher-level programming languages, such as "c", "C++", "java", "Μ", "ah (10)", ❹ and similar programs, at lower _ components. The upper operation or the direct embedding of the higher-level, in vivo% of the results can be stored in the memory, such as memory or flash drive, hard drive, CD, DV, Blu-ray and the same storage medium. In the prior art, image information can be presented by a small portion of "interest (plus coffee) ^" or "feature point (five) (four)". This frequency-frequency feature = ΓΓ can be found at a point, modified by a relatively stable multiple image ^: . - A feature point will affect other feature points, like computer = description; f - like, typically view local image or feature points around and / / political points usually have the ability to maintain the image is rotated , modified to be rendered later, and presented with different light sources. To the news. ΓΠΐΓ ΓΠΐΓ 和 和 四 四 四 四 四 四 四 : = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = In v, u. For example, in computer 200951831, for example, some feature points should include: The angles described by C. Harris and M. Stephens in "A combined comer and edge detector" in the 4th Alvey Vision C0nference, 1988. • Harris comer detector and its variants. The scale-invariant feature transformation described by D. G. Lowe in the International Journal of Computer Vision, "Distinctive image features from seale-iimriant keypoints" (Scale, plus fcature ❿ transform, SIFT) theory. The motion vector obtained by decoding the video stream. Guidance on the edge of time and space. The distribution of colors. The description of the essence. Coefficient decomposition of pixels in known beta channels, such as waveiets, curveiets, etc.特殊 Special objects in the conventional technology. Extend this idea to the image data. 'We can extract a video into a two-dimensional space structure (two-dimensional space is formed by many 2D images, and one-dimensional space is formed by many video images). This 3D architecture can be used as a basic building block for presenting video sequences. As discussed earlier, the main thinking behind the invention can be understood as thinking about the problem of video analysis in a biologically sound way and is quite helpful and inspired by biomorphology. Here, in this example, it is useful to treat feature point 13 200951831 as an atom. The feature points condensed by various video pictures are like (4) acid' video itself is a solved nuclear auditory, such as large DNA or RNA molecule. The space in the video sequence is timely and strong. The time dimension j degree can be the ordering of the video material. We can say that one feature point is followed by another feature point. If my pool video sequence is divided into sections between trees, we can treat it as a video component - a sorted sequence of ten contains a collection of feature points. As mentioned earlier, we believe that the crane data is a sequence of small sputum, and we believe that the signal of a video is also submitted by a video like nucleoside, called DNA. Using the DNA sequence analysis described above, the system and method of the present invention is capable of representing two videos as three dimensional, two dimensional, and one dimensional signals. Considering the "combination" of all feature points, we will get a three-dimensional (space-time) structure. Considering the sequence of time intervals, we can get a --dimensional representation. Consider one of the sequences we can rely on - The second _ expression. The phase _ table calls for the inter- and i-contact phase. The next is a two-stage pairing method. > In the first phase... the expression of the video phase Will be thieves. 3 video sequence will be divided into some time intervals. The time interval mentioned here ^ ^ hang is not only a single - video face, but by some video face (for example: 3 information The composition of the paragraph [time interval will be discussed here more deeply. For all time intervals, the actual video image will be extracted into a representation (also referred to herein as visual (four) acid (visualnude〇 Tide)), the expression will include 14 200951831 with this area __ feature points. _ According to my abandonment of various feature points of space-time extraction and compression of this feature, for example, we only pay attention to: different types of points. In other words, here we only need to record special The descriptor of the point and the number of description symbols of the unique feature point crane. ~ The video signal of each time division (we will be called "nuclear hearing", which is similar to the biological (four) acid) will be expressed An unsorted collection or a "bag" with 10 points (or a feature point descriptor bag). So if each feature point is thought of as a "visual atom," then the "feature bag" represents the time segment of a particular video, which can be called a "nuclear micro." The representation of the time segments of the different videos is then arranged into a sorted "sequence" or map (de-timed by the video DNA). In the present invention, we usually use the term "nuclear bond" rather than "feature bag" because it will help guide the visual analysis process by approaching the bioinformatics method. Two video sequences that conform to video mapping or video DNA can be calibrated using a number of methods similar to ©, such as comparing and calibrating job sequences. An important problem in face-to-sequence analysis is to try to find the corresponding place. The corresponding way to find is to give a subset of two similar DNA sequences, through the corresponding nuclear bitterness. Turn the phone to find the corresponding place. The algorithm described in the System and Method 'Axis calibrates those in the Bioinformation (4) ship sequence, and those algorithms can be used to calibrate two different video signals. After pairing the two parts of the body with the first stage, you will be able to complete the extra 'IV image analysis. For example, in the second phase, the space between the subsets of the 200951831 video sequence can be matched. The "things" (pixel groups) seen in the first video can be paired in the second video. More carefully, we can now look for the spatial correspondence between the inside and outside of the video image corresponding to the two times. η In the second stage after ’, we will not discard the space-time coordinates of the feature points. In the first - again 'each facet is represented as a two-dimensional feature point structure, and we: retain the coordinates of the feature points. In order to achieve the purpose of the second phase, we will use: spatial matching, video content comparison, more standard feature point algorithms, and computer vision files previously used. For object recognition and other applications, object-based analysis is necessary, and the "videogenomics" approach offers significant advantages over the prior art'. * First, the systems and methods described herein provide a higher discriminating power than the individual object description symbols. This discriminating power is not only the object's self-identification of the description symbol, but also contains the time of the branch #, like the time sequence of these descriptive symbols, although the 0-something exists, the best ability to identify is through a larger The precise optimization of the feature points obtained by the 'but we found that this is not the case. Unexpectedly, when the systems and methods described herein are compared to the prior art one by one, we are now, . Support for time (like chronological ordering in groups of most feature points) is more discriminative power than using a very large number of different descriptive symbols. For example, 4' usually increases the accuracy of the object description. Prior art 16 200951831 ^ violence (irute / ir (brute force)" to increase the accuracy of the way will be easier to get more character points and ship points, but each point and special lion descriptor is • by the intensive computing When it is produced, the "violence method" of green technology will quickly reach the decreasing position due to the higher calculation of negative and negative loads. How to 'we find out a way to increase the accuracy of the description of the object, ', he (5) wants ^ technology to increase the visual vocabulary size according to the order of the two intensities (according to the increase of the load of the two adjacent strengths) In this way, it is possible to carry out a simple brain-based comparison based on the pure and green gambling of the bribes. The systems and methods described in this month are used to improve accuracy. We avoid increasing the number of feature point description symbols and increase the accuracy through analysis at a time resolution. This will be done by simply adding two more "nuclear acid acids" (like using a smaller daily split in a video analysis) to "viewing (10) NA". In order to avoid the increase in the number of features, the secret and method of riding can achieve high precision, and it will be more efficient from the point of view of calculation. The method of the first technique, such as the method of j. Sivic and A zisserman's "3 oogle. a text retrieval approach to object matching in video". The scene is taken as a collection of images, shot, in order to Achieve high recognition power, because this must use a large number of feature points "vocabulary" (more than one million elements). In contrast, the description of the use of time support can result in the same or better results with fewer feature point vocabulary (hundreds or thousands of elements) and increase the computational benefit. A second advantage is that the system and method of the present invention permits the capture of objects of interest and the presence of content. The time series can be 17 200951831 is thought of as additional information describing the object, as well as a description of the object itself. "Picture 2" shows the situation in which the same object (Apple 2) appears in two different contexts, the ice fruit 202 and the computer 204. In the first case, the item “Frequency Fruit” appeared in the sequence of money and grass, which gave the object the meaning of fruit. In the second case #, the "Apple" object is in the sequence of the consumer computer and the iPhone + machine, which gives the object a computer meaning. Here, the system and method are complex enough to distinguish the differences between these contexts.

因此’該視頻映射/視頻DNA在這兩個案例中的呈現將有所不同, 儘管事實是物件本身是一樣的。 與此相反,如Sivic和Zisserman之先前技術的方法,並沒有 考慮到視頻内容的前因後果’因而無法區分這兩種不同的物件「蘋 果」在上面的例子中的差別。 第三個優點是描述「視頻基因」的方法允許以許多不同的方 式來執行部分比較以及視頻序列的配對。正如生物資訊學中的方 法允許不同的DNA序列進行比較,兩種不同的視頻DNA序列可〇 以進行配對’儘管有—些不同的視頻畫面(㈣酸),像是***廣 告或間隔。當不變性的視頻修變(如時間上的編輯)是需要的時. 候,這一點特別重要。舉例來說,電影的視頻DNA與***廣告的, 版本需要被正確的配對。 「第3圖」展現出了一個關於產生視頻序列之視頻映射/視頻 DNA代表式的概念性綱要。程序包含以下階段。 在第一階段302,一個局部的特徵點偵測器被用來偵測視頻序 200951831 列中之興趣點。適合的特徵點偵測器包括公開於〗988年C. Harris 及 M. Stephens 在第四屆「Alvey Vision conference」所提出之「A ^ combinedcomerand edge detector」中的角點偵測器(Harris comer • detect〇r)、1981 年 B. D. Lucas 以及 T. Kanade 在「An iterative image registration technique with an application to stereo vision」提出之 Kanade-Lucasu演鼻法、或者是由d. G. Lowe在2004年發表在IJCV 的文早「Distinctive image features from scale-invariant keypoints」 ® 中所揭露之以SIFT之尺度空間為基礎的特徵點偵測器。 感興趣的點可以被追朔到數個視頻畫面去修整無意義的或時 間不一致(例:出現在一過短的時間内)的點。這個部分我們稍 後將會描述的更加詳細。然後剩餘的點則使用局部特徵點描述 符,例如·· SIFT居於局部分佈梯度方向、或由H Bay、T Tuytdaars 以及L. van Gool在2006發表之「Speed叩Γ〇_ features」中所揭 露的加速強健特徵(Speed up robust features,SURF)演算法。通 ® 常描述符都是以向量值的方式來表現。 特徵點债測與描述演算法應該被設計成對空間扭曲的視頻序 .列而言是強健的或不變的(例如:解析度的改變、雜訊壓縮等)。 .時空特難置和對觸碰點描述餘録基本的視頻序列 之表示式層級。 在第二階段304 ’視頻序列被分割為時間區間3〇6,其經常跨 越多個視頻晝面(通常是3到30個畫面)。這樣的分割是可以達 成的’舉例來說,可以依據從先前的階段追蹤的特徵點。值得注 19 200951831 在第三階段308,在每個時間區間的特徵點是被聚集的。先前 所提到之時空位置(特徵點座標)在這個階段不被伽。相反地, 時間區間令的資訊被描述為使用「特徵袋」3忉的方法 在此,類似於Sivic和Ziss_所提出的方法,所有的特徵 點描述符號絲制視覺詞彙(獲取自代紐_述符之聚集, 舉例來說’齡向量化財法倾取),特徵赌述符號被視 覺詞彙中相應最接近的元素所雜,前述所論及的特徵點在 這種方式下的絲方式也祕子L這種_的方式視 覺詞彙可以概為是-個「元素職表」的視覺元素。 然而’不同於Sivic和Ziss_n所提之先前技術的方法在 這裡我們捨棄那些被稱做「表示式」、「視覺的餅酸」、「㈣So the video mapping/video DNA will be different in both cases, despite the fact that the objects themselves are the same. In contrast, prior art methods such as Sivic and Zisserman did not take into account the prophetic consequences of video content and thus could not distinguish between the two different objects "Apple" in the above examples. A third advantage is that the method of describing "video genes" allows partial comparisons and pairing of video sequences to be performed in many different ways. Just as the method in bioinformatics allows different DNA sequences to be compared, two different video DNA sequences can be paired' despite the fact that there are different video frames ((4) acids), such as insertion advertisements or intervals. This is especially important when invariant video changes (such as time editing) are needed. For example, the video DNA of a movie and the inserted ad, the version needs to be correctly paired. Figure 3 shows a conceptual outline of the video mapping/video DNA representation that produces the video sequence. The program contains the following stages. In the first stage 302, a local feature point detector is used to detect points of interest in the video sequence 200951831 column. Suitable feature point detectors include Corner Point Detector ("A^ combinedcomerand edge detector") proposed by C. Harris and M. Stephens in the 4th "Alvey Vision conference" (Harris comer • Detective 〇r), 1981 BD Lucas and T. Kanade in the "An iterative image registration technique with an application to stereo vision" proposed by Kanade-Lucasu, or by d. G. Lowe in 2004 published in IJCV The feature point detector based on SIFT's scale space is disclosed in "Distinctive image features from scale-invariant keypoints". Points of interest can be traced to several video frames to trim points that are meaningless or time inconsistent (eg, appearing in a short period of time). This section will be described in more detail later. Then the remaining points use local feature point descriptors, for example, SIFT is in the direction of the local distribution gradient, or is disclosed by H Bay, T Tuytdaars, and L. van Gool in "Speed叩Γ〇_ features" published in 2006. Speed up robust features (SURF) algorithm. The common descriptors are represented by vector values. The feature point debt measurement and description algorithm should be designed to be robust or invariant to spatially distorted video sequences (eg, resolution changes, noise compression, etc.). The space-time special difficulty and the representation level of the basic video sequence of the residual recording are described for the touch point. In the second phase 304' the video sequence is segmented into time intervals 3〇6, which often span multiple video frames (typically 3 to 30 frames). Such segmentation can be achieved', for example, based on feature points tracked from previous stages. Worth note 19 200951831 In the third phase 308, feature points in each time interval are aggregated. The previously mentioned spatiotemporal position (feature point coordinates) is not gamuted at this stage. Conversely, the information of the time interval order is described as the method of using the "feature bag". Here, similar to the method proposed by Sivic and Ziss_, all the feature points describe the symbolic visual vocabulary (acquired from the generation _ The aggregation of descriptive characters, for example, the "age vectorization method", the characteristic gambling symbols are mixed by the corresponding closest elements in the visual vocabulary, and the aforementioned feature points in this way are also secret. The visual vocabulary of the sub-L such _ can be roughly a visual element of the "Elemental Table". However, unlike the prior art methods proposed by Sivic and Ziss_n, we are here to abandon those that are called "representatives", "visual pies", and (4)

酸」、或偶爾被視為「特徵袋」310之特徵點的空間座標,並且以 在時間區間内之不同視覺原子出現頻率作為直方圖^群組或向 量)。這裡的代表來自視頻之—定數量的視頻畫面之「視覺核皆酸」 312基本上是依據捨棄空間座標與單純地計數發生頻率所創造的 「特徵袋」(這個過程被視為「袋函式」或「分組函式」)。如果視 覺元素的標準集合是用來描述每個「特徵袋」中的内容,那麼一 個視覺核苷酸可以表示成數學形式之直方圖或稀疏向量(叩批记 vector)。舉例來說,如果「特徵袋」描述了好幾個視頻影像,其 20 200951831 中包含第-特徵點的有三個案例,包含第二特徵點的有兩個案 例且沒有-個案例是包含第三特徵點的,那麼用來描述視頻影 .像的視覺核㈣或「特徵袋」就可以用直方圖或向量(3,如)來 *表不°在這個例子中,視覺核皆酸321被用直方圖或向量表示為 (〇,〇,〇,4,〇,〇,0,0〇,5〇)。 特徵衣」表示法允許空間編輯的不變性:假如視頻序列被 修改/舉例來說,覆蓋像素到原始畫面上,新的序列將包含混合 €>的特徵點(#分的舊特徵點將屬於原始的視頻且另—部分的新 特徵點則對應到覆蓋層)^假如重疊的大小沒有非常明顯(換言 之’晝面中的大多數資訊是屬於原來的視頻),只需要在各自的「特 徵袋」(像是疏散向量)同時具有一定比例的特徵點元素,就有可 能正確地配對兩個視覺核苷酸。 最後,所有視覺核苷酸(或特徵袋)被聚集為視頻映像或視 頻DNA 314中之經過排序的序列。每個圖像(或視覺核苷酸、 ❹ 乂」、直方圖、或稀疏向量)可以被視為在無限的字母表中之一 個被推論的字母,因此,視頻DNA為廣義的文字序列。 “ 兩個視頻序列的時間匹配能夠依據使用多種不同的演算法匹 '配對應的視頻DNA。為了匹配生物的DNA序列,這能夠由非常 簡單的「匹配/未匹配演算法」延伸到適合生物資訊的「點矩陣(d〇t matrix)」演算法,到類似那些被使用在生物資訊的精密演算法。 一些較複雜的生物資訊演算法例如:1970年由s.BNeedleman以 及 C. D Wunsch 在「A general method applicable to the search for 21 200951831 similarities in the amino acid sequence of two proteins」中戶斤提出之 Needleman-Wunsch 演算法、1981 年由 T. F. Smith 以及]V[. S. Waterman 在「Identification of common molecular subsequences」中, 所提出之Smith-Waterman演算法、以及1990年由S. F· Alschul等, 人在「Basic Local Alignment Search Tool」中所提出之 Basic Local Alignment Search Tool (BLAST)等啟發式演算法。 通常’合適的序列匹配演算法將依據定義配對的評分(或距 離)以及兩視頻序列間之配對的品質進行運作。配對的評分包含 d 兩個主要的元件:核苷酸與空隙罰分(gappenalty)間之相似點(或 距離)、關於如何不嘗試***序列而引入間隔的標準的演算法。 為了做到這一點,在第一視頻中的核苷酸以及在第二視頻中 之對應的核苷酸之距離必須依據一些數學的程序來判斷。如何將 「特徵袋」從第一視頻晝面序列相似到第二視頻晝面序列中的「特 徵袋」?相似值能夠以測量兩個核甘酸有多相似或有多不相似的 矩陣來表達。在簡單的例子中,相似值能夠以歐基里德距離❹ (Euclideandistance)或相關性向量(特徵袋)表示各個核苷酸。 若希望允許部分相似(這經常發生,特別是視覺核苷酸可能在空> 間編輯時包含不同的特徵點),則應該要使用權重或異常剔除· (rejection of outlier)更複雜的矩陣。更複雜的距離可能還會考慮 到兩核雜間的異變機率:若兩個不同的核雜為彼此的異變, 則該兩不_核賊更有相似的可能。例如,考慮第一視頻影像 序列的第-視頻影像'相同之第一視頻影像序列的第二視頻影 22 200951831 像、以及視頻重疊(vide〇 〇verlay)。明顯的,許多在第一視^^員所 &述之包内的顧特徵雜與在第二視麵描述之包⑽視頻特 •徵點相似’而因為視頻重疊,「異變」在此為那些不同的視頻特徵 點。 a 空隙罰分是在序列核苷酸之間引入間隔的一種函式。如果使 用的疋線性罰分’只要簡單地求出空隙乘以—些預設常數的數量 即可。較為雜的空隙罰分也許就要考量到_個空隙會出現的機 Φ 率’例如,可以根據内容中廣告位置和長度的統計分佈來判別。 以下的討論辨別了生物DNA與視頻DNA之間的相似性與相 一 1*生案例。由於在此所討論的系統和方法基本上將配對不同視頻 媒體的對應部分問題轉換成了容許與配對生物DNA序列之問題 有相似性的問題,因此可以經由更仔細地檢視這種類推而得到某 些較深入的理解。因s DNA序列的配對技術事在發展中相對較為 後的狀態下進行,與視頻的配對技術有關,因此系統和方法 © 有了思料之外的結果’ βρ顯是-些後階段的DNA生物資訊學研究 方法技術如何旎出乎意料地被應用到配對視頻信號的不同領域當 .中。 田 , 如同之前所討論的’在概念層次下,生物DNA的結構與描述 視頻DNA方法之間有著強烈的她性。—個生物dna是由核芽 酸所組成的序列,同樣地視頻DNA則由視覺核_ (多重視頻晝 面的特徵袋)所組成的。生物中的一個核《是由元素週期料 眾原子所組成的—個分子,同樣地,視覺核魏就是由視覺詞彙 23 200951831 (通常式不同特徵點的一個標準化集裝區)中視覺原子(也就是 特徵點)所組成的一組特徵袋。 「第4圖」簡表的方式透過呈·取之_錄以及* 生物DNA分子與其_物(_軸原子M02結構之_類比,· 說明了「視頻DNA」名稱的由來。儘管有著概念上的相錄,但 生物和視頻DNA之間擁有許多特殊的差異。首先,出現在生物分 子中的原子元素週期表規模是小的,通常僅包括—些元素(例如, 碳、氫、氧、磷、氮等等)。在視頻眶中,特徵點(原子)的❹ 視頻詞彙規模典型地至少含有好幾千至好幾百萬的視覺元素(特 徵點)。 其人核皆自久分子中的原子數量也相對較少(幾時或幾 百個)。視覺核微(特徵袋)中的「視覺原子」(特徵點)數量 -般而言就有幾百錢千個1在生物核微中時,空間關係以 及原子間的關係是很重要的,而在一個視覺核苷酸中,特徵點之 間的這種關係(也就是特徵點間的調和)通常不會被強調或被忽 〇 略。 第三’生物DNA序列中不同核苷酸的數量是小的—通常在. DNA序列中有四個(「A」、「T」、「G」、「C」)核聽,在蛋白質, 序列中則有二十個。相較之下,在視頻DNA中,每個視頻核苷酸 是一個通常會含有至少幾百或幾千個不同特徵點的「特徵袋」,可 以由長方圖或向量的方式呈現。因此,如果一組或一集裝區,例 如500或1〇〇〇個標準特徵點被當作是一個標準的是頻分析選擇來 24 200951831 使用時,每個「特徵袋」就會是一個由出現在一系列被描述為「枝 苷酸」或「特徵袋」之視頻晝面中各個500或1000個標準特徵點 . 的係數的倍數組成的長方圖或向量,因此可能會個別代表著不同 . 視頻核苷酸的這個量中的排列數量是很大的。 這些事實上的差異會讓視頻DNA的配對只會在本意上相似 於生物序列的配對。在某些方面,視頻配對的問題較為困難,而 在另一些方面上,它又較為簡單。更明確地來說,配對演算在以 © 下的面向上是不同的。 首先,在生物序列中’由於不同核苷酸的數量是小的,配對 兩個核苷酸的評分就能以簡易的「配對」或「不能配對」結果來 表示。亦即,一個生物的核苷酸可能是「A」、「t」、「G」或「C」, 而要辨別的就是是否有「A」與「A」的配對,或是沒有。相較之 下’視頻DNA中的每個核苷酸本身就是通常有幾百個或幾千個不 同係數的一排長方圖、向量或「特徵袋」,因此配對的操作會較為 ⑩ 複雜。所以’對視頻DNA而言,我們需要使用一種較為一般的核 苦酸之間的「評分法」或「距離法」概念。這種評分可以被視為 -是直方圖或向量之間的某種距離法。換言之,任兩個不同「特徵 v 袋」之間的距離有多遠? 另一方面,許多不同的概念,像是同源性分數(homology score)、嵌入、刪除、點突變以及其他類似的概念,在這些兩個不 同的領域之間有著顯著的相似性。 「苐5圖」呈現的是在一個實施例中,輸入視頻序列中的一 25 200951831 個已演算視頻DNA。視頻DNA的演算過程會收到視頻資料990, 且包含以下的階段:特徵偵測10〇〇、特徵描述2〇〇〇、特徵修整 3000、特徵表示4000、時間區間分割5000、以及視覺原子聚集· 6000。過程的輸出是一個視頻DNA 6010。某些階段可能會在不 同的實施例中進行,或者完全不進行。以下的描述詳盡地說明了 以上視頻DNA演算階段中不同的實施例。 如「第6圖」所示,視頻序列被分成一組時間區間(步驟 5000)。「第6圖」指出了在一個實施例中,視頻時間區間6〇〇的❹ 長度是固定的(例如,1秒)且不會重疊。在另一個實施例中,時 間區間602有-些重疊的部分。在此’各個視頻核普酸可能會由 和出現在1秒(或是—次要組別)中—樣多的視頻晝面所組成的, 它取決於每秒的晝面比例,可能是1〇個、16個、24個、3〇個、 60個或一些次要組合的晝面。 在另一個實施例裡,這些區間是根據鏡頭(場景)的位置而 縮短、或者是在兩個連續性的畫面中突鋪變(時間區間6G4)。❹ 這有可能是利用追蹤軌跡的結果來決定接下來的鏡頭。在每一個 畫面下’這些新的執跡數會被計算而且會取代原有的執跡數。如. 果消失的軌跡數超越某些界限,而後來取代原有軌跡的新數據也, 會超越另些界限’這種畫面被視為—個鏡頭。如果鏡頭已經被 運用在某些地方上時,在這之中某個視頻核聽可能是由上百個 或上千個視頻畫面所組成如果此晝面很長的心在另—個實施例 裡,區間是固定持續的而且在每個鏡頭下同步(參考叫 26 200951831 特徵偵測1 〇〇〇 (「第5圖」)。一個特徵偵測器運作在視頻資料 990,製造出-組線性非時變的丨從丨到n的集合,{(wX (「第 5圖」讎)’ X、y、t分別代表空間時間座標。特徵綱麵之 步驟已經詳細的表示在「第7圖」並且有—個制這個方法的實 例。特㈣測麵是在一個基礎的畫面下執行。在時間t時一個 里面、且Nt的特徵點n被配置。典型的特徵點有著二維的 邊緣或角落。不變的特徵點伽的標準演算式是可以運用在電腦 裡的。例如4落_器、贈、Kan泰Lu咖㈣㈣似)… 的值範圍從10到1000不等。在特別的實例㈣ 的值為 100、200、.♦、】〇〇〇 左 £ /rn^ .1000在另一個實施例裡,Nt的值是預設 的’且疋-個特徵點侧演算法的結果。 的資3執’特_'&在—個存在於時間與空間上 ❹ ,生一個集合。標準的特徵點偵測演算法 三維的變峨撕被嶋曝。 ㈣算法 特徵贿2_ (請參考「第5圖」)。每 述2000_測時,某_ 徵= 特徵點描述符的集合(「第 來,產生一個 個特徵點贿符是個局部」,絲徵_符合。一 東西有报多特徵點描述符(像表/運用在電腦世界裡的 是以在特徵_近的邊 τ心SURF特徵點描述符) 一個特徵點描述符可异—個局部的直方圖。代表性地, 如:一個SIFT的特德 不成個F維空間的向量。例 在一個特別的實施#田述符F=128、SURF特徵點符號F=64。 抑例裡,特徵點描述符號是在-個以晝面為 27 200951831 2礎的情況下所計算出來的’也就是說它們代表著在某一特徵點 空間相鄰之晝面上的像素。標準特徵點描述符號像是耐或 SURF可以在這個例子下使用。 在另個實;^列中。特徵點描述符號是存在於時間與空間上 的,也就是說它域表著馳存在於_與空社附近的像素。 心準特徵桃财號也就是-個三維空_歸納可以運用在這個 例子上。The acid, or occasionally referred to as the spatial coordinates of the feature points of the "feature pocket" 310, and the frequency of occurrence of different visual atoms in the time interval as a histogram group or vector). The representative here is the "visual nucleus" of the video from a certain number of video frames. 312 is basically a "feature bag" created by discarding the spatial coordinates and simply counting the frequency of occurrence (this process is regarded as "bag function". Or "grouping function"). If the standard set of visual elements is used to describe the content in each "feature bag," a visual nucleotide can be represented as a mathematical form of a histogram or a sparse vector (叩). For example, if the "feature bag" describes several video images, there are three cases with the first feature point in 20 200951831, two cases with the second feature point, and none - the case contains the third feature. Point, then the visual kernel (4) or "feature pocket" used to describe the video image can be represented by a histogram or vector (3, eg). In this example, the visual nucleus 321 is used in a histogram. The graph or vector is expressed as (〇, 〇, 〇, 4, 〇, 〇, 0, 0 〇, 5 〇). The "feature" representation allows spatial editing invariance: if the video sequence is modified/for example, overlaid on the original image, the new sequence will contain the blended feature points (the old feature points of #分 will belong to the original The video and the other part of the new feature points correspond to the overlay) ^ If the size of the overlap is not very obvious (in other words, most of the information in the face is the original video), only need to be in their respective "feature bags" (like an evacuation vector) with a certain proportion of feature point elements, it is possible to correctly pair two visual nucleotides. Finally, all of the visual nucleotides (or signature pockets) are aggregated into a sequence of sequences in the video image or video DNA 314. Each image (or visual nucleotide, ❹ 、, histogram, or sparse vector) can be thought of as one of the infinite alphabets that is inferred, so the video DNA is a generalized sequence of words. “Time matching of two video sequences can be based on the use of a variety of different algorithms to match the corresponding video DNA. To match the DNA sequence of the organism, this can be extended from a very simple “match/unmatched algorithm” to fit the biological information. The "d〇t matrix" algorithm is similar to those used in biological information. Some of the more complex biological information algorithms such as: Needleman- proposed by s. BNeedleman and C. D Wunsch in "A general method applicable to the search for 21 200951831 similarities in the amino acid sequence of two proteins" Wunsch algorithm, TF Smith and VS [. S. Waterman in "Identification of common molecular subsequences" in 1981, proposed by Smith-Waterman algorithm, and 1990 by S. F. Alschul et al. Heuristic algorithms such as Basic Local Alignment Search Tool (BLAST) proposed in Basic Local Alignment Search Tool. Usually the 'suitable sequence matching algorithm' will operate based on the score (or distance) of the defined pairing and the quality of the pairing between the two video sequences. The paired score contains two major components: the similarity (or distance) between the nucleotide and the gap penalty (gappenalty), and a standard algorithm for introducing intervals without introducing a split sequence. In order to do this, the nucleotides in the first video and the corresponding nucleotides in the second video must be judged according to some mathematical procedure. How do I match the "feature bag" from the first video page sequence to the "feature bag" in the second video page sequence? Similar values can be expressed in a matrix that measures how similar or how dissimilar the two nucleotides are. In a simple example, similar values can represent individual nucleotides in terms of Euclideanistance or correlation vector (feature pocket). If you want to allow partial similarity (which happens often, especially if the visual nucleotides may contain different feature points when editing between empty >), you should use a more complex matrix of weights or rejection of outliers. More complex distances may also take into account the odds of heterogeneity between two-core hybrids: if two different nuclear hybrids are mutually variable, the two non-nuclear thieves have similar possibilities. For example, consider the second video shadow 22 200951831 image of the first video image sequence of the first video image sequence of the first video image sequence, and the video overlay (vide〇 〇 verlay). Obviously, many of the features in the package of the first-viewer & described are similar to the package (10) video feature points described in the second view. And because the video overlaps, the "transformation" is here. For those different video feature points. a Void penalty is a function that introduces a gap between sequence nucleotides. If the 疋 linear penalty ' is used, simply multiply the gap by the number of preset constants. A more complicated gap penalty may take into account the machine Φ rate that occurs in a gap. For example, it can be discriminated based on the statistical distribution of the position and length of the advertisement in the content. The following discussion identifies similarities and similar cases between biological DNA and video DNA. Since the systems and methods discussed herein essentially convert the problem of matching portions of different video media into problems that allow for similarities with the problem of paired biological DNA sequences, it is possible to obtain some kind of analogy by examining this analogy more closely. Some deeper understanding. Because the pairing technique of s DNA sequence is carried out in a relatively backward state in development, it is related to the video pairing technique, so the system and method © have unexpected results. 'βρ显是- later DNA bioinformatics How research techniques are unexpectedly applied to different areas of paired video signals. Tian, as discussed earlier, at the conceptual level, there is a strong herality between the structure of biological DNA and the description of video DNA methods. A biological dna is a sequence consisting of nuclear phytic acid, and the same video DNA is composed of a visual nucleus (a feature pocket of multiple video faces). A nucleus in a living being is a molecule composed of the atoms of the elemental period. Similarly, the visual nucleus is a visual atom in the visual vocabulary 23 200951831 (a normalized assembly area with different characteristic points). It is a set of feature bags composed of feature points). The "4th" summary table shows the origin of the "video DNA" by means of the image of the "DNA DNA" and the _ object (the analogy of the _axis atom M02 structure.) There are many special differences between biological and video DNA. First, the scale of the atomic periodic table appearing in biomolecules is small, usually including only some elements (for example, carbon, hydrogen, oxygen, phosphorus, Nitrogen, etc.) In video ,, the ❹ video vocabulary of feature points (atoms) typically contains at least thousands to millions of visual elements (feature points). The number is also relatively small (several or hundreds). The number of "visual atoms" (feature points) in the visual nuclear micro (feature pocket) - in general, there are hundreds of thousands of 1 in the micronucleus, Spatial relationships and relationships between atoms are important, and in a visual nucleotide, this relationship between feature points (that is, the reconciliation between feature points) is usually not emphasized or ignored. Different nuclei in the three 'biological DNA sequences The amount of acid is small—usually there are four ("A", "T", "G", "C") nuclear hearings in the DNA sequence, and there are twenty in the protein sequence. In the video DNA, each video nucleotide is a "feature bag" that usually contains at least hundreds or thousands of different feature points, which can be represented by a histogram or a vector. Therefore, if a group Or a containerized area, such as 500 or 1 standard feature points, is considered as a standard. Frequency analysis options are used. 24 200951831 When used, each "feature bag" will be one that appears in a series of A histogram or vector consisting of a multiple of the coefficients of each of the 500 or 1000 standard feature points in the video of the "glycosic acid" or "feature bag", and thus may be individually represented as different. Video nucleotides The number of permutations in this quantity is large. These de facto differences will make the matching of video DNA only similar to the pairing of biological sequences. In some respects, the problem of video pairing is more difficult, while in another In some ways, it is simpler. More specifically, the pairing calculus is different in the face up with ©. First, in the biological sequence 'since the number of different nucleotides is small, the pairing of two nucleotides can be easily scored. The result of "pairing" or "cannot be paired". That is, the nucleotide of a creature may be "A", "t", "G" or "C", and it is necessary to distinguish whether there is "A" and Pairing of "A", or not. In contrast, each nucleotide in the video DNA is itself a row of squares, vectors, or "feature bags" that usually have hundreds or thousands of different coefficients. Therefore, the pairing operation will be more complicated. So, for video DNA, we need to use a more general concept of "score" or "distance method" between nucleotides. This score can be considered as - Is a distance method between a histogram or a vector. In other words, how far is the distance between any two different "features v bags"? On the other hand, many different concepts, such as homology scores, embedding, deletions, point mutations, and other similar concepts, have significant similarities between these two different domains. The "Figure 5" presents, in one embodiment, a 25 200951831 computed video DNA in the input video sequence. The video DNA calculation process will receive the video material 990 and include the following stages: feature detection 10〇〇, feature description 2〇〇〇, feature trimming 3000, feature representation 4000, time interval segmentation 5000, and visual atomic aggregation. 6000. The output of the process is a video DNA 6010. Some stages may be performed in different embodiments or not at all. The following description details the different embodiments of the above video DNA calculation stages. As shown in Fig. 6, the video sequence is divided into a set of time intervals (step 5000). "Picture 6" indicates that in one embodiment, the length of the video time interval 6〇〇 is fixed (e.g., 1 second) and does not overlap. In another embodiment, the time interval 602 has some overlapping portions. Here, 'each video nucleotide acid may consist of a video plane that appears in 1 second (or - minor group), it depends on the ratio of the surface per second, which may be 1 One, 16, 24, 3, 60 or some minor combinations. In another embodiment, the intervals are shortened depending on the position of the lens (scene) or in two consecutive pictures (time interval 6G4). ❹ It is possible to use the results of the tracking trajectory to determine the next shot. Under each screen, the number of these new tracks will be calculated and will replace the original number of tracks. If the number of trajectories disappearing exceeds certain limits, then the new data that replaces the original trajectory will surpass the other boundaries. This kind of picture is regarded as a shot. If the lens has already been used in some places, one of the video cores may be composed of hundreds or thousands of video frames. If the face is very long, in another embodiment. The interval is fixed and continuous and synchronized under each lens (refer to 26 200951831 Feature Detection 1 「 ("5th picture"). A feature detector operates on video material 990, creating a linear non-group The set of time-varying 丨 from 丨 to n, {(wX ("5th figure" 雠)' X, y, t respectively represent space time coordinates. The steps of the feature plane have been shown in detail in Figure 7 and There is an example of this method. The special (4) plane is performed under a basic picture. At time t, an inside feature point of Nt is configured. Typical feature points have two-dimensional edges or corners. The standard calculus of the invariant feature point gamma can be used in the computer. For example, the value of 4 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Values are 100, 200, .♦, 〇〇〇 left £ /rn^ .1000 in another In one embodiment, the value of Nt is the result of the preset 'and the feature point side algorithm. The _3's _'&<>> exist in time and space, generating a set. The standard feature point detection algorithm is exposed to three-dimensional changes. (4) Algorithm feature bribe 2_ (please refer to "figure 5"). For each 2000_ test, a _ sign = a set of feature point descriptors ( "First, the generation of a feature point bribe is a part", the silk sign _ compliant. One thing has a multi-feature point descriptor (like the table / used in the computer world is in the feature _ near side τ heart SURF Feature Point Descriptor) A feature point descriptor can be different - a local histogram. Typically, such as: a SIFT Ted is not a vector of F-dimensional space. Example in a special implementation #田述符F= 128. The SURF feature point symbol F=64. In the case of the example, the feature point description symbol is calculated in the case where the surface is 27 200951831 2, that is, they represent the space in a certain feature point. Pixels on adjacent faces. Standard feature points describe symbols like resistance or SURF Used in this example. In another real; ^ column. The feature point description symbol exists in time and space, that is to say, its domain table occupies the pixel existing near _ and Kongshe. The fiscal number is also a three-dimensional empty _ induction can be used in this example.

、特徵點修整3_ (請參考「第5圖」)。這邊,在所有的特徵 點裡找到個—致槪點的子集3_。在不—樣的實施例裡, -致性可能暗示著空間的—致(特徵點沒有突然的移動,加上他 們,相鄰時間上的位置是相似的),時間的—致(特徵點沒有出現 或是突然地消失),或是時空的一致(上面所說的結合, feature point trimming 3_ (please refer to "figure 5"). Here, find all the feature points - a subset of the points 3_. In an un-like embodiment, the symmetry may imply the spatial (the feature points do not move abruptly, plus they, the position at the adjacent time is similar), the time--the (feature points are not Appear or suddenly disappear), or the consistency of time and space (the combination mentioned above)

,在一個實施例裡,在「第8圖」中,為了找到前後一致的特 p點而執行魏。在觀料雜算法裡它嘗試著找出幾 ^貫地被呈現在-個連接的夠長的序列晝面裡的特徵點,這樣 就可以移轉在-辦獨晝面裡所侧刺假雜點。這些假特 ' 的軸方式峨#鮮的反馳錢,如果把那些假特 徵點移除娜纽善準確細及―㈣視仙容。一 的追二用一個晝面基礎的追蹤方式。這種方式 者去找到在晝面,和,,之兩個特徵 =通常〜咖中的加^義 旦、、率。在其他的實施例裡,多個晝面是在同-個時間内被 28 200951831 追蹤。 追蹤器3100輸出轨跡 人^ 徵點歷經時間與空間後的二遍,母一個軌跡代表-個特 個執跡特徵點的-组索引^ 一個軌跡也可以被表示成屬於這 個表現式4,氣的丄、组會其Γ一個實施例裡,某個執跡是- ,索%,思味著這組點中, 6分別代表時間開始與時間姓 ,…1 0和 有可能會卿是顺歷的時間。 φ ❹ 的特徵點都是相似的)、移點的軌跡(同一個軌跡裡 的改變),或是兩者徵點的位置不隨著軌跡而有明顯 式可以在這裡用上。電版裡可以用的特徵點追縱標準演算 檢查所得到軌跡之一致性還有執行軌跡修整細。一個實施 一些特定的門插下,這些軌跡將會有某部分因為不符合 =轉。在另-個實施例裡,顯示著空間相似處的高度變異 (犬然的移動)的軌跡將_除。在另—個實施例裡那些特徵 點的特徵點描賴號如果蘭的軌跡也·著高機率的變異性, 那這些也像會被移除。移除的結果是—個是軌跡τ,的子集W。 其中-個實施例裡,某特徵點集合知乃度以及對應處描述符 1C都會在鏡頭t開始的時候被計算’而且追蹤器會被初始化到 ,一個Kalman過濾器(fllter)將預測在下一個晝面 广之特徵點的位置為(〇,汹。這個集合和對應處描述符 (Λ匕在晝面奸汾下被計算。每一個特徵點W,/剛好都相對照特 徵點υ,,/的子集而且是在一個圓裡有著一定的半徑、圓心在 天(〇,对〇 ’還會相符合於所選擇最接近的描述符號。當晝面鄰近排 序沒有出現一個適當的配對時,轨跡被終結(失效)。只有那些符 29 200951831 合充分時間區間上的軌跡點會被保存下來。 在一個實施财,Kalman财H妹奴率麵(velocity) 起使用且以估s十之特徵點位置協方差決定下一個畫面的搜尋* 半徑。 「-種基於追縱「第8圖」(步驟32〇〇)之特徵點修整的實施例· 在第9圖」中有更詳細描述。輸入特徵點座位置⑼❹、對應之 特徵點描述符號2010以及特徵點的軌跡3U0,在每一個軌跡中之 軌跡長度為d」、移動差異為「mv」、以及描述符號變異「如」會❹ 被計算出來。這些值通過-系列的門檻及決策規則後,會去除長 度太小或變異性太大的細^結果便是屬於經過修整後剩下來之 執跡之特徵點的子集3010。 一個使丟棄執跡之可能的決策規則如下所示: (d > th_d) AND (mv < th_mv) AND (dv < th_dv) 其中th—d」為持續時間上的門檻,「th—mv」為移動變異的門植, 「th_dv」為描述符號變異的門檻。 Μ 特徵表示4〇〇(Κ請參考「第5圖」):特徵表示4_之步驟表 示以視覺字彙方絲達修整過之軌跡上的特徵點。這個階段的結. 果式產生一系列的視覺原子4010。 視覺字彙是表示為Κ之特徵點描述符號的聚集(視覺元素), 在這裡以表示。視覺字彙能夠被預先計算的,舉例來說,由表 示視頻序列的集合中收集大量的特徵點並在描述符號上表現向量 化0 30 200951831 在不同的實施例中,K值可以用1〇〇〇、2〇〇〇、3〇〇〇、...、 1000000 代入。 依據元素的數字/由最接近特徵點ζ.之描述符號的視覺詞彙取 代每個特徵點/。在一個實施例中,以最接近演算法(nearest neighbor algorithm)來尋找特徵點;·的表示式, / = argmin||y; -e,lIn one embodiment, in "Fig. 8," Wei is executed in order to find a consistent p point. In the observation algorithm, it tries to find the feature points that are presented in the long enough sequence of the connection, so that it can be transferred to the side of the side. point. These false special 'axis methods 峨 #鲜的反驰钱, if those fake points are removed, Nanashan is accurate and fine--(four) depending on the fairy. The pursuit of one is based on a face-to-face tracking method. This way to find the two characteristics in the face, and, = usually ~ in the coffee plus ^, Dan, rate. In other embodiments, multiple faces are tracked by 28 200951831 in the same time. The tracker 3100 outputs the track man ^ sign point two times after the time and space, the parent track represents - a special track feature point - group index ^ a track can also be expressed as belonging to this expression 4, gas In one embodiment, one of the executors is -, so%, thinking of this group of points, 6 respectively represents the time to start and the time of the last name, ... 1 0 and may be clear time. The feature points of φ ❹ are all similar), the trajectory of the shift point (the change in the same trajectory), or the position of the two points is not obvious with the trajectory and can be used here. The characteristic points that can be used in the electrotype can be traced to the standard calculus. The consistency of the traces obtained by the inspection is also performed. An implementation of some specific door insertions, these tracks will have some part because they do not match = turn. In another embodiment, the trajectory showing a high degree of spatial similarity (the movement of the canine) will be _ divided. In another embodiment, the feature points of those feature points are also removed if the trajectory of the blue is also highly probable. The result of the removal is a subset W of the trajectory τ. In one of the embodiments, a certain feature point set knowing degree and the corresponding descriptor 1C will be calculated at the beginning of the shot t' and the tracker will be initialized to, and a Kalman filter (fllter) will predict the next one. The position of the feature points of the face is (〇, 汹. This set and the corresponding descriptor (Λ匕 is calculated under the 汾 汾 。. Each feature point W, / just happens to compare the feature points ,, / The subset is also in a circle with a certain radius, the center of the circle in the sky (〇, confrontation ' will also correspond to the closest description symbol selected. When there is no proper pairing in the adjacent order of the face, the track It is terminated (failed). Only those track points on the full time interval will be saved. In an implementation, Kalman’s singularity is used to estimate the position of the feature points. The covariance determines the search *radius of the next screen. "An example of feature point trimming based on the "8th figure" (step 32〇〇) is described in more detail in Figure 9. Input feature points Seat position (9) Corresponding feature point description symbol 2010 and the trajectory 3U0 of the feature point, the trajectory length in each trajectory is d", the movement difference is "mv", and the description symbol variation "如" is calculated. These values are passed. - After the threshold and decision rules of the series, the result that the length is too small or the variability is too large is the subset 3010 of the feature points that have been traced after the trimming. The decision rules are as follows: (d > th_d) AND (mv < th_mv) AND (dv < th_dv) where th—d” is the threshold of duration, and “th—mv” is the gate of the mobile variation. "th_dv" is the threshold for describing the symbol variation. Μ The feature indicates 4〇〇 (please refer to “5th figure”): The step of the feature representation 4_ represents the feature point on the track that has been trimmed by the visual vocabulary. The knot of the stage produces a series of visual atoms 4010. The visual vocabulary is an aggregation (visual element) of the feature point description symbol represented by Κ, which is represented here. The visual vocabulary can be pre-computed, for example, by Express A large number of feature points are collected in the set of frequency sequences and vectorized on the descriptive symbols. 0 30 200951831 In different embodiments, the K value can be 1〇〇〇, 2〇〇〇, 3〇〇〇,... Substituting 1000000. Each feature point is replaced by a visual vocabulary of the element/the descriptive symbol closest to the feature point 。. In one embodiment, the feature point is sought with a nearest neighbor algorithm; The expression of /, argmin||y; -e,l

/=1,...yAT 其中,為IHI識別符號空間之基準。在另一個實施例中,近似鄰近 演算法(approximate nearest neighborhood)被使用。由於特徵點 i 被表示為(X,,7,,/,),被稱為視覺原子。 在-個實施射,在麵視覺字彙巾之特徵點之前,會找出 每個執跡的概點絲式。它可⑽著軌_雜闕描述符號 中獲取特徵點描述符號的平均值、巾錄❹數法Uaj〇邮 vote ) ° 在-個實施例中,沒有辨識力的特徵點會被修整掉。沒有辨 ❹識力騎徵點是㈣與多數視覺軒幾乎_離的特徵點。可以 從與離它第-近與第二近的鄰居之距離的比值來判斷這些特徵 視覺原子聚集_ ••為了每—個在步驟所計算出來的時 間區間,《原子被聚⑽視覺核_。_核聽序列 職麵)被輸出。卿峨M皮建立為〖個容器的直方圖 (K為視覺詞彙的大小容器為在時_㈣現 型之視覺原子的數目。 、 31 200951831 在-個實施例中,在區間⑽内的直方圖依據下列的公式將 視覺原子在區間内的時間位置加上權重: 弋=Σ吨 i'J, =n 其中’ _)是-個加權函式,Λ”是直方圖中個容器的值。在一 個實施例中’權重在區間中心被設定為的最大值並向區間的邊 緣減少’例如可以依據高斯公式 w(〇 = expi--Jl_) 在另-個實施例中,_出關内剪輯鏡頭,且由鏡頭的邊❹ 界至區間中心+ Ο的w(〇被設為零。 在特定的實施例中’直方圖中的容器更為了減少不可靠的容 器的影響而加權。例如,第n個容器的權重與類型n的視覺原子 的典型頻率成反比。這個類型的權重就像是反轉在文字搜尋引擎 中之文件頻率的權重。 在另-個實施例中’第n個容器的權重與依據典型突變的代 表所計算之第η個容器的變異成反比,以及與相同内容之第η個 容器的變異成正比。 @ 一旦為了兩個視頻序列計算視頻DNA,這些不同的視頻序列 能夠依照下面的描述以時間進行匹配(校準)。在—個實施例中,/ 查詢視頻DNA所代表的視覺核苦酸的序列{‘與在資料庫中之, 視頻DNA所代表的視·_的賴成,之__對應處以 以下的方式被計算。 兩個序列之間的匹配,核苷酸彳係對應核苷酸〜,或對應核苷 酸\以及^之間的空隙,同樣的,核苷酸^對應核苷酸&,或對應 32 200951831 核苷酸&以及〜之間的空隙。乜C以及之間的匹配能夠以擁 有K個對應處的序列收,人:、擁有G個空隙的序列此,九人)}二以 及擁有G’個空隙的序列{(/„,乂乂)}二來表示,其中,(^,九人)表示在核 苷酸\以及气+1之間的空隙長度/w,子序列κ,Α+ι,..·υ表示相似 處,(4,K)表示在核苷酸氕以及氕+1之間的空隙長度/;(,子序列 {〜弋+1,".八+〃}表示相似處。匹配會依據下列公式配置一個評分 κ G σ/=1,...yAT where is the reference for identifying the symbol space for IHI. In another embodiment, an approximate nearest neighborhood is used. Since the feature point i is expressed as (X,, 7,, /,), it is called a visual atom. In the implementation of the shot, before the feature points of the face visual word towel, the outline of each trace will be found. It can (10) track the _ 阙 阙 阙 中 获取 获取 获取 获取 获取 获取 获取 获取 获取 获取 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 平均值 ° ° ° ° ° ° ° ° ° ° ° ° There is no discernment. The riding point is (4) the feature point that is almost detached from most of the visual porch. These characteristics can be judged from the ratio of the distance from the first-near and second nearest neighbors. _•• For each time interval calculated in the step, the atom is clustered (10). _ nuclear listening sequence (face) is output. The 峨 峨 M skin is established as a histogram of the container (K is the size of the visual vocabulary container is the number of visual atoms in the current _ (four) present type., 31 200951831 In one embodiment, the histogram in the interval (10) Add the weight of the time position of the visual atom in the interval according to the following formula: 弋=Σ tons i'J, =n where ' _) is a weighting function, Λ” is the value of the container in the histogram. In one embodiment, the 'weight is set to the maximum value at the center of the interval and decreases to the edge of the interval', for example, according to the Gaussian formula w (〇=expi--Jl_). In another embodiment, the clipping lens is closed in the _ And from the edge of the lens to the center of the interval + Ο w (〇 is set to zero. In a particular embodiment, the container in the histogram is weighted to reduce the effects of unreliable containers. For example, nth The weight of a container is inversely proportional to the typical frequency of the visual atom of type n. This type of weight is like the weight of the file frequency in the text search engine. In another embodiment, the weight of the nth container And the η calculated from the representative of the typical mutation The variation of the containers is inversely proportional and proportional to the variation of the nth container of the same content. @ Once the video DNA is calculated for the two video sequences, these different video sequences can be matched (calibrated) in time according to the following description. In an embodiment, / querying the sequence of the visual nuclear acid represented by the video DNA {' is in the database as follows, the video DNA represents the __, the __ correspondence is in the following manner The calculation between the two sequences, the nucleotide tether corresponds to the nucleotide ~, or the corresponding nucleotide \ and ^ between the gap, the same, the nucleotide ^ corresponding nucleotide &, or corresponding 32 200951831 The gap between nucleotides & and ~. 乜C and the match between them can be obtained in a sequence with K corresponding places, person: sequence with G gaps, nine people) The sequence of G' gaps {(/„,乂乂)} is two, where (^, nine people) indicates the length of the gap between the nucleotides \ and the gas +1/w, the subsequence κ, Α +ι,..·υ indicates similarity, (4,K) indicates nucleotide 氕 and 氕+1 The gap length /; (sub-sequence {~ Yi + 1, ". eight 〃} + match will indicate a similarity score based on a configuration of the following formula κ G σ.

s = Σσ(^ ) +TsiUM H w=l rta:l 其中,<%,&)是計算核苷酸虼與核苷酸〜相似程度的評分方程 式,执九,U是空隙罰分。 根據上面的描述,許多可選擇的演算法被用來做配對的計 具’包括間早到非常複雜的。在本發明的一個實施例中, Needleman-Wunsch演算法被用來依據最大評分s來搜尋配對,在 另一個實施例中則使用Smith-Waterman演算法,而在另一個實施 例中,使用BLAST演算法。 在可選擇的實施例中,配對最大化評分s是利用下列方法而 兀成的。在第一個階段,在資料庫中的查詢和排序之間找尋有的 小固定長度w的良好配對。這些良好配對被稱為種子(seed)。在 第一卩自'^又,匕6又法從種子的兩個方向延伸配對。無差距校準程序 (un-gapped alignment process)嘗試增進校準評分(aligmnem score)使用在每個方向延伸長度w的初始種子配對。***和刪除 則不在這階段考慮。假如找到高分的無差距校準,資料庫序列則 直接跳到第三階段。在第三階段,查詢序列跟資料庫序列之間的 33 200951831 有差距校準(gapped alignment)則可以用Smith-Waterman演算法 得到。 在實施例中’評分方程式描述代表核苷酸%的直方圖 六和代表核苷酸\的直方圖办’之間的相似性。在另一個實施例中, 相似性可用(W〉的内積計算。在自由選擇的實施例中,相似性可 藉由從學習(training)資料得到的一個向量權重(wdght)加權而 最大化評分方程式的識別力。或者,評分方程式#%,、)是核苷酸% 的直方圖λ和代表核苷酸〜的直方圖y之間距離的反比。在其他 實施例中,距離可用Lp基準(Norm)計算 丨Ml广(冬KrJ' 在特定的實施例中,距離是直方圖之間的Kullback_Leiber分 散性。在其他實施例中,距離是直方圖之間的EarthM〇ver距離。s = Σσ(^) + TsiUM H w=l rta:l where <%,&) is a scoring equation for calculating the degree of similarity between nucleotide 核苷酸 and nucleotides, and U9, U is a gap penalty. According to the above description, many alternative algorithms are used to make the pairing of the instruments' from early to very complex. In one embodiment of the invention, the Needleman-Wunsch algorithm is used to search for pairing based on the maximum score s, in another embodiment using the Smith-Waterman algorithm, and in another embodiment, using BLAST calculus law. In an alternative embodiment, the pairing maximization score s is achieved using the following method. In the first phase, a good pairing with a small fixed length w is found between the query and the sort in the database. These good pairs are called seeds. In the first 卩 from '^又, 匕6 method is extended from the two directions of the seed. An un-gapped alignment process attempts to increase the augment score (aligmnem score) using an initial seed pair that extends length w in each direction. Inserts and deletes are not considered at this stage. If a high-scoring gap-free calibration is found, the database sequence jumps directly to the third stage. In the third stage, the gap between the query sequence and the database sequence is 2009. The gapped alignment can be obtained using the Smith-Waterman algorithm. In the examples, the 'score equation' describes the similarity between the histogram of the nucleotide % and the histogram of the nucleotides. In another embodiment, the similarity can be calculated using the inner product of (W >. In a freely chosen embodiment, the similarity can be maximized by weighting a vector weight (wdght) derived from the training data. Or, the scoring equation #%,,) is the inverse of the distance between the histogram λ of nucleotide % and the histogram y representing the nucleotide ~. In other embodiments, the distance can be calculated using the Lp reference (Norm) 冬Ml wide (Winter KrJ' In a particular embodiment, the distance is the Kullback_Leiber dispersion between the histograms. In other embodiments, the distance is a histogram The distance between EarthM〇ver.

在特定的實作方法,評分雜式吻,〜)與雜賊〜變異成 另-個核聽0_或_失真制錄本(underfying)視頻 序列的機率成正比_。也就纽可㈣代表核雜一直方圖A 變異成另一個代表核苷酸气的直方圖&的機率表示。 在這個例子,機率可用 P{h\h)=Y\P{hn\h\) η 來估計,當作我)是直方圖㈣第Ν個容器(bin)的值改變為九 的機率。機率紙…可從實際經驗上的每個容器獨立的訓練資料 來得到。 在另-個例子,Bayes理論用來表示評分方程式吨〕為機 率 34 200951831 P(h,\h、』_· m 其中,雜過的方法計算。姻%>可表示為 η 其中,)為測量直方圖/ζ中第Ν個容器之值假 以及獨立的為每姆器從實際經驗上的訓 =時的機率, 這個通常是有用的,不僅要找到整體查 不同的㈣州“® 1 旦面或時間序列從兩個 ❷ 不门的視頻之間,而且要找到第—個「東西」( -個視頻中的第-個空間對齊與第二相應的「東自,、’在第 視頻中的第二触間對齊之間的不同對齊。或者,」這個 用的用姐料同麵麵析度的制_。例如 =:=rr想要準確決定電視 町疋1卞曆在兩個例子之中,它為也 準以及時間(畫面編號)校準::的彻不同視頻間的空間校 在本發明的實施射,代表排序 覺核芽叫代表:_相中;的視 覺核皆,之間的空間關聯可的最佳配對之視 序列中的時間區間― 碰〜之_空關聯可用下列⑽最佳配對之視覺核 個特叫絲价 挑選,被表福-轉賴外 35 200951831 可由兩個集合之間得到,使得y;可以符合最接近的7,。不夠充分 达切的配對則被拒絕。相關分數(Corresponding Point)可表示為 一旦此對應關係被發現’轉換式T可藉由最小值 找到。 在一個實施例中’最小化可用隨機樣本共識(RANSAC)演 算法執行。在另一項實施例,最小值使用迴圈再加權最小平方適 合(iteratively-reweighted least squares fitting)演算法執行。對執 行旋轉’大小,或扭曲轉換常常是有用的。 在其中一個實施例中,轉換式T的型式為 τ cos0 sin0 «、 —sin^ cos0 v 0In a specific implementation method, scoring a miscellaneous kiss, ~) with a thief ~ mutating into another - a nuclear listening 0_ or _ distortion recording book (underfying) video sequence proportional to the probability _. In other words, Newcomer (4) represents the probability representation of the histogram A of the nuclear heterogeneity graph A and another histogram representing the nucleotide gas. In this example, the probability can be estimated by P{h\h)=Y\P{hn\h\) η, which is me) is the probability that the value of the second container (bin) changes to nine. The probability paper... can be obtained from the actual training data of each container in the actual experience. In another example, the Bayesian theory is used to represent the scoring equation ton] for the probability 34 200951831 P(h,\h, 』_· m where the mixed method is calculated. The %%> can be expressed as η where,) Measuring the value of the second container in the histogram/ζ and the probability of independent training for the actual experience of the device, this is usually useful, not only to find the overall (4) state "® 1 Face or time series from between two videos, and find the first "thing" (the first space in the video is aligned with the second corresponding "dong,," in the video The different alignment between the second touch alignment. Or, "This is the system used to analyze the same face with the same material. For example, =:=rr wants to accurately determine the TV 疋 1 calendar in two examples. In it, it is also calibrated for time and time (picture number):: The space between the different videos is calculated in the implementation of the present invention, representing the sorting sense nucleus called: _ phase; the visual nucleus, between The time interval in the visual sequence of the best pairing of spatial associations - touch ~ _ empty association available The following (10) best matching visual nucleus is specially selected as the price of the singer. It is obtained from the two sets, which can be obtained from the two sets, so that y can be matched with the closest 7. The pair is not enough. Rejected. Corresponding Point can be expressed as once the correspondence is found 'transformation T can be found by the minimum value. In one embodiment, the 'minimize random sample consensus (RANSAC) algorithm is executed. In one embodiment, the minimum value is performed using an iteratively-reweighted least squares fitting algorithm. It is often useful to perform a rotation 'size, or a twist transformation. In one embodiment, the transformation The type of T is τ cos0 sin0 «, —sin^ cos0 v 0

0 在其他一個實施例中,轉換式T的型式為 Τ: sin0 u -asm0 acosO v 0 00 In another embodiment, the type of the conversion formula T is: sin0 u -asm0 acosO v 0 0

在其他一個實施例中,轉換式T的型式為 Τ a b u c d v 0 0 1In another embodiment, the type of the conversion formula T is Τ a b u c d v 0 0 1

I 在其他-個實施财,轉換式T _式為投影雜(pr〇jective Transformation) 〇 找尋兩個序列之間的時空對應處可由「第1〇圖」表示。程序 包含以下的階段。 36 200951831 1. 視頻DNA計异:兩個視頻資料集合99〇和991被輸入到 一個視頻DNA計算階段510。階段510將被詳細表示在「第5圖」 ^的步驟1_、2000、3000、4000,同時也出現在「第u圖」至「第 .14圖」。這個階段可用線上執行或者是先計算後儲存方式執行。 2. 時間配對:得出結果的視頻DNA 6〇1〇和6〇11輸入到一 個時間校準階段520,計算時間對應處525。時間對應處本質上是 一個轉換,從視頻資料990的時間座標系統轉換到視頻資料991。 ❹ 階段520在「第15圖」中將有更詳細的描述。 3. 空間配對.空間對應處525用在階段530用來挑選視頻990 和視頻991的時間對應處子集。視頻990和視頻991中被挑選的 子集535和536分別輸入到空間校準階段54〇,計算空間對應處 545。空間對應處本質上是一個轉換,從視頻資料99〇的空間座標 系統轉換到視頻資料991。 蟓 一個實施例討論如下,其中輸入視頻序列的視頻DNA視頻序 列如「第5圖」所示被計算。視頻DNA在計算過程中輸入視頻 數據990,並包括以下階段:特徵點檢測1〇〇〇、特徵點描述2_、 、特徵點修整3000、特徵表示4000、時間區間分割50〇〇以及視覺 '原子聚集6000。程序輸出一個視頻DNA 6010。 特徵點檢測1000 : —個SURF特徵點偵測器(在出自於2〇〇6 年5月第九屆電腦視覺歐洲研討會的「SpeededUpH〇bUstFeature」 中有描述)在視頻序列990的每個晝面中獨立運作,在每一個晝 面「t」生產出一個Nt=150的最強不變的特徵點位置1〇1〇 (請參 37 200951831 考「第5圖」)。 特徵點描述2000:在偵測偵測麵的階段為每個特徵點進行 侧,-個64維之SURF特徵點描述符被計算,就如同腦年5 月第九屆電腦視覺歐洲研討會的文章「Speeded Up触㈣, Peatures」中描述的一樣。 特徵點修整3000 :這是一個可選擇的步驟,並沒有在本實施 例中實行。 特徵表示4000 :特徵點代表了 κ=1〇〇〇的條目所組成的視覺❹ 詞彙。代表元素被使用相似最近鄰演算法(appr〇ximate批福 neighbOT algorithm)計算,在 1993 年出版之「Ann acm_siamI In other implementations, the conversion T _ is pr〇jective Transformation 〇 Finding the space-time correspondence between two sequences can be represented by “1〇 图”. The program contains the following stages. 36 200951831 1. Video DNA Difference: Two video data sets 99〇 and 991 are input to a video DNA calculation stage 510. Stage 510 will be detailed in steps 1_, 2000, 3000, 4000 of "Fig. 5" ^, and also appear in "uth" to "14th". This phase can be performed online or first and then stored. 2. Time Pairing: The resulting video DNAs 6〇1〇 and 6〇11 are input to a time calibration phase 520, and the time correspondence is 525. The time correspondence is essentially a conversion from the time coordinate system of video material 990 to video material 991.阶段 Stage 520 will be described in more detail in Figure 15. 3. Spatial Pairing. Space Correspondence 525 is used in stage 530 to pick a subset of the time corresponding to video 990 and video 991. The selected subsets 535 and 536 of video 990 and video 991 are input to spatial calibration stage 54A, respectively, to calculate spatial correspondence 545. The spatial correspondence is essentially a transformation from the spatial coordinate system of the video data 99〇 to the video material 991.蟓 An embodiment is discussed below in which the video DNA video sequence of the input video sequence is calculated as shown in Figure 5. The video DNA is input into the video data 990 during the calculation process, and includes the following stages: feature point detection 1〇〇〇, feature point description 2_, feature point trimming 3000, feature representation 4000, time interval segmentation 50〇〇, and visual 'atomic aggregation 6000. The program outputs a video DNA 6010. Feature Point Detection 1000: A SURF Feature Point Detector (described in "SpeededUpH〇bUstFeature" from the 9th Computer Vision Europe Symposium in May 2002) at each of the video sequences 990 Working independently in the face, one of the strongest feature points of Nt=150 is produced in each face "t" (see Figure 37 200951831, "Figure 5"). Feature Point Description 2000: The side of each feature point is detected at the stage of detecting the detection surface. A 64-dimensional SURF feature point descriptor is calculated, just like the article of the 9th Computer Vision Europe Symposium in May of the following year. Same as described in "Speeded Up Touch (4), Peatures". Feature Point Trimming 3000: This is an optional step and is not implemented in this embodiment. Feature representation 4000: The feature point represents a visual ❹ vocabulary composed of κ=1〇〇〇 entries. The representative element is calculated using the nearest nearest neighbor algorithm (appr〇ximate), which was published in 1993 by Ann acm_siam.

Symposumi onDiscrete Alg〇rithms乂s〇DA)的第四期之 271 到 28〇 頁中,S· Arya 和 D· M. Mount 有提到「Approximate NearestOn the 271 to 28 pages of the fourth issue of Symposumi onDiscrete Alg〇rithms乂s〇DA), S·Arya and D. M. Mount mentioned “Approximate Nearest”

Neighbor Searching」的描述。只有距離最近鄰居在9〇%距離以下 的到第二近的鄰居的特徵點被保留。這個階段的結果是一組視覺 原子4010的集合。 ❹ 視覺詞彙的特徵點描述階段是從一個序列的七十五萬個特徵 點描述符從應用先前描述階段的一套分類視頻的内容來當作培訓. 資料預先計异獲得。κ均值演算法被用來使量化訓練群集分成, 1000個叢集。為了減輕計算的負擔,在尺均值演算法中的最近鄰 搜尋被其最近似變量所取代,這同樣在丨993年出版之ΓΑηη. ACM-SIAM Symposium on Discrete Algorithms」(SODA)的第四 期之271 到 280 頁中,s.Arya 和 D. M. Mount有提出「Approximate 38 200951831Description of Neighbor Searching. Only the feature points to the second nearest neighbor that are below the 9〇% distance from the nearest neighbor are retained. The result of this phase is a collection of visual atoms 4010.特征 The feature point description phase of visual vocabulary is obtained from a sequence of 750,000 feature point descriptors from the content of a set of classified videos applying the previously described stage. The data is pre-equivalently obtained. The κ-means algorithm was used to divide the quantitative training cluster into 1000 clusters. In order to reduce the computational burden, the nearest neighbor search in the mean value algorithm is replaced by its most approximate variable, which is also the fourth issue of Aηη. ACM-SIAM Symposium on Discrete Algorithms (SODA) published in 993. 271 to 280 pages, s.Arya and DM Mount have proposed "Approximate 38 200951831

Wearesmeighboi* Searching」有相關的描述。 時間區間分割5000 :視頻序列被分割為一組固定長度一秒的 . 時間序列600 (「第6圖」)。 . 視覺原子聚集6000 :對於在步驟5〇〇〇被計算的每個時間區 間’視覺原子内部聚集成視覺核苷酸。由視覺核苷酸所產生的序 列(視頻DNA 6010)是過程中的產出。視覺核苷酸被創造成像 一個K=1000個容器(bin)的直方圖,第^個的容器負責視覺原 ❹ 子在形態η中出現的時間區間次數的計數。 在兩個不同或差異更大的視頻DNA被生產之後,從這些原料 產生的視頻DNA,可由以下的方式來檢查其一致性及配對結果: 時間上的配對520 (請參考「第1〇圖」)可以用SWAT演算 法與線性空隙罰分的參數〇r = 5及^ = 3,這時會用到加權評分函式: 1000 ^ ,、 Σ^ηΚ 疼广Μ2括κΜ2 ⑩ 權重%可以憑藉著過去的經驗計算。為了這個目的,各種訓 練視頻序列可以轉化利用一套隨機空間和時間變形,包括模糊、 ,解析度、寬南比和晝面速率的變化,且其視頻DNA可以被計算 .出。在每一容器的視覺核苷酸的差異就和根據變形估計的每一容 器對應的視覺核苷酸差異一樣。對於每容器η,權重'被設定為後 兩者的差異。 空間配對540 (請參考「第1〇圖」):空間校準可以由特徵點 為兩個1秒的相應區間代表兩個集合的視頻資料990及991來完 39 200951831 成,其一致處是從先前的時間校準階段 520 特徵點,在豆他區π M & @ u —個區間中的每個 、&間的相應特徵點可由歐基里. 們之間的各自描述符。過程 ' 丨 疋兩個相對應特徵點的集合 -德距離來最小化他 行形式 =到對麟,財_从職c料妙針對職的集合進 a b u\ τ: —b c 0 0 的轉換 另方面,本發明是一種數位視頻資料的時空配對方法,其 中匕括㈣間配對的視頻畫^在這種觀財,方法包括了在數 位視頻資料執行時祕_步驟,其中數位視师料包含在多數 的時間匹配視頻晝面以獲得相似矩陣’相似矩陣的空間配對代表 每個視頻晝面使用包含配_分的代表性,-個相似耕、一個 空隙罰分元件以及-個㈣局部修正演算絲運作的表示法(就 像-個基於基因組崎的缝法,或其他合適的演算法);和實行 空間配對的數位視頻資料,其中包括多種視頻畫面時間配對使用 所獲得的她轉。在這進行部]匹配的步驟是本制立於實行 時間匹配的步驟。 上述的方法也可應用Needleman-Wunsch演算法、SWAT或其 他相似型態的演算法。以上的方法可朗基因組配對演算法來實 行,像是基本局部校準搜尋工具可被使用來比較生物學上的序 列、蛋白質或核苷酸DNA排序類的演算法。 200951831 以上的方法可以進-步的包含在數位視㈣料上執行局部特 徵點该測,其中數位視頻資料包括多數的時間匹配視頻畫面對客 •戶興趣點的檢測;並且使用興趣點將數位視㈣料分段,這裡的 數位數頻資料包括多數的時間匹目£視頻晝面到多數的時間區間, 其中在執行空間配對與時間配對運作在許多的時間區間。 、在另彳面’這個方法也許可以決定視頻資料時空對應處, 並且包含幾個步驟,像是輸入視頻資料;將視頻資料用已經過排 ©序的視覺核紐賴來表示;藉由鮮序觸視覺糾酸決定視 頻貧料時間對應處的子集合;計算視頻資料中空間和時間子集合 的對應處;並且輸出在視頻資料子集合間的時空對應處。 輸入資料的類型:關於這個方面,視頻資料可能為影像序列 的集合、可能為視頻資料的查詢以及視頻資料集、可能為單一視 頻序列的子集或在視頻資料的集合中經過修改後之視頻序列的子 集。更進一步的說,時空對應處係由查詢視頻資料中之視頻序列 ❹ 的至少一個子集與視頻資料集中之視頻序列的至少一個子集之間 被建立。在特定的實施例中,時空對應處能夠在查詢視頻資料之 .視頻序列的子集與視頻資料集之視頻序列的子集之間被建立。 • 關於上述之查詢視頻資料,所提之查詢包含視頻資料集之被 修改的子集,所提之修改係由下列一個或多個項目所組成: •改變晝面速率。 •改變空間解析度。 •構成不均勻的空間。 200951831 •修改直方圖。 •覆蓋新的視頻内容。 •新的視頻内容的時間增加。 ㈣酸侧:在另—_化巾,本發賴提之系統與方法包 含被分割為時___#料,越夠為每—麵間計算 核苷酸。 區間,月間.在另-種變化中,本發明所提之系統與方法能夠 © 分割視射料㈣定_的__或是抑定躺的時間區 門時間區間的開始與結束時間能夠依據視頻資料中的鏡頭轉換 來計算π也能夠=錄時間區間能夠部分重疊或不能部分重疊。、 視見核普十异.在另—個變化中,視覺核芽酸(如前所述, 用述視伽容巾咖_的視覺數據)也可以跟著以下步驟 來计鼻· • 間的視覺數據作為-個集合的視覺原子。Wearesmeighboi* Searching has a description. Time interval segmentation 5000: The video sequence is divided into a set of fixed lengths of one second. Time series 600 ("Fig. 6"). Visual Atomic Aggregation 6000: A visual nucleotide is aggregated inside the visual atom for each time zone that is calculated in step 5〇〇〇. The sequence generated by the visual nucleotides (video DNA 6010) is the output in the process. The visual nucleotide is created to image a histogram of K = 1000 bins, and the second container is responsible for counting the number of time intervals in which the visual primitives appear in the form η. After two different or more different video DNAs are produced, the video DNA generated from these materials can be checked for consistency and pairing results in the following ways: Time Pairing 520 (Refer to "Figure 1") You can use the SWAT algorithm and the parameters of the linear gap penalty 〇r = 5 and ^ = 3, then the weighted scoring function will be used: 1000 ^ , Σ Κ Κ 疼 Μ Μ Μ 2 Μ Μ 2 10 Weight % can be relied on Past experience calculations. For this purpose, various training video sequences can be transformed using a set of random spatial and temporal distortions, including blur, resolution, wide south ratio, and facet velocity changes, and their video DNA can be calculated. The difference in visual nucleotides in each container is the same as the visual nucleotide difference corresponding to each container estimated from the deformation. For each container η, the weight 'is set to the difference between the latter two. Space pairing 540 (please refer to "1st map"): Space calibration can be performed by the feature points for two 1 second corresponding intervals representing the two sets of video data 990 and 991 to complete 39 200951831, the consistency is from the previous The time calibration phase 520 feature points, in the Bean Region π M & @ u - each of the intervals, the corresponding feature points between the & can be the respective descriptors between the Euclid. The process ' 丨疋 two sets of corresponding feature points - the distance to minimize the form of his line = to the pair of lining, the _ _ _ c c 妙 针对 针对 针对 a a a a a a a a a a a a — — — — — — — — — The present invention is a time-space pairing method for digital video data, wherein the video matching of the (four) pairings is in this kind of money, and the method includes a secret step in the execution of the digital video material, wherein the digital visual material is included in the majority The time matching video faces to obtain the similarity matrix' spatial matching of the similarity matrix represents the representation of each video surface using the matching _ points, a similar ploughing, a gap penalty component, and a (four) partial correction calculus operation The representation (like a genomic-based stitching, or other suitable algorithm); and the implementation of spatial pairing of digital video material, including the multi-video screen time pairing used to obtain her turn. The step of matching in this section is the step of implementing the time matching. The above method can also be applied to the algorithm of Needleman-Wunsch algorithm, SWAT or other similar type. The above method can be implemented using a RAN genome pairing algorithm, such as a basic local calibration search tool that can be used to compare biological sequences, protein or nucleotide DNA sorting algorithms. 200951831 The above method can further perform the local feature point measurement on the digital view (four) material, wherein the digital video data includes most time-matched video images to detect the customer's interest points; and the interest points are used to view the digital position (4) In the segmentation, the digital-bit frequency data here includes most of the time-of-the-money video to the majority of the time interval, where the performing spatial pairing and time matching operate in many time intervals. In another way, this method may determine the space-time correspondence of the video material, and contains several steps, such as inputting video data; the video data is represented by the visual core ray that has been passed through the order; The touch visual correction determines a subset of the video poor time corresponding; calculates the correspondence between the spatial and temporal subsets in the video material; and outputs the space-time correspondence between the video data subsets. Type of input data: In this regard, the video material may be a collection of image sequences, a query that may be a video material, and a video data set, possibly a subset of a single video sequence, or a modified video sequence in a collection of video material. a subset of. Furthermore, the spatio-temporal correspondence is established between at least a subset of the video sequence ❹ in the query video material and at least a subset of the video sequence in the video data set. In a particular embodiment, the spatio-temporal correspondence can be established between a subset of the video sequence of the query video material and a subset of the video sequence of the video data set. • For the above-mentioned query video material, the proposed query contains a modified subset of the video data set, and the proposed modification consists of one or more of the following items: • Change the face rate. • Change the spatial resolution. • Form an uneven space. 200951831 • Modify the histogram. • Cover new video content. • The time for new video content has increased. (4) Acid side: In the case of another - _ 巾, the system and method of the present invention include the ___# material, and the more the nucleotide is calculated for each surface. Interval, month. In another variation, the system and method of the present invention can be used to divide the visual material (4) ___ or to determine the start and end time of the time zone interval of the lying zone according to the video. The lens conversion in the data to calculate π can also be = the time interval can be partially overlapped or not partially overlapped. In the other change, visual nuclear geric acid (as mentioned above, using the visual data of the galactic towel) can also follow the steps below to calculate the nose. Data as a collection of visual atoms.

•構建核,在至少有—個視覺軒的函式上。 對此-計鼻,該函式可能是一種在時間區間上顯現頻率的特 徵點(視覺原子)直方圖,或該函式可能是—種在時間區間• Build a core, at least with a visual illusion. For this - the nose, the function may be a feature point (visual atom) histogram showing the frequency over the time interval, or the function may be - in the time interval

覺原子顯_率之加歡方圖。如是—種加齡方圖,則此加權 可歸於視覺原子可組合的函式如下: I .視覺原子在時間區間上的時間落點。 •視覺原子在時間區間上的空間落點。 •視覺原子的有效值。 42 200951831 不同特徵點或在核苷酸上的視覺原子或「特徵袋」之相對加 權在執行^況,加權不斷的區間(例如:不同特徵點有相同 .的對待),而在其他的執行上,可能不是所有的特徵點都獲得同 .等地對待。舉例來說,在—選擇性的加權方案,加權可以為一個 在區間裡的高斯函式的最大加權。加權也可以設定一大值為視覺 内容屬於鏡頭的視覺内容當作區間的中心,小值為屬於不同鏡頭 視覺内容。又或者’加權也可設定成―大值為近於晝面中心的視 ❹覺軒的触,小料近於晝面邊界的視覺原子的地點。 視覺原子法··如前述,視覺原子描述一時空落點區域之視覺 内容的視覺數據。在一實現狀況下,代表一視覺數據的時間區間 當作視覺原子的集合包括以下步驟: •檢測在時間區間一不變特徵點的集合。 •計算在每個*㈣徵肋述視贿據之時线點區域 © •移除一個子集的不變特徵點及其描述。 •建立—個視覺原子作為剩下的不變特徵邮點之 述的集合。 k 舰點_法:除祕之特徵_驗外,視據中在時 k間區間的不變特徵點之集合,上述計算可能使用h她切咖角 落债測器、_仿設不變之祕_器、時空祕_器、或使 用MSER演算法。假如是利用MSER演算法,它可應用於單獨的 一個視頻數據子集或可以應用於視頻數據的時空的子集。上述的 43 200951831 不變特徵點之描述也可以是SIFT描述、時空的SIFT描述或者 SURF描述。 追蹤的方法:在一些實施例中,計算上述描述之子集包含: 在時間區間中相對不變特徵點的軌跡,所利用的方法如: 、 •計算-個單-描述符為屬於執跡之不變特徵點之描述符。’ •分配所有屬於該軌跡的特徵點之描述符。 此計算功能可能是不變特徵點之描述符的平均或不變特徵點 之描述符的中間值。 ‘ 特徵修整的方法:在-些實施例中,消除如上述之不變特徵❹ 點的子集包括: •在時間區間中相對不變特徵點的數據之軌跡。 •為每個軌跡指定一品質矩陣。 •消除在執跡上的不變特徵點,其品f轉值低於預定 檀值。 在二實施例’如上述之為軌跡所分派的品質矩陣可能的組 合功能如下: 〇 •描述在執跡上不變特徵點的值。 •在軌跡上不變特徵點的落點。 . 功能可能正比於料值或麵有不變特徵麟關差異。‘ 視覺原子結構:在-些實施例中,建立一個如上述視覺原子 的集合可能是為被每一個剩下的不變特徵點建立單一視覺原子, 作為不變特徵點的功能之料。其功能的計算可能包括: 44 200951831 ,收一不變特徵點的描述符作為輸入。 •=已排相代表描述符子射制-個代表描賴和最佳 • 匹_不_徵點描述符接收做為輸入。 •輪出找到的代表描述符之索引。 可使用向量量化計算法或使用近似鄰近 描述。 見予彙法.代表特徵點描述的有序集合(視覺字彙)可能 ❹〃=丨、摘數射蚊位雜料算,或者可祕輸人的視覺數據 中適應及線上更新。在某些情況,其將被利用來建築一個標準化 的視覺字彙其普遍運行在所有視頻,或者至少在—大型視頻域, 以促進對於大型視輔料庫鮮化絲和—抑視頻來源的大型 陣列。 視覺原子修整的方法:在一些實施例,建立如上述視覺原子 的木合可能跟著消除的一個視覺原子子集,被消除的視覺原子子 ❹ 集可包括: •在集合裡為每一個視覺原子指定一個品質矩陣。 . •消除品質矩陣值低於預定的門檻值的視覺原子。 門捏值可以被固定或者調整維持在集合中視覺原子的最小值 或者調整在集合中視覺原子極限的最大值。此外,指定的品質矩 陣可包括: •接收一個視覺原子當作輸入。 •計算視覺原子和代表視覺原子集合的向量相似度。 45 200951831 •輸出品質矩陣當作相似向量的功能。此功能可能正比於相 似向量的最大值和相似向量的最大值和相似向量的第二大 值之比值,或者此功能是相似向量的最大值和相似向量的 最大值和相似向量的第二大值之比值。 序列演算的方法:在-些實施例,如上述之視覺核普酸的序 列計算可能包括: •接收兩個視覺核賊序列,〜}和4,當作輸入; •接收一個評分函式外(,幻和一個空隙罰分函式外乂”)當作朱 數; 田’ © •找到部分的對應處c={(zwa.U和間隙集合 G = {(/,, m,,),..., (lL ,mL,nL )} 的功 能最大 *·έσ(Ά)+ί:Λ,《^); 值 k=l k^l •輸出找到的部分對應處功能最大值。 其它計算方法:如同先前所討論,最大化可使用 Smith-Waterman 計算法、Needleman_Wunsch 計算法、blast 計 算法來完成或者可能在階層式方法完成。 評分的方法:評分方程式如上述可能是—❹_式所組成 h Aqj ;Feel the atomic display _ rate of the plus square map. If it is a kind of ageing map, the weighting can be attributed to the visual atomic combinable function as follows: I. The time point of the visual atom over the time interval. • The spatial drop of a visual atom over a time interval. • The effective value of the visual atom. 42 200951831 The relative weighting of different feature points or visual atoms or "feature pockets" on nucleotides is performed in the case of weighted constant intervals (for example: different feature points have the same treatment), while in other executions It may not be that all feature points are treated the same. For example, in a selective weighting scheme, the weighting can be a maximum weighting of a Gaussian function in the interval. Weighting can also set a large value for the visual content that belongs to the lens as the center of the interval, and the small value belongs to the different lens visual content. Alternatively, the 'weighting can be set to a value that is close to the center of the facet, and the point of the visual atom near the boundary of the face. Visual Atomic Method As described above, a visual atom describes the visual data of the visual content of a time-space landing area. In an implementation, the time interval representing a visual data as a set of visual atoms includes the following steps: • Detecting a set of invariant feature points in the time interval. • Calculate the line point area at each *(4) rib to describe the bribe. © • Remove a subset of invariant feature points and their descriptions. • Establish a collection of visual atoms as the remaining invariant feature points. k Ship Point_法: In addition to the characteristics of the secret _ test, depending on the set of invariant feature points in the interval between the time and the time in the data, the above calculation may use h her cut corners of the debt detector, _ imitation of the secret _, time, space, or use the MSER algorithm. If the MSER algorithm is utilized, it can be applied to a single subset of video data or to a subset of the temporal and spatial aspects of the video data. The description of the above-mentioned 43 200951831 invariant feature points may also be a SIFT description, a space-time SIFT description or a SURF description. Tracking method: In some embodiments, calculating the subset of the above description includes: a trajectory of relatively invariant feature points in a time interval, such as: • calculation - a single-descriptor is not a manifestation Deformation of feature points. ’ • All descriptors that assign feature points belonging to the trajectory. This calculation function may be the intermediate value of the descriptor of the average or invariant feature point of the descriptor of the invariant feature point. ‘Feature trimming method: In some embodiments, the elimination of the subset of invariant features as described above includes: • The trajectory of the data of the relatively invariant feature points in the time interval. • Assign a quality matrix to each track. • Eliminate the invariant feature points on the trace, and the value of the product f is lower than the predetermined value. The possible combination of the quality matrices assigned in the second embodiment as described above for the trajectory is as follows: 〇 • Describe the value of the invariant feature point on the trace. • The point of the feature point is not changed on the trajectory. The function may be proportional to the difference in material value or face. ‘Visual Atomic Structure: In some embodiments, establishing a set of visual atoms as described above may be a function of the function of creating a single visual atom for each of the remaining invariant feature points as an invariant feature point. The calculation of its functions may include: 44 200951831 , receiving the descriptor of the invariant feature point as input. • = phased representative descriptor sub-shot - one representative tracing and best • p____ spot descriptor reception as input. • Round out the index of the representative descriptor found. Vector quantization calculations can be used or approximate proximity descriptions can be used. See Huihui. The ordered set (visual vocabulary) that represents the feature point description may be ❹〃=丨, picking up the number of mosquitoes, or the visual data of the secretive input and the online update. In some cases, it will be utilized to construct a standardized visual vocabulary that is commonly run on all video, or at least in the large video domain, to facilitate large arrays of large-scale visual repositories for fresh silk and video sources. Method of visual atomic trimming: In some embodiments, creating a subset of visual atoms that may be eliminated by the occlusion of a visual atom as described above, the eliminated set of visual atomic sub-sets may include: • Specifying each visual atom in the set A quality matrix. • Eliminate visual atoms whose quality matrix values are below a predetermined threshold. The gate pinch value can be fixed or adjusted to maintain the minimum of visual atoms in the set or to adjust the maximum value of the visual atomic limit in the set. In addition, the specified quality matrix can include: • Receiving a visual atom as input. • Calculate the vector similarity of a visual atom and a set of visual atoms. 45 200951831 • The output quality matrix is treated as a similar vector. This function may be proportional to the maximum of the similar vector and the ratio of the maximum of the similar vector to the second largest of the similar vector, or this function is the maximum of the similar vector and the maximum of the similar vector and the second largest of the similar vector. The ratio. Method of sequence calculus: In some embodiments, the sequence calculation of visual nucleotides as described above may include: • receiving two visual nuclear thief sequences, ~} and 4, as inputs; • receiving a scoring function ( , illusion and a gap penalty function outside the 乂") as the number of the number; field ' © • find the corresponding part of the c = { (zwa.U and the gap set G = { (/, m,,),. .., (lL , mL, nL )} has the largest function *·έσ(Ά)+ί:Λ, “^); value k=lk^l • The maximum value of the function corresponding to the output found. Other calculation methods: As discussed previously, maximization can be done using the Smith-Waterman calculation, the Needleman_Wunsch calculation, the blast calculation, or possibly in a hierarchical approach. The method of scoring: the scoring equation as described above may be - ❹ _ composed of h Aqj ;

Vi" ^si ^q/ Aqj 其中J可能為恆等矩陣,對角矩陣。 評分可能正比於核苷〜類為核賊4條件機 突變機率可能是由訓練資料進行訓練之經驗估計而來。7, 評分也可能正比於機率 46 200951831 户 fa»,) pkj) ’突變機率可能是由訓練資料進行訓練之經驗估計。 ,離基礎評分法··此外,評分函式可能反比於距離函式 和距離函式可能的組合至少有如下: • L1距離。 • Mahalanobis 距離。 • Kullback-Leibler 發散。 • Earth Mover 距離。 加權计晝:除了先前所描述加權計劃,對角矩陣的元素」可 正比於 其中’巧疋指有多少次視覺原子,·出現在視覺核普酸中。五也可由 訓練的視頻倾或視被輸人的視㈣料絲計。祕陣讀角線 上的元素也可正比於 其中V,為視覺原子;出現在相同視覺核級之突變版本的差異, W為視覺原子z•出現在任何視覺㈣酸的差異。此外,咖可 能由訓練的視頻資料估計而來。 , 上隙罰々的方法:在—些實施例,空隙罰分可以是形成函 H,·,的參數,其中,w是兩個序列的開始位置,”是間隙長 為參數。參數們可以從訓練數據來估計,而訓練數據可 ^例視頻序胸t人和刪_内容。此外空關分可能 >工Η«)_α+Ζ>„,其中,”代表間隙長度而α和&是參數。進一步, 200951831 可能是凸函式,或者與在位置始於ζ·和7.的兩個序列上找 到間隙長度為η的機率成反比。 空間相對法:計算空間相對的方法包括: •輸入視頻數據的時間相對子集。 •在視頻數據的子集提供特徵點。 哥找特徵點之間的對應處。 •尋找朗坐標之間的對應處。 面 視頻數據的咖姉子封能是至少—對㈣間 ❹ 此外,尋找特徵點的對應處進一步可能包括: " •輸入兩組特徵點; •提供特徵點的描述符; •匹配描述符。 特徵點可以利_樣的視頻㈣酸計算,並且描述符可 利用於同樣的視頻核苷酸計算。 另外,尋找特徵點間的對應處可以使用 〇 :參數所組細模迦-_互魏⑽_,如下2= 數之模型可叫行麵下舰·的_ - 其中,u和u為兩組特徵點 特徵點的錄。 I鱗紐Θ轉換兩組 3座標_對應處可表示為—個在視紐據子集裡空間系 •、座‘(X,y)和另一個視頻數據子集裡 ' 間的映射。 以、&的座標(x’,y,)之 48 200951831 輸出法:輸出時空中視頻數據子集間的對應處可是代表作為 -子集中時空系統的座雜,y,物另—子集㈣空系統的座標 (x’,y’,t’)之間的映射。 為了協助_接下來的討論,在此提供由簡易圖型所 形成的不忍@來幫助轉之雜觸綠和技巧。雜這些示意 圖相較於之前的解釋而言比較沒縣精準,但已足以讓前述的步 驟更今易理解及實行。現在為了讓圖示能夠呈現在我們所接受的 三維空間裡’那些通常要在高維度空間裡計算的多維陣列(可能 擁有H)00錢多特徵(維度)的陣列)被簡化成三維陣列而能在 三維空間中被表示。這樣使複雜的高維數學計算可以用簡單的圖 示來做替代。除此之外’那些複雜的特徵描述也能簡單地以黑白 兩色表示。Vi" ^si ^q/ Aqj where J may be an identity matrix, a diagonal matrix. The score may be proportional to the nucleoside ~ class for the nuclear thief 4 conditional machine mutation probability may be estimated from the training data training experience. 7, the score may also be proportional to the probability of 46 200951831 household fa»,) pkj) 'mutation probability may be an empirical estimate of training training data. In addition to the basic scoring method, in addition, the scoring function may be inversely proportional to the possible combination of the distance function and the distance function as follows: • L1 distance. • Mahalanobis distance. • Kullback-Leibler diverges. • Earth Mover distance. Weighting: In addition to the previously described weighting plan, the elements of the diagonal matrix can be proportional to how many times the visual atom is present in the visual nucleotide. Five can also be viewed by the training video or by the input of the (four) filament. The elements on the syllabic reading line can also be proportional to where V is the visual atom; the difference in the mutant version appearing at the same visual nucleus level, W is the visual atom z• appears in any visual (four) acid difference. In addition, coffee may be estimated from trained video material. The method of the upper gap penalty: In some embodiments, the gap penalty may be a parameter forming a function H, ·, where w is the starting position of the two sequences, "the gap length is a parameter. The parameters can be The training data is used for estimation, and the training data can be used to record the video and the content of the _ content. In addition, the vacancy can be >> Η«)_α+Ζ>„, where “represents the gap length while α and & Further, 200951831 may be a convex function, or inversely proportional to the probability of finding a gap length of η on two sequences starting at ζ· and 7. Spatial relative method: The method of calculating spatial relatives includes: • Input The temporal relative subset of the video data. • Provides feature points in a subset of the video data. Find the correspondence between the feature points. • Look for the correspondence between the lang coordinates. The facet video data can be at least - (4) ❹ In addition, the search for the corresponding points of the feature points may further include: " • input two sets of feature points; • provide descriptors of feature points; • match descriptors. feature points can be used for video (four) acid calculation , and The descriptor can be used for the same video nucleotide calculation. In addition, to find the correspondence between the feature points, you can use the 〇: parameter group of the fine-model Jia-_mutual Wei (10)_, the model of the following 2 = number can be called the ship · _ - where u and u are the records of the feature points of the two sets of feature points. I scales the transition between the two groups of 3 coordinates _ the corresponding position can be expressed as a space system in the subset of the visual field X, y) and the mapping between the other video data subsets. The coordinates of the & (x', y,) 48 200951831 output method: the correspondence between the output time-space video data subsets can be representative - the mapping between the subspace of the subspace system, y, the other sub-set (four) the coordinates of the empty system (x', y', t'). To assist _ the following discussion, provide a simple diagram The shape formed by the type can not help to help turn the green and skill. The miscellaneous diagrams are less accurate than the previous explanations, but it is enough to make the above steps easier to understand and implement. Now to make the map Can be presented in the three-dimensional space we accept 'those usually in high-dimensional space The multidimensional array (possibly possessing H) of 00 money multi-features (dimensions) is simplified into a three-dimensional array and can be represented in three-dimensional space. This allows complex high-dimensional mathematical calculations to be performed with simple icons. Instead of doing it, 'those complex characterizations can simply be represented in black and white.

「第11圖」’特徵偵測1000和特徵描述2〇〇〇 ^「第U圖」綜 合了「第5圖」的區塊麵以及區塊鳩…個被簡化的視頻核 皆酸或識別縣,是麵-連林@的餘彳貞顺算法來分析視 ,影像晝面而產生的。在這個例子中,有三種不同的特徵侧演 算法’分別為二角邊_演算法聰、三角邊偵测演算法n〇4、 以及深色平滑區域侧演算法議…個鮮的單晝面視頻核普 酸或識別絲謂、111()代表著各财同__算法所侧到 的特徵的總合’並得到的數字做成一個直方圖或向量「核苷酸。 而一個簡單的多畫面視頻核苷酸或識別標誌',則為視頻全面之系 列上的特徵數之總合。但由於雜訊的存在,單—晝面得到的結^ 49 200951831 可能會有所偏差。所以在做這方_分析時t會以 為運算對象’並制各财_雜卿麵信'、列的晝面 成更為強制視舰苷酸或酬銳,妨法將° =「=技術以形 4a ^ . 、第12圖」中 「第12圖」,特徵修们_。「第12圖」表現了簡化版的「! 5圖」區塊3_。當在做視頻分析的過程t有雜訊存在時,其今 -個晝面就會出現—個其他畫面所沒有驗暫特徵。為了要減少"11th image" 'Feature detection 1000 and feature description 2〇〇〇^"U-picture" combines the block surface of "5th picture" and the block... a simplified video core is acid or identifies the county It is the face-Lian Lin @'s Yu Yushun algorithm to analyze the visual image. In this example, there are three different feature side algorithms 'two corners _ algorithm saga, triangle edge detection algorithm n 〇 4, and dark smooth area side algorithm. Nucleic acid or identification silk, 111 () represents the sum of the features of the same __ algorithm side and the resulting number is made into a histogram or vector "nucleotides." A simple multi-screen The video nucleotide or identification mark' is the sum of the number of features on the full series of videos. However, due to the existence of noise, the single-faceted junction ^ 49 200951831 may be biased. So do this When _ analysis, t will think that the operation object 'and the various _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ "12th picture" in "12th picture", feature repairs _. "Figure 12" shows a simplified version of the "! 5" block 3_. When there is noise in the process of video analysis, it will appear in this picture - there is no temporary feature in other pictures. In order to reduce

雜訊,通常會分析-_的影音晝面,若—轉徵_出現^ 個旦面中便將之保留’若為只出現於單—畫面的瞬間特徵㈣則 將之視為雜訊而去除。在這個例子中,虛線㈣指时面與晝面 間持續的特徵、麟方塊(d〇ttedbGx)則鮮只在—個晝面中 的瞬間特徵1202。 — 「第13圖」’特徵表示4_。「第13圖」表現了簡化版的「第 5圖」步驟4000。由於雜訊的存在,視頻分析對持續之特徵的計 數了微不同於晝面對晝面的基礎13()2。為了要減少雜訊,通常會 ,縱晝面和晝面間相對應的—些晝素,利用之前的系統_來計 鼻母個面中匕們代表的特徵,然後捨棄掉其中奇特的特徵(藉 由平均數、中位數或眾數分析),再用合理的特徵來描述這一群晝, 素在這個例子中,床色邊緣修正侦測演算法(血设sm〇〇th e(jge detection algorithm)在三個晝面中回報了兩個正確結果,但在中間 的旦面13G2中卻回報成不正確的擁有兩邊相鄰的頂點。所以依據 上述的流程’其錯誤的回報將被一致特徵點13〇4所取代。 50 200951831 μ「第14圖」,特徵表示Κ (取最相近者)(步驟4_)。這圖 f顯示出4 5圖」的後期現象。即使使用-個相對而言很大數 、(數百絲千)的特徵偵嶋、統去分析—個影音檔案,其中仍 然會有些許倾無法1_ (不屬於任何―種或是·地屬於多 種)。在此圖所談到的技巧中,那些不能被清楚分類的特徵將會被 歸類到與最相近之特徵演算法_之計算容lit。「第Η圖」的 例子中,在-部分的晝面裡面,那方塊的其中一面上多了一條對 角線使得其中兩個頂點變成有四邊相鄰(特徵_。因為在這個 例子中,最接近四角邊特徵演算法的演算法是三角邊特徵演算法 簡,所以那最鄰近區域便會由㈣邊特徵被歸類三角邊特徵演 异法到1104的技術容器中。 「第15圖」,時序校準。這圖片演示了「第1〇圖」(區塊哪 的簡化版過程。為了要觸—個未知的視頻核賊或識別標糾通 常會是-個長約-秒的影音片段),我們會拿它來跟已存在的資料 庫(由取自相同或不同影片的-系列小片段)去做配對,然後將 其歸在最為相_那-個。在此例巾’待分_片段是由一個影 音節目1502 (方塊)和中間***的廣告15〇4 (由三角形和圓型組 成)所組成,而資料庫中則存有不含廣告的節目15〇6。然後利用 「第15圖」的圖表(列出未知和已知間所有配對的可能)去尋找 1502、15〇4對於1506之間的對應性(1508即是相對應的部份)。 結果顯示對應的部份是在廣告的前後’但廣告本身和資料庫間並 沒有對應而造成了一段間隔1510。 51 200951831 「第16圖」,配對步驟。此圖片詳盡的說明了一些「第1〇圖」 中的配對步驟。就像之前討論的,影像上的核苷酸配對比生物上 的核普酸配對來得複雜得多。這是因為生物上的核苷酸只有「A」、' 「T」、「C」及「G」四種’而影像上的則由大量的影像特徵組成。— 因此常常會發生給定一個未知的影音片段卻無法在資料庫中找到 其完美對應的情況。對此我們會放寬”相對應,,的標準,使其相當接 近但不一定要完美地對應。通常這種「相當接近」的概念可以由 各種距離函式來定義,如L1距離函式、Mahalanobis距離函式、❹ Kullback-Leibler divergence 距離函式、EarthM〇ver,s 距離函式或 其他的函式。 在「第16圖」中採用的距離函式是三維空間座標中的 Euclidean距離函式 d = Mxi ~χ2Ϋ +(y,-y2f +(z, -z2f] 其中,x、y以及z為字庫中之特徵描述符的係數,其被使用在簡 單的二係數「特徵描述符袋」或「核苷酸」上,如(x,y,z)。 〇 此例中,由於影音分析的系統只有11〇2、11〇4、11〇6三種(先 刖「第11圖」至「第15圖」中有提到),其配對過程便可在三維. 空間中被視覺化呈現。在此從景多音15〇6中分析出的四個景多音片段, 被四個不同的三維向量所表示。我們可以發現,在原本影音中越 緊密相關的在三維空間腫中距離也越近。故可以藉由調整足以 被稱為相似的雜長絲決定_料需要多她才能被稱為互 相配對。 52 200951831 藉著調整谓測系統對相似度的嚴謹度,便可以依照使用者的 而要來決定兩個影像片段的相關程度(之前在「第15圖」中表示)。 這將在「第17圖」中被表示。我們在「第16圖」中對於不 同影音片段的相似性比較結果可以用Venn diagram來表示。最相 近的兩個視頻糾酸或識別標$是核雜(2,3,1:)和核苷酸(1,31), 其有1的Euchdean距離。次一等靠近視頻核苷酸或識別標誌的是 核苷酸(3,4,1),其與妍邮,切距離223的如咖哪距離,且 與核皆酸(2,3,1)有Hi w Euclidean距離。最後離其他片段最遠的 疋核苷酸(G,G,1) ’他與核賊(3,4,1)、核碰(ι,3,ι)、核賊(2,3,1) 分別為5、3.16、3.60Euclidean距離。因此,藉著調整可接受的距 離範圍,視頻核微狀否可稱作「相對應」的結果也會有所變 動。舉例來过’若相似的標準距離定義為i以内的脇也尬距離, 會造成只有核苷酸(2,3,1)與核苷酸(印)相配對;若定義Eudidean 距離為2.5以内則可使得㈣酸(如)、核苷酸(⑶)、以及核苷 酸(3,4,1)皆可互相配對,如此繼續擴大範圍下去。而我們可以憑藉 著經驗和侧纽使其最触正柄輯且賴誤的配對降 至最低。 序列配對在料影像侧料巾都是相當關鍵的—部分。在 此討論的线及方法,並未事先得知序_相_息,就於這些 序列間執行S靖;制的是’該系統及方法可以精準配對出來自 同一修飾過程的序列。 超衫像,亦稱超連結衫像,是影像的其中一種類別,内建使 53 200951831 用者點擊下私令的触’有助於影像及其他超媒體元素間的導 覽因此’超影像與超文本頗為相似,同樣被全球資訊網路上廣 為使用,讓使用者點擊—字詞即可檢索其他文件的資訊。一般來 »兑’仃動與物件結合在一起的元數據以一種特殊的格式嵌入影像 中,忍味兀數據與影像都是經由内容供應者所發布的。 在2007年11月21日申請之編號11/944,290的專利「互動式 影像内容產生、發布與演示的方法與儀器(Meth〇d — appa咖 for generation,distribution and display of imeractive video c搶 ❹ 中’在參考來源清楚提到超影像的客戶端_伺服器計晝。該計畫的 特色在於影像内容與元數據是各自獨立的,客戶端持有影像内 谷’在伺服器端則為元數據;兩者藉由客戶端特殊的物件描述, 稱為「數位簽署(computing Signatures)」連結在一起,而相關的元 數據則經由比對伺服器檔案庫内的簽署後被搜尋出來。 上述汁晝中有一基礎要素為陳述和比對影像物件的方法,即 該系統與方法論及影像物件的陳述與比對的兩個面向。 〇 至於先洳技術’「互動式影像内容產生、發布與演示的方法與 儀器」之專利中描述的影像映像概括了在此提及的影像DNA ^ - 影像映像最低組成中的局部特性與在此討論的視覺原子互相, 對應,·在此討論的片段或場景,與具體化時間區間也互相對應; 另外,在先有技術中作為合成局部特性為單一載體之用的識別標 誌,也與現今發明的視覺核苷酸相對應。二階段的時間定位及空 間定位計畫是用來比對兩套簽署的運算法則。 54 200951831 超衫像分布系統如「第18圖」所示。根據該系統,只有 =供應者端被申流到超影像客戶端。因此,影像内容供= 對超衫像是無所知的,且就算使㈣留下_容,裡面也沒有任 何元數據或額外的資訊。 影像只能在倒帶影像過程中,由超影像客戶端產生的 DNA辨硪出來。當用戶點擊 ❹ ❹ DNA的-部分被傳逆到^垃/固4時間位置,該影像 ί刀被傳送_服^接讀來_影像内容、特定 :置:特定空間•繼中的物件。元數據舰器使用相同 .自先§削影像驗給元數據伽齡製且被註解的 關崎观鑛蝴爾每—個物件相 一旦影像、_地點顿定縣經聽__仰似辨識 那麼註魄娜行動,接著發送給超視頻客戶端,讓他 =點擊吨行絲。由其財體,修廣告客戶來將行動與 起’進㈣触貞行動是可能 的0 可能的動作包含: •,像内的導航’如跳轉_似的場景。 個知像間的導航’如指出與最近播過的影像相似的另一 個影帶位置。 ’搜尋,如點擊物件來執行檢索。 仏索物件更多的訊息以獲得内容的豐富度。 55 200951831 •到媒體文件或網頁的超連結,如電子商務交易(蹲 貝被點擊的物件)。 、啤 在超影像應用t,客戶端會將點擊紀錄 trr錢棘。料,嫩是物赠糊心程度的 …人’點擊的時間位置可以界定出較受用戶歡迎的部分; 再者,被點擊的物件本身即是語義上用戶感興趣的内容。 ❹ 超影剝的界面可用錢行職鱗,·朗财法,點 物件即被視為查詢檢索。 以下查詢物件的可紐會被用在·· •料庫中的圖像搜尋:搜尋的對像為圖像,由比對資 的圖像’進而得到最相似的結果。此法是以相似度 為基礎’而非語義。 通用貝=庫中的圖像搜尋:搜尋的對象為圖像,再賦予一 語義標籤;相反地,標籤合扁 的關鍵字查詢。才丁^在文子貝枓庫中用來執行傳統 ❹ 缺乏資料背景可能是上述方法的缺點,同一物件可能會有不 同的語義解釋’端賴其不_制背景。例如,點擊蘋果會產生 對水果的檢索,不過若轉的是蘋果電職告上的蘋果,則得到. 的就會是跑電腦的搜尋結果了。影像基因體可配對物件到, 影像序壯,如果序顺註_,那麼背景:祕的侧資訊就可 以被搜尋蜂·,舉例絲,轉細廣告上輔果,會輯到包 含^廣告的影像序壯,射的_字就與既有背景相關了。 56 200951831 「第19圖」是用描述方法來搜尋對象的祕例子。當使用者 點擊物件時’細_觀錢了觀舰,接細象舰 -會被傳賴元數翻服H,與m⑽觀相珊;來自最佳配 •對序列的註記則產生包含物件描述(如頻果)、背景(如電腦)的關鍵 字來搜尋。 這裡描述的方法可_在_受賴與轉保護的内 容及防止它的非法銷售與查看。 内容保護可以採用下列方式·· 内谷檢索器.存取標案與核查身分的仲介。 •覺察播放内容的播放器。 「第20圖」為範例系統,可追蹤點對點網路中非法内容與提 供者。此系統中社要仲介是内容檢索器,能存取非法上傳至内 容伺服器(如YouTube)或非法在點對點網路中分享的影片檔。 内谷檢索器使用影片基因組學的方法,產生視頻DNA給任何被檢 ❹ 索娜疑為絲_容,並將它絲作軸料資料庫比對。如 果驗明正身,檢索器便通知内容所有者。 , 「第21圖」是根據影片基因組學能覺察内容的播放器。内容 -播放時,用戶端產生識別内容的視頻DNA。視頻DNA被送回元 資料伺服器,那裡儲存了著作權内容標題的資料庫。元資料伺服 器辨別特定内容是否已列入黑名單,並發送訊息到用戶端,禁止 進步播放。或者’元資料祠服器可以***來自廣告飼服器的強 力廣告。因此,使用者可以觀看非法内容,但也被強迫收看廣告。 57 200951831 影片基因組學在加強與豐富既存傳統内容上也很有用。假設 使用者收藏-套舊DVD,且想要用日文字幕、波蘭發音看電影— 一在原本DVD裡未提供的功能。「第22圖」為系統可能的架構, 根據影片基因組學,提供加強傳統内容的可能。此系統包含用戶 端,元資料伺服器,及加強内容伺服器。在用戶端,產生了辨識 播放内谷的獨特視頻(例如,在real_time中,在播放倒退或 前進時)。視頻DNA是獨特的内容辨識器,不可編輯或扭曲。因 此’同一内容的不同版本會有相似或相同的視頻DNA。舉例來說,❹ 同一部電影在不同解析度的不同版本、不同的剪接、從帶有廣告 的播放頻道錄下的同一部電影…等,會有相似或相同的視頻 DNA視頻DNA與要求的加強内容會一同被送到元資料飼服器, 匕將視頻DNA與特定内容資料庫比對,並找到相配的内容。一旦 辨識成功,元資料伺服器會向加強内容伺服器(額外内容儲存處) 提出要求。内容加強資訊就會送到用戶端(例如,即時的串流), 適當地播放。 Ο 内容加強資訊包含: •不同聲音格式與語言的聲道。 · 不同语5的字幕。 · •未包含在原内容的場景。 •可點擊對象的超連結與訊息。 •影片加強層(如,HD解析度’更高的晝面更新率…等), 例如在H.264SVC (可伸縮的影片編碼解碼器)格式。 58 200951831 在内容評價中,就像影片搜索,語意轉與物件出現的情境 陳重要。例如,女性胸部出現在防癌診療情境下,與色情内容 情境下不同。翻基耻學可用絲射__評價,方式如 下:首先,產雜找DNA。觀細dna配對 至指定_,__件。反财,這秘供了麟物件說 明與情境柯祕。齡_物件鱗定魏及__率與期 間’決定内容的評價。Noise, usually analyzes the video of the -_, if - the _ appears _ appears in the face, it will be retained 'if it appears only in the single-picture instant feature (four), then it is treated as noise . In this example, the dotted line (4) refers to the continuous feature between the time surface and the surface, and the ridge block (d〇ttedbGx) is only in the instant feature 1202 in the face. — The “Fig. 13” feature represents 4_. Figure 13 shows a simplified version of Figure 5 of Figure 5. Due to the presence of noise, the analysis of the features of the video analysis is slightly different from the basis of the face 13()2. In order to reduce the noise, usually, the corresponding elements between the vertical and the facet, use the previous system to calculate the characteristics of the nose in the face, and then discard the strange features ( By means of average, median or mode analysis, and then using reasonable features to describe this group, in this case, bed color edge correction detection algorithm (blood set sm〇〇th e (jge detection Algorithm) returns two correct results in three faces, but in the middle of the face 13G2, it returns an incorrect vertice with two adjacent sides. Therefore, according to the above process, its wrong return will be consistent. Replaced by point 13〇4. 50 200951831 μ "14th picture", the characteristic indicates Κ (take the closest one) (step 4_). This figure f shows the late phenomenon of the 4 5 figure. Even if used - relatively A large number, (hundreds of thousands) of features detective, unified analysis - a video file, which will still be a little tilted 1_ (not belonging to any kind or land belongs to a variety). Among the techniques that are available, those that cannot be clearly classified Will be classified into the closest feature algorithm _ the calculation capacity lit. In the example of the "second map", inside the - part of the face, one side of the square has a diagonal line to make two of them The vertices become four-sided adjacent (feature _. Because in this example, the algorithm closest to the four-corner feature algorithm is the trigonometric feature algorithm, so the nearest neighbor region will be classified by the (four) edge feature. Edge feature derivation to the technical container of 1104. "Figure 15", timing calibration. This picture demonstrates the "1st map" (the simplified version of the block process. In order to touch - an unknown video nuclear thief Or the identification mark will usually be - a video clip of about -2 seconds long, we will use it to pair with the existing database (from the small series of the same or different films), and then In this case, the 'waiting for _ fragment is composed of a video program 1502 (squares) and an intervening advertisement 15〇4 (consisting of triangles and circles), and in the database. There are programs without advertisements 15〇6. Then use the chart of Figure 15 (listing the possibility of all pairings between unknown and known) to find the correspondence between 1502 and 15〇4 for 1506 (1508 is the corresponding part). The part is before and after the advertisement 'but the advertisement itself and the database do not correspond to each other and cause an interval of 1510. 51 200951831 "16th picture", pairing steps. This picture explains some "1st picture" in detail The pairing step in. As discussed earlier, the nucleotides on the image are much more complicated to match the nucleotides on the organism. This is because the nucleotides in the organism are only "A", "T", "C" and "G" are four types, and the image is composed of a large number of image features. — So it is often the case that an unknown video clip is given but it cannot be found in the database. For this we will relax the "correspondence" criteria so that they are fairly close but not necessarily perfectly matched. Usually this "closely" concept can be defined by various distance functions, such as the L1 distance function, Mahalanobis. Distance function, ❹ Kullback-Leibler divergence distance function, EarthM〇ver, s distance function or other functions. The distance function used in "Figure 16" is the Euclidean distance function in the three-dimensional coordinates d = Mxi ~χ2Ϋ +(y,-y2f +(z, -z2f] where x, y, and z are in the font The coefficient of the feature descriptor, which is used in a simple two-factor "feature descriptor bag" or "nucleotide", such as (x, y, z). In this example, the system for audio-visual analysis is only 11 〇2, 11〇4, 11〇6 (mentioned in “11th” to “15th”), the pairing process can be visualized in 3D. Space. The four scene polyphonic segments analyzed in the multi-tone 15〇6 are represented by four different three-dimensional vectors. We can find that the closer the correlation is in the original video, the closer the distance is in the three-dimensional space. It is determined by the adjustment of the heterozygous filaments that are called similar to each other. It is necessary to know how much she can be called to match each other. 52 200951831 By adjusting the accuracy of the similarity system of the predicate system, it is possible to decide two according to the user. The degree of correlation of the video clips (previously shown in "Figure 15"). This will be in "17th It is shown in Fig.. We can use the Venn diagram to compare the similarity of different video clips in Figure 16. The two most similar video corrections or identification marks are nuclear (2, 3, 1:) and nucleotides (1, 31), which have an Euchdean distance of 1. The next-order proximity to the video nucleotide or recognition marker is the nucleotide (3, 4, 1), which is labeled with Distance from 223 to Gara, and Hiw Euclidean distance from nuclear acid (2,3,1). Finally the most distant nucleotides from other fragments (G, G, 1) 'He and nuclear thief ( 3, 4, 1), nuclear touch (ι, 3, ι), nuclear thief (2, 3, 1) are 5, 3.16, 3.60 Euclidean distance. Therefore, by adjusting the acceptable range of distance, video nuclear micro The result of whether the condition can be called "correspondence" will also change. For example, if the similar standard distance is defined as the distance within the threat of i, it will cause only nucleotides (2, 3, 1) and Nucleotides (printing) are paired; if the Eudidean distance is defined to be within 2.5, the (tetra) acid (eg), nucleotide ((3)), and nucleotide (3, 4, 1) can be paired with each other, and so on. Expand the scope And we can rely on experience and side to minimize the matching of the most correct and correct. The sequence pairing on the image side of the towel is quite critical - part of the line and method discussed here, and Without prior knowledge of the sequence, the sequence is executed between the sequences; the system and method can accurately match the sequence from the same modification process. The super shirt image, also known as the super-link shirt image, is One of the categories of images, built-in 53 200951831 users click on the private touch' to facilitate navigation between images and other hypermedia elements. Therefore, 'super images are similar to hypertext, and are also on the global information network. Widely used, let users click on the word to retrieve information about other files. Generally, the metadata combined with the object is embedded in the image in a special format, and the data and images are released by the content provider. Application No. 11/944,290, filed on November 21, 2007, "Methods and Instruments for Generation, Release, and Demonstration of Interactive Image Content (Meth〇d - appa coffee for generation, distribution and display of imeractive video c) In the reference source, the super-image client_server is clearly mentioned. The feature of the plan is that the image content and the metadata are independent, and the client holds the image valley, which is metadata on the server side; The two are linked together by a client-specific object description called "computing signatures", and the relevant metadata is searched by comparing the signatures in the server archive. There is a basic element for the method of presenting and comparing image objects, that is, the two aspects of the system and methodology and the representation and comparison of the image objects. 〇 洳 洳 洳 ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The image image described in the instrument's patent summarizes the local characteristics of the lowest composition of the image DNA ^ - image image referred to herein and discussed herein. The visual atoms correspond to each other, and the segments or scenes discussed herein correspond to the specific time intervals. In addition, in the prior art, the identification mark used as a single carrier for synthesizing local features is also related to the vision of the present invention. Nucleotide correspondence. The two-stage time localization and spatial localization program is used to compare the two sets of signed algorithms. 54 200951831 The super shirt image distribution system is shown in Figure 18. According to the system, only = The provider side is applied to the super-image client. Therefore, the image content is ignorant of the super-shirt image, and even if (4) is left, there is no metadata or additional information inside. In the process of rewinding the image, the DNA generated by the super-image client is discerned. When the user clicks on the part of the DNA, the part is transmitted to the position of the ^/solid 4, and the image is transmitted. Read _image content, specific: set: specific space • following objects. The metadata ship uses the same. The first § cut image is given to the metadata gamma system and is annotated by Guansaki Guanling. Once the objects are in the shadow Like, _ location, Dunding County, listening to __ is like identifying, then paying for the action, then sending it to the super video client, let him = click on the ton of wire. From its financial body, repair the advertiser to take action The (four) touch action is possible. 0 Possible actions include: • In-image navigation 'like a jump-like scene. Navigation between images' indicates another video position similar to the most recently broadcasted image. 'Search, such as clicking on an object to perform a search. Search for more information on the object to get the richness of the content. 55 200951831 • Hyperlinks to media files or web pages, such as e-commerce transactions (objects that are clicked on mussels). , beer In the super image application t, the client will click on the record trr money. Material, tenderness is the degree of materialism. The position of the person's click can define a part that is more popular with the user. Moreover, the object being clicked is the content that is semantically interesting to the user. ❹ The interface of ultra-shadowing can be used for money, and the method of arbitrage is regarded as query retrieval. The following query objects can be used in the image search in the library: the searched object is the image, and the image of the comparison is used to get the most similar results. This method is based on similarity' rather than semantics. General Bay = Image Search in the Library: The object being searched for is an image, and then a semantic tag is assigned; conversely, the tag is flattened by a keyword query.才丁^ is used in the Wenzibei library to perform the traditional ❹ lack of data background may be the shortcoming of the above method, the same object may have different semantic interpretations ‘ depend on its background. For example, clicking on Apple will result in a search for fruit, but if you switch to Apple on the Apple Electric, you will get the results of running the computer. The image genome can be paired with objects, and the image sequence is strong. If the order is _, then the background: the secret side information can be searched for bees, for example, the silk, the fine ad, the auxiliary fruit, and the image containing the advertisement The order is strong, and the _ word is related to the existing background. 56 200951831 "19th picture" is a secret example of using the description method to search for objects. When the user clicks on the object, 'fine_viewing the money to watch the ship, picking up the fine ship--will be turned over by the number of elements, and it will be compared with m(10); the note from the best match-to-sequence will contain the description of the object. (such as frequency), background (such as computer) keywords to search. The method described here can be used in _ Dependence and Transfer Protection and prevent its illegal sale and viewing. Content protection can be done in the following ways: · Neigu searcher. Access to the standard and check the status of the intermediary. • Perceive the player playing the content. Figure 20 is an example system that tracks illegal content and providers in a peer-to-peer network. In this system, the agency is a content retriever that can access video files that are illegally uploaded to a content server (such as YouTube) or illegally shared on a peer-to-peer network. The Neigu searcher uses the method of film genomics to generate video DNA for any of the tested 索 娜 疑 , , , , , , , , , , , , , , , , , , , , , , , , ,. The retriever notifies the content owner if the identity is correct. "Picture 21" is a player that can detect content based on film genomics. Content - During playback, the client generates video DNA that identifies the content. The video DNA is sent back to the metadata server, where the database of copyright content titles is stored. The metadata server identifies whether a particular content has been blacklisted and sends a message to the client, prohibiting progressive playback. Or the 'meta data server' can insert a strong ad from the ad feeder. Therefore, the user can watch the illegal content, but is also forced to watch the advertisement. 57 200951831 Film genomics is also useful in strengthening and enriching existing traditional content. Suppose the user collects an old DVD and wants to watch movies in Japanese subtitles, Polish pronunciations - a feature not available on the original DVD. "Figure 22" is the possible architecture of the system, providing the possibility of enhancing traditional content based on film genomics. This system includes a client, a metadata server, and an enhanced content server. At the user end, a unique video is created that identifies the inner valley of the play (e.g., in real_time, when the play is reversed or advanced). Video DNA is a unique content recognizer that is not editable or distorted. Therefore, different versions of the same content will have similar or identical video DNA. For example, ❹ the same movie in different versions of different resolutions, different splicing, the same movie recorded from the ad channel with the advertisement, etc., there will be similar or the same video DNA video DNA and the required enhancement The content will be sent to the metadata feed server, and the video DNA will be compared with a specific content database to find the matching content. Once the identification is successful, the metadata server will request the enhanced content server (additional content storage). The content enhancement information is sent to the client (for example, instant streaming) and played properly. Ο Content enhancement information includes: • Channels for different sound formats and languages. · Subtitles of 5 different languages. · • Scenes not included in the original content. • Click on the hyperlinks and messages of the object. • Movie enhancement layer (eg, HD resolution 'higher face update rate, etc.), for example in the H.264 SVC (Scalable Video Codec) format. 58 200951831 In the content evaluation, just like the film search, the semantics of the object and the appearance of the object are important. For example, a woman's chest appears in the context of anti-cancer treatment, which is different from the context of pornography. The basic shame can be used to evaluate the silk __, as follows: First, find the DNA. View the dna pairing to the specified _, __ pieces. Anti-finance, this secret is for the description of the object and the secret of the situation. Age _ object scales Wei and __ rate and period 'decision' evaluation content.

這裡討論的綠與方法也可朗在基於縣_容摘要上。 則摘要中擷取影片序列裡最重要或最具代表㈣部分。如前面 ^論的,影片摘要的範例應用贿在「影片摘要生成的方法與儀 盗」於2007年7月16日申請之美國專利11/778,633,特別納入 此處參考。 以下標準可用在決定相巾某部分的重要性: 〇 自我她·如某—部分在同"序列巾重複出現,它只能播 出一次,刪除其它部分。 相似f生JE面例子.與標記為重要内容(藉預設值或藉分 析使用者行為)相似的部分,即為重要。 . 減f生反面例子.與標記為不重要内容相似的部分,即 為不重要。 影片基因組學可用在建立以上提及的相似性,相似性標準可 根據物件在序列、情境,或兩者間的相似。 關於第23圖」至「第2?圖」,一系列的圖示說明—個根據 59 200951831 系統和所述的方法之過程配置。「第23圖」顯示視頻記號特徵點 偵測過程的-個例子。在這個例子中,輸入視頻(A)是由一含有 視頻影像2304的視頻畫面23GG系列和-經由X和Y在-個利用, 多尺度特徵點偵測2306作為時間期定義的區域組成。視頻信號sl, s2, S3是受到不同空間深度⑻之遽片的卷積,生產了一系列具 有不同解析度之特徵點尺度的影像。這些不同尺度的空間影像之 後分析(例如’角落侧)在不同尺度u,3 (c)巾。圖片就可 以經由-系列的多尺度峰(D)位在確定特徵點的描述 ❹ 確遇在晝面(E )。 第24圖」顯示視頻冗號特徵點追縱和修整過程的例子。這 是-個可選擇的階段,但如果被朗,特徵點在可跨過多個晝面 被追縱到’且特徵點所所持續的晝面是足夠(例如,滿足預設標 準)情況下會被㈣’祕暫的龍料存在奴夠長的時間以 滿足預設標準則被丟棄。 「第25圖」所示視頻記號特徵點描述的例子。「第25圖」的❹ 例子說明絲被侧_特徵狀後如他述…般情況下,經 由再次輸入視頻測的程序,和這個時間分析在視頻中每個先前 摘測到之特徵點⑹四周的鄰近區域(x,y,r)。此特徵描述程序. 可經由多__方法完成。在這_子中,—轉徵點周圍附 近之影像的SIFT梯度是被計算的(H),依據這個梯度產生在局部 區域中固缝量之梯度方向的直方圖⑴。此直方圖之後解析成元 素的一個向量(J),稱為特徵描述符。 200951831 「第26圖」顯示向量量化的過程的—個例子,將映射的一影 像輸入-量化特徵點描述的系列。在這個例子中,視頻影像,先 *前稱之為具有—個任意的特徵描述符詞彙的特徵描述符向量 -(K) ’其映射到—娜準化的d轉徵描述符字彙(L)。一個標 準化之描述字彙致能(enable)標準化賴(seheme) (M),該標 準化計畫能夠不論來源的轉認唯一視頻。 「第27圖」所示為視頻DNA的創建例子。與以標準視頻分 射目比,這健在晝關晝_基礎上分析鋪,視頻腦^結合 或平均對於時間區間從多個視頻晝面產生一個整體「視頻核苷酸」 特徵袋。這方面的-個例子如「第27圖」所示。正如先前討論過 的,被分析的視紐據和對於特別畫面之特徵袋是匯總到k維直 方圖或向量(N)。這些從視頻晝面鄰近之特徵袋(例如,晝面卜 晝面2、晝面3 )然後平均⑻,產生一多畫面視頻時間區間的 代表,常常提及作為「視覺核苷酸」。 帛28圖」顯示處理視頻資料的系統_。視頻資料源遞 儲存及/或生成視頻資料。視頻分割器膽從視頻資料源細接 -收^㈣料並分割為視师料鱗請^視頻處理器聽從視 -頻資觸28G2接收視㈣料’並對視鮮料進行各縣作。在這 個例子中’視頻處理器28〇6在視頻資料中偵測特徵位置,生成與 特徵位置有關的特徵描述符,以及修整债測特徵位置以生成特徵 位置的子集。視頻聚集器厕與視頻分割器細和視頻處理器 2咖相連接。視頻聚集器誦生成一個有關視頻資料的視頻 200951831 由於這裡纣_的,視頻DnA可以包括視頻資料,該視頻資 料為已排序之視覺核苷酸序列。 儲存裝置2808接上景》像分割器28〇4,視頻處理器28〇6和視, 頻聚市S 2810可以儲存多種不同由這些元件使用的資料。儲存的 資料〇括錄衫資料、晝面資料、特徵資料、特徵描述符、視覺 原子視頻DNA、演异法、設定、門植值等。「第28圖」所描述 的το件可以直接或經由另—個巾職置、系統、元件、網路、通 訊連結等相連接。 ❹ 在此描述物件的系統和方法使多種錄景納容鑑定設備在辨識 和相關性上能與特雄影内容相連結^除此之外,有些物件能和 -種以上的傳統錄影處理/成像系統和方法一起使用。例如,一個 物件能被用來改進現有的影像處理系統。 >雖然城所形容的零件和模組是哺殊_式組合而成但 ^^一零件Ί的型式可她由調整,綠乡姆f彡内容的鑑定設 備,在辨識和相關性上達到不同的效果。在其他的物件中,一個〇 或-個以上的零件可⑽域是除去在先前已成㈣祕。替代 的物件也可融合兩種或兩種以上已形容過的零件,組合在一個元· 件上。 【圖式簡單說明】 第1圖為本發鶴提之視_料之空間校準以及時間校準之 不意圖。 第2圖為本發明所提之以視頻基因演算法表示上下文之示意 62 200951831 圖。 第3圖為本發明所提之視頻DNA之示意圖。 第4圖為本發明所提之生物的DNA與視頻DNA之比對示意 圖。 第5圖為本發明所提之建立視頻DNA之流程圖。 第6圖為本發明所提之將視頻序列分割為時間區間之示意圖。 第7圖為本發贿提之彻⑽影格為基礎之特徵點之流程圖。 第8圖為本發明所提之搜尋不變的特徵點之流程圖。 第9圖為本發明所提之修整特徵點軌跡之流程圖。 第10圖為本發明所提之搜尋兩視頻DNA序列間之時空對應 關係之流程圖。 第11圖為本發明所提之視頻特徵偵測之實例。 第12圖為本發明所提之視頻特徵修整之實例。 第13圖為本發明所提之特徵時間平均之描述之實例。 第14圖為本發明所提之視頻最近適合之鄰居之實例。 第15圖為本發明所提之稍微不同之視頻特徵之時間校準之實 例。 第16圖為本發明所提之配對程序之實例。 第17圖為本發明所提之配對結果之實例。 第18圖為本發明所提之基於視頻基因之超視頻發佈系統之示 意圖。 第19圖為本發明所提之物件基礎之搜尋系統之實例。 63 200951831 第20圖為本發明所提之基於視頻基因之非法内容追縱系統之 實例。 第21圖為本發明所提之基於視頻基因之提供玩家察覺内容之、 系統之實例。 第22圖為本發明所提之基於視頻基因之内容擴充系統之實 例0 第23圖為本發明所提之偵測視頻識別特徵點之過程之示咅 圖。 〜❹ 第24圖為本發明所提之追蹤及修整視頻識別特徵點之示音 圖。 '心 第25圖為本發明所提之描述視頻識別特徵點之示意圖。 第26圖為本發明所提之向量量子化之示意圖。 第27圖為本發明所提之建造視頻DNA之示意圖。 第28圖為本發明所提之描述視頻DNA的系統之示音圖 【主要元件符號說明】 “ ^ 100 第一階段 102 第二階段 200 蘋果 202 水果 204 電腦 302 第一階段 304 第二階段 64 200951831The greens and methods discussed here can also be based on the county-level summary. Then extract the most important or most representative part (4) of the film sequence. As previously discussed, the example of the film summary is incorporated in U.S. Patent No. 11/778,633, filed on Jul. 16, 2007, which is incorporated herein by reference. The following criteria can be used to determine the importance of a part of the towel: 自我 Self-she, such as a part, is repeated in the same sequence as the sequence, it can only be broadcast once, and other parts are deleted. Similar to the JE face example. It is important to mark something similar to the important content (by default or by analyzing user behavior). The example of subtracting the opposite of the negative. It is not important with the part marked as unimportant. Film genomics can be used to establish the similarities mentioned above, and the similarity criteria can be based on the similarity of the object in sequence, context, or both. With regard to Figures 23 through 2, a series of illustrations - a process configuration according to the method of 59 200951831 and the method described. "23rd picture" shows an example of the process of detecting the feature points of the video mark. In this example, the input video (A) consists of a video frame 23GG series containing video images 2304 and - an X-Y utilizing, multi-scale feature point detection 2306 as a time period defined area. The video signals sl, s2, S3 are convolutions of the slices of different spatial depths (8), producing a series of images of feature point scales with different resolutions. The post-analysis of these spatial images of different scales (for example, the 'corner side') is at different scales u, 3 (c). The picture can be determined by the multi-scale peak (D) of the - series in the description of the feature point ❹ indeed in the face (E). Figure 24 shows an example of the video redundancy feature tracking and trimming process. This is an optional stage, but if it is lang, the feature point will be traced to multiple faces and the facet of the feature point is sufficient (for example, the preset criteria are met). The (four) 'secret temporary dragons exist in slavery for a long enough time to meet the preset criteria and are discarded. An example of the description of the feature points of the video symbol shown in Figure 25. The ❹ example of "Fig. 25" shows that the silk is side-by-characteristically, as described by him, by re-entering the video measurement program, and this time analysis is performed on each previously extracted feature point (6) in the video. The neighborhood (x, y, r). This characterization program can be done via multiple __ methods. In this _ sub, the SIFT gradient of the image near the transition point is calculated (H), and a histogram (1) of the gradient direction of the amount of solidification in the local region is generated according to this gradient. This histogram is then parsed into a vector (J) of elements called feature descriptors. 200951831 "Figure 26" shows an example of the process of vector quantization, which is a series of mapped image-inputted feature point descriptions. In this example, the video image, first known as the feature descriptor vector with an arbitrary feature descriptor vocabulary - (K) 'is mapped to - the normalized d-transition descriptor vocabulary (L) . A standardized descriptive word enable enables standardization (seheme) (M), which can be used to identify unique videos regardless of source. Figure 27 shows an example of the creation of video DNA. In contrast to the standard video segmentation, this is based on the analysis of the shop, the video brain combined or averaged for the time interval to generate an overall "video nucleotide" feature pocket from multiple video faces. An example of this is shown in Figure 27. As previously discussed, the analyzed data and the feature pockets for the particular picture are summarized into a k-dimensional histogram or vector (N). These feature pockets (e.g., facets 2, facets 3) adjacent to the video face are then averaged (8) to produce a representation of a multi-picture video time interval, often referred to as "visual nucleotides." Figure 28 shows the system for processing video data. Video source source stores and/or generates video material. The video splitter is connected from the video data source - the ^ (four) material is divided into the visual material scales, please ^ the video processor listens to the video - the frequency contacts 28G2 to receive the visual (four) material 'and the fresh materials for each county. In this example, video processor 28〇6 detects feature locations in the video material, generates feature descriptors associated with the feature locations, and trims the debt feature locations to generate a subset of feature locations. The video aggregator is connected to the video splitter and the video processor. The video aggregator generates a video about the video material. 200951831 Because of this, the video DnA can include video material, which is a sorted visual nucleotide sequence. The storage device 2808 is connected to the scene splitter 28〇4, the video processor 28〇6 and the video, and the S2810 can store a plurality of different materials used by these components. The stored data includes the screen data, the face data, the feature data, the feature descriptor, the visual atomic video DNA, the derivation method, the setting, and the planting value. The το pieces described in Figure 28 can be connected directly or via another towel, system, component, network, communication link, etc.*** The system and method for describing objects in this article enables a variety of recording and identification devices to be linked to the content of the special video in terms of identification and correlation. In addition, some objects can be combined with more than one type of traditional video processing/imaging. The system and method are used together. For example, an object can be used to retrofit an existing image processing system. > Although the parts and modules described by the city are combined with the _ type, but the type of ^^ a part 可 can be adjusted by the identification device of the green 姆m彡 content, in identification and correlation Different effects. Among other objects, one or more than one part of the (10) domain can be removed in the previous (four) secret. Alternative objects can also be combined with two or more types of already described parts, combined on one element. [Simple description of the diagram] The first figure is not intended for spatial calibration and time calibration of the crane. Figure 2 is a schematic representation of the context of the video gene algorithm in the present invention. Figure 3 is a schematic diagram of the video DNA of the present invention. Fig. 4 is a schematic view showing the comparison of DNA and video DNA of the organism of the present invention. Figure 5 is a flow chart of establishing a video DNA according to the present invention. Figure 6 is a schematic diagram of dividing a video sequence into time intervals in accordance with the present invention. Figure 7 is a flow chart of the feature points based on the (10) frame of the bribe. Figure 8 is a flow chart of the feature points for searching for invariance in the present invention. Figure 9 is a flow chart of the trimming feature point trajectory proposed by the present invention. Figure 10 is a flow chart showing the temporal and spatial correspondence between the two video DNA sequences proposed by the present invention. Figure 11 is an example of the video feature detection proposed by the present invention. Figure 12 is an example of the trimming of the video features proposed by the present invention. Figure 13 is an illustration of the description of the characteristic time average of the present invention. Figure 14 is an example of a neighbor that is most suitable for the video of the present invention. Figure 15 is an illustration of a time alignment of slightly different video features proposed by the present invention. Figure 16 is an example of a pairing procedure proposed by the present invention. Figure 17 is an example of the pairing result proposed by the present invention. Figure 18 is a schematic illustration of a video gene based super video distribution system of the present invention. Figure 19 is an example of a search system for the basis of the object of the present invention. 63 200951831 Figure 20 is an example of an illegal content tracking system based on video genes. Figure 21 is a diagram showing an example of a system for providing player-aware content based on a video gene. Fig. 22 is a diagram showing an example of a video gene-based content expansion system according to the present invention. FIG. 23 is a diagram showing a process of detecting a video recognition feature point according to the present invention. ~ ❹ Figure 24 is a sound diagram of the tracking and trimming video recognition feature points proposed by the present invention. 'Heartogram 25 is a schematic diagram of the description of the video recognition feature points of the present invention. Figure 26 is a schematic diagram of the vector quantization proposed by the present invention. Figure 27 is a schematic illustration of the construction of a video DNA of the present invention. Figure 28 is a sound diagram of the system for describing video DNA according to the present invention. [Key element symbol description] "^ 100 First stage 102 Second stage 200 Apple 202 Fruit 204 Computer 302 First stage 304 Second stage 64 200951831

306 時間區間 308 第三階段 310 特徵袋 312 視覺核苷酸 314 視頻DNA 400 視頻信號 402 核苷酸與原子 510 視頻DNA計算值 520 時間校準 525 時間對應處 530 選擇與視頻對應之單位時間 535 子集 536 子集 540 空間校準 545 空間對應處 600 時間區間 602 時間區間 604 時間區間 606 時間區間 990 視頻貧料 991 視頻育料 1000 特徵偵測 65 1010200951831 1102 1104 1106 1108 1110 1202 1204 1302 1304 1402 1502 1504 1506 1508 1510 1602 2000 2010 2300 2304 2306 特徵位置 兩角邊偵測演算法 三角邊偵測演算法 深色平滑區域偵測演算法 視頻核苷酸 識別標諸 瞬間特徵 虛線 畫面 一致特徵點 特徵 視頻節目 廣告 節目 對應部分 間隔 三維空間 特徵描述 特徵描述符 視頻畫面 視頻影像 多尺度特徵偵測器306 Time interval 308 Third stage 310 Feature bag 312 Visual nucleotide 314 Video DNA 400 Video signal 402 Nucleotide and atom 510 Video DNA calculated value 520 Time calibration 525 Time correspondence 530 Select the unit time 535 subset corresponding to the video 536 subset 540 spatial calibration 545 spatial correspondence 600 time interval 602 time interval 604 time interval 606 time interval 990 video poor material 991 video nurturing 1000 feature detection 65 1010200951831 1102 1104 1106 1108 1110 1202 1204 1302 1304 1402 1502 1504 1506 1508 1510 1602 2000 2010 2300 2304 2306 Feature Position Two Corner Detection Algorithm Triangle Side Detection Algorithm Dark Smooth Area Detection Algorithm Video Nucleotide Identification Standard Instant Feature Dot Line Picture Consistent Feature Point Feature Video Program Advertising Program Correspondence Partially spaced three-dimensional feature description feature descriptor video picture video image multi-scale feature detector

66 20095183166 200951831

2500 視頻 2800 系統 2802 視頻資料源 2804 視頻分割器 2806 視頻處理器 2808 儲存裝置 2810 視頻聚集器 3000 特徵修整 3010 特徵子集 3100 特徵追蹤 3110 執跡 3200 軌跡修整 4000 特徵表示 4010 視覺原子 5000 時間區間分割 6000 視覺原子聚集 6010 視頻DNA 6011 視頻DNA2500 Video 2800 System 2802 Video Data Source 2804 Video Splitter 2806 Video Processor 2808 Storage Device 2810 Video Aggregator 3000 Feature Trimming 3010 Feature Subset 3100 Feature Tracking 3110 Excavation 3200 Track Trimming 4000 Feature Representation 4010 Vision Atomic 5000 Time Interval Segmentation 6000 Visual Atomic Aggregation 6010 Video DNA 6011 Video DNA

Claims (1)

200951831 七、申請專利範圍: 1. 一種由視頻資料之不随合間判斷時空對應處之方法,該方 法至少包含下列步驟: 輸入複數視頻資料集合; 描述該些_資料為已排序之複數視覺鮮酸序列; 校準該些視覺核_序列,藉以判斷該視頻資料 時間對應處子集; 由該些時間對應處子集之間計算該視頻資料之一空間對❹ 應處(時空對應處);及 輸出在β亥些視頻資料之子集間之該時空對應處。 2. 如申請專利範圍第i項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該視頻㈣係包含—查詢視頻資料 及-主體視頻資料之視頻序列的集合,或該主體視頻資料中 單一視頻序列之子集或經過修改之視頻序列之子集。 3. 如申料利翻第2項·之由視师料^合間蘭 〇 時空對應處之方法,其中該時空對應處係被建立於該查詢視 頻資料中之視頻序列的子集及該主體視頻資料中之視頻序列· 的子集之間。 4.如申請專利範圍第2項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該查詢視頻資料包含經過修改之該 主體視頻資料之子集,該修改係選擇自改變畫面速率(frame rate change )、改變空間分辨率(spatial resoluti〇n也肪辟)、不 68 200951831 均勻空間比例(non-uniform Spatiai scaiing)、修改直方圖裁 切、覆盍新的視頻内容、及***新的視頻内容之一個或多個 ^ 修改的組合。 • 5.如申請專利範圍第1項所述之由視頻資料之不同集合間判斷 時空對應處之方法’其中該視頻資料係被分割為複數時間區 間,該些時間區間分別計算出對應之視覺核苷酸。 6. 如申請專利範圍第5項所述之由涵資料之*哪合間判斷 ❿ 時空對應處之方法,其中該些時間區間包含在時間上連續不 斷的複數視頻晝面。 Λ 7. 如申清專利範圍第5項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其巾該些_區職跨的時間在1/ 到1秒之間。 / 8. 如申請專利細第5項所述之由視頻資料之不_合間判斷200951831 VII. Patent application scope: 1. A method for judging space-time correspondence by video data, the method includes at least the following steps: inputting a plurality of video data sets; describing the _ data as a sorted complex visual acid sequence Calibrating the visual core_sequences to determine a subset of the video data corresponding to the time; calculating a spatial pair of the video data between the subsets corresponding to the time (the space-time correspondence); and outputting in the The space-time correspondence between subsets of the video material. 2. The method for judging space-time correspondence between different sets of video data as described in item i of the patent application, wherein the video (4) comprises a set of video sequences of the query video material and the body video data, or the subject A subset of a single video sequence or a subset of the modified video sequence in the video material. 3. If the claim is for the second item, the method corresponding to the space-time correspondence of the room, the space-time correspondence is a subset of the video sequence and the subject video that are established in the query video material. Between the subset of video sequences in the data. 4. The method for determining a spatiotemporal correspondence between different sets of video data as described in claim 2, wherein the query video material comprises a modified subset of the subject video material, the modification being selected to change the picture rate (frame rate change), change spatial resolution (spatial resoluti〇n also), not 68 200951831 uniform space ratio (non-uniform Spatiai scaiing), modify histogram cropping, overwrite new video content, and insert new A combination of one or more of the video content modifications. • 5. The method for judging space-time correspondence between different sets of video data as described in item 1 of the patent application scope, wherein the video data is divided into complex time intervals, and the corresponding time intervals respectively calculate corresponding visual cores Glycosylate. 6. The method of judging the space-time correspondence of the data according to item 5 of the patent application scope, wherein the time intervals include a plurality of consecutive video frames that are continuous in time. Λ 7. As for the method of judging the space-time correspondence between different sets of video data as described in item 5 of the patent scope, the time of the _ zone is between 1/1 and 1 second. / 8. Judging from the non-combination of video material as described in the fifth paragraph of the patent application =空對應處之方法’其中該視頻賴係被分割為持續時間固 疋的複數時間區間或持續時間不同的複數時間區間。 9. ^請專利_ 5項所述之由視頻資料之不同集合間判斷 時^對應處之方法’其中該些__始與終止時間係 依據視頻資料之鏡頭轉變( — transition)計算。 10.如申請專利範圍第5 時空對應處之方法, 分重疊。 項所述之由顧資料之不贿合間判斷 其中該些時__部分重疊或沒有部 11 ·如申清專利範圍第 1項所述之由視頻資料之不同集合間觸 69 200951831 時空對應處之方法,其中該視覺核苦酸被計算之步驟更包人 下列步驟: 描述該視頻資料之一時間區間為一視覺原子(1 atom)集;及 依據該些視覺原子至少其中之一之一函式集建立該視覺 核皆酸。 12·如申請專利範圍第丨丨項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中被用來建立該視覺核苷酸之該函式 〇 集係為該些視覺原子在該時間區間中之顯露頻率之—直方 圖’或為該些視覺原子在該時間區間中之顯露頻率之一加權 函式直方圖。 13.如申請專利範圍第12項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該函式集係為該加權函式直方圖, 其中’被指派為該視覺核苷酸中之一視覺原子之加權函式係 包含該視覺原子在該時間區間中之一時間點、該視覺原子在 ❹ 該時間區間中之一空間位置、及該視覺原子之顯著性。 14·如申請專利範圍第12項所述之由視頻資料之不同集合間判斷 寺二對應處之方法,其中該函式集係為該加權函式長條圖, 其中’被指派為該視覺核苷酸中之一視覺原子之加權函式係 為下列其中之一: 該時間區間上之常數; 在該時間區間内的最大高斯權重; 200951831 為屬於相同鏡頭之一視覺内容設定較大的數值,其中 該相同鏡頭係該時間區間之中心,及為屬於不同鏡頭之一視 . 覺内容設定較小的數值;及 • 為位置接近—晝面之巾心之視覺原子設定較大的數值, 及為位置接近該晝面之邊界之視覺原子設定較小的數值。 is.如申μ專概圍第u項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該描述該視頻資料之該時間區間為 ❹ s彡視覺原?集之步驟更包含下列步驟: 偵測該時間區間中不變化的特徵點之集合; »十算”亥視頻資料巾各該不變化的特徵點周圍之局部時空 範圍之描述符之集合; 移除該些不變化的概關子集及該些不變化的特徵點 之描述符;及 冑視魏子的集合剩下不變化的特徵點之位置及描述符 Ο 建立為函式。 16·如申請專利細第丨5項所述之由視頻資料之不畴合間判斷 , _空對應處之方法’其中在該時間區間中之該些不變化的特 • 德^.點係自 H__LaPlace c〇mei· deteetQfS、affme_invariant Harris-Laplace comer detectors > spatio-temporal comer detectors、或MSER演算法的組合中選擇進行計算。 17.如申請專利範圍第15項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該MESR演算法係獨立的被應用於 71 200951831 該視頻資料中之晝面之子集或該視頻資料之一時空子集。 18.如申請專利範圍第15項所述之由視頻資料之不同集合 時空對應處之方法,其中該些不變的特徵點之描述符係咖 描述符、時空SIFT描述符、SURF描述符。 19.如申請專利細第15項所述之由視頻資料之不_合間 時空對應處之方法’其巾該計算贿符以合之步 包含下列步驟:= method of null correspondence where the video is divided into complex time intervals of duration solids or complex time intervals of different durations. 9. ^Please refer to the method of determining the correspondence between different sets of video data as described in the patent _ 5, wherein the __start and end times are calculated according to the lens transition of the video material. 10. If the method of applying for the 5th space-time correspondence of the patent scope is overlapped. According to the non-bribery of the data, the __ partial overlap or no part 11 is as described in the first paragraph of the patent scope, and the different sets of video data are in contact with each other. The method, wherein the step of calculating the visual nucleotide acid is further encrusted by the following steps: describing a time interval of the video material as a set of 1 atom; and according to at least one of the visual atoms The set establishes the visual nucleus. 12. The method of determining a spatiotemporal correspondence between different sets of video data as described in the scope of the patent application, wherein the set of functions used to establish the visual nucleotide is the visual atom The histogram of the revealed frequency in the time interval is either a weighted function histogram of the apparent frequencies of the visual atoms in the time interval. 13. The method of determining a spatiotemporal correspondence between different sets of video data as described in claim 12, wherein the set of functions is a histogram of the weighting function, wherein 'as assigned to the visual nucleotide The weighting function of one of the visual atoms includes a time point of the visual atom in the time interval, a spatial position of the visual atom in the time interval, and the significance of the visual atom. 14. The method for determining the correspondence between the two sets of different sets of video data as described in claim 12, wherein the set of functions is the weighted function bar graph, wherein 'as assigned to the visual core The weighting function of one of the glycosidic acids is one of the following: a constant over the time interval; a maximum Gaussian weight in the time interval; 200951831 setting a larger value for visual content belonging to one of the same shots, Wherein the same lens is at the center of the time interval, and a smaller value is set for the visual content of one of the different lenses; and • a larger value is set for the visual atom of the towel that is close to the surface of the face, and A visual atom positioned near the boundary of the facets sets a smaller value. Is the method for judging the spatiotemporal correspondence between different sets of video data as described in the item u of the application, wherein the time interval describing the video material is ❹ s 彡 visual original? The step of collecting further comprises the steps of: detecting a set of feature points that do not change in the time interval; » collecting a set of descriptors of local spatio-temporal ranges around the feature points that do not change; The non-changing subset of the changes and the descriptors of the feature points that do not change; and the positions and descriptors of the feature points that are left unchanging in the set of Weizi are established as functions. The method described in item 5 of the fifth item is judged by the non-coincidence of the video data, and the method of the space corresponding to the space is in which the non-changes in the time interval are from the H__LaPlace c〇mei·deteetQfS , affme_invariant Harris-Laplace comer detectors > spatio-temporal comer detectors, or combinations of MSER algorithms are selected for calculation. 17. As judged in the fifteenth item of the patent application, judging the space-time correspondence between different sets of video data The method, wherein the MESR algorithm is independently applied to a subset of the video in the data of 71 200951831 or a spatio-temporal subset of the video material. The method for the spatio-temporal correspondence of different sets of video data according to item 15 of the benefit range, wherein the descriptors of the invariant feature points are descriptors, spatio-temporal SIFT descriptors, and SURF descriptors. The method described in item 15 of the video data is not the case of the space-time correspondence. The method of calculating the bribe is to include the following steps: 追縱在該視頻資料之該時間區間中之該些不變特徵點; 計算屬於追蹤之-軌跡中該些不變的特徵點之該些描述 符之一函式為一描述符;及 為屬於該軌跡之所有特徵點配置該描述符。 2〇.如申請專利範圍帛I9項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中屬於該軌跡之該些不變的特徵點之Tracking the invariant feature points in the time interval of the video data; calculating one of the descriptors belonging to the invariant feature points in the track-track is a descriptor; The descriptor is configured for all feature points of the track. 2〇. The method for judging space-time correspondence between different sets of video data as described in the scope of patent application 帛I9, wherein the invariant feature points belonging to the trajectory 該些描述符之該函式係該些不變的特徵點之該些描述符的平 均或中位數。 21.如申請專利範圍第15項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該移除該些不變的特徵點之子隽之 步驟更包含下列步驟: 追蹤在該視頻資料之該時間區間中之該些不變的特徵 點; 為追蹤的每一執跡配置一執跡品質矩陣(track职也以 metric);及 72 200951831 移除該些軌跡中軌跡品質矩陣值低於預定軌跡品質門摄 之該不變的特徵點。 -22.如申請專利範圍第21項所述之由視頻資料之不同集合間判斷 . 時空對應處之方法,其中該執跡品質矩陣所配置之一軌跡係 一相符函式,該相符函式係為屬於一軌跡之不變的特徵點之 複數描述符值及複數位置的組合。 23. 如申請專利範圍第22項所述之由視頻資料之不同集合間判斷 ❹ 时對應處之綠,其巾該婦函式係與該些描述符值之差 異成比例或與該些不變的特徵點之位置之總差異成比例。 24. 如申凊專利細第IS項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其巾該建立該些視覺原子之集合之步驟 更包含為每-剩餘之不變的特徵點建立—視覺原子,藉以作 為該不變的特徵點插述符之函式之步驟。 e 25.如中請專利範圍第15項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該不變的特徵點之函式包含下列步 驟: 接收-不變的特徵點之描述符為該輸入; 符 ,由=已排序之代表性描述符之集合中搜尋—代表性描述 =料物糾版财細徵點描 之不同集合間判斷 輸出被搜尋出之該代表性描述符之索弓| •如申請專利範圍第25項所述之由視頻資料 73 200951831 時空對應處之方法,其中該搜尋該代表性描述符之步驟係使 用向量量化(Vector Quantization,VQ)演算法或近似最鄰近 (Approximate Nearest Neighbor, ANN)演算法。 27. 如申請專利範圍第1項所述之由視頻資料之不同集合間判斷 時空對應處之方法’其中該已排序之代表性描述符之集合可 能由訓練資料中固定或離線計算,或可能由該輸入之視頻資 料調整或線上更新。 28. 如申請專利範圍第15項所述之由視頻資料之不同集合間判斷❹ 時空對應處之方法,其中該建立該些視覺原子之集合之步驟 更包含下列之移除該些視覺原子之子集之步驟: 為該集合中之各該視覺原子配置一原子品質矩陣;及 移除該原子品質矩陣之值低於預定之一原子品質門檻之 該些視覺原子。 29. 如申請專利範圍第28項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該原子品質門檻係維持在該集合中❹ 之最小固疋數量之該些視覺原子或最大數量之該些視覺原 子。 ' 30·如申請專利範圍第i項所述之由視頻資料之不同集合間判斷. 時空對應處之方法,其巾細置該原子品#矩陣之步驟更包 含下列步驟: 接收一視覺原子做為該輸入; 計算該視覺原子在該些代表性視覺原子之集合中之相似 74 200951831 度向量;及 以該相似度向量之函式作為該原子品質矩陣並輸出。 31.如申請專利範圍第30項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該相似度向量之函式係執行下列步 驟: 計算在該相似度向量中之一最大值的比例;The function of the descriptors is the average or median of the descriptors of the invariant feature points. 21. The method for determining a spatiotemporal correspondence between different sets of video data as described in claim 15 wherein the step of removing the invariant feature points further comprises the following steps: tracking the video The invariant feature points in the time interval of the data; configuring a performance quality matrix for each trace of the track (the track job is also metric); and 72 200951831 removing the track quality matrix values in the tracks The constant feature point is taken at the predetermined track quality gate. -22. A method for judging a space-time correspondence between different sets of video data as described in claim 21, wherein one of the tracks of the performance quality matrix is a coincidence function, the coincidence function A combination of complex descriptor values and complex positions belonging to a feature point of a track that is invariant. 23. If the corresponding green is determined by the different sets of video data as described in item 22 of the patent application, the function of the towel is proportional to or different from the difference of the descriptor values. The total difference in the position of the feature points is proportional. 24. The method for judging the spatiotemporal correspondence between different sets of video data as described in claim IS, the step of establishing the set of visual atoms further includes each of the remaining invariant features. The point establishes a visual atom, which is the step of the function of the invariant feature point interpolator. e 25. The method for judging the spatiotemporal correspondence between different sets of video data as described in item 15 of the patent scope, wherein the function of the invariant feature point comprises the following steps: receiving-invariant feature points The descriptor is the input; the character is searched by the set of representative descriptors that have been sorted - representative description = the object is corrected, and the representative descriptor is searched out between different sets of judgments. The method of the space-time correspondence of the video material 73 200951831 as described in claim 25, wherein the step of searching for the representative descriptor uses a Vector Quantization (VQ) algorithm or approximation. Approximate Nearest Neighbor (AN) algorithm. 27. A method for judging space-time correspondence between different sets of video material as described in claim 1 of the patent application, wherein the set of representative descriptors that have been sorted may be fixed or offline calculated from the training data, or may be The input video data is adjusted or updated online. 28. The method of determining a space-time correspondence between different sets of video data as described in claim 15 wherein the step of establishing the set of visual atoms further comprises removing the subset of the visual atoms. The step of: arranging an atomic quality matrix for each of the visual atoms in the set; and removing the visual atoms whose value of the atomic quality matrix is lower than a predetermined one atomic mass threshold. 29. A method of determining a spatiotemporal correspondence between different sets of video material as recited in claim 28, wherein the atomic quality threshold is the minimum number of solid atoms or the maximum number of solids remaining in the set The number of these visual atoms. 30. As described in item i of the patent application scope, the method of judging the space-time correspondence between the different sets of video data, the step of arranging the atomic product #matrix further comprises the following steps: receiving a visual atom as The input; calculating a similarity of the visual atom in the set of representative visual atoms 74 200951831 degree vector; and using the function of the similarity vector as the atomic quality matrix and outputting. 31. A method for determining a spatiotemporal correspondence between different sets of video data as described in claim 30, wherein the function of the similarity vector performs the following steps: calculating a maximum value in the similarity vector proportion; §十鼻在該相似度向重中之該最大值及一次最大值間之比 率;及 運行在該相似度向量中之該最大值及在該相似度向量中 之該最大值及該次最大值間之比率。 32.如申請專利範圍第1項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該校準該些視覺核苷酸序列之步驟 更包含下列步驟: 接收兩該些視覺核苷酸序列π&,...,〜}及πk,...,知}做為 該輸入; 接收一評分函式七為)及一空隙罰分(gappenaity)函式 作為參數; 大化之間隙之集合 _ *=, 农 (?=你鳥《1))...,(/£,气,〜)};及 輸出被搜尋到之該部分之對應處C及該函式之最大值。 75 200951831 33. 如申請專利範圍第32項所述之由視頻資料之不同集合間判斷 時空對應處之方法’其中該函式F(C, G)最大化之步驟係利用 Smith-Waterman 演算法、Needleman-Wunsch 演算法、BLAST . 演算法或階層式(hierarchical)演算法。 34. 如申請專利範圍第32項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該評分函式係反轉一距離函式 4七),該距離函式選擇自歐基里德(Euclidean)距離、L1 距離、馬氏(Mahalanobis )距離、Kullback-Leibler divergence ❹ 距離、及Earth Mover’s距離之組合。 35. 如申請專利範圍第32項所述之由視頻資料之不同集合間判斷 時空對應處之方法’其中該評分函式係為/却及 之一個或多個的組合,其中a可能為單位矩陣(identity matrix) 或對角矩陣(diagonal matrix )。 36. 如申請專利範圍第32項所述之由視頻資料之不同集合間判斷 0 時空對應處之方法’其中該評分函式係該視覺核苷酸t為一產 生變化之核苷酸〜之機率,該產生變化之機率可能由訓 練資料估計。 ' 37. 如申請專利範圍第36項所述之由視頻資料之不同集合間判斷 時空對應處之方法’其中該評分函式係該機率之比率 P^g U , W5 ) — ’該產生變化之機率可能由訓練資料估計。 38. 如申請專利範圍第35項所述之由視頻資料之不同集合間判斷 76 200951831 時空對應處之方法’其中該矩陣3之對角線元素係正比於 < ’其中,&表示在-視覺核苷酸中之—視覺原子z•之預期 數量。 39.如申明專利範圍第38項所述之由視頻資料之不同集合間判斷 時空對應處之方法,財該w紋自姆之視頻資料或輸 入之視頻資料’且該矩陣乂之對角線元素係正比於L,其中, V,係在同一視覺核苷酸之變化版本中之視覺原子z_之變量 (variance) ° 4〇.如申凊專利範圍第%項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其巾該空隙罰分係為參數函式如,咕), 其中’/及y係在該兩視頻序列之間隙之起始位置,w係間隙 長度,夕係參數。 41. 如申請專利範圍第40項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該參數夕可能由該訓練資料估計, 该訓練資料包含在該些視頻序列中***或刪除内容之例子。 42. 如申請專利範圍第32項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該空隙罰分係為函式r(„)=fl+如,其 中’ w係間隙長度,0與6係為參數。 43. 如申請專利範圍第32項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該空隙割分係為凸面參數函式或反 轉在起始位置為/及y之該兩視頻序列中搜尋到該間隙之長度 77 200951831 «之機率。 44·如申請專利範圍第1 時空對應處之方法, 下列步驟: 項所述之由視崎料之*同集合間判斷 其中該計算該空間對應處之步驟更包含 輸入該視頻資料之複數時間對應子集 提供在該些子集中之複數特徵點; 搜尋該些特徵點間之對應處;及§ the ratio between the maximum value and the maximum value of the tenth nose in the similarity weight; and the maximum value of the running in the similarity vector and the maximum value and the maximum value in the similarity vector The ratio between the two. 32. The method of determining a spatiotemporal correspondence between different sets of video data as recited in claim 1, wherein the step of calibrating the visual nucleotide sequences further comprises the steps of: receiving two of the visual nucleosides The acid sequences π&,...,~} and πk,..., know as the input; receive a scoring function of seven) and a gap penalty (gappenaity) function as parameters; The set _ *=, Nong (?=Your bird "1))..., (/£, qi, ~)}; and the corresponding point C of the portion to which the output is searched and the maximum value of the function. 75 200951831 33. The method for judging space-time correspondence between different sets of video data as described in item 32 of the patent application scope, wherein the step of maximizing the function F(C, G) is performed by using the Smith-Waterman algorithm, Needleman-Wunsch algorithm, BLAST. Algorithm or hierarchical algorithm. 34. A method for judging spatiotemporal correspondence between different sets of video material as described in claim 32, wherein the scoring function is inversed by a distance function 4 (7), the distance function is selected from The combination of Euclidean distance, L1 distance, Mahalanobis distance, Kullback-Leibler divergence ❹ distance, and Earth Mover's distance. 35. A method for determining a spatiotemporal correspondence between different sets of video material as described in claim 32, wherein the scoring function is a combination of one or more, where a may be an identity matrix (identity matrix) or diagonal matrix. 36. A method for judging 0 space-time correspondence between different sets of video data as described in claim 32, wherein the scoring function is the probability that the visual nucleotide t is a nucleotide that produces a change~ The probability of this change may be estimated from the training data. ' 37. The method for judging the spatiotemporal correspondence between different sets of video data as described in item 36 of the patent application' wherein the scoring function is the probability ratio P^g U , W5 ) — 'this changes The probability may be estimated from the training data. 38. The method of judging 76 200951831 space-time correspondence between different sets of video data as described in claim 35, wherein the diagonal element of the matrix 3 is proportional to < 'where, & The expected number of visual atoms in the visual nucleotide z. 39. The method for judging the spatiotemporal correspondence between different sets of video data as described in claim 38 of the patent scope, the video material of the m or the input video material and the diagonal element of the matrix Is proportional to L, where V is the variation of the visual atom z_ in the modified version of the same visual nucleotide ° 4〇. The difference in video data as described in item % of the patent scope of the application The method for judging the space-time correspondence between sets, the gap penalty of the towel is a parameter function such as 咕), where '/ and y are at the beginning of the gap between the two video sequences, w is the gap length, and the system is parameter. 41. A method for determining a spatiotemporal correspondence between different sets of video material as described in claim 40, wherein the parameter may be estimated from the training data, the training material being included in the video sequence being inserted or deleted An example of content. 42. A method for determining a spatiotemporal correspondence between different sets of video material as described in claim 32, wherein the gap penalty is a function r(„)=fl+如, where 'w is the gap length, 0 and 6 are parameters. 43. A method for judging spatiotemporal correspondence between different sets of video data as described in claim 32, wherein the gap segmentation is a convex parameter function or an inversion at the beginning The length of the gap is found in the two video sequences with position / and y. 200951831 «The probability. 44. If the patent application scope is the first time and space correspondence method, the following steps: Determining, between the same set, the step of calculating the corresponding position of the space further comprises: inputting a plurality of feature points of the plurality of time corresponding subsets of the video data to provide the plurality of feature points in the subset; searching for a correspondence between the feature points; and 搜尋複數空間座標間之對應處。 45.如申請專利綱第44項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該視㈣料之時騎應子集更包含 至少一對時間對應之畫面。 46.如申請專利範圍第44項所述之由視頻資料之不同集合間判斷 時空對應處之方法’其巾該搜尋該鋪徵_之對應處之步 驟更包含下列步驟: 輸入兩特徵點之集合; ❹ 提供該些特徵點之描述符,其中,該些特徵點及該些描 述符係與被用來計算視覺核苷酸之複數特徵點及複數描述符 相同;及 配對描述符。 47.如申請專利範圍第44項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該搜尋該些特徵點間之對應處之步 驟係使用 RANdomSAmple Consensus (RANSAC)演算法。 78 200951831 48.如申請專利範圍第44項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該搜尋該些特徵點間之對應處之步 驟係搜尋一模型之複數參數,該模型係描述該兩特徵點之集 合的轉換,其中,搜尋該模組之複數參數係解決 f=arg产之優化問題,{(^)}及^义)丨係該= 特徵點之集合,T係取決於參數夕之特徵點之集合間之轉換參 數。 ❹ 49.如申請專利範圍第44項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該空間座標間之對應處係該視頻資 料之一子集中之該空間座標(x,y)與視頻資料之另一子集中之 空間座標(x’,y,)間之映射(map)。 5〇.如申請專纖圍第1項所述之由視頻資料之獨集合間列斷 時空對應處之方法,其中該計算該時空對應處之步驟更包含 下列步驟: 輸入該視頻資料之複數時間對應子集; ® 提供在該些子集中之複數特徵點; 搜尋該些特徵點間之對應處;及 ’ 搜尋複數空間座標間之對應處。 51.如申請專利範圍第1項所述之由視頻資料之不同集合間列斷 時空對應處之方法,其中該計算該時空對應處之步驟更包含 下列步驟: 輸入該視頻資料之複數時間對應子集; 79 200951831 提供在該些子針之複數特徵點; 搜尋該些特徵點間之對應處; 搜尋複數空間座標間之對應處;及 搜尋時間座標間之對應處。 52. -種由不同視頻#料之集合間判斷時空對應處之方法,該方 法至少包含下列步驟: 輸入複數視頻資料之集合; 描述該些視頻資料為已排序之複數視覺核苷酸序列,其❹ 中,該視頻資料被分為複數時間區間; 為各該時間區間計算至少一視覺核苷酸,其中,各該視 覺核皆酸係為來自該視頻資料之不同時間區間之複數視覺原 子之集合之群組,且各該視覺原子係描述該視頻資料之局部 時空範圍之視頻内容; 依據下列步驟建立視覺原子: 在該時間區間中偵測不變化之複數特徵點; ❹ 在各該不變化之特徵點周圍計算該視頻資料之局部 時空範圍之描述符之集合; 移除該些不變特徵點之子集及該些不變化之特徵點 之播述符;及 ’ 以剩下之該些不變化之特徵點之位置及該些描述符 建立視覺原子之集合; 权準该些視覺核苷酸序列’藉以判斷該視頻資料之複數 80 200951831 時間相似子集; 由该些時間相似子集之間計算該視頻資料之一空間對應 處(時空對應處);及 • 輸出在該些視頻資料之子集間之該時空對應處。 53. 如申μ專利範圍第52項所述之由視頻資料之不同集合間判斷 時空對應叙綠,其巾該被用來建立視覺核賊之群組係 該些視覺原子在該時間區間中之出現頻率之一長條圖,或為 ❹ 該些視覺軒在該時㈣财ii{麵率之-加權函式長條 圖。 54. 如申請專利範圍第52項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中在該時間區間中之該些不變化之特 徵點係以 Harris-Laplace comer detectors、affine-invariant Harris-Laplace comer detectors ' spatio-temporal corner detectors、或MSER演算法的組合所計算。 ® 55.如申請專利範圍第52項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該不變化之特徵點之描述符係Sift - 描述符、sPati〇-temp〇ral SIFT描述符、或SURF描述符,且計 . 算該些描述符之集合之步驟更包含下列步驟: 追蹤在該視頻資料之該時間區間中之該些不變特徵點之 對應處; 計算屬於一軌跡之該些不變特徵點之該些描述符之一函 式為一描述符;及 200951831 為屬於該軌跡之所有特徵點配置該描述符。 56.如申請專利範圍第52項所述之由視頻資料之不同集合間判斷 時空對應處之方法,其中該移除該些不變化之特徵點之子集 之步驟更包含下列步驟: 追蹤在該視頻資料之該時間區間中之該些不變特徵點之 對應處; 為各該軌跡配置一軌跡品質矩陣,其中該軌跡品質矩陣 係被配置給-執跡,該軌跡係與屬於該軌跡之該些不變化之❹ 特徵點之描述符之值及/或屬於該軌跡之該些不變化之特徵點 之位置之組合一致之函式;及 由軌跡品質矩陣值低於預定軌跡品質門檻之該些軌跡中 移除該不變特徵點。 A如申請專利制第52撕述之由視㈣料U哪合間判斷 時空對應處之方法,其巾建立視覺原子之集合之步驟係為各 該剩下之不變化之特徵點建立一視覺原子,藉以做為該不變❹ 化之特徵點之描述符,其中,該不變化之特徵點之描述符以 下列步驟被執行: 接收一不變化之特徵點之描述料雜人; . #由:已排序之代表性描述符之集合中搜尋一代表性描述 ^以代表佳描述符係與該輸入令最佳之該不變特徵點描述 付相匹配;及 輸出被搜尋出之該代表性插述符之索引; 82 200951831 其中,該已排序之代表性描述符之集合可能由訓練資料 中固定或離線計算,或可能由該輸人之視頻資料適應或線上 更新。 58. 如申請專利細第52項所述之由視㈣料之獨集合間判斷 時空對應處之方法,財該校準該些視覺核賊序列之 更包含下列步驟 接收兩視覺核苷酸序列,=^, ❹ 入 及? = {&,·..,做為該輸 接收-評分函式〇及一空隙罰分(gappe崎)函式 心)作^參數,其中該評分函式係反轉-距離函式“); 搜尋口P刀之對應處c=n(u丨及函式 F(C,G)=|U+|;成最大化之間隙之集合Search for the correspondence between the complex space coordinates. 45. A method for determining a spatio-temporal correspondence between different sets of video material as described in claim 44, wherein the sub-set of the visual (4) material further comprises at least one pair of time-corresponding pictures. 46. The method for determining the space-time correspondence between different sets of video data as described in claim 44 of the patent application, the step of searching for the corresponding position of the shop__ further includes the following steps: inputting a set of two feature points ❹ providing descriptors of the feature points, wherein the feature points and the descriptors are the same as the complex feature points and the complex descriptors used to calculate the visual nucleotides; and the pairing descriptors. 47. A method for determining a spatiotemporal correspondence between different sets of video material as described in claim 44, wherein the step of searching for correspondence between the feature points uses a RANdomSAmple Consensus (RANSAC) algorithm. 78 200951831 48. The method for determining a spatiotemporal correspondence between different sets of video data as described in claim 44, wherein the step of searching for a correspondence between the feature points is to search for a plural parameter of a model, The model system describes the transformation of the set of two feature points, wherein the search for the complex parameters of the module solves the optimization problem of f=arg production, {(^)} and ^yi) 该 is the set of feature points, T It depends on the conversion parameters between the sets of feature points of the parameter eve. ❹ 49. The method for judging space-time correspondence between different sets of video data as described in claim 44, wherein the space coordinates correspond to the space coordinates in a subset of the video material (x, y) A map between the spatial coordinates (x', y,) of another subset of the video material. 5〇. The method for dividing the space-time correspondence between the individual collections of the video data as described in the first item of the special fiber circumference, wherein the step of calculating the space-time correspondence further comprises the following steps: inputting the plural time of the video data Corresponding subsets; ® provide complex feature points in those subsets; search for correspondences between the feature points; and 'search for correspondences between complex space coordinates. 51. The method for breaking a space-time correspondence between different sets of video data as described in claim 1 wherein the step of calculating the space-time correspondence further comprises the following steps: inputting a complex time corresponding to the video data. 79 200951831 provides the complex feature points of the sub-pins; searches for the correspondence between the feature points; searches for the correspondence between the complex space coordinates; and searches for the correspondence between the time coordinates. 52. A method for judging space-time correspondence between sets of different video materials, the method comprising at least the following steps: inputting a set of complex video data; describing the video data as a sorted complex visual nucleotide sequence, In ❹, the video data is divided into a plurality of time intervals; at least one visual nucleotide is calculated for each time interval, wherein each of the visual nuclei is a set of complex visual atoms from different time intervals of the video data. a group, and each of the visual atomic systems describes a video content of a local spatiotemporal range of the video material; establishing a visual atom according to the following steps: detecting a complex feature point that does not change in the time interval; ❹ A set of descriptors for calculating a local spatiotemporal range of the video material around the feature point; removing a subset of the invariant feature points and the descriptors of the unchanging feature points; and 'the remaining ones do not change The location of the feature points and the descriptors establish a set of visual atoms; the visual nucleotide sequence is used to determine the video a plurality of 80 200951831 time-like subsets; calculating a spatial correspondence (spatial-temporal correspondence) of the video data between the temporally similar subsets; and • outputting the space-time correspondence between the subsets of the video materials . 53. If the space-time correspondence between the different sets of video data is as described in item 52 of the patent scope of claim μ, the towel is used to establish a group of visual nuclear thieves in which the visual atoms are located. A bar graph of the frequency of occurrence, or a bar chart of the visual weights at that time (4). 54. A method for determining a spatio-temporal correspondence between different sets of video material as described in claim 52, wherein the unchanging feature points in the time interval are Harris-Laplace comer detectors, affine- Invariant Harris-Laplace comer detectors 'spatio-temporal corner detectors, or a combination of MSER algorithms. ® 55. A method for judging spatiotemporal correspondence between different sets of video data as described in claim 52, wherein the descriptor of the unchanging feature point is Sift - descriptor, sPati〇-temp〇ral SIFT Descriptor, or SURF descriptor, and calculating the set of descriptors further comprises the steps of: tracking the correspondence of the invariant feature points in the time interval of the video data; calculating the belonging to a track One of the descriptors of the invariant feature points is a descriptor; and 200951831 configures the descriptor for all feature points belonging to the track. 56. The method of determining a spatiotemporal correspondence between different sets of video data as described in claim 52, wherein the step of removing the subset of the non-changing feature points further comprises the following steps: tracking the video Corresponding to the invariant feature points in the time interval of the data; configuring a track quality matrix for each of the tracks, wherein the track quality matrix is configured to perform, the track system and the tracks belonging to the track a function in which the value of the descriptor of the feature point and/or the combination of the positions of the unchanging feature points belonging to the track are consistent; and the tracks whose track quality matrix values are lower than the predetermined track quality threshold Remove the invariant feature point. A. For example, in the method of claim 52, the method of judging the space-time correspondence is used to determine the set of visual atoms, and the step of establishing a set of visual atoms is to establish a visual atom for each of the remaining unchanging feature points. The descriptor of the feature point of the constant change, wherein the descriptor of the feature point that does not change is executed in the following steps: receiving a description of the feature point that does not change; Searching for a representative description in the set of representative descriptors of the ordering ^ to represent the best descriptor system with the input of the invariant feature point description; and outputting the representative interpolator Index; 82 200951831 wherein the set of ranked representative descriptors may be fixed or offline calculated from the training material, or may be adapted or updated online by the input video material. 58. As for the method of judging the space-time correspondence between the individual collections of the (four) materials as described in claim 52, the calibration of the visual nuclear thief sequences further comprises the following steps of receiving the two visual nucleotide sequences, ^, 入入? = {&,·.., as the input-receiving-score function and a gap penalty (gappe) function as the ^ parameter, where the scoring function is the inverse-distance function") ; Correspondence of the search port P knife c = n (u 丨 and the function F (C, G) = | U + |; 輸出被搜尋狀該部分之對麟C及該函式之最大值。 59.如申請專利範圍第52項所述之由視頻㈣之不同集合間判斷 時空對應處之方法,其巾該計算該空㈣應處之步驟更包含 下列步驟: 輸入該視頻資料之複數時間對應子集; 提供在該些子集中之複數特徵點; 依據下列步驟搜尋該些特徵點間之對應處: 輸入兩特徵點之集合; 83 200951831 提供該些賊點之概描述符·,及 配對該些描述符;及 搜尋複數空間座標間之對應處。 60.種由不同視頻貝料之集合間觸時空對應處之方法,該方 法至少包含下列步驟: 依據下列步騎立複數視覺鮮酸序列: 由"亥視頻貝料中分析一系列時間連續之複數視頻影 像之複數特徵點; ❹ 修整該些特徵點’藉以刪除只被表現在該些視頻影 像其中之一上之特徵點; 對剩下之該些特徵點進行時間平均,並依據平均拋 棄異常之特徵點; 以最接近之特徵點為剩下之特徵點配置適合之不同 特徵點之標準陣列;及 β十算在该-系列時間連續之複數視頻影中被配置之 〇 特徵點之各麵之數量,藉㈣财_徵點之標準陣 列建立係數,其中各該視覺核_包含該標料狀係 數,且該些視覺核微序列包含該些時間連續之視覺核, 苷酸序列; 校準該些視覺㈣酸序列’藉以判斷該視頻資料之複數 時間相似子集; 由該些時間相似子集之間計算該視頻資料之一空間對應 84 200951831 處(時空對應處);及 輸出在該些視頻資料之子集間之該時空對應處。 61. —種裝置,該裝置至少包含: 一視頻資料源; -視頻分割H ’與該視頻資料源連接,用以將一視頻資 料分割為複數時間區間; ' -視頻處理H ’錢視崎贿輕,用以侧視頻資 料中之特徵點之位置、產钱該彳概點之位置對應之特徵點 之描述符、及修整該被偵測出之特徵點之位置以產生特徵點 之位置之子集;及 一視頻聚集器,與該視頻分割器及該視頻處理器相連 接’用以產生與該視頻資料對應之視頻DNA,其中,該視頻 DNA包含複數已排序之視覺核苷酸序列。 ❹ 85The output is searched for the part of the pair of C and the maximum value of the function. 59. The method for judging space-time correspondence between different sets of video (4) according to claim 52 of the patent application, the step of calculating the space (four) should include the following steps: inputting the plural time corresponding to the video data a subset; providing a plurality of feature points in the subsets; searching for a correspondence between the feature points according to the following steps: inputting a set of two feature points; 83 200951831 providing the descriptors of the thief points, and pairing the Some descriptors; and search for correspondences between complex space coordinates. 60. A method for contacting a space-time correspondence between sets of different video materials, the method comprising at least the following steps: riding a complex visual acid sequence according to the following steps: analyzing a series of time continuouss from "Hui video bedding Complex feature points of a plurality of video images; 修 trimming the feature points 'to delete feature points that are only represented on one of the video images; time-averaging the remaining feature points, and discarding the abnormalities according to the average a feature point; a standard array of suitable feature points is configured with the closest feature point as the remaining feature points; and β 算 is calculated on each side of the feature point of the series-time continuous complex video shadow The number is calculated by a standard array of (4) financial points, wherein each of the visual cores includes the standard-like coefficient, and the visual nuclear micro-sequences comprise the temporally continuous visual nuclei, the nucleotide sequence; Having a visual (four) acid sequence 'to determine a plurality of temporally similar subsets of the video material; calculating a space of the video data between the temporally similar subsets Corresponding to 84 200951831 (space-time correspondence); and outputting the space-time correspondence between the subsets of the video materials. 61. A device, the device comprising: at least: a video data source; - a video segmentation H' is coupled to the video material source for segmenting a video material into a plurality of time intervals; '-video processing H' Qian Shiqi bribe Light, the position of the feature point in the side video data, the descriptor of the feature point corresponding to the position of the production point, and the position of the detected feature point to generate a subset of the position of the feature point And a video concentrator coupled to the video divider and the video processor to generate video DNA corresponding to the video material, wherein the video DNA comprises a plurality of ordered visual nucleotide sequences. ❹ 85
TW98112574A 2008-04-15 2009-04-15 Methods and systems for representation and matching of video content TW200951831A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US4527808P 2008-04-15 2008-04-15
US12/349,469 US8358840B2 (en) 2007-07-16 2009-01-06 Methods and systems for representation and matching of video content

Publications (1)

Publication Number Publication Date
TW200951831A true TW200951831A (en) 2009-12-16

Family

ID=44871867

Family Applications (1)

Application Number Title Priority Date Filing Date
TW98112574A TW200951831A (en) 2008-04-15 2009-04-15 Methods and systems for representation and matching of video content

Country Status (1)

Country Link
TW (1) TW200951831A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8719442B2 (en) 2011-12-28 2014-05-06 Industrial Technology Research Institute System and method for providing and transmitting condensed streaming content

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8719442B2 (en) 2011-12-28 2014-05-06 Industrial Technology Research Institute System and method for providing and transmitting condensed streaming content

Similar Documents

Publication Publication Date Title
TW200951833A (en) Methods and systems for representation and matching of video content
US8358840B2 (en) Methods and systems for representation and matching of video content
US8676030B2 (en) Methods and systems for interacting with viewers of video content
US8719288B2 (en) Universal lookup of video-related data
Wei et al. Frame fusion for video copy detection
US8200648B2 (en) Data similarity and importance using local and global evidence scores
US9171578B2 (en) Video skimming methods and systems
US8503770B2 (en) Information processing apparatus and method, and program
US20120057775A1 (en) Information processing device, information processing method, and program
Sreeja et al. Towards genre-specific frameworks for video summarisation: A survey
Shroff et al. Video précis: Highlighting diverse aspects of videos
Küçüktunç et al. Video copy detection using multiple visual cues and MPEG-7 descriptors
CN103631932A (en) Method for detecting repeated video
Zhou et al. Feature extraction and clustering for dynamic video summarisation
Gornale et al. Analysis and detection of content based video retrieval
Papadopoulos et al. Automatic summarization and annotation of videos with lack of metadata information
Fan et al. Robust spatiotemporal matching of electronic slides to presentation videos
TW200951832A (en) Universal lookup of video-related data
Reyes et al. Where is my phone? Personal object retrieval from egocentric images
Jiang et al. Video searching and fingerprint detection by using the image query and PlaceNet-based shot boundary detection method
TW200951831A (en) Methods and systems for representation and matching of video content
Dong et al. Advanced news video parsing via visual characteristics of anchorperson scenes
Ewerth et al. Robust video content analysis via transductive learning
Aggarwal et al. Event summarization in videos
Lin et al. Video retrieval for shot cluster and classification based on key feature set