1220483 五、發明說明(1) 【發明領域】 本發明是有關於建立音訊/視訊的檢索資料庫之方、、 及應用前述資料庫之歌曲檢索系統,以達到透過文〜法 確地找出該文予中特定字串在該影音檔中及對應場景 置,以達到檢索特定字串,播放出特定場景之目的了 位 【發明背景】 ° 近年來由於KTV、MTV、卡拉0K已成為一流行之 樂,對於目前的點歌系統依據歌名或歌手搜尋,其缺吳 如果忘記歌名,造成歌曲太多不易尋找,或依據歌名點為 手為檢索資料搜尋,得到過於繁雜妁資訊,造成搜尋時= 上的浪費’或搜尋不到真正需要的歌曲。、曰 【發明概述】 有鑑於此,本發明提供一種建立音訊/.視訊的檢 料庫之方法以達到透過文字在每一該場景中之時間點緊,貝 精確地找出該文字中特定字串在該影音檔中及對應場景f 二置“以達:檢索特定字串,播出特定場景之目。= J 檢,、系統,係應用前述資料庫,讓使用者可透 文字或音訊特徵檢索歌曲或其片段。 ° -段Ϊ ΐ來:-ί練習新歌經常需要重複練習歌曲中的某 撥=中=索到的歌曲,直接重: 料庫,找出兮々,ί為索-貝料,來檢索音訊/視訊的資 ^子中特定字串在該影昔檔中對應場景之位 〇213.8530TW(N) _^9114彳咖kl^ $ 4頁 1220483 五、發明說明(2) 置,以達到檢索特定字串,播出 為讓本發明之卜、十、4甘μ特%厅、的0 顯易懂,下令』ί述和其目的、特徵、和優點能更明 細說明如下: 並配口所附圖式,作詳 【實施例】 廑建I t圖’第1圖係顯示音訊/視訊的檢索資料 庫建立之操作流程圖。 只丁叶 知立ί先斜t步驟S10從一影音標中分別搁取出視訊 :二:、:該視訊及該音訊分別如步驟S14進行 分,及如步驟S16進行音訊特徵分析;依據上述視訊特^ 訊特徵分析之結果,對該影音檔如步驟su進行 ^,由前述切割而得之所有場景中如步驟^0抽取 接下來,如步職進行視訊光學文字辨識, 擷取該關鍵影格中之文字,辦識出在影音檔中的文字,、 作為文字資料;依據該文字資料、該音訊特徵分析結果从 及其對應之關鍵影格,如步驟S24產生一索引表格;以 將該索引表格及該影音檔儲存,如步驟S26產生一可透j 文字或音訊特徵進行檢索之音訊/影訊檢索資料庫。k 其中’前述步驟中利用解多工從影像檔中分析出視丢 以及音訊,並將視訊與音訊的數據與訊號作特徵分析,$ 用視訊特徵分析S1 4將影片中顏色分佈作特徵分析,利 音訊特徵分析S1 6將音訊中聲音數位訊號作特徵分析。 利用場景切割S18將上述視訊特徵分析S14及音訊特^ 分析S1 6之結果切割成容易管理的個別片段,標記出每'個徵1220483 V. Description of the invention (1) [Field of the invention] The present invention is related to the establishment of audio / video search database, and the song search system using the aforementioned database, in order to find out exactly through text ~ law. The specific string in the text is placed in the video file and the corresponding scene to achieve the purpose of retrieving the specific string and playing the specific scene. [Background of the invention] ° In recent years, KTV, MTV, and karaoke have become popular. Music. For the current song ordering system, search by song name or singer. If you forget the song name, it will make too many songs difficult to find, or you will search for the retrieval data based on the song name. Hour = Waste of Time 'or you can't find the song you really need. [Summary of the Invention] In view of this, the present invention provides a method for establishing a library of audio / video materials to achieve a tight time point in each scene through the text, and accurately find a specific word in the text. String in the video file and the corresponding scene f set "Yida: Retrieve a specific string, broadcast the purpose of a specific scene. = J check, system, using the aforementioned database to allow users to see text or audio features Retrieve a song or its fragment. °-段-ΐ 来: -ί To practice a new song, you often need to repeat a certain song in the song = Medium = Requested songs, and directly re: the database, find out Xi Xi, ί is so-be Data, to retrieve the position of the specific string in the audio / video information corresponding to the scene in the movie file. 213.8530TW (N) _ ^ 9114 彳 咖啡 kl ^ $ 4 页 1220483 V. Description of the invention (2) In order to achieve the retrieval of a specific string, it is broadcasted so that 0, 10, 4 and 4 of the present invention can be easily understood. The order and its purpose, characteristics, and advantages can be explained in more detail as follows: The figure attached to the port is detailed. [Embodiment] It is shown in the first picture of the "I t diagram" The operation flow chart of the establishment of the retrieval database of video / video. Dingye Zhili first obliquely steps S10 to remove the video from a video and audio mark respectively: 2: The video and the audio are separately divided according to step S14, and Perform audio feature analysis in step S16. According to the result of the above-mentioned video feature analysis, perform step ^ on the video file such as step su, and extract all the scenes obtained from the foregoing cutting as step ^ 0. Video optical text recognition, extract the text in the key frame, identify the text in the video file, and use it as text data; based on the text data and the result of the audio feature analysis, and the corresponding key frame, as in step S24 Generate an index table; to store the index table and the audio-visual file, such as step S26 to generate an audio / video retrieval database that can be transparently searched by text or audio features. K where 'the previous step uses demultiplexing from the image The video loss and audio are analyzed in the file, and the data and signal of the video and audio are analyzed. The video feature analysis is used to analyze the color distribution in the film. Analysis, Lee Audio Feature Analysis S16 analyzes the digital sound signals in the audio as feature analysis. Use scene cutting S18 to cut the above video feature analysis S14 and audio features ^ The results of S16 analysis are cut into individual segments that are easy to manage, marking each ' Sign
1220483 五、發明說明(3) 場景的開始/結束點,文字所位置的場景,以 1在一個/!場景甲的時間點例如"明姆的月光更照亮了:的 二子串攸歌曲中的2分鐘19秒開始顯示。在撥放每 t巾’必須進行變化的時間點上就必須是—個關鍵影’、 -抛ί下來請同時參考第2圖’第2 ®係顯示視訊光學文 子辨識之詳細操作流程圖,其中從步驟S40到步驟S46定 為偵測程序’從步驟S48到步驟S49定義為辨識程序。茕 首先取得一關鍵影格,接著如步驟S40進行紋理分 析’對圓像灰度(滚淡)空間分布模武的提取和分析, 如步驟S42對關鍵影格進行動作分析,接著如步驟s4〇將推 行紋理分析與步驟S42動作分析的結果如步驟S44進行= 整合接著如步驟S46利用顏色區域切割,對、影像作門檻王 (threshold),邊緣偵測(edge detection)之後作區域擴 張(region growing)的方法來消除過量分割 、 (^vei^segmentation)現象,以求出較好的分割效果以 完成影像分割的動作,並如步驟S48對文字區塊作文字士 割,如步驟S49進行文字辨識,依據文字辨識模組之= 抽取演算法求得文字特徵值加以分析分類之結果 * 資料。 p句又予 接下來請參考第3圖,第3圖係顯示一實施例利用立 訊/影訊檢索資料庫檢索κτν歌曲之系統的系統架構,= 明之實施例之系統架構將說明如下,其中以檢索歌曲為一 實施例,但本發明之實施範圍不拘限於此。 … 0213.8530TWF(N);STLC-01-D-9114;franklin.ptd 第6頁 丄220483 五 發明說明(4) 依據本發明實施例之一種KTV歌曲檢索系餘 影訊檢索資料庫55、_第—輸人裝置5G、: ’包括- 二輪入裝置60、一音訊特徵分析單元52、—音0、一第 早元53、一搜尋引擎65。 特徵比對 音訊/影訊檢索資料庫55,儲存有複數曲目、 徵、及索引表格,透過該索引表可由文字或立 '訊特 而得到對應之曲目及片⑨’第-輸入裝置5 0 : 7°。::: ^語音資W ;第二輸入裝置6〇,用以輸入文 用 :徵分析單元52 ’用以接收由該第一輸入裝置二:二: 曰資料並進行音訊特徵分析,音訊特徵比對單元53, 將該語音資料之音訊特徵與該音訊/影訊檢索資料庫中 ,存之音訊特徵進行比對,若有相符合者則將對應之曲目 =片段輪出;以及搜尋引擎65 ’用以在該音訊/影訊檢索 賣料庫中搜尋符合由該第二輸入裝置所輸入文字資料之曲 目’並將符合之曲目或片段輸出。 請參考第3圖’使用者如果想檢索一歌曲,依據第一 輸入裝置50例如一麥克風哼唱一歌曲片段,接著利用音訊 特徵分析單元52對歌曲片段進行音訊特徵分析,利用音訊 特徵比對分析單元5 3將歌曲片段的音訊的特徵與音訊/影 訊檢索資料庫5 5中索引資料進行特徵比對分析,最後若有 旋律相符合者則將對應之曲目或片段輸出。 參考第3圖,同時可藉由第二輸入裝置6〇例如一鍵盤 輸入一索引值例如歌曲名稱、歌手名稱、歌詞片段,經由 搜尋引擎65,搜尋音訊/影訊檢索資料庫55,最後若有符1220483 V. Description of the invention (3) The start / end point of the scene, the scene where the text is located, the time point of 1 in a /! Scene A. For example, "Mim's moonlight is even more illuminated: in the two substrings song" The display starts at 2 minutes and 19 seconds. At the point in time that every t-shirt must be changed, it must be a key image.-Please refer to Figure 2 at the same time. Figure 2 shows the detailed operation flow chart of video optical recognition. From step S40 to step S46, the detection program is defined. From step S48 to step S49, it is defined as the identification program.茕 First obtain a key frame, then perform texture analysis in step S40, and extract and analyze the grayscale (fade) spatial distribution model of the circle image, perform motion analysis on the key frame in step S42, and then implement it in step s40. The results of the texture analysis and the action analysis of step S42 are performed as in step S44 = integration. Then, as in step S46, the color area is cut, and the image is thresholded. After edge detection, the area is expanded. Method to eliminate excessive segmentation and (^ vei ^ segmentation) phenomena, in order to find a better segmentation effect to complete the image segmentation action, and perform text division on the text block in step S48, and perform text recognition in step S49. Character recognition module = Extraction algorithm to obtain text feature values and analyze and classify the results * data. Please refer to FIG. 3 for the p sentence. FIG. 3 shows the system architecture of a system for retrieving κτν songs by using an example of Lixun / Video Retrieval Database. The system architecture of the illustrated embodiment will be described below. Searching for songs is an example, but the scope of implementation of the present invention is not limited to this. … 0213.8530TWF (N); STLC-01-D-9114; franklin.ptd Page 6 丄 220483 Five Descriptions of the Invention (4) A KTV song retrieval system according to an embodiment of the present invention is Yu Yingxun's retrieval database 55, _ 第 — Input device 5G ,: 'Includes-two round-in device 60, an audio feature analysis unit 52,-tone 0, a first early element 53, and a search engine 65. Feature comparison audio / video retrieval database 55, which stores a plurality of tracks, signs, and index tables, through which the corresponding tracks and films can be obtained by text or by the 'news'-input device 5 0: 7 °. :: ^ Voice information W; the second input device 60 is used to input the literacy: sign analysis unit 52 ′ to receive the data from the first input device two: two: the data and the audio characteristic analysis, the audio characteristic ratio The unit 53 compares the audio features of the voice data with the stored audio features in the audio / video retrieval database, and if there is a match, the corresponding track = segment rotation; and the search engine 65 'uses To search the audio / video search and retrieval database for tracks that match the text data input by the second input device 'and output the matching tracks or segments. Please refer to FIG. 3 'If the user wants to retrieve a song, hum a song segment according to the first input device 50, such as a microphone, and then use the audio feature analysis unit 52 to perform audio feature analysis on the song segment and use audio feature comparison analysis The unit 53 compares and analyzes the characteristics of the audio of the song segment with the index data in the audio / video retrieval database 55. Finally, if there is a melody match, the corresponding track or segment is output. Referring to FIG. 3, at the same time, an index value such as song name, artist name, and lyrics segment can be input through the second input device 60 such as a keyboard, and the audio / video retrieval database 55 is searched through the search engine 65.
0213-853〇riW(N);STl£.〇i.D.9ii4;franklin.ptd 第7頁 1220483 發明說明(5) 合索引值者輸出符合條件之曲目;以及 依據第一輸入裝置例如一麥克風輸入一索引值例如 歌:名稱、歌手名稱、歌詞片段,利用聲音來替代鍵盤輸 ’、引值來達到雙向的自然人機互動,接著經由語音辨 ί ’辨識輸入之索弓1值,將資料傳送到搜尋引擎65,利用 尋引擎65搜尋音訊/影訊檢索資料庫55,$後若有符合 索引值者輸出符合條件之曲目^ =外:本發明在場景切割S1 8時,將影音檔之場景切 割成容f管理的個別片段’標記出每個場景的開始/結束 =,文字所位在的場景,以及每一句文字在每個場景中的 時間點例如2分鐘29秒’將這些資料儲存至音訊/影訊檢索 資料庫S26 ’其中例如使用者依據麥克風輸入一歌曲片段 或依據鍵盤或麥克風輸入一歌詞片段為索〖丨值所搜尋到的 歌曲其撥放的方式可直接撥放出歌詞片段在歌曲中的音訊 以及視訊之資料,舉例來說,當輸入歌曲片段或歌詞片段 例如"明媚的月光更照亮了我的心"為—索引值,在音訊/ 影訊檢索資料庫5 5中尋找到綠島小夜曲這首歌中的歌詞片 段符合索引值,當選擇撥放綠島小夜曲這首歌曲時,可直 接從綠島小夜曲中歌詞片段"明媚的月光更照亮了我的心" 開始撥放"明媚的月光更照亮了我的心"的音訊以及視訊之 資料,不需要重頭開始撥放歌曲,尤其通常在作歌唱練習 時,經常需要重覆練習歌曲中某一段’但經常反覆倒帶不 具有便利性,以及無法直接撥放歌曲中某一段的音訊以及 視訊之資料形成時間上的浪費,但藉由本發明建立一視訊0213-853〇riW (N); ST1 £ .〇iD9ii4; franklin.ptd page 7 1220483 Description of the invention (5) Those who meet the index value output a track that meets the conditions; and input an index based on the first input device such as a microphone Values such as song: name, singer name, lyrics fragment, use sound instead of keyboard input, and value to achieve two-way natural human-computer interaction, and then recognize the input value of cable bow 1 by voice recognition and send the data to the search engine 65. Use the search engine 65 to search the audio / video retrieval database 55. If $ meets the index value after the $, then output the matching track ^ = Out: The present invention cuts the scene of the video file into a volume f when the scene cuts S1 8. Individual clips of management 'mark the start / end of each scene =, the scene where the text is located, and the time point of each sentence in each scene, such as 2 minutes 29 seconds' Save these data to audio / video retrieval Database S26 'In which, for example, a user inputs a song fragment according to a microphone or a lyrics fragment according to a keyboard or a microphone, and the playback method of the searched song Directly play out the audio and video information of the lyrics segment in the song. For example, when entering a song segment or lyrics segment such as " A bright moonlight illuminates my heart " as an index value, in audio / video Search the database 5 and find the lyrics fragments in the song "Lvdao Serenade" that meets the index value. When you choose to play the song "Lvdao Serenade", you can directly read the lyrics fragments from "Lvdao Serenade". The bright moonlight is even brighter. My heart " Start playing " The bright moonlight illuminates the audio and video information of my heart " I do n’t need to start playing songs again, especially when I ’m singing practice, I often need to repeat the practice A certain section of the song, but often repeated rewinding is not convenient, and the inability to directly play the audio and video data of a certain section of the song is a waste of time, but a video is created by the present invention.
〇213-8530TW(N);STLC.〇l.D.9114;franklin.ptd 第8頁 1220483 五、發明說明(6) 與音訊的檢索資料庫 〜^ 視訊與音訊同時撥放之功能 雖然本發明已以較佳實施例揭露如上然其 限定本發明,任何熟習此技藝者,在 :政 和範圍内’當可作各種之更動與:::精神 範圍當視後附之中請專利範圍所界定者=本發明之保護〇213-8530TW (N); STLC.〇lD9114; franklin.ptd Page 8 1220483 V. Description of the invention (6) Retrieval database with audio ~ ^ The function of video and audio playback at the same time The preferred embodiment discloses that the invention is limited as above. Anyone who is familiar with this skill will be able to make various changes within the scope of: "political and political scope" and ::: spiritual scope. Please attach the scope of the patent as defined in the appendix = the invention Protection
0213-8530TWF(N);STLjC-01-D-9114;franklin.ptd0213-8530TWF (N); STLjC-01-D-9114; franklin.ptd
1220483 圖式簡單說明 【圖式之簡單說明】 圖·第】圖係顯示音訊/視訊的檢索資料庫建立之操作流程 第2圖係顯示視訊光學文字辨識之操作流程圖,· 第3圖係顯示一實施例利用音訊/影訊檢索資 KTV歌曲之系統的系統架構。 辱檢索 【符號說明】 S1 0〜從影像檔中擷取出視訊與音訊; S14〜進行視訊特徵分析; S16〜進行音訊特徵分析; S18〜進行場景切割; S20〜抽取關鍵影格; S22〜進行視訊光學文字辨識; S24〜產生一索引表格; S26產生一音訊/視訊的檢索資料庫; S40〜進行紋理分析 S42〜進行動作分析 S44〜進行多框整合 S46〜進行顏色區域切割; S48〜進行文字切割; S49〜進行文字辨識; 50、70〜第一輸入裝置; 5 2〜音訊特徵分析單元; 53〜音訊特徵比對分析單元; 0213-8530TWF(N) »ST1C-01-D-9114; frank 1 in.ptd 第10頁 12204831220483 Simple description of the diagram [Simplified description of the diagram] Figure · Page] shows the operation flow of the audio / video search database establishment. Figure 2 shows the operation flowchart of video optical text recognition. · Figure 3 shows An embodiment uses the system architecture of a system for audio / video retrieval of KTV songs. [Remark] S1 0 ~ Extract video and audio from image file; S14 ~ Perform video feature analysis; S16 ~ Perform audio feature analysis; S18 ~ Perform scene cutting; S20 ~ Extract key frames; S22 ~ Perform video optics Character recognition; S24 ~ generates an index table; S26 generates an audio / video retrieval database; S40 ~ performs texture analysis S42 ~ performs motion analysis S44 ~ performs multi-frame integration S46 ~ performs color area cutting; S48 ~ performs text cutting; S49 ~ Perform character recognition; 50, 70 ~ First input device; 5 2 ~ Audio feature analysis unit; 53 ~ Audio feature comparison analysis unit; 0213-8530TWF (N) »ST1C-01-D-9114; frank 1 in .ptd Page 10 1220483
0213-853OIW(N);STLC-01-D-9114;franklin.ptd 第11頁0213-853OIW (N); STLC-01-D-9114; franklin.ptd Page 11