TWM591655U - Spokesperson audio and video tracking system - Google Patents
Spokesperson audio and video tracking system Download PDFInfo
- Publication number
- TWM591655U TWM591655U TW108212189U TW108212189U TWM591655U TW M591655 U TWM591655 U TW M591655U TW 108212189 U TW108212189 U TW 108212189U TW 108212189 U TW108212189 U TW 108212189U TW M591655 U TWM591655 U TW M591655U
- Authority
- TW
- Taiwan
- Prior art keywords
- information
- image
- speaker
- tracking system
- item
- Prior art date
Links
Images
Landscapes
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
一種發言人員音訊及影像追蹤系統,主要設於例如一會議室的開放空間中,並包含一控制主機、一環景影像擷取裝置及一麥克風陣列裝置,其中,控制主機的一資料庫係預先載入數筆臉部動作特徵資訊,當會議室進行會議時,環景影像擷取裝置可依據資料庫的數筆臉部動作資訊辨識出會議中正在開口發言的發言者,並分析出發言者的三維空間位址資訊後,透過三維空間位址資訊驅動麥克風陣列裝置進行精準收音及排除噪音,再進一步將發言者的臉部畫面特寫投影至會議室的一顯示幕,以供其他與會者可立即得知目前誰在發言以及可清楚聆聽發言人的發言。A speaker audio and image tracking system is mainly installed in an open space such as a conference room, and includes a control host, a panoramic image capturing device and a microphone array device, wherein a database of the control host is pre-loaded Enter several pen face motion feature information. When the conference room is in a meeting, the surround view image capture device can identify the speaker who is speaking in the meeting based on the several face motion information in the database, and analyze the speaker’s After the three-dimensional spatial address information, the microphone array device is driven to accurately receive sound and remove noise through the three-dimensional spatial address information, and then the close-up projection of the speaker's face is projected to a display screen in the conference room for other participants to immediately Know who is currently speaking and listen to the speaker clearly.
Description
一種發言人員音訊及影像追蹤系統,尤指一種可清楚辨識會議中發言人的聲音以及影像的發言人員音訊及影像追蹤系統。 A speaker audio and image tracking system, especially a speaker audio and image tracking system that can clearly identify the voice and image of the speaker in the conference.
傳統視訊會議系統可利用三個以上的攝影機來拍攝參與會議的人,同時使用麥克風陣列來進行發言者的定位,並且將所定位之發言者放大於視訊會議影像中,然而,傳統作法僅執行聲音定位來判斷音源位置,並且認為該音源位置即是發言者的位置,進而將該位置的影像放大於視訊會議影像中,因此,上述傳統方法會因為環境噪音而導致準確度不足,無法精準地判斷發言者的位置,又,一般傳統式單收音麥克風系統具有下列缺點:(1)收音方向性限制,講話的人沒有對著麥克風的收音效果很差;(2)於會議環境使用時,當換人發言時,需將麥克風轉交給下一發言人;(3)於家用智能家電設備使用時,收音效率極低。 The traditional video conference system can use more than three cameras to shoot people participating in the conference, at the same time use the microphone array to locate the speaker, and magnify the localized speaker in the video conference video, however, the traditional method only performs sound Positioning to determine the location of the audio source, and that the location of the audio source is the position of the speaker, and then enlarge the image of the location in the video conference video. Therefore, the above traditional methods will cause insufficient accuracy due to environmental noise and cannot be accurately judged The position of the speaker, in addition, the general traditional single microphone system has the following shortcomings: (1) the directionality of the radio is limited, and the speaker does not have a poor radio reception effect on the microphone; (2) when used in a conference environment, when changing When a person speaks, the microphone needs to be transferred to the next speaker; (3) When used in home smart home appliances, the radio reception efficiency is extremely low.
而傳統式麥克風陣列收音會議系統雖然因為採用全向性麥克風陣列收音,有效提高了對使用環境內所有發言者的收音品質,但無法鑑別聲音源是信號還是噪音,不利於背景噪音源的收音。 Although the conventional microphone array radio conference system adopts the omnidirectional microphone array radio, which effectively improves the radio quality of all speakers in the use environment, it is impossible to distinguish whether the sound source is a signal or noise, which is not conducive to the background noise source radio.
有鑑於上述的問題,本創作人係依據多年來從事會議視訊設備相關行業的經驗,針對視訊中發言人的音源及影像定位進行研究及分析;緣此,本創作之主要目的在於提供一種可清楚辨識會議中發言人的聲音以及影像的發言人員音訊及影像追蹤系統。 In view of the above-mentioned problems, the author is based on years of experience in the conference video equipment related industry, research and analysis of the audio source and image positioning of the speaker in the video; therefore, the main purpose of this creation is to provide a clear Identify the speaker's voice and video speaker audio and video tracking system during the conference.
為達上述的目的,本創作發言人員音訊及影像追蹤系統,其主要包括一控制主機、一環景影像擷取裝置以及一麥克風陣列裝置,其中,控制主機的一資料庫係預先載入數筆臉部動作特徵資訊,當會議室進行會議時,環景影像擷取裝置可依據資料庫的數筆臉部動作資訊辨識出會議中正在開口發言的發言者,並分析出該發言者的三維空間位址資訊後,透過三維空間位址資訊驅動麥克風陣列裝置進行精準收音及排除噪音,再進一步將發言者的臉部畫面特寫投影至會議室的一顯示幕上,以供其他與會者可清楚得知目前的發言人影像以及其發言內容。 To achieve the above purpose, the author’s audio and video tracking system for speakers mainly includes a control host, a panoramic image capture device, and a microphone array device, in which a database of the control host is preloaded with several faces Part of the motion feature information, when the conference room is in the meeting, the ambient image capture device can identify the speaker who is speaking in the meeting based on the number of facial motion information in the database, and analyze the three-dimensional space position of the speaker After the address information, the microphone array device is driven by the three-dimensional spatial address information to accurately receive and eliminate noise, and then the speaker's face picture is further projected onto a display screen of the conference room for other participants to know clearly The current speaker image and the content of his speech.
為使 貴審查委員得以清楚了解本創作之目的、技術特徵及其實施後之功效,茲以下列說明搭配圖示進行說明,敬請參閱。 In order to enable your reviewing committee to clearly understand the purpose, technical features and effects of this creation, the following description is accompanied by illustrations, please refer to it.
10:發言人員音訊及影像追蹤系統 10: Speaker audio and image tracking system
101:控制主機 101: control host
102:環景影像擷取裝置 102: Surround view image capture device
1011:中央處理模組 1011: Central processing module
1021:影像分析模組 1021: Image analysis module
1012:資料庫 1012: Database
1022:臉部辨識單元 1022: Face recognition unit
1013:資訊接收發送模組 1013: Information receiving and sending module
1014:投影模組 1014: projection module
1015:標註單元 1015: Labeling unit
1016:影像嵌入單元 1016: Image embedding unit
103:麥克風陣列裝置 103: microphone array device
1031:聲源過濾模組 1031: Sound source filter module
11:顯示幕 11: Display screen
12:會議室 12: Meeting room
13:開放空間 13: Open space
A:發言人 A: Spokesperson
B:身份辨識資訊 B: Identification information
C:視訊畫面 C: Video screen
F:臉部動作特徵資訊 F: facial motion feature information
F1:臉部影像資訊 F1: Face image information
F2:三維空間位址資訊 F2: 3D space address information
N:環境音訊 N: environmental audio
N1:人聲音源資訊 N1: Human voice source information
N2:環境噪音資訊 N2: Environmental noise information
第1圖,為本創作之系統組成示意圖。 Figure 1 is a schematic diagram of the system composition of this creation.
第2圖,為本創作之實施示意圖(一)。 Figure 2 is a schematic diagram of the implementation of this creation (1).
第3圖,為本創作之實施示意圖(二)。 Figure 3 is a schematic diagram of the implementation of this creation (2).
第4圖,為本創作之實施示意圖(三)。 Figure 4 is a schematic diagram of the implementation of this creation (3).
第5圖,為本創作之另一實施例(一)。 Figure 5 is another embodiment (1) of this creation.
第6圖,為本創作之實施例(一)實施示意圖。 Figure 6 is a schematic diagram of the first embodiment of this creation.
第7圖,為本創作之另一實施例(二)。 Figure 7 is another embodiment of the creation (2).
第8圖,為實施例(二)之實施示意圖。 Figure 8 is a schematic diagram of the implementation of the second embodiment.
請參閱「第1圖」,圖中所示為本創作之系統組成示意圖,如圖中所示的發言人員音訊及影像追蹤系統10,其主要包括一控制主機101、一環景影像擷取裝置102以及一麥克風陣列裝置103,其中,控制主機101可例如為一實體伺服器或雲端主機,且控制主機101具有一中央處理模組1011,所述的中央處理模組1011用以驅動各模組作動,並分別與一資料庫1012、一資訊接收發送模組1013以及一投影模組1014形成資訊連結,且資料庫1012中預先儲存有數筆臉部動作特徵資訊F,所述的臉部動作特徵資訊F可例如為嘴部張開講話時臉部肌肉的動作資訊等,而資訊接收發送模組1013用以接收或傳送電子資訊,且投影模組1014可用以將影像資訊投影至一顯示幕11(圖中未繪示);環景影像擷取裝置102主要設置於例如會議室的一開放空間之中,其設有一影像分析模組1021,且影像分析模組1021中具有一臉部辨識單元1022,環景影像擷取裝置102可例如為環景攝影機或是深度攝影機(Depth Camera,亦可稱立體攝影機)等,環景影像擷取裝置102可擷取不同方向的影像資訊,並且可
進一步將各個影像資訊合成為環景影像,使環景影像的影像範圍可涵蓋整個會議環境,且影像分析模組1021的臉部辨識單元1022可依據資料庫1012中的數筆臉部動作特徵資訊F,辨識出開放空間內正在發言的一發言人,並擷取及分析出該發言人的一臉部影像資訊F1以及一三維空間位址資訊F2(例如三維座標),所述的臉部影像資訊F1主要為該發言人的一臉部特寫影像資訊,所述的人臉動作辨識作業可透過機器學習或深度學習進行影像比對,例如可基於卷積神經網路(Convolutional Neural Network,CNN)進行人臉辨識訓練,更進一步例如使用Faster RCNN(Faster Region-based Convolutional Neural Network)的卷積神經網路進行人臉辨識訓練,並且可通過隨機梯度下降演算法(Stochastic Gradient Descent,SGD)進行疊代訓練,而三維空間位址資訊F2為該發言人在開放空間中的三維空間位址資訊F2,可定位出發言人的位置,又,為進一步便於環景影像擷取裝置102進行現場環境的影像擷取作業,可進一步在環景影像擷取裝置102的底部加裝一轉動基座(例如一萬向轉動基座,圖中未繪示),便於環景影像擷取裝置102可以360度取景;麥克風陣列裝置103,具有一聲源過濾模組1031,可設置於例如會議室的開放空間中,其可以為陣列式麥克風(Array Microphone),所述的麥克風陣列裝置103具有數個麥克風收音單元,可擷取數個不同方向的環境音訊N,所述的環境音訊N中主要為一人聲音源資訊N1以及一環境噪音資訊N2所組成,聲源過濾模組1031可預先設定過濾參數,以將環境噪音資訊N2過濾後只留下人聲音源資訊N1;又,環景影像擷取裝置102及麥克風陣列裝置103亦可以組設於控制主機101中,使環景影像擷取裝置102及麥克風陣列裝置103,同步擷取環景影像及聲音訊號。
Please refer to "Picture 1", which is a schematic diagram of the system composition of the creation, as shown in the speaker audio and
請參閱「第2圖」,圖中所示為本創作之實施示意圖(一),請搭配參閱「第1圖」,本創作於實施時,係預先將環景影像擷取裝置102以及麥克風陣列裝置103架設於一適當位置,例如一會議室12的一開放空間13中,常態下會議室12中所有與會人員的臉部表情均受到環景影像擷取裝置102的聚焦監控,當有人進行發言時,例如圖中所示的一發言人A,環景影像擷取裝置102會依據資料庫1012中的數筆臉部動作特徵資訊F,進一步針對發言人A的臉部表情進行辨識,以確定該人員是否正在發言,若是,則擷取及分析出該發言人的一臉部影像資訊F1以及一三維空間位址資訊F2,並進一步傳送至控制主機101的資料庫1012儲存;再請搭配參閱「第3圖」,圖中所示為本創作之實施示意圖(二),承「第2圖」所述,中央處理模組1011係進一步透過資訊接收發送模組1013將三維空間位址資訊F2傳送至麥克風陣列裝置103,使麥克風陣列裝置103可依據三維空間位址資訊F2屏蔽或關閉其他方向的麥克風收音單元,僅開啟該位址方向的麥克風單元,以聚焦接收該方向的環境音訊N,並透過聲源過濾模組1031將環境音訊N過濾出人聲音源資訊N1,並進一步傳送至控制主機101;再請搭配參閱「第4圖」,圖中所示為本創作之實施示意圖(三),承上所述,控制主機101可進一步將發言人A的臉部影像資訊F1透過投影模組1014投影至會議室12的顯示幕11上,以供會議室12的與會人員可透過投影幕11得知目前發言人的臉部影像,再將人聲音源資訊N1透過資訊接收發送模組1013發送至外部音訊設備,例如喇叭等,藉此,透過本創作的實施,可清楚辨識會議中發言人A的聲音以及影像,以確保其他與會者可清楚得知目前發言人的影像以及其發
言內容。
Please refer to "Picture 2", which is a schematic diagram of the implementation of the creation (1), please refer to "Picture 1", the implementation of this creation, the surrounding
請參閱「第5圖」,圖中所示為本創作之另一實施例(一),本創作可進一步在資料庫1012中預先儲存有數筆身份辨識資訊B,所述的數筆身份辨識資訊B可為臉部特徵資訊、名字等身份資訊,而投影模組1014中具有一標註單元1015,所述的標註單元1015可將數筆身份辨識資訊B標註於影像中的人物;再請搭配參閱「第6圖」,圖中所示為本創作之實施例(一)實施示意圖,承「第5圖」所述,請搭配參閱「第1圖」,當環景影像擷取裝置102擷取發言人A的臉部影像資訊F1並進行影像投放時,環景影像擷取裝置102亦可進一步將臉部影像資訊F1與資料庫1012中儲存的數筆身份辨識資訊B進行比對辨識,以取得對應發言人A的正確身份辨識資訊B,而完成比對後,控制主機101即可進一步透過投影模組1014的標註單元1015,將對應於發言人A的正確身份辨識資訊B標註於投影幕11的發言人A頭部影像上,以供與會人員可得知發言人A的身份。
Please refer to "Figure 5". The figure shows another embodiment of the creation (1). This creation may further pre-store several pieces of identification information B in the
請參閱「第7圖」,圖中所示為本創作之另一實施例(二),本創作亦可進一步搭配視訊設備進行畫面連動,如本圖所示的控制主機101,其投影模組1014係具有一影像嵌入單元1016;再請搭配參閱「第8圖」,圖中所示為實施例(二)之實施示意圖,所述的影像嵌入單元1016可將發言人A的臉部影像資訊F1嵌入於一視訊畫面C中,使臉部影像資訊F1以子畫面的形式嵌入於視訊畫面C中,藉以讓遠端參與視訊的相關與會人員可清楚得知視訊畫面C中正在發言的人是誰。
Please refer to "Figure 7", which shows another embodiment of the creation (2). This creation can also be used with video equipment for screen linkage, as shown in the
由上所述可知,本創作之發言人員音訊及影像追蹤系統,其主要包括一控制主機、一環景影像擷取裝置以及一麥克風陣列裝置,其中,控制主機的一資料庫係預先載入數筆臉部動作特徵資訊,當會議室進行會議時,環景影像擷取裝置可依據資料庫的數筆臉部動作資訊辨識出會議中正在開口發言的發言者,並分析出該發言者的三維空間位址資訊後,透過三維空間位址資訊驅動麥克風陣列裝置進行精準收音及排除噪音,再進一步將發言者的臉部畫面特寫投影至會議室的一顯示幕上,使本創作可達到提供其他與會者可清楚得知目前的發言人影像以及其發言內容之目的。 As can be seen from the above, the speaker audio and image tracking system of this creation mainly includes a control host, a surround image capturing device and a microphone array device, wherein a database of the control host is preloaded with several pens Facial motion feature information. When the conference room is in a meeting, the surround view image capture device can identify the speaker who is speaking in the conference based on the number of facial motion information in the database, and analyze the three-dimensional space of the speaker After the address information, the microphone array device is driven by the three-dimensional spatial address information to accurately receive sound and eliminate noise, and then the close-up projection of the speaker's face is projected on a display screen of the conference room, so that the creation can achieve other participants The person can clearly know the current speaker image and the purpose of his speech.
唯,以上所述者,僅為本創作之較佳之實施例而已,並非用以限定本創作實施之範圍;任何熟習此技藝者,在不脫離本創作之精神與範圍下所作之均等變化與修飾,皆應涵蓋於本創作之專利範圍內。 However, the above are only the preferred embodiments of this creation and are not intended to limit the scope of the implementation of this creation; anyone who is familiar with this skill will make equal changes and modifications without departing from the spirit and scope of this creation , Should be covered by the patent scope of this creation.
綜上所述,本創作之功效,係具有創作之「產業可利用性」、「新穎性」與「進步性」等專利要件;申請人爰依專利法之規定,向 鈞局提起新型專利之申請。 To sum up, the effect of this creation is to have the patent requirements such as "industry availability", "novelty" and "progressiveness" of the creation; the applicant has filed a new patent to the Jun Bureau in accordance with the provisions of the Patent Law Application.
10:發言人員音訊及影像追蹤系統 10: Speaker audio and image tracking system
101:控制主機 101: control host
102:環景影像擷取裝置 102: Surround view image capture device
1011:中央處理模組 1011: Central processing module
1021:影像分析模組 1021: Image analysis module
1012:資料庫 1012: Database
1022:臉部辨識單元 1022: Face recognition unit
1013:資訊接收發送模組 1013: Information receiving and sending module
1014:投影模組 1014: projection module
103:麥克風陣列裝置 103: microphone array device
1031:聲源過濾模組 1031: Sound source filter module
F:臉部動作特徵資訊 F: facial motion feature information
F1:臉部影像資訊 F1: Face image information
F2:三維空間位址資訊 F2: 3D space address information
N:環境音訊 N: environmental audio
N1:人聲音源資訊 N1: Human voice source information
N2:環境噪音資訊 N2: Environmental noise information
Claims (13)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW108212189U TWM591655U (en) | 2019-09-12 | 2019-09-12 | Spokesperson audio and video tracking system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW108212189U TWM591655U (en) | 2019-09-12 | 2019-09-12 | Spokesperson audio and video tracking system |
Publications (1)
Publication Number | Publication Date |
---|---|
TWM591655U true TWM591655U (en) | 2020-03-01 |
Family
ID=70768065
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW108212189U TWM591655U (en) | 2019-09-12 | 2019-09-12 | Spokesperson audio and video tracking system |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWM591655U (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI751866B (en) * | 2020-12-29 | 2022-01-01 | 仁寶電腦工業股份有限公司 | Audiovisual communication system and control method thereof |
TWI797740B (en) * | 2020-09-03 | 2023-04-01 | 日商索尼互動娛樂股份有限公司 | Apparatus, method and assembly for multimodal game video summarization with metadata field |
-
2019
- 2019-09-12 TW TW108212189U patent/TWM591655U/en unknown
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI797740B (en) * | 2020-09-03 | 2023-04-01 | 日商索尼互動娛樂股份有限公司 | Apparatus, method and assembly for multimodal game video summarization with metadata field |
TWI751866B (en) * | 2020-12-29 | 2022-01-01 | 仁寶電腦工業股份有限公司 | Audiovisual communication system and control method thereof |
US11501790B2 (en) | 2020-12-29 | 2022-11-15 | Compal Electronics, Inc. | Audiovisual communication system and control method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWM594202U (en) | Spokesman audio tracking system | |
CN106657865B (en) | Conference summary generation method and device and video conference system | |
CN105657329B (en) | Video conferencing system, processing unit and video-meeting method | |
CN210469530U (en) | Audio and image tracking system for speaking person | |
WO2020119032A1 (en) | Biometric feature-based sound source tracking method, apparatus, device, and storage medium | |
CN111432115B (en) | Face tracking method based on voice auxiliary positioning, terminal and storage device | |
US11128793B2 (en) | Speaker tracking in auditoriums | |
WO2019206186A1 (en) | Lip motion recognition method and device therefor, and augmented reality device and storage medium | |
US20150146078A1 (en) | Shift camera focus based on speaker position | |
JP7347597B2 (en) | Video editing device, video editing method and program | |
CN111260313A (en) | Speaker identification method, conference summary generation method, device and electronic equipment | |
CN109982054A (en) | A kind of projecting method based on location tracking, device, projector and optical projection system | |
TWM591655U (en) | Spokesperson audio and video tracking system | |
WO2021120190A1 (en) | Data processing method and apparatus, electronic device, and storage medium | |
US11775834B2 (en) | Joint upper-body and face detection using multi-task cascaded convolutional networks | |
CN113486690A (en) | User identity identification method, electronic equipment and medium | |
US9756421B2 (en) | Audio refocusing methods and electronic devices utilizing the same | |
CN205378084U (en) | With no paper video conferencing system of accuse in intelligence | |
US10979666B2 (en) | Asymmetric video conferencing system and method | |
TW200411627A (en) | Robottic vision-audition system | |
CN205912235U (en) | Intelligent sound box | |
US7986336B2 (en) | Image capture apparatus with indicator | |
KR101686833B1 (en) | System Providing Conference Image Among Several User | |
JP7400364B2 (en) | Speech recognition system and information processing method | |
Duffner et al. | The TA2 database–a multi-modal database from home entertainment |