TWM591655U

TWM591655U - Spokesperson audio and video tracking system

Info

Publication number: TWM591655U
Application number: TW108212189U
Authority: TW
Inventors: 薛樂山
Original assignee: 大陸商南京深視光點科技有限公司
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2020-03-01

Abstract

一種發言人員音訊及影像追蹤系統，主要設於例如一會議室的開放空間中，並包含一控制主機、一環景影像擷取裝置及一麥克風陣列裝置，其中，控制主機的一資料庫係預先載入數筆臉部動作特徵資訊，當會議室進行會議時，環景影像擷取裝置可依據資料庫的數筆臉部動作資訊辨識出會議中正在開口發言的發言者，並分析出發言者的三維空間位址資訊後，透過三維空間位址資訊驅動麥克風陣列裝置進行精準收音及排除噪音，再進一步將發言者的臉部畫面特寫投影至會議室的一顯示幕，以供其他與會者可立即得知目前誰在發言以及可清楚聆聽發言人的發言。A speaker audio and image tracking system is mainly installed in an open space such as a conference room, and includes a control host, a panoramic image capturing device and a microphone array device, wherein a database of the control host is pre-loaded Enter several pen face motion feature information. When the conference room is in a meeting, the surround view image capture device can identify the speaker who is speaking in the meeting based on the several face motion information in the database, and analyze the speaker’s After the three-dimensional spatial address information, the microphone array device is driven to accurately receive sound and remove noise through the three-dimensional spatial address information, and then the close-up projection of the speaker's face is projected to a display screen in the conference room for other participants to immediately Know who is currently speaking and listen to the speaker clearly.

Description

發言人員音訊及影像追蹤系統Speaker audio and image tracking system

一種發言人員音訊及影像追蹤系統，尤指一種可清楚辨識會議中發言人的聲音以及影像的發言人員音訊及影像追蹤系統。 A speaker audio and image tracking system, especially a speaker audio and image tracking system that can clearly identify the voice and image of the speaker in the conference.

傳統視訊會議系統可利用三個以上的攝影機來拍攝參與會議的人，同時使用麥克風陣列來進行發言者的定位，並且將所定位之發言者放大於視訊會議影像中，然而，傳統作法僅執行聲音定位來判斷音源位置，並且認為該音源位置即是發言者的位置，進而將該位置的影像放大於視訊會議影像中，因此，上述傳統方法會因為環境噪音而導致準確度不足，無法精準地判斷發言者的位置，又，一般傳統式單收音麥克風系統具有下列缺點：(1)收音方向性限制，講話的人沒有對著麥克風的收音效果很差；(2)於會議環境使用時，當換人發言時，需將麥克風轉交給下一發言人；(3)於家用智能家電設備使用時，收音效率極低。 The traditional video conference system can use more than three cameras to shoot people participating in the conference, at the same time use the microphone array to locate the speaker, and magnify the localized speaker in the video conference video, however, the traditional method only performs sound Positioning to determine the location of the audio source, and that the location of the audio source is the position of the speaker, and then enlarge the image of the location in the video conference video. Therefore, the above traditional methods will cause insufficient accuracy due to environmental noise and cannot be accurately judged The position of the speaker, in addition, the general traditional single microphone system has the following shortcomings: (1) the directionality of the radio is limited, and the speaker does not have a poor radio reception effect on the microphone; (2) when used in a conference environment, when changing When a person speaks, the microphone needs to be transferred to the next speaker; (3) When used in home smart home appliances, the radio reception efficiency is extremely low.

而傳統式麥克風陣列收音會議系統雖然因為採用全向性麥克風陣列收音，有效提高了對使用環境內所有發言者的收音品質，但無法鑑別聲音源是信號還是噪音，不利於背景噪音源的收音。 Although the conventional microphone array radio conference system adopts the omnidirectional microphone array radio, which effectively improves the radio quality of all speakers in the use environment, it is impossible to distinguish whether the sound source is a signal or noise, which is not conducive to the background noise source radio.

有鑑於上述的問題，本創作人係依據多年來從事會議視訊設備相關行業的經驗，針對視訊中發言人的音源及影像定位進行研究及分析；緣此，本創作之主要目的在於提供一種可清楚辨識會議中發言人的聲音以及影像的發言人員音訊及影像追蹤系統。 In view of the above-mentioned problems, the author is based on years of experience in the conference video equipment related industry, research and analysis of the audio source and image positioning of the speaker in the video; therefore, the main purpose of this creation is to provide a clear Identify the speaker's voice and video speaker audio and video tracking system during the conference.

為達上述的目的，本創作發言人員音訊及影像追蹤系統，其主要包括一控制主機、一環景影像擷取裝置以及一麥克風陣列裝置，其中，控制主機的一資料庫係預先載入數筆臉部動作特徵資訊，當會議室進行會議時，環景影像擷取裝置可依據資料庫的數筆臉部動作資訊辨識出會議中正在開口發言的發言者，並分析出該發言者的三維空間位址資訊後，透過三維空間位址資訊驅動麥克風陣列裝置進行精準收音及排除噪音，再進一步將發言者的臉部畫面特寫投影至會議室的一顯示幕上，以供其他與會者可清楚得知目前的發言人影像以及其發言內容。 To achieve the above purpose, the author’s audio and video tracking system for speakers mainly includes a control host, a panoramic image capture device, and a microphone array device, in which a database of the control host is preloaded with several faces Part of the motion feature information, when the conference room is in the meeting, the ambient image capture device can identify the speaker who is speaking in the meeting based on the number of facial motion information in the database, and analyze the three-dimensional space position of the speaker After the address information, the microphone array device is driven by the three-dimensional spatial address information to accurately receive and eliminate noise, and then the speaker's face picture is further projected onto a display screen of the conference room for other participants to know clearly The current speaker image and the content of his speech.

為使貴審查委員得以清楚了解本創作之目的、技術特徵及其實施後之功效，茲以下列說明搭配圖示進行說明，敬請參閱。 In order to enable your reviewing committee to clearly understand the purpose, technical features and effects of this creation, the following description is accompanied by illustrations, please refer to it.

10:發言人員音訊及影像追蹤系統 10: Speaker audio and image tracking system

101:控制主機 101: control host

102:環景影像擷取裝置 102: Surround view image capture device

1011:中央處理模組 1011: Central processing module

1021:影像分析模組 1021: Image analysis module

1012:資料庫 1012: Database

1022:臉部辨識單元 1022: Face recognition unit

1013:資訊接收發送模組 1013: Information receiving and sending module

1014:投影模組 1014: projection module

1015:標註單元 1015: Labeling unit

1016:影像嵌入單元 1016: Image embedding unit

103:麥克風陣列裝置 103: microphone array device

1031:聲源過濾模組 1031: Sound source filter module

11:顯示幕 11: Display screen

12:會議室 12: Meeting room

13:開放空間 13: Open space

A:發言人 A: Spokesperson

B:身份辨識資訊 B: Identification information

C:視訊畫面 C: Video screen

F:臉部動作特徵資訊 F: facial motion feature information

F1:臉部影像資訊 F1: Face image information

F2:三維空間位址資訊 F2: 3D space address information

N:環境音訊 N: environmental audio

N1:人聲音源資訊 N1: Human voice source information

N2:環境噪音資訊 N2: Environmental noise information

第1圖，為本創作之系統組成示意圖。 Figure 1 is a schematic diagram of the system composition of this creation.

第2圖，為本創作之實施示意圖(一)。 Figure 2 is a schematic diagram of the implementation of this creation (1).

第3圖，為本創作之實施示意圖(二)。 Figure 3 is a schematic diagram of the implementation of this creation (2).

第4圖，為本創作之實施示意圖(三)。 Figure 4 is a schematic diagram of the implementation of this creation (3).

第5圖，為本創作之另一實施例(一)。 Figure 5 is another embodiment (1) of this creation.

第6圖，為本創作之實施例(一)實施示意圖。 Figure 6 is a schematic diagram of the first embodiment of this creation.

第7圖，為本創作之另一實施例(二)。 Figure 7 is another embodiment of the creation (2).

第8圖，為實施例(二)之實施示意圖。 Figure 8 is a schematic diagram of the implementation of the second embodiment.

請參閱「第1圖」，圖中所示為本創作之系統組成示意圖，如圖中所示的發言人員音訊及影像追蹤系統10，其主要包括一控制主機101、一環景影像擷取裝置102以及一麥克風陣列裝置103，其中，控制主機101可例如為一實體伺服器或雲端主機，且控制主機101具有一中央處理模組1011，所述的中央處理模組1011用以驅動各模組作動，並分別與一資料庫1012、一資訊接收發送模組1013以及一投影模組1014形成資訊連結，且資料庫1012中預先儲存有數筆臉部動作特徵資訊F，所述的臉部動作特徵資訊F可例如為嘴部張開講話時臉部肌肉的動作資訊等，而資訊接收發送模組1013用以接收或傳送電子資訊，且投影模組1014可用以將影像資訊投影至一顯示幕11(圖中未繪示)；環景影像擷取裝置102主要設置於例如會議室的一開放空間之中，其設有一影像分析模組1021，且影像分析模組1021中具有一臉部辨識單元1022，環景影像擷取裝置102可例如為環景攝影機或是深度攝影機(Depth Camera，亦可稱立體攝影機)等，環景影像擷取裝置102可擷取不同方向的影像資訊，並且可進一步將各個影像資訊合成為環景影像，使環景影像的影像範圍可涵蓋整個會議環境，且影像分析模組1021的臉部辨識單元1022可依據資料庫1012中的數筆臉部動作特徵資訊F，辨識出開放空間內正在發言的一發言人，並擷取及分析出該發言人的一臉部影像資訊F1以及一三維空間位址資訊F2(例如三維座標)，所述的臉部影像資訊F1主要為該發言人的一臉部特寫影像資訊，所述的人臉動作辨識作業可透過機器學習或深度學習進行影像比對，例如可基於卷積神經網路(Convolutional Neural Network，CNN)進行人臉辨識訓練，更進一步例如使用Faster RCNN(Faster Region-based Convolutional Neural Network)的卷積神經網路進行人臉辨識訓練，並且可通過隨機梯度下降演算法(Stochastic Gradient Descent，SGD)進行疊代訓練，而三維空間位址資訊F2為該發言人在開放空間中的三維空間位址資訊F2，可定位出發言人的位置，又，為進一步便於環景影像擷取裝置102進行現場環境的影像擷取作業，可進一步在環景影像擷取裝置102的底部加裝一轉動基座(例如一萬向轉動基座，圖中未繪示)，便於環景影像擷取裝置102可以360度取景；麥克風陣列裝置103，具有一聲源過濾模組1031，可設置於例如會議室的開放空間中，其可以為陣列式麥克風(Array Microphone)，所述的麥克風陣列裝置103具有數個麥克風收音單元，可擷取數個不同方向的環境音訊N，所述的環境音訊N中主要為一人聲音源資訊N1以及一環境噪音資訊N2所組成，聲源過濾模組1031可預先設定過濾參數，以將環境噪音資訊N2過濾後只留下人聲音源資訊N1；又，環景影像擷取裝置102及麥克風陣列裝置103亦可以組設於控制主機101中，使環景影像擷取裝置102及麥克風陣列裝置103，同步擷取環景影像及聲音訊號。 Please refer to "Picture 1", which is a schematic diagram of the system composition of the creation, as shown in the speaker audio and image tracking system 10, which mainly includes a control host 101, a surround view image capture device 102 And a microphone array device 103, wherein the control host 101 can be, for example, a physical server or a cloud host, and the control host 101 has a central processing module 1011, and the central processing module 1011 is used to drive each module to operate , And form an information link with a database 1012, an information receiving and sending module 1013 and a projection module 1014 respectively, and the database 1012 pre-stores several facial motion feature information F, said facial motion feature information F can be, for example, the movement information of the facial muscles when the mouth is opened, and the information receiving and sending module 1013 is used to receive or transmit electronic information, and the projection module 1014 can be used to project the image information onto a display screen 11 ( (Not shown in the figure); the surround view image capturing device 102 is mainly installed in an open space such as a conference room, which is provided with an image analysis module 1021, and the image analysis module 1021 has a face recognition unit 1022 The surrounding image capturing device 102 may be, for example, a surrounding camera or a depth camera (also called a stereo camera), etc. The surrounding image capturing device 102 may capture image information in different directions, and may Further synthesize each image information into a surrounding image, so that the image range of the surrounding image can cover the entire conference environment, and the face recognition unit 1022 of the image analysis module 1021 can be based on several facial motion feature information in the database 1012 F. Identify a speaker who is speaking in the open space, and extract and analyze a face image information F1 and a three-dimensional spatial address information F2 (such as three-dimensional coordinates) of the speaker. Information F1 is mainly a close-up image information of the face of the spokesperson. The face motion recognition operation can be compared by machine learning or deep learning. For example, it can be based on Convolutional Neural Network (CNN) Perform face recognition training, for example, use the Faster RCNN (Faster Region-based Convolutional Neural Network) convolutional neural network for face recognition training, and can use Stochastic Gradient Descent (SGD) for stacking Generation training, and the three-dimensional space address information F2 is the three-dimensional space address information F2 of the speaker in the open space, which can locate the position of the speaker, and in order to further facilitate the environment image capturing device 102 to carry out the on-site environment For the image capturing operation, a rotating base (such as a universal rotating base, not shown in the figure) can be further added to the bottom of the surrounding image capturing device 102, so that the surrounding image capturing device 102 can be 360 degrees Viewing; the microphone array device 103 has a sound source filter module 1031, which can be installed in an open space such as a conference room, which can be an array microphone (Array Microphone), and the microphone array device 103 has several microphones for receiving sound The unit can capture several environmental audios N in different directions. The environmental audio N is mainly composed of a person's sound source information N1 and an environment noise information N2. The sound source filter module 1031 can pre-set filter parameters to After filtering the environmental noise information N2, only the human sound source information N1 is left; in addition, the surrounding image capturing device 102 and the microphone array device 103 can also be set in the control host 101 to make the surrounding image capturing device 102 and the microphone The array device 103 simultaneously captures the surrounding image and the audio signal.

請參閱「第2圖」，圖中所示為本創作之實施示意圖(一)，請搭配參閱「第1圖」，本創作於實施時，係預先將環景影像擷取裝置102以及麥克風陣列裝置103架設於一適當位置，例如一會議室12的一開放空間13中，常態下會議室12中所有與會人員的臉部表情均受到環景影像擷取裝置102的聚焦監控，當有人進行發言時，例如圖中所示的一發言人A，環景影像擷取裝置102會依據資料庫1012中的數筆臉部動作特徵資訊F，進一步針對發言人A的臉部表情進行辨識，以確定該人員是否正在發言，若是，則擷取及分析出該發言人的一臉部影像資訊F1以及一三維空間位址資訊F2，並進一步傳送至控制主機101的資料庫1012儲存；再請搭配參閱「第3圖」，圖中所示為本創作之實施示意圖(二)，承「第2圖」所述，中央處理模組1011係進一步透過資訊接收發送模組1013將三維空間位址資訊F2傳送至麥克風陣列裝置103，使麥克風陣列裝置103可依據三維空間位址資訊F2屏蔽或關閉其他方向的麥克風收音單元，僅開啟該位址方向的麥克風單元，以聚焦接收該方向的環境音訊N，並透過聲源過濾模組1031將環境音訊N過濾出人聲音源資訊N1，並進一步傳送至控制主機101；再請搭配參閱「第4圖」，圖中所示為本創作之實施示意圖(三)，承上所述，控制主機101可進一步將發言人A的臉部影像資訊F1透過投影模組1014投影至會議室12的顯示幕11上，以供會議室12的與會人員可透過投影幕11得知目前發言人的臉部影像，再將人聲音源資訊N1透過資訊接收發送模組1013發送至外部音訊設備，例如喇叭等，藉此，透過本創作的實施，可清楚辨識會議中發言人A的聲音以及影像，以確保其他與會者可清楚得知目前發言人的影像以及其發言內容。 Please refer to "Picture 2", which is a schematic diagram of the implementation of the creation (1), please refer to "Picture 1", the implementation of this creation, the surrounding image capture device 102 and the microphone array The device 103 is erected in an appropriate location, such as an open space 13 of a conference room 12, under normal circumstances, the facial expressions of all participants in the conference room 12 are monitored by the focus image capturing device 102. When someone speaks At this time, for example, a speaker A shown in the figure, the surround image capturing device 102 will further recognize the facial expression of the speaker A according to the number of facial motion feature information F in the database 1012 to determine Whether the person is speaking, if it is, the facial image information F1 and the three-dimensional spatial address information F2 of the speaker are captured and analyzed, and further transmitted to the database 1012 of the control host 101 for storage; "Figure 3", the figure shows the schematic diagram of the implementation of the creation (2). According to the "Figure 2", the central processing module 1011 further uses the information receiving and sending module 1013 to convert the three-dimensional space address information F2 Sent to the microphone array device 103, so that the microphone array device 103 can shield or turn off the microphone receiver unit in other directions according to the three-dimensional space address information F2, and only turn on the microphone unit in the address direction to focus and receive the ambient audio N in that direction, And through the sound source filtering module 1031, the ambient audio N is filtered out of the human sound source information N1, and further transmitted to the control host 101; please refer to the "Figure 4", the figure shows the schematic diagram of the implementation of the creation (three ), as mentioned above, the control host 101 can further project the face image information F1 of the speaker A through the projection module 1014 onto the display screen 11 of the conference room 12 so that the participants in the conference room 12 can use the projection screen 11 Get the current face image of the speaker, and then send the human voice source information N1 to the external audio equipment, such as a speaker, through the information receiving and sending module 1013, so that through the implementation of this creation, you can clearly identify the speech in the meeting Person A’s voice and video to ensure that other participants can clearly know the current speaker’s video and 言内容。 Word content.

請參閱「第5圖」，圖中所示為本創作之另一實施例(一)，本創作可進一步在資料庫1012中預先儲存有數筆身份辨識資訊B，所述的數筆身份辨識資訊B可為臉部特徵資訊、名字等身份資訊，而投影模組1014中具有一標註單元1015，所述的標註單元1015可將數筆身份辨識資訊B標註於影像中的人物；再請搭配參閱「第6圖」，圖中所示為本創作之實施例(一)實施示意圖，承「第5圖」所述，請搭配參閱「第1圖」，當環景影像擷取裝置102擷取發言人A的臉部影像資訊F1並進行影像投放時，環景影像擷取裝置102亦可進一步將臉部影像資訊F1與資料庫1012中儲存的數筆身份辨識資訊B進行比對辨識，以取得對應發言人A的正確身份辨識資訊B，而完成比對後，控制主機101即可進一步透過投影模組1014的標註單元1015，將對應於發言人A的正確身份辨識資訊B標註於投影幕11的發言人A頭部影像上，以供與會人員可得知發言人A的身份。 Please refer to "Figure 5". The figure shows another embodiment of the creation (1). This creation may further pre-store several pieces of identification information B in the database 1012. The mentioned pieces of identification information B can be identity information such as facial feature information, name, etc., and the projection module 1014 has a labeling unit 1015, which can label several pieces of identification information B to the person in the image; "Figure 6", shown in the drawing is an embodiment of the creation (a) implementation schematic diagram, as described in "Figure 5", please refer to "Figure 1", when the surrounding image capture device 102 captures When the face image information F1 of the speaker A is used for image delivery, the surrounding image capturing device 102 may further compare and recognize the face image information F1 with the number of identity recognition information B stored in the database 1012 to Obtain the correct identification information B corresponding to the speaker A, and after the comparison is completed, the control host 101 can further mark the correct identification information B corresponding to the speaker A on the projection screen through the labeling unit 1015 of the projection module 1014 11 The image of Spokesperson A’s head, so that participants can know the identity of Spokesperson A.

請參閱「第7圖」，圖中所示為本創作之另一實施例(二)，本創作亦可進一步搭配視訊設備進行畫面連動，如本圖所示的控制主機101，其投影模組1014係具有一影像嵌入單元1016；再請搭配參閱「第8圖」，圖中所示為實施例(二)之實施示意圖，所述的影像嵌入單元1016可將發言人A的臉部影像資訊F1嵌入於一視訊畫面C中，使臉部影像資訊F1以子畫面的形式嵌入於視訊畫面C中，藉以讓遠端參與視訊的相關與會人員可清楚得知視訊畫面C中正在發言的人是誰。 Please refer to "Figure 7", which shows another embodiment of the creation (2). This creation can also be used with video equipment for screen linkage, as shown in the control host 101 shown in this figure, its projection module 1014 is equipped with an image embedding unit 1016; please also refer to "Figure 8", which shows an implementation schematic diagram of embodiment (2). The image embedding unit 1016 can embed the image information of the speaker A's face F1 is embedded in a video screen C, so that the facial image information F1 is embedded in the video screen C in the form of a sub-screen, so that the relevant participants in the remote participation in the video can clearly know that the person who is speaking in the video screen C is Who.

由上所述可知，本創作之發言人員音訊及影像追蹤系統，其主要包括一控制主機、一環景影像擷取裝置以及一麥克風陣列裝置，其中，控制主機的一資料庫係預先載入數筆臉部動作特徵資訊，當會議室進行會議時，環景影像擷取裝置可依據資料庫的數筆臉部動作資訊辨識出會議中正在開口發言的發言者，並分析出該發言者的三維空間位址資訊後，透過三維空間位址資訊驅動麥克風陣列裝置進行精準收音及排除噪音，再進一步將發言者的臉部畫面特寫投影至會議室的一顯示幕上，使本創作可達到提供其他與會者可清楚得知目前的發言人影像以及其發言內容之目的。 As can be seen from the above, the speaker audio and image tracking system of this creation mainly includes a control host, a surround image capturing device and a microphone array device, wherein a database of the control host is preloaded with several pens Facial motion feature information. When the conference room is in a meeting, the surround view image capture device can identify the speaker who is speaking in the conference based on the number of facial motion information in the database, and analyze the three-dimensional space of the speaker After the address information, the microphone array device is driven by the three-dimensional spatial address information to accurately receive sound and eliminate noise, and then the close-up projection of the speaker's face is projected on a display screen of the conference room, so that the creation can achieve other participants The person can clearly know the current speaker image and the purpose of his speech.

唯，以上所述者，僅為本創作之較佳之實施例而已，並非用以限定本創作實施之範圍；任何熟習此技藝者，在不脫離本創作之精神與範圍下所作之均等變化與修飾，皆應涵蓋於本創作之專利範圍內。 However, the above are only the preferred embodiments of this creation and are not intended to limit the scope of the implementation of this creation; anyone who is familiar with this skill will make equal changes and modifications without departing from the spirit and scope of this creation , Should be covered by the patent scope of this creation.

綜上所述，本創作之功效，係具有創作之「產業可利用性」、「新穎性」與「進步性」等專利要件；申請人爰依專利法之規定，向鈞局提起新型專利之申請。 To sum up, the effect of this creation is to have the patent requirements such as "industry availability", "novelty" and "progressiveness" of the creation; the applicant has filed a new patent to the Jun Bureau in accordance with the provisions of the Patent Law Application.

101:控制主機 101: control host

102:環景影像擷取裝置 102: Surround view image capture device

1011:中央處理模組 1011: Central processing module

1021:影像分析模組 1021: Image analysis module

1012:資料庫 1012: Database

1022:臉部辨識單元 1022: Face recognition unit

1013:資訊接收發送模組 1013: Information receiving and sending module

1014:投影模組 1014: projection module

103:麥克風陣列裝置 103: microphone array device

1031:聲源過濾模組 1031: Sound source filter module

F:臉部動作特徵資訊 F: facial motion feature information

F1:臉部影像資訊 F1: Face image information

F2:三維空間位址資訊 F2: 3D space address information

N:環境音訊 N: environmental audio

N1:人聲音源資訊 N1: Human voice source information

N2:環境噪音資訊 N2: Environmental noise information

Claims

一種發言人員音訊及影像追蹤系統，可設於一開放空間中，其包括：一控制主機，具有一中央處理模組，該中央處理模組分別與一資料庫、一資訊接收發送模組以及一投影模組形成資訊連結，其中，該資料庫預先儲存有數筆臉部動作特徵資訊；一環景影像擷取裝置，與該控制主機形成資訊連結，該環景影像擷取裝置可依據該資料庫中的數筆該臉部動作特徵資訊，辨識出該開放空間內正在發言的一發言人，並擷取及分析出該發言人的一臉部影像資訊以及一三維空間位址資訊，而該臉部影像資訊以及該三維空間位址資訊可分別經過該資訊接收發送模組傳送至該資料庫儲存；一麥克風陣列裝置，與該控制主機形成資訊連結，可供以接收該三維空間位址資訊，使該麥克風陣列裝置可依據該三維空間位址資訊擷取一環境音訊，並過濾及分析出該環境音訊中的一人聲音源資訊；以及該人聲音源資訊可進一步傳送至該控制主機，並透過該資訊接收發送模組進行資訊的傳遞進行資訊的發送，而該臉部影像資訊則可透過該投影模組投影至一顯示幕中。 A speaker audio and image tracking system can be set in an open space, which includes: a control host with a central processing module, the central processing module and a database, an information receiving and sending module and a The projection module forms an information link, wherein the database pre-stores several facial motion feature information; a surrounding image capturing device forms an information link with the control host, and the surrounding image capturing device can be based on the database Several strokes of the facial motion feature information, identify a speaker who is speaking in the open space, and capture and analyze the speaker’s facial image information and a three-dimensional spatial address information, and the face The image information and the three-dimensional spatial address information can be transmitted to the database storage via the information receiving and transmitting module respectively; a microphone array device forms an information link with the control host to receive the three-dimensional spatial address information, so that The microphone array device can capture an environmental audio according to the three-dimensional spatial address information, and filter and analyze the human voice source information in the environmental audio; and the human voice source information can be further transmitted to the control host, and through the The information receiving and transmitting module transmits information and transmits information, and the facial image information can be projected into a display screen through the projection module.

如申請專利範圍第1項所述之發言人員音訊及影像追蹤系統，其中，該環景影像擷取裝置設有一轉動基座。 The speaker audio and image tracking system as described in item 1 of the patent application scope, wherein the surrounding image capturing device is provided with a rotating base.

如申請專利範圍第1項所述之發言人員音訊及影像追蹤系統，其中，該環景影像擷取裝置設有一影像分析模組。 The speaker audio and image tracking system as described in item 1 of the patent application scope, which In this, the surrounding image capture device is provided with an image analysis module.

如申請專利範圍第3項所述之發言人員音訊及影像追蹤系統，其中，該影像分析模組中設有一臉部辨識單元。 The speaker audio and image tracking system as described in item 3 of the patent application scope, wherein a face recognition unit is provided in the image analysis module.

如申請專利範圍第1項所述之發言人員音訊及影像追蹤系統，其中，該麥克風陣列裝置設有一聲源過濾模組。 The speaker audio and image tracking system as described in item 1 of the patent application scope, wherein the microphone array device is provided with a sound source filtering module.

如申請專利範圍第1項所述之發言人員音訊及影像追蹤系統，其中，該資料庫中預先儲存有數筆身份辨識資訊。 A speaker audio and image tracking system as described in item 1 of the patent application scope, in which several pieces of identification information are pre-stored in the database.

如申請專利範圍第6項所述之發言人員音訊及影像追蹤系統，其中，數筆該身份辨識資訊為臉部特徵資訊。 The speaker audio and image tracking system as described in item 6 of the patent application scope, wherein several pieces of the identification information are facial feature information.

如申請專利範圍第6項所述之發言人員音訊及影像追蹤系統，其中，該投影模組中具有一標註單元。 The speaker audio and image tracking system as described in item 6 of the patent application scope, wherein the projection module has a labeling unit.

如申請專利範圍第1項所述之發言人員音訊及影像追蹤系統，其中，該資訊接收發送模組可將該臉部影像資訊傳送至一視訊畫面中。 The speaker audio and image tracking system as described in item 1 of the patent application scope, wherein the information receiving and transmitting module can transmit the facial image information to a video screen.

如申請專利範圍第1項所述之發言人員音訊及影像追蹤系統，其中，該環境音訊包括該人聲音源資訊以及一環境噪音資訊。 A speaker audio and image tracking system as described in item 1 of the scope of the patent application, wherein the environmental audio includes information on the source of the person's sound and information on environmental noise.

如申請專利範圍第1項所述之發言人員音訊及影像追蹤系統，其中，該臉部影像資訊為該發言人的一臉部特寫影像資訊。 The speaker audio and image tracking system as described in item 1 of the patent application scope, wherein the face image information is a close-up image information of the face of the speaker.

如申請專利範圍第1項所述之發言人員音訊及影像追蹤系統，其中，該環景影像擷取裝置及該麥克風陣列裝置係設於該控制主機內。 The speaker audio and image tracking system as described in item 1 of the patent application scope, wherein the surrounding image capturing device and the microphone array device are provided in the control host.

如申請專利範圍第1項所述之發言人員音訊及影像追蹤系統，其中，該投影模組具有一影像嵌入單元。 The speaker audio and image tracking system as described in item 1 of the patent application scope, wherein the projection module has an image embedding unit.