CN113327286B - 360-degree omnibearing speaker vision space positioning method - Google Patents

360-degree omnibearing speaker vision space positioning method Download PDF

Info

Publication number
CN113327286B
CN113327286B CN202110504362.4A CN202110504362A CN113327286B CN 113327286 B CN113327286 B CN 113327286B CN 202110504362 A CN202110504362 A CN 202110504362A CN 113327286 B CN113327286 B CN 113327286B
Authority
CN
China
Prior art keywords
face
image
speaker
camera
space positioning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110504362.4A
Other languages
Chinese (zh)
Other versions
CN113327286A (en
Inventor
刘振焘
龙映佐
吴敏
熊永华
周莉
金浩然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN202110504362.4A priority Critical patent/CN113327286B/en
Publication of CN113327286A publication Critical patent/CN113327286A/en
Application granted granted Critical
Publication of CN113327286B publication Critical patent/CN113327286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)
  • Studio Devices (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a 360-degree omnibearing speaker vision space positioning method, which comprises the following steps: starting 360-degree panoramic camera groups distributed in an annular rule, carrying out face detection, judging that the target user has interaction intention, and if not, continuing to track the face and the lips; judging whether to splice images of faces of target users with interaction intention, wherein the spliced images are used for visual space positioning; otherwise, directly selecting a corresponding image; positioning the face image according to the face image picture; according to the image positioning result and the positions of the 360-degree panoramic camera groups which are distributed in an annular regular manner and correspond to the camera groups, coordinate system conversion is carried out, and the speaker can be positioned accurately and in real time in 360-degree all directions.

Description

360-degree omnibearing speaker vision space positioning method
Technical Field
The invention relates to the technical field of speaker positioning, in particular to a 360-degree omnibearing speaker vision space positioning method.
Background
With the rapid development of the Internet, mobile intelligent terminals and intelligent robots, the interaction between people and machines is more and more frequent, and people-oriented, natural and efficient main targets for developing a new generation of man-machine interaction modes. In practical man-machine interaction systems, the object positioning function is the first important problem to be solved in the interaction system. After the target user position is obtained, the machine can perform subsequent operations such as directional voice recognition, emotion recognition, directional service providing for the user and the like, and the interactive system can pick up more accurate target information in the expected direction, so that accurate service and feedback are provided.
Existing targeted speaker positioning methods often rely on depth cameras or binocular cameras, as well as other sensors, which are often limited by the limited positioning azimuth of the positioning device, and cannot position speakers in other locations. The disclosed improvement focuses on using microphones or other sensors for auxiliary positioning, and then using a rotating platform or the like to drive a camera for visual space positioning. However, these methods have a certain positioning delay, and if the target speaker shifts, the positioning efficiency and the positioning accuracy have uncertainty.
Disclosure of Invention
In view of the above, the invention provides a 360-degree omnibearing speaker vision space positioning method, which comprises the following steps:
s1, starting a 360-degree panoramic camera group distributed in an annular rule, carrying out face detection, carrying out face and lip tracking after detecting the face, judging that the target user has interaction intention when the target person speaks towards the camera, otherwise, continuing to carry out face and lip tracking;
s2, image stitching decision: judging whether to splice the images of the faces of the target users with the interaction intention in the S1, wherein the spliced images are used for positioning a visual space; otherwise, directly selecting a corresponding image;
s3, carrying out face visual space positioning according to the face-containing image obtained by image stitching decision;
s4, converting the result of the visual space positioning into a world coordinate system to complete the omnibearing visual space positioning.
Further, the image stitching decision in S2 is to determine whether the speaker is located in the image capturing frame junction area of the two cameras with the nearest orientation of the speaker, wake up the two adjacent cameras with the orientation of the speaker when the speaker is located in the image capturing frame junction area, and stitch the images of the two cameras.
Further, the visual space positioning method described in S3 is as follows:
s31: using a face detection algorithm, calling a face detection classifier, capturing a target face and drawing the target face by using a rectangular frame;
s32: recording the position coordinates (x 1 ,y 1 ),(x 1 ,y 2 ), (x 2 ,y 1 ),(x 2 ,y 2 ) The center of the coordinate system is the center point of the current shooting picture;
s33: calculating the center position of the face
Figure RE-GDA0003154533390000021
S34: calculating a face azimuth angle:
Figure RE-GDA0003154533390000031
wherein alpha is the range angle occupied by the current image pickup picture, and X is the total length of transverse pixels of the current image pickup picture;
s35: calculating a face pitch angle:
Figure RE-GDA0003154533390000032
wherein beta is the pitch angle of the camera, and Y is the total length of longitudinal pixels of the current image.
Further, the coordinate system conversion method described in S4 is as follows:
the 360-degree panoramic camera group with annular regular distribution is provided with N cameras, the cameras are numbered from 1 to N-1 in the clockwise direction, the shooting center of the camera with the direction 1 is the world coordinate system center, and when the world coordinate of the face shot by the kth camera is converted from the image coordinate, the method comprises the following steps of
Figure RE-GDA0003154533390000033
Wherein θ image Is the azimuth angle, k of the image m Refers to the camera number with smaller number obtained during image stitching, the pitch angle is kept unchanged after coordinate transformation, and theta o Is the face azimuth.
The implementation of the technical scheme of the invention has the beneficial effects that: (1) The 360-degree panoramic camera group distributed in a ring-shaped manner is adopted, so that 360-degree panoramic images can be received, time is saved when the face is positioned relative to the rotation of the rotating platform due to the fact that the 360-degree panoramic images are adopted, and quick and real-time positioning can be realized;
(2) Compared with 360-degree image stitching, the method for deciding whether to stitch the images saves time, ensures that the face is always relatively close to the central area of the image, saves stitching time and improves the accuracy of face positioning.
Drawings
FIG. 1 is a flow chart of a 360-degree omni-directional speaker localization method of the audio-visual dual mode according to the present invention;
fig. 2 is a schematic view of the cross-over area of the image capturing screen.
Detailed Description
The invention provides a 360-degree omnibearing speaker vision space positioning method, which aims to solve the problems that the existing single-mode speaker positioning method is low in reliability, and the existing multi-mode speaker positioning method is limited by a limited positioning azimuth angle and can complete positioning only by relying on a rotating platform.
Referring to fig. 1, a 360-degree omnidirectional speaker vision space positioning method includes the following steps:
s1, starting a 360-degree panoramic camera group distributed in an annular rule, carrying out face detection, carrying out face and lip tracking after detecting the face, judging that the target user has interaction intention when the target person speaks towards the camera, otherwise, continuing to carry out face and lip tracking;
s2, image stitching decision: judging whether to carry out image stitching on the face of the target user with the interaction intention in the S1, namely judging whether the speaker is positioned in the image pickup picture connecting area of the two cameras with the nearest azimuth of the speaker, referring to FIG. 2, when the speaker is positioned in the image pickup picture connecting area, waking up the two adjacent cameras with the azimuth of the speaker, and carrying out image stitching on the image pickup pictures of the two cameras; otherwise, directly waking up the azimuth camera of the speaker, and not performing image stitching; the spliced images are used for visual space positioning; otherwise, directly selecting a corresponding image;
s3, carrying out face visual space positioning according to the face-containing image obtained by image stitching decision; the method comprises the following specific steps:
s31: using a face detection algorithm, calling a face detection classifier, capturing a target face and drawing the target face by using a rectangular frame;
s32: recording the position coordinates (x 1 ,y 1 ),(x 1 ,y 2 ), (x 2 ,y 1 ),(x 2 ,y 2 ) The center of the coordinate system is the center point of the current shooting picture;
s33: calculating the center position of the face
Figure RE-GDA0003154533390000051
S34: calculating a face azimuth angle:
Figure RE-GDA0003154533390000052
wherein alpha is the range angle occupied by the current image pickup picture, and X is the total length of transverse pixels of the current image pickup picture;
s35: calculating a face pitch angle:
Figure RE-GDA0003154533390000053
wherein beta is the pitch angle of the camera, and Y is the total length of longitudinal pixels of the current image.
S4, converting the result of the visual space positioning into a world coordinate system to finish the omnibearing visual space positioning; the coordinate system conversion method is as follows:
the 360-degree panoramic camera group with annular regular distribution is provided with N cameras, the cameras are numbered from 1 to N-1 in the clockwise direction, the shooting center of the camera with the direction 1 is the world coordinate system center, and when the world coordinate of the face shot by the kth camera is converted from the image coordinate, the method comprises the following steps of
Figure RE-GDA0003154533390000054
Wherein θ image Is the azimuth angle, k of the image m Refers to the camera number with smaller number obtained during image stitching, the pitch angle is kept unchanged after coordinate transformation, and theta o Is the face azimuth.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (1)

1. A360-degree omnibearing speaker vision space positioning method is characterized by comprising the following steps:
s1, starting a 360-degree panoramic camera group distributed in an annular rule, carrying out face detection, carrying out face and lip tracking after detecting the face, judging that the target user has interaction intention when the target person speaks towards the camera, otherwise, continuing to carry out face and lip tracking;
s2, image stitching decision: judging whether to splice the images of the faces of the target users with the interaction intention in the S1, wherein the spliced images are used for positioning a visual space; otherwise, directly selecting a corresponding image;
s2, judging whether the speaker is located in a camera picture connecting area of two cameras with the nearest azimuth of the speaker, waking up two adjacent cameras with the azimuth of the speaker when the speaker is located in the camera picture connecting area, and performing image splicing on the camera pictures of the two cameras;
s3, carrying out face visual space positioning according to the face-containing image obtained by image stitching decision;
the visual space positioning method described in S3 is as follows:
s31: using a face detection algorithm, calling a face detection classifier, capturing a target face and drawing the target face by using a rectangular frame;
s32: recording the position coordinates (x 1 ,y 1 ),(x 1 ,y 2 ),(x 2 ,y 1 ),(x 2 ,y 2 ) The center of the coordinate system is the center point of the current shooting picture;
s33: calculating the center position of the face
Figure QLYQS_1
S34: calculating a face azimuth angle:
Figure QLYQS_2
wherein alpha is the range angle occupied by the current image pickup picture, and X is the total length of transverse pixels of the current image pickup picture;
s35: calculating a face pitch angle:
Figure QLYQS_3
wherein beta is the pitch angle of the camera, Y is the total length of longitudinal pixels of the current camera picture;
s4, converting the result of the visual space positioning into a world coordinate system to finish the omnibearing visual space positioning;
the coordinate system conversion method described in S4 is as follows:
the 360-degree panoramic camera group with annular regular distribution is provided with N cameras, the cameras are numbered from 1 to N in the clockwise direction, the shooting center of the camera with the direction 1 is the center of the world coordinate system, and when the face shot by the kth camera converts the world coordinate from the image coordinate, the method comprises the following steps of
Figure QLYQS_4
Wherein θ image Is the azimuth angle, k of the image m When the finger images are splicedThe camera number with smaller number is obtained, the pitch angle is kept unchanged after coordinate transformation, and theta o Is the face azimuth.
CN202110504362.4A 2021-05-10 2021-05-10 360-degree omnibearing speaker vision space positioning method Active CN113327286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110504362.4A CN113327286B (en) 2021-05-10 2021-05-10 360-degree omnibearing speaker vision space positioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110504362.4A CN113327286B (en) 2021-05-10 2021-05-10 360-degree omnibearing speaker vision space positioning method

Publications (2)

Publication Number Publication Date
CN113327286A CN113327286A (en) 2021-08-31
CN113327286B true CN113327286B (en) 2023-05-19

Family

ID=77415109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110504362.4A Active CN113327286B (en) 2021-05-10 2021-05-10 360-degree omnibearing speaker vision space positioning method

Country Status (1)

Country Link
CN (1) CN113327286B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236599A (en) * 2007-12-29 2008-08-06 浙江工业大学 Human face recognition detection device based on multi- video camera information integration
CN110223690A (en) * 2019-06-10 2019-09-10 深圳永顺智信息科技有限公司 The man-machine interaction method and device merged based on image with voice

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007147762A (en) * 2005-11-24 2007-06-14 Fuji Xerox Co Ltd Speaker predicting device and speaker predicting method
US8229134B2 (en) * 2007-05-24 2012-07-24 University Of Maryland Audio camera using microphone arrays for real time capture of audio images and method for jointly processing the audio images with video images
CN106503615B (en) * 2016-09-20 2019-10-08 北京工业大学 Indoor human body detecting and tracking and identification system based on multisensor
CN108734733B (en) * 2018-05-17 2022-04-26 东南大学 Microphone array and binocular camera-based speaker positioning and identifying method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236599A (en) * 2007-12-29 2008-08-06 浙江工业大学 Human face recognition detection device based on multi- video camera information integration
CN110223690A (en) * 2019-06-10 2019-09-10 深圳永顺智信息科技有限公司 The man-machine interaction method and device merged based on image with voice

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Detection of a Speaker in Video by Combined Analysis of Speech Sound and Mouth Movement;Osamu Ikeda;《ISCV 2007: Adavace in Visual Computing》;第602-610页 *
Voice Activity Detection based on Facial Movement;Bart Joosten et al.;《Journal on Multimodal User Interfaces》;第183-193页 *
基于视觉显著度的说话检测;王瑾 等;《武汉大学学报(理学版)》;第61卷;第363-367页 *

Also Published As

Publication number Publication date
CN113327286A (en) 2021-08-31

Similar Documents

Publication Publication Date Title
CN108734733B (en) Microphone array and binocular camera-based speaker positioning and identifying method
US6005610A (en) Audio-visual object localization and tracking system and method therefor
CN103718125B (en) Finding a called party
CN111432115B (en) Face tracking method based on voice auxiliary positioning, terminal and storage device
CN111641794B (en) Sound signal acquisition method and electronic equipment
CN111263106B (en) Picture tracking method and device for video conference
CN103716595A (en) Linkage control method and device for panoramic mosaic camera and dome camera
CN106934351B (en) Gesture recognition method and device and electronic equipment
CA3190886A1 (en) Merging webcam signals from multiple cameras
US20230090916A1 (en) Display apparatus and processing method for display apparatus with camera
CN104349040A (en) Camera base for video conference system, and method
JP4451892B2 (en) Video playback device, video playback method, and video playback program
WO2021066392A2 (en) Method, device, and non-transitory computer-readable recording medium for estimating information about golf swing
CN108076304A (en) A kind of built-in projection and the method for processing video frequency and conference system of camera array
KR101718081B1 (en) Super Wide Angle Camera System for recognizing hand gesture and Transport Video Interface Apparatus used in it
CN107507133B (en) Real-time image splicing method based on circular tube working robot
CN113312985B (en) Audio-visual double-mode 360-degree omnibearing speaker positioning method
JP2016066187A (en) Image processor
CN113327286B (en) 360-degree omnibearing speaker vision space positioning method
CN112839165B (en) Method and device for realizing face tracking camera shooting, computer equipment and storage medium
CN116684647B (en) Equipment control method, system and equipment in video real-time transmission scene
JP3272584B2 (en) Region extraction device and direction detection device using the same
Pingali et al. Audio-visual tracking for natural interactivity
JP4373645B2 (en) Video distribution system, program, and recording medium
US11665391B2 (en) Signal processing device and signal processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant