CN114339113A

CN114339113A - Video call method, related device, equipment and storage medium

Info

Publication number: CN114339113A
Application number: CN202111456189.1A
Authority: CN
Inventors: 张子洋
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-04-12

Abstract

The application discloses a video call method, a related device, call equipment and a storage medium, wherein the video call method comprises the following steps: performing first positioning on audio data of the video call equipment at the current moment to obtain a first position of a speaker at the current moment, and performing second positioning on an image to be detected of the video call equipment at the current moment to obtain a second position of the speaker at the current moment; at least one of the first positioning and the second positioning refers to positioning results of a plurality of historical moments before the current moment in the execution process, and the positioning results comprise a first position and a second position of the speaker at the historical moments; combining the first direction and the second direction to obtain the final direction of the speaker at the current moment; and carrying out video call through the video call equipment based on the final orientation. By the scheme, the positioning accuracy of the speaker can be improved in the video call process.

Description

Video call method, related device, equipment and storage medium

Technical Field

The present application relates to the field of video communication technologies, and in particular, to a video call method, and a related apparatus, device, and storage medium.

Background

With the increasing development of electronic information technology, video calls have been widely used in many scenes such as daily life, business and office. For example, in daily life, a person may chat with a strange person through a video call; in business offices, remote connection can be realized through video calls.

In the video call process, positioning the speaker is one of the key factors related to the video call quality. For example, noise suppression can be performed based on accurate positioning to improve the voice quality of the speaker, so that the video call effect can be greatly improved. In view of this, how to improve the speaker positioning accuracy is an urgent problem to be solved.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a video call method, a related device, equipment and a storage medium, which can improve the positioning accuracy of speakers.

In order to solve the above technical problem, a first aspect of the present application provides a video call method, including: performing first positioning on audio data of the video call equipment at the current moment to obtain a first position of a speaker at the current moment, and performing second positioning on an image to be detected of the video call equipment at the current moment to obtain a second position of the speaker at the current moment; at least one of the first positioning and the second positioning refers to positioning results of a plurality of historical moments before the current moment in the execution process, and the positioning results comprise a first position and a second position of the speaker at the historical moments; combining the first direction and the second direction to obtain the final direction of the speaker at the current moment; and carrying out video call through the video call equipment based on the final orientation.

In order to solve the above technical problem, a second aspect of the present application provides a video call apparatus, including: the first positioning module is used for carrying out first positioning on audio data of the video call equipment at the current moment to obtain a first direction of a speaker at the current moment; the second positioning module is used for carrying out second positioning on the image to be detected of the video call equipment at the current moment to obtain a second direction of the speaker at the current moment; at least one of the first positioning and the second positioning refers to positioning results of a plurality of historical moments before the current moment in the execution process, and the positioning results comprise a first position and a second position of the speaker at the historical moments; the combining module is used for combining the first direction and the second direction to obtain the final direction of the speaker at the current moment; and the call module is used for carrying out video call through the video call equipment based on the final direction.

In order to solve the above technical problem, a third aspect of the present application provides a video call device, which includes a screen, a microphone, a camera, a communication circuit, a memory, and a processor, where the screen, the microphone, the camera, the communication circuit, and the memory are respectively coupled to the processor, the memory stores program instructions, and the processor is configured to execute the program instructions to implement the video call method in the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor, the program instructions being for implementing the video call method in the first aspect.

According to the scheme, the audio data of the video call equipment at the current moment is firstly positioned to obtain the first position of the speaker at the current moment, the image to be detected of the video call equipment at the current moment is secondly positioned to obtain the second position of the speaker at the current moment, at least one of the first position and the second position refers to the positioning results of a plurality of historical moments before the current moment in the execution process, the positioning results comprise the first position and the second position of the speaker at the historical moments, on the basis, the first position and the second position are combined to obtain the final position of the speaker at the current moment, the video call equipment is used for carrying out video call based on the final position, on one hand, in the video call process, the final position is obtained by combining the first position and the second position respectively through sound source positioning and image positioning, compared with a single positioning mode, the method is favorable for improving the positioning accuracy, and on the other hand, in the video call process, at least one of sound source positioning and image positioning refers to the positioning results of a plurality of previous historical moments in the positioning process, so that stable positioning can be kept in the video call process with continuous time sequence. Therefore, the positioning accuracy of the speaker can be improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a video call method of the present application;

FIG. 2 is a schematic diagram of one embodiment of a three-dimensional coordinate system;

FIG. 3 is a process diagram of an embodiment of a video call method of the present application;

FIG. 4 is a schematic flow chart diagram illustrating a video call method according to another embodiment of the present application;

FIG. 5 is a block diagram of an embodiment of a video call device according to the present application;

FIG. 6 is a block diagram of an embodiment of a video telephony device according to the present application;

FIG. 7 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a video call method according to an embodiment of the present application.

Specifically, the method may include the steps of:

step S11: the method comprises the steps of carrying out first positioning on audio data of the video call equipment at the current moment to obtain a first position of a speaker at the current moment, and carrying out second positioning on an image to be detected of the video call equipment at the current moment to obtain a second position of the speaker at the current moment.

In the embodiment of the disclosure, at least one of the first positioning and the second positioning refers to positioning results of a plurality of historical moments before the current moment in the execution process, and the positioning results comprise a first position and a second position of the speaker at the historical moments. It should be noted that, during the video call, the speaker may move, for example, in a business office scene, the speaker may have a habit of moving while speaking, and there is no example, so the steps in the embodiments disclosed in the present application may be executed at the shooting time of each video frame to respectively use the shooting time of each video frame as the current time to locate the speaker. Furthermore, for convenience of description, the current time may be denoted as t, and several historical times before the current time may include: time T-1, time T-2, time T-3, …, time T-T, etc., without limitation. Further, the positioning result of each historical time may also be obtained through the steps in the embodiments disclosed in this application. For example, at the time t-i, the first position and the second position at the time t-i can be obtained by referring to the positioning result at the time t-i-1 through the steps in the embodiments disclosed in the present application, and at the time t-i +1, the first position and the second position at the time t-i +1 can be obtained by referring to the positioning result at the time t-i through the steps in the embodiments disclosed in the present application, and so on, which are not exemplified herein.

In an implementation scenario, the video call device in the embodiments of the present disclosure may be integrated with a screen, a microphone, and a camera, so as to collect audio data through the microphone, shoot image data through the camera, and display the image data through the screen, which may specifically refer to the following description in the embodiments of the video call device, and will not be described herein again. It should be noted that, in order to express the speaker orientation, a three-dimensional coordinate system may be constructed based on the video call device, and an included angle between a connecting line between the speaker orientation and an origin of the three-dimensional coordinate system and a coordinate axis of the three-dimensional coordinate system is regarded as the speaker orientation. Illustratively, please refer to FIG. 2 in combination, FIG. 2 is a block diagram of a system for processing a plurality of data streamsA schematic diagram of an embodiment of a three-dimensional coordinate system. As shown in fig. 2, the origin of the three-dimensional coordinate system is the center of the front lower edge of the video call device, and the video call device further integrates a dual-microphone array at a position close to the center of the front lower edge of the video call device, the first direction can be represented as an included angle between a connection line between a sound source and a microphone and an array connection line (i.e., a dotted line indicated by an arrow of the dual-microphone array in fig. 2), since in a real scene, the distance between the sound source and any one of the microphones is far greater than the distance between the microphones, the model can be regarded as a far-field model, and an incident angle θ of the sound source (i.e., the included angle) satisfies a requirement of a distance between the sound source and any one of the microphones

Wherein τ represents a time delay of a sound reaching the two microphones, d represents a distance between the two microphones, and c represents a speed of sound, which may be referred to in detail in related art of a far-field model and is not described herein again. Further, the second orientation can be represented as an angle between a line connecting the speaker and the origin in the image and the x-axis. In the scenario shown in FIG. 2, the closer the speaker is to the edge of the image in the image, the smaller the first and second orientations are, indicating that the speaker is further from the middle of the video telephony device, and the closer the speaker is to the center of the image, the larger the first and second orientations are, indicating that the speaker is closer to the middle of the video telephony device. That is, on the premise that both the first and second directions are accurately located, no matter how the speaker moves, the deviation between the first direction and the second direction should be theoretically within a specific range, and the specific numerical value of the specific range can be adjusted according to the established three-dimensional coordinate system, which is not described herein again.

In an implementation scenario, in the first positioning, in the execution process, the positioning results of a plurality of historical moments before the current moment may be referred to, for each historical moment, the fusion weight corresponding to the historical moment may be obtained based on the azimuth deviation between the first azimuth and the second azimuth at the historical moment and the voice detection result of the call audio collected at the historical moment, and the weighted cross-correlation result at the current moment may be obtained by fusing the cross-correlation results of the call audio at each historical moment by using the fusion weight at each historical moment. According to the mode, the weighted cross-correlation result at the current moment is obtained by weighting and fusing the cross-correlation results of the call audios at the historical moments, so that the first position at the current moment is obtained based on the weighted cross-correlation result, the instability of instantaneous cross-correlation in the continuous time sequence video call process can be greatly relieved, and the accuracy of sound source positioning is favorably improved.

In a specific implementation scenario, for each historical time, a first weight coefficient may be obtained based on an azimuth deviation between a first azimuth and a second azimuth at the historical time, and a second weight coefficient may be obtained based on a human voice detection result of the call audio collected at the historical time, where the azimuth deviation is negatively correlated with the first weight coefficient, that is, the larger the azimuth deviation is, the smaller the first weight coefficient is, and the smaller the azimuth deviation is, the larger the first weight coefficient is, and on this basis, a fusion weight corresponding to the historical time may be obtained based on the first weight coefficient and the second weight coefficient. In the above manner, the first weight coefficient is obtained based on the azimuth deviation between the first azimuth and the second azimuth, the second weight coefficient is obtained based on the voice detection result of the call audio, and the azimuth deviation is inversely related to the first weight coefficient, and the fusion weight is obtained based on the first weight coefficient and the second weight coefficient, so that the fusion weight can be set from both the azimuth deviation and the voice detection, so that the call audio which is not beneficial to sound source positioning is filtered, and the sound source positioning precision is favorably improved.

For example, for the first weight coefficient, in the case where the first orientation is in the first deviation range of the second orientation, the first weight coefficient may be directly set to the first value, and in the case where the first orientation is out of the first deviation range of the second orientation, the first weight coefficient may be directly set to the second value, and the first value is larger than the second value. For example, the first deviation range may be set to 5 degrees, 10 degrees, 15 degrees, etc., and is not limited thereto. In addition, the first value may be set to 1, and the second value may be set to 0, which is not limited herein. Taking the example where the first deviation range is set to 5 degrees, the first weight coefficient may be set to 1 when the first orientation is within the deviation range of 5 degrees of the second orientation, or the first weight coefficient may be set to 0 when the first orientation is out of the deviation range of 5 degrees of the second orientation. Other cases may be analogized, and no one example is given here. In the above manner, under the condition that the first azimuth is in the first deviation range of the second azimuth, the first weight coefficient is the first numerical value, under the condition that the first azimuth exceeds the first deviation range of the second azimuth, the first weight coefficient is the second numerical value, and the first numerical value is greater than the second numerical value, so that the call audio at the historical moment with smaller deviation between image positioning and sound positioning at a plurality of reference historical moments can be obtained in the sound source positioning process, and the improvement of the sound source positioning precision is facilitated.

For example, for the second weight coefficient, in a case where the human voice detection result includes that human voice is detected in the call audio, the second weight coefficient may be set to a third value, and in a case where the human voice detection result includes that human voice is not detected in the call audio, the second weight coefficient may be set to a fourth value, and the third value is greater than the fourth value. For example, the third value may be set to 1 and the fourth value may be set to 0. Taking this as an example, in the case where the human voice detection result includes that human voice is detected in the call audio, the second weight coefficient may be directly set to 1, whereas in the case where the human voice detection result includes that human voice is not detected in the call audio, the second weight coefficient may be directly set to 0. Other cases may be analogized, and no one example is given here. In the above manner, under the condition that the voice is detected in the voice detection including the call audio, the second weight coefficient is the third numerical value, and under the condition that the voice is not detected in the voice detection result including the call audio, the second weight coefficient is the fourth numerical value, and the third numerical value is greater than the fourth numerical value, so that the call audio containing the voice in the call audio respectively corresponding to a plurality of historical moments can be referred more in the sound source positioning process, and the improvement of the sound source positioning precision is facilitated.

Illustratively, the first weights respectively corresponding to the historical moments are obtainedAfter the coefficient and the second weight coefficient, for each historical time, the corresponding first weight coefficient and the corresponding second weight coefficient may be multiplied to obtain the fusion weight of the historical time. For convenience of description, the first weighting factor of the ith historical time can be recorded as

And the second weight coefficient is recorded as

The fusion weight corresponding to the historical time can be expressed as

In a specific implementation scenario, taking an example that the video call device includes 2 microphones, the call audio at each historical time may include sub-audio respectively collected by the two microphones, and for convenience of description, the sub-audio may be respectively represented as x₁(t)＝a₁s(t-τ₁)+n₁(t), and x₂(t)＝a₂s(t-τ₂)+n₂(t) wherein n₁And n₂All represent white noise and are not correlated with each other, s represents a sound source, τ₁、τ₂Time of sound source to the two microphones, respectively, and s is n₁、n₂And are not correlated with each other, at the historical time t, the result of the correlation of the call audio can be represented as:

R_x1x2(τ)＝E(x₁(t)x₂(t-τ))

＝a₁a₂E(s(t-τ₁)*s(t-τ₂-τ))

＝a₁a₂R_ss(τ-(τ₁-τ₂)……(1)

as shown in formula (1), τ ═ τ₁-τ₂When the time is in process, the cross correlation is maximum, namely the cross correlation peak value is the time delay difference of the sound source reaching the microphone, on the basis, the fusion weight of each historical moment can be utilized to respectively weight the corresponding cross correlation result,and obtaining a weighted cross-correlation result, then taking a cross-correlation peak value to obtain a time delay difference at the current moment, and based on the time delay difference, combining a calculation formula about the time delay difference in the far-field model to calculate to obtain a sound source incidence angle theta serving as a first azimuth.

In one implementation scenario, the second positioning refers to the positioning result during execution, and in the second positioning process, tld (tracking Learning detection) is mainly adopted. TLD is a long-term target Tracking algorithm, which is composed of a detector (i.e. Detection), a tracker (i.e. Tracking) and a learner (i.e. Learning). The tracker is used for tracking continuous interframe movement, the tracker estimates the position of a current frame according to the known position of a target in a previous frame, so that an active track of the target can be generated, a positive sample can be provided for the learner from the active track, the detector is used for estimating the error of the tracker, the detector performs comprehensive scanning on each frame of image, finds a position similar to the target, detects and generates a positive sample and a negative sample, the positive sample and the negative sample are given to the learner, an algorithm selects a most reliable position from all positive samples as an output, and then the initial position of the tracker is updated by the output. In addition, the learner iteratively trains the classifier included in the detector according to the positive sample and the negative sample generated by the tracker and the detector to improve the detection accuracy, and the specific principle of the TLD may refer to the technical details of the TLD, which are not described herein again. Unlike the conventional TLD, the embodiments disclosed in this application further combine the sound source localization result to assist in the second localization process by the TLD. Specifically, a plurality of candidate positive samples can be extracted based on the activity track determined by the tracker of the TLD, the candidate positive samples are screened by using the first orientation of the speaker at the historical time to obtain a target positive sample, the detector of the TLD is trained to be convergent by using the target positive sample, and on the basis, the TLD is used for performing second positioning on the image to be measured to obtain a second orientation at the current time. In the mode, the candidate positive sample is extracted through the activity track extracted by the tracker, the target positive sample is further screened out from the first position of the speaker at the historical moment, the target positive sample is used for training the detector to be converged, on the basis, the TLD is used for carrying out second positioning on the image to be detected, and the second position at the current moment is obtained.

In a specific implementation scenario, the specific process of determining the active trajectory by the tracker of the TLD may refer to the details of the TLD, which are not described herein again.

In a specific implementation scenario, please refer to fig. 3 in combination, and fig. 3 is a schematic process diagram of an embodiment of a video call method according to the present application. As shown in fig. 3, a classification criterion may be provided for the learner through sound source localization, and still taking the current time t as an example, for a historical time t-i (where i is 1, 2, 3, and the like), the candidate positive sample extracted at the historical time t-i by using the first orientation at the historical time t-i may be further classified, and if it is obtained that the image orientation of the candidate positive sample in the image to be measured captured at the historical time t-i is the same as or close to the first orientation at the historical time t-i, the candidate positive sample may be screened as the target positive sample, otherwise, the candidate positive sample may be directly used as the target negative sample. That is to say, the sound source direction can be directly defined as a positive sample, and the non-sound source direction can be defined as a negative sample, on the basis, the detector can be trained to be convergent, so that the detector learns the updated and more accurate image characteristics of the speaker, and the subsequent detection precision is improved.

In a specific implementation scenario, before training the detector of the TLD, in response to the orientation deviation between the first orientation and the second orientation at the previous historical time of the current time satisfying the second condition, the TLD-based learner assigns a learning weight (the learning weight may be set to be greater than 1, such as 1.5, 2, 2.5, etc.) to each target positive sample, and based on this, trains the detector to converge based on the target positive sample assigned the learning weight. Specifically, the second condition may be set to include that the azimuth deviation between the first azimuth and the second azimuth at the previous history time is lower than a preset threshold (e.g., 5 degrees, 10 degrees, etc.), and it may be considered that positive feedback of sound source localization is obtained, and the learning weight is given thereto by the learner, so that the confidence level of the target positive sample may be increased.

In a specific implementation scenario, as shown in fig. 3, after the training is converged, the detector may be used to detect the image to be detected at the current time, and in this process, the first orientation at the current time may be used to define a detection region in the image to be detected, so that in the subsequent process of positioning the image to be detected by using TLD, the detector only performs target detection in the detection region, so as to improve the detection efficiency. Specifically, during the second positioning process of the image to be detected at the current time, the tracker of the TLD may provide a positive sample, the detector of the TLD may detect several positive samples in the detection area, and finally, a positive sample most similar to the speaker may be selected from the positive samples, and based on the positive sample, the second position at the current time may be determined. It should be noted that the terms related to the sample, such as "positive sample", "negative sample", "candidate positive sample", and the like, in the embodiments of the present disclosure all refer to sub-images captured from an image captured by a camera, and refer to the related technical details of TLD specifically, which are not described herein again.

In a specific implementation scenario, as described above, after the second position of the speaker at the current time is obtained, the tracker of the TLD may be updated accordingly, which may specifically refer to relevant technical details of the TLD, and is not described herein again.

Step S12: and combining the first orientation and the second orientation to obtain the final orientation of the speaker at the current moment.

In one implementation scenario, the first orientation and the second orientation may be directly weighted to obtain a final orientation of the speaker at the current time. The weights of both the first and second azimuths may be set according to actual circumstances, for example, in the case where the accuracy of sound source localization is high, the accuracy of image localization is high, the weight of the first azimuth may be set to be higher than the weight of the second azimuth; conversely, in the case where the accuracy of sound source localization is lower than that of image localization, the weight of the first azimuth may be set lower than that of the second azimuth, which is not limited herein.

In an implementation scenario, the reliability detection may be performed on the first direction to obtain a reliability detection result of the first direction, and the reliability detection may be performed on the second direction to obtain a reliability detection result of the second direction, and based on the reliability detection result of the first direction and the reliability detection result of the second direction, the final direction may be obtained. According to the mode, the reliability detection is carried out on the first direction and the second direction respectively, and then the final direction is obtained based on the reliability detection result, so that the reliability of the final direction is improved.

In a particular implementation scenario, it may be determined that the first detection result includes that the first orientation is reliable in a case where the first orientation is within a second deviation range of the second orientation, and that the first detection result includes that the first orientation is unreliable in a case where the first orientation is beyond the second deviation range of the second orientation. Illustratively, the second deviation range may be set to 5 degrees, 15 degrees, 20 degrees, etc., without limitation. Taking the second deviation range as 15 degrees as an example, if the first orientation is within the deviation range of 15 degrees of the second orientation, the first orientation may be considered reliable, and conversely, the first orientation may be considered unreliable. In the above manner, whether the first orientation is reliable or not is determined by comparing the degree of deviation of the first orientation from the second orientation, so that the reliability detection is performed from the viewpoint of using the second orientation as a comparison reference.

In a specific implementation scenario, as mentioned above, the second orientation may be detected by TLD, and it may be detected whether the following two are satisfied at the same time: the similarity between the face image of the speaker and the sample image satisfies the second condition, and the azimuth deviation between the second azimuth and the first azimuth satisfies the third condition, where the sample image is a positive sample of a detector for training the TLD, and the specific meaning may refer to the foregoing description, and is not described herein again. On this basis, if the above two are satisfied simultaneously, it may be determined that the second detection result includes that the second orientation is reliable, and conversely, it may be determined that the second detection result includes that the second orientation is unreliable. For example, the face image of the speaker may be a face image recognized by the speaker from the captured image at the first time during the video call, which may be referred to the following disclosure embodiments and will not be described herein again. Further, the second condition may be set to have a similarity higher than a preset similarity (e.g., 70%, 75%, etc.), and the third condition may be set to have an azimuth deviation lower than a preset angle (e.g., 30 degrees, 35 degrees, etc.). In the above manner, the degree of the second orientation deviating from the first orientation and the similarity between the face image and the sample image are compared, and the reliability of the second orientation can be determined through multiple dimensions.

In a specific implementation scenario, in a case that the first detection result includes a first location reliability and the second detection result includes a second location reliability, the first location and the second location may be fused to obtain a final location, and if the final location is an average of the first location and the second location, or a weighting is performed, which is not limited herein. In addition, if either one of the positions is unreliable, the other position can be taken as a final position, or if both are unreliable, the positioning at the current moment can be considered to be failed. By the method, under the condition that the first direction and the second direction are both reliably detected, the final direction is obtained by fusing the first direction and the second direction, and the accuracy of the final direction is improved.

Step S13: and carrying out video call through the video call equipment based on the final orientation.

In one implementation scenario, beamforming may be performed based on the final range to suppress noise around the speaker so that the speech quality in the final azimuth direction may be improved. It should be noted that, the specific process of beam forming may refer to the technical details of beam forming, and is not described herein again.

In one implementation scenario, the video-call device may be controlled to turn to the speaker based on the final orientation satisfying a fourth condition. Illustratively, the fourth condition may be set as: finally, the direction is about to exceed the video shooting range, and the speaker can be positioned at the middle position again by controlling the video call equipment to turn to the speaker, so that the risk of the speaker going out of the frame in the video call process can be reduced. It should be noted that the video call device may be configured to further include a rotation mechanism (e.g., a pan/tilt head, etc.), and the rotation mechanism may control the video call device to turn to the speaker.

In an implementation scenario, when multiple places are connected through a video, and multiple speakers exist in at least one place, in the process of speaking one of the speakers, the speaker can be continuously tracked and a video call can be maintained through the steps in the embodiment of the present disclosure, that is, the steps in the embodiment of the present disclosure can be adopted at each time when the speaker speaks, so as to implement the video call.

In one implementation scenario, as described above, when multiple speakers exist in at least one venue during video connection, after one of the speakers finishes speaking, another speaker can continue video connection through the video call device without manually operating the video call device. Specifically, when another speaker needs to speak, only a preset wake-up word (e.g., magic fly, etc.) needs to be spoken, at this time, in response to detecting a target audio including the preset wake-up word, a sound source is positioned based on the target audio to obtain a target position of the speaker, then, based on the target position, the video call device is controlled to turn to the speaker, and face recognition is performed based on a target image photographed by the video call device to obtain a face image of the speaker, and the face image is used as a tracking target for performing the second positioning. In the above manner, in response to detection of the target audio containing the preset wake-up word, sound source positioning is performed based on the target audio to obtain the target position of the speaker, and based on the target position, the video call device is controlled to turn to the speaker, and face recognition is performed based on the target image shot by the video call device to obtain the face image of the speaker, and the face image is used as a tracking target for executing second positioning, so that manual operation is not needed, target tracking and video call can be triggered only by the preset wake-up word, and convenience in operation is improved.

In a specific implementation scenario, the specific process of sound source localization may refer to the foregoing technical details of audio cross-correlation, which are not described herein again.

In a specific implementation scenario, the face recognition may be implemented by a neural network, and specifically refer to technical details of related networks such as FaceNet and deep id, which are not described herein again.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a video call method according to another embodiment of the present application.

Specifically, the method may include the steps of:

step S41: and responding to the detected target audio frequency containing the preset awakening words, and positioning a sound source based on the target audio frequency to obtain the target azimuth of the speaker.

Reference may be made to the related description in the foregoing embodiments, which are not repeated herein.

Step S42: and controlling the video call device to turn to the speaker based on the target orientation.

Step S43: and carrying out face recognition based on the target image shot by the video call equipment to obtain a face image of the speaker.

In the embodiment of the present disclosure, the face image may be used as a tracking target for performing the second positioning subsequently, and specific reference may be made to relevant description in the foregoing embodiment, which is not described herein again.

Step S44: the method comprises the steps of carrying out first positioning on audio data of the video call equipment at the current moment to obtain a first position of a speaker at the current moment, and carrying out second positioning on an image to be detected of the video call equipment at the current moment to obtain a second position of the speaker at the current moment.

In this embodiment of the disclosure, at least one of the first positioning and the second positioning refers to a positioning result at a plurality of historical times before the current time in the execution process, and the positioning result includes a first position and a second position of the speaker at the historical times.

Step S45: and combining the first orientation and the second orientation to obtain the final orientation of the speaker at the current moment.

Step S46: and carrying out video call through the video call equipment based on the final orientation.

According to the scheme, before formal video call, the speaker enables the video call equipment to automatically turn to the position of the speaker by presetting the awakening words, and then formal video call is carried out on the basis, so that the convenience of the video call can be improved.

Referring to fig. 5, fig. 5 is a schematic diagram of a video call device 50 according to an embodiment of the present application. The video call device 50 comprises a first positioning module 51, a second positioning module 52, a combining module 53 and a call module 54, wherein the first positioning module 51 is used for performing first positioning on audio data of the video call device at the current moment to obtain a first direction of a speaker at the current moment; the second positioning module 52 is configured to perform second positioning on the image to be detected of the video call device at the current time, so as to obtain a second direction of the speaker at the current time; at least one of the first positioning and the second positioning refers to positioning results of a plurality of historical moments before the current moment in the execution process, and the positioning results comprise a first position and a second position of the speaker at the historical moments; a combining module 53, configured to combine the first orientation and the second orientation to obtain a final orientation of the speaker at the current time; and a call module 54 for conducting a video call through the video call device based on the final orientation.

According to the scheme, on one hand, in the video call process, the final position is obtained by combining the first position and the second position which are obtained by sound source positioning and image positioning respectively, and compared with a single positioning mode, the positioning precision is favorably improved, on the other hand, in the video call process, at least one of the sound source positioning and the image positioning refers to the positioning results of a plurality of previous historical moments in the positioning process, and the stable positioning is favorably kept in the video call process with continuous time sequence. Therefore, the positioning accuracy of the speaker can be improved.

In some disclosed embodiments, the first positioning refers to the positioning result during execution; the first positioning module 51 includes a fusion weight submodule, configured to obtain, for each historical time, a fusion weight corresponding to the historical time based on a bearing deviation between the first bearing and the second bearing at the historical time and a voice detection result of the call audio acquired at the historical time; the first positioning module 51 includes a result weighting submodule, configured to fuse the cross-correlation results of the call audio at each historical time by using the fusion weight at each historical time, so as to obtain a weighted cross-correlation result at the current time; the first positioning module 51 comprises a first determining submodule for obtaining a first orientation at the current time based on the weighted cross-correlation result.

Therefore, the weighted cross-correlation result at the current moment is obtained by weighting and fusing the cross-correlation results of the call audios at the historical moments, so that the first position at the current moment is obtained based on the weighted cross-correlation result, the instability of instantaneous cross-correlation in the continuous time sequence video call process can be greatly relieved, and the accuracy of sound source positioning is favorably improved.

In some disclosed embodiments, the fusion weight submodule includes a weight coefficient obtaining unit, configured to obtain a first weight coefficient based on an azimuth deviation between a first azimuth and a second azimuth at a historical time, and obtain a second weight coefficient based on a voice detection result of a call audio acquired at the historical time; wherein the azimuth deviation is inversely related to the first weight coefficient; the fusion weight submodule comprises a fusion weight obtaining unit which is used for obtaining the fusion weight corresponding to the historical moment based on the first weight coefficient and the second weight coefficient.

Therefore, the first weight coefficient is obtained based on the azimuth deviation between the first azimuth and the second azimuth, the second weight coefficient is obtained based on the voice detection result of the call audio, the azimuth deviation is in negative correlation with the first weight coefficient, and the fusion weight is obtained based on the first weight coefficient and the second weight coefficient, so that the fusion weight can be set from the two aspects of azimuth deviation and voice detection, the call audio which is not beneficial to sound source positioning is filtered, and the sound source positioning precision is favorably improved.

In some disclosed embodiments, the first weight factor is a first value if the first orientation is within a first deviation range of the second orientation, the first weight factor is a second value if the first orientation is outside the first deviation range of the second orientation, and the first value is greater than the second value.

Therefore, under the condition that the first azimuth is in the first deviation range of the second azimuth, the first weight coefficient is a first numerical value, under the condition that the first azimuth exceeds the first deviation range of the second azimuth, the first weight coefficient is a second numerical value, and the first numerical value is larger than the second numerical value, so that the call audio at the historical moment with smaller deviation between image positioning and sound positioning at a plurality of reference historical moments can be obtained in the sound source positioning process, and the improvement of the sound source positioning precision is facilitated.

In some disclosed embodiments, the second weighting factor is a third value in a case where the human voice detection includes detection of human voice in the call audio, and the second weighting factor is a fourth value greater than the fourth value in a case where the human voice detection result includes no human voice detected in the call audio.

Therefore, under the condition that the voice is detected in the voice detection including the call audio, the second weight coefficient is a third numerical value, under the condition that the voice is not detected in the voice detection result including the call audio, the second weight coefficient is a fourth numerical value, and the third numerical value is greater than the fourth numerical value, so that the call audio containing the voice in the call audio respectively corresponding to a plurality of historical moments can be referred more in the sound source positioning process, and the improvement of the sound source positioning precision is facilitated.

In some disclosed embodiments, the second positioning refers to the positioning results during execution; the second positioning module 52 includes a sample extraction sub-module for extracting a number of candidate positive samples based on the activity trajectory determined by the tracker of the TLD; the second positioning module 52 includes a sample screening sub-module for screening a plurality of candidate positive samples by using the first orientation of the speaker at the historical time to obtain a target positive sample; the second positioning module 52 includes a model training sub-module for training the detector of the TLD to converge with the target positive sample; the second positioning module 52 includes a second determining sub-module, configured to perform second positioning on the image to be measured by using TLD, so as to obtain a second orientation at the current time.

Therefore, a candidate positive sample is extracted through the activity track extracted by the tracker, a target positive sample is further screened out from the first direction of the speaker at the historical moment, the target positive sample is used for training the detector to be converged, on the basis, the TLD is used for carrying out second positioning on the image to be detected, and the second direction at the current moment is obtained.

In some disclosed embodiments, the second location module 52 includes a learning weight sub-module for assigning a learning weight to each target positive sample in response to an orientation offset between the first orientation and the second orientation at a historical time prior to the current time satisfying a first condition; the model training submodule is specifically configured to train the detector to converge based on the target positive samples assigned the learning weights.

Therefore, after positive feedback of sound source localization is obtained, the learning weight is given thereto by the learner, and the confidence level of the target positive sample can be increased.

In some disclosed embodiments, the combining module 53 includes a reliability detection sub-module, configured to perform reliability detection on the first orientation to obtain a first detection result of the first orientation, and perform reliability detection on the second orientation to obtain a second detection result of the second orientation; the combining module 53 comprises a final orientation sub-module for deriving a final orientation based on the first detection result and the second detection result.

Therefore, the reliability detection is carried out on the first direction and the second direction respectively, and then the final direction is obtained based on the reliability detection result, so that the reliability of the final direction is improved.

In some disclosed embodiments, the reliable detection sub-module may include a first detection unit, and in particular, to determine that the first detection result includes the first orientation being reliable if the first orientation is within a second deviation range of the second orientation, and to determine that the first detection result includes the first orientation being unreliable if the first orientation is beyond the second deviation range of the second orientation.

Therefore, whether the first orientation is reliable or not is determined by comparing the degree to which the first orientation is deviated from the second orientation, and reliability detection is performed from the viewpoint of taking the second orientation as a comparison reference.

In some disclosed embodiments, the reliable detection sub-module may comprise a second detection unit, specifically configured to detect whether both of the following are satisfied simultaneously: the similarity between the face image of the speaker and the sample image meets a second condition, and the azimuth deviation between the second azimuth and the first azimuth meets a third condition; wherein the sample image is a positive sample for training a detector of the TLD; if so, determining that the second detection result comprises the second azimuth reliability; if not, determining that the second detection result comprises the unreliable second orientation.

Therefore, the reliability of the second orientation can be determined in multiple dimensions by comparing the degree of deviation of the second orientation from the first orientation and the similarity between the face image and the sample image.

In some disclosed embodiments, the final orientation submodule is specifically configured to fuse the first orientation and the second orientation to obtain the final orientation in response to the first detection result comprising the first orientation confidence and the second detection result comprising the second orientation confidence.

Therefore, under the condition that both the first direction and the second direction are reliably detected, the final direction is obtained by fusing the first direction and the second direction, and the accuracy of the final direction is improved.

In some disclosed embodiments, video call device 50 further includes a sound source localization module, configured to, in response to detecting a target audio including a preset wake-up word, perform sound source localization based on the target audio to obtain a target orientation of the speaker; video call device 50 also includes a rotation module for controlling the video call device to turn to the speaker based on the target orientation; the video call device 50 further includes a face recognition module, configured to perform face recognition based on a target image captured by the video call device to obtain a face image of a speaker; wherein the face image is used as a tracking target for performing the second positioning.

Therefore, the speaker enables the video call equipment to automatically turn to the position of the speaker by presetting the awakening words, and then formally carries out video call on the basis, so that the convenience of the video call can be improved.

Referring to fig. 6, fig. 6 is a schematic diagram of a video call device 60 according to an embodiment of the present application. Video call device 60 includes a screen 61, a microphone 62, a camera 63, a communication circuit 64, a memory 65, and a processor 66, where screen 61, microphone 62, camera 63, communication circuit 64, and memory 65 are respectively coupled to processor 66, and program instructions are stored in memory 65, and processor 66 is configured to execute program instructions to implement steps in any of the above-described video call method embodiments. Specifically, video call device 60 may include, but is not limited to: a sound box with a screen, a monitoring camera, etc., which are not limited herein.

Specifically, the processor 66 is configured to control itself, the screen 61, the microphone 62, the camera 63, the communication circuit 64, and the memory 65 to implement the steps in any of the video call method embodiments described above. Processor 66 may also be referred to as a CPU (Central Processing Unit). Processor 66 may be an integrated circuit chip having signal processing capabilities. The Processor 66 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Additionally, processor 66 may be commonly implemented by an integrated circuit chip.

In one implementation scenario, as shown in fig. 6, the video call device 60 further includes a rotating mechanism 67, where the rotating mechanism 67 at least carries the camera 63, and the rotating mechanism 67 is configured to rotate the camera 63 to photograph the speaker during the speaker activity. In addition, the rotation mechanism 67 may also carry at least one of the screen 61 and the microphone 62 to rotate. It should be noted that the rotating mechanism 67 may include, but is not limited to, a pan and tilt head, and is not limited herein.

Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of a computer readable storage medium 70 according to the present application. The computer readable storage medium 70 stores program instructions 71 capable of being executed by the processor, the program instructions 71 being configured to implement the steps in any of the video call method embodiments described above.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A video call method, comprising:

carrying out first positioning on audio data of a video call device at the current moment to obtain a first direction of a speaker at the current moment, and carrying out second positioning on an image to be detected of the video call device at the current moment to obtain a second direction of the speaker at the current moment; wherein at least one of the first positioning and the second positioning refers to positioning results of a plurality of historical moments before the current moment in the execution process, and the positioning results comprise a first position and a second position of the speaker at the historical moments;

combining the first orientation and the second orientation to obtain a final orientation of the speaker at the current moment;

and carrying out video call through the video call equipment based on the final orientation.

2. The method of claim 1, wherein the first positioning refers to the positioning result during execution; the first positioning of the audio data of the video call device at the current moment to obtain the first orientation of the speaker at the current moment comprises the following steps:

for each historical moment, obtaining a fusion weight corresponding to the historical moment based on the azimuth deviation between the first azimuth and the second azimuth at the historical moment and the voice detection result of the call audio collected at the historical moment;

fusing the mutual correlation results of the call audios at the historical moments by utilizing the fusion weight of each historical moment to obtain the weighted mutual correlation result at the current moment;

and obtaining a first position at the current moment based on the weighted cross-correlation result.

3. The method according to claim 2, wherein the obtaining of the fusion weight corresponding to the historical time based on the orientation deviation between the first orientation and the second orientation at the historical time and the human voice detection result of the call audio collected at the historical time comprises:

obtaining a first weight coefficient based on the azimuth deviation between the first azimuth and the second azimuth at the historical moment, and obtaining a second weight coefficient based on the voice detection result of the call audio collected at the historical moment; wherein the bearing deviation is inversely related to the first weight coefficient;

and obtaining the fusion weight corresponding to the historical moment based on the first weight coefficient and the second weight coefficient.

4. A method as claimed in claim 3, wherein the first weight factor is a first value if the first orientation is within a first range of deviation of the second orientation, the first weight factor is a second value if the first orientation is outside the first range of deviation of the second orientation, and the first value is greater than the second value.

5. The method according to claim 3, wherein the second weight coefficient is a third numerical value in a case where the human voice detection includes detection of human voice in the call audio, and the second weight coefficient is a fourth numerical value in a case where the human voice detection result includes no human voice detected in the call audio, and the third numerical value is larger than the fourth numerical value.

6. The method of claim 1, wherein the second positioning refers to the positioning result during execution; the second positioning of the image to be detected of the video call device at the current moment to obtain a second orientation of the speaker at the current moment comprises:

extracting a number of candidate positive samples based on the activity trajectory determined by the tracker of the TLD;

screening the candidate positive samples by using the first orientation of the speaker at the historical moment to obtain a target positive sample;

training a detector of the TLD to converge with the target positive sample;

and carrying out second positioning on the image to be detected by utilizing the TLD to obtain a second direction at the current moment.

7. The method as recited in claim 6, wherein prior to the training a detector of the TLD with the target positive sample to converge, the method further comprises:

assigning a learning weight to each of the target positive samples based on a learner of the TLD in response to an orientation offset between the first orientation and the second orientation at the historical time prior to the current time satisfying a first condition;

the training a detector of the TLD to converge with the target positive sample includes:

training the detector to converge based on the target positive samples assigned the learning weights.

8. The method of claim 1, wherein said combining the first orientation and the second orientation to obtain a final orientation of the speaker at the current time comprises:

performing reliability detection on the first direction to obtain a first detection result of the first direction, and performing reliability detection on the second direction to obtain a second detection result of the second direction;

and obtaining the final position based on the first detection result and the second detection result.

9. The method of claim 8, wherein the performing the reliability detection on the first orientation to obtain a first detection result of the first orientation comprises:

determining that the first detection result includes that the first orientation is reliable if the first orientation is within a second deviation range of the second orientation;

determining that the first detection result includes the first orientation being unreliable if the first orientation is outside a second deviation range of the second orientation.

10. The method of claim 8, wherein the second orientation is based on TLD detection; the performing reliability detection on the second orientation to obtain a second detection result of the second orientation includes:

detecting whether the following two are satisfied simultaneously: the similarity between the face image of the speaker and the sample image meets a second condition, and the azimuth deviation between the second azimuth and the first azimuth meets a third condition; wherein the sample image is a positive sample for training a detector of the TLD;

if so, determining that the second detection result comprises that the second direction is reliable;

if not, determining that the second detection result comprises the unreliable second position.

11. The method of claim 8, wherein the deriving the final orientation based on the first detection result and the second detection result comprises:

and fusing the first orientation and the second orientation to obtain the final orientation in response to the first detection result comprising the first orientation reliability and the second detection result comprising the second orientation reliability.

12. The method of claim 1, wherein before the first locating audio data of the video call device at the current time to obtain the first orientation of the speaker at the current time, or after the video call device performs the video call based on the final orientation, the method further comprises:

responding to the detected target audio frequency containing the preset awakening words, carrying out sound source positioning based on the target audio frequency to obtain the target orientation of the speaker;

controlling the video call device to turn to the speaker based on the target orientation;

carrying out face recognition based on a target image shot by the video call equipment to obtain a face image of the speaker;

wherein the face image is a tracking target for performing the second positioning.

13. The method of claim 1, wherein the conducting a video call through the video call device based on the final orientation comprises at least one of:

beamforming based on the final orientation to suppress noise around the speaker;

and controlling the video call equipment to turn to the speaker based on the final orientation meeting a fourth condition.

14. A video call apparatus, comprising:

the first positioning module is used for carrying out first positioning on audio data of the video call equipment at the current moment to obtain a first direction of a speaker at the current moment;

the second positioning module is used for carrying out second positioning on the image to be detected of the video call equipment at the current moment to obtain a second direction of the speaker at the current moment; wherein at least one of the first positioning and the second positioning refers to positioning results of a plurality of historical moments before the current moment in the execution process, and the positioning results comprise a first position and a second position of the speaker at the historical moments;

a combining module, configured to combine the first orientation and the second orientation to obtain a final orientation of the speaker at the current time;

and the call module is used for carrying out video call through the video call equipment based on the final position.

15. A video call device comprising a screen, a microphone, a camera, communication circuitry, a memory, and a processor, the screen, the microphone, the camera, the communication circuitry, and the memory being respectively coupled to the processor, the memory having stored therein program instructions for execution by the processor to implement the video call method of any of claims 1-13.

16. The apparatus of claim 15, further comprising a rotation mechanism, the rotation mechanism at least carrying the camera, the rotation mechanism being configured to rotate the camera to capture the speaker during the speaker activity.

17. A computer-readable storage medium, in which program instructions executable by a processor are stored, the program instructions being for implementing the video call method of any one of claims 1 to 13.