CN112578909A

CN112578909A - Equipment interaction method and device

Info

Publication number: CN112578909A
Application number: CN202011471453.4A
Authority: CN
Inventors: 钟鹏飞; 张宁; 廖加威; 任晓华; 车炜春; 黄晓琳; 董粤强; 赵慧斌
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-30
Anticipated expiration: 2040-12-15
Also published as: CN112578909B

Abstract

The application discloses a method and a device for equipment interaction, and relates to the artificial intelligence technology. The specific implementation scheme is as follows: the first device listens for sound signals. A first interaction is performed according to the heard sound signal. And if the sound signal is not monitored, the first equipment judges whether an object exists in a preset visual field range according to the shot image frame. And if so, executing a second interaction according to the shot image frame. Through the monitoring to sound and the shooting of image frame, realize the monitoring to the user from two aspect cooperations of sense of hearing and vision to can effectively promote monitoring user's accuracy, and because the shooting range of camera is limited, consequently at first use sound collection system to carry out the monitoring of sound, can further promote the efficiency and the success rate of user monitoring.

Description

Equipment interaction method and device

Technical Field

The present application relates to artificial intelligence technology in computer technology, and in particular, to a method and an apparatus for device interaction.

Background

With the continuous development of computer technology, intelligent devices such as robots play more and more important roles in daily life.

At present, when the intelligent device realizes interaction, a camera usually monitors whether a user exists in the range of the current intelligent device, and if the user exists, the intelligent device and the user perform active interaction.

However, since the shooting range of the camera is limited, the success rate of user monitoring is low because the user is monitored only according to the camera.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for equipment interaction.

According to an aspect of the present application, there is provided a device interaction method, including:

the first equipment monitors sound signals;

performing a first interaction according to the monitored sound signal;

if the sound signal is not monitored, the first equipment judges whether an object exists in a preset visual field range according to the shot image frame;

and if so, executing a second interaction according to the shot image frame.

According to another aspect of the present application, there is provided an apparatus for device interaction, including:

the monitoring module is used for monitoring the sound signal by the first equipment;

the processing module is used for executing a first interaction action according to the monitored sound signal;

the first equipment is used for judging whether an object exists in a preset visual field range according to the shot image frame if the sound signal is not monitored;

and the processing module is also used for executing a second interaction action according to the shot image frame if the second interaction action exists.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to another aspect of the application, there is provided a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, execution of the computer program by the at least one processor causing the electronic device to perform the method of the first aspect.

The technology according to the application solves the problem that the success rate of user monitoring is low.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic structural diagram of a robot provided in an embodiment of the present application;

fig. 2 is a flowchart of a method for device interaction provided in an embodiment of the present application;

fig. 3 is a second flowchart of a device interaction method provided in the embodiment of the present application;

fig. 4 is a schematic diagram illustrating an implementation of a listening position according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an implementation of determining a target position according to a first position according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating an implementation of determining a target position from a plurality of first positions according to an embodiment of the present application;

fig. 7 is a flowchart three of a method for device interaction provided in the embodiment of the present application;

fig. 8 is a first schematic view illustrating a preset field of view according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a preset amount of view range according to an embodiment of the present application

FIG. 10 is a schematic diagram illustrating an implementation of determining a target object according to an embodiment of the present application;

fig. 11 is a schematic diagram of implementation of determining a target object by multiple objects according to an embodiment of the present application

Fig. 12 is a schematic diagram illustrating an implementation of determining a predicted walking path according to an embodiment of the present application;

FIG. 13 is a schematic diagram of the interaction scope provided by the embodiments of the present application;

FIG. 14 is a schematic diagram of interaction directions provided by embodiments of the present application;

fig. 15 is a flowchart illustrating a method for device interaction according to an embodiment of the present application;

FIG. 16 is a schematic structural diagram of an apparatus for device interaction according to an embodiment of the present application;

FIG. 17 is a block diagram of an electronic device used to implement a method of device interaction of an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

For better understanding of the technical solution of the present application, the following further detailed description is provided for the related background art of the present application:

with the continuous development of science and technology, smart devices play an increasingly important role in daily life, wherein smart devices refer to devices having a sensing function, thinking judgment and calculation processing capabilities, and can execute related functions.

In a possible implementation manner, the intelligent device may be, for example, a robot, where the robot is an intelligent device capable of semi-autonomous or fully-autonomous operation, and it can be understood that the robot is divided according to appearance, usage, and the like, and may have a plurality of different categories.

Aiming at the problems in the prior art, the application provides the following technical conception: utilize smart machine's sound collection system and camera to realize the monitoring to the user in coordination to can promote the success rate to user monitoring, and because the shooting scope of camera is limited, consequently can use sound collection system to carry out the monitoring of sound at first, thereby further promote efficiency and success rate to user monitoring.

Taking a robot as an example, the structure of the robot is described below with reference to fig. 1, and fig. 1 is a schematic structural diagram of the robot according to the embodiment of the present application.

As shown in fig. 1, the robot includes:

the device comprises an image acquisition device, a sound acquisition device, a motion chassis and an expression display system device.

In an actual implementation process, the number of the image acquisition devices installed on the robot may be one or multiple, the image acquisition devices are not limited in this embodiment, and the specific positions where the cameras are installed may also be selected according to actual requirements, and the indication in fig. 1 is only an exemplary introduction.

And the sound collection device can collect the sound around the robot, and the user can be monitored and sounded according to the collected sound, so that in the embodiment, the sound collection device and the image collection device can be combined to monitor the user from two aspects of hearing and vision.

Wherein, for example, can install a plurality of sound collection devices among the robot, a plurality of sound collection devices can install the different position at the robot to the realization is to the collection of the sound in each position, and in the actual realization in-process, the realization of the concrete quantity and the position of sound collection device installation can be selected according to actual demand, and this embodiment does not do the special restriction to this.

And the motion chassis of the robot can realize the motion of the robot, such as the displacement of the robot, or the rotation of the robot direction, so as to realize the interaction with the user.

And the expression display system device of the robot can display different expressions, for example, the corresponding expressions can be displayed on the expression display system according to the current interactive content, compared with the prior art in which the interaction is performed only according to limbs and a user, the expression display system device can combine limb actions and expression feedback to perform the interaction with the user, thereby improving the diversity and flexibility of human-computer interaction and improving the user experience.

Based on the structure of the robot introduced above, in the implementation scheme of the present application, taking the image capture device as a camera as an example, for example, a picture taken by the camera of the robot can be preprocessed within an effective recognition range of the camera, so as to determine a user position area. The distance between the user and the robot is then measured. And predicting a user walking direction path by optimizing the image frame sequence characteristics, and judging whether the user walking direction path enters a robot interaction area. And the robot makes corresponding feedback through the limb actions and the expressions in the effective recognition range.

And in the effective identification range, acquiring sound signals of the target area in real time by using sound acquisition devices in different directions, and monitoring the sound magnitude decibel values in different directions in real time. The sounding position of the user is judged by identifying the difference between the volume and the environmental volume, and the position with a large sound source difference is the position of the user, so that the position of the user relative to the robot is determined. When multi-person interaction is identified, the position of a user can be judged through sound source positioning, and the robot can make corresponding feedback through limb actions and expressions in an effective range. Therefore, the monitoring and interaction of the user can be realized based on the visual and auditory coordination.

The method for device interaction provided by the present application is described below with reference to specific embodiments, and fig. 2 is a flowchart of the method for device interaction provided by the present application.

As shown in fig. 2, the method includes:

s201, the first equipment monitors sound signals.

In this embodiment, the first device may be, for example, the robot described above, or the first device may also be any possible intelligent device, as long as the first device can monitor sound, collect images, and interact with a user, and the specific implementation manner of the first device is not particularly limited in this embodiment.

In a possible implementation manner, the first device may include, for example, sound collection devices disposed in a plurality of different directions, so that when the first device listens to the sound signal, the first device may listen to the sound signal from different directions of the first device, and if the sound collection device in any one direction listens to the sound signal, the first device may determine that the sound signal is listened to.

Or, if the sound collecting devices in all directions do not monitor the sound signal, it may be determined that the first device does not monitor the sound signal.

And S202, executing a first interaction according to the monitored sound signal.

In a possible implementation manner, when the first device monitors the sound signal, the first device may determine that there is a user around the first device, that is, the first device currently monitors the user from an auditory level, and at this time, the first device may actively interact with the user.

In a possible implementation manner, the first interaction of the first device may include, for example, a limb action of the first device, and the first interaction of the first device may further include an expression action of the first device, for example, the first device may analyze the monitored sound signal, so as to perform the corresponding limb action and the expression action.

In another possible implementation manner, because the first device may monitor sound signals of different directions, the first device may further determine a target direction according to the monitored sound signal of at least one direction, where the target direction may be, for example, a direction in which the sound signal is largest or may also be a direction in which the sound signal is clearest, which is not limited in this embodiment. After determining the target orientation, the first device may steer the interactive surface towards the target orientation, thereby performing the first interactive action described above for the user of the target orientation.

In an actual implementation process, a specific implementation of the first interaction may be selected according to an actual requirement, which is not limited in this embodiment as long as the first interaction is an interaction performed by the first device and the user.

And S203, if the sound signal is not monitored, the first equipment judges whether an object exists in a preset visual field range according to the shot image frame.

In another possible implementation manner, the first device may not monitor the sound signal, where the first device may monitor the sound signal within a preset time period, and if the sound signal is not monitored within the preset time period, it is determined that the sound signal is not monitored, and a specific setting of the preset time period may be selected according to an actual requirement.

Or, the first device may also monitor the sound signal in real time without setting the preset time duration, and immediately determine that the sound signal is not monitored if the sound signal is not monitored at the current time.

But at this time, it does not mean that there is no user around the first device, and it may be that the user around does not make a sound, the first device may further continue to take image frames according to the camera, and then monitor the user according to the taken image frames.

The first device may capture an image frame within a preset visual field range, and then analyze the captured image frame, so as to determine whether an object exists within the preset visual field range.

The image frames may be analyzed, for example, to determine whether an object is present in the image frames, where the object may be, for example, a user. If at least one object exists in the image frame, it may be determined that a user exists in the preset view range, and the number of the objects included in the image frame is not limited in this embodiment, and may be determined according to actual requirements. If the object does not exist in the image frame, it may be determined that the user does not exist within the preset visual field range.

And S204, if the second interaction action exists, executing a second interaction action according to the shot image frame.

In a possible implementation manner, if the first device determines that an object exists in the preset view range according to the captured image frame, the first device may determine that a user exists around the first device, that is, the first device monitors the user from a visual aspect at this time.

Then, the first device may perform a second interaction according to the captured image frames, where the second interaction is similar to the first interaction described above, and may be, for example, a limb action or an expression action performed by the first device after analyzing the captured image frames, and a specific implementation of the second interaction is not limited in this embodiment.

It should be noted that, the first interaction described above is performed according to the monitored sound signal, and the second interaction is performed according to the captured image frame, in another possible implementation manner, after the sound signal is monitored, the user is monitored according to the image frame, or after the user is monitored according to the image frame, the sound signal is monitored, and then the final interaction may be determined according to the sound signal and the image frame.

The device interaction method provided by the embodiment of the application comprises the following steps: the first device listens for sound signals. A first interaction is performed according to the heard sound signal. And if the sound signal is not monitored, the first equipment judges whether an object exists in a preset visual field range according to the shot image frame. And if so, executing a second interaction according to the shot image frame. Through the monitoring to sound and the shooting of image frame, realize the monitoring to the user from two aspect cooperations of sense of hearing and vision to can effectively promote monitoring user's accuracy, and because the shooting range of camera is limited, consequently at first use sound collection system to carry out the monitoring of sound, can further promote the efficiency and the success rate of user monitoring.

On the basis of the above embodiments, two parts of the application, that is, the first device realizes interaction through auditory monitoring and the first device realizes interaction through visual monitoring, are respectively described below with reference to specific embodiments.

First, an implementation manner of implementing interaction for auditory monitoring is introduced with reference to fig. 3 to 6, fig. 3 is a second flowchart of a method for device interaction provided in the embodiment of the present application, fig. 4 is a schematic diagram of implementing a listening position provided in the embodiment of the present application, fig. 5 is a schematic diagram of implementing determining a target position by a first position provided in the embodiment of the present application, and fig. 6 is a schematic diagram of implementing determining a target position by a plurality of first positions provided in the embodiment of the present application.

As shown in fig. 3, the method includes:

s301, the first device listens for sound signals from different directions of the first device.

In this embodiment, the first device may include, for example, sound collection devices disposed in different directions, where the sound collection signals of the respective directions may collect sound signals of corresponding directions, so that monitoring of sound signals from different directions of the first device may be achieved.

In a possible implementation manner, the number of the sound collecting devices of the first device may be, for example, 6, wherein the 6 sound collecting devices may be uniformly arranged on the first device, so as to collect sound around 360 directions of the robot, and it is understood that the first device may rotate 360 degrees in the moving area, so that no matter which direction the first device rotates, since the monitoring of the sound signal is 360 degrees, the monitoring of the sound signal may be comprehensively achieved.

For example, as can be understood with reference to fig. 4, it is assumed that 6 sound collection devices are currently installed on the first device to monitor sound signals of six directions, namely, direction 301, direction 302, direction 303, direction 304, direction 305, and direction 306.

Or, in an actual implementation process, the number of the sound collection devices arranged on the first device may also be 4 or 8, which is not limited in this embodiment, and the sound collection devices may be selected according to actual requirements as long as the collection of sound signals in different directions can be implemented.

The sound signals in different directions of the first equipment are monitored, so that the user can be effectively detected no matter which angle the user stands at the first equipment, and the accuracy and the success rate of user monitoring can be effectively improved.

S302, the first device determines the respective recognition volumes of the sound signals in different directions.

The first equipment collects the sound signals in different directions and can monitor the decibel values of the sound signals in different directions in real time, and accordingly the respective recognition volumes of the sound signals in different directions are determined, wherein the recognition volumes are the decibel values of the sound signals.

And S303, the first device judges whether the difference value of the identification volume between any two directions is smaller than or equal to a first threshold value or not according to the sound signals of different directions, if so, S304 is executed, and if not, S305 is executed.

In this embodiment, the first device needs to determine the specific sounding direction of the user based on the monitored sound signals of the respective directions, so as to interact with the user in a targeted manner.

In one possible implementation, if the first device listens in different directions, but only listens to the sound signal in one direction, it can be determined that this direction is the direction in which the user is.

In another possible implementation manner, if the first device monitors different directions and monitors sound signals in different directions, a target direction in which a user is located needs to be determined in different directions, so as to perform targeted interaction with the user.

In one possible implementation of determining the target position in different positions, for example, the sound emitting position of the user may be determined by monitoring the difference between the recognized volume of the sound signal in each position and the current ambient volume, for example, the difference between the recognized volume of the sound signal in each position and the ambient volume is the largest, so as to determine the position of the user relative to the robot, and therefore, the current ambient volume needs to be determined first.

The environment volume can be understood as an average value of currently collected sound signals of all directions, but in order to ensure accuracy of the determined environment volume, when the volume of a sound signal of a certain direction is too large or the volume of a sound signal of a certain direction is too small, the too large or too small sound signal is removed, so that the environment volume is determined.

Therefore, in this embodiment, it may be determined whether a difference between the recognition volumes of any two orientations is less than or equal to a first threshold, where a specific implementation of the first threshold may be selected according to an actual requirement, and this embodiment is not particularly limited thereto.

In one possible implementation manner of determining the first threshold, for example, an initial ambient volume reference value may be set, for example, the quiet ambient volume in the room is 45 db, and a value of 45 db may be used as the initial reference value, for example, the first threshold described above may be set, a self-check is performed after the first device is started, and by comparing with the first threshold, a numerical reference of the ambient volume at the location may be automatically determined, so as to implement real-time calculation of the ambient volume.

And S304, determining the average value of the identification volume of each sound signal in each direction as the environment volume.

In a possible implementation manner, if the difference between the recognition volumes of any two directions is smaller than or equal to the first threshold, which indicates that the difference between the recognition volumes of the sound signals of the respective directions is not large, an average value of the recognition volumes of the sound signals of the respective directions may be determined, and the average value may be determined as the ambient volume.

Here, for example, it is assumed that the respective recognition volumes of the current sound signals of 6 directions are: 67 db, 95 db, 101 db, 88 db, 66 db, 71 db, and assuming that the first threshold is set to 45, based on the example described herein, it can be determined that the difference in the recognized volume between any two orientations is less than 45, and then the average of the recognized volumes of the sound signals of the above 6 orientations can be determined to be 81.3, and then the ambient volume can be determined to be 81.3.

In an actual implementation process, the identification volume of the sound signal in each direction may be selected according to actual requirements, which is not particularly limited in this embodiment.

S305, the maximum recognized volume and the minimum recognized volume are removed from the recognized volumes of the respective sound signals of the respective directions, and the average value of the remaining recognized volumes is determined as the ambient volume.

In another possible implementation manner, if the difference between the identification volumes of any two orientations is not less than or equal to the first threshold, for example, if the difference between the identification volumes of two orientations is less than or equal to the first threshold, it indicates that the difference between the identification volume of the sound signal in some orientation and the identification volume of the sound signal in the other orientation is greater.

In order to ensure the accuracy of the finally determined ambient volume, the maximum recognition volume and the minimum recognition volume may be removed from the recognition volumes of the sound signals in the respective directions, and then the average value of the remaining recognition volumes may be determined, and the average value determined at this time may be determined as the ambient volume.

Here, for example, it is assumed that the respective recognition volumes of the current sound signals of 6 directions are: 67 db, 25 db, 145 db, 88 db, 66 db, 71 db, and assuming that the first threshold is set at 45, based on the example described herein, it can be determined that the difference in the recognized volume between any two locations is not less than 45, e.g., the difference between 25 db and 145 db reaches 120 db, which is greater than the first threshold 45. Then the maximum recognition volume 145 and the minimum recognition volume 25 need to be removed, and then the average of the remaining recognition volumes 67, 88, 66, 71 is determined to be 73, so that the ambient volume can be determined to be 73 at this time.

It can be understood that, in this embodiment, the difference between the identification volumes of the sound signals in any two directions is determined first, and then when the difference is smaller than the first threshold, the average value of the identification volumes of the directions is determined as the ambient volume, so that the determination of the ambient volume can be simply and efficiently implemented; and when the difference between any two directions is not smaller than the first threshold, firstly providing the maximum environmental volume and the minimum environmental volume, and then determining the average value of the residual identification volumes as the environmental volume, so that the influence of overlarge or undersize sound in a certain direction on the environmental volume can be avoided, and the accuracy of the determined environmental volume can be ensured.

S306, determining at least one first direction with the recognition volume being larger than or equal to a second threshold value.

Based on the above description, in a possible implementation manner, when the sound signal is monitored in only one direction, the direction in which the user in the direction in which the sound signal is monitored is located can be determined, and thus the number of directions in which the sound signal is monitored can be determined.

In order to ensure the accuracy of the monitored sound signal, the identification volume of the monitored sound signal may be compared with a second threshold, and the direction in which the identification volume is greater than or equal to the second threshold is determined as a first direction, where the first direction is the direction in which the sound signal is determined to be monitored.

For example, the second threshold may be set to 0, which indicates that the direction is determined as the first direction as long as the sound signal is currently heard in the direction.

For another example, the second threshold may be set to be 3, which means that the monitored sound signal is determined to be monitored only when the recognition volume of the monitored sound signal is determined to be greater than or equal to 3 db, and the sound signal smaller than 3 db may be regarded as a sound signal that does not need to be processed.

In an actual implementation process, the specific setting of the second threshold may be selected according to actual requirements, which is not particularly limited in this embodiment, and the first direction in which the sound is monitored is determined by comparing the recognition volume with the second threshold, so that the accuracy of the determined direction in which the sound is monitored can be effectively ensured.

S307, determine whether the number of the first orientations is one, if yes, execute S308, and if no, execute S309.

After at least one first direction is determined, if the number of the first directions is one, the direction of the user can be directly determined at this time, and if the number of the first directions is more than one, further determination is needed, so that whether the number of the first directions is one or not can be determined.

S308, determining the first direction as a target direction.

In a possible implementation manner, if the number of the first direction is one, the first direction may be determined as the target direction, for example, as can be understood with reference to fig. 5, as shown in fig. 5, there are currently 6

directions

501, 502, 503, 504, 505, and 506, taking the second threshold value as an example, assuming that the recognition volume of the sound signal of only the direction 504 is greater than the second threshold value, that is, the sound signal is only heard at the direction 504, the direction 504 may be determined as the target direction at this time.

S309, respectively determining the difference value between the identification volume and the environment volume of each first direction, and determining the first direction with the largest difference value as the target direction.

In another possible way, if the number of the first orientations is more than one, which indicates that the sound signal is currently heard in different orientations, then it is necessary to determine a user orientation for interacting with the user in different orientations.

In one possible implementation manner of determining the target bearing, if the ambient volume is determined, for example, a difference between the recognition volume and the ambient volume of each first bearing may be determined, and then the first bearing with the largest difference may be determined as the target bearing.

For example, it can be understood with reference to fig. 6, as shown in fig. 6, assuming that there are currently 6

orientations

601, 602, 603, 604, 605, and 606, taking the second threshold value as 0 as an example, assuming that the recognition volumes of the sound signals of the three orientations, i.e., the current orientation 602, the orientation 604, and the orientation 606, are greater than the second threshold value, that is, the sound signals are heard in all of the three orientations, it can be determined that there are three first orientations, i.e., the orientation 602, the orientation 604, and the orientation 606, which coexist at this time.

Then, the difference between the recognized volume and the ambient volume of the three first orientations may be determined, and the first orientation having the largest difference may be determined as the target orientation, referring to fig. 6, assuming that the recognized volume of the orientation 602 is 130 db, the recognized volume of the orientation 604 is 75 db, the orientation 606 is 90 db, and assuming that the ambient volume at this time is 73 db, it may be determined that the first orientation having the largest difference among the three first orientations is the orientation 602, and the orientation 602 may be determined as the target orientation.

In another possible implementation manner, when the sound signals are all monitored in different directions, for example, the direction in which the sound signal is monitored first may be determined as the target direction.

S310, the first device adjusts the interactive interface to face the target position and executes a first interactive action.

The target position in this embodiment refers to a position of the user relative to the first device that is currently determined, and for the purpose of targeted implementation and user interaction, the first device may enable the interactive interface to face the target position and perform the first interactive action.

For example, the first interactive action may be performed by turning the emoticon display system apparatus of the first device to the target orientation. Taking the first device as an example of a robot, for example, the robot may perform the first interactive motion within an effective range, where the effective range refers to a range in which the robot performs a limb motion, such as body rotation, limb movement, and the like, and for example, the robot may perform body and limb motions within a circular range with a radius of 1.5 meters at the center of the robot. When the robot can perform body and limb movement after determining the target position of the user, expression feedback of the robot can be controlled through the expression display system device.

The specific body motion and expression feedback can be selected and set according to actual requirements, which is not particularly limited in this embodiment.

The device interaction method provided by the embodiment of the application comprises the following steps: the first device listens for sound signals from different directions of the first device. The first device determines respective identified volumes of the sound signals at the different orientations. The first device judges whether the difference value of the identification volume between any two directions is smaller than or equal to a first threshold value according to the sound signals of different directions, and if so, the average value of the identification volume of each sound signal of each direction is determined as the environment volume. If not, the maximum identification volume and the minimum identification volume are removed from the identification volumes of the sound signals in all directions, and the average value of the remaining identification volumes is determined as the environment volume. At least one first bearing is determined that identifies a volume greater than or equal to a second threshold. And judging whether the number of the first directions is one, and if so, determining the first directions as the target directions. If not, respectively determining the difference value between the identification volume and the environment volume of each first direction, and determining the first direction with the largest difference value as the target direction. The first device adjusts the interactive interface to a target-oriented orientation and performs a first interactive action.

By monitoring the sounds in different directions, when the sounds in one direction are determined to be monitored, the direction is determined as the target direction, or when the sounds in different directions are determined to be monitored, the target direction is determined in different directions, and then the first device is pertinently interacted with the user in the target direction, so that the pertinence and the effectiveness of device interaction can be effectively improved. And when the target position is determined in the plurality of first positions, the position with the largest difference value is determined as the target position by comparing the difference values of the identification volume and the environment volume of each first position, so that the reasonability of the determined target position can be effectively ensured, and the accuracy of the determined environment volume can be effectively ensured by integrating the average value of the identification volume of each position when the environment volume is determined.

Fig. 7 is a third flowchart of a method for realizing interaction of equipment provided by the embodiment of the present application, fig. 8 is a first schematic diagram of a preset field of view provided by the embodiment of the present application, fig. 9 is a second schematic diagram of a preset field of view provided by the embodiment of the present application, fig. 10 is a schematic diagram of an implementation of determining a target object by one object provided by the embodiment of the present application, fig. 11 is a schematic diagram of an implementation of determining target objects by a plurality of objects provided by the embodiment of the present application, fig. 12 is a schematic diagram of an implementation of determining a predicted walking path provided by the embodiment of the present application, fig. 13 is a schematic diagram of an interaction range provided by the embodiment of the present application, and fig. 14 is a schematic diagram of an interaction direction provided by the embodiment of the present application.

As shown in fig. 7, the method includes:

s701, the first equipment judges whether an object exists in a preset visual field range according to the shot image frame, if so, S702 is executed, and if not, S701 is executed.

In this embodiment, the camera of the first device may capture an image frame within a preset visual field range, and then the first device may determine whether an object exists within a preset recognition range according to the captured image frame.

In a possible implementation manner, the preset view range may be as shown in fig. 8 and fig. 9, first, referring to fig. 8, the distance of the preset view range may be, for example, in a range of 0-6 meters from the first device, and, referring to fig. 9, the up-down viewing angle of the camera is, for example: 70 ° and, for example, the left and right viewing angles of the camera are: 124 deg.

With reference to fig. 8 and 9, it can be determined that in one possible implementation, the preset view range may be 70 ° of the up-down view angle, 124 ° of the left-right view angle of the camera, and the distance is 6 meters. In an actual implementation process, the preset view range may be selected and set according to actual requirements, which is not particularly limited in this embodiment.

In one possible implementation, before the determination is made according to the image frame, the image frame may be first preprocessed, where the preprocessing may include at least one of the following: and performing antialiasing processing and image information equalization processing to enable the image frame to meet the requirements of a database format.

If it is determined whether an object exists according to the image frame, for example, the image frame may be subjected to image recognition processing to determine whether the object exists in the image frame.

If it is determined that the object does not exist in the current image frame, it may be determined that the object does not exist within the preset view range, and then the image frame may be continuously captured and the object may be recognized according to the captured image frame, so step S701 may be repeatedly performed.

S702, the first device determines the number of objects included in the image frame according to the shot image frame.

In another possible implementation manner, if the first device determines that an object exists in the preset view range according to the captured image frames, the number of the objects may be further determined, so as to perform targeted interaction with the user.

Wherein the first device may determine the number of objects included in the image frame from the photographed image frame.

In a possible implementation manner, taking the object as a pedestrian as an example, for example, a Haar-like features (Haar-like features) feature classifier file suitable for pedestrian recognition in a specific scene may be trained in advance according to image information for judging a user position in the scene, where the Haar-like features are a kind of digital image features used for object recognition and are a first instant face detection operation.

When the pedestrian is located, the pedestrian position area in the image frame can be located according to the pre-trained Haar-Link Features classifier, if one pedestrian position area is determined, the current existence of one pedestrian can be determined, and if a plurality of pedestrian areas are determined, the current existence of a plurality of pedestrians can be determined.

And S703, if the image frame comprises an object, determining the object in the image frame as a target object.

In a possible implementation, if it is determined that an object is included in the image frame, it may be determined that the object in the current image frame is the target object, and it may be understood that the target object is the object to be interacted with.

For example, referring to fig. 10, if an object 1001 is currently determined to be included in the image frame, it may be determined that the object 1001 is the target object.

And S704, if the image frame comprises a plurality of objects, determining that the object closest to the first device is determined as the target object.

In another possible implementation, if it is determined that the image frame includes a plurality of objects, a target object may be determined among the plurality of objects, and in a possible implementation, an object closest to the first device may be determined as the target object.

For example, referring to fig. 11, 3 objects, namely, an object 1101, an object 1102, and an object 1103, are included in the currently determined image frame, and the object 1103 closest to the first device among the three objects may be determined as the target object.

The implementation manner of determining the distance between each object and the first device may be, for example, determining the distance between each object and the first device by a distance sensor of the first device, or determining the distance between each object and the first device by the distance in the image frame, which is not particularly limited in this embodiment and may be selected according to actual requirements.

Alternatively, one of the multiple objects may be randomly selected as the target object, and the implementation manner of determining the target object is not limited in this embodiment as long as the target object is one of the multiple objects.

S705, the first device determines a position area of the target object in a plurality of image frames according to the plurality of image frames corresponding to the target object.

After the target object is determined in the single-frame image, the first device may use the target object as an interactive target to predict a walking path of the target object, where the first device may obtain, through the camera, a plurality of image frames corresponding to the target object, so as to predict the walking path of the target object.

The first device may determine, according to the position areas of the target object in the plurality of image frames, the distance between the target object in the plurality of image frames and the first device, so as to implement prediction of the walking path, and then the first device needs to determine, according to the plurality of image frames corresponding to the target object, the position areas of the target object in the plurality of image frames.

For example, referring to fig. 12, it is understood that, taking a single frame image as an example, a position region 1201 of a target object may be determined in the single frame image, and this operation is performed for each of a plurality of image frames corresponding to the target object, so that the position regions of the target object in the plurality of image frames may be determined, where the determination of the position regions may be performed by a feature classifier as described above, for example.

And S706, the first device determines the distance between the target object in the image frames and the first device according to each position area, the pixel width of the image frames and the focal length of the camera respectively.

After determining the location areas of the target object in the plurality of image frames, the distance of the target object from the first device in each image frame may be determined from each location area.

The first device may calculate a distance between the target object and the first device according to a monocular distance measurement principle in a camera calibration principle, which is described by taking any one of the frames as an example, for example, a width W of a location area may be determined according to a location area of the target object in the frame image, and a distance between the target object and the first device may be represented by D, and meanwhile, a pixel width of the image frame may be determined to be P, and a focal length of the camera may be F, where W, D, P, F may satisfy the following formula one:

formula (P x D)/W

It is understood that the camera focal length F, the pixel width P, and the width W of the location area can be determined, and thus the distance D between the target object and the first device in the frame of image can be determined.

For example, referring to FIG. 12, currently in a single frame image, a distance 1202 of a target object and a first device may be determined.

In the above description taking any one frame as an example, the above operation is performed on a plurality of image frames corresponding to the target object, so that the distance between the target object and the first device in the plurality of image frames can be determined.

And S707, the first device determines a predicted walking path of the target object according to the distance between the target object in the plurality of image frames and the first device.

After determining the distance of the target object from the first device in the plurality of image frames of the target object, a predicted walking path of the target object may be determined according to the plurality of distances.

In one possible implementation, for example, the walking direction path prediction of the pedestrian is estimated by using a bayesian formula probability function statistical model fitting method according to information in a section of continuously shot image key frames, namely, the distance between the currently determined target object and the first device.

For example, referring to fig. 12, the current target object corresponds to a plurality of image frames, which are a t-2 frame, a t-1 frame, a t frame, and a t +1 frame, respectively, it can be currently determined that a position area of the target object in the t-2 frame is 1203, a position area of the target object in the t-1 frame is 1204, a position area of the target object in the t frame is 1205, and a position area of the target object in the t +1 frame is 1206, based on the rest of the positions, a distance between the target object and the first device can be determined, and then it is assumed that a predicted walking path 1207 can be predicted.

And S708, the first device judges whether the predicted walking path of the target object and the interaction range of the target object are overlapped, if so, S709 is executed, and if not, S710 is executed.

After the first device predicts the predicted walking path of the target object, it may be determined whether the predicted walking path of the target object and the interaction range of the target object overlap, that is, it may be determined whether the target object may enter the interaction range of the first device according to the predicted walking path.

As shown in fig. 13, for example, the interaction range of the first device may be, for example, a radius range of 1.5 meters around the center of the robot, and a range of 60 degrees around the front of the robot, and the predicted walking path of the target object may specifically enter the robot interaction range, and the predicted walking path may appear in the radius range of 1.5 meters around the center of the robot.

In another possible implementation manner, it may be further determined whether the target object may move into the interaction range of the robot according to the interaction direction of the target object and the first device, for example, an angle between the predicted walking path and the first device, which may be understood by referring to fig. 14, for example, where the predicted walking path of the target object and the angle between the first device may be classified into 5 classes, which are 180 °, 225 °, 135 °, 90 °, and 270 ° shown in fig. 14, respectively.

For example, when the angle between the predicted walking path and the first device is 180 degrees, the interaction range of the robot where the target object can walk can be determined; for example, when the predicted walking path and the first device form an included angle of 270 degrees and 90 degrees, it can be determined that the target object does not enter the interaction range of the robot; for another example, when the predicted walking path and the first device are at an angle of 225 ° and 135 °, it may be determined that the target object may walk into the interaction range of the robot, and further image frame shooting and prediction may be performed to continuously observe.

In an actual implementation process, the specific interaction directions may be set, and the specific processing modes of the interaction directions may be selected according to actual requirements, which is not particularly limited in this embodiment.

And S709, the first equipment moves along with the target object and executes a second interactive action.

In a possible implementation manner, if it is determined that the predicted walking path will go into the interaction range of the first device, the robot may move along with the target object, and corresponding feedback may be made according to the determined interaction direction to correspondingly implement the limb action and the expression of the first device.

Taking a camera as an example, when a target object enters the field of view of the camera of the robot, for example, the body and the face of the robot can move following the target object. Wherein a movement instruction is received by the camera recognition device, wherein the rotation range of the robot may be, for example, -60 ° to 60 °, the right front of the robot is 0 °, and the range of the robot peucedanum is a sector area with a radius of-60 ° to 60 ° with a center of 1.5 m, i.e. the range described above with reference to fig. 13.

And S710, discarding the predicted walking path.

In another possible implementation manner, if it is determined that the predicted walking path does not enter the interaction range of the first device, the first device may discard the predicted walking path, and then perform the above process again to detect the object.

The device interaction method provided by the embodiment of the application comprises the following steps: the first equipment judges whether an object exists in a preset visual field range or not according to the shot image frame, and if yes, the first equipment determines the number of the objects in the image frame according to the shot image frame. If an object is included in the image frame, an object in the image frame is determined as a target object. If a plurality of objects are included in the image frame, it is determined that an object closest to the first device is determined as a target object. The first device determines a position area of the target object in a plurality of image frames according to the plurality of image frames corresponding to the target object. The first device determines distances between a target object in the plurality of image frames and the first device according to the position areas, the pixel width of the image frames and the focal length of the camera respectively. The first device determines a predicted walking path of the target object according to a distance between the target object and the first device in the plurality of image frames. And the first equipment judges whether the predicted walking path of the target object and the interaction range of the target object are overlapped, if so, the first equipment moves along with the target object and executes a second interaction action. If not, discarding the predicted walking path.

The target object is determined according to the image frames, interaction with the target object can be performed pertinently, interaction accuracy and interaction efficiency are improved, meanwhile, according to the position areas of the target object in the image frames, the distance between the target object and the first device in the image frames is determined, then prediction of a walking path can be achieved based on the determined distance, interaction with the target object is achieved according to the predicted walking path, active interaction is performed when the target object is determined to be close to the first device, interaction accuracy can be effectively guaranteed, user experience is avoided being poor due to the fact that disturbance to a user is avoided, and in sum, monitoring and interaction to the user can be effectively achieved based on a visual layer.

Based on the above-described embodiments, a system description is provided below with reference to fig. 15 for a method for device interaction provided by the present application, and fig. 15 is a flowchart illustrating the method for device interaction provided by the present application.

As shown in fig. 15, the method includes:

the first device firstly judges whether the sound signal is monitored, and if the sound signal is monitored, the first device executes the process of monitoring and interacting the user according to the hearing sense.

Specifically, the first device may determine whether the currently monitored sound signal is a single person or multiple persons, determine the position of the single sound signal as the target position if the currently monitored sound signal is the single person, determine a target position in different positions corresponding to the multiple persons if the currently monitored sound signal is the multiple persons, and refer to the description of the above embodiment for the implementation manner of determining the target position in different positions specifically, which is not described herein again.

After determining the target orientation, the interactive interface may be adjusted to face the target orientation and corresponding limb actions and expression feedback performed.

Or, if the first device determines that the sound signal is not monitored at first, the process of monitoring and interacting the user according to the vision is executed.

Specifically, the first device may determine whether an object exists in a preset view range according to the photographed image frame, and if not, determine again from the beginning, if so, determine whether the current object is a single person or multiple persons, if so, determine a single person as the target object, and if so, determine the object farthest from the first device as the target object.

And then, determining a predicted walking path of the target object, wherein the implementation manner of determining the predicted walking path may refer to the description of the above embodiment, and is not described herein again, and then, according to the predicted walking path, determining whether the target object will enter the interaction range of the first device, if not, discarding the path, and if so, the first device moves according to the target object, executes a limb action, and performs expression feedback.

To sum up, compared with the method for device interaction in the prior art that user monitoring is performed only through a camera and then interactive action is performed, the method for device interaction provided by the embodiment of the application can realize user monitoring by utilizing the cooperation of the sound collection device and the camera, and because the shooting range of the camera is limited, the sound collection device is firstly used for sound monitoring, and the efficiency and the success rate of user monitoring can be further improved. In the interaction process, interaction can be realized based on limb interaction and expression feedback, so that the vividness and flexibility of interaction can be improved.

The application provides a device interaction method and device, which are applied to an artificial intelligence technology in a computer technology to achieve the purpose of improving the efficiency and the success rate of user monitoring.

Fig. 16 is a schematic structural diagram of an apparatus for device interaction according to an embodiment of the present application. As shown in fig. 16, the apparatus 1600 for device interaction of the present embodiment may include: monitoring module 1601, processing module 1602, and determining module 1603.

A monitoring module 1601, configured to monitor a sound signal by a first device;

a processing module 1602, configured to perform a first interaction according to the monitored sound signal;

a determining module 1603, configured to determine, if the sound signal is not monitored, whether an object exists in a preset view range according to the captured image frame by the first device;

the processing module 1602 is further configured to, if the second interaction exists, execute a second interaction according to the captured image frame.

In a possible implementation manner, the monitoring module 1601 is specifically configured to:

the first device listens for sound signals from different directions of the first device.

In a possible implementation manner, the monitoring module 1601 specifically includes:

the first determining submodule is used for determining the environment volume by the first equipment according to the sound signals of different directions;

the first determining submodule is further used for determining the respective identification volumes of the sound signals of the different directions by the first equipment;

the first determining submodule is further configured to determine, by the first device, a target position where sound is monitored according to each of the identified volume and/or the ambient volume;

and the adjusting submodule is used for adjusting the interactive interface to face the target position by the first equipment and executing the first interactive action.

In a possible implementation manner, the first determining submodule is specifically configured to:

if the difference value of the identification volume between any two directions is less than or equal to a threshold value, determining the average value of the identification volume of each sound signal of each direction as the environment volume; or,

if the difference between the recognition volumes of any two directions is greater than the threshold, the maximum recognition volume and the minimum recognition volume are removed from the recognition volumes of the sound signals of the directions, and the average value of the recognition volumes of the sound signals of the remaining directions is determined as the ambient volume.

determining at least one first orientation in which the identified volume is not 0;

if the number of the first orientations is one, determining the first orientation as the target orientation; or,

if the number of the first orientations is more than one, the difference value between the identification volume and the environment volume of each first orientation is determined respectively, and the first orientation with the largest difference value is determined as the target orientation.

In a possible implementation manner, the processing module 1602 is specifically configured to:

a second determination submodule for the first device to determine a target object in an image frame according to the captured image frame;

the second determining submodule is further used for the first device to determine a position area of the target object in a plurality of image frames according to the plurality of image frames corresponding to the target object;

the second determining submodule is further used for the first device to respectively determine the distance between a target object in the image frames and the first device according to each position area, the pixel width of the image frames and the focal length of the camera;

the second determining sub-module is further used for the first device to determine a predicted walking path of a target object according to the distance between the target object in the plurality of image frames and the first device;

and the execution sub-module is used for executing a second interactive action by the first equipment according to the predicted walking path of the target object.

In a possible implementation manner, the execution submodule is specifically configured to:

the first equipment judges that the predicted walking path of the target object and the interaction range of the target object are overlapped;

if so, the first equipment moves along with the target object and executes the second interaction action.

In a possible implementation manner, the second determining submodule is specifically configured to:

the first device determines the number of objects included in a photographed image frame according to the image frame;

if the image frame comprises an object, determining the object in the image frame as a target object;

if the image frame comprises a plurality of objects, determining that the object closest to the first device is determined as the target object.

In a possible implementation manner, the processing module 1603 is further configured to:

pre-processing the image frame, wherein the pre-processing comprises at least one of: and performing sawtooth removing processing and image information equalization processing.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

There is also provided, in accordance with an embodiment of the present application, a computer program product, including: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 17, the electronic apparatus 1700 includes a computing unit 1701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1702 or a computer program loaded from a storage unit 1708 into a Random Access Memory (RAM) 1703. In the RAM 1703, various programs and data required for the operation of the device 1700 can also be stored. The computing unit 1701, the ROM 1702, and the RAM 1703 are connected to each other through a bus 1704. An input/output (I/O) interface 1705 is also connected to bus 1704.

Various components in the device 1700 are connected to the I/O interface 1705, including: an input unit 1706 such as a keyboard, a mouse, and the like; an output unit 1707 such as various types of displays, speakers, and the like; a storage unit 1708 such as a magnetic disk, optical disk, or the like; and a communication unit 1709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1709 allows the device 1700 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 1701 executes the various methods and processes described above, such as the method of device interaction. For example, in some embodiments, the method of device interaction may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1708. In some embodiments, part or all of a computer program may be loaded and/or installed onto device 1700 via ROM 1702 and/or communications unit 1709. When the computer program is loaded into RAM 1703 and executed by the computing unit 1701, one or more steps of the method of device interaction described above may be performed. Alternatively, in other embodiments, the computing unit 1701 may be configured in any other suitable manner (e.g., by way of firmware) to perform a method of device interaction.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of device interaction, comprising:

the first equipment monitors sound signals;

performing a first interaction according to the monitored sound signal;

and if so, executing a second interaction according to the shot image frame.

2. The method of claim 1, wherein the first device listens for sound signals, comprising:

3. The method of claim 1 or 2, wherein said performing a first interaction in accordance with said heard sound signal comprises:

the first equipment determines the environmental volume according to the sound signals of different directions;

the first device determining respective identified volumes of the sound signals at the different orientations;

the first equipment determines the target position of the monitored sound according to each identified volume and/or the environment volume;

and the first equipment adjusts the interactive interface to face the target position and executes the first interactive action.

4. The method of claim 3, wherein the first device determining an ambient volume from the differently oriented sound signals comprises:

5. The method of claim 4, wherein determining, by the first device, a target bearing to listen for sound based on each of the identified volumes and/or the ambient volume comprises:

6. The method according to any one of claims 1-5, wherein said performing a second interaction from said captured image frames comprises:

the first device determines a target object in an image frame according to the shot image frame;

the first device determines a position area of the target object in a plurality of image frames corresponding to the target object according to the plurality of image frames;

the first device respectively determines the distance between a target object in the image frames and the first device according to each position area, the pixel width of the image frames and the focal length of a camera;

the first device determining a predicted walking path of a target object in the plurality of image frames according to a distance between the target object and the first device;

and the first equipment executes a second interactive action according to the predicted walking path of the target object.

7. The method of claim 6, wherein the first device performs a second interaction based on the predicted path of travel of the target object, comprising:

8. The method of claim 6, wherein the first device determines a target object in image frames from the captured image frames, comprising:

9. The method according to any one of claims 6-8, further comprising:

10. An apparatus for device interaction, comprising:

11. The apparatus of claim 10, wherein the listening module is specifically configured to:

12. The apparatus according to claim 10 or 11, wherein the listening module specifically includes:

13. The apparatus of claim 12, wherein the first determination submodule is specifically configured to:

14. The apparatus of claim 13, wherein the first determination submodule is specifically configured to:

15. The apparatus according to any one of claims 10-14, wherein the processing module is specifically configured to:

the second determining submodule is further used for the first device to respectively determine the distance between a target object in the image frames and the first device according to each position area, the pixel width of the image frames and the focal length of a camera;

16. The apparatus of claim 15, wherein the execution submodule is specifically configured to:

17. The apparatus of claim 15, wherein the second determining submodule is specifically configured to:

18. The apparatus of any of claims 15-17, the processing module further to:

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-9.