CN111428672A

CN111428672A - Interactive object driving method, device, equipment and storage medium

Info

Publication number: CN111428672A
Application number: CN202010247255.3A
Authority: CN
Inventors: 陈智辉
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-07-17
Also published as: SG11202109202VA; TW202139064A; WO2021196648A1; KR20210124313A; JP2022531055A

Abstract

A driving method, a device, equipment and a storage medium of an interactive object are disclosed, wherein the method comprises the following steps: acquiring a first image; identifying a face region image of a mouth at least containing a target object in the first image, and determining key point information of the mouth in the face region image; determining whether the target object is in a speaking state or not according to the key point information of the mouth; and driving the interactive object to respond in response to the target object being in the speaking state.

Description

Interactive object driving method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for driving an interactive object.

Background

The man-machine interaction mode is mostly based on key pressing, touch and voice input, and responses are carried out by presenting images, texts or virtual characters on a display screen. At present, the virtual character is improved on the basis of a voice assistant, and the interaction between a user and the virtual character is still on the surface.

Disclosure of Invention

The embodiment of the disclosure provides a driving scheme for an interactive object.

According to an aspect of the present disclosure, a driving method of an interactive object is provided, the method including: acquiring a first image; identifying a face region image of a mouth at least containing a target object in the first image, and determining key point information of the mouth in the face region image; determining whether the target object is in a speaking state or not according to the key point information of the mouth; and driving the interactive object to respond in response to the target object being in the speaking state.

In combination with any one of the embodiments provided by the present disclosure, the mouth keypoint information includes position information of a plurality of keypoints of the mouth; the plurality of key points comprise at least one key point pair, and the key point pair comprises at least two key points which are respectively positioned at the upper lip and the lower lip; the determining whether the target object is in a speaking state according to the mouth key point information includes: determining first distances of two key points of the key point pairs, which are respectively positioned at the upper lip and the lower lip, according to the position information of the at least one key point pair; and determining whether the target object is in a speaking state according to the first distance.

In combination with any embodiment provided by the present disclosure, the first image is a frame in an image sequence; the determining whether the target object is in a speaking state according to the first distance includes: in the image sequence, acquiring a set number of images to be processed, wherein the images to be processed comprise the first image and at least one frame of second image except the first image; acquiring a first distance of a key point pair in the second image; and determining whether the target object is in a speaking state according to the first distance of the key point pair in the first image and the first distance of the key point pair in the second image.

In combination with any embodiment provided by the present disclosure, the acquiring, in the image sequence, a set number of images to be processed includes: and performing window sliding in the image sequence by using a window with a set length and a set step length, and acquiring a set number of images to be processed by sliding each time, wherein the first image is the last frame image in the window.

In combination with any one of the embodiments provided in this disclosure, the determining whether the target object is in a speaking state according to the first distance of the keypoint pair in the first image and the first distance of the keypoint pair in the second image includes: determining an image of which the average value of the Euclidean distances of each key point pair is greater than a first set threshold value as a target image, or determining an image of which the weighted average value of the Euclidean distances of each key point pair is greater than a second set threshold value as a target image; determining the number of target images included in the image to be processed; and determining that the target object is in a speaking state in response to the fact that the ratio between the number of the target images and the number of the set to-be-processed images is larger than a set ratio.

In combination with any one of the embodiments provided by the present disclosure, the first set threshold and the second set threshold are determined according to a resolution of the image to be processed.

According to an aspect of the present disclosure, there is provided an apparatus for driving an interactive object, the apparatus including: an acquisition unit configured to acquire a first image; an identifying unit configured to identify a face region image of a mouth including at least a target object in the first image, and determine mouth key point information in the face region image; the determining unit is used for determining whether the target object is in a speaking state or not according to the key point information of the mouth; and the driving unit is used for responding to the fact that the target object is in a speaking state and driving the interactive object to respond.

In combination with any one of the embodiments provided in the present disclosure, the apparatus includes: an acquisition unit configured to acquire a first image; an identifying unit configured to identify a face region image of a mouth including at least a target object in the first image, and determine mouth key point information in the face region image; the determining unit is used for determining whether the target object is in a speaking state or not according to the key point information of the mouth; and the driving unit is used for responding to the fact that the target object is in a speaking state and driving the interactive object to respond.

In combination with any embodiment provided by the present disclosure, the first image is a frame in an image sequence; the determining unit, when configured to determine whether the target object is in a speaking state according to the first distance, is specifically configured to: in the image sequence, acquiring a set number of images to be processed, wherein the images to be processed comprise the first image and at least one frame of second image except the first image; acquiring a first distance of a key point pair in the second image; and determining whether the target object is in a speaking state according to the first distance of the key point pair in the first image and the first distance of the key point pair in the second image.

In combination with any embodiment provided by the present disclosure, when the determining unit is configured to obtain a set number of images to be processed in the image sequence, the determining unit is specifically configured to: and performing window sliding in the image sequence by using a window with a set length and a set step length, and acquiring a set number of images to be processed by sliding each time, wherein the first image is the last frame image in the window.

In combination with any one of the embodiments provided in the present disclosure, the first distance of the key point pair includes a euclidean distance between two key points, and the determining unit, when determining whether the target object is in a speaking state according to the first distance of the key point pair in the first image and the first distance of the key point pair in the second image, is specifically configured to: determining an image of which the average value of the Euclidean distances of each key point pair is greater than a first set threshold value as a target image, or determining an image of which the weighted average value of the Euclidean distances of each key point pair is greater than a second set threshold value as a target image; determining the number of target images included in the image to be processed; and determining that the target object is in a speaking state in response to the fact that the ratio between the number of the target images and the number of the set to-be-processed images is larger than a set ratio.

In combination with any one of the embodiments provided by the present disclosure, the driving unit is specifically configured to: and in response to the first determination that the target object in the first image is in a speaking state when the interactive object is in a standby state, driving the interactive object into a state of talking with the target object.

The method, the device, the equipment and the computer readable storage medium for driving the interactive object in one or more embodiments of the present disclosure acquire a face region image at least including a mouth of the target object in a first image by identifying the first image, determine mouth key point information in the face region image, determine whether the target object is in a speaking state according to the mouth key point information to drive the interactive object to respond, determine whether the target object is speaking in real time according to the first image, and enable the interactive object to respond to the speaking of the target object in time to enter an interactive state under the condition that the target object does not perform touch interaction with terminal equipment displaying the interactive object, thereby improving interactive experience of the target object.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.

Fig. 1 is a schematic diagram of a display device in a driving method of an interactive object according to at least one embodiment of the present disclosure;

fig. 2 is a flowchart of a driving method of an interactive object according to at least one embodiment of the present disclosure;

fig. 3 is a schematic diagram of key points of a mouth in a driving method of an interactive object according to at least one embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a driving apparatus for an interactive object according to at least one embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

At least one embodiment of the present disclosure provides a driving method for an interactive object, where the driving method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game console, a desktop computer, an advertisement machine, a kiosk, a vehicle-mounted terminal, and the like, and the server includes a local server or a cloud server, and the method may also be implemented by a way that a processor calls a computer-readable instruction stored in a memory.

In the embodiment of the present disclosure, the interactive object may be any interactive object capable of interacting with the target object, and may be a virtual character, a virtual animal, a virtual article, a cartoon image, or other virtual images capable of implementing an interactive function, where the presentation form of the virtual image may be a 2D form or a 3D form, and the present disclosure is not limited thereto. The target object can be a user, a robot or other intelligent equipment. The interaction mode between the interaction object and the target object can be an active interaction mode or a passive interaction mode. In one example, the target object may issue a demand by making a gesture or a limb action, and the interaction object is triggered to interact with the target object by active interaction. In another example, the interactive object may interact with the interactive object in a passive manner by actively calling a call, prompting the target object to make an action, and the like.

The interactive object may be displayed through a terminal device, and the terminal device may be a television, an all-in-one machine with a display function, a projector, a Virtual Reality (VR) device, an Augmented Reality (AR) device, or the like.

Fig. 1 illustrates a display device proposed by at least one embodiment of the present disclosure. As shown in fig. 1, the display device has a display device of a transparent display screen, which can display a stereoscopic picture on the transparent display screen to present a virtual scene with a stereoscopic effect and an interactive object. For example, the interactive objects displayed on the transparent display screen in fig. 1 are virtual cartoon characters. In some embodiments, the terminal device described in the present disclosure may also be the display device with the transparent display screen, where the display device is configured with a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement the driving method for the interactive object provided in the present disclosure when executing the computer instructions, so as to drive the interactive object displayed in the transparent display screen to respond to the target object.

In some embodiments, the interactive object may emit a specified voice to the target object in response to the terminal device receiving sound driving data for driving the interactive object to output the voice. The voice driving data can be generated according to the action, expression, identity, preference and the like of the target object around the terminal equipment, so that the interactive object is driven to respond by sending out the specified voice, and therefore the anthropomorphic service is provided for the target object. In the interaction process of the interaction object and the target object, the interaction object is driven to emit the specified voice according to the voice driving data, and meanwhile, the interaction object cannot be driven to make the face action synchronous with the specified voice, so that the interaction object is stiff and unnatural when the voice is emitted, and the target object and the interaction experience are influenced. Based on this, at least one embodiment of the present disclosure provides a driving method for an interactive object, so as to improve the experience of interaction between a target object and the interactive object.

Fig. 2 shows a flowchart of a driving method of an interactive object according to at least one embodiment of the present disclosure, and as shown in fig. 2, the method includes steps 201 to 204.

In step 201, a first image is acquired.

The first image may be an image of the surroundings of the terminal device (e.g. display device) presenting the interactive object. The surrounding space of the terminal device, including any direction within a certain range of the terminal device, may include one or more of a forward direction, a lateral direction, a rear direction, and an upward direction of the terminal device, for example. Illustratively, the range is determined according to the range of the audio signal with the set intensity received by the sound detection module for detecting the audio signal. The sound detection module can be arranged in the terminal equipment, is a built-in module of the terminal equipment, can also be used as an external device, and is independent of the terminal equipment. The first image may also be an image captured by an image capturing device acquired through a network. The image acquisition equipment can be a camera built in the terminal equipment, and can also be a camera independent of the terminal equipment. The number of the image acquisition devices may be one or more. For example, the target object (e.g. the user) may use the terminal device to perform a certain operation, e.g. use a certain client of the terminal device to perform a service related to interaction with the interactive object, the first image may be an image collected by a camera of the terminal device or an external camera, the image may be uploaded to the server via a network, and the server may parse the image and determine whether to control the interactive object to respond based on the parsing result.

In step 202, a face region image of a mouth containing at least a target object in the first image is identified, and mouth keypoint information in the face region image is determined.

In one example, a face region image of a mouth including the target object in the first image may be cropped to be an independent image, to perform face key detection on the face region image, to determine a mouth key point in the face region image, and to obtain the mouth key point information, such as position information.

In one example, the face keypoint detection may be performed directly on a face region image of a mouth of the target object included in the first image, and the mouth keypoint information included in the first image may be determined.

In step 203, it is determined whether the target object is in a speaking state according to the mouth key point information.

When the mouth of the target object is in an open state and in a closed state, the position information of the detected key points of the mouth is different. For example, when the mouth is in an open position, the distance between a key point located on the upper lip and a key point located on the lower lip is typically greater than a certain degree; while the mouth is in a closed position, the distance between the critical point on the upper lip and the critical point on the lower lip is generally small. The distance threshold for judging whether the mouth is in an open state or a closed state is related to the positions of the mouth where the selected key points of the upper lip and the lower lip are located. For example, the threshold distance for determining the distance between a key point located at the center of the upper lip and a key point located at the center of the lower lip is typically greater than the threshold distance for determining the distance between a key point located at the edge of the upper lip and a key point located at the edge of the lower lip.

When the mouth of the target object is detected to be in an open state in all the first images within the set time, the target object can be determined to be in a speaking state. On the contrary, if the mouth of the target object is always in a closed state within a set time, it can be determined that the target object does not speak.

In step 204, in response to the target object being in the speaking state, the interactive object is driven to respond.

Because the target object and the terminal device displaying the interactive object may not have touch interaction, when the target object starts speaking or sends a voice command when the target object is more than the target objects around the terminal device or the image acquisition device or the received audio signals are more than the target objects, it may not be possible to timely judge that the target object starts to interact with the interactive object. By detecting whether a target object around a terminal device or an image acquisition device is in a speaking state, when a target object is determined to be in the speaking state, the interactive object is driven to respond to the target object, for example, a gesture of listening to the target object is made, or a specific response is made to the target object, for example, in the case that the target object is a woman, the interactive object can be driven to send out "lady, what can help you? ".

In the embodiment of the disclosure, a first image is identified to obtain a face area image at least containing a mouth of a target object in the first image, mouth key point information in the face area image is determined, whether the target object is in a speaking state is determined according to the mouth key point information to drive the interactive object to respond, whether the target object is speaking is determined in real time according to the first image, and the interactive object can respond to the speaking of the target object in time and enter the interaction state under the condition that the target object does not perform touch interaction with a terminal device displaying the interactive object, so that the interactive experience of the target object is improved.

In the embodiment of the present disclosure, the mouth keypoint information includes position information of a plurality of keypoints of the mouth; the plurality of keypoints includes at least one keypoint pair including at least two keypoints located at the upper lip and the lower lip, respectively.

Fig. 3 illustrates a schematic diagram of a key point of a mouth in a driving method of an interactive object according to at least one embodiment of the present disclosure. Among the key points of the mouth shown in fig. 3, at least one key point pair, such as key point pair (98, 102), may be obtained, where key point 98 is located at the middle of the upper lip and key point 102 is located at the middle of the lower lip.

From the position information of at least one keypoint pair of the mouth, a first distance of two keypoints of said keypoint pair, located at the upper lip and at the lower lip, respectively, may be determined. For example, for a keypoint pair (98, 102), with knowledge of the location information of keypoint 98 and keypoint 102, then a first distance of keypoint 98 and keypoint 102 may be determined.

Whether the target object is in a speaking state can be determined according to the first distance.

The first distance between keypoint 98 and keypoint 102 is different in the open and closed conditions of the mouth. In a case where the first distance between the key point 98 and the key point 102 is greater than the set threshold, it may be determined that the mouth of the target object is in an open state in the first image; conversely, if the first distance between the key point 98 and the key point 102 is less than or equal to the set threshold, it may be determined that the mouth of the target object is in a closed state. Based on the closed or open state of the mouth, it can then be determined whether the target is in a speaking state, i.e. whether the target object is currently speaking.

Those skilled in the art will appreciate that the key point pairs are not limited to (98, 102), and that one key point may be located in the upper lip region and the other key point may be located in the lower lip region. In the case that a plurality of key point pairs are selected, the average distance between the upper lip key point and the lower lip key point in the first image may be determined according to an average value or a weighted average value of the first distances corresponding to the plurality of key point pairs. And the set threshold value for judging the mouth closing or opening is determined according to the selected key point pair to the position.

In an embodiment of the present disclosure, the first image is a frame in an image sequence. The image sequence may be a video stream acquired by an image acquisition device, or a plurality of frames of images captured at a set frequency. In the case that the first image is one frame in an image sequence, it may be determined whether the target object is in a speaking state according to a first distance of the keypoint pair in each image to be processed by acquiring a set number of images to be processed in the image sequence. Wherein the image to be processed comprises the first image and a second image except the first image. And for a second image, acquiring a first distance of a key point pair in the second image, and determining whether the target object is in a speaking state according to the first distance of the key point pair in the first image and the first distance of the key point pair in the second image.

The two frames of second images in the images to be processed can be two continuous frames adjacent to the first image, and can also be two frames of second images forming frames with the same interval as the first image. For example, assuming that the first image is the Nth frame in the image sequence, the two-frame second image may be the N-1 th frame, the N-2 th frame; or it may be the N-2 th frame, the N-4 th frame, and so on.

In this embodiment, it can be determined whether the mouths of the target object are in an open state or a closed state in a set number of images to be processed, according to the first distances of the key point pairs in the first image and the second image, thereby determining whether the target object is in a speaking state.

In some embodiments, the image sequence may be subjected to sliding in a window with a set length and a set step size, a set number of images to be processed are obtained in each sliding, and the first image is a last frame image in the window.

The length of the window is related to the number of the images to be processed contained in the window, and the longer the length of the window is, the more the number of the images to be processed is contained; the step size for performing the sliding window is related to the time interval (frequency) for acquiring the image to be processed, that is, the time interval for judging the speaking state of the target object. The length and step length of the window can be set according to the actual interactive scene. For example, in the case of a window of length 10 and a step size of 2, it is indicated that the window comprises 10 images to be processed, and the sliding window is performed at 2 images per movement in the image sequence.

In the embodiment of the present disclosure, it can be determined whether the target object is in a speaking state in the first image by the mouth state of the target object in the first image and the second image before the first image. And, through the mode of sliding window, every time obtain corresponding settlement figure pending image, can confirm whether the target object is in the state of speaking to the first image in the image sequence.

In an embodiment of the present disclosure, the first distance comprises a euclidean distance between two keypoints. For a three-dimensional face image, the Euclidean distance between two key points can more accurately measure the distance and the position relation between the two key points.

In some embodiments, it may be determined whether the target object is in a speaking state according to the first distance of the keypoint pair in the first image and the first distance of the keypoint pair in the second image.

Firstly, determining an image with the mean value of the Euclidean distances of each key point pair larger than a first set threshold as a target image, or determining an image with the weighted mean value of the Euclidean distances of each key point pair larger than a second set threshold as the target image. That is, an image in which the mouth of the target object is in an open state in the image to be processed is determined as a target image.

Thereafter, the number of target images included in the image to be processed is determined. That is, the number of first images including a mouth in an open state among the images to be processed is determined.

And then, determining whether the target object is in a speaking state according to the proportion between the number of the target images and the number of the set images to be processed.

In response to the proportion being greater than a set proportion, determining that the target object is in a speaking state; otherwise, in response to the ratio being less than or equal to a set ratio, determining that the target object is not currently speaking.

In some embodiments, different euclidean distance setting thresholds may be set according to previews of different resolutions of the image to be processed. That is, the first set threshold and the second threshold may be determined according to the resolution of the resolution image to be processed.

In one example, the euclidean distance setting distance may be set to 9 in a case where the resolution of the image to be processed is 720 × 1080. The length of the window can be set to 10, i.e. the window comprises 10 images to be processed, and the window is moved by step 1. In the case where the set ratio is 0.4, when the window is slid to the current image frame, if more than 4 images in the mouth-open state are included in the 10 included images to be processed, it is determined that the target object is in the speaking state.

When the interactive object is in a standby state, that is, when the interactive object does not interact with the target object, the interactive object may be driven to enter a state of talking with the target object in response to first determining that the target object in the first image is in a talking state.

Under the condition that the target object does not perform touch interaction with the terminal equipment displaying the interactive object, the interactive object can respond to the fact that the target object is in a speaking state in time through the method, and enters the interactive state, so that the interactive experience of the target object is improved.

Fig. 4 illustrates a schematic structural diagram of a driving apparatus for an interactive object according to at least one embodiment of the present disclosure, and as shown in fig. 4, the apparatus may include: an acquisition unit 401 configured to acquire a first image; an identifying unit 402 configured to identify a face region image of a mouth including at least a target object in the first image, and determine mouth keypoint information in the face region image; a determining unit 403, configured to determine whether the target object is in a speaking state according to the mouth key point information; the driving unit 404 is configured to drive the interactive object to respond in response to that the target object is in a speaking state.

In some embodiments, the apparatus comprises: an acquisition unit configured to acquire a first image; an identifying unit configured to identify a face region image of a mouth including at least a target object in the first image, and determine mouth key point information in the face region image; the determining unit is used for determining whether the target object is in a speaking state or not according to the key point information of the mouth; and the driving unit is used for responding to the fact that the target object is in a speaking state and driving the interactive object to respond.

In some embodiments, the first image is a frame in a sequence of images; the determining unit, when configured to determine whether the target object is in a speaking state according to the first distance, is specifically configured to: in the image sequence, acquiring a set number of images to be processed, wherein the images to be processed comprise the first image and at least one frame of second image except the first image; acquiring a first distance of a key point pair in the second image; and determining whether the target object is in a speaking state according to the first distance of the key point pair in the first image and the first distance of the key point pair in the second image.

In some embodiments, when the determining unit is configured to obtain a set number of images to be processed in the image sequence, the determining unit is specifically configured to: and performing window sliding in the image sequence by using a window with a set length and a set step length, and acquiring a set number of images to be processed by sliding each time, wherein the first image is the last frame image in the window.

In some embodiments, the first distance of the keypoint pair includes a euclidean distance between two keypoints, and the determining unit, when determining whether the target object is in a speaking state according to the first distance of the keypoint pair in the first image and the first distance of the keypoint pair in the second image, is specifically configured to: determining an image of which the average value of the Euclidean distances of each key point pair is greater than a first set threshold value as a target image, or determining an image of which the weighted average value of the Euclidean distances of each key point pair is greater than a second set threshold value as a target image; determining the number of target images included in the image to be processed; and determining that the target object is in a speaking state in response to the fact that the ratio between the number of the target images and the number of the set to-be-processed images is larger than a set ratio.

In some embodiments, the first set threshold and the second set threshold are determined according to a resolution of the image to be processed.

In some embodiments, the drive unit is specifically configured to: and in response to the first determination that the target object in the first image is in a speaking state when the interactive object is in a standby state, driving the interactive object into a state of talking with the target object.

At least one embodiment of the present specification further provides an electronic device, as shown in fig. 5, where the device includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement the driving method of the interactive object according to any embodiment of the present disclosure when executing the computer instructions.

In some embodiments, the device is, for example, a server or a terminal device, and the server or the terminal device determines the speaking state of the target state according to the mouth key point information in the first image, so as to control the interactive object displayed by the display device. And under the condition that the terminal equipment is display equipment, the display equipment further comprises a display screen or a transparent display screen for displaying the animation of the interactive object.

At least one embodiment of the present specification also provides a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the driving method of the interactive object according to any one of the embodiments of the present disclosure.

As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method of driving an interactive object, the method comprising:

acquiring a first image;

identifying a face region image of a mouth at least containing a target object in the first image, and determining key point information of the mouth in the face region image;

determining whether the target object is in a speaking state or not according to the key point information of the mouth;

and driving the interactive object to respond in response to the target object being in the speaking state.

2. The method according to claim 1, wherein the mouth keypoint information comprises position information of a plurality of keypoints of the mouth; the plurality of key points comprise at least one key point pair, and the key point pair comprises at least two key points which are respectively positioned at the upper lip and the lower lip;

the determining whether the target object is in a speaking state according to the mouth key point information includes:

determining first distances of two key points of the key point pairs, which are respectively positioned at the upper lip and the lower lip, according to the position information of the at least one key point pair;

and determining whether the target object is in a speaking state according to the first distance.

3. The method of claim 2, wherein the first image is a frame in a sequence of images;

the determining whether the target object is in a speaking state according to the first distance includes:

in the image sequence, acquiring a set number of images to be processed, wherein the images to be processed comprise the first image and at least one frame of second image except the first image;

acquiring a first distance of a key point pair in the second image;

and determining whether the target object is in a speaking state according to the first distance of the key point pair in the first image and the first distance of the key point pair in the second image.

4. The method according to claim 3, wherein the obtaining a set number of images to be processed in the image sequence comprises:

and performing window sliding in the image sequence by using a window with a set length and a set step length, and acquiring a set number of images to be processed by sliding each time, wherein the first image is the last frame image in the window.

5. The method according to claim 3 or 4, wherein the first distance of the keypoint pair comprises a Euclidean distance between two keypoints, and the determining whether the target object is in a speaking state according to the first distance of the keypoint pair in the first image and the first distance of the keypoint pair in the second image comprises:

determining an image of which the average value of the Euclidean distances of each key point pair is greater than a first set threshold value as a target image, or determining an image of which the weighted average value of the Euclidean distances of each key point pair is greater than a second set threshold value as a target image;

determining the number of target images included in the image to be processed;

and determining that the target object is in a speaking state in response to the fact that the ratio between the number of the target images and the number of the set to-be-processed images is larger than a set ratio.

6. The method according to claim 5, wherein the first set threshold and the second set threshold are determined according to a resolution of the image to be processed.

7. The method of any one of claims 1 to 6, wherein the driving the interactive object to respond in response to the target object being in a speaking state comprises:

and in response to the first determination that the target object in the first image is in a speaking state when the interactive object is in a standby state, driving the interactive object into a state of talking with the target object.

8. An apparatus for driving an interactive object, the apparatus comprising:

an acquisition unit configured to acquire a first image;

an identifying unit configured to identify a face region image of a mouth including at least a target object in the first image, and determine mouth key point information in the face region image;

the determining unit is used for determining whether the target object is in a speaking state or not according to the key point information of the mouth;

and the driving unit is used for responding to the fact that the target object is in a speaking state and driving the interactive object to respond.

9. The apparatus according to claim 8, wherein the mouth keypoint information comprises position information of a plurality of keypoints of the mouth; the plurality of key points comprise at least one key point pair, and the key point pair comprises at least two key points which are respectively positioned at the upper lip and the lower lip;

the determining unit is specifically configured to:

10. The apparatus of claim 9, wherein the first image is a frame in a sequence of images;

the determining unit, when configured to determine whether the target object is in a speaking state according to the first distance, is specifically configured to:

acquiring a first distance of a key point pair in the second image;

11. The apparatus according to claim 10, wherein the determining unit, when configured to obtain a set number of images to be processed in the sequence of images, is specifically configured to:

12. The apparatus according to claim 10 or 11, wherein the first distance of the keypoint pair comprises a euclidean distance between two keypoints, and the determining unit, when determining whether the target object is in the speaking state according to the first distance of the keypoint pair in the first image and the first distance of the keypoint pair in the second image, is specifically configured to:

determining the number of target images included in the image to be processed;

13. The apparatus according to claim 12, wherein the first set threshold and the second set threshold are determined according to a resolution of the image to be processed.

14. The device according to any one of claims 8 to 13, characterized in that the drive unit is particularly adapted to:

15. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 7 when executing the computer instructions.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.