CN115661941A

CN115661941A - Gesture recognition method and electronic equipment

Info

Publication number: CN115661941A
Application number: CN202211577422.6A
Authority: CN
Inventors: 赵渊
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-01-31
Anticipated expiration: 2042-12-09
Also published as: CN115661941B

Abstract

The embodiment of the application provides a gesture recognition method and electronic equipment, wherein the method is executed by the electronic equipment, the electronic equipment comprises a camera and an event camera, and the method comprises the following steps of: acquiring a video stream acquired by a camera and an event stream acquired by an event camera; and performing fusion processing on the first video stream with the same acquisition time and the first event stream, and determining a gesture recognition result when the hand of the user faces the camera, wherein the first video stream is a part of the video stream acquired by the camera, and the first event stream is a part of the event stream acquired by the event camera. The method can improve the accuracy of the gesture recognition result in each use scene.

Description

Gesture recognition method and electronic equipment

Technical Field

The application relates to the technical field of electronics, in particular to a gesture recognition method and electronic equipment.

Background

In a conventional human-computer interaction process, a user may implement a human-computer interaction function through an interaction device (e.g., a keyboard, a mouse, a tablet, a touch screen, a game controller, etc.) connected to an electronic device, in such a manner that the user is required to input an interaction instruction using the interaction device, for example, inputting a text instruction through the keyboard, inputting a game control instruction through the game controller, etc.

In order to improve the diversity and convenience of human-computer interaction, an interaction mode through gesture recognition has been proposed, for example, when a user uses an electronic device, the electronic device may collect a hand image of the user, determine an interaction instruction by recognizing a gesture in the hand image, and then perform a corresponding operation.

However, in some scenarios (e.g., motion scenarios), the gesture recognition result obtained by using the related art has low precision, so that the electronic device cannot respond to the interaction requirement of the user in time.

Disclosure of Invention

The application provides a gesture recognition method and electronic equipment, which can improve the accuracy of a gesture recognition result in each use scene.

In a first aspect, the present application provides a gesture recognition method, which is performed by an electronic device, the electronic device including a camera and an event camera, the method including: acquiring a video stream acquired by a camera and an event stream acquired by an event camera; and performing fusion processing on the first video stream with the same acquisition time and the first event stream, and determining a gesture recognition result when the hand of the user faces the camera, wherein the first video stream is a part of the video stream acquired by the camera, and the first event stream is a part of the event stream acquired by the event camera.

The video stream collected by the camera can comprise a plurality of frames of images and the collection time of each frame of image, and the camera can collect the frames of images according to a preset frequency; the event stream collected by the event camera may include coordinates of pixel points corresponding to the event and an occurrence time of the event, which may be represented by a three-dimensional data structure.

After the electronic device acquires the video stream and the event stream, the gesture recognition result of the user can be determined by combining the two data streams. In order to ensure that the electronic device processes the video stream and the event stream, the data collected at the same time point (or time period) is processed, the electronic device may select a partial video stream (i.e., a first video stream) from the video stream, select a partial event stream (i.e., a first event stream) from the event stream, where the partial video stream and the partial event stream have the same collection time, and then perform a fusion process on the partial video stream and the partial event stream to determine a gesture recognition result.

Optionally, the electronic device may process the first video stream based on an image recognition technology, process the first event stream based on a point cloud recognition technology, and then fuse the processing results to obtain a gesture recognition result.

Because the video stream can include rich semantic information, and the events corresponding to the event stream all occur asynchronously, and have the advantages of high frame rate and low time delay, the electronic device combines the video stream and the event stream, integrates the advantages of the video stream and the event stream to perform gesture recognition, and can improve the accuracy of gesture recognition results in various use scenes.

With reference to the first aspect, in some implementation manners of the first aspect, the performing fusion processing on the first video stream and the first event stream with the same acquisition time to determine a gesture recognition result when the hand of the user faces the camera includes: and fusing the first video stream and the first event stream with the same acquisition time by adopting a neural network, and determining a gesture recognition result.

In order to improve the efficiency of the gesture recognition process and further improve the accuracy of the gesture recognition result, the first video stream and the first event stream may be fused by using a neural network in the implementation manner, so as to obtain the gesture recognition result. The neural network may be any network capable of processing video streams and event streams, and the specific network structure is not limited.

With reference to the first aspect, in some implementation manners of the first aspect, the determining a gesture recognition result by performing fusion processing on a first video stream and a first event stream with the same acquisition time by using a neural network includes: processing the first video stream by adopting a first network to obtain a first characteristic, and processing the first event stream by adopting a second network to obtain a second characteristic; fusing the first feature and the second feature to obtain a fused feature; and determining a gesture recognition result according to the fusion characteristics.

The video stream and the event stream are two different types of data, so that the electronic device can respectively process the video stream and the event stream by adopting different networks. Optionally, the first network may be a 3D convolutional network, for example, a slowfast network; the second network may be a point cloud based network, such as a PointNet network or the like.

After the electronic device processes the first video stream using the first network, a first feature, that is, a feature related to the video stream, may be obtained. After the first event stream is processed by using the second network, a second feature, that is, a feature about the event stream, can be obtained. And then fusing the first characteristic and the second characteristic to obtain a fused characteristic, namely the characteristic of fusing the video stream and the event stream. The gesture recognition result obtained by the fusion feature comprehensively considers the features of the video stream and the features of the event stream, and combines the advantages of the video stream and the event stream to perform gesture recognition, so that the accuracy of the gesture recognition result in each use scene can be improved.

With reference to the first aspect, in some implementations of the first aspect, the fusing the first feature and the second feature to obtain a fused feature includes: and fusing the first feature and the second feature based on an attention mechanism to obtain a fused feature.

The attention mechanism can automatically focus on the features needing attention in different scenes, for example, in a scene with fast motion, the attention degree to an event stream is high, and in a scene close to static or a scene needing attention to local details, the attention degree to a video stream is high.

With reference to the first aspect, in some implementations of the first aspect, before the merging the first video stream and the first event stream with the same acquisition time, the method further includes: and aligning the video stream collected by the camera with the event stream collected by the event camera, and determining a first video stream and a first event stream with the same collection time.

As can be seen from the above description, the electronic device performs the fusion process on the first video stream and the first event stream with the same capture time, and then needs to select the first video stream and the first event stream with the same capture time from the obtained video stream and event stream.

Optionally, the electronic device may select an event stream having the same occurrence time as the acquisition time from the event stream based on the acquisition time of the frame image in the video stream, that is, implement data alignment between the video stream and the event stream.

In one implementation manner, the performing data alignment on the video stream acquired by the camera and the event stream acquired by the event camera to determine the first video stream and the first event stream with the same acquisition time includes: acquiring a part of video stream acquired by a camera through a sliding window as a first video stream; according to a first time for acquiring the first video stream, acquiring a partial event stream with the first time as the first event stream from the event stream acquired by the event camera.

The electronic device may set a sliding window, sequentially select multiple frames of images in the sliding window as a first video stream, and select an event stream with an event occurrence time as a first time from the entire event stream according to an acquisition time (i.e., the first time) of the multiple frames of images in the sliding window, that is, the first event stream, so that the electronic device may perform fusion processing on the first video stream and the first event stream. By adopting a sliding window mechanism to sequentially process the first video stream and the first event stream, the data volume processed by the electronic equipment at one time can be reduced, data overlapping can be realized between the sliding windows, and the front data and the rear data of the frame image and the event stream can be fully considered, so that the accuracy of the subsequent gesture recognition result is further improved.

Alternatively, the sliding window may be a window with a variable length, or may be a window with a constant length, which is described in the following embodiments.

With reference to the first aspect, in some implementations of the first aspect, after determining the first event stream, the method further includes: and performing up-sampling operation or down-sampling operation on the event points in the first event stream to enable the number of the event points in the first event stream to reach a preset number.

The event stream collected by the event camera may be a plurality of event points, the event reporting process is asynchronous, and the pixel points where the event occurs are not determined, so that the number of the collected event points within a fixed time duration is not fixed, and if the first video stream and the first event stream are input into the neural network for processing according to a sliding window mechanism, the sizes of the input event streams are not uniform, and the neural network cannot be compatible with input data of different sizes. Then, the electronic device may sample event points in the first event stream such that the number of event points within the sliding window reaches a preset number of points. Alternatively, the sampling may be up-sampling or down-sampling. Under the condition that the number of the event points in the sliding window is larger than the preset number of points, selecting the event points with the preset number of points from the event points by adopting a down-sampling method; and under the condition that the number of the event points in the sliding window is less than the preset number of points, filling the event points to the event points with the preset number of points by adopting an up-sampling method. Thus, the size of the event stream input to the neural network is made the same to be compatible with the input requirements of the neural network.

With reference to the first aspect, in some implementation manners of the first aspect, the fusing the first video stream and the first event stream with the same acquisition time to determine a gesture recognition result when the hand of the user faces the camera includes: fusing the first video stream and the first event stream, and determining the probability of each preset gesture category; and if the probability of the first gesture category in the probabilities of all the gesture categories is greater than or equal to a preset probability threshold, taking the first gesture category as a gesture recognition result.

When the electronic device performs fusion processing on the first video stream and the first event stream, particularly when the neural network is used to perform fusion processing on the first video stream and the first event stream, the output result may be a probability of each gesture class, for example, the probability of class 1 is P1, the probability of class 2 is P2, and the probability of class 3 is P3; if the probability of a certain gesture category is greater than or equal to the preset probability threshold, it may be determined as the current gesture category, that is, the gesture recognition result.

In one implementation, the method further includes: and if the frame number of the video stream collected by the camera reaches a preset frame number threshold value, but the gesture recognition result is still not determined, outputting the information of the gesture recognition failure.

The frame number of the video stream acquired by the camera has already reached the frame number threshold (for example, the frame number reaches 150 frames in 5 seconds), but the electronic device still determines the gesture recognition result, which indicates that the gesture of the current user is not easily recognized. In order to reduce the waiting time, the electronic device may stop collecting the video stream and the event stream, and output information indicating that the gesture recognition is failed, so as to prompt the user that the gesture cannot be recognized currently or prompt the user to put the gesture again.

With reference to the first aspect, in some implementations of the first aspect, the acquiring a video stream captured by a camera and an event stream captured by an event camera includes: and under the condition that a preset condition is met, acquiring a video stream acquired by the camera and an event stream acquired by the event camera, wherein the preset condition represents that the electronic equipment starts a gesture recognition function.

In order to reduce processing power consumption of the electronic device, the electronic device may give a user prompt information for making a gesture when the user triggers a gesture recognition function of starting the electronic device or the electronic device automatically starts the gesture recognition function, and after the electronic device gives the prompt information, starts to acquire a video stream and an event stream for gesture recognition, or triggers a camera to acquire the video stream and an event camera to acquire the video stream. The preset condition may be that the electronic device displays a prompt message, or the electronic device starts a gesture recognition function.

In a second aspect, the present application provides an apparatus, which is included in an electronic device, and which has a function of implementing the behavior of the electronic device in the first aspect and the possible implementation manners of the first aspect. The functions may be implemented by hardware, or by hardware executing corresponding software. The hardware or software includes one or more modules or units corresponding to the above-described functions. Such as a receiving module or unit, a processing module or unit, etc.

In a third aspect, the present application provides an electronic device, comprising: a processor, a memory, and an interface; the processor, the memory and the interface cooperate with each other to enable the electronic device to perform any one of the methods of the first aspect.

In a fourth aspect, the present application provides a chip comprising a processor. The processor is adapted to read and execute the computer program stored in the memory to perform the method of the first aspect and any possible implementation thereof.

Optionally, the chip further comprises a memory, the memory being connected to the processor by a circuit or a wire.

Further optionally, the chip further comprises a communication interface.

In a fifth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the processor is caused to execute any one of the methods in the first aspect.

In a sixth aspect, the present application provides a computer program product comprising: computer program code which, when run on an electronic device, causes the electronic device to perform any of the methods of the first aspect.

Drawings

Fig. 1 is a diagram of an application scenario of an example gesture recognition method according to an embodiment of the present application;

FIG. 2 is a diagram illustrating an example of an image detection process provided in the related art;

fig. 3 is a schematic structural diagram of an example of an electronic device according to an embodiment of the present disclosure;

FIG. 4 is a block diagram of a software structure of an example of an electronic device according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating an example of a gesture recognition method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating an example of a distribution of event points in an event stream according to an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating an example of data alignment between a video stream and an event stream according to an embodiment of the present application;

FIG. 8 is a schematic diagram of another example of aligning a video stream and an event stream according to an embodiment of the present application;

FIG. 9 is a schematic flowchart of another gesture recognition method provided in an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating an example of sampling an event stream according to an embodiment of the present disclosure;

fig. 11 is a schematic diagram of a data processing process of an example slowfast network according to an embodiment of the present application;

fig. 12 is a schematic diagram of an example of a data processing process of a PointNet network according to the embodiment of the present application;

FIG. 13 is a schematic diagram of an example of data processing based on an attention mechanism according to an embodiment of the present disclosure;

fig. 14 is a schematic flowchart of another gesture recognition method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Wherein in the description of the embodiments of the present application, "/" indicates an inclusive meaning, for example, a/B may indicate a or B; "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present application, "a plurality" means two or more than two.

In the following, the terms "first", "second" and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", and "third" may explicitly or implicitly include one or more of the features.

At present, human-computer interaction modes are increasing, and electronic devices can not only realize a human-computer interaction function through some interaction devices (such as a keyboard, a mouse, a handwriting pad, a touch screen, a game controller and the like), but also realize the human-computer interaction function through recognizing gestures of a user.

The gesture recognition technology can be adapted to various scenes, for example, in a shooting scene, the electronic device may set a gesture for starting shooting, and when the electronic device detects the gesture of the user, a shooting function may be started; for example, as shown in fig. 1, the gesture for starting shooting may be a gesture for a user to lift a palm, and when the electronic device recognizes the palm gesture in the picture through the camera, the shooting function may be started, so that the user does not need to perform operations such as clicking a shooting control, and the human-computer interaction is improved. For another example, in a game scene, the electronic device may set a gesture for starting a game, and when the electronic device detects the gesture of the user, the game may be started; illustratively, the gesture for starting the game may be an "OK" gesture of the user, and when the electronic device recognizes the gesture through the camera, the game may be started, so as to improve the game experience of the user. Of course, the gesture recognition technology can also be applied to other scenarios, which are not described herein again.

In the related art, commonly used gesture recognition technologies may include a video stream-based recognition technology and an event stream-based recognition technology, and the processes of the two gesture recognition technologies are briefly described below.

For the identification technology based on video stream, as shown in fig. 2, the electronic device may collect video stream, such as red, green and blue (RGB) video stream, through a camera (or a camera), detect frame images in the video stream, determine a hand position in the frame images, such as marking a hand boundary box, and perform key point detection on the hand position images, such as detecting and marking positions of 21 skeleton points, and perform gesture classification according to the detection result of the key points, so as to determine a gesture identification result. The process of detecting the image in the related art may be implemented by using a conventional image recognition technology, or by using a recognition technology based on deep learning. However, with this kind of gesture recognition technology, when the hand of the user is in a stationary or low-speed motion, its recognition accuracy is high, but when the hand motion is fast (at this time, the position of the hand in the frame image changes fast), its recognition accuracy is low; if the electronic device cannot accurately recognize the gesture, the corresponding function cannot be executed, for example, the shooting function cannot be started in a shooting scene.

For the event stream based recognition technology, the electronic device may acquire an event stream, such as a spatiotemporal event stream, through an event camera, and then process and analyze the event stream to obtain a gesture recognition result. The event camera is a novel asynchronous sensor, and is different from a traditional camera which shoots an image or a video, and the event camera shoots an "event" which can be understood as "change of pixel brightness", namely, the event camera outputs the change of the pixel brightness. The working mechanism of the event camera is that when the brightness value of a position where a certain pixel is located changes and reaches a set threshold value, the camera transmits back an event, all the events occur asynchronously, and the event camera has the advantages of high frame rate and low time delay. Each event data can have four attribute values, wherein the first two items are pixel point coordinates (usually x and y coordinates) of an event, the third item is a timestamp of the event, the last item takes the polarity state 0, 1 (or-1, 1 and the like) to represent whether the brightness changes from low to high or from high to low, or the last item takes the activation/deactivation state (0, 1) to represent whether an object corresponding to the pixel point moves or not; or, each event may only include three attribute values of the coordinates and the timestamp of the pixel point of the event. In the related art, the electronic device may input the acquired event stream into a deep learning based network and output a gesture recognition result. However, as can be seen from the above description of the event, the attribute value only includes the pixel coordinate, the timestamp and the state value, and the state value has only two states, that is, the attribute value of the event lacks semantic information (such as color information) in the field of view of the camera, so in a scene with a complex background, the recognition accuracy of the technique may be low.

In view of this, an embodiment of the present application provides a gesture recognition method, which may combine an event stream of an event camera and a video stream of a camera to perform processing and analysis, fuse features of the event stream and features of the video stream, and then perform gesture classification using the fused features, so as to improve accuracy of a gesture recognition result in each usage scenario (especially, a motion scenario). It should be noted that the gesture recognition method provided in the embodiment of the present application may be applied to an electronic device having a camera function, such as a mobile phone, a tablet computer, a wearable device, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, a super-mobile personal computer (UMPC), a netbook, and a Personal Digital Assistant (PDA), and the specific type of the electronic device is not limited in any way in the embodiment of the present application.

For example, fig. 3 is a schematic structural diagram of an example of the electronic device 100 according to the embodiment of the present application. Taking the electronic device 100 as a mobile phone as an example, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, an event camera 180N, and the like.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the time sequence signal to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be called directly from memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

It should be understood that the connection relationship between the modules illustrated in the embodiment of the present application is only an exemplary illustration, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charging management module 140, and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In some other embodiments, the power management module 141 may also be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The wireless communication module 160 may provide a solution for wireless communication applied to the electronic device 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), global Navigation Satellite System (GNSS), frequency Modulation (FM), near Field Communication (NFC), infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering on electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves.

In some embodiments, antenna 1 of electronic device 100 is coupled to mobile communication module 150 and antenna 2 is coupled to wireless communication module 160 so that electronic device 100 can communicate with networks and other devices through wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), time division code division multiple access (time-division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR technologies, among others. GNSS may include Global Positioning System (GPS), global navigation satellite system (GLONASS), beidou satellite navigation system (BDS), quasi-zenith satellite system (QZSS), and/or Satellite Based Augmentation System (SBAS).

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement a photographing function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, and the application processor, etc.

The ISP is used to process the data fed back by the camera 193. For example, when a user takes a picture, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, an optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and converting the electric signal into an image visible to the naked eye. The ISP can also carry out algorithm optimization on noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1. In some embodiments, camera 193 may include a front camera and a rear camera.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to perform fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor, which processes input information quickly by referring to a biological neural network structure, for example, by referring to a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc. The electronic device 100 may also collect an event stream or the like through the event camera 180N.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The software system of the electronic device 100 may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application takes an Android system with a layered architecture as an example, and exemplarily illustrates a software structure of the electronic device 100.

Fig. 4 is a block diagram of a software structure of the electronic device 100 according to the embodiment of the present application. The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided from top to bottom into an application layer, an application framework layer, an Android runtime (Android runtime) and system library, a kernel layer, and a hardware layer. The application layer may include a series of application packages.

As shown in fig. 4, the application package may include applications such as camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 4, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like. The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc. The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures. The phone manager is used to provide communication functions of the electronic device 100. Such as management of call status (including on, off, etc.). The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like. The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scrollbar text in a status bar at the top of the system, such as a notification of a running application in the background, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The Android runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), media libraries (media libraries), three-dimensional graphics processing libraries (e.g., openGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications. The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like. The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like. The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The hardware layer may include a camera that may capture video streams or frame images and an event camera that may capture event streams.

In addition, the application layer may further include a processing module for processing the video stream collected by the camera and the event stream collected by the event camera to determine a gesture recognition result.

For convenience of understanding, the following embodiments of the present application will specifically describe a gesture recognition method provided by the embodiments of the present application by taking an electronic device having a structure shown in fig. 3 and fig. 4 as an example, and combining the drawings and an application scenario.

The gesture recognition method provided by the embodiment of the application can be applied to the scene shown in fig. 1, when a user holds the electronic device, if the user wants to use the gesture recognition function of the electronic device, the user can orient the hand to the front camera or the rear camera of the electronic device according to the prompt information of the electronic device. After the electronic device recognizes the gesture of the user, the corresponding function can be executed.

For example, as shown in fig. 1, in a photographing scene, when a user lifts a palm towards a front camera of the electronic device, the electronic device detects a hand-lifting action of the user, and a video stream of the camera and an event stream of the event camera can be captured for gesture recognition. If the palm gesture is recognized, the photographing function can be started. Optionally, in a photographing scene, after the hand gesture is recognized, the electronic device may further activate and display a photographing countdown timer for prompting the user to prepare to start photographing, and the photographing function is started after the countdown timer is counted down.

Optionally, in some scenarios, when the user triggers starting of the gesture recognition function of the electronic device or the electronic device automatically starts the gesture recognition function, the electronic device may further give a gesture making prompt information to the user to prompt the user to make a corresponding gesture. After the electronic equipment gives the prompt information, the electronic equipment starts to collect the video stream of the camera and the event stream of the event camera again for gesture recognition, so that the processing power consumption of the electronic equipment is reduced.

Fig. 5 is a schematic flowchart of an example of a gesture recognition method provided in an embodiment of the present application, where the method may be executed by an electronic device, and specifically includes:

s101, acquiring a video stream acquired by a camera and an event stream acquired by an event camera.

The video stream acquired by the camera may include multiple frames of images and the acquisition time of each frame of image, the camera may acquire the frames of images according to a preset frequency, for example, 30 frames of images are acquired per second, and the multiple frames of images may form a segment of video stream. It is understood that rich semantic information, such as environmental information, color information, etc., may be included in the video stream.

The event stream collected by the event camera may include coordinates of pixel points corresponding to the event and occurrence time of the event, which may be represented by a three-dimensional data structure, for example, as shown in fig. 6, x and y in a three-dimensional coordinate system represent coordinates of pixel points corresponding to the event, and t represents occurrence time of the event. In fig. 6, an event stream can be understood as an event cloud, which includes a plurality of event points, each of which has a coordinate (i.e., a pixel point coordinate) and an occurrence time. It can be understood that the events collected by the event cameras all occur asynchronously, which has the advantages of high frame rate and low latency.

And S102, processing the video stream and the event stream by adopting a neural network, and determining a gesture recognition result.

After the electronic device acquires the video stream and the event stream, the video stream and the event stream may be input to a preset neural network, and by combining respective advantages of the video stream and the event stream, through operations such as feature extraction, feature fusion, classification, and the like, gesture categories (that is, gesture recognition results) may be output, for example, category 1 represents a palm gesture, category 2 represents a fist-making gesture, category 3 represents an "OK" gesture, and the like.

In order to ensure that data collected at the same time point (or time period) is processed when the neural network processes the video stream and the event stream, before the electronic device processes the video stream and the event stream by using the neural network, the electronic device may align the video stream and the event stream, and then process the aligned video stream and event stream by using the neural network to obtain a gesture recognition result.

In one implementation, the process of the electronic device aligning the video stream and the event stream may be: the electronic device may select an event stream with an occurrence time of an event as a first time from the entire event stream based on an acquisition time or a time period (denoted as the first time) of a frame image in the video stream, that is, the video stream and the event stream corresponding to the first time may be input to the neural network for processing.

Exemplarily, assuming that the acquisition time period of the frame image in the video stream is t1-t2, the electronic device may select an event stream with an occurrence time period of t1-t2 from the entire event stream, i.e., data alignment between the video stream and the event stream is achieved.

In another implementation, the process of the electronic device aligning the video stream and the event stream may be: because the frame images in the video stream are acquired frame by frame, if the electronic device acquires a frame image and processes the frame image, the data processing amount of the electronic device is increased, therefore, a mechanism of a sliding window may be adopted in the implementation manner, the multi-frame images in the sliding window are sequentially selected, and according to the acquisition time or time period (marked as a first time) of the multi-frame images in the sliding window, an event stream whose occurrence time of an event is a first time is selected from the whole event stream, that is, the video stream and the event stream corresponding to the first time may be input to the neural network for processing. Alternatively, the sliding window may be a window with a variable length, or may be a window with a constant length. It is understood that the data in the sliding window has overlapping property during the moving process of the sliding window, and the following description can be specifically referred to.

Under the condition that the sliding window is a window with a variable length, assuming that the initial length of the sliding window is N, the electronic device starts to acquire a first frame image (at this time, the event stream is also synchronously acquired), the acquired frame image is firstly cached, and after acquiring N frames of images (namely, the length of the sliding window is reached), the electronic device selects the event stream with corresponding occurrence time according to acquisition time corresponding to the N frames of images, so that a video stream composed of the N frames of images and the event stream with corresponding occurrence time can be input into a neural network for processing. And then, the electronic equipment continues to acquire the frame image, the length of the sliding window is changed into N + h (h is more than or equal to 1), namely the starting end of the sliding window is unchanged, the stopping end moves backwards by h steps (one frame can be obtained in one step), and when the electronic equipment acquires the N + h frame image in total, namely the h frame image is acquired again, the event stream corresponding to the occurrence time can be selected for processing according to the acquisition time corresponding to the N + h frame image. And then, the electronic equipment continues to acquire the frame image, the length of the sliding window is changed into N + h + h, and the steps are continuously executed.

Exemplarily, assuming that the initial length of the sliding window is N =15, h =1, as shown in fig. 7, the above process is specifically: the electronic device sequentially acquires frame images (at this time, event streams are also synchronously acquired), and when 15 frame images are acquired, according to acquisition time (for example, a t3-t4 time period) corresponding to the 15 frame images, the event streams corresponding to the t3-t4 time periods are selected, that is, the video streams and the event streams corresponding to the t3-t4 time periods are located in the same sliding window, so that the electronic device can input the video streams and the event streams into the neural network for processing. Next, the electronic device continues to acquire frame images, the length of the sliding window is changed to N + N =16, when the electronic device newly acquires 1 frame image, and at this time, there are 16 frame images in the sliding window, the electronic device selects an event stream corresponding to a time period from t3 to t5 according to the acquisition time (for example, a time period from t3 to t 5) corresponding to the 16 frame image, and processes a video stream and an event stream corresponding to the time period from t3 to t5 by using a neural network, and so on.

Under the condition that the sliding window is a window with a constant length, assuming that the initial length of the sliding window is M, the electronic device starts to acquire a first frame image (at this time, the event stream is also synchronously acquired), the acquired frame image is firstly cached, and after the M frame image is acquired (namely, the length of the sliding window is reached), the electronic device selects the event stream with corresponding occurrence time according to the acquisition time corresponding to the M frame image, so that the video stream formed by the M frame image and the event stream with corresponding occurrence time can be input into the neural network for processing. Then, the electronic device continues to acquire frame images, the length of the sliding window is still M, but the sliding window moves backwards by M steps (i.e. the step length is M, M is greater than or equal to 1, and one frame can be taken here), and when the electronic device acquires M frame images again, an event stream corresponding to the occurrence time can be selected for processing according to the acquisition time corresponding to the 1+ M frame images to the M + M frame images (which are M frame images in total). And then, the electronic equipment continues to acquire the frame image, the sliding window moves backwards for M steps, and the steps are continuously executed according to the acquisition time corresponding to the frame image from 1+ M + M to the frame image from M + M + M (total M frame images).

Exemplarily, assuming that the initial length of the sliding window is M =15,m =1, as shown in fig. 8, the above process is specifically: the electronic device sequentially acquires frame images (at this time, event streams are also synchronously acquired), and when 15 frame images are acquired, according to acquisition time (for example, a t6-t7 time period) corresponding to the 15 frame images, the event streams corresponding to the t6-t7 time periods are selected, that is, the video streams and the event streams corresponding to the t6-t7 time periods are located in the same sliding window, so that the electronic device can input the video streams and the event streams into the neural network for processing. Next, the electronic device continues to acquire frame images, the sliding window moves backwards by 1 step, when the electronic device newly acquires 1 frame image, and at this time, the frame 2 to the frame 16 images (15 frames in total) are in the sliding window, the electronic device selects an event stream corresponding to the time period t8 to t9 according to the acquisition time (for example, the time period t8 to t 9) corresponding to the frame 15 images, and processes the video stream and the event stream corresponding to the time period by using a neural network, and so on.

It should be noted that the electronic device may also align the acquired video stream and the event stream in other manners, which is not limited in this embodiment of the application. In addition, the length, the moving step length, the changing mode, and the like of the sliding window may also be performed in other modes, which are not specifically limited in the embodiment of the present application.

As can be seen from the above description, after the electronic device aligns the video stream and the event stream by using any of the above implementations, the video stream and the event stream need to be input into the neural network for processing, and a processing procedure of the neural network is described below.

The neural network in the embodiment of the present application may be any network having the capability of processing a video stream and an event stream, and before the video stream and the event stream are processed by using the neural network, the neural network needs to be trained to converge. After the electronic device inputs the video stream and the event stream at the same time point (or time period) into the neural network, the neural network can respectively extract features of the video stream and the event stream, then fuse the extracted features, and classify according to the fused features. Optionally, the result output by the neural network may be a probability of each gesture category, for example, the probability of category 1 is P1, the probability of category 2 is P2, the probability of category 3 is P3, and the like; if the probability of a certain gesture category is greater than or equal to a preset probability threshold, the gesture category can be determined as the current gesture category.

For example, assuming that the preset probability threshold is 0.7, the neural network outputs the following results: the probability P1=0.8 of the category 1, the probability P2=0.1 of the category 2, and the probability P3=0.1 of the category 3, that is, it can be determined that the probability of the category 1 is greater than the probability threshold, then the category 1 is the determined gesture category, and the corresponding gesture (for example, palm gesture) is the gesture recognition result.

Based on the processing procedure of the neural network, in combination with the sliding window mechanism adopted by the electronic device, after the electronic device inputs the video stream and the event stream corresponding to any sliding window into the neural network, corresponding output results can be obtained. Therefore, in an implementation manner, when the gesture category can be determined according to the output result corresponding to a certain sliding window, the gesture recognition result can be determined; and when the gesture type cannot be determined according to the output result corresponding to the current sliding window, processing the video stream and the event stream corresponding to the next sliding window until the gesture recognition result can be determined. Alternatively, when the electronic device determines the gesture recognition result, the processing of the video stream and the event stream may be stopped, or the capturing of the video stream and the event stream may be stopped. In another implementation manner, if the number of frames of the frame image in the video stream acquired by the electronic device has reached the preset frame number threshold, but the gesture recognition result is still not determined, the electronic device may also stop the processing of the video stream and the event stream, or stop acquiring the video stream and the event stream, so as to reduce the data processing power consumption of the electronic device.

Illustratively, the electronic device starts to acquire a first frame image, when the number of the acquired frame images reaches the length of a sliding window, the video stream and the event stream corresponding to the current sliding window are input into the neural network to obtain the probability of each gesture category, and if each probability is smaller than a probability threshold, the electronic device continues to acquire the frame image and increases the length of the sliding window by n or moves the sliding window backwards by m steps. When the number of frame images acquired by the electronic equipment reaches the length of the sliding window again, inputting the video stream and the event stream corresponding to the current sliding window into the neural network to obtain the probability of each gesture category, and if the probability of any gesture category is greater than or equal to a probability threshold value, taking the corresponding gesture as a gesture recognition result; if all the probabilities are still smaller than the probability threshold, the electronic equipment continues to acquire the frame image, and increases the length of the sliding window by n or moves the sliding window backwards by m steps, and so on. If the total number of the frame images acquired by the electronic device reaches a frame number threshold (for example, 150 frames, that is, the number of frames in 5 seconds), and a gesture recognition result is still undetermined, which indicates that the gesture of the current user is not easily recognized, the electronic device may stop acquiring the video stream and the event stream. Optionally, at this time, the electronic device may further give a prompt message to the user to prompt the user that the gesture cannot be recognized currently or prompt the user to put the gesture again.

For example, after the neural network processes the video stream and the event stream corresponding to the sliding window for the first time, the probability of each gesture category is obtained as follows: the probability of the category 1 is 0.3, the probability of the category 2 is 0.3, the probability of the category 3 is 0.4, and each probability is smaller than a probability threshold (e.g., 0.7), so that the gesture recognition result is not determined currently. The neural network processes the video stream and the event stream corresponding to the second time, and the probability of each gesture category is obtained as follows: and if the probability of the category 1 is 0.1, the probability of the category 2 is 0.7 and the probability of the category 3 is 0.2, determining that the gesture recognition result is the gesture corresponding to the category 2.

According to the gesture recognition method, the electronic equipment can be combined with the event stream of the event camera and the video stream of the camera to perform processing and analysis, the characteristics of the event stream and the characteristics of the video stream are fused, and then gesture classification is performed by using the fused characteristics, so that the accuracy of gesture recognition results in various use scenes is improved.

In one embodiment, to further reduce processing power consumption of the electronic device, the electronic device may give a prompt message for making a gesture to the user when the user triggers to start a gesture recognition function of the electronic device or the electronic device automatically starts the gesture recognition function, and after the prompt message is given by the electronic device, the electronic device starts to acquire the video stream and the event stream to perform gesture recognition. Then, as shown in fig. 9, the above S101 may include:

and S1011, under the condition that the preset conditions are met, acquiring the video stream acquired by the camera and the event stream acquired by the event camera.

The preset condition may be that the electronic device displays prompt information, or the electronic device starts a gesture recognition function. When this condition is satisfied, the electronic device can acquire the video stream and the event stream. In one implementation, the electronic device may trigger the camera to capture the video stream and trigger the event camera to capture the event stream if a preset condition is met.

In this embodiment, as can be known from the above description, the video stream acquired by the camera is an image of one frame, and because the frequency of acquiring the image is fixed, the number of frames of the image acquired within a fixed time duration is also fixed; for example, all acquired as 1 second 30 frames of images. The event stream collected by the event camera can be a plurality of event points, the event reporting process is asynchronous, and the pixel points of the event are not determined, so the number of the event points collected within a fixed time length is not fixed; for example, 100 event points are collected in the first 1 second, 150 event points are collected in the second 1 second, and so on. If so, if the collected video stream and the event stream are directly input into the neural network for processing according to the sliding window mechanism, the sizes of the event streams processed by the neural network are not uniform, and the neural network is generally not compatible with input data of different sizes.

In view of this, in the embodiment of the application, after the electronic device aligns the video stream and the event stream by using the above process, the event stream may be preprocessed, that is, as shown in fig. 9, the step S102 may specifically include:

and S1021, aligning the video stream and the event stream by adopting a sliding window mechanism, and preprocessing the event stream in the sliding window.

For the process of aligning the video stream and the event stream by using the sliding window mechanism, reference may be made to the description of the above embodiments, which is not described herein again. The electronic device preprocesses the event stream in the sliding window by the following process: setting a fixed point number (also called as a preset point number), and sampling event points in the event stream in the sliding window, so that the number of the event points in the sliding window reaches the fixed point number. Alternatively, the sampling may be up-sampling or down-sampling. Under the condition that the number of the event points in the sliding window is larger than the fixed point number, selecting the event points with the fixed point number from the event points by adopting a downsampling method; under the condition that the number of the event points in the sliding window is less than the number of the fixed points, the event points can be filled to the event points with the fixed number by adopting an up-sampling method.

Taking the sliding window as a window with a constant length as an example, assuming that the length of the sliding window is M, that is, M frames of images are provided in one sliding window, the acquisition time of the M frames of images is t6-t7, and there are F event points in the event stream corresponding to the time period, that is, the electronic device can preprocess the F event points.

In the case that F is greater than the fixed number F of event points, the electronic device needs to downsample F event points, and in an implementable manner, as shown in fig. 10, the downsampling process may be: the three-dimensional coordinate system x, y and t are corresponding to a plurality of event points, F event points corresponding to the time period from t6 to t7 are located in a rectangular block marked in the graph, the rectangular block can be averagely divided into F small blocks, then one event point is randomly selected from each small block, and then F event points can be obtained in total, namely the fixed point number F is obtained by down-sampling.

In the case that F is less than the fixed number F of event points, the electronic device needs to perform upsampling on the F event points, and in an implementable manner, the upsampling process may be: and inserting new event points among the event points by adopting an interpolation method until the total number of the event points reaches f, namely, up-sampling to a fixed point number f.

It should be noted that the above sampling method is only an example, and the method of upsampling and downsampling is not particularly limited in the embodiment of the present application.

After the above processing procedure, the electronic device may process the video stream and the event stream within the sliding window by using a neural network. Because the video stream and the event stream are two different types of data, different networks can be used to process the video stream and the event stream respectively in the embodiment of the present application, and the specific process may include:

and S1022, processing the video stream in the sliding window by adopting the first network to obtain a first characteristic, and processing the event stream in the sliding window by adopting the second network to obtain a second characteristic.

And S1023, fusing the first feature and the second feature to obtain fused features.

And S1024, determining a gesture recognition result according to the fusion characteristics.

In step S1022, the first network may be a 3D convolutional network, for example, a slowfast network. Illustratively, taking a slowfast network as an example, as shown in fig. 11, the network is a dual-path network for video recognition, one path is responsible for capturing semantic information provided by an image or a few sparse frames, it operates at a low frame rate and a slow refresh rate, the other path is responsible for capturing rapidly changing motion, by operating at a fast refresh rate and a high temporal resolution, the first path may be referred to herein as a slow path (slow path), the second path is a fast path (fast path), the two paths may be driven at different temporal speeds, and finally merged together through a lateral connection.

With continued reference to fig. 11, T and C of the slow path (slow path) are a reference, T represents the number of frames (or time resolution) sampled, C represents the number of channels (channels), H and W represent the height and width of the pictures in the video stream, the sampling rate is 1/τ (here, τ =16 is generally taken), and when a video stream of a total of τ × T frames is input, the slow path can sample T frames. This path compresses the input timing information and can therefore focus more on extracting the spatial semantic information.

For fast paths, the number of channels can be small because it is not necessary to capture features of too fine granularity, and can be set to be β C (here, β =1/8 is taken in general); however, since the fast path requires motion information to be captured, which requires a denser frame image, the number of frames sampled is relatively large compared to the slow path, and can be set to α T (where α =8 is generally taken).

As can also be seen from fig. 11, after each block (block), the slow path and the fast path have a unidirectional link (i.e., a lateral connection) from the fast path to the slow path, so that the features extracted from the fast path and the features extracted from the slow path are fused, where the fusion may be a simple summation or concatenation manner; finally, the last block of the slow path and the last block of the fast path may input the features into a classifier (prediction) to obtain an output result of the slowfast network. Then, based on the implementation principle of the network, the electronic device may output the first feature after inputting the video stream in the sliding window into the first network.

The second network is a network for processing the event stream, and since the event stream (also referred to as an event cloud) may be composed of a plurality of event points, the second network may be a network based on a point cloud, for example, a PointNet network, or the like. Exemplarily, taking a PointNet network as an example, as shown in fig. 12, a processing procedure of the PointNet network can be roughly divided into three steps, namely, a dimensionality increasing operation, a feature extraction operation, and a global feature fusion operation.

With continued reference to fig. 12, the dimension-raising operation may be performed by a fully-connected neural network (MLP network), which is also called a multi-layer perceptron, and is composed of three network layers, i.e., an input layer, a hidden layer, and an output layer, wherein the hidden layer may have multiple layers, e.g., 5 layers, and the sizes of the neurons are 64, 64, 64, 128, and 1024, respectively. Suppose there are n event points (event points n, respectively) ₁ 、n ₂ 、n ₃ ……n _n ) Inputting a PointNet network, wherein each event point has three-dimensional characteristics and respectively represents the coordinates of a corresponding pixel point and the occurrence time of an event, and after the dimension increasing processing of the first hidden layer, the characteristics of each event point can be increased to M ₁ Dimension, after the dimension raising processing of the second hidden layer, the characteristic of each event point can be increased to M ₂ Dimension, and so on, eventually adding the features of each event point to M _a Dimensions (e.g., 1024 dimensions), the features obtained for each event point may be referred to as local features.

After the dimensionality increasing operation, n M can be obtained _a The event points of the dimension (for example, n event points of 1024 dimensions) may then be subjected to a feature extraction operation, which may be performed as follows: for each of the n event points, M is counted from it _a Selecting several features from the dimensional features to form an M _b The global features of a dimension, for example, may constitute 1 1024-dimensional feature. Alternatively, the feature extraction operation herein may be performed using a max-pooling or mean-pooling method; maximum pooling, i.e. taking the point with the maximum value in the local acceptance domain, and average pooling, i.e. local poolingAll values in the domain are accepted for averaging.

After the feature extraction operation, in order to further optimize the obtained global features, the global features may be input into an MLP network, which is different from the MLP network in the above-mentioned dimension-increasing operation, and the MLP network in this step is used to perform fusion processing on the global features, for example, mapping the global features to M _c And maintaining to obtain the output result of the PointNet network. Then, based on the implementation principle of the network, the electronic device may output the second feature after inputting the event stream in the sliding window into the second network.

Alternatively, the resulting first and second features may be represented using a feature matrix (or feature vector).

After obtaining the first feature and the second feature, the electronic device may perform the process of fusing the first feature and the second feature in S1023, and in this embodiment, the electronic device may set weights corresponding to the first feature and the second feature, and perform weighted fusion on the first feature and the second feature according to the respective weights to obtain a fused feature. In one implementation, the attention to the video stream and the event stream is different in different scenes, for example, the attention to the event stream is higher in a scene with fast motion, and the attention to the video stream is higher in a scene close to static or a scene needing attention to local details. Therefore, the embodiment of the application can adopt a feature fusion method based on an attention mechanism to fuse the first feature and the second feature.

Illustratively, as shown in fig. 13, the feature fusion method based on the attention mechanism may include the following processes: assuming that the first feature is marked as Z and the second feature is marked as E, firstly performing linear transformation on the Z and the E in the feature fusion process to obtain corresponding Q, K and V matrixes, wherein Q is Query, K is Key and V is Value. Optionally, the Q, K, and V matrices corresponding to the first feature Z are respectively: q _Z =Z×W ^Q ，K _Z =Z×W ^K ，V _Z =Z×W ^V The Q, K, V matrices corresponding to the second feature E are respectively: q _E =E×W ^Q ，K _E =E×W ^K ，V _E =E×W ^V . Wherein, W ^Q 、W ^K And W ^V Are three trainable parameter matrices, the input matrices (i.e., first and second features) are associated with W ^Q 、W ^K And W ^V Multiplication can generate corresponding Q, K and V matrixes.

And then carrying out dot product calculation on the Q matrix and the K matrix to obtain the similarity. Optionally, the similarity R corresponding to the first feature Z _ZZ =Q _Z ×K _Z ^T ，R _ZE =Q _Z ×K _E ^T Similarity R corresponding to the second feature E _EE =Q _E ×K _E ^T ，R _EZ =Q _E ×K _Z ^T . Optionally, the similarity corresponding to the first feature and the similarity corresponding to the second feature may form a 2 × 2 matrix, which may be, for example, a matrix of

。

And then performing Softmax on the obtained similarity to obtain the attention weight. Alternatively, the obtained similarity may be divided by

And then Softmax is performed. For example, the attention weight corresponding to the first feature Z

Attention weight corresponding to the second feature E

Wherein, in the step (A),

is the dimension of the matrix K. For example, the attention weight corresponding to the first feature and the attention weight corresponding to the second feature may form a 2 × 2 matrix, that is, a matrix for normalizing the similarity, for example, the matrix may be

I.e. the attention weight corresponding to the first characteristic of the first behavior and the attention weight corresponding to the second characteristic of the second behavior, it can be understood that S _ZZ And S _ZE The sum of which is 1,S _EE And S _EZ The sum is 1.

After the attention weight is obtained, it can be multiplied by the V matrix to obtain a first fusion feature J _Z And a second fusion feature J _E . Exemplarily, J _Z =S _ZZ ×V _Z ＋S _ZE ×V _E ，J _E =S _EZ ×V _Z ＋S _EE ×V _E . Finally, the first fusion feature J is combined _Z And a second fusion feature J _E Adding to obtain the final fusion characteristic G = J _Z ＋J _E 。

For the step of calculating the attention weight, it should be noted that, in the embodiment of the present application, a larger attention weight may be set for a data stream with a higher attention degree in combination with a current scenario. For example, in a scene with fast motion, if the attention degree of the event stream is high, the attention weight corresponding to the second feature is set to be large during normalization; in a scene close to a still or a scene needing to pay attention to local details, the attention degree to the video stream is higher, and the attention weight corresponding to the first feature is set to be larger during normalization. Therefore, the accuracy of the subsequent gesture recognition result can be improved.

After obtaining the fusion feature, the electronic device may execute the process of determining the gesture recognition result according to the fusion feature in S1024, and optionally, in this embodiment, the neural network may still be used, for example, the classification network is used to process the fusion feature, so as to obtain the probability of each gesture class, for example, the probability of the class 1 is P1, the probability of the class 2 is P2, and the probability of the class 3 is P3. If the probability of a certain gesture category is greater than or equal to a preset probability threshold, the current gesture category can be determined.

It is understood that the electronic device may perform the above processing procedure on the video stream and the event stream in each sliding window respectively until a gesture recognition result is obtained.

According to the gesture recognition method, the electronic equipment can be combined with the event stream of the event camera and the video stream of the camera for processing and analyzing, the features of the event stream and the features of the video stream are fused for gesture recognition, and particularly the accuracy of gesture recognition in each scene can be improved to a great extent by combining the feature fusion method of the attention mechanism.

Based on the processing procedure of the foregoing embodiment and the software structure of the electronic device shown in fig. 4, the following describes a gesture recognition method provided by the present application with reference to an embodiment, as shown in fig. 14, where the method may include:

s201, under the condition that the preset conditions are met, the camera collects video streams, and the event camera collects event streams.

S202, the camera sends video streams to the processing module, and the event camera sends event streams to the processing module.

In one implementation, the processing module may be a module located at an application layer, and in another implementation, the processing module may also be a module located at another software layer or a hardware layer, for example, a software module integrated in a processor, and so on.

S203, the processing module processes the video stream and the event stream to determine a gesture recognition result.

The processing module may process the video stream and the event stream by using the steps of S1022 to S1024, and the specific process is not described herein again.

And S204, the processing module sends the gesture recognition result to the service application.

And S205, the service application executes corresponding operation according to the gesture recognition result.

The business application is a certain application of the application program layer, and may be an application that needs to use a gesture recognition function currently. For example, in the case that the service application is a camera application, if the gesture recognition result is a palm gesture, the camera starts a shooting function; if the service application is a game application, the game application starts playing the game if the gesture recognition result is an "OK" gesture.

According to the gesture recognition method, the electronic equipment can be used for processing and analyzing the event stream of the event camera and the video stream of the camera, and the features of the event stream and the features of the video stream are fused to perform gesture recognition, so that the accuracy of gesture recognition in each scene can be improved.

Examples of the gesture recognition methods provided by the embodiments of the present application are described above in detail. It will be appreciated that the electronic device, in order to implement the above-described functions, comprises corresponding hardware and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, with the embodiment described in connection with the particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional modules according to the method example, for example, the functional modules may be divided into the functional modules corresponding to the functions, such as the detection unit, the processing unit, the display unit, and the like, or two or more functions may be integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

It should be noted that all relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.

The electronic device provided by the embodiment is used for executing the gesture recognition method, so that the same effect as the effect of the implementation method can be achieved.

In case of an integrated unit, the electronic device may further comprise a processing module, a storage module and a communication module. The processing module can be used for controlling and managing the action of the electronic equipment. The memory module can be used to support the electronic device in executing stored program codes and data, etc. The communication module can be used for supporting the communication between the electronic equipment and other equipment.

The processing module may be a processor or a controller. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., a combination comprising one or more microprocessors, digital Signal Processing (DSP) and microprocessors, or the like. The storage module may be a memory. The communication module may specifically be a radio frequency circuit, a bluetooth chip, a Wi-Fi chip, or other devices that interact with other electronic devices.

In an embodiment, when the processing module is a processor and the storage module is a memory, the electronic device according to this embodiment may be a device having a structure shown in fig. 3.

The embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the processor is caused to execute the gesture recognition method of any one of the above embodiments.

The embodiment of the present application further provides a computer program product, which when running on a computer, causes the computer to execute the above related steps to implement the gesture recognition method in the above embodiment.

In addition, embodiments of the present application also provide an apparatus, which may be specifically a chip, a component or a module, and may include a processor and a memory connected to each other; the memory is used for storing computer execution instructions, and when the device runs, the processor can execute the computer execution instructions stored in the memory, so that the chip can execute the gesture recognition method in the above method embodiments.

The electronic device, the computer-readable storage medium, the computer program product, or the chip provided in this embodiment are all configured to execute the corresponding method provided above, and therefore, the beneficial effects that can be achieved by the electronic device, the computer-readable storage medium, the computer program product, or the chip may refer to the beneficial effects in the corresponding method provided above, and are not described herein again.

Through the description of the above embodiments, those skilled in the art will understand that, for convenience and simplicity of description, only the division of the above functional modules is used as an example, and in practical applications, the above function distribution may be completed by different functional modules as needed, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, a module or a unit may be divided into only one logic function, and may be implemented in other ways, for example, a plurality of units or components may be combined or integrated into another apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed to a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented as a software functional unit and sold or used as a separate product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A gesture recognition method, performed by an electronic device comprising a camera and an event camera, the method comprising:

acquiring a video stream acquired by a camera and an event stream acquired by an event camera;

and fusing a first video stream and a first event stream with the same acquisition time, and determining a gesture recognition result when a hand of a user faces the camera, wherein the first video stream is a part of the video stream acquired by the camera, and the first event stream is a part of the event stream acquired by the event camera.

2. The method according to claim 1, wherein the fusing the first video stream and the first event stream with the same acquisition time to determine the gesture recognition result when the user's hand is facing the camera comprises:

and fusing the first video stream and the first event stream with the same acquisition time by adopting a neural network, and determining the gesture recognition result.

3. The method according to claim 2, wherein the neural network comprises a first network and a second network, and the determining the gesture recognition result by performing a fusion process on a first video stream and a first event stream with the same acquisition time by using the neural network comprises:

processing the first video stream by using the first network to obtain a first characteristic, and processing the first event stream by using the second network to obtain a second characteristic;

fusing the first feature and the second feature to obtain a fused feature;

and determining the gesture recognition result according to the fusion characteristics.

4. The method of claim 3, wherein said fusing the first feature and the second feature to obtain a fused feature comprises:

and fusing the first feature and the second feature based on an attention mechanism to obtain the fused feature.

5. The method according to any one of claims 1 to 4, wherein before the merging the first video stream and the first event stream with the same acquisition time, the method further comprises:

and aligning the video stream acquired by the camera with the event stream acquired by the event camera, and determining a first video stream and a first event stream with the same acquisition time.

6. The method of claim 5, wherein the data aligning the video stream captured by the camera and the event stream captured by the event camera, and determining the first video stream and the first event stream with the same capture time comprises:

acquiring a part of video stream collected by the camera through a sliding window to serve as the first video stream;

and acquiring a partial event stream of which the occurrence time of the event is the first time from the event stream acquired by the event camera as the first event stream according to the first time of acquiring the first video stream.

7. The method of claim 6, wherein after determining the first event stream, the method further comprises:

and performing up-sampling operation or down-sampling operation on the event points in the first event stream to enable the number of the event points in the first event stream to reach a preset number.

8. The method according to any one of claims 1 to 4, wherein the fusing the first video stream and the first event stream with the same acquisition time to determine the gesture recognition result when the user hand faces the camera comprises:

performing fusion processing on the first video stream and the first event stream, and determining the probability of each preset gesture category;

and if the probability of the first gesture category in the probabilities of all the gesture categories is larger than or equal to a preset probability threshold value, taking the first gesture category as the gesture recognition result.

9. The method according to any one of claims 1 to 4, further comprising:

and if the frame number of the video stream acquired by the camera reaches a preset frame number threshold value, but the gesture recognition result is still not determined, outputting gesture recognition failure information.

10. The method of any of claims 1 to 4, wherein the acquiring a video stream captured by a camera and an event stream captured by an event camera comprises:

and under the condition that a preset condition is met, acquiring the video stream acquired by the camera and the event stream acquired by the event camera, wherein the preset condition represents that the electronic equipment starts a gesture recognition function.

11. The method of claim 3 or 4, wherein the first network is a 3D convolutional network and the second network is a point cloud based network.

12. An electronic device, comprising:

one or more processors;

one or more memories;

the memory stores one or more programs that, when executed by the processor, cause the electronic device to perform the method of any of claims 1-11.

13. A computer-readable storage medium, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the method of any one of claims 1 to 11.