CN116912950A

CN116912950A - Identification method, head-mounted device and storage medium

Info

Publication number: CN116912950A
Application number: CN202311170299.0A
Authority: CN
Inventors: 李林峰; 黄海荣
Original assignee: Hubei Xingji Meizu Technology Co ltd
Current assignee: Hubei Xingji Meizu Technology Co ltd
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-10-20

Abstract

The application discloses an identification method, a head-mounted device and a storage medium, which belong to the technical field of artificial intelligence, and the identification method of the embodiment of the application comprises the following steps: displaying preset mark points, and collecting images based on a camera; the preset mark point positions are used for guiding a user to move the camera so as to enable the preset mark point positions to coincide with the identification targets; based on the preset mark point positions, a region of interest is segmented from the acquired first image; and carrying out target recognition on the region of interest to obtain a recognition result of the recognition target. The identification method, the head-mounted device and the storage medium provided by the application can accurately capture the identification target and realize more accurate and reliable target identification.

Description

Identification method, head-mounted device and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an identification method, a headset device, and a storage medium.

Background

With the development of artificial intelligence technology, target recognition technology is gradually applied in daily life. For example, intelligent terminals such as intelligent glasses and intelligent mobile phones can be configured with target recognition technology to realize shooting and object recognition functions.

The realization of the photo identifier depends on the manual calibration of the user to the photo taken. For example, after shooting is finished, a user can specify an area to be identified in the photo through the touch screen, so that the intelligent terminal is assisted in determining a target to be identified, but the use experience of the user is reduced due to additional operation, and the accuracy of identifying objects is affected due to false touch.

Disclosure of Invention

The application provides an identification method, a head-mounted device and a storage medium, which are used for solving the problems that photographing identification needs manual calibration of a user, user experience is affected and identification accuracy is low

In a first aspect, the present application provides an identification method, including:

displaying preset mark points, and collecting images based on a camera; the preset mark point positions are used for guiding a user to move the camera so as to enable the preset mark point positions to coincide with the identification targets;

based on the preset mark point positions, a region of interest is segmented from the acquired first image;

and carrying out target recognition on the region of interest to obtain a recognition result of the recognition target.

In some embodiments, the segmenting the region of interest from the acquired first image based on the preset marker point location includes:

Performing gesture recognition on the first image to obtain a gesture recognition result;

and based on the preset mark point positions and the gesture recognition result, segmenting a region of interest from the first image.

In some embodiments, the performing gesture recognition on the first image to obtain a gesture recognition result includes:

based on a gesture recognition model, carrying out hand gesture recognition and finger direction recognition on the first image to obtain a gesture recognition result and a pointing recognition result of the first image;

determining the gesture recognition result based on the gesture recognition result and the pointing recognition result;

the gesture recognition model is obtained based on a sample image and training of a gesture label and a pointing label of the sample image.

In some embodiments, the gesture recognition model includes a backbone network, and gesture recognition branches and pointing recognition branches respectively connected to the backbone network;

the step of performing hand gesture recognition and finger direction recognition on the first image based on the gesture recognition model to obtain a gesture recognition result and a pointing recognition result of the first image comprises the following steps:

based on the backbone network, extracting the characteristics of the first image to obtain image characteristics;

Based on the gesture recognition branch, carrying out hand gesture recognition on the image characteristics to obtain a gesture recognition result;

and based on the pointing identification branch, carrying out finger direction identification on the image characteristic to obtain the pointing identification result.

In some embodiments, the segmenting the region of interest from the first image based on the preset marker point location and the gesture recognition result includes:

under the condition that the gesture recognition result is a non-pointing gesture, based on the preset mark point position, a region of interest is segmented from the acquired first image;

and under the condition that the gesture recognition result is that a pointing gesture exists, a region of interest is segmented from the first image based on the position and the pointing direction of the pointing gesture and the preset mark point position.

In some embodiments, the displaying the preset mark point location and performing image acquisition based on the camera includes:

acquiring a first voice, and carrying out instruction recognition on the first voice to obtain a voice instruction;

and displaying a preset mark point position under the condition that the voice instruction is a query instruction, and acquiring an image based on a camera.

In some embodiments, the obtaining the first voice and performing instruction recognition on the first voice to obtain a voice instruction includes:

acquiring a second voice, and detecting wake-up words of the second voice;

under the condition that the wake-up word exists in the second voice, acquiring a first voice, and carrying out instruction recognition on the first voice to obtain a voice instruction.

In a second aspect, the application provides a head-mounted device, which comprises a head-mounted device body, and a camera, a display assembly and a processor which are arranged on the head-mounted device body, wherein the camera and the display assembly are respectively connected with the processor;

the processor is configured to perform the method as set forth in any one of the above;

the camera is used for carrying out image acquisition under the control of the processor, and the display component is used for displaying preset marking points under the control of the processor.

In a third aspect, the application provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the above.

In a fourth aspect, the application provides a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the above.

According to the identification method, the head-mounted device and the storage medium, the user is guided to adjust the acquisition visual angle of the camera by displaying the preset mark point positions, so that the identification target contained in the image acquired based on the camera can be overlapped with the preset mark point positions, based on the identification target, the region of interest separated from the acquired image can be ensured to contain the identification target expected by the user, and therefore the identification target can be accurately captured on the premise that the user does not participate in calibration, and more accurate and reliable target identification can be realized.

Drawings

In order to more clearly illustrate the application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an identification method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for segmenting a first image according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a gesture recognition model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a voice wake model according to an embodiment of the present application;

FIG. 5 is a second flow chart of an identification method according to an embodiment of the application;

FIG. 6 is a schematic diagram of a command word recognition model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a headset according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the "first" and "second" distinguishing between objects generally are not limited in number to the extent that the first object may, for example, be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/" generally means a relationship in which the associated object is an "or" before and after.

The realization of the photo identifier depends on the manual calibration of the user to the photo taken. The manual calibration of the user for the target to be identified in the shot picture can be completed in the shooting process or after the shooting is finished.

For example, in the shooting process, a user can point to a target to be identified while shooting, and then the intelligent terminal performs target positioning through gesture recognition, but if the user is inconvenient to perform gesture operation, calibration failure is directly caused. For example, after shooting is finished, the user can specify the area to be identified in the photo through the touch screen, so that the intelligent terminal is assisted in determining the target to be identified, but the use experience of the user is reduced due to additional operation.

Therefore, the application provides an identification method, a head-mounted device and a storage medium, and a user is guided to move a camera through a preset mark point position, so that an identification target can be in an interested area, and the identification target can be positioned conveniently to realize target identification. The application can realize target identification without manual calibration of a user.

In some embodiments, the terminals (terminal devices) include various handheld devices, vehicle mount devices, wearable devices, computing devices, or other processing devices connected to a wireless modem, such as cell phones, tablets, desktop notebooks, and smart devices that can run applications, including the central console of a smart car, etc. Specifically, it may refer to a User Equipment (UE), an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent, or a user equipment. The terminal device may also be a satellite phone, a cellular phone, a smart phone, a wireless data card, a wireless modem, a machine type communication device, a cordless phone, a session initiation protocol (session initiation protocol, SIP) phone, a wireless local loop (wireless local loop, WLL) station, a personal digital assistant (personal digital assistant, PDA), a handheld device with wireless communication capabilities, a computing device or other processing device connected to a wireless modem, a vehicle mounted device or a wearable device, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal in industrial control (industrial control), a wireless terminal in self-driving (self-driving), a wireless terminal in telemedicine (remote media), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation security (transportation safety), a wireless terminal in smart city (smart city), a future terminal in smart home (smart home) network, a terminal in a 5G network or a communication network, etc. The terminal may be powered by a battery, may also be attached to and powered by a power system of the vehicle or vessel. The power supply system of the vehicle or the ship may also charge the battery of the terminal to extend the communication time of the terminal.

Fig. 1 is a schematic flow chart of an identification method according to an embodiment of the application. As shown in fig. 1, there is provided an identification method applicable to the above terminal, the identification method including the steps of: step 110, step 120, step 130. The method flow steps are only one possible implementation of the application.

Step 110, displaying preset mark points, and collecting images based on a camera; the preset mark point location is used for guiding a user to move the camera so as to enable the preset mark point location to coincide with the identification target.

Specifically, the terminal for executing the identification method provided by the embodiment of the present application may be a terminal configured with a camera, for example, a smart phone, smart glasses, a tablet pc, or the like, or may be a terminal connected to an external camera, which is not particularly limited in the embodiment of the present application.

For example, the smart glasses may be AR (augmented reality) glasses, VR (virtual reality) glasses, MR (mixed reality) glasses, or the like. The smart glasses may be self-contained with computing and application execution capabilities, such as having a processor, memory, and operating system to support the running of applications to perform the identification method of the present application; the intelligent glasses can also be provided with only display capability, and the identification method of the application can be executed by connecting other terminals with operation and application execution capability, such as mobile phones, tablet computers and the like, so as to support the running of the application.

For example, the external camera may be connected to the terminal by wireless communication such as bluetooth or Wi-Fi, or may be connected to the terminal by wired communication such as USB data line.

Under the condition that the camera is determined to collect images, preset mark points can be displayed in the display view of the terminal, so that a user is guided to move the camera to adjust the collection view angle of the camera, and identification targets contained in images collected by the camera can be overlapped with the displayed preset mark points.

For example, in the case of wearing smart glasses with cameras, a preset mark point location may be displayed in a display field of the smart glasses, thereby guiding a user to move the head to adjust an acquisition view angle of the cameras.

For example, in the case of wearing a smart glasses without a camera or using a terminal such as a mobile phone, a preset mark point location may be displayed on a display field of the smart glasses or a screen of the mobile phone, thereby guiding a user to move an externally connected camera to adjust a capture viewing angle of the camera.

For example, the adjustment of the camera may be automatic, and the camera is flexibly adjusted according to the preset point position of the mark by arranging the camera on a driving motor, a cradle head and other mechanisms.

For example, the preset mark points may be dots, boxes, asterisks, or other shaped image identifications.

In order to visually achieve the guiding effect of the preset mark point, the preset mark point may be displayed on the same screen as the image acquired by the camera and displayed at a preset position in the image. The preset position here, that is, the preset display position of the preset marking point position, may be, for example, a center position of the image, or the preset position may be any position set by the user, for example, an upper left area and a lower right area, or the preset position may be a position that can reflect the acquisition preference of the user based on the position statistics of the identification target manually calibrated by the user in the earlier stage, which is not particularly limited in the embodiment of the present application.

For example, the preset mark point position may be highlighted, flicked, or the like to prompt the user.

When a user observes an image acquired by a camera, the user can observe a preset marking point position displayed on the same screen as the image. Based on the method, a user can automatically judge whether the identification target in the currently acquired image coincides with the preset mark point position or not, and further when the identification target in the currently acquired image coincides with the preset mark point position, the adjustment of the acquisition visual angle of the camera is realized by moving the camera until the identification target in the latest acquired image coincides with the preset mark point position.

For example, a user is provided with stationery and a bouquet in front of the user, and both of them can be shown in an image under the current acquisition view angle of the camera. Assuming that a user takes the bouquet as a recognition target, and intends to recognize the variety of the bouquet through the terminal, the position relationship between the bouquet and a preset mark point in an image acquired by the camera can be observed, and finally the bouquet in the acquired image is overlapped with the preset mark point by moving the camera.

And 120, segmenting a region of interest from the acquired first image based on the preset mark point positions.

Specifically, after the camera completes the acquisition, the acquired image may be recorded as a first image. The first image here may be an image acquired by the user based on the camera in a case where it is determined that the identification target coincides with the preset mark point. Namely, the identification target in the default first image of the terminal coincides with the preset mark point.

In this case, the terminal may segment the region of interest including the recognition target from the first image based on the position of the preset marker point in the first image. Here, the first image may be segmented with a preset marker point as a center and a preset size as a size of the region of interest, and the segmented image region may be referred to as the region of interest. Or, the first image may be divided into a plurality of regions in advance, and a region in which the preset marking point is located is selected from the plurality of regions and is recorded as the region of interest.

It can be understood that the image segmentation is performed based on the preset mark point positions to obtain the region of interest, so that the region of interest obtained by the segmentation, namely, the region of interest in the first image, in which the user desires to perform target recognition, can be ensured to include a recognition target defaulted by the user through the mobile camera.

For example, the first image shot by the camera includes stationery and bouquet, and the bouquet coincides with the preset mark point, so that the region of interest obtained by segmentation based on the preset mark point includes the bouquet, and the image region including the stationery, which may interfere with target recognition, is filtered.

And 130, carrying out target recognition on the region of interest to obtain a recognition result of the recognition target.

Specifically, after the region of interest (region of interest, ROI) is obtained, object recognition can be performed on the region of interest, thereby obtaining a recognition result in the region of interest, that is, a recognition object desired by the user.

For example, the target recognition is performed on the region of interest containing the bouquet, the bouquet in the region of interest can be recognized, and the corresponding recognition result can include the variety of the bouquet. In addition, the corresponding recognition result can also contain the related introduction of the bouquet variety. In addition, particularly when the identification result is displayed, the source of the related introduction in the identification result or the purchase link of the variety in the identification result may be displayed, which is not particularly limited in the embodiment of the present application.

According to the embodiment of the application, the user is guided to adjust the acquisition visual angle of the camera by displaying the preset mark point position, so that the identification target contained in the image acquired based on the camera can be overlapped with the preset mark point position, and based on the identification target, the region of interest separated from the acquired image can be ensured to contain the identification target expected by the user, so that the identification target can be accurately captured on the premise that the user does not participate in calibration, and more accurate and reliable target identification is realized.

It should be noted that each embodiment of the present application may be freely combined, exchanged in order, or separately executed, and does not need to rely on or rely on a fixed execution sequence.

In some embodiments, fig. 2 is a flowchart of a first image segmentation method according to an embodiment of the present application, as shown in fig. 2, step 120 includes:

and step 121, performing gesture recognition on the first image to obtain a gesture recognition result.

Step 122, based on the preset mark point location and the gesture recognition result, segmenting the region of interest from the first image.

Specifically, considering that the sizes of the areas occupied by the recognition targets in the first image are different, the user cannot necessarily ensure that the recognition targets completely coincide with the preset mark points when shooting based on the camera, and it is possible to coincide the edges of the recognition targets with the preset mark points. Thus, although the preset mark point position can provide guidance for locating the region of interest, the region of interest determined based on the preset mark point position alone may result in the segmentation of the recognition target in the first image, that is, the region of interest may not necessarily contain the complete recognition target.

Aiming at the problem, the embodiment of the application combines the preset mark point positions and the gesture recognition result to realize more accurate region of interest segmentation.

That is, the user can also point to the recognition target when image acquisition is performed by the camera. The first image thus acquired may contain, in addition to the recognition target itself, a hand area pointed by the user at the recognition target. Thus, after the first image is obtained, gesture recognition can be performed on the first image, so that a gesture recognition result is obtained. The gesture recognition result is used for representing whether a gesture of the pointing gesture exists in the first image, and in the case of the gesture of the pointing gesture, the specific pointing direction of the pointing gesture can be further represented.

Based on this, when the region of interest is segmented, both the preset marker point location and the gesture recognition result may be combined. Further, under the condition that the gesture recognition result is a non-pointing gesture, the preset mark point positions can be directly applied to divide the region of interest; and when the gesture recognition result is that the pointing gesture exists, the region of interest may be divided based on the pointing direction of the pointing gesture obtained by gesture recognition, the position of the pointing gesture, and the position of the preset mark point, for example, the position of one vertex of the region of interest may be determined according to the position of the pointing gesture, and the position of the other vertex opposite to the vertex may be determined according to the pointing direction, thereby dividing the region of interest including the preset mark point.

In the embodiment of the application, the region of interest is segmented by combining the preset mark point positions and the gesture recognition result, which is beneficial to improving the reliability of the segmentation of the region of interest, thereby improving the probability of the region of interest containing the complete recognition target and further ensuring the reliability of target recognition.

In embodiments of the present application, although not explicitly described, there may be a variety of machine learning techniques that may support segmentation and object recognition of regions of interest, such as OpenCV, RCNN series, YOLO series, SSD, retinaNet, and the like.

In some embodiments, step 121 comprises:

Specifically, gesture recognition based on the first image may be achieved through a pre-trained gesture recognition model. The gesture recognition model is input with the first image, and output with the gesture recognition result and the pointing recognition result of the first image.

After the input first image is acquired, the gesture recognition model may perform gesture recognition on the first image to acquire a hand gesture in the first image, and further determine whether the hand gesture in the first image is a pointing gesture, thereby obtaining and outputting a gesture recognition result. The gesture recognition result may be a pointing gesture or a non-pointing gesture, or may be a left-finger gesture, a right-finger gesture, or a non-pointing gesture.

In addition, after the gesture recognition model acquires the input first image, finger direction recognition can be performed on the first image to acquire the finger direction in the first image, so that a pointing recognition result is obtained and output. The direction recognition result may be one of a plurality of preset directions, or may be other types of directions indicating no direction, and the preset directions may be 8 types, such as up, down, left, right, left up, right up, left down, right down, or 4 types, such as left up, right up, left down, right down, which is not particularly limited in the embodiment of the present application.

The gesture recognition model may be trained prior to execution of step 121, the gesture recognition model may be obtained by supervised training, where the samples to which the supervised training is applied include sample images, and the sample images are pre-labeled with gesture tags and pointing tags. It will be appreciated that the pose tag of the sample image is used to indicate whether a pointing pose, or in particular a left or right finger pose, exists in the sample image; the pointing tag of the sample image is used to indicate the direction of the finger in the sample image. In the training process, a sample image can be input into an initial model to obtain a sample gesture recognition result and a sample pointing recognition result, which are obtained by carrying out hand gesture recognition and finger direction recognition on the initial model by aiming at the sample image, so that the sample gesture recognition result and the gesture label are compared, the sample pointing recognition result and the pointing label are compared, a loss function is determined, parameter iteration is carried out on the initial model based on the loss function, and the initial model with the completed parameter iteration is recorded as a gesture recognition model.

It will be appreciated that in the gesture recognition model, the hand gesture recognition and finger direction recognition for the first image are performed separately. Thus, after the gesture recognition result and the pointing recognition result are obtained, the gesture recognition result can be determined by comparing the gesture recognition result and the pointing recognition result, and the reliability of gesture recognition is further ensured.

Further, only when the gesture recognition result is a pointing gesture and the pointing recognition result is a specific direction but not others, the terminal determines that a gesture pointing to the recognition target exists in the first image, that is, determines that the gesture recognition result is that the pointing gesture exists; and in the case that the gesture recognition result is a pointing gesture and the pointing recognition result is other, or in the case that the gesture recognition result is a non-pointing gesture and the pointing recognition result is a specific direction, or in the case that the gesture recognition result is a non-pointing gesture and the pointing recognition result is other, that is, as long as at least one of the gesture recognition result and the pointing recognition result is a gesture which is considered to be directed to the recognition target in the first image, the terminal determines that the gesture which is directed to the recognition target is not present in the first image, that is, determines that the gesture recognition result is a non-pointing gesture.

In the embodiment of the application, the gesture recognition result is determined by combining the gesture recognition result and the pointing recognition result, and the gesture recognition is respectively carried out from the two dimensions of the gesture recognition and the pointing recognition, so that the reliability of gesture recognition can be improved, and the reliability of subsequent target recognition based on gesture recognition is further improved.

In some embodiments, fig. 3 is a schematic structural diagram of a gesture recognition model according to an embodiment of the present application, where, as shown in fig. 3, the gesture recognition model includes a backbone network, and a gesture recognition branch and a pointing recognition branch connected to the backbone network, respectively;

accordingly, in step 121, based on the gesture recognition model, performing hand gesture recognition and finger direction recognition on the first image to obtain a gesture recognition result and a pointing recognition result of the first image, including:

Specifically, the hand gesture recognition and the finger direction recognition share the backbone network in the gesture recognition model, that is, after the first image is input to the gesture recognition model, the first image is first input to the backbone network, and feature extraction is performed on the first image by the backbone network, so as to obtain image features of the first image. The backbone network here may be a neural network with image processing capability such as a residual network Resnet or a lightweight network MobileNet.

In the gesture recognition model, the output end of the backbone network is respectively connected with the input end of the gesture recognition branch and the input end of the pointing recognition branch, that is, the image features of the first image output by the backbone network can be respectively input into the gesture recognition branch and the pointing recognition branch.

Wherein, after inputting the image feature to the gesture recognition branch, the gesture recognition branch can perform hand gesture recognition based on the image feature, thereby obtaining and outputting a gesture recognition result. Here, the gesture recognition branch may be a two-class network, i.e. for distinguishing whether the gesture in the first image is a pointing gesture, or a non-pointing gesture; alternatively, the gesture recognition branch may be a three-classification network, i.e. for distinguishing whether the gesture in the first image is a left-finger gesture, a right-finger gesture or a non-pointing gesture.

After inputting the image feature to the direction recognition branch, the direction recognition branch can perform finger direction recognition based on the image feature, thereby obtaining and outputting a direction recognition result. Here, the directional identification branch may be a multi-classification network, for example, specifically divided into 9 classes, representing up, down, left, right, upper left, upper right, lower left, lower right, and others, respectively.

In the embodiment of the application, the gesture recognition model realizes gesture recognition in two dimensions of gesture recognition and direction recognition through a common backbone network, and ensures the reliability of a gesture recognition result through mutual constraint of a gesture recognition branch and a direction recognition branch.

In some embodiments, the implementation of step 122 includes two cases, namely, a case where the finger recognition result is a no pointing gesture and a case where the gesture recognition result is a presence of a pointing gesture. The way the region of interest is segmented in both cases is as follows:

Specifically, in the case that the gesture recognition result is a gesture without pointing, that is, the user does not give an effective gesture to guide the terminal to determine the position of the recognition target in the first image, the user may start from the position of the preset mark point in the first image only to segment the region of interest, and it is understood that the manner of segmenting the region of interest from the first image based on the preset mark point is consistent with the segmentation manner provided in the foregoing embodiment, which is not described in detail in the embodiment of the present application.

When the gesture recognition result is that a pointing gesture exists, that is, a user gives an effective gesture to guide the terminal to determine the position of the recognition target in the first image, the information provided by the pointing gesture and the position of a preset mark point in the first image are combined at the moment, and the region of interest is segmented. Further, the position of one vertex of the region of interest may be determined with the position of the pointing gesture, and the position of the other vertex opposite to the vertex may be determined with the pointing direction, thereby segmenting the region of interest including the preset marker point. Alternatively, the position where the preset mark point is located may be taken as the center point of the region of interest, and the position of one vertex of the region of interest is determined by the position of the pointing gesture to divide the region of interest, which is not particularly limited in the embodiment of the present application.

In the scene of shooting and object recognition, the terminal needs to continuously shoot and reason, and the reasoning process needs to be realized by depending on a large model comprising functions of gesture recognition, image target detection, target segmentation, classification and the like. Persistent photo reasoning can result in significant power consumption. In response to the above-described problems, in some embodiments, step 110 includes:

Specifically, in order to reduce power consumption caused by continuous photographing reasoning, the photographing reasoning function is triggered by voice in the embodiment of the application. That is, when the user needs to start the photographing inference function to realize photographing knowledge, the user can input the first voice to the terminal. That is, the terminal can realize voice collection through a microphone arranged or externally connected to the terminal, so that first voice dictated by the user is obtained.

After the first voice is obtained, the terminal can conduct instruction recognition on the first voice, so that whether a voice instruction issued by a user is a query instruction capable of triggering a photographing reasoning function is judged.

The instruction recognition can be command word recognition or full-function voice recognition, and the specific mode for realizing the instruction recognition can be determined according to the calculation capability of the terminal itself or the current network state of the terminal.

For example, the command word recognition is smaller in model, lower in computational effort requirement and lower in energy consumption compared with the full-function voice recognition, and on the specific command word recognition, the command word recognition is higher in accuracy compared with the full-function voice recognition, so that the command word recognition can be selected to realize instruction recognition under the condition that the terminal is a terminal with lower computational effort such as intelligent glasses or the electric quantity of the terminal is lower; in addition, compared with command word recognition, the full-function voice recognition has higher accuracy, and under the condition that the current network state of the terminal is excellent and the terminal can communicate with the server, the first voice can be sent to the server to perform full-function voice recognition, and a voice command returned by the server is acquired.

After completing the recognition of the instruction and obtaining the voice instruction, it can be further judged whether the voice instruction is a query instruction, for example, "what is it", "what is it" and so on. Aiming at the condition that a voice instruction is a query instruction, the terminal can trigger a photographing reasoning function, display a preset mark point position and start the camera, so that a user is guided to move the camera to adjust the acquisition visual angle of the camera, and the recognition target contained in the image acquired based on the camera is enabled.

In the embodiment of the application, the photographing reasoning function is triggered based on the query instruction, so that huge power consumption caused by continuous photographing reasoning is avoided, the electric quantity can be effectively saved, and the endurance of the terminal is prolonged.

It can be understood that if the voice command corresponding to the first voice is not a query command, models such as gesture recognition, target detection, target segmentation and classification under the photographing reasoning function are not operated, so that the power consumption can be reduced, and the computing resource can be saved.

In some embodiments, in step 110, the obtaining the first voice and performing instruction recognition on the first voice to obtain a voice instruction includes:

Acquiring a second voice, and detecting wake-up words of the second voice;

Specifically, in order to further reduce the terminal energy consumption, a trigger condition may be set for the acquisition of the first voice, that is, a wake-up word detection link is added before the instruction recognition, and the first voice is acquired and the instruction recognition is performed only when the wake-up word is detected.

It will be appreciated that the wake-up word herein is predetermined, e.g. "small Ji Xiaoji". The terminal can collect the second voice, and perform wake-up word detection on the second voice, namely perform voice wake-up, and determine that the terminal is awakened under the condition that wake-up words are detected, and the terminal after wake-up can collect the first voice and perform instruction recognition by applying the first voice. In this process, wake word detection may be achieved by a neural network, and the neural network model for wake word detection is typically relatively simple and small in size, e.g., may be tens of k in size. The lightweight wake-up word detection model is deployed at the terminal and runs continuously, so that a large amount of electric energy consumption is not brought to the terminal. In addition, as a trigger condition for instruction recognition, instruction recognition is not performed before voice wakeup is realized, and electric energy consumption caused by continuous instruction recognition can be avoided.

In the embodiment of the application, the instruction recognition is triggered under the condition that the wake-up word is detected, so that the electric quantity can be effectively saved, and the endurance of the terminal can be prolonged.

It can be understood that if the voice wake-up fails, that is, if no wake-up word exists in the second voice, the terminal can be in a standby state, and models such as gesture recognition, target detection, target segmentation and classification under the photographing reasoning function do not operate, so that the power consumption can be reduced, and the computing resource can be saved.

In some embodiments, fig. 4 is a schematic structural diagram of a voice wake model according to an embodiment of the present application. The wake word detection for the second voice in step 110 may be implemented by a voice wake model, for example, by the voice wake model shown in fig. 4. In fig. 4, for the second voice, the voice features may be first obtained through feature extraction, then the voice features are input into an acoustic model, the acoustic model is used as an encoder to perform acoustic feature extraction, then the extracted acoustic features are input into a decoder, and the decoder decodes the extracted acoustic features to obtain a text sequence corresponding to the second voice, so as to determine whether a wake-up word exists in the text sequence, so as to determine success or failure of wake-up. The acoustic model may be implemented by a deep neural network, for example, DFSMN ((Deep Feedforward Sequential Memory Networks), deep feedforward sequence memory neural network) may be applied.

In some embodiments, fig. 5 is a second flowchart of an identification method according to an embodiment of the present application, as shown in fig. 5, the identification method may include the following steps:

step 510, voice wakeup:

the terminal can collect the second voice in real time and detect wake-up words aiming at the second voice. If it is detected that the wake-up word exists in the second speech, it may be determined that the speech wake-up is successful, and step 520 is performed. In addition, if it is detected that the wake-up word does not exist in the second voice, it may be determined that the voice wake-up fails, step 520 is not performed, and the process returns to continue to step 510.

Step 520, instruction identification:

after the voice wake-up is successful, the terminal can collect the first voice and conduct instruction recognition aiming at the first voice so as to obtain a voice instruction corresponding to the first voice. The instruction recognition here may specifically be command word recognition.

Fig. 6 is a schematic structural diagram of a command word recognition model according to an embodiment of the present application, and command word recognition may be implemented by using the command word recognition model shown in fig. 6. In fig. 6, the command word recognition model includes two parts, namely an acoustic model and a CTC (Connectionist Temporal Classification) decoding module, the acoustic feature of the first voice is encoded through the acoustic model, and then the acoustic feature is decoded through the decoding model, so that a voice command can be obtained. It can be appreciated that the command words supported by the terminal are all preset and participate in the training of the acoustic model. Because the command words supported by the terminal are limited, at most hundreds of command words are short, the command words are not needed to be analyzed by a complex natural language model, and the CTC decoding module can be used for finishing decoding. Here, the CTC decoding model includes simple word context information, and it can be determined whether the user pronunciation matches with a preset command word.

After obtaining the voice command, if it is determined that the voice command is a query command, step 530 may be performed; if it is determined that the voice command is not a query command, then the corresponding operation of the voice command is performed without performing step 530. Alternatively, if it is determined that the voice command is not a query command, the terminal may not respond at all, e.g., the voice command is "turn on air conditioner", the terminal does not support this command, and may not respond at all.

Step 530, photographing:

under the condition that the terminal receives the inquiry instruction, the terminal starts the camera to acquire a first image, and displays a preset mark point position to guide a user to move the camera, so that the preset mark point position coincides with the identification target. In the process, after determining that the preset mark point position coincides with the identification target, the user can issue a photographing instruction, and the photographing instruction can be issued in a voice instruction mode.

After the first image is acquired by the camera, step 540 is performed.

Step 540, gesture recognition:

because the first image already ensures that the preset mark point position coincides with the identification target, on the basis, the position of the identification target in the first image can be further confirmed through gesture recognition.

The gesture recognition herein may include hand gesture recognition and finger direction recognition, and in the case where the gesture recognition result is a pointing gesture and the pointing recognition result is a specific direction but not others, the terminal determines that there is a gesture pointing to the recognition target in the first image.

After completing gesture recognition, step 550 is performed.

Step 550, region segmentation:

and dividing the region of interest from the first image by combining both the preset mark point position and the gesture recognition result. The region of interest obtained by segmentation can be a region pointed by a finger in the gesture recognition result, and the region of interest covers the position of a preset mark point.

After segmentation of the region of interest, step 560 is performed.

Step 560, target identification:

the target recognition of the region of interest may specifically be image classification of the region of interest, where the image classification may be implemented by using a neural Network, specifically may be implemented by using a pre-trained model of a classical Network, for example, VGG16 (Visual Geometry Group Network ), resnet (Residual Network) 50, etc., and since the neural Network operates on a mobile phone or a terminal application of a glasses type, the model size cannot be too large, and may be implemented by preferentially selecting a lightweight mobilet.

Further, if the above-mentioned recognition method needs to be applied to a specific occasion, for example, in a factory, the recognition targets are all in the professional field, then the image classification model needs to be fine-tuned on the basis of the pre-training model. Here, for fine tuning of the image classification model, training fine tuning may be performed on the photographs after classifying the marks by collecting photographs of recognition targets in some professional fields and performing classification marks.

In some embodiments, fig. 7 is a schematic structural diagram of a headset according to an embodiment of the present application, as shown in fig. 7, a headset includes a headset body 710, and a camera 720, a display component 730, and a processor 740 disposed on the headset body 710, where the camera 720 and the display component 730 are respectively connected to the processor 740;

the processor 740 is configured to perform the identification method as described above;

the camera 720 is used for acquiring images under the control of the processor 740, and the display component 730 is used for displaying preset marking points under the control of the processor 740.

It is understood that the headset body 710 herein may be a spectacle frame (spectacle frame), helmet, or the like. Under the condition that a user wears the head-mounted device, whether the identification target coincides with the preset mark point position on the display assembly or not in the acquisition view field of the camera can be observed through the display assembly on the head-mounted device, so that the preset mark point position coincides with the identification target through adjusting the view angle, a first image when the preset mark point position coincides with the identification target is shot through the camera, and the identification method aiming at the identification target in the first image is realized through the processor. The identification method is identical to that in the above embodiment, and will not be described here.

The display assembly 730 may be a conventional display screen such as an LCD, OLED, mini LED, micro LED, etc.; the optical display module can be an AR optical display module such as an array optical waveguide, a geometric optical waveguide, a diffraction optical waveguide, a holographic optical waveguide, a multilayer optical waveguide and the like.

In some embodiments, the headset body is a spectacle frame.

Specifically, the headset may be an intelligent glasses, and the intelligent glasses may be provided with one or more cameras on the glasses frame, for example, two cameras may be provided and respectively arranged on two sides of the glasses frame. The user issues a query instruction "take a picture", "start taking a picture", "what this is", "what that is what" when, and starting the camera to collect images. In general, the human eye is facing the recognition target at this time, that is, the image acquired by the camera has a high probability of containing the recognition target. However, since the distance between the person and the recognition target is not fixed, other objects may be included in the image in addition to the recognition target.

Considering that the current cameras on the intelligent glasses are provided with no direction adjusting device, the shooting view angles of the cameras are generally consistent with the facing directions of faces. In order to prevent that the image obtained by shooting does not contain an identification target, or that a plurality of objects exist in the image, so that the processor cannot accurately identify the target from the determination, the processor generates a colored point or a box on a display component of the glasses as a preset mark point position when starting the camera, and the position of the point position can be the center position of the image or other preset positions. If the user finds that the preset mark point position is not coincident with the identification target, the face direction can be adjusted so as to drive the camera to adjust the shooting visual angle until the preset mark point position is coincident with the identification target object, and then a shooting instruction is issued, so that the processor can accurately locate the identification target from the shot image and perform identification.

In another aspect, the present application also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing the method provided by the above method embodiments, the method comprising:

In yet another aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method provided by the above-described method embodiments, the method comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of identification, comprising:

displaying preset mark points, and collecting images based on a camera; the preset mark point positions are used for guiding a user to move the camera so as to enable the preset mark point positions to coincide with the identification targets, and the preset mark point positions and images acquired by the camera are displayed on the same screen;

based on the preset mark point position, segmenting a region of interest from a first image acquired by acquisition, wherein the first image is an image acquired by a camera based on a user when determining that an identification target coincides with the preset mark point position, and the identification target is contained in the region of interest;

performing target recognition on the region of interest to obtain a recognition result of the recognition target;

the segmenting the region of interest from the acquired first image based on the preset marking point position comprises the following steps:

based on the preset mark point positions and the gesture recognition result, segmenting a region of interest from the first image;

the step of performing gesture recognition on the first image to obtain a gesture recognition result includes:

2. The recognition method according to claim 1, wherein the gesture recognition model includes a backbone network, and a gesture recognition branch and a point recognition branch connected to the backbone network, respectively;

3. The method according to claim 1, wherein the segmenting the region of interest from the first image based on the preset marker point location and the gesture recognition result includes:

4. A method of identifying as in any of claims 1 to 3, wherein displaying the predetermined marker points and capturing the image based on the camera comprises:

5. The method of claim 4, wherein the obtaining the first voice and performing instruction recognition on the first voice to obtain the voice instruction includes:

Acquiring a second voice, and detecting wake-up words of the second voice;

6. The head-mounted device is characterized by comprising a head-mounted device body, and a camera, a display assembly and a processor which are arranged on the head-mounted device body, wherein the camera and the display assembly are respectively connected with the processor;

the processor is configured to perform the identification method of any one of claims 1 to 5;

7. The headset of claim 6, wherein the headset body is a spectacle frame.

8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the identification method according to any one of claims 1 to 5.