CN117097985B

CN117097985B - Focusing method, electronic device and computer readable storage medium

Info

Publication number: CN117097985B
Application number: CN202311309672.6A
Authority: CN
Inventors: 张恒
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-10-11
Filing date: 2023-10-11
Publication date: 2024-04-02
Anticipated expiration: 2043-10-11
Also published as: CN117097985A

Abstract

The application provides a focusing method, an electronic device and a computer readable storage medium. In the method, the electronic device may divide the preview screen into a plurality of areas and identify an instance in each area. Then, the electronic device can receive a voice command of focusing the appointed example by the user, and further control the camera to focus the example. The method can enable the user to focus the appointed area in the preview picture without directly contacting the electronic equipment, and improves shooting experience of the user.

Description

Focusing method, electronic device and computer readable storage medium

Technical Field

The present disclosure relates to the field of terminals, and in particular, to a focusing method, an electronic device, and a computer readable storage medium.

Background

When a user takes a picture using a camera of an electronic device, the user often needs to click on a display screen to focus on a target at a specific location. Sometimes, the user is far away from the electronic device when shooting, and is inconvenient to manually select a focusing position, which may cause that the final shot picture of the electronic device does not conform to the effect expected by the user.

Disclosure of Invention

The application provides a focusing method, an electronic device and a computer readable storage medium. The electronic device may identify an instance in the preview screen and then receive a voice focus instruction for the user to focus on the specified instance. And the electronic equipment can recognize the voice focusing instruction of the user and focus the instance appointed by the user. Therefore, the user can change the focusing position without clicking the display screen, and the shooting experience of the user is improved.

In a first aspect, the present application provides a focusing method, where the method is applied to an electronic device, and the electronic device includes a camera, and the method includes: the electronic equipment displays a first preview picture and labels of K1 examples in the first preview picture, wherein K1 is a positive integer; the electronic equipment displays a second preview picture and labels of K2 examples in the second preview picture, wherein the K2 examples comprise a first example, the first preview picture and the second preview picture are generated by the electronic equipment based on images acquired by a camera at different times, K2 is a positive integer, and K1 examples are partially or completely different from K2 examples; the electronic equipment receives a first voice instruction, and the first voice instruction instructs the electronic equipment to focus the first instance based on the label of the first instance; in response to the first voice instruction, the electronic device focuses on the first instance.

The electronic device may obtain the preview frame and identify an instance in a picture of the preview frame. The electronic device may identify an instance in the first preview frame and then display a first preview screen corresponding to the first preview frame and labels of K1 instances in the first preview screen. The electronic device may then identify an instance in the second preview frame and then display a second preview screen corresponding to the second preview frame and labels for K2 instances in the second preview screen. The first preview frame and the second preview frame are preview frames in the preview stream, the time for the electronic device to acquire the second preview frame is later than the time for acquiring the first preview frame, and the time for the electronic device to display the second preview picture is later than the time for displaying the first preview picture. Wherein the instances in the first preview screen and the second preview screen may be partially or completely different, which may be caused by one or more instances in the first preview screen being moved out of the area where the camera captures the original image (or referred to as the shooting range of the camera), and/or one or more instances entering the area where the camera captures the original image. The first preview screen and the second preview screen may both include the first instance, that is, the first instance is always in the area where the camera captures the original image. Referring to the embodiment shown in fig. 2C, the preview screen shown in fig. 2C may be a second preview screen, and K2 examples may include a grass, a person 1, a person 2, a person 3, a dog in the preview screen. The first preview screen may contain more or fewer instances than the second preview screen, for example, the first preview screen may be an instance of fewer "dogs" based on the second preview screen, or the first preview screen may have more other people based on the second preview screen, and so on. The electronic device may receive a first voice instruction that may instruct the electronic device to focus the first instance based on the tag of the first instance. Taking the embodiment shown in fig. 2D as an example, an example corresponding to the tab "person 1" in the second preview screen shown in fig. 2D may be referred to as a first example. The first voice instruction may be a voice focus instruction to focus on the position where the person 1 is located in the second preview screen based on the tag "person 1". For example, the first voice command may be "focus on person 1", or "focus on the first person", etc. Focusing on the person 1 here may mean that the electronic device focuses based on the center of the bounding box where the person 1 is located, or that the electronic device focuses based on the center of gravity of the area where the person 1 is located. Meanwhile, the electronic device may display a focus frame at person 1. The electronic device may identify the instance in the preview screen by an object detection model or a coarse-granularity image segmentation model. The electronic device may receive input of the preview frame image via the object detection model and then output a bounding box in which the one or more types of instances reside and an identification of the identified instance. Alternatively, the electronic device may receive input of the preview frame image through a coarse-grained image segmentation model and then output an identification of the region in which one or more instances are located and each instance. The electronic device may then determine a tag for the instance based on the identity of the instance.

That is, the electronic device may identify the instance in the preview screen and then display a label for the instance. The user can focus the appointed instance through the voice focusing instruction based on the label displayed by the electronic equipment, so that the user can select the focusing position without clicking the display screen. Meanwhile, the label of the instance displayed in advance by the electronic equipment can prompt the user, so that the user is prevented from giving an excessively fuzzy voice focusing instruction, and the accuracy of the electronic equipment in recognizing the focusing position is prevented from being influenced.

Optionally, when displaying the first preview screen and the labels of K1 instances in the first preview screen, the electronic device may further display the labels of sub-instances of K1 instances; similarly, when the electronic device displays the second preview screen and the labels of K2 instances in the second preview screen, the electronic device may also display the labels of K2 sub-instances. That is, the electronic device may identify instances in the preview screen, and the electronic device may segment the instances, thereby identifying sub-instances in the instances. For example, in the preview screen shown in fig. 2C, examples of a grass, a person 1, a person 2, a person 3, and a dog may be displayed, and each example may be divided into a plurality of sub-examples. For example, the example of the character 1 may further divide sub-examples of "hair", "face", "clothes", and the like, and the electronic device may display the label of "character 1" and the label of "hair", "face", "clothes", and the like. Thus, the user can obtain a relatively accurate focusing position by sending out a voice focusing instruction only once.

With reference to the first aspect, in some embodiments, the K2 instances include a second instance, and the method further includes: the electronic equipment receives a second voice instruction, wherein the second voice instruction is used for instructing the electronic equipment to focus the second instance; in response to the second voice command, the electronic device changes the focusing target from the first instance to the second instance, and meanwhile, the electronic device can move the position of the focusing frame in the preview screen from the position of the first instance to the position of the second instance.

Referring to the embodiment shown in fig. 2D, person 1 may be a first example, and after focusing on the first example, the electronic device may receive a second voice command, where the second voice command may be, for example, "focus on person 2", where "person 2" is the second example, and further the electronic device may change the focus target from "person 1" (i.e., a man in the preview screen facing the camera) to "person 2" (i.e., a child in the preview screen). When the electronic device focuses on the first instance, the labels of the K2 instances may be displayed continuously, or the labels of the K2 instances may be displayed. Thus, the user can replace the focusing target for a plurality of times according to the own requirement.

With reference to the first aspect, in some embodiments, after focusing the first instance by the electronic device, the method further includes: the electronic equipment displays labels of one or more sub-examples in the first example, wherein the labels of the one or more sub-examples are obtained by dividing the first example by the electronic equipment, and the one or more sub-examples comprise the first sub-example; the electronic equipment receives a third voice instruction, and the third voice instruction instructs the electronic equipment to focus the first sub-instance based on the label of the first sub-instance; and responding to the third voice instruction, and focusing the first sub-example by the electronic equipment.

After focusing the first instance, the electronic device may divide the area where the first instance is located to obtain multiple instances. Wherein the electronic device may segment the first instance based on a fine-grained image segmentation model. The electronic device may receive an input of an image including the first instance through the fine-grained image segmentation model, and then output a plurality of areas obtained by cutting the first instance, and an identification of a sub-instance corresponding to each area. Referring to the embodiment shown in fig. 2E, after focusing the example of the person 1, the electronic device may divide the area where the first example (i.e. the person 1) is located to obtain sub-examples of "face", "hair", "clothes", etc., and then the electronic device may display labels of the sub-examples of face, hair, clothes, etc. As shown in fig. 2F, the electronic device may receive a third voice command for focusing on the first sub-example, where the first sub-example may be the face of the person 1 in the preview screen shown in fig. 2F, and the third voice command may be, for example, "focus on the face". After receiving the voice focusing instruction of focusing to the face, the electronic equipment can focus the face of the person 1. The electronic device may display only the label of the divided partial instance, for example, the electronic device may also divide the instance of "neck", but may not display the label of the instance of "neck" in the preview screen.

That is, after focusing on a certain instance, the electronic device may divide the area where the instance is located, and divide the instance into multiple sub-instances. Therefore, a user can focus a certain sub-instance in the instances through the voice focusing instruction, a more accurate focusing position is obtained, and shooting experience of the user is improved.

With reference to the first aspect, in some embodiments, when the electronic device displays the tag of one or more sub-instances in the first instance, the method further includes: the electronic device cancels the label displaying K2 instances.

Referring to the embodiment shown in fig. 2E, when the electronic device displays the label of the sub-instance of the character 1, the labels of other instances (e.g., the character 2, the character 3, the dog, and the grass) may not be displayed, so that the electronic device can avoid that the preview screen is prevented from including too many labels, which affects the user to watch the preview screen when the electronic device displays the preview screen.

Optionally, when the electronic device displays the labels of one or more sub-instances in the first instance, the electronic device may also continue to display the labels of K2 instances. This may help the user focus on other instances in the preview screen.

With reference to the first aspect, in some embodiments, the method further includes: the electronic equipment displays a third preview picture, wherein the third preview picture comprises K3 examples, the K3 examples comprise a first example, and K3 is a positive integer; the electronic equipment receives a fourth voice instruction, wherein the fourth voice instruction is used for instructing the electronic equipment to identify an instance in the preview picture; in response to the fourth voice instruction, the electronic device displays labels of the K3 instances.

Wherein the third preview screen can be another preview screen that includes the first instance. Referring to the embodiment shown in fig. 2E-2F, after the electronic device segments the instance of person 1, the labels of the instance in the preview screen may be hidden. The third preview screen may be the preview screen shown in fig. 2G, and the electronic device may receive a fourth voice command to re-identify an instance in the preview screen and display an instance tab, such as "refocus" or "display tab", and the like, and the electronic device may redisplay the tab of the instance in the preview screen in response to the command, thereby helping the user to correctly issue the voice focus command.

With reference to the first aspect, in some embodiments, the method further includes: in response to the first voice instruction, the electronic device displays a focus frame at a location of the first instance.

Referring to the embodiment shown in fig. 2D-2E, upon receiving a user's voice focus instruction to focus on person 1', the electronic device may display a focus frame at person 1. Here, "focus to person 1" is the first voice command. So that the user can determine, through the position of the focus frame, whether the electronic device is in focus for the correct position.

With reference to the first aspect, in some embodiments, after the electronic device displays the focus frame at the location of the first instance, the method further includes: the electronic equipment receives a fifth voice instruction, wherein the fifth voice instruction is used for indicating the focusing frame to move to the first azimuth; and responding to the fifth voice instruction, the electronic equipment moves the focusing frame to the first direction by a first distance, and focuses the moved position of the focusing frame.

Wherein the first orientation may be, for example, up, down, left, right, up-left, down-right, etc. The first distance may be preset by the electronic device, e.g., the first distance may be 10 pixels, 20 pixels, etc.; alternatively, the first distance may be determined based on a fifth voice command, e.g., the fifth voice command instructs the electronic device to move 1 cm toward the first direction, and the electronic device may convert 1 cm of the display screen into a number of pixels in the preview screen, which may also be referred to as the first distance. Referring to the embodiment shown in fig. 2I, the fifth voice command received by the electronic device may be, for example, "move focus frame to the left" as shown in fig. 2I, or "focus to the left", etc. In response to the fifth voice command, the electronic device may move the focus frame to the left a first distance based on the focus frame position shown in fig. 2H, where the moved focus frame is in the position shown in fig. 2I. And the electronic device can focus the focusing frame shown in fig. 2I. In this way, the electronic device may provide a user with more ways to control the focus position.

With reference to the first aspect, in some embodiments, the K2 instances include a third instance, and the method further includes: the electronic equipment receives a sixth voice instruction, wherein the sixth voice instruction is used for instructing the electronic equipment to focus on a first area, and the first area is any one of the following: left, right, upper, lower, or middle inside the third instance; in response to the sixth voice command, the electronic device focuses on the first area.

Referring to the embodiment shown in fig. 2H, the second preview screen may be the preview screen shown in fig. 2H, and the third example may be, for example, a grass in the preview screen shown in fig. 2H. The electronic device may receive focusing on the first region within the region occupied by the third instance. The first region may be, for example, to the left of the example of a grass mat. The electronic device may receive a user's "focus to the left of the congress" voice focus instruction, which may be referred to as a sixth voice instruction. In response to the sixth voice command, the electronic device may focus on the left side of the area occupied by the grass. Optionally, the electronic device may divide the rectangular area occupied by the grass into N equal parts according to the length of the rectangular area, and further the electronic device may focus the center of the leftmost rectangular area in the N equal parts of rectangle, where N is a positive integer. In this way, the electronic device may provide a user with more ways to control the focus position, and the user may focus on a certain area inside a certain instance or sub-instance as desired.

In combination with the first aspect, in some embodiments, the second preview screen includes a fourth instance, and the fourth instance is not included in the K2 instances, and the method further includes: the electronic equipment receives a seventh voice instruction, wherein the seventh voice instruction is used for indicating focusing on the fourth example; in response to the seventh voice instruction, the electronic device identifying a fourth instance in a fourth preview screen, the fourth preview screen being displayed by the electronic device after the second preview screen; the electronic device focuses on the fourth instance.

With reference to the first aspect, in some embodiments, after the electronic device identifies the fourth instance in the fourth preview screen, the method further includes: the electronic device displays the label of the fourth example.

That is, the fourth instance is included in the second preview screen, but the electronic device does not recognize the fourth instance. The electronic device can respond to the seventh voice instruction to identify the fourth instance and further focus the fourth instance. Referring to the embodiment shown in fig. 2J-2K, the electronic device 100 does not recognize flowers in the preview screen, and after receiving a voice focus instruction of "focus to flowers" by the user, the electronic device may recognize flowers in the preview screen. The electronic device may display the labels of "flower 1", "flower 2" in the preview screen and focus the flowers (e.g., flower 1) in the preview screen. In this way, the electronic device can identify the instance in the preview screen based on the voice focusing instruction of the user, and help the user select the focusing position.

In combination with the first aspect, in some embodiments, the labels of the K1 instances are different, as are the labels of the K2 instances. That is, the tag corresponding to each instance is unique, and the electronic device can determine a unique instance based on the tag, so that accuracy of acquiring the focusing target from the voice focusing instruction can be improved.

With reference to the first aspect, in some embodiments, before the electronic device displays the first preview screen and the labels of K1 instances in the first preview screen, the method further includes: the electronic equipment displays a voice focusing button; the electronic device detects an operation acting on the voice focus button.

Referring to the embodiment shown in fig. 2B-2C, the voice focus button may be the voice focus button 214 shown in fig. 2B, and the electronic device may detect an operation of the user acting on the voice focus button 214, and further identify an instance in the preview screen in response to the operation.

In a second aspect, the present application provides an electronic device comprising a display screen, a camera, a memory, and a processor coupled to the memory; wherein the display screen is used for displaying an interface, the camera is used for shooting an image, the memory stores a computer program, and the processor executes the computer program to enable the electronic device to realize the method of any one of the first aspect.

In a third aspect, the present application provides a computer readable storage medium storing a computer program or computer instructions for execution by a processor to implement the method of any one of the first aspects.

In a fourth aspect, embodiments of the present application provide a computer program product which, when executed by a processor, implements a method according to any one of the first aspects.

In a fifth aspect, embodiments of the present application provide a chip, the chip including a processor and a memory, where the memory is configured to store a computer program or computer instructions, and the processor is configured to execute the computer program or computer instructions stored in the memory, so that the chip performs the method according to any one of the first aspect.

The solutions provided in the second aspect to the fifth aspect are used to implement or cooperate to implement the methods correspondingly provided in the first aspect, so that the same or corresponding beneficial effects as those of the corresponding methods in the first aspect can be achieved, and no further description is given here.

Drawings

Fig. 1 is a schematic architecture diagram of an electronic device 100 provided in an embodiment of the present application;

Fig. 2A to fig. 2K are schematic diagrams of user interaction involved in a focusing method according to an embodiment of the present application;

FIG. 3 is a flowchart of a focusing method provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an electronic device processing preview frames using a coarse-granularity image segmentation model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an electronic device processing a preview frame using a target detection model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an electronic device further processing preview frames using a fine granularity image segmentation model provided by an embodiment of the present application;

FIG. 7 is a schematic architecture diagram of a fine-grained image segmentation model according to an embodiment of the disclosure;

fig. 8 is a schematic diagram of a training method of a fine-grained image segmentation model according to an embodiment of the application.

Detailed Description

The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this application refers to and encompasses any or all possible combinations of one or more of the listed items.

The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.

When a user takes a picture by using the camera of the electronic equipment, the display screen of the electronic equipment can display a preview picture of the image acquired by the camera. The preview screen typically includes a plurality of objects (in the embodiment of the present application, the objects in the preview screen may also be referred to as instances), for example, an image of a plurality of persons may be included in the preview screen, each person may be referred to as an object (or an instance). The electronic device may receive an operation by which a user clicks on a preview screen displayed at a particular location on the display screen, and in response to the operation, the electronic device may focus one or more targets in the preview screen displayed at the location on the display screen. For example, the electronic device may determine a location on the display screen that the user clicks on, and then drive the motor to change the position of the lens in the camera so that the image of the target at that location that is captured by the camera is more clear. The method for focusing a certain position in the preview screen of the electronic device is not limited in the embodiment of the application.

That is, focusing can be used to sharpen the content in the focus area when capturing an image. Wherein the user can manually select and adjust the focus area. For example, the user can determine the area on which the click operation is applied as the focus area by the click operation applied on the preview screen. The electronic device may also display a focus frame in the focus area for the user to determine the position of the focus area. Optionally, the electronic device may also automatically adjust the focus area.

In some scenes, the user is far from the electronic device at the time of photographing, for example, the user can press a shutter key through a selfie stick when the user is photographing using the selfie stick, or instruct the electronic device to take a photograph through a voice command "photograph", "eggplant", or the like when the user uses voice control to take a photograph. In such a scenario, the user may not be able to easily manually click the display screen to select the focus position, which may result in the electronic device focusing on the wrong target, resulting in a final photo that does not match the user's intended effect.

To solve this problem, the present application provides a focusing method, an electronic device, and a computer-readable storage medium. The electronic device can divide one or more examples from the image acquired by the camera, then the electronic device can receive a voice instruction of focusing on the specific examples by a user, and the electronic device can focus on the specific examples based on the voice instruction of the user. Therefore, the user can control the focusing position without clicking the display screen of the electronic equipment, and the photographing experience of the user is improved.

The electronic device 100 provided in the embodiment of the present application is first described below.

Fig. 1 is a schematic architecture diagram of an electronic device 100 according to an embodiment of the present application.

The electronic device 100 may be a portable terminal device with an iOS, android, microsoft or other operating system mounted, such as a cell phone, tablet, desktop, laptop, handheld, notebook, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook, as well as a cellular phone, personal digital assistant (personal digital assistant, PDA), augmented reality (augmented reality, AR) device, virtual Reality (VR) device, artificial intelligence (artificial intelligence, AI) device, wearable device, vehicle-mounted device, smart home device and/or smart city device, etc.

As shown in fig. 1, electronic device 100 may include a camera 110, a buffer memory 120, an image signal processor (image signal processor, ISP) 130, an application processor (application processor, AP) 140, a neural-Network Processor (NPU) 141, a microphone 150, an audio processor 160, a display 170, an encoder 180, and an external memory 190, coupled by one or more sets of buses. The buses may be an integrated circuit (inter-integrated circuit, I2C) bus, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) bus, a pulse code modulation (pulse code modulation, PCM) bus, a mobile industry processor interface (mobile industry processor interface, MIPI), and the like.

The camera 110 may include: the lens 111, the photosensor 112, the motor 113, and a flexible printed circuit board (flexible printed circuit board, FPCB) portion (not shown in fig. 1). The FPCB is responsible for connecting other components of the camera 110, such as the photosensor 112, with an Image Signal Processor (ISP) 130, for example, transmitting raw data output from the photosensor 112 to the ISP130. The motor 113 can move the lens 111 to a designated position to facilitate focusing on objects at different distances. At the time of photographing, the shutter of the camera 110 is opened, and light is incident through the lens 111 and irradiated onto the photosensitive sensor 112. The photosensor 112 converts the optical signal to an electrical signal, which is further converted to a digital signal by analog-to-digital conversion (analog digital convert, ADC) for delivery to the ISP for processing. The data of the digital signal, i.e. the raw image data collected by the camera, may be in a bayer (bayer) arrangement, for example. The RAW image data may also be referred to as RAW image.

The buffer memory 120 may be used to buffer the RAW image output by the camera 110, and when the electronic device 100 receives an instruction that the user will specify that the preview frame generates a photo, the image signal processor 130 may fetch the RAW image corresponding to the preview frame from the buffer memory 120 to generate the photo.

The ISP130 may be configured to perform a series of image processing on the RAW image to obtain YUV frames or RGB frames. Wherein the series of image processing may include: automatic exposure control (auto exposure control, AEC), automatic gain control (auto gain control, AGC), automatic white balance (auto white balance, AWB), color correction, removal of dead spots, and the like. The ISP may also be integrated within the camera 110.

An Application Processor (AP) 140 may be coupled to one or more random access memories (random access memory, RAM), one or more non-volatile memories (NVM). The random access memory may be directly readable and writable by the application processor, and may be used for storing executable programs (e.g., machine instructions) of the operating system or other ongoing programs, as well as storing data of users and applications, etc. The nonvolatile memory may also store executable programs, store data of users and application programs, and the like, and may be loaded into the random access memory in advance for the application processor to directly read and write. A storage unit may also be provided in the application processor, which may be a cache storage unit, and may be used to store instructions or data that has just been used or recycled by the application processor. The implementation code of the focusing method provided by the embodiment of the application can be stored in the NVM. After the camera application is started, the code may be loaded into RAM. Thus, the application processor can directly read the program codes from the RAM, and the focusing method provided by the embodiment of the application is realized.

The NPU141 is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and also can continuously perform self-learning. Applications such as intelligent awareness of the electronic device 100 may be implemented through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

In an embodiment of the present application, the NPU141 may receive the preview stream sent by the application processor 140 and then split the pictures of one or more preview frames in the preview stream into different instances.

In this embodiment of the present application, the NPU141 may further receive a voice focusing command input by the user through the microphone 150, identify an instance selected by the user from the voice focusing command, and further focus the location of the instance.

Microphone 150, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 150 through the mouth, inputting a sound signal to the microphone 150. The electronic device 100 may be provided with at least one microphone 150. In other embodiments, the electronic device 100 may be provided with two microphones 150, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four, or more microphones 150 to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc.

The audio processor 160 may be used to convert a voice focus instruction input by a user through the microphone 150 from an analog audio signal to a digital audio signal. In some embodiments the audio processor 160 may be integrated in the application processor 140.

The display 170 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light emitting diode (AMOLED), a flexible light-emitting diode (flex), a mini, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like.

The display 170 may be used to display images captured by the camera, such as showing preview images (preview frames). The preview image can be YUV frames or RGB frames output by ISP and is further obtained through downsampling algorithm, the definition of the preview image is often lower than that of a photo, and the time delay of displaying the preview frames on a display screen due to too high definition of the preview frames is avoided. A series of preview images (preview frames) are arranged in time sequence to form a preview stream, and based on the preview stream, a display screen can present pictures acquired by a camera in real time. Wherein the preview stream needs to be presented before it can be displayed on the display screen. The sending and displaying refers to pushing the preview image collected by the camera to a Frame Buffer (FB) for storage. The frame buffer is a section of storage space, which can be located in a video memory or a memory, and is used for storing rendering data processed or to be extracted by the video card chip. The content of the frame buffer corresponds to the interface display on the display screen, which can be simply understood as the buffer corresponding to the display content on the display screen. That is, modifying the content in the frame buffer modifies the picture displayed on the display screen. A touch sensor 171 may be provided in the display 170 for detecting a touch operation acting thereon, and the touch operation detected by the touch sensor may be transferred to an Application Processor (AP) to determine a touch event type.

The encoder 180 may be configured to encode YUV frames or RGB frames output by the ISP to obtain a photograph. The format of the photograph output by encoder 180 may include, but is not limited to: joint photographic experts group (Joint Photographic Experts Group, JPEG), tagged image file format (Tag Image File Format, TIFF), etc. In some embodiments, encoder 180 may be an encoding unit integrated in application processor 140.

External memory 190 may be one type of NVM that can be used to hold image files such as photographs, videos, etc. The photos and videos can be stored in a path accessible by the gallery application program, so that a user can view the photos and videos in the path by opening the gallery. The gallery is an application program for managing image files such as photos, videos and the like, and can be named as an album.

In the embodiment of the present application, the application processor 140 may receive the preview stream processed by the image signal processor 130, and send the preview stream to the NPU141 for image segmentation. The NPU141 may divide the area occupied by different instances in the preview frame (may also be referred to as a preview frame) to obtain a plurality of image areas that do not overlap with each other. For example, when a plurality of persons are included in a frame of a preview frame, each person may be referred to as an instance, and the NPU141 may divide the area occupied by the different persons in the frame of the preview frame. The NPU141 may also identify an instance of each region in the preview frame, and identify an instance corresponding to the region with a tag. Wherein the content of the tag may include an instance name, a sequence number, and the tag may be used to distinguish between different instances in the preview frame. Labels of different persons in the preview frame picture can be "person 1" and "person 2" respectively. Alternatively, the NPU141 may divide each instance from a rectangular area containing the instance, where the rectangular area of the instance may be obtained by the NPU141 performing image division on the preview frame or performing object detection on the preview frame. The NPU141 may output the image segmentation result to the application processor 140, where the image segmentation result may include an area to which each pixel in the preview frame picture belongs, and a tag corresponding to the area. The pixels are basic elements constituting a picture of a digital image. The application processor 140 may process the picture of the preview frame using the image division result of the preview frame, and then display the processed picture of the preview frame on the display 170. After the preview frame picture is processed, labels of the examples can be displayed in the area occupied by each example or around the area. Thus, the user can send out a voice focusing instruction based on the label of the instance, so as to instruct the electronic device 100 to control the camera 110 to focus the position of the instance.

The microphone 150 may receive a voice focusing instruction input by a user, where the voice focusing instruction may include a tag corresponding to an instance in the preview frame, and is used to instruct the electronic device to focus an area where the instance is located. The audio processor 160 may convert the voice focus instruction in the form of the analog audio signal collected by the microphone 150 into a digital audio signal and then transmit the voice focus instruction in the form of the digital audio signal to the application processor 140. The application processor 140 may send the voice focus instruction to the neural network processor 141, and the neural network processor 141 identifies the tag of the instance contained in the voice focus instruction. The application processor 140 may receive the label of the instance identified by the neural network processor 141, and then determine, based on the image segmentation result of the preview frame, the position where the instance is located in the preview frame, where the position is the position to be focused. In one aspect, the application processor 140 may send the position to be focused to the display 170 so that the display 170 may identify the position in the preview frame. On the other hand, the application processor 140 may also send the position to be focused to the ISP130, and the ISP130 may drive the motor 113 to change the position of the lens 111, so as to improve the definition of the position to be focused in the preview frame.

For convenience of description and better understanding, the focusing method provided in the embodiments of the present application will be described in the following embodiments by taking the camera 110 as a front camera as an example. The front camera here refers to a camera on the same side as the display panel. The display panel can reserve the aperture for the position of the camera 110, so that the camera 110 can acquire images through the reserved aperture, and the lens of the camera 110 is prevented from being shielded by the display panel. Optionally, the camera 110 may also be embedded in the display panel as an under-screen camera (under display camera, UDC). The focusing method when the electronic device 100 uses the other cameras (such as the rear camera) to perform shooting may refer to the focusing method when the electronic device 100 uses the camera 110 to perform shooting, which is not described herein.

The structure illustrated in fig. 1 does not constitute a specific limitation on the electronic device 100, and the electronic device 100 may include more or less components than illustrated, or may combine certain components, or may split certain components, or may have a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. For example, the electronic device may also include a graphics processor (graphics processing unit, GPU) for rendering, may contain more cameras, and so on. As another example, the electronic device 100 may also include a variety of sensors: pressure sensors, distance sensors, proximity sensors, touch sensors, ambient light sensors, etc., the electronic device 100 may achieve fast focus through the distance sensors.

The following describes a scene in which the electronic device provided in the embodiment of the present application focuses.

Fig. 2A to fig. 2K are schematic diagrams of User Interactions (UI) involved in a focusing method according to an embodiment of the present application.

Fig. 2A illustrates a Home screen interface 200 on the electronic device 100, which may include desktop icons for one or more applications, as shown in fig. 2A, which may include a desktop icon 201 for a camera application.

The electronic device 100 may detect a user operation, such as a click operation, acting on the desktop icon 201. In response to this operation, the electronic device 100 may launch the camera application and display the photo preview interface 210 as shown in fig. 2B. At the same time, the electronic device 100 may open the lens 111 of the camera 110 to capture an image.

As shown in fig. 2B, the photo preview interface 210 may include a toolbar 211, a preview window 212, and a shutter button 213.

The toolbar 211 may contain buttons corresponding to one or more function options, for example, an AI camera button, a voice focus button 214, a flash control button, a filter control button, and the like, among others. The voice focusing button 214 is used for starting a voice focusing function, and when the electronic device starts the voice focusing function, the focusing method provided by the embodiment of the application is executed. In the embodiment shown in FIG. 2B, a diagonal line may be displayed on the voice focus button 214 to indicate that the current voice focus function is in an off state.

The preview window 212 may be used to display a preview stream generated by the electronic device 100 after processing the image captured by the camera 110. As shown in fig. 2B, examples of men, women, mobile phone players, dogs, grass, etc. who face the camera 110 are included in the preview frame displayed in the preview window 212.

The shutter button 213 is used to take a photograph. When the electronic device 100 receives an operation of clicking the shutter button 213 by a user, the ISP130 may fetch a RAW image corresponding to the preview frame from the buffer memory 120, and then the ISP130 may process the RAW image to generate a photograph and store it in the external memory 190.

As shown in fig. 2C, the electronic device 100 may receive an operation of clicking the voice focus button 214 by the user, and in response to the operation, the electronic device 100 may activate the voice focus function. At the same time, the electronic device 100 may cancel the diagonal line on the displayed voice focus button 214, indicating that the voice focus function has been activated. The electronic device 100 may activate the microphone 150 to collect the user's voice focus instructions. Meanwhile, each instance in the preview frame picture can be followed by a label corresponding to the instance. As shown in fig. 2C, the tag in the preview frame screen displayed in the preview window 212 may include: "grass", "character 1", "character 2", "character 3", "dog", and so forth. Wherein the 'person 1' tag is used for identifying men with the face facing the camera in the preview frame, the 'person 2' tag is used for identifying girls in the preview frame, and the 'person 3' tag is used for identifying people playing mobile phones in the preview frame. It should be noted that the labels shown in fig. 2C are only examples, and do not constitute limitations of the embodiments of the present application, for example, the electronic device 100 may also represent labels of different examples in the preview frame image in other forms, or the labels shown in fig. 2C may also be other text contents, for example, the labels of girls in the preview frame image may also be "girls". The preview frame displayed in the preview window 212 may be obtained by combining the preview frame screen and the image division result of the preview frame by the application processor 140. The image segmentation result of the preview frame may be obtained by the neural network processor 141 receiving the preview frame sent by the application processor 140, segmenting the preview frame based on the regions where different instances of the preview frame are located, and then identifying the picture content of each region in the segmented preview frame. Here, the neural network processor 141 may divide the preview frame using a coarse grain image division model, or may divide the preview frame using an object detection model. Reference may be made to the description of the subsequent embodiments for coarse-grained image segmentation models and object detection algorithms, which are not developed here.

In some embodiments, the electronic device 100 may also select each identified instance box in the preview screen using a bounding box, where the bounding box may be a rectangle capable of containing one instance in the preview screen. So that the user can more clearly see where each instance is located in the preview screen. Alternatively, the bounding box may be a dashed line or a solid line, which is not limited by the embodiments of the present application.

That is, the electronic device may divide a plurality of instances in the preview screen, and display a label of each instance in the preview screen, so that the user may be prompted to issue a voice focusing instruction based on the label of the instance, so as to avoid inaccurate focusing targets identified by the electronic device due to the user issuing an inappropriate voice focusing instruction.

In some embodiments, the electronic device 100 may further segment instances in the preview frame. As shown in fig. 2D, the microphone 150 of the electronic device 100 receives a voice focusing instruction of the user, and the voice focusing instruction may be used to select an area corresponding to a "person 1" tag from the plurality of tags displayed in the preview frame screen to focus. The voice focus instruction may be, for example, a voice uttered by the user to focus on person 1. In response to the voice focusing instruction, the electronic device 100 may focus a man with a front face facing the camera corresponding to the "person 1" tag in the preview frame image. A man facing the camera will display a focusing frame in the area occupied by the preview frame, and the focusing frame can be at the center of gravity of the area where "person 1" is located in the preview frame. As shown in fig. 2E, the position of the focus frame framed in the preview frame screen is partially the face of the man, and the other is the neck of the man. The electronic device 100 may further divide the instance in the region where "person 1" is located, for example, the electronic device 100 may further divide the region where "person 1" is located into a plurality of regions, and identify the instance corresponding to each region with a tag. As shown in fig. 2E, the area where "person 1" is located is divided into three areas, and labels of the three areas are "hair", "face", and "clothes", respectively. Taking the area corresponding to the hair label as an example, the area is the position of the hair of the man with the front face facing the camera in the preview frame picture. Alternatively, the electronic device 100 may hide tags obtained by processing the preview frames using the coarse-granularity image segmentation model (or the object detection model), such as "grass", "character 1", "character 2", "character 3", and "dog".

In some embodiments, after the electronic device 100 focuses a certain instance and segments the instance to obtain a plurality of sub-instances, the instance may disappear from the shooting range of the camera, and then the electronic device 100 may re-identify the instance in the preview screen and display the label of the instance. For example, after the electronic device divides "person 1", the "person 1" may move out of the shooting range of the camera. When the electronic device does not recognize "character 1", the electronic device 100 can re-recognize the instance in the preview screen and then display the label of the instance without dividing the instance.

Here the neural network processor 141 may segment the preview frame using a fine-grained image segmentation model. The fine-grain image segmentation model may contain more types of instances, and the fine-grain image segmentation model may be used to further segment the image region segmented by the coarse-grain image segmentation model (or the image region initially identified by the target detection model), thereby identifying more instances. For example, a coarse-grain image segmentation model or an object detection model may determine the area in the screen where someone is located. The fine-grained image model can further divide the area where the person is located in the picture to obtain the area of the face, hair, body and the like of the person in the picture. The above-mentioned "face", "hair" and "body" are sub-examples included in the example of "person".

As shown in fig. 2F, the electronic device 100 receives, through the microphone 150, a voice focusing instruction of the user, where the voice focusing instruction may be used to select an area corresponding to a "face" tag from a plurality of tags displayed in the preview frame screen to focus. The voice focusing instruction may be, for example, voice of "focusing on a face" uttered by the user. In response to the voice focusing instruction, the electronic device 100 may instruct the camera 110 to focus on the face of the man facing the camera, and at the same time, the electronic device 100 may move the position of the focusing frame in the preview window 212, and move the focusing frame to the center of gravity position of the area where the "face" tag is located. Therefore, the electronic equipment can meet the requirement of a user for selecting a finer focusing target, help the user to focus more accurately, and improve the shooting experience of the user.

In some embodiments, the electronic device 100 may display the tags of all the identified instances on the preview frame after processing the preview frame using the coarse-granularity image segmentation model (or the object detection model) and the fine-granularity image segmentation model. In this case, the preview screen includes a label obtained by performing coarse-granularity image segmentation model processing on the preview frame (or the target detection model) and a label obtained by performing fine-granularity image segmentation model processing on the preview frame. For example, the coarse-granularity image segmentation model or the target detection model determines an instance of "person", and the fine-granularity image segmentation model further divides the image area corresponding to the instance of "person" to obtain sub-instances of "face", "hair", "body", and the like, so that the preview picture can include labels of "person", "face", "hair", "body". Thus, the user can select a finer focusing target only by giving out a voice focusing instruction once.

In some embodiments, the electronic device 100 may associate words or terms in the tag with other related words or terms. For example, "person" may be associated with "person" and "1" may also be associated with "first". For example, in the embodiment shown in fig. 2D, when the voice focus instruction issued by the user is "focus on the first person", the electronic device 100 may also focus on the area where the example of "person 1" is located in response to the voice focus instruction. That is, the voice focus instruction received by the electronic device 100 may be in various forms.

It should be noted that, the electronic device 100 may display only the labels of the identified part of the examples in the preview screen, and not need to display the labels of all the identified examples, so that too many labels displayed in the preview screen may be avoided, and the look and feel of the user may be affected.

In some embodiments, when the user wants to change the focus position, the user can also refocus by a voice focus instruction. As shown in fig. 2G, the electronic device 100 may receive a user's voice focus instruction via the microphone 150, which may be used to instruct the camera 110 to refocus. The voice focus instruction may be, for example, a user-entered voice of "refocus". In response to the voice focus instruction, the electronic device 100 may re-recognize one or more instances in the preview frame and display a label for a different instance in the preview frame. As shown in fig. 2G, the electronic device 100 may identify one or more instances in the preview frame screen and then display a label corresponding to the one or more instances on the preview frame screen: "character 1", "character 2", "character 3", "dog" and "grass". Therefore, the user can change the focusing position for many times according to the requirement, and the shooting experience of the user is improved.

In some embodiments, electronic device 100 may identify words or sentences or the like in the voice focus instructions that are used to represent the orientation in addition to identifying the type of instance in the voice focus instructions. As shown in fig. 2H, the electronic device 100 may receive a user's voice focus instruction via the microphone 150, which may include both a tag and an azimuth word. For example, the voice focus instruction may be a voice input by the user to focus to the left of the grass. The electronic device 100 may extract the label "grass" and the azimuth term "left" contained in the voice focusing instruction, so that the electronic device 100 may focus on the left side of the area where the "grass" is located. Therefore, the user can select more focusing positions during shooting, and shooting experience of the user is improved.

In some embodiments, the electronic device may also adjust the focus position based on the position of the focus frame in response to a voice focus instruction by the user. As shown in fig. 2I, the electronic device 100 may receive a voice focus instruction of the user "move the focus frame to the left", and in response to the voice focus instruction, the electronic device 100 may move the position of the focus frame to the left by a preset distance while controlling the camera 110 to change the focus position. Thus, the user can control the focusing position more accurately, and the shooting experience of the user is improved. Not limited to the voice focusing instruction of "move focus frame to the left", the electronic device 100 may associate the focus position moved to the left with various voice focusing instructions, for example, the electronic device 100 may also receive the voice focusing instructions such as "move to the left", "focus to the left", and so on, which are sent by the user, and then move the focus position to the left, which is not limited in the embodiment of the present application. The method for indicating the electronic device to move the focusing position upwards, downwards and rightwards by the user through the voice focusing instruction can refer to the method for indicating the electronic device to move the focusing position leftwards, and is not repeated herein.

In some embodiments, after receiving the voice focusing instruction of the user, the electronic device may cancel displaying the label of the instance or sub-instance being displayed in the preview screen, for example, in the embodiment shown in fig. 2E-2F, after receiving the voice focusing instruction of focusing on the face, the electronic device may focus on the face, and cancel displaying the labels such as "hair", "face", "clothes", etc. After the preset time has elapsed, the electronic device may display the label of the instance again. Therefore, the user can more conveniently observe the preview picture after selecting the focusing position, so that the label is prevented from shielding the preview picture. And/or the electronic device 100 may further display a tab of an instance in the preview screen after receiving and generating a photograph in response to an operation for clicking the shutter button 213. Therefore, the electronic equipment can display the label in time after the user takes the picture, so that the user is prompted to issue a voice focusing instruction according to the label.

In some embodiments, the electronic device may always display the tab of the instance in the preview screen, so that the tab may give a prompt to the user, which may help the user to give a voice focus instruction.

In some embodiments, the instance types pre-identified by the electronic device 100 may be limited, and the voice focusing instruction received by the electronic device 100 may further include instance types not pre-identified. The electronic device 100 may detect an instance type in the preview frame that is not pre-identified in response to the voice focus instruction.

As shown in fig. 2J, the electronic device 100 receives a voice focus instruction of the user through the microphone 150, which may be a voice of "focus to flower" uttered by the user. The electronic device 100 may recognize an instance type of "flower" from the voice focus instruction, where "flower" is an instance type that is not previously recognized. The electronic device 100 may detect a "flower" in the preview frame.

As shown in fig. 2K, the electronic device 100 may detect "flowers" in the preview frame, thereby displaying "flower 1", "flower 2" identifications. The electronic device may receive a voice focusing instruction for focusing on the "flower" type instance, for example, the voice focusing instruction may be a voice of focusing on flower 1 "sent by the user, and further the electronic device 100 may focus on the instance corresponding to the" flower 1 "tag selected by the user.

It should be noted that, the voice focusing instruction received by the electronic device may be ambiguous or may be of various types. Taking two example labels of "flower 1" and "flower 2" shown in fig. 2K as an example, the electronic device may receive a voice focusing instruction of "focusing on the first flower" by the user, and further focus on the example corresponding to "flower 1". Or, the electronic device may also receive a voice focusing instruction of the user's flower focused to the left', and further focus the instance corresponding to "flower 1".

In some embodiments, after the electronic device 100 identifies a portion of the instances in the preview screen, the labels of the instances may not be displayed on the preview screen. The electronic device 100 may display a tag of an instance included in the voice focus instruction in the preview screen after receiving the voice focus instruction input by the user. When the instance contained in the voice focusing instruction is not among the instances previously recognized by the electronic device 100, the electronic device 100 may recognize the instance in the preview screen.

In some embodiments, the electronic device 100 may not pre-identify the instance in the preview screen. After receiving the voice focusing instruction input by the user, the electronic device 100 may identify an instance in the preview screen based on the voice focusing instruction, and further display a label corresponding to the instance in the preview screen.

The following describes a focusing method provided in the embodiments of the present application.

As shown in fig. 3, a flowchart of a focusing method provided in an embodiment of the present application may include, but is not limited to, the following steps:

s301, acquiring a preview frame A from a preview stream, wherein a picture of the preview frame A comprises an instance A.

The application processor 140 in the electronic device 100 may receive the preview stream transmitted by the image signal processor 130 and process the preview stream. The preview stream is made up of a plurality of preview frames generated by the electronic device in time order. Preview frame a may be included in the preview stream.

One or more types of examples may be included in the picture of preview frame a, where examples may refer to objects having a shape in the picture. Types of examples may include, but are not limited to, humans, animals, plants, microorganisms, stones, water currents, fireworks, buildings, and the like.

S302, identifying the instance A in the preview frame A and displaying the label of the instance A.

The electronic device may identify one or more types of instances in preview frame a, and thus identify instance a in preview frame a. The method by which the electronic device 100 recognizes an instance in preview frame a may include two types:

the method comprises the following steps: the electronic device 100 uses the coarse-granularity image segmentation model to identify instances in preview frame a.

The neural network processor 141 in the electronic device 100 may use a coarse-grain image segmentation model to segment the region occupied by one or more types of instances in the preview frame a picture such that the preview frame a picture is divided into a plurality of regions.

It should be noted that the preview frame is composed of a plurality of pixels. The electronic device 100 may divide the preview frame a picture into different regions, each region corresponding to an instance, by a coarse-granularity image segmentation model. While each region will also have its own tag that is used to identify the instance in that region.

Specifically, taking the embodiment shown in fig. 4 as an example, after the preview frame a is input to the coarse-granularity image segmentation model, the coarse-granularity image segmentation model divides the pixels of the preview frame a into different pixel sets, where each pixel set corresponds to a region in the preview screen. For example, the electronic device may divide all pixels in the region of the preview frame a picture where the man facing the camera is located into a pixel set a through the coarse-granularity image segmentation model, and identify the pixel set a with a label of "person 1", which indicates that the pixels in the pixel set a may constitute an instance of "person 1" in the preview frame a picture. Finally, the coarse-grain image segmentation model may output a plurality of pixel sets, each with a respective label for identifying instances of the picture that the pixel set constitutes.

After the electronic device 100 divides the area occupied by each instance in the preview frame a picture, the preview frame a may be displayed on the display screen based on the division result. That is, the electronic device 100 may display the label corresponding to each pixel set in or near the region where the pixel set is located in the preview frame a. For example, the electronic device 100 may display a label corresponding to each instance in or beside the area occupied by the instance in the preview frame a picture. Referring to the embodiment shown in fig. 2C, the electronic device may identify the identified multiple types of instances using a tag, and when a certain type of instance in the preview frame a includes multiple instances, the electronic device may number them, such as "person 1" and "person 2".

The second method is as follows: the electronic device 100 uses the object detection model to identify instances in preview frame a.

The electronic device 100 may obtain the preview frame a and then detect instances of one or more types in the preview frame a using the object detection model. Wherein the object detection model can sequentially frame different positions in the preview frame a using rectangular areas (bounding boxes) of different sizes, and then identify whether and what types of instances are contained within the bounding box.

As shown in fig. 5, the object detection model may receive input of preview frame a and then determine bounding boxes for instances of one or more of the types. Examples of one or more types identified by electronic device 100 may include, for example, "people," dogs, "" grass, among others. The electronic device may detect "people", "dogs", "grass" in the preview frame a picture through the object detection model, and then output the positions of the bounding boxes of the above examples in the preview frame a picture.

In some embodiments, the bounding box of an instance may be rectangular, the bottom edge of the bounding box may be the height at which the lowest pixel point in the set of pixels comprising the instance is located, the top edge of the bounding box may be the height at which the highest pixel point in the set of pixels comprising the instance is located, the horizontal coordinate of the left edge of the bounding box may be equal to the horizontal coordinate of the leftmost pixel point in the set of pixels comprising the instance, and the horizontal coordinate of the right edge of the bounding box may be equal to the horizontal coordinate of the rightmost pixel point in the set of pixels comprising the instance.

In some embodiments, the target detection model may be a model based on a convolutional neural network (convolutional neural networks, CNN) model, a recurrent neural network (recurrent neural network, RNN) model, a long short-term memory (LSTM) model, a deep neural network (deep neural network, DNN) model, a generative pre-training transformer (GPT) large model, and so on.

Alternatively, the target detection model may be a YOLO (you only look once) series model, such as a YOLO v1 model, a YOLO v2 model, or the like. Alternatively, the object detection model may also be a SSD (single shot multibox detector) series model.

It should be noted that the size of the bounding box is related to the object detection model, and different bounding box sizes or different shapes of bounding boxes may be used in different object detection models, which is not limited in the embodiments of the present application. For convenience of description and better understanding, the focusing method is described by taking the YOLO model as an example, and the focusing method when the electronic device uses other object detection models can refer to the focusing method when the YOLO model is used, which is not described herein.

After the electronic device detects the different instances of preview frame a, the instance may be identified by a tag next to each instance while preview frame a is displayed, which may include, but is not limited to, the name, serial number, etc. of the instance, such as "person 1", "person 2" shown in fig. 2C.

In some embodiments, the electronic device 100 may identify the instance a in the preview frame a in response to a user's voice focus instruction, which may be used to instruct the electronic device to focus on the instance a. In this scenario, instance a may be an instance that is not identified by the electronic device 100 using method one or method two described above (the scenario may be an instance in which the coarse-grained image segmentation model or the object detection model cannot accurately identify the type to which instance a corresponds). After receiving the voice focusing instruction of the user on the instance a, the electronic device 100 may identify the instance of the type corresponding to the instance a in the preview frame a by using an open set object detection algorithm, so as to determine the position of the bounding box of the instance a in the preview frame a. When a plurality of instances of the type corresponding to the instance a are included in the preview frame a screen, the electronic device 100 may number the plurality of instances of the type and then display the tag thereof in the preview screen. Referring to the embodiment shown in fig. 2J, the electronic device 100 does not recognize "flowers" in the preview screen, and the tags of "flowers" are not included in the preview screen. The electronic device 100 may receive a voice focus instruction of the user to "focus on flowers", and in response to the voice focus instruction, the electronic device 100 may detect "flowers" in the preview frame using an open set object detection algorithm, and further display an instance identifier of the "flowers" on the display screen. The open set target detection algorithm refers to a target detection algorithm capable of detecting any class of examples. Alternatively, the open set target detection algorithm may be a groudingdino algorithm.

S303, receiving a voice focusing instruction A of a user, wherein the voice focusing instruction A is used for indicating the electronic equipment to focus the example A.

S304, focusing the example A based on the voice focusing instruction A.

The electronic device 100 may receive a user's voice focus instruction a, which may be used to select instance a from one or more types of instances in the preview frame a picture for focusing. Taking the embodiment shown in fig. 2D as an example, the voice focusing instruction a may be an instruction of "focusing on person 1" inputted by the user by voice. Then, the electronic device 100 may identify the voice content such as the instance name and the serial number in the voice focusing instruction a through a voice recognition algorithm, so as to determine the position to be focused. For example, when the user inputs "focus to person 1" in voice, the electronic device 100 may determine that the instance of "person 1" is to be focused based on the "first", "person" or the like word or words therein. The electronic device 100 may then determine a set of pixels corresponding to the instance of "person 1", and the electronic device may then select the center of gravity of the area occupied by the set of pixels for focusing.

The electronic device 100 recognizes the azimuth word in the voice focus instruction input by the user. The embodiment shown in fig. 2H is an example, and the voice focusing command input by the user is "focus to the left of the grass. The electronic device 100 may identify the instance name "grass" therein and the azimuth term "left" to focus on the area corresponding to the instance "grass". Optionally, the electronic device 100 may obtain the width and height of the region in the preview frame a, and then divide the region into N equal parts based on the width of the region, and then the electronic device 100 may select the center of the leftmost region to focus, where N may be a positive integer such as 2, 3, etc.

Alternatively, the electronic device 100 may input the voice focus instruction into a neural network model, and the neural network model may output voice contents such as an instance name, a serial number, and a direction in the voice focus instruction. The neural network model may be a model based on a convolutional neural network (convolutional neural networks, CNN) model, a recurrent neural network (recurrent neural network, RNN) model, a long short-term memory (LSTM) model, a deep neural network (deep neural network, DNN) model, a generative pre-training transformer (GPT) large model, or the like. In one possible implementation, the neural network model may be a small model obtained by distillation training of a large model. The large model is a neural network model with large scale, large parameter quantity, huge data set and complex architecture. The small model is a neural network model with small scale, small parameter quantity, small data set and relatively simple architecture. The electronic device 100 may identify the voice focusing instruction input by the user through the small model, so as to identify the information of the to-be-focused position from the voice focusing instruction in a shorter time.

The electronic device 100 may continuously acquire new preview frames from the preview stream for processing, and further display the new preview frames on the display screen. After focusing the instance a based on the voice focusing instruction a, the electronic device will focus the instance a in the preview frame all the time as long as the instance a still exists in the subsequent preview frame. The method for focusing the instance a in the subsequent preview frame by the electronic device may refer to the method for focusing the instance a in the preview frame a, which is not described herein.

In some embodiments, when the electronic device 100 identifies the instance a using the coarse-granularity image segmentation model in step S302, the electronic device may focus the center or gravity center of the region in which the instance a is located in the preview frame a after receiving the voice focusing instruction a for focusing the instance a.

In other embodiments, the electronic device 100 identifies instance a using the object detection model in step S302, which may output a bounding box containing instance a. The electronic device 100 may focus on the center of the bounding box, where the center of the bounding box refers to the intersection of the diagonals of the rectangular area of the bounding box.

S305, acquiring a preview frame B from the preview stream, wherein a picture of the preview frame B comprises an instance A.

The electronic device may obtain a preview frame B from the preview stream, the preview frame B being another preview frame in the picture containing instance a, the electronic device may generate the preview frame B at a time later than the preview frame a.

S306, dividing the occupied area of the instance A in the preview frame B to obtain the instance B, and displaying the label of the instance B.

In step S304, the electronic device determines, through the voice command a, that the instance a is focused, and then the electronic device 100 may acquire a pixel set corresponding to the instance a in the preview frame B, and then segment one or more types of instances in the image area where the pixel set is located.

The electronic device may segment the region occupied by instance a in preview frame B using a fine-grained image segmentation model. Illustratively, as shown in fig. 6, the electronic device may obtain a bounding box containing instance a from the preview frame B, which may be rectangular in shape, alternatively, the bounding box of instance a may be the smallest rectangle that can include the image of instance a. After the electronic device inputs the image in the bounding box of the instance a to the fine-grained image segmentation model, the fine-grained image segmentation model can segment the image area where the instance a is located, and finally outputs a pixel set corresponding to one or more types of instances, wherein the pixel set can include a pixel set corresponding to an instance such as a face, a hair, a neck, a garment and the like. Finally, the electronic device can display the image segmentation result on the display screen and identify part or all of the examples in the display screen. As shown in fig. 2E, the electronic device may identify hair, face, clothing of example a (a man facing the camera in front) in the preview screen.

In some embodiments, the location of instance a identified in step S302 by the electronic device 100 may be inaccurate, and the bounding box containing instance a determined based on the identified location of instance a may be inaccurate. The electronic device 100 may expand the length of the bounding box to P times the original length and/or expand the width of the bounding box to Q times the original width after determining the bounding box of the instance a based on the location of the instance a identified in step S302, where P and Q may be any real numbers, P and Q may be unequal, e.g., P may be 1, 1.5, 2, etc., Q may be 2, 2.5, 3, etc., which is not limited by the embodiments of the present application. The electronic device 100 may then input the preview screen in the enlarged bounding box into a fine-grained image segmentation model for segmentation.

In some embodiments, where the electronic device 100 identifies instance a using the coarse-grain image segmentation model in step S302, the electronic device may also input an image of instance a (instead of all images within the bounding box containing instance a) into the fine-grain image segmentation model, which in turn segments one or more types of instances in instance a by the fine-grain image segmentation model.

In some embodiments, when the electronic device 100 segments the preview frame B through the fine-grained image segmentation model, all images within the bounding box may be segmented. For example, after the image shown in fig. 6 is input to the fine-grain image segmentation model, the electronic device 100 may segment men, girls, dogs through the fine-grain image segmentation model. In this way, the electronic device 100 may identify one or more instances of the preview screen after segmentation of a man, a child, a dog.

S307, a voice focusing instruction B of the user is received, wherein the voice focusing instruction B is used for instructing the electronic equipment to focus the instance B in the instance A.

S308, focusing the example B based on the voice focusing instruction B.

The electronic device may receive an instruction from a user to focus on instance B in instance a, thereby focusing on instance B. Wherein instance B may also be referred to as a sub-instance of instance a. Referring to the embodiment shown in fig. 2E-2F, the electronic device may receive a voice focusing command B input by the user, where the voice focusing command B may be "focusing on a face", and in response to the voice focusing command B, the electronic device 100 may drive the motor 113 to move the position of the lens 111, so as to achieve focusing on the face in the preview frame B.

In some embodiments, steps S305-S308 are optional, and the user can complete focusing by entering a voice command only once. Alternatively, the electronic device may take steps S305-S308 as optional steps when processing the image using the coarse-grain image segmentation model in step S302. The electronic equipment can determine the outline of the instance through the coarse-granularity image segmentation model, so that the instance selected by the user can be focused more accurately under the condition of only receiving one voice instruction, and the instance cannot be focused outside.

In some embodiments, after the electronic device focuses on an instance or sub-instance, the location of the instance or sub-instance in the subsequent preview screen or screens may change. The electronic device may focus in a subsequent preview screen based on the location where the instance is located and display a focus frame at the location of the instance in the preview screen. That is, when the position of the instance or sub-instance changes, the focusing position of the electronic device can move along with the position of the instance or sub-instance in the preview screen, so that the user can keep focusing on a certain instance or sub-instance by the electronic device without re-issuing a voice focusing instruction, and the shooting experience of the user is improved.

The order of execution of the steps illustrated in fig. 3 is merely exemplary, and the electronic device may use more or fewer steps than in fig. 3, e.g., the electronic device may add certain steps or subtract certain steps, which the embodiments of the present application do not limit.

The architecture of a fine-grained image segmentation model provided in the embodiments of the application is described below.

As shown in fig. 7, the fine-grained image segmentation model may include an image segmentation module and an image recognition module, wherein the image segmentation module is configured to receive an image, extract image features and segment the image into a plurality of regions; the image recognition module is used for acquiring the image characteristics and the segmented image areas output by the image segmentation module, recognizing the image of each area and identifying the content of the image of the area.

In some embodiments, the image segmentation module may be a neural network model based on a convolutional neural network (convolutional neural networks, CNN) model, a recurrent neural network (recurrent neural network, RNN) model, a long short-term memory (LSTM) model, a deep neural network (deep neural network, DNN) model, a generative pre-training transformer (GPT) large model, and so on.

Alternatively, the image segmentation module may be a SAM (segment anything model) model, and the SAM model may encode an image using an image encoder (image encoder) to obtain an image embedding (image embedding), where the image embedding is a dense vector representation of the image, that is, a feature of the image. The SAM model may divide an image into a plurality of regions based on image features, each region may correspond to an instance. The SAM model may represent the segmented regions using a mask (mask).

Not limited to the SAM model, but other image segmentation models are also possible, which are not limited by the embodiments of the present application.

The image recognition module may include a convolutional layer, a multi-layer perceptron, and an activation function. The convolution layer A can convolve the mask corresponding to one image area output by the image segmentation module, and the convolution result can be added with the image characteristics output by the image segmentation module to fuse the mask and the image characteristics. The image recognition module inputs the addition result into the convolution layer B for convolution, and then inputs the addition result into the multi-layer perceptron. The multi-layer perceptron may classify the image region based on the fusion features to determine the image content of the region. The final activation function (which may be, for example, a Softmax function) may normalize the results and output the final classification result for the image region (i.e., the identity of the image region).

It should be noted that, after the image is divided by the image dividing module, the image may be divided into a plurality of image areas, that is, into a plurality of masks. The electronic device can sequentially input each region after the image segmentation and the characteristics of the image into the image recognition module for recognition, so as to obtain the identification of each image region.

The architecture of the fine-grained image segmentation model described above is by way of example only, and in some embodiments, the fine-grained image segmentation model may include more or fewer software modules than shown in FIG. 7, or may incorporate or add certain modules, as embodiments of the application are not limited in this regard.

The architecture of the coarse-grain image segmentation model may refer to the architecture of the fine-grain image segmentation model, and will not be described herein.

The following describes a training method for a fine-grained image segmentation model provided in the embodiments of the application.

As shown in fig. 8, the electronic device may input image a to a target detection model, and then detect an instance (e.g., a person) of type C therein, and the target detection model may output a bounding box in the image in which the person is located. Taking the head portraits of the men in the image as an example, the electronic device may input the bounding box in which the men is located and the image in the bounding box into an image segmentation model, where the image segmentation model may receive the image input and output the segmented image. Alternatively, the image segmentation model may be a SAM (segment anything model) model, and the SAM model may segment the image of the man into several regions, such as region 1, region 2 and region 3 shown in fig. 8, where each region may correspond to an instance. Wherein the SAM model may represent the segmented region using a mask.

Finally, the electronic device may input the segmentation result of the SAM model into the multimodal image recognition model. The multi-modal image recognition model may receive an image input and output a textual description of the image content. Alternatively, the multimodal image recognition model may be a BLIP (bootstrapping language-image pre-tracking) model, and the electronic device may input images of region 1, region 2, and region 3 into the BLIP model, and then the BLIP model may recognize an instance in each region, and further the electronic device may obtain the name of the instance.

Thus, the electronic device obtains a set of training data pairs including image A, the segmentation result of image A (region 1, region 2, and region 3), and the instance name corresponding to each region. The electronic device may generate multiple sets of training data pairs through a greater number of images, and the method for generating training data pairs by the electronic device based on other images may refer to the method for generating training data pairs based on image a, which is not described herein. Not limited to the SAM model, but other image segmentation models are also possible, which are not limited by the embodiments of the present application. Also, not limited to the BLIP model, but other multimodal image recognition models are possible, nor are embodiments of the present application limited thereto.

Finally, the electronic device may train the fine-grained image segmentation model based on the sets of training data. Training a model here refers to adjusting parameters of the model using existing training data to improve the predictive capabilities of the model. That is, training the fine-grained image segmentation model may improve the accuracy of the fine-grained image segmentation model in segmenting instances of the image (or sub-instances of the segmented instances) and identifying the segmented image content.

In some embodiments, the training process of the fine-grained image segmentation model may be completed by a cloud server.

Alternatively, the object detection model in fig. 8 may also be a coarse-grained image segmentation model. The electronic device 100 may receive input of a preview frame through the coarse-granularity image segmentation model and then segment instances of one or more types in the preview frame picture. The electronic device 100 may input the image within the bounding box containing each instance into the SAM model through the coarse-grain image segmentation model, or the electronic device 100 may input the image of the region in which the instance is located into the SAM model.

The training method of the coarse-granularity image segmentation model can refer to the training method of the fine-granularity image segmentation model, and will not be described herein. The SAM model outputs coarse-granularity image segmentation results when the coarse-granularity image segmentation model is trained. It should be noted that coarse granularity is relative to fine granularity, coarse granularity image segmentation may be used to segment an image into multiple regions, while fine granularity image segmentation may be used to further segment the results of coarse granularity image segmentation.

The training method of the fine-grained image segmentation model shown in fig. 8 is merely an example, and the electronic device may further include more or fewer software modules than those in fig. 8 when training the fine-grained image segmentation model, or may combine some modules or add some modules, which is not limited in the embodiments of the present application.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

As used in the above embodiments, the term "when …" may be interpreted to mean "if …" or "after …" or "in response to determination …" or "in response to detection …" depending on the context. Similarly, the phrase "at the time of determination …" or "if detected (a stated condition or event)" may be interpreted to mean "if determined …" or "in response to determination …" or "at the time of detection (a stated condition or event)" or "in response to detection (a stated condition or event)" depending on the context.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: ROM or random access memory RAM, magnetic or optical disk, etc.

Claims

1. A focusing method applied to an electronic device including a camera, the method comprising:

the electronic equipment displays a first preview picture and labels of K1 examples in the first preview picture, wherein K1 is a positive integer, and the types of the K1 examples are partially or completely different;

the electronic equipment displays a second preview picture and labels of K2 examples in the second preview picture, wherein the K2 examples comprise a first example, the first preview picture and the second preview picture are generated by the electronic equipment based on images acquired by the camera at different times, K2 is a positive integer, the K1 examples are partially or completely different from the K2 examples, and the types of the K2 examples are partially or completely different;

The electronic equipment receives a first voice instruction, and the first voice instruction instructs the electronic equipment to focus the first instance based on the label of the first instance;

in response to the first voice instruction, the electronic device focuses the first instance;

the electronic equipment displays labels of one or more sub-examples in the first example, wherein the labels of the one or more sub-examples are obtained by dividing the first example by the electronic equipment, and the one or more sub-examples comprise the first sub-example;

the electronic equipment receives a third voice instruction, and the third voice instruction instructs the electronic equipment to focus the first sub-instance based on the label of the first sub-instance;

and responding to the third voice instruction, and focusing the first sub-example by the electronic equipment.

2. The method of claim 1, wherein the K2 instances comprise a second instance, the method further comprising:

the electronic equipment receives a second voice instruction, wherein the second voice instruction is used for instructing the electronic equipment to focus the second instance;

in response to the second voice instruction, the electronic device changes a focus target from the first instance to the second instance.

3. The method of claim 1, wherein when the electronic device displays the labels of one or more sub-instances in the first instance, the method further comprises: the electronic device cancels the tag displaying the K2 examples.

4. A method according to claim 3, characterized in that the method further comprises:

the electronic equipment displays a third preview picture, wherein the third preview picture comprises K3 examples, the K3 examples comprise the first example, and K3 is a positive integer;

the electronic equipment receives a fourth voice instruction, wherein the fourth voice instruction is used for instructing the electronic equipment to identify an instance in a preview picture;

and responding to the fourth voice instruction, and displaying the labels of the K3 examples by the electronic equipment.

5. The method according to claim 1, wherein the method further comprises: in response to the first voice instruction, the electronic device displays a focus frame at a location of the first instance.

6. The method of claim 5, wherein after the electronic device displays a focus frame at the location of the first instance, the method further comprises:

The electronic equipment receives a fifth voice instruction, wherein the fifth voice instruction is used for indicating the focusing frame to move to a first direction;

and responding to the fifth voice instruction, the electronic equipment moves the focusing frame to the first direction by a first distance, and focuses the moved position of the focusing frame.

7. The method of any one of claims 1, 5, 6, wherein the K2 instances comprise a third instance, the method further comprising:

the electronic equipment receives a sixth voice instruction, wherein the sixth voice instruction is used for instructing the electronic equipment to focus a first area, and the first area is any one of the following: left, right, upper, lower, or middle inside the third instance;

and responding to the sixth voice instruction, and focusing the first area by the electronic equipment.

8. The method of claim 1, wherein the second preview screen includes a fourth instance, the fourth instance not being included in the K2 instances, the method further comprising:

the electronic equipment receives a seventh voice instruction, wherein the seventh voice instruction is used for indicating focusing on the fourth example;

In response to the seventh voice instruction, the electronic device identifying the fourth instance in a fourth preview screen that the electronic device displays after the second preview screen;

the electronic device focuses the fourth instance.

9. The method of claim 8, wherein after the electronic device identifies the fourth instance in a fourth preview screen, the method further comprises:

the electronic device displays the label of the fourth example.

10. The method of any one of claims 1-6, 8, 9, wherein the labels of the K1 instances are different and the labels of the K2 instances are also different.

11. The method of claim 10, wherein before the electronic device displays a first preview screen and the labels of K1 instances in the first preview screen, the method further comprises:

the electronic equipment displays a voice focusing button;

the electronic device detects an operation acting on the voice focus button.

12. An electronic device, the electronic device comprising: the device comprises a display screen, a camera, a memory and a processor coupled to the memory; the display screen is used for displaying an interface, the camera is used for shooting images, the memory stores a computer program, and the processor executes the computer program to enable the electronic device to realize the method as claimed in any one of claims 1 to 11.

13. A computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of any one of claims 1 to 11.