CN113901785A

CN113901785A - Marking method and electronic equipment

Info

Publication number: CN113901785A
Application number: CN202111153604.6A
Authority: CN
Inventors: 杨奇川; 张杨; 张柳新
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-07

Abstract

The application discloses a marking method and electronic equipment, wherein the method comprises the following steps: respectively acquiring audio and images in a currently played first image; performing voice recognition on the audio to generate corresponding first text information; determining a real object and/or a text object in the image; matching the first text information with the content of the real object and/or the text object, and determining a first target in the image at least based on a matching result; marking the first target. The marking method can automatically mark the currently played content in the first image according to the currently played first image, so that a user can accurately know the content of the first image expressed at the current time at any time, and the time for the user to understand the content of the first image is saved.

Description

Marking method and electronic equipment

Technical Field

The present disclosure relates to the field of image and audio processing, and more particularly, to a marking method and an electronic device.

Background

When interacting with a network, a plurality of interacting parties can communicate video and audio via the network. However, in the interaction process, one party often encounters the problem of unclear reference in the expression process of the other party. For example, in the process of delivering a speech to a document by a first interactive party, a second interactive party cannot see a position point corresponding to a current voice in the document of an interactive image in time, and the phenomenon may occur many times, so that the interaction accuracy is affected, and the interaction efficiency is reduced. The solution to this problem is usually to ask the current lecture for a specific position in the document through manual dialog, which is time-consuming and labor-consuming.

Disclosure of Invention

An object of an embodiment of the present application is to provide a marking method, including:

respectively acquiring audio and images in a currently played first image;

performing voice recognition on the audio to generate corresponding first text information;

determining a real object and/or a text object in the image;

matching the first text information with the content of the real object and/or the text object, and determining a first target in the image at least based on a matching result;

marking the first target.

Optionally, the determining a real object and/or a text object in the image includes:

performing image semantic segmentation operation on the image, and determining at least one real object in the image;

identifying the physical object to form a corresponding physical identifier; accordingly, the method can be used for solving the problems that,

the matching the first text information with the content of the real object and/or the text object, and determining a first target in the image at least based on a matching result includes:

comparing the first text information with the real object identification;

and in one or more object identifications, determining that the target indicated by the first text information is the first target.

performing image semantic segmentation operation on the image, and determining at least one text object in the image;

identifying characters in the text object to form a corresponding text block; accordingly, the method can be used for solving the problems that,

comparing the first text information with the text block;

in one or more of the text blocks, determining that the target indicated by the first text information is the first target.

Optionally, the performing an image semantic segmentation operation on the image includes:

determining the category of a pixel point based on the display of the pixel point in the image;

and determining a real object and/or a text object in the image based on the classification of the categories of the pixel points.

Optionally, in a case that the physical object includes a person, the method further includes:

acquiring limb information of the person, wherein the limb information comprises guiding action information;

determining a guide object to which the guide motion information is guided in the first image; accordingly, the method can be used for solving the problems that,

the determining a first target in the image based on at least the matching result comprises:

determining the first target based on a matching result of the matching operation and the determined guide object.

Optionally, the guidance motion information includes eye guidance information and posture guidance information of the person, and the determining of the guidance object guided by the guidance motion information in the first image includes:

determining the corresponding guide object based on the eye guide information and the posture guide information.

Optionally, the performing voice recognition on the audio to generate corresponding first text information includes:

determining a voice recognition mode corresponding to the audio based on the frequency characteristics of the audio;

performing voice recognition on the audio based on the determined voice recognition mode.

Optionally, the method further comprises:

outputting the marked first target to a client for display.

Optionally, the marking the first target includes:

a first target is delineated with a mark line to enable the first target to be highlighted in a first document associated with the first image.

An embodiment of the present application further provides an electronic device, including:

the acquisition module is configured to respectively acquire the audio and the image in the currently played first image;

the recognition module is configured to perform voice recognition on the audio to generate corresponding first text information; determining a real object and/or a text object in the image;

the processing module is configured to perform matching operation on the first text information and the real object and/or the content of the text object, and determine a first target in the image at least based on a matching result; marking the first target.

The marking method can automatically mark the currently played content in the first image according to the currently played first image, so that a user can accurately know the content of the first image expressed at the current time at any time, and the time for the user to understand the content of the first image is saved.

Drawings

FIG. 1 is a flow chart of a marking method according to an embodiment of the present application;

FIG. 2 is a flowchart of a first embodiment of step S300 of FIG. 1 according to an embodiment of the present application;

FIG. 3 is a flowchart of a first embodiment of step S400 of FIG. 1 according to an embodiment of the present application;

FIG. 4 is a flowchart of a second embodiment of step S300 of FIG. 1 according to an embodiment of the present application;

FIG. 5 is a flowchart of a second embodiment of step S400 of FIG. 1 according to an embodiment of the present application;

FIG. 6 is a flow chart of one embodiment of a marking method of an embodiment of the present application;

FIG. 7 is a flowchart of one embodiment of step S200 in FIG. 1 according to an embodiment of the present application;

FIG. 8 is a flow chart of another embodiment of a marking method of an embodiment of the present application;

fig. 9 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Various aspects and features of the present application are described herein with reference to the drawings.

It will be understood that various modifications may be made to the embodiments of the present application. Accordingly, the foregoing description should not be construed as limiting, but merely as exemplifications of embodiments. Those skilled in the art will envision other modifications within the scope and spirit of the application.

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the application and, together with a general description of the application given above and the detailed description of the embodiments given below, serve to explain the principles of the application.

These and other characteristics of the present application will become apparent from the following description of preferred forms of embodiment, given as non-limiting examples, with reference to the attached drawings.

It is also to be understood that although the present application has been described with reference to some specific examples, those skilled in the art are able to ascertain many other equivalents to the practice of the present application.

The above and other aspects, features and advantages of the present application will become more apparent in view of the following detailed description when taken in conjunction with the accompanying drawings.

Specific embodiments of the present application are described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely exemplary of the application, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the application of unnecessary or unnecessary detail. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present application in virtually any appropriately detailed structure.

The specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may each refer to one or more of the same or different embodiments in accordance with the application.

The marking method can be applied to a server, and particularly can be applied to the server in the field of online real-time interaction. For example, the method can be applied to the fields of conference systems, online teaching and the like. The marking method can mark the current focus in the first image, so that the client can accurately view the current focus (such as the current teaching content of a teacher) in the first image through marking. The method comprises the steps of respectively obtaining audio and images in a first image which is played currently. The first image may be an image currently played by the server, and the client may view the first image and view a first file related to the first image, where the first file may be specific content that needs to be expressed in the first image, for example, the first image is a teaching video, and the first file is teaching content that is described by a teacher in the first image. After the first image is acquired, the audio and the video in the first image can be acquired respectively. And performing speech recognition on the audio to generate corresponding first text information, for example, obtaining the first text information through ASR speech recognition. For images, there is a need to determine physical objects and/or textual objects in the image, with different types of things in the image, including textual objects and other non-textual physical objects, such as people or other non-textual objects. And after the real object and/or the text object in the image are determined, matching operation is carried out on the first text information and the content of the real object and/or the text object. The method comprises the steps of comparing first text information with the content of a real object, or comparing the first text information with the content of a text object, or comparing the first text information with all the content of the real object and the content of the text object, wherein the specific comparison mode comprises the steps of searching the content of the first text information in the content of the real object and/or the text object, and further determining a first target in an image at least based on a matching result, wherein the first target is the related content of the first text information. After the first target is determined, the first target needs to be marked, so that the marked first target can be sent to the client, and when a user uses the client to check a first file containing the first target, the specific content of the first image in the current time can be timely and accurately obtained, for example, the specific position of the content (the first target) of the teacher in the current lecture in the first file, so that the teaching effect is improved. Of course, if the method is applied to other application fields, the corresponding effect can be achieved.

To describe the labeling method in more detail, the labeling method is described in detail below with reference to the accompanying drawings, and fig. 1 is a flow chart of the labeling method according to an embodiment of the present application, as shown in fig. 1 and with reference to fig. 7, the labeling method includes the following steps:

s100, respectively acquiring the audio and the image in the currently played first image.

The currently played first image may be an image played by the server, such as a teaching image currently played by the server in the online teaching field, or a conference image currently played by the server in the online conference field. Of course, the corresponding video currently played in other fields may be also possible. The server can process the first image, respectively acquire the audio and the image in the first image, and further respectively process the audio and the image.

S200, performing voice recognition on the audio to generate corresponding first text information.

The first text information may be information related to the currently played audio content represented in text form.

When the audio is subjected to Speech Recognition, the generated first text information may be a specific expression content text of the current audio through a plurality of Recognition modes, for example, a specific Recognition mode of Speech Recognition (ASR), a specific Recognition mode of semantic Recognition, and the like. For example, the first text information is information generated for what the teacher is currently teaching in the teaching video. The first text information contains specific data content, and can be conveniently used.

S300, determining a real object and/or a text object in the image.

Objects (which may also be considered objects) of different types in the image include textual objects and other non-textual physical objects, which may include people and other non-textual objects.

When determining the physical object and/or the textual object, a specific distinction can be made according to the morphology and the content of the different objects in the image.

In one embodiment, a database containing relevant information of various objects is constructed in advance, and relevant characteristics of various objects are stored in the database. For example, various characteristics of the physical object, such as shape, color, size, and the like, are stored. The characteristics of the form, content, etc. of the text object are also stored. When the physical object is determined, the object in the image may be compared with the related data of the physical object stored in the database, so as to determine the physical object. Similarly, when determining the text object, the object in the image may be compared with the data related to the text object stored in the database, thereby determining the text object.

In another embodiment, the real object and/or text object in the image can be distinguished based on the display condition of the pixels starting from the specific display content of the pixels displaying the image. An image block formed, for example, by a set of pixels displayed with the same or similar and associated text may be considered a text object.

S400, matching the first text information with the real object and/or the content of the text object, and determining a first target in the image at least based on a matching result.

Specifically, the matching operation may include matching the first text information with the content of the physical object, matching the first text information with the content of the text object, and matching the first text information with the content of both the physical object and the text object. And then the first target is determined. The first object may be an object to which the first text information is currently referred.

For example, the content of the first text information is compared with the content expressed by the text object, it is determined whether the first text information appears in the content of the text object, and it is determined in which position in particular if it appears. The content in the text object that is the same as or related to the first text information may be considered as the first target. For example, in the online teaching process, the text object is a teaching courseware, and the first text information is the content currently taught by the teacher in the teaching image, and after comparing the first text information with the teaching courseware, a first target corresponding to the first text information in the teaching courseware can be determined, that is, the current teaching content of the teacher.

Of course, the content of the first text information may also be compared with the physical object, or with the content of both the physical object and the text object, to determine the first target in the physical object and/or the text object.

S500, marking the first target.

In particular, marking the first object may highlight the first object, such as in a first file sent to the client for display, thereby distinguishing the first object from other content that appears at the same time. For example, the first target is painted or delineated by a distinctive color.

When a user uses a client to check a first file containing a first target, the user can timely and accurately acquire the specific content of the first image at the current time, and can obviously notice the first target.

Still taking online teaching as an example, for example, the first file is a teaching courseware displayed in the client, and the first target is the content in the first file currently being taught by the teacher. When the student views the first file, the marked first target can be obviously viewed. Thereby being convenient for students to attend lessons and improving the learning effect.

In an embodiment of the present application, the determining a real object and/or a text object in the image, as shown in fig. 2, includes:

s310, performing image semantic segmentation operation on the image, and determining at least one real object in the image.

The image semantic segmentation operation is to segment different real objects represented in the image according to different display object semantics, and the real object may be different non-text objects displayed in the image, such as a displayed person or other non-text objects.

For example, a plurality of persons and a plurality of articles are present in an image, and each person and each article can be determined by an image semantic segmentation operation. All characters and articles are separated out separately.

Taking online teaching as an example, the situation that a teacher gives a lecture on a platform is in an image, and after the image is subjected to image semantic segmentation, a real object in the image is determined, wherein the real object comprises the teacher, a blackboard, a desk, a plurality of different books and the like.

And S320, identifying the object to form a corresponding object identification.

Specifically, the identification of the physical object may be based on a specific type and/or name of the physical object. The formed physical mark can refer to the physical object, such as the physical object is marked by a name, and the formed physical mark is the name of the physical object.

Still taking online teaching as an example, when each real object in an image is identified, a teacher may be marked as a "teacher", a blackboard may be marked as a "blackboard", and a plurality of books may be respectively marked as "book 1", "book 2", "book 3", and so on. The corresponding physical object identifier is as above: "teacher", "blackboard", "book 1", "book 2", and "book 3".

Correspondingly, the matching the first text information with the real object and/or the content of the text object, and determining the first target in the image based on at least the matching result, as shown in fig. 3, includes:

s410, comparing the first text information with the real object identification;

s420, in one or more of the object identifiers, determining that the target indicated by the first text information is the first target.

Specifically, the content of the first text information is compared with all the object identifiers, so that the specific object identifier indicated by the first text information can be determined, for example, the content of the first text information is book 1, and after the content of the first text information is compared with all the object identifiers, it is determined that the object indicated by the first text information is book 1, so that the book 1 is used as the first object.

Of course, if the physical object in the image changes, for example, when a teacher goes from above the platform to below the platform, the content of the image changes, and the physical objects are the teacher, the blackboard, the book 1, the book 2, and the book 3, respectively. The transformation is: teacher, window, teaching model, experimental apparatus. Accordingly, the physical object identifier also changes, that is, the physical object is marked as: "teacher", "window", "teaching model" and "experimental apparatus". And then, comparing the first text information with the changed real object identifier, and re-determining the first target indicated by the first text information. If the content (first text information) currently taught by the teacher is an experimental device, the experimental device is the first target.

In an embodiment of the present application, the determining a real object and/or a text object in the image, as shown in fig. 4, includes:

s330, performing image semantic segmentation operation on the image, and determining at least one text object in the image.

Similarly, the image semantic segmentation operation is to segment at least one text object represented in the image according to the difference of display object semantics. A text object is one or more text-containing objects displayed in an image.

For example, the text object may be text content in a book, displayed text content in a teaching courseware, and the like. And after the relevant images containing the texts are displayed in the images, performing image semantic segmentation operation on the images so as to determine each text object.

Still taking online teaching as an example, a teaching courseware (containing characters) and an opened teaching book (containing files) are displayed in the image. After the image semantic segmentation operation is carried out on the image, the teaching courseware and the teaching books can be determined as different text objects.

S340, recognizing characters in the text object to form a corresponding text block.

Specifically, the recognizing the text may be recognizing a text object originally displayed as an image, and the formed text block may include specific text content of the text object.

The specific way of identifying the words in the text object may be in many different ways. For example, a form Recognition method using a graphic symbol or an OCR Character Recognition (Optical Character Recognition) method is used to recognize characters in a text object. The OCR is a process of an electronic device checking characters printed on paper, then translating shapes into computer characters by using a character recognition method, that is, scanning text objects, and then analyzing and processing push files, thereby obtaining characters and layout information. Thereby recognizing and extracting characters on the text object and forming a text block.

Correspondingly, the matching the first text information with the real object and/or the content of the text object, and determining the first target in the image based on at least the matching result, as shown in fig. 5, includes:

s430, comparing the first text information with the text block;

s440, in one or more text blocks, determining that the target indicated by the first text information is the first target.

Specifically, the content of the first text information is compared with all text blocks, and a specific text block or a part of the content in the text block referred to by the first text information can be determined. For example, if the content of the first text information is "learning programming", after comparing the content of the first text information with all text blocks, it is determined that the target indicated by the first text information is the content of the first text block related to learning programming, and thus the content related to learning programming in the first text block is determined as the first target.

In addition, if the content of the first text information changes, for example, the content of the lecture changes as the teacher goes deeper and deeper at any time, so that the content of the first text information also changes. When the first text information is compared with the text block, the content of the current first text information is changed from the previous part originally appearing in the text block to the later part appearing in the text block. And comparing the first text information with the text block, determining that the latter part of the text block is the content of the current first text information, and determining the latter part of the text block as a first target. So that the first target always keeps consistent with the content of the first text information.

In an embodiment of the present application, the performing an image semantic segmentation operation on the image includes the following steps:

Specifically, the semantic segmentation of the image is to divide the image into color blocks with certain semantic information. And then identifying the semantic category of each color block, and labeling the corresponding label of the category to each pixel point, thereby realizing the semantic reasoning process from a bottom layer to a high layer, and finally obtaining a segmented image with the semantic labeling information of each pixel, wherein the segmented image comprises the determined object and/or text object.

For example, when a teacher and a blackboard in an image are divided, due to the fact that displayed colors of the corresponding pixel points and the association relations between the pixel points are different, boundaries of the corresponding pixel point sets are divided, and physical objects of different categories, namely the teacher and the blackboard, are formed.

In an embodiment of the application, in the case that the physical object includes a person, as shown in fig. 6 in combination with fig. 7, the method further includes the following steps:

s600, obtaining the limb information of the person, wherein the limb information comprises guiding action information.

The physical object comprises a person and other objects which are not texts, if the physical object comprises the person, the image of the person can be analyzed, and the limb information of the one or more tasks is obtained, wherein the limb information is information expressed by limbs of the person, and the limb information comprises guiding action information with a guiding effect.

For example, in the online teaching field, the guidance action information may be that a teacher in an image gives a guidance action by a hand, guides a part of content in a teaching courseware, or guides a teaching mold in an image.

In addition, the guiding action information can also be eye information of teachers, for example, glasses watch the teaching mould, so that the effect of guiding the teaching mould is achieved.

S700, determining the guide object guided by the guide action information in the first image.

When determining the guidance object to which the guidance action information is guided in the first image, a specific type of the guidance action information, such as a guide of eye spirit, a guide of gesture, a guide of pointer, etc., may be first guided. And further determining the instruction object guided in the first image according to the specific motion, form and other characteristics of the guiding motion information. For example, for the guidance of the gesture, the pointing direction of the finger is analyzed, and thus the pointed object is determined as the guidance object.

Correspondingly, the determining a first target in the image based on at least the matching result comprises: determining the first target based on a matching result of the matching operation and the determined guide object.

Specifically, the matching result of the matching operation performed on the first text information and the content of the real object and/or the text object may be combined with the guidance object guided by the guidance action information, so that the first target in the image, that is, the related content of the first text information in the currently played audio, may be determined more accurately.

In one embodiment, if the target determined by the matching result is not the same as the target determined by the guidance object, the re-determination of the first target may be performed according to the weight of the matching operation and the guidance action information. For example, when the weight of the matching operation is large, the target determined by the matching result may be the first target.

Optionally, the guidance motion information includes eye guidance information and posture guidance information of the person, and the determining of the guidance object guided by the guidance motion information in the first image includes: determining the corresponding guide object based on the eye guide information and the posture guide information.

For example, the eye guiding information is eye information of the teacher in the teaching process, and the posture guiding information is gesture and body information of the teacher in the teaching process. The guidance target can be accurately determined by determining based on both the eye guidance information and the posture guidance information described above. If the eye guide information guides book 1 and the object guided by the gesture in the gesture guide information is the first part of content in book 1, the first part of content may be determined as the guide object.

In an embodiment of the present application, the performing speech recognition on the audio to generate corresponding first text information, as shown in fig. 8, includes:

s210, determining a voice recognition mode corresponding to the audio frequency based on the frequency characteristics of the audio frequency;

s220, performing voice recognition on the audio based on the determined voice recognition mode.

Specifically, the audio may be audio emitted by a plurality of different objects, and the frequency characteristics of the audio emitted by the different objects are different. For example, the teaching image includes a teacher and a plurality of students, and the frequency characteristics of the sound generated by the teacher are different from those of the sound generated by the students. All of the teacher's audio may be determined based on the frequency characteristics. Different characters correspond to different voice recognition modes, and for a teacher, the voice recognition modes can be associated in advance according to the frequency characteristics of the sound produced by the teacher, so that all audio frequencies produced by the teacher are recognized by the associated voice recognition modes, and first text information corresponding to the teaching audio frequencies of the teacher is accurately obtained.

In one embodiment of the present application, the method further comprises the steps of: outputting the marked first target to a client for display.

Specifically, the server is connected to the client, the server may play the first image, and the client may simultaneously play the first image and/or play the first file associated with the first image. The first object may be displayed in the first image and/or in the first file.

For example, the server plays a first image of a teacher's teaching, and the client is used by students, where the teacher's teaching courseware, such as slides, is played. The first file is the teaching courseware, and in this embodiment, when the client displays the teaching courseware, the marked first object contained in the teaching courseware is more obviously noticed by the student. Therefore, students can quickly determine the content taught by the current teacher, and the teaching quality is improved.

In one embodiment of the present application, said marking said first target comprises the steps of: a first target is delineated with a mark line to enable the first target to be highlighted in a first document associated with the first image.

For example, when marking the first object, the first object may be delineated using a specific marking line, such as a bold marking line with a specific color, so as to make the first object more obvious. The first object is in the first file, and the first object is obvious when the user views the first file because the first object is obvious. It should be noted that the first file is associated with the first image, if the first image is a teaching image, the first file is a teaching courseware associated with the teaching image, and the student can use the client to view the teaching courseware, so that the first target marked in the teaching courseware can be easily viewed, that is, the teaching content currently taught by the teacher.

An embodiment of the present application further provides an electronic device, as shown in fig. 9 and combined with fig. 7, including:

the acquisition module is configured to acquire the audio and the image in the currently played first video respectively.

The currently played first image may be an image played by the server, such as a teaching image currently played by the server in the online teaching field, or a conference image currently played by the server in the online conference field. Of course, the corresponding video currently played in other fields may be also possible. The acquisition module can process the first image, and respectively acquire the audio and the image in the first image, so that the electronic device respectively processes the audio and the image.

The recognition module is configured to perform voice recognition on the audio to generate corresponding first text information; and determining a real object and/or a text object in the image.

The Recognition module may recognize the audio through multiple Recognition modes, for example, a specific Recognition mode through Speech Recognition (ASR), may also recognize the content of the audio through a semantic Recognition mode, and the generated first text information is a specific expression content text of the current audio. For example, the first text information is information generated for what the teacher is currently teaching in the teaching video. The first text information contains specific data content, and can be conveniently used.

Specifically, objects (which can also be regarded as objects) with different types in the image include text objects and other non-text real objects, and the real objects may include people and other non-text objects.

When determining the physical object and/or the text object, the processing module may make a specific distinction according to the form and content of different objects in the image.

In one embodiment, a database containing relevant information of various objects is constructed in advance, and relevant characteristics of various objects are stored in the database. For example, various characteristics of the physical object, such as shape, color, size, and the like, are stored. The characteristics of the form, content, etc. of the text object are also stored. When the physical object is determined, the processing module may compare the object in the image with the related data of the physical object stored in the database, so as to determine the physical object. Similarly, when determining the text object, the processing module may compare the object in the image with the data related to the text object stored in the database, thereby determining the text object.

In another embodiment, the processing module may differentiate the real object and/or the text object in the image based on the display condition of the pixels, starting from the specific display content of the pixels displaying the image. An image block formed, for example, by a set of pixels displayed with the same or similar and associated text may be considered a text object.

The processing module performs matching operation on the first text information and the real object and/or the content of the text object, wherein the matching operation may include matching the first text information and the content of the real object, matching the first text information and the content of the text object, and matching the first text information and the content of both the real object and the text object. And then the first target is determined. The first object may be an object to which the first text information is currently referred.

For example, the processing module compares the content of the first text message with the content expressed by the text object, determines whether the first text message appears in the content of the text object, and determines where the first text message appears if the first text message appears. The content in the text object that is the same as or related to the first text information may be considered as the first target. For example, in the online teaching process, the text object is a teaching courseware, the first text information is the content currently taught by the teacher in the teaching image, and the processing module can determine a first target corresponding to the first text information in the teaching courseware after comparing the first text information with the teaching courseware, that is, the current teaching content of the teacher.

Of course, the processing module may also compare the content of the first text information with the physical object, or with the content of both the physical object and the text object, to determine the first target in the physical object and/or the text object.

For the step of marking the first object by the processing module, specifically, the step of marking the first object by the processing module may highlight the first object, for example, in the first file sent to the client for display, so as to distinguish the first object from other content appearing at the same time. For example, the first target is painted or delineated by a distinctive color.

In one embodiment of the present application, the identification module is further configured to:

the processing module is further configured to:

comparing the first text information with the real object identification;

the processing module is further configured to:

comparing the first text information with the text block;

In an embodiment of the application, in a case that the physical object includes a person, the obtaining module is further configured to:

accordingly, the identification module is further configured to:

the processing module is further configured to:

In an embodiment of the application, the guidance action information includes eye guidance information and posture guidance information of the person, and the recognition module is further configured to:

In one embodiment of the present application, the processing module is further configured to:

outputting the marked first target to a client for display.

The above embodiments are only exemplary embodiments of the present application, and are not intended to limit the present application, and the protection scope of the present application is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present application and such modifications and equivalents should also be considered to be within the scope of the present application.

Claims

1. A method of marking, comprising:

respectively acquiring audio and images in a currently played first image;

determining a real object and/or a text object in the image;

marking the first target.

2. The method of claim 1, the determining a physical object and/or a textual object in the image, comprising:

comparing the first text information with the real object identification;

3. The method of claim 2, the determining a physical object and/or a textual object in the image, comprising:

comparing the first text information with the text block;

4. The method of claim 2 or 3, the performing an image semantic segmentation operation on the image, comprising:

5. The method of claim 1, wherein if the physical object comprises a human, the method further comprises:

6. The method of claim 5, the guide motion information including eye guide information and posture guide information of the person, the determining of the guide object to which the guide motion information is guided in the first image, comprising:

7. The method of claim 1, wherein the speech recognizing the audio to generate corresponding first text information comprises:

8. The method of claim 1, further comprising:

outputting the marked first target to a client for display.

9. The method of claim 1, the marking the first target, comprising:

10. An electronic device, comprising: