CN115240681A

CN115240681A - Method and device for generating conference summary

Info

Publication number: CN115240681A
Application number: CN202110436295.7A
Authority: CN
Inventors: 王卓; 贾志华; 罗婧华; 王其哲
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2022-10-25

Abstract

The application relates to a method and equipment for generating a conference summary, wherein the method comprises the following steps: acquiring at least one section of text content related to the conference and text content time corresponding to each section of text content; acquiring at least one image content related to the conference and an image content time corresponding to each image content; generating an epoch of the conference based on the at least one piece of text content and the text content time, and the at least one piece of image content and the image content time. By adopting the method and the device, the text content and the image content acquired during the meeting can be typeset by utilizing the time information, so that the meeting summary is automatically generated.

Description

Method and device for generating conference summary

Technical Field

The present application relates to the field of conferences, and in particular, to a method and an apparatus for generating a conference summary.

Background

Today, people have become more and more accustomed to using smart interactive tablets during meetings. For example, during a meeting, one can write down the content (e.g., keywords or diagrams) in the meeting on the electronic whiteboard interface of the smart interactive tablet, and then spread the discussion around the content.

In the related art, one may perform a screen capture operation on an interface (e.g., an electronic whiteboard interface as mentioned above) displayed on an intelligent interactive panel used in a conference, and then take the captured screen capture image as a conference summary. Since the contents of the parts in the screenshot image may be scattered and distributed in the areas of the screenshot image, it is difficult for people to know the basic situation in the meeting only through the meeting, for example, the discussion sequence of the meeting about the contents of the parts.

Disclosure of Invention

In view of the above, a method for generating a conference summary and a device thereof are provided.

In a first aspect, an embodiment of the present application provides a method for generating a conference summary, including: acquiring at least one section of text content related to the conference and text content time corresponding to each section of text content; acquiring at least one image content related to the conference and an image content time corresponding to each image content; generating a summary of the meeting based on the at least one piece of text content and the text content time, and the at least one piece of image content and the image content time.

In the embodiment of the application, the method can determine the meeting summary content by acquiring the text content and the image content displayed on the screen in the meeting process, and then typeset the meeting summary content according to the time information, so that the meeting summary is automatically generated, not only can the subsequent manual arrangement be omitted, but also the generated meeting summary comprises data of multiple dimensions, and the generated meeting summary is more comprehensive and richer.

According to a possible implementation manner of the first aspect, generating an epoch of the conference based on the at least one text content and the text content time, and the at least one image content and the image content time includes: and typesetting the at least one section of text content and the at least one image content according to the time sequence based on the text content time and the image content time to generate the conference summary.

In the embodiment of the application, the text content and the image content can be typeset according to the text content time and the image content time in sequence, so that the generated conference summary can contain time information, and the generated conference summary is more in line with the reading habit of a user.

According to a possible implementation manner of the first aspect, before obtaining at least one piece of text content related to the meeting and a text content time corresponding to each piece of text content, the method further includes: audio data associated with the conference is obtained.

In the embodiment of the application, the method can also utilize the audio data to generate the conference summary, so that the generated content of the conference summary is richer.

According to one possible implementation form of the first aspect, the audio data comprises at least one piece of audio data collected sequentially;

the acquiring at least one section of text content related to the conference and the text content time corresponding to each section of text content comprises: and converting each piece of audio data in the at least one piece of audio data into each piece of text content, wherein the text content time corresponding to each piece of text content comprises the acquisition starting time and the acquisition ending time of the audio data corresponding to each piece of text content.

In the embodiment of the application, the method can segment the collected audio data according to the speaking sequence and convert the segmented voice data into text content, so that the generated conference summary can reflect the front-back relationship of the text content, and the text content in the generated conference summary is more comprehensive.

According to a possible implementation manner of the first aspect, the acquiring at least one piece of text content related to the conference and the text content time corresponding to each piece of text content includes: converting the audio data into text data; extracting at least one keyword in the text data and each section of text content corresponding to each keyword, wherein the text content time corresponding to each section of text content comprises the acquisition starting time and the acquisition ending time of the audio data corresponding to each section of text content.

In the embodiment of the application, the method can segment the collected audio data according to the semantic information, generate text content according to the semantic information, and increase analysis processing on the audio data, so that the generated conference summary is more simplified.

According to one possible implementation manner of the first aspect, the at least one image content includes a screen capture image obtained by performing a screen capture operation during the meeting, and the image content time corresponding to each image content includes a time for performing the screen capture operation.

In the embodiment of the application, the method can utilize the screenshot image as the image content in the conference summary, so that the conference summary is richer and more sufficient.

According to one possible implementation manner of the first aspect, the at least one image content includes at least one manual input content displayed on a display screen through a manual input operation, where the manual input operation is an input operation performed by a participant during the conference, and the image content time corresponding to each image content is an input start time and an acquisition end time of the manual input content included in each image content.

In the embodiment of the application, the method can use the manually input content manually input by the conference participants as the image content, so that the conference summary is richer and more sufficient.

According to one possible implementation manner of the first aspect, the at least one image content includes at least one user attention content, determined by image data, at which a participant gazes during the conference, wherein the image data includes captured behavior data of the participant during the conference or image data including a corneal reflection point of the participant, and an image content time corresponding to each image content is determined by the capturing time.

In an embodiment of the application, the method may determine the gazing area and the time information of the conference participants through the acquired image data of the conference participants, so as to determine the image content in the conference summary, so that the content contained in the generated conference summary is richer.

In a second aspect, an embodiment of the present application provides an apparatus for generating a conference summary, including: the system comprises a text content acquisition unit, a text content acquisition unit and a text content processing unit, wherein the text content acquisition unit is used for acquiring at least one piece of text content related to a conference and text content time corresponding to each piece of text content; an image content acquisition unit configured to acquire at least one image content related to the conference and an image content time corresponding to each image content; a generating unit, configured to generate an epoch of the conference based on the at least one piece of text content and the text content time, and the at least one piece of image content and the image content time.

According to a possible implementation manner of the second aspect, the generating unit is specifically configured to typeset the at least one text content and the at least one image content according to a time sequence based on the text content time and the image content time, and generate the meeting summary.

According to a possible implementation manner of the second aspect, the apparatus further includes: an audio data acquisition unit for acquiring audio data related to the conference.

According to a possible implementation manner of the second aspect, the audio data includes at least one section of audio data text content acquiring unit that is sequentially acquired and is specifically configured to convert each section of audio data in the at least one section of audio data into each section of text content, where a text content time corresponding to each section of text content includes an acquisition start time and an acquisition end time of the audio data corresponding to each section of text content.

According to a possible implementation manner of the second aspect, the text content obtaining unit includes: the conversion module is used for converting the audio data into text data; and the extraction module is used for extracting at least one keyword in the text data and each section of text content corresponding to each keyword, wherein the text content time corresponding to each section of text content comprises the acquisition starting time and the acquisition ending time of the audio data corresponding to each section of text content.

According to one possible implementation manner of the second aspect, the at least one image content includes a screen capture image obtained by performing a screen capture operation during the meeting, and the image content time corresponding to each image content includes a time for performing the screen capture operation.

According to a possible implementation manner of the second aspect, the at least one image content includes at least one manual input content displayed on a display screen through a manual input operation, wherein the manual input operation is an input operation performed by a participant during the conference, and an image content time corresponding to each image content is an input start time and an acquisition end time of the manual input content included in the each image content.

According to one possible implementation of the second aspect, the at least one image content includes at least one user attention content, determined by image data, at which a participant gazes during the conference, wherein the image data includes captured behavior data of the participant during the conference or image data including corneal reflection points of the participant, and an image content time corresponding to each image content is determined by the capture time.

In a third aspect, an embodiment of the present application provides a method for generating a conference summary, where the method includes:

acquiring at least one text content and time information of each text content in the at least one text content; the at least one text content is obtained based on audio data of the conference; acquiring at least one image and time information of each image in the at least one image; the at least one image is obtained based on screen content of the conference; generating an summary of the conference based on the at least one piece of text content and the time information of each of the at least one piece of text content, and the at least one image and the time information of each of the at least one image.

According to a possible implementation manner of the third aspect, the obtaining at least one text content and time information of each text content in the at least one text content includes: acquiring at least one section of audio data of the conference and time information of each section of audio data in the at least one section of audio data; respectively converting each piece of audio data in the at least one piece of audio data to obtain at least one piece of text content corresponding to the number of the at least one piece of audio data; and obtaining the time information of each text content in the at least one text content based on the time information of each audio data in the at least one audio data.

According to a possible implementation manner of the third aspect, the obtaining at least one piece of text content and time information of each piece of text content in the at least one piece of text content includes: acquiring at least one section of audio data of the conference and time information of each section of audio data in the at least one section of audio data; converting the at least one piece of audio data to obtain first text content; determining at least one keyword based on the first text content; determining second text content associated with the first keyword based on the first keyword and the first text content, wherein time information of the second text content is obtained based on time information of audio data corresponding to the second text content; the first keyword is any one of the at least one keyword.

According to a possible implementation manner of the third aspect, the at least one image is obtained based on a screen content of the conference captured at predetermined time intervals; the time information of the first image is the interception time of the screen content corresponding to the first image, and the first image is any one of the at least one image.

According to a possible implementation manner of the third aspect, the screen content of the conference is a screen content of an electronic whiteboard; the acquiring at least one image and time information of each image of the at least one image includes: determining information of at least one region of interest of a participant of the conference in a screen of the electronic whiteboard; the information of the first attention area comprises the position and attention time of the first attention area, wherein the position is the position of the first attention area in a screen of the electronic whiteboard, and the attention time is the attention time of the conferee to the first attention area; the first region of interest is any of the at least one region of interest; capturing screen content of the electronic whiteboard to obtain a first image based on the position of the first attention area, wherein time information of the first image is attention time of the first attention area; the first image is any one of the at least one image.

According to a possible implementation manner of the third aspect, the determining information of at least one attention area of a participant of the conference in a screen of the electronic whiteboard includes: acquiring image data during the meeting and shooting time of the image data; the image data includes head information or eye movement information of participants of the conference; determining a location of the first region of interest based on the image data; determining a time of interest of the first region of interest based on a capturing time of the image data.

According to a possible implementation manner of the third aspect, the determining information of at least one attention area of a participant of the conference in a screen of the electronic whiteboard includes: acquiring data of a first touch input on the electronic whiteboard during the meeting; determining an execution time of the first touch input and location information on the electronic whiteboard based on the data of the first touch input; determining a location of the first region of interest based on location information of the first touch input on the electronic whiteboard; determining a time of interest of the first region of interest based on an execution time of the first touch input.

According to a possible implementation manner of the third aspect, the generating a summary of the meeting based on the at least one text content and the time information of each of the at least one text content, and the at least one image and the time information of each of the at least one image includes: determining the sequence between the at least one text content and the at least one image based on the time information of each text content in the at least one text content and the time information of each image in the at least one image; generating a summary of the conference according to a predetermined format based on the precedence order.

In a fourth aspect, an embodiment of the present application provides an apparatus for generating a conference summary, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the method of any one of claims 1-8 when executing the instructions.

In a fifth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium, on which computer program instructions are stored, wherein the computer program instructions, when executed by a processor, implement the method in various possible implementations of the above first aspect or third aspect.

In a sixth aspect, embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the method in the various possible implementations of the first or third aspect above.

These and other aspects of the present application will be more readily apparent from the following description of the embodiment(s).

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.

FIG. 1 illustrates a schematic diagram of an application scenario provided in accordance with the present application;

FIG. 2 illustrates a schematic structural diagram of a smart interactive tablet according to an embodiment of the present application;

FIG. 3 shows a flowchart of the steps for generating a meeting summary according to an embodiment of the present application;

FIG. 4 illustrates a schematic diagram of an interface involved in generating a meeting summary according to an embodiment of the present application;

FIG. 5 shows a schematic diagram of conference data in different dimensions according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating steps for obtaining at least one textual content in a meeting summary and corresponding textual content time according to one embodiment of the present application;

FIG. 7 is a flowchart illustrating steps for obtaining at least one textual content and corresponding textual content time in a meeting summary according to one embodiment of the present application;

FIG. 8 is a flowchart illustrating steps for obtaining at least one image content and corresponding image content time in a conference summary according to one embodiment of the present application;

FIG. 9 is a flowchart illustrating steps for obtaining at least one image content and corresponding image content time in a meeting summary according to one embodiment of the present application;

FIG. 10 illustrates a schematic diagram of an interface involved in generating a meeting summary according to an embodiment of the present application;

FIG. 11 illustrates a flowchart of steps for obtaining at least one image content and corresponding image content time in a meeting summary according to an embodiment of the present application;

FIG. 12 shows a schematic diagram of a conference summary according to an embodiment of the present application;

FIG. 13 shows a flowchart of the steps for generating a meeting summary according to an embodiment of the present application;

FIG. 14 shows a flowchart of the steps for generating a meeting summary according to an embodiment of the present application;

FIG. 15 shows a block diagram of a device that generates a conference summary according to an embodiment of the present application.

Detailed Description

In order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish identical items or similar items with substantially the same functions and actions. For example, the first room and the second room are only used for distinguishing different rooms, and the sequence order thereof is not limited. Those skilled in the art will appreciate that the terms "first," "second," and the like do not denote any order or importance, but rather the terms "first," "second," and the like do not denote any order or importance.

It is noted that, in the present application, words such as "exemplary" or "for example" are used to mean exemplary, illustrative, or descriptive. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, a and/or B, which may indicate: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

A method for generating a conference summary provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 shows an application scenario diagram according to an embodiment of the application.

As shown in fig. 1, it is assumed that a conference room of an enterprise is shared by a plurality of conference participants including a conference host 100 and other conference participant users (hereinafter referred to as "conference participants") 101. During the meeting, the conference moderator 100 may change, for example, it may be possible for participants to take turns to speak as the conference moderator, or it may be possible for voluntary or required participants to speak as the conference moderator. Further, the number of conference presenters is not limited by the illustration.

During the conference, the conference host 100 sends the screen-shot content containing the conference information to the smart interactive tablet 13 through the terminal device 11, and then the smart interactive tablet 13 displays the screen-shot content. For example, in an application scenario for teaching with the smart interactive tablet 13, a teacher may screen an electronic document such as a courseware onto the smart interactive tablet 13, and the smart interactive tablet 13 may present the electronic document to students. For another example, in an application scenario where a commercial presentation or topic sharing is performed using the smart interaction tablet 13, the presenter (i.e., the conference host 100) may screen the produced presentation onto the smart interaction tablet 13, and the smart interaction tablet 13 may present the presentation to the audience/viewers.

In addition, the conference host 100 or other conference participant 101 may also perform manual input operations on the smart interaction tablet, and corresponding manual input content will be displayed on the smart interaction tablet 13. As an example, the conference host 100 may call up a canvas interface (also referred to as an electronic whiteboard interface) of the smart interaction tablet 13, in which case the smart interaction tablet 13 corresponds to a large-area handwriting board. The conference moderator 100 can then manually enter certain content on the intelligent interactive tablet interface, such as currently spoken keywords or the like or simple strokes for ease of understanding, or the like. After manual entry, the conference moderator 100 may also edit the manually entered content, e.g., manually entered sketching may be deleted or modified. Taking the manual input content as an example of the text, the editing may be adjusting the font, size, position, etc. of the text. In addition, the conference host 100 may also perform manual input operations through an external input device such as a mouse, a keyboard, or the like.

As another example, the conference host 100 may also perform manual input operations based on the content already presented on the smart interaction tablet 13, e.g., the conference host 100 may manually mark up keywords in an electronic document displayed on the smart interaction tablet 13, etc.

In FIG. 1, terminal equipment 11 may include electronic equipment of the type that a user may use, such as: a mobile phone, a desktop, a tablet device, a notebook computer, a pda (Personal Digital Assistants), a wearable device (such as smart glasses, a smart watch, etc.), etc., which should not be limited by the present application.

The network 12 is used for realizing network connection between the terminal device 11 and the smart interactive tablet 13, and may include various types of wired or wireless networks. In one embodiment, the network 12 may include a near field communication network such as bluetooth, WIFI, zigBee, or the like. In another embodiment, the Network 12 may include telecommunications networks such as the Public Switched Telephone Network (PSTN) and the Internet. Of course, the network 12 may also include both near field communication networks and remote communication networks.

The smart interactive tablet 13 may adopt a hardware structure as shown in fig. 2, and the smart interactive tablet 13 will be described in detail with reference to fig. 2.

Fig. 2 shows a diagram of the smart interaction tablet 13 according to an embodiment of the present application. As shown in fig. 2, the smart interactive tablet 13 may include: a processor 201, a communication bus 202, a user interface 203, a network interface 204, and a memory 205.

Wherein a communication bus 202 is used to enable the connection communication between these components.

The user interface 203 may include, among other things, a display screen, a camera, and an audio device. Optionally, the number of the display screen, the camera and the audio device includes one or more, which is not limited in this application. Optionally, the user interface 203 may also include a standard wired, wireless interface.

Wherein, the display screen comprises a display panel. Optionally, the display screen may further include a touch panel; at this time, the display screen may be a touch screen or a touch screen; in this case, the display panel may be located at the lowermost layer in the display screen, and the touch panel may be located at the uppermost layer in the display screen. The touch panel may communicate the detected touch operation to the processor 201 and may provide visual output related to the touch operation by the display panel. As an example, when a participant touches the display screen with a stylus or a touch pen, the touch panel may pass the detected touch operation to the processor 201.

Wherein the camera is used for capturing still images or video. In general, a camera may include a photosensitive assembly such as a lens group and an image sensor, wherein the lens group includes a plurality of lenses (convex or concave lenses) for collecting an optical signal reflected by an object to be photographed and transferring the collected optical signal to the image sensor. And the image sensor generates an original image of the object to be shot according to the optical signal. As an example, a camera may take images of the participants during the conference.

The audio device may include an audio input device and an audio output device, among other things. The audio input device may convert the collected sound into audio data. As an example, the audio input device may include a microphone, and a user may speak by approaching the microphone, which may then convert the collected sound into audio data. The audio output device may convert audio data into a sound signal. As an example, the audio output device may include a speaker through which the smart interaction tablet 13 may output sound. The audio device in fig. 2 is built in the smart interactive tablet 13, and optionally, the audio device may also be an external device of the smart interactive tablet 13. For example, the user may speak using an external microphone of the smart interaction tablet 13.

The network interface 204 may include a standard wired interface, a wireless interface (e.g., a WIFI interface).

The processor 201 may include one or more Processing units, for example, the processor 201 may include Processing units such as an Application Processor (AP), a Graphics Processing Unit (GPU), and a Central Processing Unit (CPU). The processor 201 can generate a control signal according to the operation instruction and the timing signal, so as to control the intelligent interactive panel 13.

The memory 205 may be used to store executable program code of the smart interactive tablet 13, which includes instructions. The memory 205 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (e.g., a sound playing function, an image playing function, etc.) required for at least one function, and the like. The storage data area may store data (e.g., audio data, image data) created during use of the smart interactive tablet 13, and the like. Further, the memory 205 may also include a high speed random access memory or a non-volatile memory, etc.

In the conference scenario shown in fig. 1, after the conference is finished, the participants can arrange the conference summary of the conference according to the conference record manually recorded during the conference. Briefly, the participants can summarize each topic discussed in the conference according to the conference record, then adjust the sequence of the topics according to the logical relationship of the topics, and finally generate the conference record. It can be seen that this method is a manual method for generating the meeting records, which requires additional manpower and material resources. Furthermore, in the prior art, there are two approaches to generating a conference summary:

in the first scheme of generating the conference summary, the conference participants may perform a screen capture operation on a screen displayed on an intelligent interactive panel used in the conference during the conference, and then acquire a screen capture image and output the screen capture image as the conference summary. In a second approach to generating a conference summary, participants may capture audio data throughout the conference using audio devices, then convert the audio data to text data and output as the conference summary for the conference.

It can be seen that the above schemes all output data (e.g., screen shots, audio data, etc.) generated during a meeting as a meeting summary, without any analysis and sorting of the data. Especially in a scene of meeting by using the intelligent interactive tablet, written contents may be scattered in various areas on the screen of the intelligent interactive tablet, and the meeting summary generated by adopting the schemes cannot embody the logical relationship and/or the time sequence of the contents.

Based on this, the method for generating the conference summary provided by the embodiment of the application can determine the conference summary content by using the audio data acquired in the conference process and the image content displayed on the screen, and then typeset the conference summary content according to the time information, so that the conference summary is automatically generated, not only can the subsequent manual arrangement be omitted, but also the generated conference summary comprises data of multiple dimensions, and therefore, the generated conference summary is more comprehensive and rich.

It should be noted that the present description is not exhaustive of all alternative embodiments for reasons of space. Any combination of features, which would occur to one skilled in the art upon reading the present specification to be understood to communicate alternative embodiments, is provided as long as the features are not mutually inconsistent.

FIG. 3 shows a flowchart of the steps for generating a meeting summary according to an embodiment of the present application. As shown in fig. 3, the method of the present embodiment may include:

in step S101, a first interface is displayed, where the first interface includes a first control.

Optionally, the first interface may be the above-mentioned smart interactive tablet interface, or may be an interface displayed by the smart interactive tablet after running the conference-related application. The conference-related application may be an application installed in the smart interactive tablet computer from the factory, or an application downloaded by a user after using the smart interactive tablet computer. The application may assist the user in holding a meeting, a lecture, or a presentation, etc. using the intelligent interactive tablet. By way of example, the first interface may be a main interface of the conference related application or a lower interface of the main interface, and the application is not limited thereto.

Optionally, the first interface displays a plurality of controls, and the plurality of controls includes a first control, where the first control is an operable control for triggering generation of the conference summary. Illustratively, the type of the first control includes at least one of a button, a manipulable entry, and a slider.

Optionally, a second control may be further included on the first interface, where the second control is configured to determine the composition for generating the meeting summary. In implementation, a user can directly select a target template for generating the conference summary from the candidate conference summary templates through the second control, and the intelligent interactive tablet displays the generated conference summary according to the format of the target template. Additionally, the user may also customize the template for the meeting summary using a second control, e.g., the user may determine the font or font size of the meeting summary using the second control, and also, e.g., the user may determine whether to display the meeting time using the second control, etc.

In step S102, a trigger signal of the first control is received, where the trigger signal indicates to start generating a conference summary. Optionally, the trigger signal is a user operation signal that triggers generation of the conference summary. Illustratively, the trigger signal includes any one or more of a click operation signal, a slide operation signal, a press operation signal, and a long press operation signal.

In other possible implementations, the trigger signal may also be implemented in audio form. For example, the smart interactive tablet receives an audio signal input by a user, analyzes the audio signal to obtain audio content, and when a keyword word (for example, "meeting summary" or "record generation") matching preset information corresponding to the first control exists in the audio content, it is determined that the first control is triggered, that is, the smart interactive tablet receives a trigger signal on the first control.

In step S103, a conference summary is generated in response to the trigger signal. That is to say, after receiving the trigger signal, the smart interactive tablet may execute the conference summary generation method according to the embodiment of the present application. It should be noted that, the implementation details of generating the conference summary may refer to the related description in the following embodiments, which are not introduced here.

In one possible implementation, the method may display the generated meeting summary on a second interface, wherein the second interface may be displayed on a screen of the smart interactive tablet and is different from the first interface.

Optionally, a third control may be included on the second interface, wherein the third control is for indicating an operable control for the generated conference summary, e.g., a user may modify the generated conference summary with the third control. Still alternatively, the user may share the already generated conference summary to the conference participants using the third control. Still alternatively, the user may store the generated conference summary as a picture or document in a preset format using the third control. Still alternatively, the user can delete a generated conference summary, etc. using a third control.

In one illustrative example, as shown in fig. 4, the smart interactive tablet may display a first interface 41 on the screen, where the first interface 41 includes a first control 42 of "generate a meeting summary". And when the intelligent interactive tablet receives a click operation signal of the user for the first control 42, starting a function of generating a conference summary. The generated conference summary is then displayed on the second interface 43. As shown in fig. 4, other controls, such as "share", "modify", "output", etc. may be included on the second interface 43, and a user (e.g., a conferee) may perform subsequent operations on the generated conference summary by operating these controls.

In the embodiment, the user can start the conference summary generation function through interacting with the intelligent interactive tablet, and after the conference summary generation function is started, the conference summary is generated according to the method for generating the conference summary of the embodiment of the application, so that the subsequent manual sorting process is omitted. In implementation, the user can further perform operations on the generated conference summary according to the needs of the user, including sharing the conference summary, modifying the conference summary, and the like.

Various embodiments of generating a conference summary using multi-dimensional conference data will be described below with reference to fig. 5-12. To more clearly describe the various embodiments, the conference data of different dimensions collected by the smart interactive tablet 13 will be briefly explained below with reference to fig. 5. As shown in fig. 5, the smart interactive tablet 13 may utilize a camera 501, an audio device 502, and a stylus 503 to acquire different types of data for generating a conference summary. The method comprises the following specific steps:

in an implementation, the smart interactive tablet 13 may capture audio data generated by the meeting participants using the audio device 502. As an example, the smart interaction tablet 13 may directly activate the audio device 502 to capture audio data after being activated, and as another example, the audio device 502 may start capturing audio data generated by the conference participants after being triggered to activate.

In an implementation, the smart interactive tablet 13 may obtain screen capture data of content displayed on a display screen of the smart interactive tablet. The screen capture data may indicate image data acquired after the smart interactive tablet 13 performs a screen capture operation (which may also be referred to as a screen capture operation in the following description) on the content displayed on the display screen. In an implementation, the screen capture operation may refer to performing a snapshot operation on content displayed on the display screen, and the image content obtained through the snapshot operation may be full-screen content displayed on the display screen or may be partial content displayed on the display screen.

In one possible implementation, the screen capture operation may be that the smart interactive tablet 13 performs the screen capture operation after receiving a user trigger signal, for example, the user trigger signal includes any one or more of a click operation signal, a slide operation signal, a press operation signal, and a long press operation signal. In implementations, the user trigger may indicate a user trigger for a screen capture control.

In one possible implementation, the smart interactive tablet 13 may perform a screen capture operation on the interface being displayed after responding to an operation performed by the user to generate a new interface, for example, after the meeting participants have written the content on the screen or have completed a discussion on each content displayed on the current interface, the meeting participants may generate a new interface (i.e., a blank interface) to discuss a next subject, and at this time, the smart interactive tablet 13 may perform a screen capture/screen capture operation on the interface being displayed.

In one possible implementation, the screen capture operation may be an operation automatically performed by the smart interactive tablet. Optionally, the smart interactive tablet may perform the screen capture operation at preset time intervals, for example, the smart interactive tablet may perform the screen capture operation every five minutes.

In an implementation, in a case where the smart interactive tablet 13 displays the smart interactive tablet interface, the conference participant may also perform a manual input operation on the smart interactive tablet. The smart interactive tablet 13 may display the manually input contents on the display screen in response to the manual input operation.

In one possible implementation, the manual input operation may indicate an operation of performing a touch input on the smart interactive tablet 13 using a stylus, a stylus pen, or a finger of a user. In this case, the smart interaction tablet 13 may display a canvas interface, and the participant selects various elements (e.g., markup, graphics, brush color, etc.) displayed on the canvas interface and then makes touch input on the smart interaction tablet with the various elements. For example, the meeting participant may draw on the smart interactive tablet 13 after selecting the brush color. The smart interactive tablet 13 may display touch input content corresponding to the touch input operation on the display screen in response to receiving the touch input operation of at least one participant.

In implementation, the camera 503 may also take pictures of the participants during the conference to obtain image data of the participants.

In one possible implementation, the camera 503 may employ an eye tracker. An eye tracker is a device capable of tracking and measuring the position and movement information of an eyeball. As an example, the camera 503 may photograph the participants for a preset period of time, for example, the camera may photograph the participants every three minutes.

As can be seen from fig. 5, the smart interactive tablet in embodiments of the present application may collect a variety of different types of data during a meeting. The conference summary generation method can directly or indirectly use the data as the content of the conference summary, and typeset according to the respective time information of the data, so that the conference summary with the sequence can be automatically generated.

Embodiments of performing processing for different types of data to generate content for a conference summary will be described in detail below with reference to fig. 6-11. The content of the conference summary refers to content presented to the user as part of the conference summary, including textual content as well as image content.

Fig. 6 and 7 show flowcharts of steps for acquiring text content and text content time using audio data, respectively. Briefly, the embodiment shown in fig. 6 segments the collected audio data according to the speaking manner of the speaker, and the embodiment shown in fig. 7 segments the collected audio data according to the semantic information. An embodiment of obtaining the text content and the text content time corresponding to the text content by using the audio data will be described in detail with reference to fig. 6. As shown in fig. 6:

in step S201, audio data of the conference and time information corresponding to the audio data are collected. The time information may be represented by a time period consisting of a collection start time and a collection end time of the audio data. For example, the smart interactive tablet may start the acquisition at 11 o ' clock and end the acquisition at 11 o ' clock and 30 o ' clock, and the time information of the acquired audio data is (11.

In an implementation, the smart interactive tablet may obtain at least one piece of audio data in the speaking order of the participants. That is to say, the intelligent interactive tablet can use a single utterance of each speaker as a segment of audio data according to the difference of the speakers, and finally complete audio acquisition of the conference to acquire the audio data of the conference. The audio data comprises time information corresponding to each piece of audio data.

Specifically, the audio device may determine whether the participant starts speaking, continues speaking, and ends speaking according to the intensity of the collected audio data. If the intensity of the audio data collected by the audio equipment is greater than the preset intensity, determining that the conference participant starts speaking and recording the collection starting time, and if the intensity of the audio data collected by the conference participant is within the preset intensity range, determining that the conference participant continues speaking, and the audio equipment can continuously collect the audio data until the intensity of the audio data of the conference participant is less than the preset intensity and exceeds the preset time, determining that the conference participant finishes speaking, and recording the collection ending time. Finally, the audio device may obtain the segment of audio data and time information corresponding to the segment of audio data.

For example, if a participant speaks with a sound exceeding a preset intensity, the audio device may determine that the participant starts speaking, and record the collection start time 15 of the current speech to start collecting audio data. Then, the conference participant continues to speak at the preset intensity, and the audio device continues to collect audio data until the conference participant pauses for more than 5 seconds, the audio device may record the collection end time 15 when the conference participant finishes speaking and stop collecting the audio data of the user. At this time, the audio apparatus may store the audio data of the participant this time together with the time information (i.e., (15, 15.

In an implementation, the participants may speak at the same time or at different times. The intelligent interactive tablet can record each piece of audio data generated by each participant respectively. In one possible implementation, the audio device may use a blind source separation technique to separate the audio data of different participants, and then may use a voiceprint recognition technique to identify each participant, and store/record the audio data of each participant separately. For convenience of search and management, different identity tags may be set for each participant when storing the audio data, and then the audio data is stored/recorded in the manner of table 1 or table 2 for each participant. As shown in table 1:

TABLE 1

As can be seen from Table 1, at time t ₂ There are cases where multiple participants (primary and nth participants) speak simultaneously during the same time period. At this time, the audio device may store the generated audio data in a consistent order of speech. That is, the audio device may be preferentially stored at time t ₂ The previously speaking participant continues to produce audio data. As an example, at time t ₂ When the first participant and the nth participant speak simultaneously, but because of t ₂ The first participant is speaking before, so the first participant can be stored preferentially at time t ₂ The generated audio data is then stored for the nth participant at time t ₂ The resulting audio data.

Furthermore, it is also possible to store in the manner referred to in table 2:

TABLE 2

In table 2, each piece of audio data generated by each participant may be correspondingly stored together. As can be seen from table 2, there may be a case where a plurality of participants speak simultaneously in the same time period to generate audio data, respectively. For example, at time t ₂ The respective generated audio data may be stored with the corresponding participants, respectively. In table 2, the audio data generated by each participant may be stored in chronological order. It should be noted that the above only shows an illustrative embodiment, and the smart interactive tablet may also store the collected audio data in other manners.

In step S202, each piece of audio data in the at least one piece of audio data is converted into each piece of text content, where the text content time of each piece of text content is a time period formed by the acquisition start time and the acquisition end time for acquiring the corresponding piece of audio data.

In one possible implementation, the smart interactive tablet may segment the collected audio data according to the collection start time and the collection end time of each speaking by the speaker. For example, the smart interactive tablet may obtain at least one piece of audio data and time information of each piece of audio data as stored in table 1. The smart interactive tablet may then initiate the acquisition with time t generated by the primary participant ₁ To the end time t of collection ₃ The corresponding audio data is used as the first segment of audio data, and the collection starting time generated by the nth participant can be t ₂ Time to end of collection is t ₃ The corresponding audio data is taken as the second piece of audio data. The smart interaction tablet may then convert the first piece of audio data to a first text content and the second piece of audio data to a second text content.

In an implementation, the method may employ audio recognition techniques to convert each piece of audio data into textual content. The audio Recognition technology is also called Automatic audio Recognition (ASR), and means that an input audio signal can be converted into a corresponding text output through Recognition and understanding. In an implementation, the method may use an audio recognition model to convert each audio data into a text content, and the audio recognition model mentioned herein may include a neural network model that converts the input audio data into a text content after training with training data, which will not be further described.

In implementation, the time information of each piece of text content may be determined using the time information of the corresponding audio data. Still taking the first participant as an example, the time information of the text data can be determined by using the time information of the corresponding audio data, i.e. the acquisition start time t ₁ To the end time t of collection ₃ Time period (t) of ₁ ，t ₃ )。

In another possible implementation manner, after the voice data is converted into the text data, the intelligent interaction tablet may segment the text data by using semantic information of the text data to determine each text content. This is described in detail below with reference to fig. 7, as shown in fig. 7:

in step S301, the audio data of the conference and the time information corresponding to the audio data are collected, which is the same as step S201 and will not be described herein again.

In step S302, the audio data obtained in step S202 may be converted into text data according to the audio recognition technique as described above.

In step S303, the keywords in the text data and each text content corresponding to each keyword are extracted, where the text content time of each text content is a time period formed by the acquisition start time and the acquisition end time for acquiring the audio data corresponding to the text content.

In particular, the method may determine at least one keyword in the text data. In an implementation, the smart interactive tablet may perform semantic analysis on the text data by using a Natural Language Processing (NLP) technology to extract the at least one keyword.

In one possible implementation, the semantic analysis technique may preprocess the text data, where the preprocessing includes filtering noise, and the like. And then, splitting the preprocessed text data into each text clause according to punctuation marks. For example, the text data "performed well this year, we believe that the next year will be better" divided directly by punctuation into "performed well this year" and "we believe that the next year will be better".

Subsequently, the semantic analysis technique may determine the part-of-speech of each participle in each text clause, that is, the text clause may be divided into the participles according to the part-of-speech. The semantic analysis technology can extract text clauses meeting the window size from the text clauses larger than the preset threshold value as candidate phrases according to the part of speech of each participle and a window setting mode. The window size may be preset by a technician according to user viewing preferences, for example, the window size may be set to 4 and the participles within the window are complete in part of speech. As described above, for the clause "we believe that the next year will be better" can be classified by part of speech as "we", "believe", "next year", "meeting", "better". When the window size is set to 4, phrases "we believe", "believe next year", "future meeting", "will be better" can be extracted from the clause. These phrases may be determined as candidate phrases, which in this way may make the candidate phrases more consistent with the reading habits of the user.

Under the condition that candidate phrases corresponding to the clauses are determined, the intelligent interaction tablet can determine the occurrence frequency of the candidate phrases and take the candidate phrases with the occurrence frequency exceeding a preset threshold number of times as keywords.

In practice, the method may determine text clauses corresponding to each keyword and treat these text clauses as text content corresponding to the keyword. Taking the first keyword as an example, the method may determine text clauses corresponding to the first keyword, and then take text paragraphs formed by the text clauses as the first text content corresponding to the first keyword.

In an implementation, the text content time of the text content may be determined by time information of the audio data corresponding to each text clause included in the text content, for example, in a case where the first text content includes five text clauses, the text content time of the first text content may be represented by a time period of the acquisition start time of the first clause and the acquisition end time of the fifth clause. The collection start time of the first clause may be the collection start time of the audio data corresponding to the first clause, and the collection end time of the fifth clause may be the collection end time of the audio data corresponding to the fifth clause. In another possible implementation, the starting time of the text content time of the first text content may be the starting time corresponding to the clause with the earliest starting time among the five clauses; similarly, the collection end time of the text content time of the first text content may be the collection end time corresponding to the clause with the most complete collection end time in the five clauses.

In one possible implementation, the text data may be divided at time intervals in order to more accurately determine the keywords and determine the corresponding text content. For example, text data at time intervals of 10 minutes may be processed as above, i.e., a keyword and text content corresponding to the keyword are determined. For example, if the time interval between the occurrence of the first keyword and the occurrence of the second keyword exceeds 15 minutes, the text clause corresponding to the first keyword appearing later is not regarded as the first text content. For example, if a meeting participant had referred to "performance" when summarizing this year and referred to "performance" again when describing "expectations" after 15 minutes. That is, if the time interval between the occurrence of the keyword "performance" and the occurrence of the keyword "performance" exceeds 15 minutes, the intelligent interactive tablet finishes the segmentation of "performance" when the meeting participant summarizes this year, and the clause that mentions "performance" when describing "expectation" is not taken as the first text content.

In summary, according to the method for generating a conference summary of the embodiment of the present application, in the process of acquiring the text content in the conference summary by using the audio data, the audio data sent by each participant participating in the conference can be collected and subjected to segmentation processing or semantic analysis, so as to determine the discussion focus (i.e., keywords) and the corresponding discussion details (i.e., text content) of the conference, and the text content in the conference summary is formed by using the discussion focus and the discussion details, so that the time for a user to sort the conference summary is reduced, and the conference summary is ordered and comprehensive, so that a reader can clearly and clearly understand the conference.

Having described the acquisition of the text content and the text content time using the audio data with reference to fig. 6 and 7, an embodiment of acquiring at least one image content and an image content time in the conference summary will be described below with reference to fig. 8, 9, and 11. In short, the image content in the meeting summary may be content displayed on the display screen of the smart interactive tablet. The image content may be a part of the content (i.e., a part of the interface) in the interface displayed on the display screen of the smart interactive tablet, and in the following description, an embodiment of determining the part of the content using the acquired image data of the conference participant may be described with reference to fig. 8, and an embodiment of determining the part of the content using the acquired manual input data may be described with reference to fig. 9. Furthermore, the image content may also be a complete interface displayed on a display screen, and in the following description, this embodiment may be described with reference to fig. 11.

Fig. 8 shows a flowchart of the steps for determining at least one image content in a conference summary using image data captured by a camera. In this embodiment, the smart interactive tablet may determine the image content in the meeting summary "indirectly" using the image data, that is, the image data is not directly as the image content, but a part of the content in the displayed interface is determined using the image data, the part of the content is as the image content in the meeting summary, and the time period determined by the photographing time is used as the image content time. As shown in fig. 8:

in step S401, image data during the conference and the shooting time of the image data are obtained, where the image data is obtained by shooting a participant with a camera. In an alternative implementation, the camera may photograph the participants for a preset period of time, for example, the camera may photograph the participants every three minutes. Optionally, the camera may store the image data obtained by shooting corresponding to the shooting time.

At step S402, at least one user gaze content and a gaze period at which the conference participant gazes during the conference is determined based on the image data.

In an implementation, the method may utilize the image data to determine a gazing area at which a participant gazes. The intelligent interaction panel can track the viewpoints of all the participants in the image data through the image data, determine at least one watching area watched by the participants, and any watching area in the watching areas can be obtained through the viewpoints of the participants. Taking the first watching area as an example, the intelligent interactive tablet may use the photographed image data to take an area with a preset size formed by viewpoints of the participants as the first watching area, where the viewpoints may indicate that the view of the participants extends to a contact point on the display screen. For example, in the case where the participants are both looking at an icon on the screen, a larger area including the icon may be taken as the first-time gazing area. In addition, the method can also screen the viewpoints to remove viewpoints which obviously deviate from the same region.

In one possible implementation, the method may employ an eye tracking method to determine the viewpoint of each participant; and the viewpoint of the participant is the intersection point of the sight of the participant and the display screen of the intelligent interactive panel. As shown in fig. 10, the line of sight of participant 1004 is AB, and the intersection of the line of sight of participant 1004 AB and the display screen is a, so the viewpoint corresponding to participant 1004 is a.

Specifically, the smart interactive tablet may determine the viewpoint of each participant as follows: arranging a near-infrared light source (e.g., an infrared illuminator) and a camera near a display screen of the smart interactive tablet, for example, arranging the near-infrared light source and the camera at an upper edge or a lower edge of the display screen of the smart interactive tablet; then, when the conferee is positioned near the display screen of the intelligent panel, the infrared light generated by the near-infrared light source irradiates the eyes of the conferee, so that a corneal reflection central point is generated on the cornea of the eyes of the conferee, and then the camera is used for shooting an image with the corneal reflection central point. That is, the above-mentioned image data may be image data including a corneal reflection center point of a conference participant. Further, the intelligent interactive tablet may identify two central points (i.e., a pupil central point and a cornea reflection central point) on each image using an image processing algorithm, determine a sight line direction of eye movement (i.e., a direction in which a pupil moves), and calculate the viewpoint of the participant by using the direction in combination with geometric features of other reflections.

Subsequently, the method may intercept the user gaze content from a screenshot image as image content of the meeting summary based on the location of the gaze area. In an implementation, the smart interactive tablet may obtain a screenshot image of a currently displayed interface. As a possible implementation manner, the smart interactive tablet may acquire a screenshot image of a currently displayed interface while acquiring image data, and store the image data together with the screenshot image correspondingly. The method may then use the location (e.g., coordinate data) of the gaze region determined above to capture a portion of the screenshot image from the screenshot image corresponding to the location as image content.

In implementations, a gaze period for each user gazing at the content may be determined as an image content time. The watching period may be a period determined using the photographing time of the photographed conference participant. That is, the smart interaction tablet may determine a gaze period for each gaze region using a photographing time of the image data and determine the gaze period as an image content time. For example, at 14: after determining that the participant gazes at the top on the smart interactive tablet in the first image data taken at 00, the gazing region in the second image taken every three minutes is still the top, but the gazing region in the third image taken every three minutes is the bottom, then the gazing time period for the top can be determined to be 14.

In summary, the method for generating a conference summary according to the embodiment of the present application may further determine the watching region and the time information of the conference participants through the acquired image data of the conference participants in the case that at least one text content of the conference summary is determined by using the audio data, so as to determine the images of the conference summary, so that the content of the generated conference summary is richer.

FIG. 9 shows a flowchart of the steps for determining image content in a meeting summary using manual input operations. In this embodiment, the smart interactive tablet may take manual input contents manually input by the conference participant as image contents, and determine a period of the manual input operation as an image content time. As shown in fig. 9:

in step S501, a manual input operation of at least one participant is received.

In an implementation, in the case where the smart interactive tablet displays a canvas interface, a participant may manually select to display various elements (e.g., markup, graphics, brush color, etc.) on the canvas interface and manually input on the smart interactive tablet with the various elements. For example, a meeting participant may draw on the smart interactive tablet after selecting a brush color. The intelligent interactive tablet can receive manual input operation of at least one participant. As an example, the smart interaction tablet may detect coordinates of a touch point of a user touch input.

In step S502, based on the manual input operation, each of the manual input contents and the manual input time displayed on the display screen is determined.

In an implementation, when the participant performs a manual input operation, the smart interactive tablet may determine the manual input operation as the same manual input content. For example, after a participant draws a histogram on the electronic whiteboard interface with a stylus, the smart interactive tablet may recognize the histogram. As a possible implementation manner, the intelligent interactive tablet may pre-store a plurality of common simple stroke styles, and determine a simple stroke style corresponding to a manual input operation after the manual input operation is performed. Based on this, the smart interactive tablet can take the entire histogram and the various data entered on the histogram as the same manually entered content. The smart interactive tablet then generates and displays content (e.g., a histogram) corresponding to the touch input operation.

In an implementation, the smart interactive tablet may obtain a screenshot image of a currently displayed interface. Then, the position information (for example, coordinate data) of each touch point in the touch input operation is utilized to intercept the screen capture image corresponding to the content from the screen capture image and store the screen capture image as the image content.

In an implementation, the smart interaction tablet may determine the corresponding image content time using the manual input time corresponding to the manual input content, e.g., the meeting participant at 15:00 to 15, a house is drawn, time 15.

Optionally, the method may match the manually entered content with the user gazed content. In particular, the smart interaction tablet may determine whether there is user gaze content corresponding to the manual input time. In the case where it is determined that there is corresponding user-gazing content, the user-gazing content may be subjected to a correction operation using the manually input content. This is because the user gazes at the content is limited to the area of the user gazing at the content, but the area may be a size previously set by the user or a technician, and thus the content gazed at by the user may not be accurately determined. For example, a participant draws a picture and speaks for the picture. The user's gaze area determined by the method using the viewpoint may be only a partial area of the painting.

For ease of understanding, fig. 10 gives a relevant illustration. As shown in fig. 10, the user gazing areas of the conference participant 1001, the conference participant 1002, the conference participant 1003 and the conference participant 1004 are the area 1010, but the manual input area in the same time period is the area 1020, so the smart interactive tablet can modify the user gazing area from the area 1010 to the area 1020. That is, after determining the matching manually input content, the method may modify the user gaze content using the manually input content, and determine the modified user gaze content as the image content of the conference summary. In this case, the smart interactive tablet may only treat the user gaze content as image content of the conference summary.

In summary, the method for generating a conference summary according to the embodiment of the present application may modify the above-determined user gazing area by using the manual input content obtained by the conference participant for the manual input operation of the intelligent interactive tablet. So that the image content of the conference summary can be accurately determined.

Fig. 11 shows an embodiment of using a screen capture image as image content of a conference summary, and the image content time of each image content refers to the time at which a screen capture operation is performed, as shown in fig. 11:

in step S601, a user trigger signal for performing a screen capture operation is received.

In an implementation, the screen capture operation may refer to a snapshot operation performed on the content displayed on the screen, so as to form a picture with the same length and width as the screen and the same content. In one possible implementation, the screen capture operation may be that the smart interactive tablet performs the screen capture operation after receiving a user trigger signal, for example, the user trigger signal includes any one or more of a click operation signal, a slide operation signal, a press operation signal, and a long press operation signal. In implementations, the user trigger may indicate a user trigger for a screen capture control.

Furthermore, the user trigger signal may also indicate a trigger signal manually input by a user. That is, when the conference participant manually inputs to the smart interactive tablet through an input device (e.g., a mouse, a keyboard, or a stylus pen) or a finger, the smart interactive tablet receives a user trigger signal and performs a screen capture operation on the current display interface.

In another possible implementation, the screen capture operation may be an operation automatically performed by the smart interactive tablet. Optionally, the smart interactive tablet may perform the screen capture operation at preset time intervals, for example, the smart interactive tablet may perform the screen capture operation every five minutes.

In step S602, in response to the user trigger signal, a screen capture image and a corresponding screen capture time are acquired. In an implementation, the smart interaction tablet may determine a screen capture time of a screen capture image as the image content time.

In summary, the method for generating a conference summary according to the embodiment of the present application may further use the screen capture image as the image content in the conference summary under the condition that the conference summary has been generated by using the audio data, so that the conference summary is richer and more sufficient.

By combining the above embodiments, it can be seen that the conference summary generated by the conference summary generation method of the embodiment of the present application at least includes two parts of content, one part of content is text content, and the other part of content is image content. For ease of illustration, the conference summary generated by the present application will be described below in conjunction with FIG. 12.

As shown in fig. 12, the conference summary includes a plurality of text contents and a plurality of image contents, each of which corresponds to a respective text content time and image content time, for example, the first text content corresponds to a first time period, and the first image content corresponds to a first time period.

Then, the method may typeset the at least one piece of text content and the at least one piece of image content according to a time sequence based on the time period to which the text content belongs, and generate the conference summary.

In one possible implementation, the smart interaction tablet may predetermine a time interval of a time period, for example, the time period may indicate half an hour or the time period may indicate 45 minutes, etc. On the basis, the method takes the conference starting time as the starting time of the first time period, and divides the conference into different time periods according to the preset time interval. In the case that the last time period is less than the preset time period, the method may also regard it as the last time period.

And under the condition of determining the time period, the intelligent interaction tablet can respectively determine the text content and the image content belonging to each time period by utilizing each acquired text content and each acquired image content. As shown in fig. 12, the smart interaction tablet may determine first text content and first image content belonging to a first time period.

In another possible implementation manner, the smart interactive tablet may use time information of some type of conference summary content as a reference time to generate a reference conference summary, and then add other types of conference summary content to a corresponding location of the reference conference summary. For example, the intelligent interactive tablet may use the text content time as a reference time, and then typeset the text contents according to the sequence of the text content time to generate the first meeting summary. Then, the intelligent interactive tablet can supplement the acquired image contents to corresponding positions in the first conference summary according to the image content time. In particular, the smart interaction tablet may arrange respective image content in the vicinity of the text content corresponding thereto by image content time, for example, may be arranged above the text content or below the text content.

Taking the conference summary shown in fig. 12 as an example, the smart interactive tablet may generate a first conference summary using the text content time as a reference time, in which a first time period may indicate a text content time of the first text content, and a second time period may indicate a text content time corresponding to the second text content. Then, the smart interactive tablet may determine first image content corresponding to the image content time belonging to the first time period, and add the first image content to a lower side of the first text content. In addition, the smart interactive tablet can also determine second image content corresponding to the image content time belonging to a second time period, and add the second image content to the lower part of the second text content.

In the generated conference summary, a first time period is taken as an example, and the time period corresponds to the first text content and the first image content. Wherein the first text is a text content acquired by the steps of fig. 6 or fig. 7 using audio data acquired by the audio device. The first image content may include the entire screenshot image or a partial screenshot image. In an implementation, the first image content may be the whole screenshot image processed by the steps described in fig. 11, that is, after the whole screenshot image is acquired, the whole screenshot image may be processed by the steps described in fig. 11. Furthermore, the first image content may be image data processed by the steps shown in fig. 8, that is, after the image data of the conference participant is acquired, the image data may be processed according to the steps shown in fig. 8, and after the position of the first watching area is determined, a part of the screenshot image is acquired as the first image content. The first image content may also be a partial screenshot image processed by the steps shown in fig. 9, that is, after the smart interactive tablet receives the manual input operation, the smart interactive tablet may process the manual input operation according to the steps shown in fig. 9, and after the position of the first gazing area is determined, the partial screenshot image is obtained as the first image content.

In implementation, the smart interactive tablet may also use the image data, the touch data, and the audio data together to generate a conference summary, an embodiment of which will be described below in conjunction with fig. 13, as shown in fig. 13:

at step 700, the conference begins.

In step S701, image data is acquired, and the step of acquiring image data is the same as that in step S401, which will be described herein again.

At step S702, manual input data, which refers to manual input operation by the conference participants, is acquired, which has been described above with reference to fig. 9 and will not be described in detail here.

In step S703, audio data is acquired, and the step of acquiring audio data is the same as step S201 or step S301 above, and will not be described herein again.

In step S711, a user gaze area is determined using the acquired image data, which is the same as step S402 and will not be described herein.

In step S712, the manual input content is determined by using the manual input data, which is the same as step S502 and will not be described herein.

In step S713, the audio data is converted into text data, which is the same as step S302 and will not be described again.

In step S721, the manually input content is matched with the user gazing content, and at least one image content and an image content time are obtained, which are described above with reference to fig. 10 and will not be described again.

At step S723, at least one piece of text content and time information of each piece of text content are obtained, which is the same as step S303 and will not be described herein again.

In step S731, the data acquired in the above respective steps is stored together with time information. In implementation, the above data may be stored in a storage unit (e.g., a temporary storage area) within the smart interactive tablet, stored in an external memory, or may be stored in the cloud, which shall not be limited thereto.

In step S741, a trigger signal that generates a conference summary is received. In an implementation, the conference participant may generate the trigger signal by triggering a control displayed on the screen, and the trigger signal is described in detail above and will not be described in detail herein. Further, the step S741 may be performed at each point of time when the conference is in progress.

At step S751, at least one piece of text content related to a meeting and a text content time corresponding to each piece of text content, and at least one piece of image content related to the meeting and an image content time corresponding to each piece of image content are acquired. How to acquire the image content and the image content time of the conference summary has been described in detail above with reference to fig. 6 and 7, and how to acquire the text content and the text content time of the conference summary has been described in detail above with reference to fig. 8, 9, and 11, and will not be described in detail again.

In step S761, the at least one piece of text content and the at least one piece of image content are typeset by using the text content time and the image content time, and a meeting summary of the meeting is generated. The detailed description of the composition operation has been made above with reference to fig. 12, and will not be described again.

In step S771, the conference summary is output. In practice, the meeting summary may be exported for review by the user according to embodiments of the present application as a meeting summary in various formats, such as hypertext markup language (HTML) format, portable Document Format (PDF) format, WORD processor (WORD) format, and so forth.

By combining the above embodiments, the embodiments of the present application provide a method for generating a conference summary, and the method can typeset text content and image content acquired in a conference according to time, so as to achieve the purpose of automatically generating the conference summary. This embodiment will be specifically described below with reference to fig. 14.

FIG. 14 shows a flowchart of the steps for generating a meeting summary according to an embodiment of the present application.

In step S801, at least one piece of text content related to the conference and a text content time corresponding to each piece of text content are acquired.

In step S802, at least one image content related to the conference and an image content time corresponding to each image content are acquired.

In step S803, an epoch of the conference is generated based on the at least one piece of text content and the text content time, and the at least one piece of image content and the image content time.

Optionally, generating a summary of the meeting based on the at least one piece of text content and the text content time, and the at least one piece of image content and the image content time, comprises:

and typesetting the at least one section of text content and the at least one image content according to the time sequence based on the text content time and the image content time to generate the conference summary.

Optionally, before obtaining at least one piece of text content related to the conference and a text content time corresponding to each piece of text content, the method further includes:

audio data associated with the conference is obtained.

Optionally, the audio data comprises at least one piece of audio data captured sequentially.

Optionally, the obtaining at least one piece of text content related to the conference and a text content time corresponding to each piece of text content includes:

and respectively converting each piece of audio data in the at least one piece of audio data into each piece of text content, wherein the text content time corresponding to each piece of text content comprises the acquisition starting time and the acquisition ending time of the audio data corresponding to each piece of text content.

Optionally, the acquiring at least one piece of text content related to the conference and a text content time corresponding to each piece of text content includes:

converting the audio data into text data;

extracting at least one keyword in the text data and each section of text content corresponding to each keyword, wherein the text content time corresponding to each section of text content comprises the acquisition starting time and the acquisition ending time of the audio data corresponding to each section of text content.

Optionally, the at least one image content includes a screen capture image obtained by performing a screen capture operation during the meeting, and the image content time corresponding to each image content includes a time for performing the screen capture operation.

Optionally, the at least one image content includes at least one manual input content displayed on a display screen through a manual input operation, wherein the manual input operation is an input operation performed by a participant during the conference, and the image content time corresponding to each image content is an input start time and an acquisition end time of the manual input content included in each image content.

Optionally, the at least one image content includes at least one user attention content determined by image data at which a participant gazes during the conference, wherein the image data includes captured behavior data of the participant during the conference or image data including corneal reflection points of the participant, and an image content time corresponding to each image content is determined by the capture time.

It is to be understood that the above-mentioned terminal and the like include hardware structures and/or software modules corresponding to the respective functions for realizing the above-mentioned functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed in hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

In the embodiment of the present application, the terminal and the like may be divided into function modules according to the method example, for example, each function module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

In the case of dividing each functional module by corresponding each function, fig. 15 shows a block diagram of an apparatus for generating a conference summary according to an embodiment of the present application.

The apparatus 1500 for generating a conference summary may comprise:

a text content acquiring unit 1510 configured to acquire at least one piece of text content related to the conference and a text content time corresponding to each piece of text content.

An image content acquiring unit 1520, configured to acquire at least one image content related to the meeting and an image content time corresponding to each image content.

A generating unit 1530, configured to generate an era of the conference based on the at least one text content and the text content time, and the at least one image content and the image content time.

Optionally, the generating unit 1530 is specifically configured to, based on the text content time and the image content time, typeset the at least one text content and the at least one image content according to a time sequence, and generate the meeting summary.

Optionally, the apparatus 1500 for generating a conference summary further comprises:

an audio data acquisition unit for acquiring audio data related to the conference.

Optionally, the audio data comprises at least one piece of audio data collected sequentially.

The text content obtaining unit 1510 is specifically configured to convert each piece of audio data in the at least one piece of audio data into each piece of text content, where a text content time corresponding to each piece of text content includes a collection start time and a collection end time of the audio data corresponding to each piece of text content.

Alternatively, the text content acquiring unit 1510 includes:

the conversion module is used for converting the audio data into text data;

and the extraction module is used for extracting at least one keyword in the text data and each section of text content corresponding to each keyword, wherein the text content time corresponding to each section of text content comprises the acquisition starting time and the acquisition ending time of the audio data corresponding to each section of text content.

Optionally, the at least one image content includes a screen capture image obtained by performing a screen capture operation during the meeting, and the image content time corresponding to each image content includes a time at which the screen capture operation is performed.

An embodiment of the present application provides a device for generating a conference summary, including: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above method when executing the instructions.

Embodiments of the present application provide a non-transitory computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

Embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The computer-readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable Programmable Read-Only Memory (EPROM or flash Memory), a Static Random Access Memory (SRAM), a portable Compact Disc Read-Only Memory (CD-ROM), a Digital Versatile Disc (DVD), a Memory stick, a floppy disk, a mechanical coding device, a punch card or an in-groove protrusion structure, for example, having instructions stored thereon, and any suitable combination of the foregoing.

The computer readable program instructions or code described herein may be downloaded from a computer readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present application may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the internet using an internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of this application by personalizing, with state information of the computer-readable program instructions, an electronic circuit such as a Programmable Logic circuit, a Field-Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It is also noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware (e.g., a Circuit or an ASIC) for performing the corresponding function or action, or by combinations of hardware and software, such as firmware.

While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of generating a conference summary, comprising:

acquiring at least one section of text content related to the conference and text content time corresponding to each section of text content;

acquiring at least one image content related to the conference and an image content time corresponding to each image content;

generating a summary of the meeting based on the at least one piece of text content and the text content time, and the at least one piece of image content and the image content time.

2. The method of claim 1, wherein generating the summary of the meeting based on the at least one piece of text content and the text content time, and the at least one piece of image content and the image content time comprises:

and typesetting the at least one section of text content and the at least one image content according to a time sequence based on the text content time and the image content time to generate the conference summary.

3. The method of claim 1 or 2, wherein prior to obtaining at least one piece of textual content relating to a meeting and a textual content time corresponding to each piece of textual content, the method further comprises:

audio data associated with the conference is obtained.

4. The method of claim 3, wherein the audio data comprises at least one piece of audio data captured sequentially;

the acquiring at least one text content related to the meeting and the text content time corresponding to each text content includes:

and converting each piece of audio data in the at least one piece of audio data into each piece of text content, wherein the text content time corresponding to each piece of text content comprises the acquisition starting time and the acquisition ending time of the audio data corresponding to each piece of text content.

5. The method of claim 3, wherein the obtaining at least one text content associated with the meeting and a text content time corresponding to each text content comprises:

converting the audio data into text data;

6. The method of claim 1 or 2, wherein the at least one image content comprises a screen capture image obtained by performing a screen capture operation during the meeting, and the image content time corresponding to each image content comprises a time at which the screen capture operation was performed.

7. The method according to claim 1 or 2, wherein the at least one image content includes at least one manual input content displayed on a display screen by a manual input operation, wherein the manual input operation is an input operation performed by a participant during the conference, and the image content time corresponding to each image content is an input start time and an acquisition end time of the manual input content included in the each image content.

8. The method of claim 1 or 2, characterized in that the at least one image content comprises at least one user attention content determined by image data at which a participant gazes during the conference, wherein the image data comprises captured behavior data of the participant during the conference or image data comprising corneal reflection points of the participant, and the image content time corresponding to each image content is determined by the capturing time for a period of time.

9. An apparatus for generating a conference summary, comprising:

the system comprises a text content acquisition unit, a text content acquisition unit and a text content processing unit, wherein the text content acquisition unit is used for acquiring at least one piece of text content related to a conference and text content time corresponding to each piece of text content;

an image content acquisition unit configured to acquire at least one image content related to the conference and an image content time corresponding to each image content;

a generating unit, configured to generate an epoch of the conference based on the at least one piece of text content and the text content time, and the at least one piece of image content and the image content time.

10. The apparatus of claim 9, wherein the generating unit is specifically configured to generate the meeting summary by laying out the at least one piece of text content and the at least one piece of image content in chronological order based on the text content time and the image content time.

11. The apparatus of claim 9 or 10, further comprising:

12. The device of claim 11, wherein the audio data comprises at least one segment of audio data collected sequentially;

the text content acquiring unit is specifically configured to convert each piece of audio data in the at least one piece of audio data into each piece of text content, where a text content time corresponding to each piece of text content includes an acquisition start time and an acquisition end time of the audio data corresponding to each piece of text content.

13. The apparatus according to claim 11, wherein the text content acquiring unit includes:

the conversion module is used for converting the audio data into text data;

14. The apparatus of claim 9 or 10, wherein the at least one image content comprises a screen shot image obtained by performing a screen shot operation during the meeting, and the image content time corresponding to each image content comprises a time at which the screen shot operation was performed.

15. The apparatus according to claim 9 or 10, wherein the at least one image content includes at least one manual input content displayed on a display screen by a manual input operation, wherein the manual input operation is an input operation performed by a participant during the conference, and the image content time corresponding to each image content is an input start time and an acquisition end time of the manual input content included in the each image content.

16. The apparatus of claim 9 or 10, wherein the at least one image content comprises at least one user attention content determined by image data at which a participant gazes during the meeting, wherein the image data comprises captured behavioral data of the participant during the meeting or image data comprising corneal reflection points of the participant, and wherein each image content corresponding to an image content time is determined by the capture time for a period of time.

17. An apparatus for generating a conference summary, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any one of claims 1-8 when executing the instructions.

18. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1-8.

19. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the method of any of claims 1-8.