CN111833861A - Artificial intelligence based event evaluation report generation - Google Patents

Artificial intelligence based event evaluation report generation Download PDF

Info

Publication number
CN111833861A
CN111833861A CN201910317933.6A CN201910317933A CN111833861A CN 111833861 A CN111833861 A CN 111833861A CN 201910317933 A CN201910317933 A CN 201910317933A CN 111833861 A CN111833861 A CN 111833861A
Authority
CN
China
Prior art keywords
participant
event
sequence
text
report
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910317933.6A
Other languages
Chinese (zh)
Inventor
李烨
郑师宜
陈阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Priority to CN201910317933.6A priority Critical patent/CN111833861A/en
Priority to PCT/US2020/023460 priority patent/WO2020214316A1/en
Publication of CN111833861A publication Critical patent/CN111833861A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/4223Cameras
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/0059Measuring for diagnostic purposes; Identification of persons using light, e.g. diagnosis by transillumination, diascopy, fluorescence
    • A61B5/0077Devices for viewing the surface of the body, e.g. camera, magnifying lens
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/117Identification of persons
    • A61B5/1171Identification of persons based on the shapes or appearances of their bodies or parts thereof
    • A61B5/1176Recognition of faces
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06398Performance of employee with respect to a job function
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/06Electrically-operated educational appliances with both visual and audible presentation of the material to be studied
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Human Computer Interaction (AREA)
  • Human Resources & Organizations (AREA)
  • Signal Processing (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Veterinary Medicine (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Surgery (AREA)
  • Molecular Biology (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Social Psychology (AREA)
  • Educational Administration (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Educational Technology (AREA)
  • Acoustics & Sound (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)

Abstract

The present disclosure provides a method for generating an evaluation report for an event. A multimedia stream corresponding to the event may be received, wherein the multimedia stream includes a video stream associated with a first participant of the one or more participants of the event. A sequence of facial images and a sequence of body images of the first participant may be detected from the video stream. A sequence of emotions can be identified from a sequence of facial images and a sequence of actions can be identified from a sequence of body images. The performance of the first participant in the event may be evaluated by at least one participant evaluation model associated with the category of the event according to at least one of an emotion sequence and an action sequence. A report related to the first participant may be generated based at least on the performance.

Description

Artificial intelligence based event evaluation report generation
Background
In various events involving one or more participants, it is often difficult to accurately generate reports relating to these participants in each short period in time, including, for example, an assessment report of the participants' performance at each event and/or the status of the event. For example, in a teaching event, it is often time consuming and not accurate enough to assess the performance of teachers and/or students in each class. Although the teaching achievements of teachers and students may be reflected in a relatively time-saving manner through a periodic test rather than a per-class test, adjusting the teaching manner of teachers or intervening in the classroom performance of students after the periodic test may be too late to efficiently improve the performance of teachers and/or students in a teaching event. For educational providers, it is desirable to have immediate knowledge of the teacher's and/or student's performance, so that subsequent performance by the teacher and/or student can be effectively adjusted to improve the quality of the teaching.
Disclosure of Invention
This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Embodiments of the present disclosure propose a method for generating an evaluation report for an event. In the method, a multimedia stream corresponding to the event may be received, wherein the multimedia stream includes a video stream related to a first participant of the one or more participants of the event. A sequence of facial images and a sequence of body images of the first participant may be detected from the video stream. A sequence of emotions can be identified from a sequence of facial images and a sequence of actions can be identified from a sequence of body images. The performance of the first participant in the event may be evaluated by at least one participant evaluation model associated with the category of the event according to at least one of an emotion sequence and an action sequence. A report related to the first participant may be generated based at least on the performance.
It should be noted that one or more of the above aspects include the features described in detail below and particularly pointed out in the claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.
Drawings
The disclosed aspects will hereinafter be described in conjunction with the appended drawings, which are provided to illustrate, but not to limit, the disclosed aspects.
FIG. 1 illustrates an architecture of an exemplary assessment report generation system, according to an embodiment.
Fig. 2 illustrates an exemplary process of generating a report based on a video stream corresponding to an event, according to an embodiment.
Fig. 3 illustrates a process of generating an additional report based on an audio stream corresponding to an event, according to an embodiment.
FIG. 4 illustrates an example reporting interface generated for performance of a participant in an example instructional event, according to an embodiment.
FIG. 5 illustrates an example reporting interface generated for performance of another participant in an example instructional event, according to an embodiment.
FIG. 6 is an example report interface generated for an example instructional event, according to an embodiment.
Fig. 7 illustrates a flow diagram of an exemplary method for generating an evaluation report for an event, according to an embodiment.
Fig. 8 illustrates an example apparatus for generating an evaluation report for an event, according to an embodiment.
Fig. 9 illustrates another exemplary apparatus for generating an evaluation report for an event, according to an embodiment.
Detailed Description
The present disclosure will now be discussed with reference to various exemplary embodiments. It is to be understood that the discussion of these embodiments is merely intended to enable those skilled in the art to better understand and thereby practice the embodiments of the present disclosure, and does not teach any limitation as to the scope of the present disclosure.
In various types of events, it is often difficult for people to accurately know the performance of each participant in the event and/or the general status of the event in a timely manner. Various categories of events may relate to events of communication between at least two participants including, for example, teaching, debate, speech, meeting, and the like. For example, in the teaching category, a class may be considered an event and the corresponding teachers and students may be considered participants of the event. In another example, in the exemplary debate category, a debate may be considered an event, team members attending the debate may be considered participants of the event, and so on. In teaching, it is certainly time consuming if the performance of a teacher and/or a student in each class is evaluated for that class. However, if the teacher's and/or student's performance in these classes is evaluated for multiple classes over a period of time (e.g., a week, a month, a school period), the way the teacher's lessons are taught and/or the student's learning condition may not be adjusted in time based on the teacher's and/or student's performance.
In order to be able to timely and efficiently learn the performance of various participants in an event and/or the condition of the event, embodiments of the present disclosure propose a method and a system for generating an evaluation report for the event. In particular, the method and system may be implemented based on Artificial Intelligence (AI) technology. The method may evaluate the performance of the participant in the event and/or evaluate the event according to at least one of the emotion sequence, the action sequence, and the text of the audio stream of the participant by receiving and analyzing a multimedia stream corresponding to the event, identifying an emotion sequence and an action sequence of the participant based on a video stream in the multimedia stream, and/or identifying text corresponding to the event based on an audio stream in the multimedia stream, and automatically generating an evaluation report based on a result of the evaluation. The generated report may be presented to the participant or to a third party so that the participant and/or the third party can intuitively understand in time whether the participant performs satisfactorily at the event and how to improve at subsequent events. Further, multiple performances of the participant may be obtained for multiple events of the same category of the same participant, and a composite assessment report may be generated from a combination of the multiple performances. For example, in teaching, for multiple lessons of a same teacher during a school period, an assessment report for the teacher during the school period may be generated based on the classroom performance of the teacher in each of the multiple lessons. For another example, in a lecture, for multiple lectures of the same speaker, which may be the same subject or multiple lectures of different subjects, a comprehensive or staged evaluation report for the speaker may be generated according to the performance of the speaker in each lecture.
It is noted that the evaluation model used to perform the evaluation in different categories of events may be different, since the characteristics of different categories of events are different. For example, in teaching, the characteristics for teaching may include such things as the classroom mental state of the teacher or student, the classroom behavior representation of the teacher or student, the classroom emotion change of the teacher or student, the interaction between the teacher and student, the matching degree of the teacher's teaching points and the courseware, and so on. In other categories, such as in debate, characteristics for debate may include such things as the speed of response of the debtor, emotional changes, relevance of the debtor's speech content to the debate topic, the speeches of the debtor, the degree of collaboration between members in the same team, the debate style, etc.; in an exemplary lecture, characteristics for the lecture may include, for example, the degree of richness of the lecturer's body language, emotional changes, pitch changes, the degree of matching of the lecture content to the lecture draft, the dwell time of the lecturer, the mental state of the audience, emotional changes, and the like.
Fig. 1 illustrates the architecture of an exemplary assessment report generation system 100, according to an embodiment. In fig. 1, a signal acquisition device 120, a terminal device 130, a set of base models 140, a set of event-based models 150 are interconnected by a network 110. The signal acquisition device 120 may include various acquisition devices capable of acquiring a video stream signal 122 and an audio stream signal 124 from participants of one or more events, such as the exemplary participants 102(a), 102(B), including, but not limited to, a video camera, a voice recorder, a cell phone, a computer, any other electronic device with a camera and/or microphone, and so forth. In one example, the captured video stream signal 122 and audio stream signal 124 may be communicated to the set of base models 140 via the network 110, either wirelessly or by wire.
In some embodiments, the set of base models 140 may include a face recognition model 141, a body recognition model 142, a speech recognition model 143, a knowledge graph 144, a data analysis/mining model 145, a natural language processing model 146.
In some examples, face recognition model 141 may receive video stream 122 and detect face images from video stream 122 and identify emotions in the face images, e.g., in at least one image including a face of participant 102(a) and/or participant 102(B), to output an emotion sequence for each participant. In some examples, the body detection model 142 may receive the video stream 122 and detect body images from the video stream 122 and identify actions in the body images, e.g., actions in at least one image including a body part of the participant 102(a) and/or the participant 102(B), to output a sequence of actions for each participant. In some examples, the speech recognition model 143 can perform speech recognition on the audio stream 124 to generate text of the audio stream 124. In some implementations, the speech recognition model 143 can also obtain speech rate statistics for the audio stream 124 and/or text thereof from the audio stream. In some examples, the knowledge-graph 144 may be any general knowledge-graph or domain-specific knowledge-graph. In some examples, the data analysis/mining model 145 may perform data analysis/mining on emotions derived by the face recognition model 141 and actions derived by the body recognition model 142 based on the content of the knowledge graph 144.
In some examples, the natural language processing model 146 performs natural language processing on the text of the audio stream 124 generated by the speech recognition model 143, including but not limited to, for example, semantic analysis, syntactic analysis, entity extraction, and so forth, not shown. In some implementations, the natural language processing model 146 may include multiple models, such as a key content extraction model, a text sentiment analysis model, an inappropriate word detection model, and so forth. In some examples, the key content extraction model may be used to extract key content from the text of the audio stream 124 based on a preset content list, such as key content matching the preset content list. For example, in the teaching example, the preset content list may be courseware prepared by a teacher in advance; in the lecture example, the preset content list may be a lecture manuscript of the lecturer, and so on. In addition, the key content extraction model may also determine the distribution of the extracted key content in the text of the audio stream, such as the time of occurrence, location, frequency, etc., and the matching or coverage of the extracted key content with respect to a preset content list. The text emotion analysis model may perform emotion analysis on the text of audio stream 124 using any known text emotion analysis technique to obtain a text emotion. Inappropriate words may be detected from text by an inappropriate word detection model based on a preset blacklist relating to event categories for subsequent event evaluation or presentation to participants or third parties. The content of the black list can in practice be set in advance by the system according to the event category. For example, in teaching, the preset blacklist may include words such as "stupid, maverick, junk", and the like.
The data processed by the base set of models 140 is transmitted to the event-based set of models 150 to evaluate the participant's performance and/or the event via at least one evaluation model associated with the category of the event. In some implementations, the evaluation may be in a categorical, labeling, or scoring manner. For example, the performance of a participant is labeled as "positive", "negative", "full", "not productive", "listening and speaking seriously", etc.; scoring the performance of the participants, such as by tenths, percentages, and the like; events are classified as "good", "bad", and so on. In other implementations, the evaluation may be in a hierarchical manner, such as by classifying a participant's performance or event as "low", "medium", "high", or "primary", "secondary", "tertiary", and so forth. The evaluation of the present disclosure may be performed in any suitable manner and is not limited to the above.
In some examples, the event-based model set 150 includes a participant performance evaluation model 151 and an event evaluation model 152. In some examples, participant performance assessment model 151 may be associated with an event category and assess a participant's performance in the event based on a received sequence of emotions and/or a sequence of actions of the participant. In some implementations, the participant performance evaluation model 151 may include, but is not limited to, at least one of: a behavior representation assessment model, a mental state assessment model, an emotional change assessment model, and a participant interaction assessment model. In some examples, the event evaluation model 152 may be associated with an event category and evaluate events according to the text of the received audio stream. By way of example, the event evaluation model 152 may perform event evaluation based on output from the speech recognition model 143 and/or the natural language processing model 146, such as speech rate for the audio stream, key content extracted from the text, text sentiment, inappropriate words in the detected text, matching or coverage of key content to a preset content list, and so forth.
The above-described evaluation model may be a pre-trained machine learning-based evaluation model. During training, the behavior representation evaluation model may take as inputs the action sequence and the event class, and generate as output a behavior representation; the mental state assessment model may take the sequence of emotions and the sequence of actions as inputs and generate the mental state as an output; the emotion change evaluation model can adopt an emotion sequence as input and generate emotion change as output; the participant interaction evaluation model can adopt the action sequences of a plurality of participants as input to generate interaction conditions as output; the event evaluation model may take as input at least one of speech rate, key content match/coverage, text sentiment, inappropriate words, participant performance, relevance of participant emotional change to text sentiment, and generate as output an event evaluation result.
Although the base model set 140 and the event-based model set 150 are shown separately herein, they may be combined in the same model set or device, e.g., may be included in a server, processor, cloud device, etc. Further, each of the above models may be individually trained by a machine learning model.
The evaluation results generated by the event-based model set 150 may be provided to the terminal device 130 via wired or wireless means for display via the display component 132. In some embodiments, the assessment results may be presented in the form of a report, e.g., the assessment results are included in the report. In these embodiments, the terminal device 130 may provide the received report to a database (not shown) via the network 110 for storage and/or for reporting statistics.
Furthermore, although the signal collection device 120 and the terminal device 130 are shown as separate devices in fig. 1, the signal collection device 120 may be integrated in the terminal device 130. For example, the terminal device 130 may be a mobile phone, a computer, a tablet computer, etc., and the signal acquisition device 120 may be a component of the above devices. By way of example and not limitation, the signal acquisition device 120 may be a microphone, a camera, or the like, as described above.
It should be understood that all of the components or models shown in fig. 1 are exemplary. The word "exemplary" is used herein to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of the exemplary words is intended to represent the concepts in a concrete fashion. The term "or" as used in this application is meant to be an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise, or clear from context, "X employs A or B" means any of the natural inclusive permutations. That is, if X uses A, X uses B, or X uses both A and B, "X uses A or B" satisfies any of the above examples. In addition, the articles "a" and "an" as used in this application and the appended claims should generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form.
Fig. 2 illustrates an exemplary process 200 for generating a report based on a video stream corresponding to an event, according to an embodiment.
At block 202, a video stream in a multimedia stream corresponding to an event may be received, wherein the video stream relates to at least one participant of the event. In some examples, this operation may be performed by, for example, signal acquisition device 120 of fig. 1.
At block 204, data pre-processing may be performed on the video stream. In some examples, the data pre-processing includes, for example, segmenting the entire video stream into a plurality of segments, for example, each segment may be 5-10 seconds in duration. In some examples, the video stream may be segmented using known video stream sequential segmentation techniques, such as a subsequent segment immediately preceding the subsequent segment. In some examples, the video stream may be segmented using an overlapping video stream segmentation technique, e.g., the latter segment has a partial overlap with the former segment. For example, assuming that the duration of the video stream is 30 seconds and the duration of each segment after segmentation is 5 seconds, the segments obtained by the continuous segmentation method may be: the first segment is the 1 st-5 th second of the video stream, the second segment is the 6 th-10 th second of the video stream, the third segment is the 11 th-15 th second of the video stream, and so on. In contrast, the segments obtained by the overlap segmentation method may be: the first segment is the 1 st-5 th second of the video stream, the second segment is the 3 rd-7 th second of the video stream, the third segment is the 5 th-9 th second of the video stream, and so on. Preferably, the video stream may be segmented in an overlap segmentation manner so that the emotion in the face image and the motion in the body image can be more accurately recognized from the face image sequence and the body image sequence obtained from the overlapped segments.
At block 206, a sequence of facial images may be captured from the pre-processed video stream. For example, a face image is sampled from each segment at regular intervals (e.g., one second) by screen capture.
At block 208, the mood of the face is identified from each captured image of the face. In some examples, an emotional tag may be attached for each facial image, e.g., for a certain facial image, a tag [ happy ] or [ face a, happy ].
At block 210, a sequence of emotions for a participant may be generated for the face of the participant based on the identified emotions or additional labels for the facial image.
In some examples, the operations in the example blocks 206, 208, 210 may be performed by, for example, the facial recognition model 141 of fig. 1.
At block 212, a sequence of body images may be captured from the pre-processed video stream. For example, a body image is sampled from each segment at regular intervals by means of screen capture.
At block 214, an action in progress by the participant to whom the body belongs is identified from the captured one or more body images. For example, from the captured one or more body images, it may be determined that the participant is holding his hand, yawning, sleeping, and so forth.
Optionally, at block 216, one of the following labels may be appended for each identified action: positive, negative and neutral. In one implementation, the above-described labels attached to actions may be categorized according to the category of the event. For example, in a speech, the clap of the listener may be marked as positive; in a classroom of teaching, the clapping action of a student may be marked as negative, e.g., a student claps his hands while a teacher is giving a lecture may affect the classroom discipline.
Optionally, at block 218, one of the following labels may be appended for each identified action: malicious and non-malicious, wherein the malicious or non-malicious actions may be labeled by matching with a pre-set malicious action or labeled using a pre-trained machine learning model. In one implementation, only malicious actions may be marked, while unmarked actions may default to non-malicious actions. The above-described labels attached to actions may be classified according to the category of the event. For example, in teaching, a teacher beating a student with a book may be considered or marked as a malicious act; while in a meeting, the conferee's typing other participants with the meeting record may be considered or flagged as a non-malicious action. In some examples, the malicious actions may be listed separately in subsequently generated reports to alert viewers of the reports, such as indicating in the report generated at block 226 whether there are actions marked as malicious or actions that may be presented in the report marked as malicious.
At block 220, an action sequence or tagged action sequence may be generated for a participant based on the action identified at block 214 and/or the tags made at blocks 216, 218. For example, a participant's sequence of actions in an event may be { hand-up, yawning, east-Zhang-West, chatting with others … … }. In another example, the sequence of actions of a participant in an event may be { clapping (positive, non-malicious), beating the head of another person (negative, malicious), sitting still (neutral, non-malicious) … … }.
In some examples, the operations in the example blocks 212, 214, 220 and optional blocks 216, 218 may be performed by, for example, the body recognition model 142 of fig. 1.
At block 222, the performance of the participant may be evaluated based on the sequence of emotions generated at block 210, the sequence of actions generated at block 220, or the sequence of tagged actions. For example, the performance of the participant in the event may be evaluated by at least one participant evaluation model associated with the category of the event. In some examples, the performance of the participant may include at least one of: a mental state of the participant generated from the emotion sequence and the action sequence, a behavioral representation of the participant generated from the action sequence and the event category, a change in emotion of the participant generated from the emotion sequence, and interactions of the participant with other participants derived from the action sequence.
In some examples, the operations in the example block 222 may be performed by, for example, the participant performance evaluation model 151 of fig. 1. Specifically, each performance of the participant may be evaluated by respective models included in the participant performance evaluation model 151, such as a behavior representation evaluation model, a mental state evaluation model, an emotional change evaluation model, and a participant interaction evaluation model.
In some examples, a behavior representation evaluation model may be used to evaluate the behavior representation of the participant, such as performing an evaluation according to a sequence of actions of the participant. For example, when the event is a lesson in teaching, the participant is a teacher, and the input is a thumb-raising gesture of the teacher, the output of the behavior representation assessment model may be "positive" as a performance of the teacher in the lesson, such as a behavior representation in performance. Further, depending on the category of the event, the behavior representation evaluation model may also include a plurality of sub-models, such as a teaching gesture evaluation sub-model, a body language evaluation sub-model, a malicious behavior evaluation sub-model, and so on. In this example, when the input is a thumb-raising gesture of the teacher, the input may be received by the tutorial gesture evaluation submodel and output a "positive tutorial gesture" as a refined performance under a performance or behavioral representation of the teacher.
Furthermore, even if the inputs are similar actions, the outputs processed via multiple evaluation models associated with different event categories may be different. For example, in teaching, for a "hand-up" action of a student (i.e., a participant) in a class, a participant performance evaluation model associated with the teaching may judge that the "hand-up" action corresponds to a behavior of the student that "answers a question", so that the behavior of the student in the class may be analyzed as being expressed as, for example, "positive behavior". In a conference, for a participant's "holding up" action, the participant model associated with the conference may determine that the "holding up" action corresponds to the participant's "proposal vote" behavior, so that the participant's behavior in the conference may be analyzed as represented by "neutral behavior", for example.
In some examples, a mental state assessment model may be used to assess the mental state of the participant, for example, based on received sequences of emotions and/or sequences of actions. For example, when the received emotion sequence for a student is { aversion, neutral, light sight … … }, and the action sequence is { yawning, east-tension, westward, chat with others … … }, the mental state assessment model associated with teaching may assess the mental state of the student as "out of the heart" as an expression of the student in a class. As another example, when the sequence of emotions received for a speaker is { happy, angry, surprised … … }, and the sequence of actions is { waving, walking, nodding, shaking … … }, the mental state assessment model associated with a lecture may assess the mental state of the speaker as "mental full" as an expression of the speaker in the lecture.
In some examples, the mood change assessment model may assess a participant's performance, i.e., mood changes, based on the participant's sequence of moods. For example, the output of the emotional change assessment model may be "emotional change stable" or "emotional change unstable," depending on the type of emotion and/or the magnitude of the intensity change involved in the participant's emotional sequence.
In some examples, the participant interaction assessment model may assess the interaction between two participants based on the correspondence between their actions. For example, in the teaching category, if the teacher gives a response to the student within a predetermined time for the student to hold his hand, for example, by pointing his finger or hand toward the student, it can be considered that there is a correspondence between the holding hand and the response motion here, and thus it is found that there is an interaction between the teacher and the student. On the contrary, if the teacher still does not send a response to the student or the current action of the teacher is walking when the student raises the hands for a predetermined time, it can be considered that there is no corresponding relationship between the hand-raising action and the current action of the teacher, and thus it is found that there is no interaction between the teacher and the student. In addition, the participant interaction assessment model may also output an assessment result of "more interaction" or "less interaction" based on the number of times the participant interacted in the event as compared to a threshold number of times.
At block 224, from the evaluation at block 222, performance tags for the participants corresponding to the results of the evaluation may be generated. For example, taking the event of a tutorial as an example, in terms of mental state, the performance tag may include, but is not limited to, at least one of the following: plump, lack of essence, hearing carefully, out of mind, and so on. In terms of behavioral representations, performance tags may include, but are not limited to, at least one of the following: positive, negative, presence or absence of malicious behavior, positive or negative teaching gestures, monotonous or rich body language, how much teacher-student interaction, emotional changes or statistics, and the like.
At block 226, a report may be generated relating to the participant based at least on the participant's performance and/or performance tags. In some examples, the generated report may be provided to a terminal device, e.g., a participant or a terminal device of a third party, such as terminal device 130 in fig. 1.
Further, although not shown in fig. 2, the report generated at block 226 may be a single event for different participants or may be multiple events for the same participant. In the case of multiple events involving the same participant, generating the report may further include: a plurality of performances or performance tags of the participant in a plurality of events are obtained, and a report is generated according to at least a combination of the plurality of performances or performance tags.
It should be understood that all of the blocks shown in fig. 2 and their input information, output information are exemplary, and that blocks may be added or merged, with input information and output information for blocks being added or subtracted depending on the particular arrangement. Further, it should be understood that embodiments of the present disclosure may build a machine learning-based participant performance assessment model that may employ one or more of the above-described emotion sequences, action sequences, etc. as features and be trained to determine a participant's performance in an event. The model is not limited to being built using any particular machine learning technique.
Fig. 3 illustrates a process 300 for generating an additional report based on an audio stream corresponding to an event, according to an embodiment.
At block 302, an audio stream in a multimedia stream corresponding to an event may be received, wherein the audio stream relates to at least one participant of the event. In some examples, the operations in block 302 may be performed by, for example, signal acquisition device 120 of fig. 1.
At block 304, speech recognition may be performed on the received audio stream to generate text for the audio stream. In implementations, speech recognition may be performed on the audio stream using any speech recognition model known, including but not limited to, for example, Hidden Markov Models (HMMs), Convolutional Neural Networks (CNNs), deep neural network models (DNNs), and so forth. In some examples, the operations in block 304 may be performed by, for example, speech recognition model 143 of fig. 1.
At block 306, Natural Language Processing (NLP) may be performed on the text of the audio stream. In some examples, the operations in block 306 may be performed by, for example, Natural Language Processing (NLP) model 146 of fig. 1. In some embodiments, natural language processing may include at least one of the following operations: speech rate statistics in block 308, key content extraction in block 310, text sentiment analysis in block 312, inappropriate word detection in block 314, where key content is extracted from text according to a preset list of content. In some examples, the key content extraction operation may further include determining a distribution of the key content in the text, such as a time of occurrence, a location, a frequency, and so forth. In some examples, the detected inappropriate words may be detected from text based on a preset blacklist relating to categories of events.
The event may be evaluated at block 316 using an event evaluation model related to the event category based on the output of blocks 308, 310, 312, 314. In some examples, the event evaluation model may be implemented by a pre-trained classification model and/or regression model (e.g., a scoring model). In one embodiment, the event evaluation model may evaluate events according to a degree of match or coverage between the extracted key content with respect to a preset content list. Optionally, the event evaluation operation at block 316 may also be performed in accordance with the performance tags of the participants at block 318, such as the performance tags of the participants generated at block 224 of fig. 2, where the participants involved at block 318 may be the same or different than the participants involved in the audio stream received at block 302.
At block 320, additional reports including the results of the evaluation may be generated in accordance with the evaluation operations at block 316 for the event. In some examples, the additional report may also include key content extracted from the text in block 310. In other examples, where the participant involved in the video stream is different from the participant involved in the audio stream, the additional report may include a correlation between the emotional change and the textual emotion of the participant involved in the video stream.
Further, in some examples, the additional reports generated at block 320 may be displayed separately from or merged with the reports generated at block 226 in fig. 2.
It is to be appreciated that the operations described above in block 316 for evaluating events may be implemented via a pre-trained model.
For ease of illustration and for simplicity, the reporting interface generated according to the concepts of the present application is described below in terms of a teaching class. It is to be understood that the concepts of the present application may be applied to other types of events.
Fig. 4 illustrates an example reporting interface 400 generated for performance of participants in an event of an example instructional category, according to an embodiment. The interface is displayed on an exemplary display component, such as display component 132 of FIG. 1. In this embodiment, the event is a lesson in the teaching and the participant is one of the participants associated with the video stream, such as a teacher in the lesson.
In some examples, the reporting interface 400 may be a report generated according to the method of fig. 2, where the method is performed based on a video stream. In this embodiment, the event is, for example, a change in the interface 400 in the form of "course name: unit 1, lesson 1; course time: 2019.4.1014: 30-16: 00 "to the lesson. In the interface 400, some additional information, such as student name, course status, is also shown in the course information for the course event, which is not necessary in the report generated according to the embodiments of the present disclosure and thus will not be described in detail herein.
In the example shown in fig. 4, the results of the assessment of various performances of a teacher (i.e., participant) in the course (i.e., event) are shown in the reporting interface 400, including but not limited to classroom mental states, classroom behavior representations, classroom emotional changes, teacher-student interactions, wherein classroom behavior representations may further include classroom malicious behaviors, teaching gestures, body language, and the like. As shown in fig. 4, the assessment of performance for the participant may be shown in a labeled or scored manner, such as the labels "no essence taken" and "mental full" shown for classroom mental states by the labeled model, or the score "85" shown for classroom mental states by the scored model; the labels "positive" and "negative" or score "80" shown for the classroom behavior representation; the flags "none" and "severe" or score "0" are shown for classroom malicious behavior; the labels "negative" and "positive" or score "85" shown for the tutorial gesture; the tokens "monotone" and "rich" or score "75" shown for the body language; the labels "little" and "many" or score "85" shown for the teacher-student interaction; "stable" or score "85" shown for classroom mood changes.
In some examples, an indicia of the change in emotion of the participant (e.g., teacher) may be generated and included in the report based on the identified sequence of emotions of the participant. In this example, the label of the teacher's classroom emotional change as "stable" can be derived based on the emotional sequence through the labeling model. Of course, other forms of tagging may be attached to the teacher's classroom emotional changes based on the pre-trained tagging model. In other examples, the teacher's classroom emotional changes may be scored by a scoring model, such as a score of "85".
It should be understood that although one of the participant's (i.e., teacher) performances of "classroom malicious activity" is shown in the reporting interface 400 of FIG. 4 as optionally included within the "classroom activity representation" item, the performance "classroom malicious activity" may also be listed as a separate item in the report to alert the viewers of the report to this point.
Fig. 5 illustrates an example reporting interface 500 generated for performance of another participant in an event of an example instructional category, according to an embodiment. The interface is displayed on an exemplary display component, such as display component 132 of FIG. 1. In this embodiment, the event is a lesson in the teaching and the other participant is one of the participants associated with the video stream, such as a student in the lesson.
In some examples, the reporting interface 500 may be a report generated according to the method of fig. 2, where the method is performed based on a video stream. In this embodiment, the event is, for example, a change in the interface 500 in the form of "course name: unit 1, lesson 1; course time: 2019.4.1014: 30-16: 00 "to the lesson. In the reporting interface 500, some additional information, such as teacher name, course status, is also shown in the course information for the course event, which is not necessary in the report generated according to the embodiments of the present disclosure, and thus will not be described in detail herein.
In the example shown in fig. 5, results of the evaluation of various performances of a student (i.e., another participant) in the course (i.e., an event) are shown in the reporting interface 500, including but not limited to classroom mental state, classroom behavior representation, classroom emotional changes, teacher-student interactions, wherein classroom behavior representation may further include behavior representations of whether there is a late arrival, an early exit, a mid-way departure, and another person chatting. As shown in fig. 5, the assessment of performance for this other participant may be shown in a flagged or scored manner, such as the flags "out of the heart" and "speak with great care" shown for classroom mental states by the flagged model, or a score "80" shown for classroom mental states by the scored model; the labels "negative" and "positive" or score "80" shown for the classroom behavior representation; the labels "little" and "many" or score "85" shown for the teacher-student interaction; shown for classroom emotional changes is "unstable" or score "40".
In some examples, the change in emotion of the participant (e.g., student) may be generated and included in the report based on an identified sequence of emotions or emotional statistics of the participant, such as the classroom emotional statistics shown in fig. 5. In this example, the flag for classroom emotion change for the student may be derived by a flag model as "unstable" based on the sequence of emotions or emotion statistics. Of course, other forms of tagging may be attached to the student's classroom emotional changes based on the pre-trained tagging model. In other examples, the performance of a student's classroom mood changes may be scored by a scoring model, such as a score of "40".
Fig. 6 is an example report interface 600 for event generation for an example instructional category, according to an embodiment. The interface is displayed on an exemplary display component, such as display component 132 of FIG. 1. In this embodiment, the event is a lesson in a lesson and the participant is one of the participants associated with the audio stream, such as a teacher in the lesson.
In some examples, the reporting interface 600 may be a report generated according to the method of fig. 3, where the method is performed based on an audio stream. In this embodiment, the event is, for example, a change in the interface 600 in the form of "course name: unit 1, lesson 1; course time: 2019.4.1014: 30-16: 00 "to the lesson. In the reporting interface 600, some additional information, such as student name, teacher name, course status, is also shown in the course information for the course event, which is not necessary in the report generated according to the embodiments of the present disclosure and thus will not be described in detail herein.
In the example shown in fig. 6, the results of the assessment for a section of a course (i.e., an event) are shown in the reporting interface 600, including but not limited to course assessment, knowledge point matching (corresponding to key content matches), teacher's (corresponding to participants involved in the audio stream) classroom speech rate, teacher's classroom inappropriate words, emotional changes of students (corresponding to participants involved in the video stream) and teacher's textual emotional relevance. As shown in FIG. 6, the evaluation for an event may be shown in a flagged or scored manner, such as flagged "poor" and "good" shown for a course evaluation by a flagged model, or a score of "80" shown by a scored model; the labels "low" and "high" or score "100" are shown for the knowledge point matching degree; the labels "slow" and "fast" or score "50" shown for the teacher's classroom speech rate; the labels "none" and "many" or score "0" are shown for the teacher's classroom inappropriate words; the labels "low" and "high" or score "50" for the emotional change of the student are related to the text emotion of the teacher. In some examples, the "teacher's classroom speech rate" shown in the form of a flag or score may be combined with or substituted for the "teacher's classroom speech rate change" shown in the form of a line graph in fig. 6 to perform event evaluation; the "knowledge point matching degree" shown in the form of a label or score may be combined with or replaced by the "extracted knowledge point" to perform event evaluation in the report.
It should be understood that any one of the performance for the participant and any one of the evaluation results for the event in each of the reporting interfaces shown in fig. 4, 5, 6 are exemplary, and in practice any one of the performance and evaluation results shown may be added, subtracted or replaced according to system design or actual needs. Although different reports are shown in fig. 4, 5, and 6, respectively, these three separate reports may be displayed in any combination.
Fig. 7 illustrates a flow diagram of an exemplary method 700 for generating an evaluation report for an event, according to an embodiment.
At block 710, a multimedia stream corresponding to an event may be received, the multimedia stream including a video stream related to a first participant of one or more participants of the event.
At block 720, a sequence of facial images and a sequence of body images of the first participant may be detected from the video stream.
At block 730, a sequence of emotions may be identified from the sequence of facial images and a sequence of actions may be identified from the sequence of body images.
At block 740, the first participant's performance in the event may be evaluated by at least one participant evaluation model associated with the category of the event according to at least one of the sequence of emotions and the sequence of actions.
At block 750, a report related to the first participant may be generated based at least on the performance.
In one implementation, the representation includes at least one of: a mental state of the first participant generated from the sequence of emotions and the sequence of actions; a behavioral representation of the first participant generated from the sequence of actions and the category of the event; a change in mood of the first participant generated from the sequence of moods; and an interaction of the first participant with at least one other participant of the one or more participants.
In one implementation, identifying the sequence of actions includes: depending on the category of the event, appending to each action in the sequence of actions one of the following labels: positive, negative and neutral.
In another implementation, identifying the sequence of actions includes: depending on the category of the event, appending to each action in the sequence of actions one of the following labels: malicious and non-malicious.
In further implementations, generating the report further includes: indicating in the report whether there is an action marked as malicious; and/or presenting the action marked as malicious in the report.
In one implementation, the multimedia stream further includes an audio stream related to a second participant of the one or more participants, and the method further includes: generating a text corresponding to the audio stream by performing voice recognition on the audio stream; evaluating the event at least from the text by an event evaluation model associated with the category of the event; and generating an additional report including results of the evaluation of the event.
In further implementations, the second participant is the same or different from the first participant.
In yet another implementation, the event is evaluated by the event evaluation model further according to at least one of: generating the speed of speech of the second participant according to the text; key content extracted from the text; a text sentiment generated by performing sentiment analysis on the text; and inappropriate words detected from the text.
In a further implementation, the event is evaluated by the event evaluation model further according to a degree of match between the key content and a preset content list associated with the event.
In yet another implementation, the inappropriate words are detected from the text according to a preset blacklist relating to categories of the event.
In yet another implementation, the event is evaluated by the event evaluation model further based on the performance of the first participant.
In further implementations, the additional report includes at least one of: key content extracted from the text; and a correlation between the emotion change of the first participant generated from the emotion sequence and the text emotion generated from the text if the second participant is different from the first participant.
In one implementation, the event includes a plurality of events involving the first participant, and generating the report further includes: obtaining a plurality of performances of the first participant in the plurality of events; and generating the report based at least on a combination of the plurality of representations.
It should be understood that method 700 may also include: any steps/processes for generating an evaluation report for an event according to embodiments of the present disclosure, as mentioned above.
Fig. 8 illustrates an example apparatus 800 for generating an evaluation report for an event, according to an embodiment.
The apparatus 800 may include: a receiving module 810 for receiving a multimedia stream corresponding to the event, the multimedia stream comprising a video stream related to a first participant of the one or more participants of the event; a detection module 820 for detecting a sequence of facial images and a sequence of body images of the first participant from the video stream; an identifying module 830 for identifying a sequence of emotions from the sequence of facial images and a sequence of actions from the sequence of body images; an evaluation module 840 for evaluating the first participant's performance in the event according to at least one of the sequence of emotions and the sequence of actions by at least one participant evaluation model associated with the category of the event; and a generating module 850 for generating a report related to the first participant based at least on the performance.
In one implementation, the representation includes at least one of: a mental state of the first participant generated from the sequence of emotions and the sequence of actions; a behavioral representation of the first participant generated from the sequence of actions and the category of the event; a change in mood of the first participant generated from the sequence of moods; and an interaction of the first participant with at least one other participant of the one or more participants.
In one implementation, the multimedia stream further includes an audio stream associated with a second participant of the one or more participants. Furthermore, the apparatus further comprises: the text generation module is used for generating a text corresponding to the audio stream by carrying out voice recognition on the audio stream; and an event evaluation module to evaluate the event based at least on the text via an event evaluation model associated with the category of the event. In some examples, the generation module is further to generate an additional report including results of the evaluation of the event.
In one implementation, the event evaluation module evaluates the event further according to at least one of the following through the event evaluation model: generating the speed of speech of the second participant according to the text; key content extracted from the text; a text sentiment generated by performing sentiment analysis on the text; and inappropriate words detected from the text.
In one implementation, the additional report includes at least one of: key content extracted from the text; and a correlation between the emotion change of the first participant generated from the emotion sequence and the text emotion generated from the text if the second participant is different from the first participant.
In one implementation, the event includes a plurality of events involving the first participant, and generating the report further includes: obtaining a plurality of performances of the first participant in the plurality of events; and generating the report based at least on a combination of the plurality of representations.
It should be understood that the apparatus 800 may further include: any other module configured for generating an evaluation report for an event according to an embodiment of the present disclosure, as mentioned above.
Fig. 9 illustrates another example apparatus 900 for generating an evaluation report for an event, according to an embodiment. The apparatus 900 may include one or more processors 910 and memory 920 storing computer-executable instructions that, when executed, the one or more processors 910 may perform the following: receiving a multimedia stream corresponding to the event, the multimedia stream comprising a video stream related to a first participant of one or more participants of the event; detecting a sequence of facial images and a sequence of body images of the first participant from the video stream; identifying a sequence of emotions from the sequence of facial images and a sequence of actions from the sequence of body images; evaluating the first participant's performance in the event according to at least one of the sequence of emotions and the sequence of actions by at least one participant evaluation model associated with the category of the event; and generating a report related to the first participant based at least on the performance.
Embodiments of the present disclosure may be embodied in non-transitory computer readable media. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any of the operations of the method for generating an evaluation report for an event according to embodiments of the present disclosure as described above.
It should be understood that all operations in the methods described above are exemplary only, and the present disclosure is not limited to any operations in the methods or the order of the operations, but rather should encompass all other equivalent variations under the same or similar concepts.
It should also be understood that all of the modules in the above described apparatus may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. In addition, any of these modules may be further divided functionally into sub-modules or combined together.
The processor has been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software depends upon the particular application and the overall design constraints imposed on the system. By way of example, the processor, any portion of the processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), Programmable Logic Device (PLD), state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, microcontroller, DSP, or other suitable platform.
Software should be viewed broadly as representing instructions, instruction sets, code segments, program code, programs, subroutines, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. The computer readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, a Random Access Memory (RAM), a Read Only Memory (ROM), a programmable ROM (prom), an erasable prom (eprom), an electrically erasable prom (eeprom), a register, or a removable disk. Although the memory is shown as being separate from the processor in aspects presented in this disclosure, the memory may be located internal to the processor (e.g., a cache or a register).
The above description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described herein that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims (20)

1. A method for generating an evaluation report for an event, comprising:
receiving a multimedia stream corresponding to the event, the multimedia stream comprising a video stream related to a first participant of one or more participants of the event;
detecting a sequence of facial images and a sequence of body images of the first participant from the video stream;
identifying a sequence of emotions from the sequence of facial images and a sequence of actions from the sequence of body images;
evaluating the first participant's performance in the event according to at least one of the sequence of emotions and the sequence of actions by at least one participant evaluation model associated with the category of the event; and
generating a report related to the first participant based at least on the performance.
2. The method of claim 1, wherein the performance comprises at least one of:
a mental state of the first participant generated from the sequence of emotions and the sequence of actions;
a behavioral representation of the first participant generated from the sequence of actions and the category of the event;
a change in mood of the first participant generated from the sequence of moods; and
an interaction of the first participant with at least one other participant of the one or more participants.
3. The method of claim 1, wherein identifying the sequence of actions comprises:
depending on the category of the event, appending to each action in the sequence of actions one of the following labels: positive, negative and neutral.
4. The method of claim 1, wherein identifying the sequence of actions comprises:
depending on the category of the event, appending to each action in the sequence of actions one of the following labels: malicious and non-malicious.
5. The method of claim 4, wherein generating the report further comprises:
indicating in the report whether there is an action marked as malicious; and/or
Presenting the action marked as malicious in the report.
6. The method of claim 1, wherein the multimedia stream further comprises an audio stream related to a second participant of the one or more participants, and the method further comprises:
generating a text corresponding to the audio stream by performing voice recognition on the audio stream;
evaluating the event at least from the text by an event evaluation model associated with the category of the event; and
generating an additional report including results of the evaluation of the event.
7. The method of claim 6, wherein the second participant is the same or different from the first participant.
8. The method of claim 6, wherein the event is evaluated by the event evaluation model further according to at least one of:
generating the speed of speech of the second participant according to the text;
key content extracted from the text;
a text sentiment generated by performing sentiment analysis on the text; and
inappropriate words detected from the text.
9. The method of claim 8, wherein the event is evaluated by the event evaluation model further based on a degree of match between the key content and a preset list of content associated with the event.
10. The method of claim 8, wherein the inappropriate words are detected from the text according to a preset blacklist relating to categories of the event.
11. The method of claim 6, wherein the event is evaluated by the event evaluation model further based on performance of the first participant.
12. The method of claim 6, wherein the additional report comprises at least one of:
key content extracted from the text; and
a correlation between a change in emotion of the first participant generated from the sequence of emotions and a textual emotion generated from the text if the second participant is different from the first participant.
13. The method of claim 1, wherein the event comprises a plurality of events involving the first participant, and generating the report further comprises:
obtaining a plurality of performances of the first participant in the plurality of events; and
generating the report based at least on a combination of the plurality of representations.
14. An apparatus for generating an evaluation report for an event, comprising:
a receiving module for receiving a multimedia stream corresponding to the event, the multimedia stream comprising a video stream related to a first participant of the one or more participants of the event;
a detection module to detect a sequence of facial images and a sequence of body images of the first participant from the video stream;
an identification module for identifying a sequence of emotions from the sequence of facial images and a sequence of actions from the sequence of body images;
an evaluation module to evaluate the first participant's performance in the event by at least one participant evaluation model associated with the category of the event according to at least one of the sequence of emotions and the sequence of actions; and
a generating module to generate a report related to the first participant based at least on the performance.
15. The apparatus of claim 14, wherein the performance comprises at least one of:
a mental state of the first participant generated from the sequence of emotions and the sequence of actions;
a behavioral representation of the first participant generated from the sequence of actions and the category of the event;
a change in mood of the first participant generated from the sequence of moods; and
an interaction of the first participant with at least one other participant of the one or more participants.
16. The apparatus of claim 14, wherein the multimedia stream further comprises an audio stream related to a second participant of the one or more participants, and the apparatus further comprises:
the text generation module is used for generating a text corresponding to the audio stream by carrying out voice recognition on the audio stream; and
an event evaluation module to evaluate the event at least from the text by an event evaluation model associated with a category of the event;
wherein the generation module is further to generate an additional report including a result of the evaluation of the event.
17. The apparatus of claim 16, wherein the event evaluation module evaluates the event by the event evaluation model further according to at least one of:
generating the speed of speech of the second participant according to the text;
key content extracted from the text;
a text sentiment generated by performing sentiment analysis on the text; and
inappropriate words detected from the text.
18. The apparatus of claim 16, wherein the additional report comprises at least one of:
key content extracted from the text; and
a correlation between a change in emotion of the first participant generated from the sequence of emotions and a textual emotion generated from the text if the second participant is different from the first participant.
19. The apparatus of claim 14, wherein the event comprises a plurality of events involving the first participant, and generating the report further comprises:
obtaining a plurality of performances of the first participant in the plurality of events; and
generating the report based at least on a combination of the plurality of representations.
20. An apparatus for generating an evaluation report for an event, comprising:
one or more processors; and
a memory storing computer-executable instructions that, when executed, cause the one or more processors to:
receiving a multimedia stream corresponding to the event, the multimedia stream comprising a video stream related to a first participant of one or more participants of the event;
detecting a sequence of facial images and a sequence of body images of the first participant from the video stream;
identifying a sequence of emotions from the sequence of facial images and a sequence of actions from the sequence of body images;
evaluating the first participant's performance in the event according to at least one of the sequence of emotions and the sequence of actions by at least one participant evaluation model associated with the category of the event; and
generating a report related to the first participant based at least on the performance.
CN201910317933.6A 2019-04-19 2019-04-19 Artificial intelligence based event evaluation report generation Pending CN111833861A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910317933.6A CN111833861A (en) 2019-04-19 2019-04-19 Artificial intelligence based event evaluation report generation
PCT/US2020/023460 WO2020214316A1 (en) 2019-04-19 2020-03-19 Artificial intelligence-based generation of event evaluation report

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910317933.6A CN111833861A (en) 2019-04-19 2019-04-19 Artificial intelligence based event evaluation report generation

Publications (1)

Publication Number Publication Date
CN111833861A true CN111833861A (en) 2020-10-27

Family

ID=70293075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910317933.6A Pending CN111833861A (en) 2019-04-19 2019-04-19 Artificial intelligence based event evaluation report generation

Country Status (2)

Country Link
CN (1) CN111833861A (en)
WO (1) WO2020214316A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095157A (en) * 2021-03-23 2021-07-09 深圳市创乐慧科技有限公司 Image shooting method and device based on artificial intelligence and related products
CN113505659A (en) * 2021-02-02 2021-10-15 黑芝麻智能科技有限公司 Method for describing time event
CN116260990A (en) * 2023-05-16 2023-06-13 合肥高斯智能科技有限公司 AI asynchronous detection and real-time rendering method and system for multipath video streams
CN116452072A (en) * 2023-06-19 2023-07-18 华南师范大学 Teaching evaluation method, system, equipment and readable storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11516036B1 (en) 2019-11-25 2022-11-29 mmhmm inc. Systems and methods for enhancing meetings
CN112651610B (en) * 2020-12-17 2024-02-02 韦福瑞 Checking method and system for judging and identifying adaptability of simulated environment based on sound
US20230230588A1 (en) * 2022-01-20 2023-07-20 Zoom Video Communications, Inc. Extracting filler words and phrases from a communication session
CN117994098A (en) * 2024-03-05 2024-05-07 江苏薪传科技有限公司 Teaching index data determining method, device, computer equipment and medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110295392A1 (en) * 2010-05-27 2011-12-01 Microsoft Corporation Detecting reactions and providing feedback to an interaction
WO2015148727A1 (en) * 2014-03-26 2015-10-01 AltSchool, PBC Learning environment systems and methods
CN106156170A (en) * 2015-04-16 2016-11-23 北大方正集团有限公司 The analysis of public opinion method and device
CN106716958A (en) * 2014-09-18 2017-05-24 微软技术许可有限责任公司 Lateral movement detection
CN107292271A (en) * 2017-06-23 2017-10-24 北京易真学思教育科技有限公司 Learning-memory behavior method, device and electronic equipment
CN107316261A (en) * 2017-07-10 2017-11-03 湖北科技学院 A kind of Evaluation System for Teaching Quality based on human face analysis
CN107895244A (en) * 2017-12-26 2018-04-10 重庆大争科技有限公司 Classroom teaching quality assessment method
CN108764047A (en) * 2018-04-27 2018-11-06 深圳市商汤科技有限公司 Group's emotion-directed behavior analysis method and device, electronic equipment, medium, product
CN109035089A (en) * 2018-07-25 2018-12-18 重庆科技学院 A kind of Online class atmosphere assessment system and method
CN109063587A (en) * 2018-07-11 2018-12-21 北京大米科技有限公司 data processing method, storage medium and electronic equipment
CN109101933A (en) * 2018-08-21 2018-12-28 重庆乐教科技有限公司 A kind of emotion-directed behavior visual analysis method based on artificial intelligence
CN109598300A (en) * 2018-11-30 2019-04-09 成都数联铭品科技有限公司 A kind of assessment system and method

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110295392A1 (en) * 2010-05-27 2011-12-01 Microsoft Corporation Detecting reactions and providing feedback to an interaction
WO2015148727A1 (en) * 2014-03-26 2015-10-01 AltSchool, PBC Learning environment systems and methods
CN106716958A (en) * 2014-09-18 2017-05-24 微软技术许可有限责任公司 Lateral movement detection
CN106156170A (en) * 2015-04-16 2016-11-23 北大方正集团有限公司 The analysis of public opinion method and device
CN107292271A (en) * 2017-06-23 2017-10-24 北京易真学思教育科技有限公司 Learning-memory behavior method, device and electronic equipment
CN107316261A (en) * 2017-07-10 2017-11-03 湖北科技学院 A kind of Evaluation System for Teaching Quality based on human face analysis
CN107895244A (en) * 2017-12-26 2018-04-10 重庆大争科技有限公司 Classroom teaching quality assessment method
CN108764047A (en) * 2018-04-27 2018-11-06 深圳市商汤科技有限公司 Group's emotion-directed behavior analysis method and device, electronic equipment, medium, product
CN109063587A (en) * 2018-07-11 2018-12-21 北京大米科技有限公司 data processing method, storage medium and electronic equipment
CN109035089A (en) * 2018-07-25 2018-12-18 重庆科技学院 A kind of Online class atmosphere assessment system and method
CN109101933A (en) * 2018-08-21 2018-12-28 重庆乐教科技有限公司 A kind of emotion-directed behavior visual analysis method based on artificial intelligence
CN109598300A (en) * 2018-11-30 2019-04-09 成都数联铭品科技有限公司 A kind of assessment system and method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505659A (en) * 2021-02-02 2021-10-15 黑芝麻智能科技有限公司 Method for describing time event
US11887384B2 (en) 2021-02-02 2024-01-30 Black Sesame Technologies Inc. In-cabin occupant behavoir description
CN113095157A (en) * 2021-03-23 2021-07-09 深圳市创乐慧科技有限公司 Image shooting method and device based on artificial intelligence and related products
CN116260990A (en) * 2023-05-16 2023-06-13 合肥高斯智能科技有限公司 AI asynchronous detection and real-time rendering method and system for multipath video streams
CN116452072A (en) * 2023-06-19 2023-07-18 华南师范大学 Teaching evaluation method, system, equipment and readable storage medium
CN116452072B (en) * 2023-06-19 2023-08-29 华南师范大学 Teaching evaluation method, system, equipment and readable storage medium

Also Published As

Publication number Publication date
WO2020214316A1 (en) 2020-10-22

Similar Documents

Publication Publication Date Title
CN108648757B (en) Analysis method based on multi-dimensional classroom information
CN111833861A (en) Artificial intelligence based event evaluation report generation
US11551804B2 (en) Assisting psychological cure in automated chatting
WO2021232775A1 (en) Video processing method and apparatus, and electronic device and storage medium
CN107992195A (en) A kind of processing method of the content of courses, device, server and storage medium
CN114298497A (en) Evaluation method and device for classroom teaching quality of teacher
CN113537801B (en) Blackboard writing processing method, blackboard writing processing device, terminal and storage medium
CN114299617A (en) Teaching interaction condition identification method, device, equipment and storage medium
Li et al. Multi-stream deep learning framework for automated presentation assessment
CN113076770A (en) Intelligent figure portrait terminal based on dialect recognition
CN113920534A (en) Method, system and storage medium for extracting video highlight
CN113238654A (en) Multi-modal based reactive response generation
CN117615182B (en) Live broadcast interaction dynamic switching method, system and terminal
CN109754653A (en) A kind of method and system of individualized teaching
Campoy-Cubillo et al. Assessing multimodal listening comprehension through online informative videos: The operationalisation of a new listening framework for ESP in higher education
CN110046290B (en) Personalized autonomous teaching course system
Querol-Julián The multimodal genre of synchronous videoconferencing lectures: An eclectic framework to analyse interaction
CN110956142A (en) Intelligent interactive training system
Jain et al. Student’s Feedback by emotion and speech recognition through Deep Learning
CN111078010B (en) Man-machine interaction method and device, terminal equipment and readable storage medium
CN116825288A (en) Autism rehabilitation course recording method and device, electronic equipment and storage medium
Zheng et al. Automated Multi-Mode Teaching Behavior Analysis: A Pipeline Based Event Segmentation and Description
KR102658252B1 (en) Video education content providing method and apparatus based on artificial intelligence natural language processing using characters
Liu et al. Design of Voice Style Detection of Lecture Archives
Zhao et al. Design and Implementation of a Teaching Verbal Behavior Analysis Aid in Instructional Videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination