CN113065444A

CN113065444A - Behavior detection method and device, readable storage medium and electronic equipment

Info

Publication number: CN113065444A
Application number: CN202110326159.2A
Authority: CN
Inventors: 程驰; 周佳; 包英泽
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-07-02

Abstract

The embodiment of the invention discloses a behavior detection method, a behavior detection device, a readable storage medium and electronic equipment. The method comprises the steps of extracting a plurality of images in a video stream according to a set frequency by acquiring the video stream; determining violation images in the multiple images through a pre-trained classification model; when the multiple images comprise at least one group of continuous illegal images of which the number is greater than or equal to a set value, determining at least one group of video clips corresponding to the at least one group of continuous illegal images; and then determining at least one group of candidate video segments in the at least one group of video segments, and determining the video segments with illegal behaviors in the at least one group of candidate video segments through a pre-trained behavior recognition model. By the method, the illegal action in the video stream can be automatically detected through image extraction, illegal image detection and illegal action detection, so that the labor consumption is reduced, and the detection efficiency is improved.

Description

Behavior detection method and device, readable storage medium and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a behavior detection method, a behavior detection device, a readable storage medium and electronic equipment.

Background

With the development of internet application and the appearance of online teaching and video conferences, the traditional teaching mode and working mode of people are changed, the online teaching is more and more widely used in daily life of people, and an online teaching platform is provided with a large number of student users, so that a large number of teachers are needed to teach students, but in the online teaching process, the lesson environment of teachers is relatively random, the lesson time can be adjusted according to the time of students and is relatively dispersed, for example, after 8 or 9 students leave home at night, the online teaching can be needed to be intensively studied, but the lesson state of teachers has great volatility, so that the situations of drowsiness, unconcentration and the like easily occur, the course quality is poor, and the learning efficiency and learning interest of students are influenced; staff of the online teaching platform can manually detect the teaching process of the teacher, detect whether the teacher has illegal behaviors in the teaching process, such as sleepiness, eye closure, yawning and the like, and remind the teacher who has the behaviors for many times, so that the quality and the effect of online teaching are ensured. However, because the number of teachers on the online teaching platform is large, a large amount of manpower is consumed through manual monitoring, and the efficiency is low.

In conclusion, how to detect the violation behaviors of teachers in the teaching process, reduce labor consumption and improve detection efficiency is a problem to be solved at present.

Disclosure of Invention

In view of this, embodiments of the present invention provide a behavior detection method and apparatus, a readable storage medium, and an electronic device, which automatically detect an illegal behavior of a teacher during a teaching process, reduce labor consumption, and improve detection efficiency.

In a first aspect, an embodiment of the present invention provides a method for behavior detection, where the method includes: acquiring a video stream; extracting a plurality of images from the video stream according to a set frequency; determining violation images in the multiple images through a pre-trained classification model; in response to at least one group of continuous illegal images in the multiple images, determining at least one group of video clips corresponding to the at least one group of continuous illegal images, wherein the number of images of the continuous illegal images is greater than or equal to a set numerical value; determining at least one group of candidate video segments in the at least one group of video segments, wherein the candidate video segments are silent video segments; and determining the video clips with the illegal behaviors in the at least one group of candidate video clips through a pre-trained behavior recognition model.

Preferably, the determining at least one group of candidate video segments in the at least one group of video segments specifically includes:

determining a voiced speech segment in the video stream;

removing video segments in the at least one group of video segments, wherein the video segments intersect with the voiced speech segments, and determining at least one group of candidate video segments in the at least one group of video segments, which are not voiced.

Preferably, the determining the voiced speech segment in the video stream specifically includes:

determining an audio stream corresponding to the video stream;

and determining the voiced speech segments in the audio stream through speech endpoint detection.

Preferably, the method further comprises:

and determining at least one group of face region video clips in the at least one group of candidate video clips through a pre-trained face detection model, wherein the face detection model is used for acquiring the face regions of the candidate video clips, and the face region video clips are video clips formed by face regions intercepted from the candidate video clips.

Preferably, the method further comprises: the determining, by using a pre-trained behavior recognition model, a video clip with an illegal behavior in the at least one group of candidate video clips specifically includes:

and determining the video clips with the illegal behaviors in the at least one group of face region video clips through a pre-trained behavior recognition model.

Preferably, the classification model is a first classification model or a second classification model, wherein the first classification model is used for judging whether a person in the image closes eyes; the second classification model is used for judging whether the person in the image opens the mouth or not.

Preferably, the behavior recognition model is a first behavior recognition model or a second behavior recognition model, wherein the first behavior recognition model is used for recognizing whether the eye-closing doze violation behaviors exist in the candidate video segment; the second behavior identification model is used for identifying whether the violation behaviors of yawning and yawning exist in the candidate video segment.

Preferably, the training process of the classification model includes:

acquiring a historical violation behavior image and a historical compliance behavior image;

and training the classification model according to the historical violation behavior image and the historical compliance behavior image, wherein the classification model is a binary classification model.

Preferably, the training process of the behavior recognition model includes:

acquiring historical violation behavior fragments and historical compliance behavior fragments;

and training the classification model according to the historical violation behavior fragments and the historical compliance behavior fragments, wherein the classification model is a deep learning neural network model.

Preferably, the face detection model is a deep learning neural network model trained according to historical face data.

In a second aspect, an embodiment of the present invention provides an apparatus for behavior detection, where the apparatus includes:

an acquisition unit configured to acquire a video stream;

the processing unit is used for extracting a plurality of images from the video stream according to a set frequency;

the determining unit is used for determining violation images in the multiple images through a pre-trained classification model;

the determining unit is further configured to determine, in response to at least one group of consecutive illegal images included in the plurality of images, at least one group of video segments corresponding to the at least one group of consecutive illegal images, where the number of images of the consecutive illegal images is greater than or equal to a set numerical value;

the determining unit is further configured to determine at least one set of candidate video segments of the at least one set of video segments, where the candidate video segments are silent video segments;

the determining unit is further configured to determine, through a pre-trained behavior recognition model, a video segment in which an illegal behavior exists in the at least one group of candidate video segments.

Preferably, the determining unit is specifically configured to:

determining a voiced speech segment in the video stream;

Preferably, the determining unit is specifically configured to:

determining an audio stream corresponding to the video stream;

Preferably, the determining unit is further configured to:

Preferably, the determining unit is specifically configured to: and determining the video clips with the illegal behaviors in the at least one group of face region video clips through a pre-trained behavior recognition model.

Preferably, the obtaining unit is further configured to: acquiring a historical violation behavior image and a historical compliance behavior image;

the processing unit is further to: and training the classification model according to the historical violation behavior image and the historical compliance behavior image, wherein the classification model is a binary classification model.

Preferably, the obtaining unit is further configured to: acquiring historical violation behavior fragments and historical compliance behavior fragments;

the processing unit is further to: and training the classification model according to the historical violation behavior fragments and the historical compliance behavior fragments, wherein the classification model is a deep learning neural network model.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium on which computer program instructions are stored, which when executed by a processor implement the method according to the first aspect or any one of the possibilities of the first aspect.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, the memory being configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method according to the first aspect or any one of the possibilities of the first aspect.

The embodiment of the invention obtains the video stream; extracting a plurality of images from the video stream according to a set frequency; determining violation images in the multiple images through a pre-trained classification model; in response to at least one group of continuous illegal images in the multiple images, determining at least one group of video clips corresponding to the at least one group of continuous illegal images, wherein the number of images of the continuous illegal images is greater than or equal to a set numerical value; determining at least one group of candidate video segments in the at least one group of video segments, wherein the candidate video segments are silent video segments; and determining the video clips with the illegal behaviors in the at least one group of candidate video clips through a pre-trained behavior recognition model. By the method, the illegal action in the video stream can be automatically detected by extracting the images, detecting the illegal images, generating the candidate video clips and determining the video clips with the illegal action in the candidate video clips through the illegal action detection, so that the labor consumption is reduced, and the detection efficiency is improved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a method of behavior detection in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of a method of behavior detection in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of a method of behavior detection in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart of a method of behavior detection in accordance with an embodiment of the present invention;

FIG. 5 is a schematic view of an image of an embodiment of the present invention;

FIG. 6 is a schematic view of an image of an embodiment of the present invention;

FIG. 7 is a flow chart of a method of behavior detection in accordance with an embodiment of the present invention;

FIG. 8 is a flow chart of a method of behavior detection in accordance with an embodiment of the present invention;

FIG. 9 is a schematic diagram of an apparatus for behavior detection according to an embodiment of the present invention;

fig. 10 is a schematic diagram of an electronic device of an embodiment of the invention.

Detailed Description

The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, certain specific details are set forth. It will be apparent to those skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present disclosure.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout this specification, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present disclosure, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified.

The online teaching platform is provided with a large number of student users, a large number of teachers are needed to teach students, in the online teaching process, the teaching environment of the teachers is random, the teaching time can be adjusted according to the time of the students, and the online teaching is dispersed, for example, after 8 and 9 students leave home at night, the online teaching may be needed to be intensively studied, but the teaching state of the teachers has great volatility, and the situations of drowsiness, poor spirit and the like easily occur, so that the quality of courses is poor, and the learning efficiency and the learning interest of the students are influenced; or, during the online meeting, a plurality of persons can simultaneously perform online video, but in the online meeting process, the environments of the participants are relatively random, and if the meeting time is long, the participants are easy to have situations such as unconsciousness and the like, so that the meeting quality is poor, the working efficiency is influenced, and aiming at online teaching, the staff of the online teaching platform can perform manual detection on the teaching process of a teacher, and detect whether the teacher has illegal behaviors in the teaching process; or, aiming at the online conference, a company can arrange special staff to manually detect the conference process of the participants, detect whether the participants have illegal behaviors in the conference process, such as drowsiness and eye closure, yawning and the like, and then remind teachers or participants who have the behaviors for many times, so that the quality and effect of online teaching or online conferences are ensured; however, since teachers or participants of the online teaching platform are numerous, a large amount of manpower is consumed through manual monitoring, and the efficiency is low.

In the embodiment of the invention, the illegal behaviors of teachers in the teaching process or the illegal behaviors of participants in the conference process are detected by a behavior detection method, so that the labor consumption is reduced and the detection efficiency is improved; the behavior detection method can also be applied to scenes such as online live broadcast, and the embodiment of the invention is not limited.

In the embodiment of the present invention, fig. 1 is a flowchart of a method for behavior detection according to the embodiment of the present invention. As shown in fig. 1, the method specifically comprises the following steps:

and step S100, acquiring a video stream.

Specifically, after the online course is finished, a teaching playback video stream of a teacher is obtained; or, after the online meeting, obtaining the playback video stream of the participants; in one possible implementation, the video stream may also be obtained instantaneously.

For example, assuming that the time duration of the playback video stream of the online lesson teacher is 45 minutes, the time duration is only an exemplary time duration in the embodiment of the present invention.

Step S101, extracting a plurality of images in the video stream according to a set frequency.

Specifically, frames are extracted from the video stream according to a set frequency, each frame represents one image, and if the set frequency is 1 frame of image per second, 45 minutes of video is required to extract 45 × 60 to 2700 images; optionally, the set frequency may also be 2 frames of images per second, or 1 frame of images per 2 seconds, which is not limited in the embodiment of the present invention and is determined according to the actual situation.

And S102, determining violation images in the multiple images through a pre-trained classification model.

Specifically, the classification model is a first classification model or a second classification model, where the first classification model is used to determine whether a person in the image closes an eye; the second classification model is used for judging whether the person in the image opens the mouth or not.

In a possible implementation manner, a training process of the classification model is shown in fig. 2, and specifically includes:

and step S200, acquiring a historical violation image and a historical compliance image.

Specifically, when the classification model is a first classification model, the person in the historical violation image is eye-closed, and the person in the historical compliance image is eye-open; and when the classification model is a second classification model, the person in the historical violation behavior image is open-mouthed, and the person in the historical compliance behavior image is closed-mouthed.

In the embodiment of the present invention, the classification model may classify other violations and compliance behaviors in addition to classifying eyes being closed, eyes being open, mouth being open, or mouth being closed, specifically, the classification model is determined according to an actual use situation, and the embodiment of the present invention is not limited thereto.

Step S201, training the classification model according to the historical violation behavior image and the historical compliance behavior image, wherein the classification model is a binary classification model.

In the embodiment of the invention, the first classification model or the second classification model for classifying the images is generated through the method, and then the first classification model or the second classification model is used for determining the illegal image in the multiple images, namely the image with closed eyes or the image with open mouth of the person, so that the efficiency and the accuracy for determining the illegal image in the multiple images are improved.

Step S103, in response to that the multiple images include at least one group of consecutive illegal images, determining at least one group of video segments corresponding to the at least one group of consecutive illegal images, where the number of images of the consecutive illegal images is greater than or equal to a set numerical value.

Specifically, assuming that the set value is 3, when the set frequency is 1 frame of image per second and 3 or more images in 2700 images which need to be extracted in the 45-minute video are continuous violation images, determining the video clip corresponding to the continuous violation images.

For example, assuming that the 7 th, 8 th and 9 th images of the 2700 images are illegal images, and the 7 th, 8 th and 9 th seconds corresponding to the 7 th, 8 th and 9 th images constitute the video clip because the set frequency is 1 frame per second; or, assuming that the 21 st, 22 nd, 23 rd, 24 th and 25 th images of the 2700 images are illegal images, and the set frequency is 1 frame per second, the 21 st, 22 nd, 23 rd, 24 th and 25 th corresponding 21 st, 22 nd, 23 rd, 24 th and 25 th images constitute the video clip; the specific quantities are determined according to actual conditions, and the invention is not limited to the specific quantities.

Step S104, determining at least one group of candidate video segments in the at least one group of video segments, wherein the candidate video segments are silent video segments.

Specifically, it is assumed that 20 groups of video segments are determined through step S104, and 10 groups of silent video segments among the 20 groups of video segments are determined as candidate segments.

In an embodiment of the present invention, a specific process of determining at least one group of candidate video segments in the at least one group of video segments is shown in fig. 3, and specifically includes the following steps:

and step S300, determining an audio stream corresponding to the video stream.

Specifically, assuming that the video stream is 45 minutes, 45 minutes of audio stream is extracted from the video stream.

Step S301, determining a voiced speech segment in the audio stream through speech endpoint detection.

Specifically, a voiced speech segment in the 45-minute audio stream is determined through Voice endpoint Detection (VAD), that is, a start time point and an end time point of the voiced speech segment are determined.

In the embodiment of the invention, voice endpoint detection is generally used for identifying the occurrence and the cancellation of voice in an audio signal, whether the voice occurs or whether the voice is flat on background noise determines whether the detection of the VAD method is stable, and under the environment of pure background noise), even a simple energy detection method can obtain a better voice detection effect, however, under general conditions, the audio signal has background noise, so the VAD method has better robustness on the noise; the specific processing steps of the VAD method comprise: performing framing processing on the audio signal; extracting features from each frame of data; training a classifier on a set of data frames of a known speech and silence signal region; and classifying the unknown framing data, and judging whether the unknown framing data belongs to a voice signal or a silent signal.

Step S302, removing the video segments with intersection with the voice segments with pronunciation from the at least one group of video segments, and determining at least one group of candidate video segments without pronunciation in the at least one group of video segments.

Specifically, it is assumed that the 7 th, 8 th and 9 th seconds corresponding to the 7 th, 8 th and 9 th illegal images form the video segment and intersect with the voiced speech segment, it is assumed that the starting time point of a certain voiced speech segment is the 8 th second, the ending time point is the 11 th second, and the 7 th, 8 th and 9 th seconds form the video segment, it is stated that the 7 th, 8 th and 9 th seconds constitute a video segment in which the video segment is voiced, that is, the teacher did not doze or yawn in this video segment, and further, the 20 sets of video segments determined in step S104 are divided into the 7 th, 8 th and 9 th seconds to constitute the video segments, whereby, assuming that there is an intersection between 10 sets of the 20 sets of video segments and the voiced speech segment, the remaining 10 sets of unvoiced video segments are taken as candidate segments.

And step S105, determining the video clips with the illegal behaviors in the at least one group of candidate video clips through a pre-trained behavior recognition model.

Specifically, the behavior recognition model is a first behavior recognition model or a second behavior recognition model, wherein the first behavior recognition model is used for recognizing whether the eye-closing doze violation behavior exists in the candidate video segment; the second behavior identification model is used for identifying whether the violation behaviors of yawning and yawning exist in the candidate video segment.

In a possible implementation manner, a training process of the behavior recognition model is shown in fig. 4, and specifically includes:

and step S400, acquiring historical violation behavior fragments and historical compliance behavior fragments.

Specifically, when the behavior recognition model is a first behavior recognition model, the person in the historical violation behavior segment is eye-closed and sleepless, and the person in the historical compliance behavior segment is eye-open and sleepless; and when the behavior is identified as a second behavior identification model, the characters in the historical violation behavior segment are open-mouth yawned, and the characters in the historical compliance behavior segment are closed-mouth not yawned.

In the embodiment of the present invention, the behavior recognition model may classify other violations and compliance behaviors besides closing eye drowsiness, opening eye without drowsiness, opening mouth with yawning or not opening mouth with yawning, and is specifically determined according to actual use conditions, which is not limited in the embodiment of the present invention.

Step S401, training the classification model according to the historical violation behavior segment and the historical compliance behavior segment, wherein the classification model is a deep learning neural network model.

In the embodiment of the invention, the first behavior recognition model or the second behavior recognition model for recognizing the candidate video clips is generated through the method, and then the first behavior recognition model or the second behavior recognition model is used for determining the illegal behaviors in the candidate video clips, namely the video clips with the human being dozing in eyes closed or yawning in mouth open, so that the efficiency and the accuracy of determining the video clips with the illegal behaviors in a plurality of candidate video clips are improved.

In a possible implementation manner, before step S105, the method further includes: and determining at least one group of face region video clips in the at least one group of candidate video clips through a pre-trained face detection model, wherein the face detection model is used for acquiring the face regions of the candidate video clips, and the face region video clips are video clips formed by face regions intercepted from the candidate video clips.

Specifically, the face detection model is a deep learning neural network model trained according to historical face data.

In the embodiment of the present invention, assuming that any one of the candidate video segments is an image, as shown in fig. 5, if any one of the candidate video segments includes faces of 2 people, one of the faces has a large face area, and the other face has a small face area, for the purpose of clear distinction, the face with the large face area is the face 1 in fig. 5, and the face with the small face area is the face 2 in fig. 5, the largest face area is determined in each image in the candidate video segment according to a face detection model, and the image is captured, where the captured image includes only the face 1 as shown in fig. 6; and recombining each intercepted image to generate a new face area video segment.

In the embodiment of the present invention, when a new face area video segment needs to be generated, as shown in fig. 7, the method for behavior detection specifically includes the following steps:

and step S700, acquiring the video stream.

Step S701, extracting a plurality of images from the video stream according to a set frequency.

Step S702, determining violation images in the multiple images through a pre-trained classification model.

Step S703, in response to that the plurality of images include at least one group of consecutive illegal images, determining at least one group of video clips corresponding to the at least one group of consecutive illegal images, where the number of images of the consecutive illegal images is greater than or equal to a set numerical value.

Step S704, determining at least one group of candidate video segments in the at least one group of video segments, wherein the candidate video segments are silent video segments.

Step S705, determining at least one group of face region video segments in the at least one group of candidate video segments through a pre-trained face detection model.

Step S706, determining the video clips with the illegal behaviors in the at least one group of face region video clips through a pre-trained behavior recognition model.

In the embodiment of the invention, the video clips with illegal behaviors are determined in the video clips of the face area by the method, and the face area is clearer and the detection effect is more accurate due to image interception.

In a possible implementation manner, a complete data flow diagram corresponding to a behavior detection method is shown in fig. 8, and specifically the following is shown: taking online teaching as an example, after the online teaching is finished, executing S801 to acquire a course playback video stream of a teacher, then simultaneously executing S8021 to frame the video stream and S8022 to extract an audio stream corresponding to the video stream, and after S8021, executing S8031 to judge each frame of image through a classification model to determine an illegal image; after S8022, S8032 is executed to analyze the audio stream through VAD, and a pronunciation voice segment is determined; executing S804 after S8031 to determine at least one set of video segments corresponding to at least one set of consecutive violation images; according to S804 and S8032, S805 is executed to determine at least one group of silent video segments, S806 is executed to determine at least one group of face region video segments through a face detection model, and S807 is executed to determine video segments with violation behaviors in the at least one group of face region video segments through a behavior recognition model.

In the embodiment of the invention, if the video clips of illegal behaviors of teachers in each class are larger than a set value, or the average number of the video clips of the illegal behaviors of teachers in multiple classes is larger than the set value, the behavior of the teachers is reminded to remind the teachers to pay attention to the teaching behaviors of the teachers, and the behavior of the teachers is normalized and corrected; similarly, the participant can be reminded.

Fig. 9 is a schematic diagram of an apparatus for behavior detection according to an embodiment of the present invention. As shown in fig. 9, the apparatus of the present embodiment includes an acquisition unit 901, a processing unit 902, and a determination unit 903.

The acquiring unit 901 is configured to acquire a video stream; the processing unit 902 is configured to extract a plurality of images in the video stream according to a set frequency; the determining unit 903 is configured to determine, through a pre-trained classification model, an illegal image in the multiple images; the determining unit 903 is further configured to determine, in response to at least one group of consecutive illegal images included in the multiple images, at least one group of video segments corresponding to the at least one group of consecutive illegal images, where the number of images of the consecutive illegal images is greater than or equal to a set value; the determining unit 903 is further configured to determine at least one candidate video segment in the at least one candidate video segment, where the candidate video segment is an unvoiced video segment; the determining unit 903 is further configured to determine, through a pre-trained behavior recognition model, a video segment in which an illegal behavior exists in the at least one group of candidate video segments.

In the embodiment of the invention, the illegal action in the video stream can be automatically detected by extracting the images, detecting the illegal images, generating the candidate video clips and detecting the video clips with the illegal action in the candidate video clips through the illegal action, so that the manpower consumption is reduced and the detection efficiency is improved.

Fig. 10 is a schematic diagram of an electronic device of an embodiment of the invention. The electronic device shown in fig. 10 is a general-purpose behavior detection apparatus, which includes a general-purpose computer hardware structure, including at least a processor 1001 and a memory 1002. The processor 1001 and the memory 1002 are connected by a bus 1003. The memory 1002 is adapted to store instructions or programs executable by the processor 1001. Processor 1001 may be a stand-alone microprocessor or may be a collection of one or more microprocessors. Thus, the processor 1001 implements the processing of data and the control of other devices by executing instructions stored by the memory 1002 to perform the method flows of embodiments of the present invention as described above. The bus 1003 connects the above components together, and also connects the above components to a display controller 1004 and a display device and an input/output (I/O) device 1005. Input/output (I/O) devices 1005 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, input/output devices 1005 are connected to the system through an input/output (I/O) controller 1006.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, various aspects of embodiments of the invention may take the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, various aspects of embodiments of the invention may take the form of: a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

Any combination of one or more computer-readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of embodiments of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to: electromagnetic, optical, or any suitable combination thereof. The computer readable signal medium may be any of the following computer readable media: is not a computer readable storage medium and may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of embodiments of the present invention may be written in any combination of one or more programming languages, including: object oriented programming languages such as Java, Smalltalk, C + +, and the like; and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package; executing in part on a user computer and in part on a remote computer; or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention described above describe various aspects of embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of behavior detection, the method comprising:

acquiring a video stream;

extracting a plurality of images from the video stream according to a set frequency;

determining violation images in the multiple images through a pre-trained classification model;

in response to at least one group of continuous illegal images in the multiple images, determining at least one group of video clips corresponding to the at least one group of continuous illegal images, wherein the number of images of the continuous illegal images is greater than or equal to a set numerical value;

determining at least one group of candidate video segments in the at least one group of video segments, wherein the candidate video segments are silent video segments;

and determining the video clips with the illegal behaviors in the at least one group of candidate video clips through a pre-trained behavior recognition model.

2. The method of claim 1, wherein the determining at least one candidate set of video segments from the at least one set of video segments comprises:

determining a voiced speech segment in the video stream;

3. The method of claim 2, wherein the determining the voiced speech segments in the video stream comprises:

determining an audio stream corresponding to the video stream;

4. The method of claim 1, further comprising:

5. The method according to claim 4, wherein the determining, through a pre-trained behavior recognition model, the video segments of the at least one group of candidate video segments having the illegal behavior specifically comprises:

6. The method of claim 1, wherein the classification model is a first classification model or a second classification model, wherein the first classification model is used for determining whether a person in the image is closed; the second classification model is used for judging whether the person in the image opens the mouth or not.

7. The method of claim 1, wherein the behavior recognition model is a first behavior recognition model or a second behavior recognition model, wherein the first behavior recognition model is used for recognizing whether there is an eye-closing drowsy violation in the candidate video segment; the second behavior identification model is used for identifying whether the violation behaviors of yawning and yawning exist in the candidate video segment.

8. The method of claim 1, wherein the training process of the classification model comprises:

9. The method of claim 1, wherein the training process of the behavior recognition model comprises:

10. The method of claim 4, wherein the face detection model is a deep learning neural network model trained from historical face data.

11. An apparatus for behavior detection, the apparatus comprising:

an acquisition unit configured to acquire a video stream;

12. A computer-readable storage medium on which computer program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1-10.

13. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-10.