CN115019390A

CN115019390A - Video data processing method and device and electronic equipment

Info

Publication number: CN115019390A
Application number: CN202210583066.2A
Authority: CN
Inventors: 杨咏臻; 程一晟; 蒋智文; 熊子良; 曹启云
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-09-06

Abstract

The disclosure provides a video data processing method and device and electronic equipment, relates to the field of artificial intelligence, and particularly relates to the field of video analysis. The specific implementation scheme is as follows: acquiring video data; acquiring a time sequence frame image from video data, wherein the time sequence frame image comprises a plurality of frame images within a preset time period; and detecting whether a target main body in the video data has a preset behavior action or not according to the multi-frame image in the time sequence frame image to obtain a target detection result.

Description

Video data processing method and device and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a video data processing method and apparatus, and an electronic device.

Background

In the related art, a mode of comprehensively analyzing a static frame map based on multiple basic auditing types is generally adopted to identify and audit whether illegal actions exist in a video, but the mode generally has the defects of low action identification accuracy and large workload of identification standard adjustment.

Disclosure of Invention

The disclosure provides a method and a device for video data processing and an electronic device.

According to an aspect of the present disclosure, there is provided a video data processing method including: acquiring video data; acquiring a time sequence frame image from video data, wherein the time sequence frame image comprises a plurality of frame images within a preset time period; and detecting whether a target main body in the video data has a preset behavior action or not according to the multi-frame images in the time sequence frame image to obtain a target detection result.

Optionally, detecting whether a target subject in the video data has a predetermined behavior action according to a multi-frame image in the time-series frame map, and obtaining a target detection result includes: extracting a first frame image in a plurality of frame images; detecting whether a target main body is included in a first frame image; and under the condition that the detection result is that the first frame image comprises the target subject, detecting whether the target subject in the video data has a preset behavior action or not to obtain a target detection result.

Optionally, detecting whether a target subject in the video data has a predetermined behavior action according to a multi-frame image in the time-series frame map, to obtain a target detection result, including: cutting a plurality of frames of images in the time sequence frame image to obtain a plurality of frames of cut images comprising a target main body, wherein the size of an area occupied by the target main body in the plurality of frames of cut images exceeds a preset threshold value; and detecting whether a target main body in the video data has a preset behavior action or not based on the multi-frame clipping image to obtain a target detection result.

Optionally, detecting whether a target subject in the video data has a predetermined behavior action according to a multi-frame image in the time-series frame map, and obtaining a target detection result includes: detecting whether the action type of a target subject in the video data belongs to a preset action type or not based on a plurality of frame images in the time sequence frame image, wherein the action of the preset action type comprises a preset action; and under the condition that the detection result is that the action type of the target subject in the video data belongs to the preset action type, detecting whether the target subject in the video data has a preset behavior action or not to obtain a target detection result.

Optionally, detecting whether a target subject in the video data has a predetermined behavior action according to a multi-frame image in the time-series frame map, to obtain a target detection result, including: inputting a plurality of frame images in a time sequence frame image into an action detection model, and obtaining the confidence coefficient that the action of a target subject in video data is a preset behavior action and the target type to which the action of the target subject belongs, wherein the action detection model is obtained by training a plurality of groups of sample data, and the plurality of groups of sample data comprise: the method comprises the steps that a sample detection result of whether a sample body in a multi-frame sample image has a preset action or not is obtained; and determining a target detection result based on the confidence degree and the target type.

Optionally, the method further includes: acquiring a target video clip under the condition that the target detection result is that a target main body in the video data has a preset behavior action, wherein the target video clip is a video intercepted from the video data; and determining a final detection result based on the target detection result and the target video segment, wherein the final detection result is used for identifying whether the video data comprises the predetermined video type.

According to another aspect of the present disclosure, there is provided a video data processing apparatus including: the first acquisition module is used for acquiring video data; the second acquisition module is used for acquiring a time sequence frame image from the video data, wherein the time sequence frame image comprises a plurality of frame images in a preset time period; and the detection module is used for detecting whether a target body in the video data has a preset behavior action or not according to the multi-frame image in the time sequence frame image to obtain a target detection result.

Optionally, the detection module comprises: the extraction unit is used for extracting a first frame image in a plurality of frame images; the first detection unit is used for detecting whether the first frame image comprises a target main body or not; and the second detection unit is used for detecting whether the target subject in the video data has a preset behavior action or not under the condition that the detection result is that the first frame image comprises the target subject, so as to obtain a target detection result.

Optionally, the detection module includes: the image processing unit is used for cutting the multi-frame images in the time sequence frame image to obtain a multi-frame cut image comprising a target main body, wherein the size of an area occupied by the target main body in the multi-frame cut image exceeds a preset threshold value; and the third detection unit is used for detecting whether a target main body in the video data has a preset behavior action or not based on the multi-frame cut-down image to obtain a target detection result.

Optionally, the detection module comprises: a fourth detection unit configured to detect whether an action type of a target subject in the video data belongs to a predetermined action type based on a plurality of frame images in the time-series frame image, wherein the action of the predetermined action type includes a predetermined behavior action; and the fifth detection unit is used for detecting whether the target body in the video data has the preset behavior action or not to obtain a target detection result when the detection result is that the action type of the target body in the video data belongs to the preset action type.

Optionally, the detection module includes: the computing unit is used for inputting the multi-frame images in the time sequence frame image into the action detection model, and obtaining the confidence coefficient that the action of the target subject in the video data is the preset action and the target type to which the action of the target subject belongs, wherein the action detection model is obtained by training multiple groups of sample data, and the multiple groups of sample data comprise: the method comprises the steps of obtaining a multi-frame sample image and a sample detection result of whether a sample body in the multi-frame sample image has a preset action or not; and the determining unit is used for determining a target detection result based on the confidence coefficient and the target type.

Optionally, the apparatus further comprises: the third acquisition module is used for acquiring a target video clip under the condition that the target detection result is that a target main body in the video data has a preset behavior action, wherein the target video clip is a video intercepted from the video data; and the determining module is used for determining a final detection result based on the target detection result and the target video clip, wherein the final detection result is used for identifying whether the video data comprises the predetermined video type.

According to still another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the above-described methods.

According to yet another aspect of the disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform any of the above methods.

According to yet another aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements any of the above-described methods.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of a video data processing method provided according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of a conventional scheme for auditing dance movements according to an embodiment of the disclosure;

FIG. 3 is a flow chart of a video data processing method provided in accordance with an alternative embodiment of the present disclosure;

fig. 4 is a block diagram of a video data processing apparatus provided according to an embodiment of the present disclosure;

fig. 5 is a schematic block diagram of an example electronic device provided in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In an embodiment of the present disclosure, a video data processing method is provided, and fig. 1 is a flowchart of the video data processing method provided in the embodiment of the present disclosure, and as shown in fig. 1, the method includes:

step S102, video data is obtained;

step S104, acquiring a time sequence frame image from the video data, wherein the time sequence frame image comprises a plurality of frame images within a preset time period;

and step S106, detecting whether a target body in the video data has a preset behavior action or not according to the multi-frame image in the time sequence frame image, and obtaining a target detection result.

Because the time sequence frame image is a plurality of continuous frame images which are sequenced according to time, the time sequence frame image has very rich context information relative to a static frame image, so that the identification and the audit aiming at the action can be directly and accurately finished by combining a specific algorithm on the basis of obtaining the time sequence frame image from the video data through the processing, and only the standard of the specific action needs to be adjusted when the standard of the identification and the audit is adjusted, thereby greatly solving the problems of low accuracy rate of the action identification and large workload of the adjustment of the identification standard.

As an alternative embodiment, whether a target subject in the video data has a predetermined behavior action is detected according to multiple frames of images in the time-series frame map, and a target detection result is obtained, the following manner may be adopted: extracting a first frame image in a plurality of frame images; detecting whether a target main body is included in a first frame image; and under the condition that the detection result is that the first frame image comprises the target subject, detecting whether the target subject in the video data has a preset behavior action or not to obtain a target detection result. The method comprises the steps of firstly carrying out theme detection on a first frame image in a time sequence frame image to judge whether a target main body needing to be detected exists in the time sequence frame image, detecting actions only under the condition that the first frame image comprises the target main body, namely directly finishing detection when the first frame image does not comprise the target main body. Through the detection of the target main body in the first frame image, the time sequence frame image can be rapidly processed according to the target main body, so that the unnecessary action identification process is reduced, and the data processing efficiency is improved.

As an alternative embodiment, whether a target subject in the video data has a predetermined behavior action is detected according to multiple frames of images in the time-series frame map, and a target detection result is obtained, the following manner may be adopted: cutting down the multi-frame images in the time sequence frame image to obtain a multi-frame cut-down image comprising a target main body, wherein the size of an area occupied by the target main body in the multi-frame cut-down image exceeds a preset threshold value; and detecting whether a target main body in the video data has a preset behavior action or not based on the multi-frame cut-down image to obtain a target detection result. When the target main body exists in the time sequence frame image, the multi-frame image is cut, so that information data irrelevant to the target main body in the frame image can be reduced, the interference on action recognition is avoided, and the efficiency of action recognition on the target main body can be improved.

When the multi-frame image is cut, the image may be preprocessed in different manners according to actual application.

As an alternative embodiment, whether a target subject in the video data has a predetermined behavior action is detected according to multiple frames of images in the time sequence frame image, and the following manner is adopted to obtain a target detection result: detecting whether the action type of a target subject in the video data belongs to a preset action type or not based on a plurality of frame images in the time sequence frame image, wherein the action of the preset action type comprises a preset behavior action; and under the condition that the detection result is that the action type of the target subject in the video data belongs to the preset action type, detecting whether the target subject in the video data has a preset behavior action or not to obtain a target detection result. The action type of the target subject is detected once, the action is detected in detail only when the preset action type exists, and the action detection method can effectively eliminate the detection of the action which does not belong to the preset action type, can also improve the processing of a time sequence frame chart for detecting the preset action and further improve the efficiency of detecting the preset action by primarily detecting whether the action belongs to the preset action type.

As an alternative embodiment, whether a target subject in the video data has a predetermined behavior action is detected according to multiple frames of images in the time-series frame map, and a target detection result is obtained, the following manner may be adopted: inputting a plurality of frames of images in a time sequence frame image into an action detection model, and obtaining the confidence coefficient that the action of a target subject in video data is a preset behavior action and the target type to which the action of the target subject belongs, wherein the action detection model is obtained by training a plurality of groups of sample data, and the plurality of groups of sample data comprise: the method comprises the steps that a sample detection result of whether a sample body in a multi-frame sample image has a preset action or not is obtained; and determining a target detection result based on the confidence degree and the target type. The confidence coefficient of the preset behavior action identified in the video data is calculated, so that the reliability of the detection result can be converted into a specific numerical value, and then the specific numerical value is compared with a preset threshold value, so that whether the behavior action belongs to the preset behavior action can be further judged, and the accuracy and the reliability of the detection result of the preset behavior action are improved. In addition, the motion detection model is used for detecting the motion, so that the method has the advantage of high efficiency, and the motion detection model can be obtained based on a large amount of abundant samples, so that the accuracy can be effectively improved.

As an optional embodiment, in a case that the target detection result is that a target subject in the video data has a predetermined behavior, acquiring a target video clip, where the target video clip is a video captured from the video data; and determining a final detection result based on the target detection result and the target video segment, wherein the final detection result is used for identifying whether the video data comprises the predetermined video type. By determining the final detection result based on the target detection result and the target video clip, not only can the result of the previous detection performed on the predetermined behavior action be given, but also the corresponding video clip can be retained in the final detection result, so as to facilitate the subsequent manual review or other processing.

Based on the above embodiments and alternative embodiments, an alternative implementation is provided.

In this alternative embodiment, the video data is exemplified by live data. The direct data corresponds to various live broadcast anchor types, such as entertainment anchor, outdoor anchor, game anchor, and the like. But the supervision on the network content is more strict, and enterprises taking no measures are difficult to continue. Therefore, the content auditing service appears to be extremely important.

For the auditing of the live broadcast, the auditing of images, voice and texts is mainly focused. For the image review, the live broadcast content is mainly reviewed, generally by performing frame extraction on the video stream, and then by performing review through an AI operator.

However, as regulatory requirements increase, existing audits have been unable to meet the requirements, and more types of audits have emerged. At present, in many live broadcast rooms, a plurality of dancing actions with illegal entertainment and main broadcasting actions exist. A blow is required for such actions.

However, conventional audits have been unable to meet such needs. Because of such violations, the action information of the person, i.e. a series of time-series frame images, needs to be analyzed first, and a common still picture cannot be determined, so that a large possibility of erroneous determination exists. And finally, judging whether the action is illegal through the action of the character by the model, and only making final punishment.

Taking an example of examining a dance action, fig. 2 is a schematic flow chart of a conventional scheme for examining a dance action according to an embodiment of the present disclosure, and as shown in fig. 2, the scheme includes the following steps:

(1) the media service module performs stream pulling from a live broadcast service provider to obtain the latest live broadcast stream needing to be checked in real time;

(2) after pulling a stream from an outer network, the media service forwards the stream to a server of an inner network, and a decoding module decodes a video stream to form a time sequence frame diagram;

(3) the frame drawing module is used for drawing a frame image according to a strategy configured by a developer in advance, such as 1 frame in 1 second or one frame in 3 seconds, and outputting a static frame image;

(4) distributing the static frame image to AI operator services, such as yellow reflex detection, sexy detection, inelegant motion detection and the like;

(5) and after all AI operator services return results, performing comprehensive judgment by a vulgar dance judgment module. The main judgment mode is as follows:

a) setting a time window, such as 20 seconds, which is mainly used for counting AI operator request results of different categories within 20 s;

b) after counting classification results in the time window, finally judging whether the anchor has illegal behaviors according to the set rule-breaking threshold proportion of each result; the threshold proportion of each result violation can be configured by self;

c) when the new supervision requirement needs to be adapted, an AI operator can be added or the violation threshold proportion of the statistical AI operator can be adjusted to achieve the aim.

However, the above solution has the following disadvantages:

(1) the static frame image is obtained by adopting a frame extraction mode, the obtained information is not much, and only conventional detection such as yellow reflex, bad behavior and the like can be carried out, so that the method has certain limitation;

(2) for the judgment of the vulgar dance, comprehensive judgment needs to be carried out according to the auditing results of a series of AI operators, the violation threshold of each operator is difficult to accept and reject, careful adjustment is needed for each auditing scale adjustment, and the workload is large;

(3) the violation judgment of the complex actions is inaccurate, and when a new violation action occurs to the user, the system may not be able to detect or the accuracy rate is not high;

(4) the overall detection accuracy is low, the data volume of follow-up pushing for human review is high, and the overall cost is not obviously reduced.

The scheme provided by the alternative embodiment of the disclosure is designed and realized on the basis of the detection of the time-sequence frame image. The method is characterized in that the prior processing of the static frame image is converted into the processing of the dynamic time sequence frame image, and meanwhile, an AI operator responsible for preprocessing is added for data screening, so that the accuracy of the final result is improved.

Fig. 3 is a flowchart of a video data processing method according to an alternative embodiment of the disclosure, and as shown in fig. 3, the overall flow of the scheme is as follows:

(2) after the media service is pulled from the outer network, the media service is forwarded to a server of the inner network, the live stream in a period of time is selected for decoding and extraction, if 10 frames are extracted in 1 second and the duration is 5 seconds, 50 frames of images are in total, and a time sequence frame image is formed. Meanwhile, another module is used for cutting the live stream into video clips for storage;

(3) and extracting the first frame from the decoded time sequence frame image, and performing main body detection on the first frame, namely detecting whether main body transactions such as people and the like which need to be checked exist. If not, discarding all frame images and video segments to wait for the next analysis;

(4) if the main body exists, cutting the picture irrelevant to the main body, and reserving the main body as much as possible to prepare for subsequent judgment;

(5) and performing dance classification AI operator prediction on the cut time sequence frame image, and judging whether the character of the video dances. If the dance does not occur, all the frame images and video clips are discarded, and the next analysis is waited;

(6) if the person dances in the time sequence frame image of the batch, judging whether the dance of the person is vulgar dance, and finally outputting the type and the confidence coefficient of the vulgar dance by the AI operator;

(7) the system sets the judgment threshold value of each vulgar dance category according to the supervision scale and the actual situation. Below this threshold, all frame pictures and video segments are discarded, awaiting the next analysis. And if not, pushing the auditing result and the stored video clip to the service party, so that the service party can conveniently carry out comprehensive judgment according to the video clip and the AI algorithm auditing result.

The key problems solved in the design and implementation of this solution are as follows:

(1) problem of static pictures

The limitations of static picture auditing mainly include:

1) the static picture has single carried information and can only carry out a general basic auditing type;

2) for the examination and check of dances, the real actions and intentions of the figures are difficult to judge, and only comprehensive judgment can be performed by depending on examination and check results of multiple types;

3) for the auditing of each type of vulgar dances, comprehensive judgment of different basic auditing types is needed, and the subsequent operations of threshold value adjustment, new classification addition and the like consume more manpower.

In an alternative embodiment of the present disclosure, the above problem is solved in the form of a time sequence frame map and a time sequence AI model:

1) by extracting picture frames within a time period, for example, 10 frames in 1 second and 5 seconds of video. The frame images extracted in this way have time sequence and context information;

2) an AI algorithm model for receiving time sequence data is trained, and an input time sequence frame image is analyzed instead of a single image. Therefore, more information can be acquired by combining the context of the input data, and the problem that only basic audit can be performed is avoided.

(2) Problem of audit accuracy

The reason that the traditional scheme has low auditing accuracy mainly comprises the following steps:

1) the combined judgment of the basic auditing types needs to manually analyze each type of dance, carefully select the combined classification and each category threshold value, and then can judge different dance classifications;

2) for the adjustment of each basic classification threshold, the condition that different dance classifications affect each other occurs, namely, the threshold of a certain dance type cannot be accurately judged, so that the set threshold may affect other classifications to cause misjudgment;

3) the addition of new classes or the change of regulatory requirements brings about the adjustment of basic audit types and thresholds, and the accuracy of all the classes after adjustment needs to be tested globally.

In an alternative embodiment of the present disclosure, the idea of preprocessing and time-sequential AI model is adopted to solve the above problem:

1) by adopting a preprocessing AI algorithm, the acquired time sequence frame image is sequentially subjected to main body detection, image cutting and dance judgment operators to confirm whether people dance in the frame image, so that data which do not meet conditions are discarded in advance, misjudgment of a subsequent vulgar dance AI algorithm is avoided, and the accuracy is improved;

2) abandon the judgment method of basic type combination, subdivide the relevant vulgar dance behaviors as shown in the following table:

tremble chest	Chest top	M squat	S squat
				Shivering hip	Top crotch	Large-amplitude hip swing	Others

Identifying relevant vulgar dance classifications by adopting a vulgar dance AI algorithm for receiving the time sequence frame chart;

3) for newly-added dance classification, only relevant data needs to be collected, and the AI operator is trained again.

During the experiment, the alternative embodiment of the present disclosure has the following beneficial effects:

(1) the auditing efficiency of the vulgar dance violation behaviors is improved, 5000+ paths are audited daily, 1100w frame images are counted, and the total time is the live broadcast duration of 110 hours +. The result pushing of the violation anchor can be completed within 30 seconds, and the judgment of the violation anchor on the minute level can be achieved by matching with subsequent manual review;

(2) the accuracy of violation audit is high, the accuracy of the audit result pushed to human audit reaches 69%, and the online real-time audit requirement is met.

In an embodiment of the present disclosure, there is also provided a video data processing apparatus, and fig. 4 is a block diagram of a structure of the video data processing apparatus provided according to the embodiment of the present disclosure, as shown in fig. 4, the apparatus includes: a first acquisition module 41, a second acquisition module 42 and a detection module 43, which will be explained below.

A first obtaining module 41, configured to obtain video data; a second obtaining module 42, connected to the first obtaining module 41, configured to obtain a time-series frame map from the video data, where the time-series frame map includes multiple frame images within a predetermined time period; and a detecting module 43, connected to the second obtaining module 42, for detecting whether a target subject in the video data has a predetermined behavior action according to the multiple frames of images in the time-series frame image, so as to obtain a target detection result.

As an alternative embodiment, the detection module 43 includes: the extraction unit is used for extracting a first frame image in the multi-frame images; the first detection unit is used for detecting whether the first frame image comprises a target main body or not; and the second detection unit is used for detecting whether the target subject in the video data has a preset behavior action or not under the condition that the detection result is that the first frame image comprises the target subject, so as to obtain a target detection result.

As an alternative embodiment, the detection module 43 includes: the image processing unit is used for cutting a plurality of frames of images in the time sequence frame image to obtain a plurality of cut images comprising a target main body, wherein the size of an area occupied by the target main body in the plurality of cut images exceeds a preset threshold value; and the third detection unit is used for detecting whether a target main body in the video data has a preset behavior action or not based on the multi-frame cut-down image to obtain a target detection result.

As an alternative embodiment, the detection module 43 includes: a fourth detection unit configured to detect whether an action type of a target subject in the video data belongs to a predetermined action type based on the multiple frame images in the time-series frame image, wherein the action of the predetermined action type includes a predetermined behavior action; and the fifth detection unit is used for detecting whether the target body in the video data has the preset action or not to obtain a target detection result when the detection result is that the action type of the target body in the video data belongs to the preset action type.

As an alternative embodiment, the detection module 43 includes: the calculation unit is used for inputting the multi-frame images in the time sequence frame image into the action detection model to obtain the confidence coefficient that the action of the target subject in the video data is the preset behavior action and the target type to which the action of the target subject belongs, wherein the action detection model is obtained by training a plurality of groups of sample data, and the plurality of groups of sample data comprise: the method comprises the steps that a sample detection result of whether a sample body in a multi-frame sample image has a preset action or not is obtained; and the determining unit is used for determining the target detection result based on the confidence coefficient and the target type.

As an alternative embodiment, the apparatus further comprises: the third acquisition module is used for acquiring a target video clip under the condition that the target detection result is that a target main body in the video data has a preset behavior action, wherein the target video clip is a video intercepted from the video data; and the determining module is used for determining a final detection result based on the target detection result and the target video segment, wherein the final detection result is used for identifying whether the video data comprises the predetermined video type.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 5 is a schematic block diagram of an example electronic device provided in accordance with an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components illustrated herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 500 comprises a computing unit 501, which may perform various suitable actions and processes in accordance with a computer program stored in a read-only memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 can also be stored. The computing unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the electronic device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices through a computer network such as an internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as the video data processing method. For example, in some embodiments, the video data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the video data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the video data processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A video data processing method, comprising:

acquiring video data;

acquiring a time sequence frame image from the video data, wherein the time sequence frame image comprises a plurality of frame images within a preset time period;

and detecting whether a target main body in the video data has a preset behavior action or not according to the multi-frame image in the time sequence frame image to obtain a target detection result.

2. The method of claim 1, wherein the detecting whether a target subject in the video data has a predetermined behavior action according to the plurality of frames of images in the time-series frame image comprises:

extracting a first frame image in the multi-frame images;

detecting whether the first frame image comprises the target main body;

and under the condition that the detection result is that the target main body is included in the first frame image, detecting whether the target main body in the video data has the preset behavior action or not to obtain the target detection result.

3. The method according to claim 1, wherein the detecting whether a target subject in the video data has a predetermined behavior action according to the plurality of frames of images in the time-series frame image to obtain a target detection result comprises:

cutting the multi-frame images in the time sequence frame image to obtain multi-frame cut images comprising the target main body, wherein the size of the area occupied by the target main body in the multi-frame cut images exceeds a preset threshold value;

and detecting whether a target main body in the video data has a preset behavior action or not based on the multi-frame cut-down image to obtain a target detection result.

4. The method of claim 1, wherein the detecting whether a target subject in the video data has a predetermined behavior action according to the plurality of frames of images in the time-series frame image comprises:

detecting whether an action type of the target subject in the video data belongs to a predetermined action type based on the multi-frame images in the time-series frame image, wherein the action of the predetermined action type comprises the predetermined behavior action;

and under the condition that the detection result is that the action type of the target subject in the video data belongs to a preset action type, detecting whether the target subject in the video data has a preset behavior action or not to obtain a target detection result.

5. The method according to claim 1, wherein the detecting whether a target subject in the video data has a predetermined behavior action according to the multiple frames of images in the time-series frame image to obtain a target detection result comprises:

inputting the multiple frames of images in the time sequence frame image into an action detection model, and obtaining a confidence that the action of the target subject in the video data is the predetermined behavior action and a target type to which the action of the target subject belongs, wherein the action detection model is obtained by training multiple sets of sample data, and the multiple sets of sample data comprise: the method comprises the steps that a plurality of frames of sample images and sample detection results of whether a sample body in the plurality of frames of sample images has a preset action or not are obtained;

determining the target detection result based on the confidence and the target type.

6. The method of any of claims 1-5, wherein the method further comprises:

acquiring a target video clip under the condition that the target detection result is that the target subject in the video data has the predetermined behavior action, wherein the target video clip is a video intercepted from the video data;

determining a final detection result based on the target detection result and the target video segment, wherein the final detection result is used for identifying whether the video data comprises a predetermined video type.

7. A video data processing apparatus comprising:

the first acquisition module is used for acquiring video data;

the second acquisition module is used for acquiring a time sequence frame image from the video data, wherein the time sequence frame image comprises a plurality of frame images within a preset time period;

and the detection module is used for detecting whether a target main body in the video data has a preset behavior action or not according to the multi-frame image in the time sequence frame image to obtain a target detection result.

8. The apparatus of claim 7, wherein the detection module comprises:

the extraction unit is used for extracting a first frame image in the multi-frame images;

a first detection unit, configured to detect whether the target subject is included in the first frame image;

and the second detection unit is used for detecting whether the target subject in the video data has the predetermined behavior action or not under the condition that the detection result is that the first frame image comprises the target subject, so as to obtain the target detection result.

9. The apparatus of claim 7, wherein the detection module comprises:

the image processing unit is used for cutting the multi-frame images in the time sequence frame image to obtain a multi-frame cut image comprising the target main body, wherein the size of the area occupied by the target main body in the multi-frame cut image exceeds a preset threshold value;

and the third detection unit is used for detecting whether a target main body in the video data has a preset behavior action or not based on the multi-frame cut-down image to obtain a target detection result.

10. The apparatus of claim 7, wherein the detection module comprises:

a fourth detection unit configured to detect whether an action type of the target subject in the video data belongs to a predetermined action type based on the plurality of frame images in the time-series frame image, wherein the action of the predetermined action type includes the predetermined behavior action;

and a fifth detecting unit, configured to, when a detection result is that the motion type of the target subject in the video data belongs to a predetermined motion type, detect whether a predetermined behavioral motion exists in the target subject in the video data, and obtain a target detection result.

11. The apparatus of claim 7, wherein the detection module comprises:

a calculating unit, configured to input the multiple frames of images in the time-series frame map into an action detection model, and obtain a confidence that an action of the target subject in the video data is the predetermined behavioral action, and a target type to which the action of the target subject belongs, where the action detection model is trained by using multiple sets of sample data, where the multiple sets of sample data include: the method comprises the steps that a plurality of frames of sample images and sample detection results of whether a sample body in the plurality of frames of sample images has a preset action or not are obtained;

a determining unit, configured to determine the target detection result based on the confidence and the target type.

12. The apparatus of any of claims 7 to 11, wherein the apparatus further comprises:

a third obtaining module, configured to obtain a target video segment when the target detection result indicates that the predetermined behavior action exists in the target subject in the video data, where the target video segment is a video captured from the video data;

a determining module, configured to determine a final detection result based on the target detection result and the target video segment, where the final detection result is used to identify whether the video data includes a predetermined video type.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.