CN114913470A

CN114913470A - Event detection method and device

Info

Publication number: CN114913470A
Application number: CN202210815231.2A
Authority: CN
Inventors: 刘智辉; 余言勋; 杜治江; 牛中彬; 黄宇; 杨雪峰
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-08-16
Anticipated expiration: 2042-07-11
Also published as: CN114913470B

Abstract

The application relates to the technical field of image information processing, in particular to an event detection method and device, which are used for solving the problem of low accuracy of an event detection result caused by inaccurate target information identification in the prior art. The method comprises the following steps: identifying an event in a video to be detected, and obtaining an event type of the event; determining an event position of an event in the video to be detected according to the event type, wherein the event position comprises an event occurrence area and an electronic fence area of the event; the event occurrence area is used for representing the three-dimensional space position of the event occurring in the space shot by the acquisition equipment, and the electronic fence area is an area with a set size expanded on the basis of the event occurrence area; and identifying a participating object of the event according to the event occurrence area in the video to be detected, and identifying a sighting object of the event according to the electronic fence area in the video to be detected.

Description

Event detection method and device

Technical Field

The present application relates to the field of image information processing technologies, and in particular, to an event detection method and apparatus.

Background

With the importance of the current society on intelligent technologies such as social security, urban brain and the like, understanding or detection of events in monitoring video pictures becomes especially important. In the current event detection method, an event is determined through a target, that is, all target information included in a video to be detected is determined first, and then the event in the video to be detected is determined through all target information included in the video. However, some objects in the object information included in the video are irrelevant to event detection, and therefore, the event detection efficiency is reduced to some extent by the detection method of determining the event by the object. In addition, when the target information is inaccurately identified, it may cause a decrease in the accuracy of the event detection result.

Disclosure of Invention

The embodiment of the application provides an event detection method and device, and aims to solve the problem that in the prior art, target information identification is inaccurate, so that the accuracy of an event detection result is low.

In a first aspect, an embodiment of the present application provides an event detection method, including:

identifying an event in a video to be detected, and obtaining an event type of the event; determining an event position of an event in the video to be detected according to the event type, wherein the event position comprises an event occurrence area and an electronic fence area of the event; the event occurrence area is used for representing the three-dimensional space position of the event in the shooting space of the acquisition equipment, and the electronic fence area is an area with a set size expanded on the basis of the event occurrence area; and identifying a participating object of the event according to the event occurrence region in the video to be detected, and identifying a witness object of the event according to the electronic fence region in the video to be detected.

Based on the scheme, the targets are determined by the events instead of the targets, the participation objects of the events and the event witness objects can be screened, and reliable bases are provided for event investigation and analysis. The targets in the electronic fence area after the event occurs are identified, and the participation objects or the witness objects are screened, so that the workload of target detection can be reduced, and the detection efficiency of event detection can be improved. In addition, only the event position is detected, and the detection accuracy of the target detection can be improved.

In a possible implementation, the method further includes: determining a time boundary of an event in the video to be detected, wherein the time boundary is used for expressing the starting time and the ending time of the event in the video to be detected; determining the event position of the event in the video to be detected according to the event type comprises the following steps: determining a video segment in the video to be detected according to the time boundary; and determining the event position of the event in the video segment according to the event type.

Based on the scheme, when event detection is carried out, the starting time and the ending time of the event can be directly obtained, and the event position can be further determined according to the event type and the starting time and the ending time of the event.

In a possible implementation manner, the identifying an event in a video to be detected and determining a time boundary of the event in the video to be detected includes: taking the video to be detected as an input of a first neural network, so as to output the event type and the time boundary through the first neural network; determining the event position of the event in the video to be detected according to the event type comprises the following steps: taking the video to be detected, the event type and the time boundary as the input of a second neural network so as to output the event position through the second neural network; the identifying the participating object of the event according to the event occurrence region in the video to be detected and identifying the sighting object of the event according to the electronic fence region in the video to be detected comprises: and taking the video to be detected and the event position as the input of a third neural network, so as to output the participating object and the sighting object of the event through the third neural network.

In a possible implementation manner, the event occurrence area is mapped to a pixel position indication in a video frame included in the video to be detected through a cube space of the event occurrence; the event occurrence area satisfies a condition shown by the following formula:

；

wherein Space is used for indicating the event occurrence area,

pixel point positions to which four vertexes representing the bottom surface of the cubic space are respectively mapped,

and the positions of the pixel points mapped by the four vertexes of the top surface of the cubic space respectively are represented.

In a possible implementation manner, the electric fence region is obtained by taking a central point of the bottom surface of the cubic space as a center and performing equal-scale amplification on the bottom surface by a set multiple.

Based on the scheme, the event occurrence area and the electronic fence area of the event of the video to be detected can be directly determined during event detection, and then target detection can be performed in the area, so that the calculated amount of target detection is reduced, and the detection efficiency of the target detection is improved.

In a possible implementation manner, the first neural network, the second neural network, and the third neural network constitute a neural network model, and the neural network model is obtained by training based on a training sample set;

the training sample set comprises a plurality of videos and video understanding results and target detection results corresponding to the videos respectively, the video understanding result of a first video comprises an event type, a time boundary, an event occurrence area and an electronic fence area of an event occurring in the first video, the first video is any one of the videos, and the target detection result of the first video comprises a participation object and a sighting object of the event occurring in the first video.

In a possible implementation, the method further includes: training the neural network model by: inputting the first video into the first neural network to output a predicted event type and a predicted time boundary of an event occurring in the first video through the first neural network, and determining a first loss value according to a loss between the predicted event type and an event type corresponding to the first video in the training sample set and a loss between the predicted time boundary and a time boundary corresponding to the first video in the training sample set;

inputting the first video, the predicted event type, and the predicted time boundary into the second neural network to output a predicted event occurrence region and a predicted fence region through the second neural network, determining a second loss value based on a loss between the predicted event occurrence region and an event occurrence region corresponding to the first video in the training sample set, and a loss between the predicted fence region and a fence region corresponding to the first video in the training sample set;

inputting the first video, the predicted event occurrence region, and the predicted fence region into a third neural network to output a predicted engagement object and a predicted sighting object through the third neural network, determining a third loss value based on a loss between the predicted engagement object and an engagement object of an event occurring with the first video in the training sample set, and a loss between the predicted sighting object and a sighting object of an event occurring with the first video in the training sample set;

weighting the first loss value, the second loss value and the third loss value to obtain a fourth loss value;

and respectively adjusting the network parameters of the first neural network, the second neural network and the third neural network according to the fourth loss value to obtain the neural network model.

Based on the scheme, a mutual fusion mode is adopted on the network structure, and the neural networks are mutually constrained, so that the accuracy rate of event detection can be improved.

In a possible implementation, the method further includes: outputting an event detection result, the event detection result including the event type, the time boundary, the event location, the engagement object, and the sighting object.

In a second aspect, an embodiment of the present application provides an event detection apparatus, including:

the first identification module is used for identifying an event in a video to be detected and obtaining the event type of the event;

the first processing module is used for determining the event position of an event in the video to be detected according to the event type, wherein the event position comprises an event occurrence area and an electronic fence area of the event; the event occurrence area is used for representing the three-dimensional space position of the event in the space shot by the acquisition equipment, and the electronic fence area is an area with a set size expanded on the basis of the event occurrence area;

and the second identification module is used for identifying a participating object of the event according to the event occurrence region in the video to be detected and identifying a sighting object of the event according to the electronic fence region in the video to be detected.

In some embodiments, the first processing module is further configured to: determining a time boundary of the event in the video to be detected, wherein the time boundary is used for expressing the starting time and the ending time of the event in the video to be detected;

the first processing module, when determining the event location of the event in the video to be detected according to the event type, is specifically configured to: determining a video segment in the video to be detected according to the time boundary; and determining the event position of the event in the video segment according to the event type.

In some embodiments, the first identifying module, when identifying an event in the video to be detected and determining a time boundary of the event in the video to be detected, is specifically configured to: taking the video to be detected as an input of a first neural network, so as to output the event type and the time boundary through the first neural network;

the first processing module, when determining the event location of the event in the video to be detected according to the event type, is specifically configured to: taking the video to be detected, the event type and the time boundary as the input of a second neural network so as to output the event position through the second neural network;

the second identification module, when identifying a participating object of the event according to the event occurrence region in the video to be detected and identifying a sighting object of the event according to the electronic fence region in the video to be detected, is specifically configured to: and taking the video to be detected and the event position as the input of a third neural network, so as to output the participation object and the sighting object of the event through the third neural network.

In some embodiments, the event occurrence area is mapped to the pixel position indication in the video frame included in the video to be detected through the cube space of the event occurrence; the event occurrence area satisfies a condition shown by the following formula:

；

wherein Space is used for indicating the event occurrence area,

In some embodiments, the fence area is an area obtained by taking a center point of the bottom surface of the cubic space as a center and performing equal-scale magnification on the bottom surface by a set factor.

In some embodiments, the first, second, and third neural networks form a neural network model, the neural network model being trained based on a set of training samples;

In some embodiments, the processing module is further configured to train the neural network model by:

inputting the first video into the first neural network to output a predicted event type and a predicted time boundary of an event occurring in the first video through the first neural network, and determining a first loss value according to a loss between the predicted event type and an event type corresponding to the first video in the training sample set and a loss between the predicted time boundary and a time boundary corresponding to the first video in the training sample set;

In some embodiments, the apparatus further comprises an output module; the output module is configured to output an event detection result, where the event detection result includes the event type, the time boundary, the event location, the participation object, and the sighting object.

In a third aspect, an embodiment of the present application provides an electronic device, including:

a memory for storing program instructions;

and the processor is used for calling the program instructions stored in the memory and executing the methods of the first aspect and the different implementation modes of the first aspect according to the obtained program.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores computer instructions that, when executed on a computer, cause the computer to perform the method according to the first aspect and different implementations of the first aspect.

In addition, for technical effects brought by any one implementation manner of the second aspect to the fourth aspect, reference may be made to the technical effects brought by the first aspect and different implementation manners of the first aspect, and details are not described here.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1A is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 1B is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an event detection method according to an embodiment of the present application;

fig. 3 is a schematic diagram of an electric fence area according to an embodiment of the present application;

fig. 4 is a schematic diagram of an event detection result according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a neural network model provided in an embodiment of the present application;

FIG. 6 is a schematic illustration of an event response heatmap provided by an embodiment of the present application;

FIG. 7 is a diagram illustrating a relationship between an event occurrence area and an electric fence area according to an embodiment of the present application;

fig. 8 is a schematic time sequence diagram of a video provided by an embodiment of the present application;

fig. 9 is a schematic diagram illustrating a detection result of a sighting target according to an embodiment of the present application;

fig. 10 is a schematic diagram illustrating a training process of a neural network model according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram illustrating a flow of training information of a neural network model according to an embodiment of the present disclosure;

fig. 12 is a schematic diagram of an event detection device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

For ease of understanding, before describing the event detection method provided in the embodiments of the present application, a detailed description will be given of the technical background of the embodiments of the present application.

In the prior art, the detection of the event can be realized by a video understanding method. In the prior art, the event detection method is to establish a corresponding relationship between a target environment and an event, determine an image to be detected to perform event detection according to the corresponding relationship between the target environment and the event, and further obtain an event detection result. However, this method has a limitation in that when the image to be detected does not match the target environment, it is impossible to detect an event in the image to be detected. In addition, when event detection is performed, an event is determined through a target, and all targets appearing in the detection process need to be detected, so that the event is determined. The method needs a large amount of calculation work, and when the target detection result is not accurate, the problem that the accuracy rate of the detection result of the event is not high is caused.

Based on the above problems, an embodiment of the present application provides an event detection method and apparatus. The event detection method and the event detection device are suitable for traffic scenes, and are also suitable for scenes such as public handling halls, hospitals and waiting halls. The scheme provided by the application can be used for understanding the type, the starting or ending time, the event occurrence area, the electronic fence area, the participation object and the witness object of the content or the occurrence event in the video at one time. By accurately dividing the event influence range, the key area can be accurately analyzed. By using the 'event targeting' instead of the existing 'event targeting', the direct participants and witnesses of the event can be screened, and reliable basis is provided for event investigation and analysis. Meanwhile, targets in the electronic fence area before the event occurs can be identified, and participatory objects or witness objects are screened. Event detection is realized through a neural network model on a network structure, the number of series modules for video understanding and target detection is reduced, and the overall identification accuracy is improved.

The event detection method provided by the embodiment of the application can be realized through the execution device. In some embodiments, the execution device may be an electronic device, which may be implemented by one or more servers, for example, one server 100 in fig. 1A. Referring to fig. 1A, a schematic diagram of a possible application scenario provided in the embodiment of the present application is shown, which includes a server 100 and an acquisition device 200. The server 100 may be implemented by a physical server or may be implemented by a virtual server. The server may be implemented by a single server, may be implemented by a server cluster formed by a plurality of servers, and may implement the event detection method provided by the present application by a single server or a server cluster. The acquisition device 200 is a device with an image acquisition function, and includes an electric alarm device, an electronic monitoring device, a monitoring camera, a video recorder, a terminal device (such as a notebook, a computer, a mobile phone, and a television) with a video acquisition function, and the like. The capture device 200 may transmit the captured video to be detected to the server 100 through the network. Alternatively, the server 100 may be connected to the terminal device 300, receive an event detection task sent by the terminal device 300, and perform event detection according to the received video to be detected sent by the capturing device 200. In some scenarios, the server 100 may send the event detection result to the terminal device 300. The terminal device 300 may be a television, a mobile phone, a tablet computer, a personal computer, and the like.

By way of example, referring to FIG. 1B, server 100 may include a processor 110, a communication interface 120, and a memory 130. Of course, other components, not shown in FIG. 1B, may also be included in the server 100.

The communication interface 120 is configured to communicate with the capture device 200 and the terminal device 300, and is configured to receive a video to be detected sent by the capture device 200, or receive an event detection task sent by the terminal device 300, or send an event detection result to the terminal device 300.

In the embodiments of the present application, the processor 110 may be a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

The processor 110 is a control center of the server 100, connects various parts of the entire server 100 using various interfaces and routes, performs various functions of the server 100 and processes data by operating or executing software programs and/or modules stored in the memory 130 and calling data stored in the memory 130. Alternatively, processor 110 may include one or more processing units. The processor 110 may be a control component such as a processor, a microprocessor, a controller, etc., and may be, for example, a general purpose Central Processing Unit (CPU), a general purpose processor, a Digital Signal Processing (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.

The memory 130 may be used to store software programs and modules, and the processor 110 executes various functional applications and data processing by operating the software programs and modules stored in the memory 130. The memory 130 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to a business process, and the like. Memory 130, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 130 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 130 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 130 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

In other embodiments, the enforcement device may be a terminal device. In some scenes, the terminal device may receive the video to be detected sent by the acquisition device, and perform event detection according to the video to be detected. The terminal device may include a display device, and the display device may be a liquid crystal display, an Organic Light-Emitting Diode (OLED) display, a projection display device, and the like, which is not particularly limited in this application.

It should be noted that the structure shown in fig. 1A and 1B is only an example, and the present embodiment is not limited thereto.

An event detection method is provided in the embodiment of the present application, and fig. 2 exemplarily shows a flow of the event detection method, which may be performed by an event detection apparatus, which may be located in the server 100 shown in fig. 1B, such as the processor 110, or the server 100. The event detection means may also be located in the terminal device. The specific process is as follows:

and 201, identifying an event in the video to be detected, and obtaining the event type of the event.

The video to be detected comprises a plurality of continuous video frames acquired by the acquisition equipment. The acquisition equipment can be electric alarm equipment, electronic monitoring equipment, a monitoring camera, a video recorder, terminal equipment (such as a notebook, a computer, a mobile phone and a television) with a video acquisition function and the like.

Illustratively, after the acquisition device acquires the video to be detected, the server acquires the video to be detected from the acquisition device.

In some embodiments, the server receives a video file sent by the capture device, and the video file includes the video to be detected. The video file may be an encoded file of video. Therefore, the server can decode the received video file to obtain the video to be detected. The video is encoded, so that the file size of a video file can be effectively reduced, and the transmission is convenient. Therefore, the transmission speed of the video can be improved, and the efficiency of confirming the video event subsequently can be improved. The method for acquiring encoded code stream data may be any suitable method, including but not limited to: real Time Streaming Protocol (RTSP), Open Network Video Interface Forum (ONVIF) standard or proprietary Protocol.

In some embodiments, after the server obtains the video to be detected collected by the collecting device, the server may identify an event in the video to be detected, and obtain an event type of the event. Illustratively, the event type may be any event that can be recognized by the server. For example, in a traffic scenario, the event type may include an event of a collision between a motor vehicle and a person, an event of parking violations, an event of putting a shelf, running a red light, or parking violations, which is not specifically limited in this embodiment of the present application.

The event occurring in the video to be detected can be identified by adopting modes of face identification, vehicle identification or human body identification and the like. In some embodiments, a neural network may be employed to identify events occurring in the video to be detected.

And determining the event type corresponding to the event occurring in the video file according to different event detection rules. Wherein, different event types correspond to different event detection rules. The event detection rules are pre-set. The event detection rule is a rule for judging whether the event occurs or not, or further includes a rule for identifying specific contents in the event. For example, the event type is a vehicle collision, and the event detection rule may include an identification rule for determining whether two or more vehicles collide.

One event type or a plurality of event types can be corresponded to one video.

In one possible implementation, the time boundary is obtained while identifying the event type of the event occurring in the video to be detected. The time boundary is used for expressing the starting time and the ending time of an event occurring in the video to be detected. As an example, when it is determined that the type of the corresponding event in the video to be detected is a vehicle collision, a start time and an end time of the event occurring in the video to be detected are determined to obtain a time boundary. E.g. 10 seconds total, in the video to be detected

The vehicles collide until the video end time, and then the vehicle collision event is in the videoThe time boundary in (1) is

Seconds to 10 th seconds.

And 202, determining the event position of the event in the video to be detected according to the event type.

The event position comprises an event occurrence area and an electronic fence area, the event occurrence area is used for representing the three-dimensional space position of the event occurring in the space shot by the acquisition equipment, the electronic fence area is larger than the event occurrence area, and the event occurrence area is located in the electronic fence area. The electric fence area may be an area that is enlarged by a set size on the basis of the event occurrence area. For example, the fence area is obtained by enlarging a set ratio with the center point of the event occurrence area as the center.

In some embodiments, in the case that the time boundary is obtained while identifying the event type of the event in the video to be detected, when determining the event position, the following method may be implemented:

determining a video segment in the video to be detected according to the obtained time boundary of the event in the video to be detected, and further determining the event position of the event in the video segment according to the event type. The server can intercept the video segment from the video to be detected through the time boundary, and further can reduce the time length of the video to be processed when the event position is determined according to the video type, so that the processing efficiency can be improved, and the processing time delay is reduced.

Illustratively, the event occurrence area may be expressed by three-dimensional coordinates or by two-dimensional coordinates. For example, the space where the event occurs and the acquisition device shoots is a cubic space. The event occurrence area can be understood as 8 vertices of the cubic space mapped to pixel positions in the video frame comprised by the video to be detected.

For example, the event occurrence area satisfies the condition shown in the following formula:

；

wherein, Space is used for indicating the event occurrence area,

the positions of the pixel points mapped by the four vertexes of the bottom surface of the cubic space respectively,

It will be appreciated that the event occurring in the space captured by the capture device may take other shapes, such as a cylinder. The event occurrence area can be mapped to the pixel position of the video frame through the center point of the top surface of the column, the center point of the bottom surface of the column is mapped to the pixel position of the video frame, and the radius of the top surface and the radius of the bottom surface of the column.

The pixel position referred to in the embodiment of the present application may also be referred to as a pixel point position, or referred to as a pixel coordinate or a pixel point coordinate, which is not specifically limited in the embodiment of the present application.

Optionally, when determining the event location, the event occurrence area may be determined first, and then the electronic fence area may be determined according to the event occurrence area. It is also possible to determine the fence area first and then specifically determine the event occurrence area in the fence area.

In some embodiments, the electric fence area can be obtained by scaling up a set number of times with the center point of the bottom surface of the cubic space as the center. Specifically, the pixel point corresponding to the central point of the bottom surface of the cube space can be determined through the pixel points corresponding to the four vertexes of the bottom surface of the cube space, and after the pixel coordinate corresponding to the central point of the bottom surface of the cube space is determined, the central point can be used as the center to amplify the set multiple in equal proportion to obtain the electronic fence area.

Illustratively, the pixel point corresponding to the center point satisfies the condition described by the following formula:

；

wherein the content of the first and second substances,

the coordinates of the pixel points corresponding to the central point are represented,

representing pixel points

Is determined by the coordinate of (a) in the space,

representing pixel points

Is determined by the coordinate of (a) in the space,

representing pixel points

Is determined by the coordinate of (a) in the space,

representing pixel points

The coordinates of (a).

As an example, the coordinates of the vertices of the electronic fence area can be determined by

It is shown that the vertex coordinates of the fence area satisfy the condition shown in the following formula:

wherein the content of the first and second substances,

to represent

The coordinates of the pixel points of (a),

to represent

The coordinates of the pixel points of (a),

to represent

The coordinates of the pixel points of (a),

to represent

The coordinates of the pixel points of (a),

it is shown that the magnification is set,

。

in some embodiments, when the fence area exceeds the boundary of the video frame, the area can be truncated by the image boundary to obtain the final fence area. When the electric fence area is as shown in fig. 3

Cut off according to image boundaries when exceeding the boundaries of video frames

Enclosed areaAs an electronic fence area.

And 203, identifying the participating objects of the event according to the event occurrence area in the video to be detected, and identifying the sighting objects of the event according to the electronic fence area in the video to be detected.

The server may identify the participant objects of the event from the event occurrence area and identify witness objects of the event within the electronic fence area and outside the event occurrence area. Specifically, the center point of the target object which appears in the electronic fence area is determined, and when the center point of the target object is outside the event occurrence area and inside the electronic fence area, the target object is determined to be a sighting object. As an example, as shown in fig. 4, after determining an event occurrence area and an electronic fence area in a video to be detected, a target object appearing in the event occurrence area in the video is detected as a participating object of the event. For example, taking a traffic scene as an example, two vehicles in the event occurrence area in fig. 4 are participating objects of the event. The target objects with the center points outside the event occurrence area and inside the electronic fence area are sighting objects, and the sighting objects can comprise pedestrians, vehicles, non-motor vehicles and the like. As shown in fig. 4, sighting objects include sighting 1, sighting 2, and sighting 3. For example, in a traffic scene, when the event type is a pedestrian red light running event, it is determined that an event occurrence area of the event is a three-dimensional space where a zebra crossing shot by a collection device is located, and an electronic fence area is an intersection area including the zebra crossing. The method comprises the steps of determining a participating object running a red light event in an event occurrence area as a pedestrian running the red light, and determining a sighting object when the event occurs in an electronic fence area, wherein the sighting object can comprise a pedestrian, a vehicle, a non-motor vehicle and the like.

Based on the scheme, the event is used for determining the target instead of determining the event by using the target, and the participation object and the event witness object of the event can be screened, so that a reliable basis is provided for event investigation and analysis. The targets in the electronic fence area after the event occurs are identified, and the participation objects or the witness objects are screened, so that the workload of target detection can be reduced, and the detection efficiency of event detection can be improved. In addition, only the event position is detected, and the detection accuracy of the target detection can be improved. In addition, the method can also identify a plurality of related information corresponding to the event, and provide more auxiliary information for subsequent event analysis.

In some embodiments, when determining the electric fence area, the following method can be further used: and after the event occurrence area is determined, identifying the participating object of the event according to the event occurrence area. And determining the electronic fence area according to the participating objects of the event. Specifically, the shape of the fence area may be preset, and after determining the participating object of the event, the pixel values of the vertices of the fence area may be determined according to the position of the participating object.

The event detection method provided by the embodiment of the application can be realized through a neural network model, and the neural network model designed by the embodiment of the application is introduced below.

The neural network model can be used for realizing the event detection method provided by the embodiment of the application, and identifying the type, time boundary, event position, participating object and sighting object of the event occurring in the video to be detected. The neural network model in the embodiment of the application structurally adopts a multi-information fusion structure, the neural network model comprises a plurality of neural networks, and different neural networks have different functions. The neural networks are mutually constrained, compared with a single neural network, the neural network model provided by the application has higher identification accuracy, and the overall structure of the neural network model is shown in fig. 5.

Specifically, the neural network model in the embodiment of the present application includes a first neural network, a second neural network, and a third neural network, and the neural network model can be used for video understanding and target detection. Wherein the video understanding part adopts two neural networks, namely a first neural network and a second neural network. Wherein the first neural network is used for understanding the video content, the recognition result, namely the event type and the time boundary (starting time and ending time) can be output from two dimensions. The second neural network is used for predicting the event position, and the output result is a series of appointed number of pixel coordinates with space geometric relation on the image. The target detection portion is implemented by a third neural network. And detecting and identifying the target object in the video to be detected through a third neural network so as to identify the participating object and the sighting object of the event. It is to be understood that, in the embodiment of the present application, the first neural network may be any one of existing image-based video understanding methods or a derivative improved network, the second neural network may be any one of target detection or keypoint detection or a derivative improved network, and the third neural network may be any one of image target detection or a derivative improved network, which is not limited in this embodiment of the present application.

In some embodiments, the first neural Network may adopt a Boundary-Matching Network (BMN), which has a good effect on video content understanding and content positioning. In event detection, a video to be detected can be used as an input of the first neural network, so that an event type and a time boundary are output through the first neural network. The output form of the first neural network can be expressed as:

wherein the content of the first and second substances,

representing the output of the first neural network and,

for indicating the start time of the event or events,

for indicating the end time of the event and type for indicating the type of the event.

As an example, a video to be detected may be input into a first neural network, the video length of the video to be detected is 10 seconds, an event occurring in the video to be detected is identified through the first neural network to obtain an event type, and a time boundary is determined according to the event type. For example, the first neural network is used for vision to be detectedAnd performing video understanding frequently, and identifying that the type of the event occurring in the video to be detected is a two-vehicle collision event. Further, the first neural network can also determine the time boundary of the event in the video to be detected, namely the starting time and the ending time, according to the event occurring in the video to be detected. For example, when the video to be detected is 10 seconds in total, the video to be detected is the second

The vehicles collide for the second time, and the starting time of the collision event in the video is the first time until the video end time

Second, the end time is 10 th second.

In some embodiments, the second neural network may adopt a pretrained monocular 3D object detection (DD 3D) network structure, and the method has a good effect in 3D object detection. The video to be detected, the event type and the time boundary may be used as inputs to a second neural network to output the event location through the second neural network. The second neural network can identify an event occurrence area, the event occurrence area is mapped to a pixel position indication in a video frame included in the video to be detected through a cubic space of event occurrence, and the event occurrence area output by the second neural network can be represented as:

wherein the content of the first and second substances,

representing the output of the second neural network,

In some embodiments, the output of the first neural network and the output of the second neural network may be fused and represented in the form of a heat map, as shown in fig. 6. Before the event starts, the heat map has no response of the event and the event occurrence area is mapped to the pixel position in the video frame included in the video to be detected, such as the first 3 heat maps on the right side of fig. 6; from the start time of the event (i.e. the time of occurrence of the event), a response of the occurrence of the event and the mapping of the event occurrence area to the pixel position in the video included in the video to be detected will occur on the heat map until the event is ended. Wherein 8 points on the heat map represent 8 vertices mapping of the event occurrence area to 8 pixel point locations on the video frame. Responses to the occurrence of events appear on the heat map, with the closer to the center of the response, the higher the predicted probability of the event.

Further, the electric fence area can be determined by the event occurrence area. The electronic fence area is obtained by taking the central point of the bottom surface of the cubic space as a center and amplifying the bottom surface by a set multiple in equal proportion. The electronic fence area is located in an image area of a video frame included in the video to be detected. The relationship of the event occurrence area to the electric fence area is shown in fig. 7.

In some embodiments, the third neural network in target detection may employ a Yolov5 network. When the target detection is performed, the video to be detected and the event position can be used as the input of the third neural network, so that the participating object and the sighting object of the event can be output through the third neural network. As an example, assume that the start time is t, so that the video frame at time t can be subject to target detection in the event occurrence area or the electronic fence area, as shown in fig. 8. For example, the electric fence area can pass

When the enclosed area is expressed, the target objects in the electronic fence area and outside the event occurrence area are detected to obtain the witnessesAn object. The target detection result can be expressed as:

where rect represents the four vertex pixel coordinates of the detection frame of the identified object, and label represents the type of the identified object, such as a person, a motor vehicle, a non-motor vehicle, and the like. As shown in figure 9 of the drawings,

coordinates of pixel points of four vertices of the detection box that can represent witness 1 (witness 1)

，

It may be indicated that sighting 1 is of a type other than a motor vehicle.

In some embodiments, since the video to be detected includes multiple frames of video frames, the same target on multiple frames of images may be associated or preferred as a final target detection result.

In the embodiment of the application, the first neural network, the second neural network and the third neural network form a neural network model, and the neural network model is obtained by training based on a training sample set. The training sample set comprises a plurality of samples, and each sample comprises a video, a video understanding result and a target detection result corresponding to the video. Taking the first sample as an example, the first sample includes the first video and a video understanding result and a target detection result corresponding to the first video. The video understanding result comprises an event type, a time boundary, an event occurrence area and an electronic fence area of an event occurring in the first video, and the target detection result comprises a participation object and a sighting object of the event occurring in the first video.

Taking training the neural network model through the first sample as an example, as shown in fig. 10, the training process is as follows:

the first video in the first sample is input to a first neural network 1001 to output a predicted event type and a predicted temporal boundary through the first neural network and to determine a first loss value.

In some embodiments, the first video in the first sample may be used as an input to a first neural network, and the predicted event type and predicted temporal boundary of an event occurring in the first video may be output via the first neural network. Further, a first loss value is determined based on a loss between the predicted event type and the event type corresponding to the first video and a loss between the predicted temporal boundary and the temporal boundary corresponding to the first video. Wherein the first loss value may be determined by weighting a loss between the predicted event type and the event type corresponding to the first video and a loss between the predicted temporal boundary and the temporal boundary corresponding to the first video. It can be understood that other operations may be performed according to the loss between the predicted event type and the event type corresponding to the first video and the loss between the predicted time boundary and the time boundary corresponding to the first video to determine the first loss value, which is not specifically limited in this application.

Inputting the first video, the predicted event type and the predicted time boundary into a second neural network to output a predicted event occurrence region and a predicted fence region through the second neural network, and determining a second loss value 1002.

In some embodiments, the first video in the first sample, the predicted event type output by the first neural network, and the predicted time boundary may be used as inputs to a second neural network, and the predicted event occurrence region and the predicted fence region may be output by the second neural network. Further, a first loss value is determined according to the loss between the predicted event occurrence area and the event occurrence area corresponding to the first video and the loss between the predicted electronic fence area and the electronic fence area corresponding to the first video. Wherein the second loss value may be determined by weighting a loss between the predicted event occurrence area and the event occurrence area corresponding to the first video, and a loss between the predicted fence area and the fence area corresponding to the first video. It can be understood that other operations can be performed according to the loss between the predicted event occurrence area and the event occurrence area corresponding to the first video and the loss between the predicted electronic fence area and the electronic fence area corresponding to the first video to determine the second loss value, which is not specifically limited in this application.

1003 inputting the first video, the predicted event occurrence area and the predicted fence area into a third neural network to output a predicted engagement object and a predicted sighting object through the third neural network, and determining a third loss value.

In some embodiments, the first video in the first sample, the predicted event occurrence area and the predicted fence area output by the second neural network may be used as inputs to a third neural network, and the predicted engagement object and the predicted sighting object may be output by the third neural network. Further, a first loss value is determined based on the loss between the predicted engagement object and the event engagement object corresponding to the first video and the loss between the predicted sighting object and the sighting object corresponding to the first video. Wherein the second loss value may be determined by weighting a loss between the predicted event engagement object and an engagement object corresponding to the first video, and a loss between the predicted sighting object and a sighting object corresponding to the first video. It can be understood that the second loss value may be determined by performing other operations according to the loss between the predicted participating object and the participating object corresponding to the first video and the loss between the predicted sighting object and the sighting object corresponding to the first video, which is not specifically limited in this application.

And 1004, weighting the first loss value, the second loss value and the third loss value to obtain a fourth loss value.

In some embodiments, the fourth loss value may also be referred to as a total loss value, and is used to represent a total loss value of a neural network model composed of the first neural network, the second neural network, and the third neural network. The total loss value satisfies the condition shown in the following formula:

wherein the content of the first and second substances,

represents the total loss value of the neural network model,

which represents the value of the first loss to be,

the weight corresponding to the first loss value is represented,

the value of the second loss is represented,

the weight corresponding to the second loss value is represented,

a third value of the loss is represented,

a weight corresponding to the third loss value is represented, wherein,

。

and 1005, respectively adjusting the network parameters of the first neural network, the second neural network and the third neural network through the fourth loss value to obtain a neural network model.

In some embodiments, a preset value of the total loss value of the neural network model may be set, and when the fourth loss value is greater than the preset value, the network parameters of the first neural network, the network parameters of the second neural network, and the network parameters of the third neural network model are respectively adjusted by the fourth total loss value. And when the fourth loss value is less than or equal to the preset value, finishing training and storing each network parameter of the neural network model.

Fig. 11 shows a training information flow diagram of a neural network model, in which the realizations represent the flow direction of the data and the dashed lines represent the flow direction of the loss information. The input 1 is the video which is marked manually and the event type, the time boundary, the event position, the participation object and the witness object of the event corresponding to the video. The output 1 (event type and time boundary) of the first neural network is part of input information of the second neural network, and the output 1 is used as a constraint item of the second neural network, so that the second neural network can obtain an identification result more accurately; the output 2 of the second neural network is part of input information of the third neural network, and the output 2 is used as a constraint item of the third neural network, so that the third neural network can obtain an identification result more accurately. And calculating the total loss value of the whole neural network model through the loss 1, the loss 2 and the loss 3, and adjusting the network parameters of the neural network through the total loss value.

Based on the same technical concept, the embodiment of the present application provides an event detection apparatus 1200, and referring to fig. 12, the apparatus 1200 may perform each step in the event detection method. To avoid repetition, a detailed description thereof is omitted. The apparatus 1200 comprises a first recognition module 1201, a first processing module 1202 and a second recognition module 1203.

A first identification module 1201, configured to identify an event in a video to be detected, so as to obtain an event type of the event;

a first processing module 1202, configured to determine an event position of an event occurring in the video to be detected according to the event type, where the event position includes an event occurrence area and an electronic fence area; the event occurrence area is used for representing the three-dimensional space position of the event in the space shot by the acquisition equipment, and the electronic fence area is an area with a set size expanded on the basis of the event occurrence area;

a second identifying module 1203, configured to identify a participating object of the event according to the event occurrence area in the video to be detected, and identify a witness object of the event according to the electronic fence area in the video to be detected.

In some embodiments, the first processing module 1202 is further configured to determine a time boundary of an event occurring in the video to be detected, where the time boundary is used to express a start time and an end time of the event occurring in the video to be detected;

the first processing module 1202, when determining the event location of the event occurring in the video to be detected according to the event type, is specifically configured to: determining a video segment in the video to be detected according to the time boundary; and determining the event position of the event in the video segment according to the event type.

In some embodiments, the first identifying module 1201, when identifying an event in a video to be detected and determining a time boundary of the event occurring in the video to be detected, is specifically configured to: taking the video to be detected as an input of a first neural network, so as to output the event type and the time boundary through the first neural network;

the first identification module 1201, when determining an event location of an event occurring in the video to be detected according to the event type, is specifically configured to: taking the video to be detected, the event type and the time boundary as the input of a second neural network so as to output the event position through the second neural network;

the second identifying module 1203, when identifying a participating object of the event according to the event occurrence region in the video to be detected, and identifying a sighting object of the event according to the electronic fence region in the video to be detected, is specifically configured to: and taking the video to be detected and the event position as the input of a third neural network, so as to output the participating object and the sighting object of the event through the third neural network.

；

wherein Space is used for indicating the event occurrence area,

and pixel point positions mapped by four vertexes of the top surface of the cubic space respectively are represented.

In some embodiments, the first processing module 1202 is further configured to train the neural network model by: inputting the first video into the first neural network to output a predicted event type and a predicted time boundary of an event occurring in the first video through the first neural network, and determining a first loss value according to a loss between the predicted event type and an event type corresponding to the first video in the training sample set and a loss between the predicted time boundary and a time boundary corresponding to the first video in the training sample set;

In some embodiments, the apparatus further comprises an output module 1204; the output module 1204 is configured to output an event detection result, where the event detection result includes the event type, the time boundary, the event location, the participation object, and the sighting object.

Based on the same technical concept, an embodiment of the present application provides a computer-readable storage medium, including: computer program code which, when run on a computer, causes the computer to perform the above-described event detection method. Since the principle of the computer-readable storage medium to solve the problem is similar to the event detection method, the implementation of the computer-readable storage medium can refer to the implementation of the method, and repeated details are not repeated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. An event detection method, comprising:

identifying an event in a video to be detected, and obtaining an event type of the event;

determining an event position of an event in the video to be detected according to the event type, wherein the event position comprises an event occurrence area and an electronic fence area of the event; the event occurrence area is used for representing the three-dimensional space position of the event in the space shot by the acquisition equipment, and the electronic fence area is an area with a set size expanded on the basis of the event occurrence area;

and identifying a participating object of the event according to the event occurrence region in the video to be detected, and identifying a witness object of the event according to the electronic fence region in the video to be detected.

2. The method of claim 1, wherein the method further comprises:

determining a time boundary of the event in the video to be detected, wherein the time boundary is used for expressing the starting time and the ending time of the event in the video to be detected;

the determining the event position of the event in the video to be detected according to the event type comprises the following steps:

determining a video segment in the video to be detected according to the time boundary;

and determining the event position of the event in the video segment according to the event type.

3. The method of claim 2, wherein said identifying an event in a video to be detected and determining a time boundary for said event in said video to be detected comprises:

taking the video to be detected as an input of a first neural network, so as to output the event type and the time boundary through the first neural network;

determining the event position of the event in the video to be detected according to the event type comprises the following steps:

taking the video to be detected, the event type and the time boundary as the input of a second neural network so as to output the event position through the second neural network;

the identifying the participating object of the event according to the event occurrence region in the video to be detected and identifying the sighting object of the event according to the electronic fence region in the video to be detected comprises:

and taking the video to be detected and the event position as the input of a third neural network, so as to output the participating object and the sighting object of the event through the third neural network.

4. The method according to claim 3, wherein the event occurrence area is mapped to an indication of pixel positions in a video frame comprised by the video to be detected through a cubic space of the event occurrence;

the event occurrence area satisfies a condition shown by the following formula:

；

wherein Space is used for indicating the event occurrence area,

four representing the bottom surface of the cubic spaceThe positions of the pixel points to which the vertices are respectively mapped,

5. The method of claim 4, wherein the fence area is an area obtained by magnifying the bottom surface of the cubic space by a set number of equal scales with a center point of the bottom surface as a center.

6. The method of any one of claims 3-5, wherein the first, second, and third neural networks form a neural network model, the neural network model trained based on a set of training samples;

7. The method of claim 6, wherein the method further comprises:

training the neural network model by:

8. The method of any of claims 2-5, 7, further comprising:

outputting an event detection result, the event detection result including the event type, the time boundary, the event location, the engagement object, and the sighting object.

9. An event detection device, comprising:

10. An electronic device, comprising:

a memory for storing program instructions;

a processor for invoking program instructions stored in said memory to execute the method of any of claims 1-8 in accordance with an obtained program.

11. A computer-readable storage medium having stored thereon computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-8.