CN104243894A

CN104243894A - Audio and video fused monitoring method

Info

Publication number: CN104243894A
Application number: CN201310231183.3A
Authority: CN
Inventors: 陈孝良; 李晓东
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2013-06-09
Filing date: 2013-06-09
Publication date: 2014-12-24

Abstract

The invention relates to an audio and video fused monitoring method. The method includes the steps that an audio signal and a video signal are collected, and the collected signals are conditioned; synergistic preprocessing is performed on the conditioned signals; whether an obtained signal includes the audio signal and the video signal or not is judged, when the two signals are included, the audio signal and the video signal are analyzed in a fused mode, according to the fused analysis result, target information included in the audio and video signals is found out, and if the obtained signal only includes the audio signal, independent analysis and processing are performed on the audio signal to obtain the target information included in the audio signal; according to the obtained target information, whether the posture of a camera needs to be adjusted or not is determined, and if in need, the posture of the camera is adjusted, and then execution is performed again, wherein the process of adjusting the posture of the camera includes focusing, light supplementing and angle adjusting.

Description

A kind of sound video fusion method for supervising

Technical field

The present invention relates to monitoring field, particularly a kind of sound video fusion method for supervising.

Background technology

Video monitoring is a kind of Main Means in monitoring field.Traditional video monitoring is mainly based on low resolution mono-vision video sensor, in the face of the dynamic scene of complexity and the requirement of intelligent real-time early warning day by day, its technology exists two and challenges greatly: first, video sensor exist visual angle narrower, be subject to the problem of blocking, easily be subject to the impact of IFR conditions and light intensity, such as snow and rain greasy weather gas and day-night change; Second, video monitoring carries out detecting based on video data streams a large amount of continuously, the algorithm complex of localization and tracking is higher, especially realize the real-time of intellectual analysis based on HD video poor, cost and energy consumption are also problems, which has limited the application of HD video transducer in monitoring field.

In order to tackle these challenges, extensive research has been carried out both at home and abroad for the intelligent of video monitoring and real-time, wherein thinking expands based on video high-level processing algorithm and deepens an Intellectual Analysis Technology for video, and the methods such as panorama, stereo camera shooting and 3-D modeling compensate for the narrower defect in mono-vision video sensor visual angle to a certain extent; Another thinking is based on multisensor data fusion theory, utilizes the feature extracted from similar or foreign peoples's multisensor to realize object-oriented intelligent analysis.Carry out linkage of multi-cameras in field of video monitoring and merged the exploration of GPS, radar, laser, foreign peoples's signal such as infrared in recent years.

But sound is as nature signal of interest, does not also draw attention in monitoring field so far, be mainly limited to the technological lag of microphone array.Along with the development of array and sensing technology, the acoustic detection research based on microphone array has had larger progress, has carried out Applied D emonstration in fields such as medical monitoring, consumer electronics, Border Protection, Industry Control.Owing to enhancing based on the acoustic detection method of microphone array dispersive target and detection movable in short-term, location and follow-up control, have the advantages that low energy consumption, round-the-clock, unobstructed, non-blind area and real-time are good, be highly suitable for the application in monitoring field.But because monitoring scene circumstance complication, background are noisy, existing microphone array location technology can not directly apply to monitoring scene analysis.Amount of information in addition due to acoustic detection acquisition is relatively less, only independently cannot meet the demand in monitoring field with microphone array.Also there is no a set of complete skill scheme being adapted to the sound video fusion monitoring in monitoring field at present.

Summary of the invention

The object of the invention is to overcome the defects such as the single video monitoring visual field is narrower, easily affected by environment, obtaining information amount is few, thus a kind of sound video fusion method for supervising based on microphone array and monopod video camera is provided.

To achieve these goals, the invention provides a kind of sound video fusion method for supervising, comprising:

Step 1), collection audio frequency and vision signal, nurse one's health gathered signal;

Step 2), step 1) is obtained, through conditioning signal do collaborative preliminary treatment; Described collaborative preliminary treatment comprise signal done compress, filtering, denoising and enhancing;

Step 3), to step 2) signal that obtains whether comprises sound signal simultaneously and vision signal is judged, when to comprise two kinds of signals simultaneously, perform step 4), if only comprise sound signal, then perform step 5);

Step 4), convergence analysis is done to sound signal and vision signal, find out the target information comprised in described sound vision signal according to the result of convergence analysis, then perform step 6);

Step 5), independently analysis and treament is done to sound signal, obtain the target information comprised in described sound signal, then perform step 6);

Step 6), the target information obtained according to step 4) or step 5) determine to adjust the need of to the attitude of video camera, if desired adjust, and the attitude of adjustment video camera, then re-executes step 1); Wherein, the described attitude to video camera is carried out adjustment and is comprised focusing, light filling, adjustment angle.

In technique scheme, also comprise:

Step 7), pattern recognition is carried out to current sound vision signal, to obtain the semantic information comprising keyword, time, orientation, classification, state of object event; Described pattern recognition comprises behavior understanding, differentiates control and state estimation, and wherein, described behavior understanding, by the extraction of motion feature, obtains the keyword of object event; Described differentiation controls the result according to behavior understanding, the information such as time, orientation of further acquisition event, compared with corresponding keyword threshold value, detects the classification judging object event; Described state estimation, according to the classification differentiating object event, according to the importance degree of the default eigenvalue estimate object event of classification, sets alert level according to estimated result to object event;

Step 8), from the sound vision signal through pattern recognition, capture key message and core fragment, by the semantic information of multiple fragment assembly and editor's formation one reflection monitor message, by coding after the compression of these semantic information, transmit finally by real-time performance.

In technique scheme, described step 4) comprises:

Step 4-1), from background noise data storehouse, extract background noise data, realize background modeling; Wherein, described background noise data storehouse under storing multiple meteorological condition, the background noise of multiple typical scene; Described meteorological condition comprises the special weather condition of wind, rain, snow, mist, described typical scene comprises calling for help, blows a whistle, collides, explodes, fires a shot, low-latitude flying, crowd massing;

Step 4-2), from target characteristic database, extract plurality of target characteristic information, by these target signature informations and step 4-1) in the background noise model set up combine, obtain virtual target feature; Wherein, described target characteristic database is for storing clarification of objective information, described feature comprises essential characteristic, transform domain feature, statistical nature, the motion feature of audio frequency or vision signal, and the information of these features in time, space, spectrum, phase place etc.;

Step 4-3), to step 2) the sound vision signal that generates and step 4-2) the virtual target feature that generates compares, realizes step 2) target's feature-extraction of sound vision signal that generates;

Step 4-4), according to step 4-3) the target's feature-extraction result that obtains utilizes Bayesian analysis to carry out probabilistic determination, found the event comprised in gathered sound vision signal by maximum a posteriori probability;

Step 4-5), to step 4-4) in detected target adopt Wave beam forming and the Wave arrival direction estimating method of based target characteristic sum background noise model, calculate according to acoustic signal propagation rule and open energy, phase place and Doppler effect that moving acoustic sources target under space and enclosure space two class environmental condition has to realize locating, determine the coordinate figure of this target;

Step 4-6), to through location target follow the tracks of.

In technique scheme, the step 4-3 described) and step 4-4) between, also comprise step 4-1 described in multiple exercise)-step 4-3).

In technique scheme, step 4-3 described) in, one group of object feature value by sound vision signal and the right result of virtual target aspect ratio, these object feature value being sorted from high to low according to similarity, is the target's feature-extraction result of sound vision signal higher than a certain object feature value presetting threshold value in ranking results.

In technique scheme, the step 4-6 described) in, described tracking comprises the coordinate figure determined according to microphone array and controls video camera attitude, realizes focusing, light filling, adjustment angle.

In technique scheme, described step 5) comprises:

Step 5-1), from background noise data storehouse, extract background noise data, realize background modeling; Target signature is extracted from target characteristic database;

Step 5-2), adopt Wave beam forming and the Wave arrival direction estimating method of based target characteristic sum background noise model, calculate according to acoustic signal propagation rule and open that energy, phase place and the Doppler effect that moving acoustic sources target under space and enclosure space two class environmental condition has detects distributed object, the contribution of location and trace model, thus identifications, classification be optimized to acoustic target, locate and tracking.

In technique scheme, in described step 6), the number of times re-executing step 1) is no more than 3 times.

The invention has the advantages that:

1) acoustic signature is introduced video detection and tracking algorithm as parameter by the present invention, and acoustics signal processing has the advantages that algorithm complex is little, real-time is good, can improve the performance of video object recognition and tracking algorithm.

2) the present invention extracts the compound characteristics merging audio frequency and video two kinds of foreign peoples's signals, makes up the shortcoming of traditional video surveillance, has round-the-clock, unobstructed, the detection of non-blind area, localization and tracking ability, can improve the reaction speed of supervisory control system.

3) the present invention carries out automatic analysis and semantic understanding to sound video data, capture key message and core fragment in monitoring scene, the semantic information of splicing and editor's formation one reflection monitor message, by Internet Transmission after compressed encoding, the problem that monitor network mass data expands day by day can be avoided.

4) collection of multiple channel acousto vision signal, analysis, calculating and communication function combine together gasifying device by the present invention, solve the oversize problem of not easily installing of microphone array, support wireless transmission and PLC function simultaneously, the problems such as the more cost caused of connection cable is higher can be avoided.

Accompanying drawing explanation

Fig. 1 is the flow chart of sound video fusion method for supervising of the present invention;

Fig. 2 is the schematic diagram adjusted video camera attitude.

Embodiment

Now the invention will be further described by reference to the accompanying drawings.

The voice signal that sound video fusion method for supervising of the present invention obtains based on microphone array and the vision signal that video camera obtains realize the monitoring to monitoring scene.

Before the performing step of the inventive method is elaborated, first related notion involved in the present invention is described.

Target characteristic database: target refers to the unexpected abnormality event in monitoring scene, target characteristic database is for storing clarification of objective information, described feature comprises essential characteristic, transform domain feature, statistical nature, the motion feature of audio frequency or vision signal, and the information of these features in time, space, spectrum, phase place etc. (as average, variance, cepstrum, envelope etc.).

Background noise data storehouse: background noise data storehouse under storing multiple meteorological condition, the background noise of multiple typical scene.Described meteorological condition comprises the special weather conditions such as wind, rain, snow, mist; Described typical scene comprises calling for help, blows a whistle, collides, explodes, fires a shot, low-latitude flying, crowd massing etc.

Below in conjunction with accompanying drawing, method of the present invention is described further.

With reference to figure 1, method of the present invention comprises the following steps:

Step 1), collection audio frequency and vision signal, nurse one's health gathered signal.

In this step, vision signal adopts camera acquisition, and sound signal adopts microphone array collection.Under normal circumstances, the signal gathered should comprise audio frequency and vision signal simultaneously, but due to reasons such as microphone array or video camera break down, in some cases, the signal gathered only comprises sound signal or only comprises vision signal, can continue to perform subsequent operation for this type of situation.

Step 2), step 1) is obtained, through conditioning signal do collaborative preliminary treatment.

In this step, the collaborative preliminary treatment of signal comprises and takes turns doing the operations such as compression, filtering, denoising and enhancing to signal, and implementations of these operations are conventionally known to one of skill in the art, therefore do not repeat herein.

In step 1), if the signal gathered comprises sound signal and vision signal simultaneously, then signal is done compress, filtering time adopt the mode of collaborative compression and collaborative filtering, if the signal gathered only comprises sound signal or vision signal, then process according to single signal cooked mode.

Step 3), to step 2) signal that obtains whether comprises sound signal simultaneously and vision signal is judged, when to comprise two kinds of signals simultaneously, perform step 4), if only comprise sound signal, then perform step 5).

Mention, also there is the possibility only comprising vision signal or only comprise sound signal in the signal that watch-dog gathers before, for the situation only comprising vision signal, to the analysis and treament of vision signal then not within the scope of the application.

Step 4), convergence analysis is done to sound signal and vision signal, find out the target information comprised in described sound vision signal according to the result of convergence analysis, then perform step 6).

Step 5), independently analysis and treament is done to sound signal, obtain the target information comprised in described sound signal, then perform step 6).

Step 6), the target information obtained according to step 4) or step 5) determine to adjust the need of to the attitude of video camera.With reference to figure 2, when adjusting attitude, the attitude that first perception video camera is current, the direction of the determined acoustic target of sound signal then received according to microphone array and range information, determine the difference between video camera current pose and targeted attitude, thus realize pose adjustment.Described pose adjustment comprises the operations such as focusing, light filling, adjustment angle.

After the attitude of adjustment video camera, aforesaid step 1) can be re-executed, carry out signal Resurvey or compensate gathering, gathered result is done collaborative preliminary treatment and the analysis of sound video fusion according to such as front step, and the result obtained can be used for the attitude adjusting video camera further.The operation of this positioning cycle repeats at most 3 times to ensure algorithmic statement and speed, and automatically forbids in tracing process.

Be more than the description to the inventive method basic step, as the preferred implementation of one, in another embodiment, the inventive method also comprises:

Step 7), pattern recognition is carried out to current sound vision signal, to obtain the semantic information such as keyword, time, orientation, classification, state of object event; Described pattern recognition comprises behavior understanding, differentiates control and state estimation, wherein, described behavior understanding is mainly through the extraction of motion feature, obtain the keyword of object event, such as collide, blast etc., described differentiation controls mainly according to the result of behavior understanding, the information such as time, orientation of further acquisition event, compared with corresponding keyword threshold value, detect the classification judging object event; Described state estimation, mainly according to the classification differentiating object event, according to the importance degree of the default eigenvalue estimate object event of classification, sets alert level according to estimated result to object event.

By above-mentioned steps 7) and step 8), the sound vision signal obtained in monitor procedure can be retrieved comparatively easily in subsequent operation, improves recall precision, contributes to the further utilization of monitor message.

Below the specific implementation of the correlation step in the inventive method is described further.

In described step 4), convergence analysis is done to sound vision signal and comprises multiple sub-step, comprising:

Step 4-1), from background noise data storehouse, extract background noise data, realize background modeling.

From knowing the description in background noise data storehouse before, under including multiple meteorological condition in background noise data storehouse, the background noise of multiple typical scene, in this step, according to external condition during monitoring, corresponding background noise data can be chosen from background noise data storehouse, utilize this background noise data to carry out modeling.

Step 4-2), from target characteristic database, extract plurality of target characteristic information, by these target signature informations and step 4-1) in the background noise model set up combine, obtain virtual target feature.

Step 4-3), to step 2) the sound vision signal that generates and step 4-2) the virtual target feature that generates compares, realizes step 2) target's feature-extraction of sound vision signal that generates.

One group of object feature value by sound vision signal and the right result of virtual target aspect ratio in this step, these object feature value being sorted from high to low according to similarity, is exactly the target's feature-extraction result of sound vision signal higher than a certain object feature value presetting threshold value in ranking results.

It should be noted that, if when method of the present invention is applied on the embedded OS of resource-constrained, once may not read all information in target characteristic database and background noise data storehouse, in this case, step 4-1)-step 4-3) need multiple exercise, to obtain target's feature-extraction result more accurately.

Step 4-4), according to step 4-3) the target's feature-extraction result that obtains utilizes Bayesian analysis to carry out probabilistic determination, found the event comprised in gathered sound vision signal by maximum a posteriori probability.

Step 4-5), to step 4-4) in detected target adopt Wave beam forming and the Wave arrival direction estimating method of the model refinement of based target characteristic sum background noise, the energy opened moving acoustic sources target under space and enclosure space two class environmental condition and have is calculated according to acoustic signal propagation rule, phase place and Doppler effect, to realize location, determine the coordinate figure of this target.

Step 4-6), to through location target follow the tracks of.Described tracking comprises the coordinate figure determined according to microphone array and controls monopod video camera attitude, realize the operations such as focusing, light filling, adjustment angle, ensure the video information that can catch intended target under multi-target condition continuously and stably, and realize the quick and precisely switching between multiple target.

In described step 5), independently analysis and treament is done to sound signal and comprises:

Step 5-1), from background noise data storehouse, extract background noise data, realize background modeling; Target signature is extracted from target characteristic database.

It should be noted last that, above embodiment is only in order to illustrate technical scheme of the present invention and unrestricted.Although with reference to embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that, modify to technical scheme of the present invention or equivalent replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of right of the present invention.

Claims

1. a sound video fusion method for supervising, comprising:

2. sound video fusion method for supervising according to claim 1, is characterized in that, also comprise:

3. sound video fusion method for supervising according to claim 1 and 2, it is characterized in that, described step 4) comprises:

Step 4-6), to through location target follow the tracks of.

4. sound video fusion method for supervising according to claim 3, is characterized in that, the step 4-3 described) and step 4-4) between, also comprise step 4-1 described in multiple exercise)-step 4-3).

5. sound video fusion method for supervising according to claim 3, it is characterized in that, step 4-3 described) in, one group of object feature value by sound vision signal and the right result of virtual target aspect ratio, these object feature value being sorted from high to low according to similarity, is the target's feature-extraction result of sound vision signal higher than a certain object feature value presetting threshold value in ranking results.

6. sound video fusion method for supervising according to claim 3, is characterized in that, the step 4-6 described) in, described tracking comprises the coordinate figure determined according to microphone array and controls video camera attitude, realizes focusing, light filling, adjustment angle.

7. sound video fusion method for supervising according to claim 1 and 2, it is characterized in that, described step 5) comprises:

8. sound video fusion method for supervising according to claim 1 and 2, is characterized in that, in described step 6), the number of times re-executing step 1) is no more than 3 times.