EP1387514A2

EP1387514A2 - Signal comparison method and apparatus

Info

Publication number: EP1387514A2
Application number: EP03254607A
Authority: EP
Inventors: Yuan-Xing British Broadcasting Corp. Zheng
Original assignee: British Broadcasting Corp
Current assignee: British Broadcasting Corp
Priority date: 2002-07-31
Filing date: 2003-07-24
Publication date: 2004-02-04
Also published as: GB2391322A; GB0217772D0; GB2391322B; EP1387514A3

Abstract

The invention provides a method and apparatus for determining the relative time difference or delay between first and second audio signals that represent substantially the same audio content. The invention also provides a method and an apparatus for determining whether two audio signals contain the same audio content.

A comparison of the two audio signals is carried out using a low-bit representation of each signal that is generated using the dominant frequency within successive portions or frames of the signal. This audio representation can also be used as a means of comparing two video programmes of which the audio signal is a part.

As the analysis is based on the frequencies present within the audio signal, it can for example, be performed more quickly than an analysis based on the energy of the audio signal. As a resul-t, the determination can be made in real time, allowing it to be advantageously used in the field of broadcasting to confirm that a signal being transmitted by a regional broadcasting centre is in accordance with the master signal being sent to it by a programme originator.

The invention may also be used to synchronise two like signals with each other.

Description

This application relates to an improved method and apparatus for comparing signals. The method and apparatus relate in particular to the field of broadcasting and systems for monitoring the content of the broadcast signal.
In the broadcasting industry, it is common for a central programme originator to relay programmes over a private network for broadcasting at regional centres. The regional centres can then add local programmes to the received programme and transmit this for reception in their catchment area. This system means that in most cases a strong signal can be provided to a viewer, as the signal is transmitted from a local receiver, rather than a more distant central receiver. It also allows the signal to be adapted to the region in which it is received, for example, the insertion of local news programmes after national news programmes.
Similarly, the private network circuits can be used to relay signals in the opposite direction allowing regional centres to contribute content to other catchment areas, such as to the region of the central programme originator, or other regional areas.
The private circuits relaying the programmes for broadcast can fail however, either partially or totally, preventing broadcast signals from being transmitted altogether, or misrouting signals such that a different signal is received at a location than the signal that was intended. We have appreciated that such failures need to be detected, and need to be detected in real time.
Detection of such failures can be achieved by comparison of the signal being transmitted by the central programme originator with that being transmitted at the regional centre. However, the comparison is complicated by a number of problems inherent in the broadcasting system, such as timing delay.
Typically, the broadcast signal being transmitted at the regional centre will lag behind that transmitted by the programme originator by up to a few seconds. This is partly due to the inherent delay resulting from the transmission of the broadcast signal to the regional centre over the private circuits or broadcast chain, known as the "signature chain delay". MPEG coding/decoding processing time for digital signals can also have a delaying effect. For example, analogue signals are typically delayed by around 100ms, and digital signals by 1 to 2 seconds.
In order to perform any comparison therefore the two signals must be synchronised. As there is no inherent timing structure within audio data, such as radio broadcasts, synchronising audio signals can be difficult.
Another problem arises from limited network capacity. If comparison of the two signals is to take place, either the signals themselves, or information about the signals must eventually be routed to the same location to be compared. This can be expensive in terms of network capacity.
United States Patent number 4,230,990 describes a system and method for identifying broadcast programs, wherein a pattern recognition process is combined with a signalling event which acts as a trigger or cue signal. A segment of each programme at a predetermined location with respect to one of these cue signals is sampled and processed to form the programme's reference signature which is stored in computer memory. In the field, the monitoring equipment detects cue signals broadcast by a monitored station and, upon detection samples the broadcast program signal at the same predetermined location to create a broadcast signature of unknown programme identity. By comparing broadcast signatures to reference signatures, a computer identifies the broadcast of programmes whose reference signatures have been stored in memory.

SUMMARY OF THE INVENTION

The invention is defined in the appendant claims to which reference should now be made. Advantageous features are set forth in the dependent claims.
The invention provides a method and apparatus for determining the relative time difference or delay between first and second audio signals that represent substantially the same audio content. The invention also provides a method and an apparatus for determining whether two audio signals contain the same audio content.
A comparison of the two audio signals is carried out using a low-bit representation of each signal that is generated using the dominant frequency within successive portions or frames of the signal. This audio representation can also be used as a means of comparing two video programmes of which the audio signal is a part.
As the analysis is based on the frequencies present within the audio signal, it can for example, be performed more quickly than an analysis based on the energy of the audio signal. As a result, the determination can be made in real time, allowing it to be advantageously used in the field of broadcasting to confirm that a signal being transmitted by a regional broadcasting centre is in accordance with the master signal being sent to it by a programme originator.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described in detail, by way of example, and with reference to the drawings in which:
Figure 1 is a flow chart illustrating the operation of the preferred system for performing comparison of audio signals in accordance with the preferred embodiment of the invention;
Figure 2 illustrates the method of dividing an audio signal into frames for analysis in accordance with the method illustrated in Figure 1;
Figure 3 illustrates the step of dividing the audio signal into overlapping frames;
Figure 4 illustrates an example representation of a first audio signal;
Figure 5 illustrates an example representation of a second audio signal, the representation being shorter than the first;
Figure 6 illustrates a schematic representation of correlation results of the first and second audio signal;
Figure 7 illustrates actual correlation results produced by experiment for two signals containing the same audio content;
Figure 8 illustrates actual correlation results produced by experiment for two unrelated signals;
Figure 9 schematically illustrates the generation of a successive representation of the first audio signal;
Figure 10 illustrates apparatus according to a first preferred embodiment of the invention;
Figure 11 illustrates apparatus according to a second preferred embodiment of the invention;
Figure 12 illustrates an embodiment employing a single computer terminal;
Figure 13 illustrates an embodiment employing a remote computer terminal.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The preferred system provides a method and apparatus for comparing two audio signals and determining whether they are substantially the same, that is whether they contain the same audio content. The system has particular application to the broadcast industry and allows a programme originator to verify that regional centres are transmitting the correct programmes. If they are not, then the private circuits used by the programme originator to transmit broadcast information to the regional centres may be at fault. Thus, any problems with the circuits, such as complete failure, or mis-routing may be identified and addressed.
Although the method compares audio signals, it will be understood that the technique provided by the preferred system is not limited to comparison of radio broadcast programmes, but can also be used to compare the audio parts of video signals.
Furthermore, the preferred system actually compares a low-bit rate representation of the original audio signal. Transmission of the representation to another location on the network for comparison is less costly in terms of network capacity than if the original audio signal was transmitted. The saving in network capacity is particularly germane if the representation is of the audio corresponding to a video broadcast, as it allows the comparison of two video broadcasts to be achieved while only ever transmitting a considerably paired down version of the original signal.
The operation of the preferred system will now be described in detail with reference to Figures 1 to 13 of the drawings.
Figure 1 is a flow chart illustrating the steps preformed by the preferred system. The system compares two audio signals: a first audio signal, such as the local signal transmitted by a regional broadcasting centre, and a second audio signal, such as a national signal being transmitted to the regional broadcasting centre from a national broadcasting centre or programme originator.
In practice, these steps may be embodied by software running on a computer, or by equivalent hardware such as dedicated circuits. Operation begins therefore in step S10 which represents the initialisation of the software or circuits respectively.
In step S20, the raw audio data of the first audio signal is captured, that is the signal that is to be compared to the master signal. Preferably, this is achieved by sampling the original audio signal. As the capture will be occurring in real time, the system feeds the captured audio data directly into a buffer for storage. The buffer need only be large enough to store a few seconds of audio data, as the timing delay between the signals being compared is typically less than a few seconds. If larger synchronisation delays are expected then the buffer will need to be larger to accommodate sufficient data. In practice, the preferred system has been found to tolerate a delay of about 2s in the two signals being compared. This is sufficient to handle most digital signals. The maximum delay that the system can accommodate however is limited only by the size of the buffers in which the audio signal and signatures are stored and the processing power of the computer.
Once a sufficient amount of the audio signal has been captured, control flows to step S30 where processing occurs to generate a signature or representation of the captured audio data. The buffer need not be full before the processing starts, as the processing and the capture of audio data can occur simultaneously. Also, although a predetermined amount of data is preferably stored before processing begins, it is possible that the processing begin almost immediately after the first data points are stored in the buffer. In the preferred system however, the processing does not begin until the buffer is full.
The processing of the audio data to generate the signature or representation will now be explained in more detail with reference to Figures 2 to 4.
Figure 2 shows a block representing the audio data of the first audio signal captured in step S20. For purposes of illustration, the audio data is taken to comprise 17408 samples. The x axis of the data block will be understood to represent time, and the y axis, although no variation in the data is actually shown, will be understood to represent the amplitude of the audio signal at a point in time.
Firstly, the preferred system breaks the audio data down in smaller frames of data for analysis. It can be seen that the block of data containing 17408 samples shown in Figure 2 can be subdivided into 17 frames of 1024 samples each.
The preferred system processes the data block by extracting the dominant frequency from each frame using a Fast Fourier Transform (FFT). This is a conventional technique known in the art and so will not be discussed further here. Other known techniques could also be used. Thus, each frame containing 1024 samples of audio data is converted into a single data point representing the dominant frequency of that frame. The collection of the data points from all of the frames is used as a signature or representation of the original audio data.
To some extent, the shape of the signature or representation generated in this way will depend on the number of frames used and the starting position of the frames within the audio signal. This can cause later problems for recognition of an audio signature particularly in cases where the audio content of the signal varies quickly, such as for rock music for example.
The effect of starting the frames at different positions in the audio signal is exacerbated by the delay in time between the two signals being compared. Thus, in the preferred system, the audio data is broken down into a larger number of overlapping frames, as this provides greater tolerance against synchronisation problems and quickly varying audio content.
Figure 3 shows an illustration of the audio data broken down into overlapping frames. Four overlapping frames each containing 1024 samples are shown, displaced from each other by 256 samples. As a result, the audio data of 17408 samples is in fact broken down into 65 frames, and is therefore represented by 65 dominant frequency data points.
An example signature is illustrated in Figure 4 for the first audio signal, that is the one which is being compared to the master signal. The signature contains M signature points. In practice, it is preferred if M is of the order of 2000. It should be remembered that in this diagram, the x axis represents time, measured in frame number, and the y axis represents the dominant frequency.
The data rate of the audio signature shown in Figure 4 can be calculated from the following equation. signature_data_rate = original data ratefftFrameSize . overlap_coefficient - (overlap_coefficient - 1 )
The first term on the right hand side of this equation can be understood as the number of data points or samples of the original audio signal, that is the block size, divided by the frame size, which is given by the number of samples per frame, multiplied by the overlap co-efficient, or the number of frames per frame width. In the above example shown in Figure 3, the overlap co-efficient is 4, as for every frame width of size 1024 samples, there are four frames, beginning at sample 1, 257, 513 and 769 respectively. The next frame (the fifth) will then start at sample 1025.
The last term in this equation represents the fact that the last frame in the audio data block cannot be sub-divided into further frames, because those further frames will lie at least partially beyond the end of the data block. The number of such frames produced by sub-dividing the last frame in the data block is given by the overlap co-efficient minus one. In the above example, this value is therefore 3.
The number of samples in the audio data block, given by the sampling rate of the audio data block and the data block size, the FFT frame size and the overlap ratio are all variable, and can be chosen according to the application and the computer processing power available. If the sampling rate or the overlap ratio are chosen to be too high, the computer processor may not be able to perform the analysis quickly enough, and the system will cease to work in real time. It has been found that for a 1 GHz computer processor, in this case an Intel Pentium III, a sampling rate of greater than 22 KHz and an overlap ratio of 8 or greater, are too high for the computer to reliably function for a frame size of 1024 samples. Increasing the size of the FFT frame, to 8196 samples say, allows the computer to function again with this sampling rate and overlap ratio. However, increasing the frame size is not desirable, as it can lead to problems synchronising the two audio signals, particularly for certain types of fast music. This is because the dominant frequency for a short block of signal may not be the same as the dominant frequency for a longer block of signal.
The preferred system employs an overlap ratio of 8, a sampling rate of 11.025 KHz, an FFT frame size of 1024 samples and an audio data block size of 17408 samples. In practice however, the overlap coefficient may be varied from 4 to 32, with the sampling frequency being varied accordingly between 8kHZ and 44kHz. For processors faster than 1GHZ a higher overlap coefficient or sampling frequency may be used.
The results of the sampling step S30 is to produce an audio signature of the type illustrated in Figure 4.
Reference shall now be made again to Figure 1. Following the generation of the audio signature in step S30, the signature is stored in a buffer S40 ready for comparison. If the comparison method is being employed at a regional broadcasting centre to detect reliable transmission of a signal received from the programme originator, the first audio signal stored in the buffer in step S40 will be the local broadcast signal being transmitted; the second audio signal that is to be compared with the first, will be the signal transmitted to the regional centre from the national broadcasting centre.
In any case, in order to perform the comparison, an audio signature of the second audio signal must also be generated. The signature or representation of the second signal is generated in the same way as described above for first audio signal except that the representation is deliberately made for a smaller section of the second signal than the first. The first signal representation is therefore longer, in that it contains more samples than the second representation, and in that it therefore represents a longer period in the time domain. The second signature is illustrated in Figure 5 and contains N samples. In practice, a typical figure for N might be 340.
In the case of a broadcast network, the audio signature representing the second audio signal is generated at the national broadcasting centre. This signal is then transmitted through the server computer of the regional television centre via an IT network. For this reason a processing step for the second signal is not shown in Figure 1. Instead, the second signature is received in step S50.
A separate IT network is preferred as techniques can be employed to detect faults such as those caused by mutual mis-routing. Such faults can be differentiated from faults at source. It is also possible that the audio signature could be transmitted to the regional broadcasting centre with the regional audio signal. However if there is a fault in the private circuits routing the original audio signal, the audio signature will not be received at the regional broadcasting centre and no comparison can then be made.
Once the audio signature of the source signal has been obtained in step S50, comparison of the two audio signatures is performed, S60, and a judgment is made as to whether the two signatures represent the same audio content. Of course, because of the difference in size of the two signatures, the audio content will not be exactly the same; instead the two signals are analysed to determine if the second signature or representation is contained within the first.
This is achieved by using a standard correlation or cross-correlation technique, applied in a particular way. Figure 6 schematically illustrates the process. The top row of the figure shows the signature (n) of N signature points representing the second audio signal. The middle row represents the longer sequence of M (>N) signature points of the audio signature (m) for the local audio signal, and the bottom row shows the cross-correlation results of the signatures against each other for different relative time displacements.
The top two rows of the figure showing the signatures for the two signal sections are plotted against time on the x axis, as each point of the plot represents the dominant frequency in a frame of the audio signal. The bottom row of the figure however, showing the correlation results, is plotted on the x axis (labelled D) against relative displacement of the two signatures (n) and (m) against each other. The first point at position D=0 is the result of the correlation when the beginning of the signature n is aligned with the beginning of signature m. The next point is obtained when the signature n is shifted to the right by one signature point in comparison to the beginning of the signature m and the correlation is performed again. The last point plotted on the axis D is given by the correlation result when the last signature point of signature n is aligned with the last signature point of signature m.
The cross-correlation is calculated as though the sequences of N and M signature points were both continuous wave forms expressed as a series of regular digitized samples. The cross-correlation result is also shown as if it were a continuous wave form, although it will be appreciated that it actually consists of M-N+1 discrete values.
If a standard cross-correlation technique is employed, then the height of the cross-correlation wave form at a particular point is a function of the integral of the product of the second signature and the part of the first signature which it overlaps when the second signature is aligned with that particular point. Assuming that the second signature is contained within the first, the position of the peak of the cross-correlation wave form should occur when the second shorter signature (n) is aligned with an identical region in the longer first signature (m). This should occur at only one point in the first signature.
To find a match between the first and second signature therefore, the preferred system detects the maximum value of the correlation wave form. The number of peaks contained within the correlation wave form gives an indication of how likely it is that the maximum peak represents the point at which the first and second signatures match. So, in order to determine a measure of the reliability of the match, the number of points where the value exceeds two-thirds of the maximum value are calculated. Providing the ratio between this number and the total number of points is lower than a certain predetermined value, it can be assumed that there is one clear and strong peak in the correlation results. The software will then deem that there is a match between the two signatures. This is illustrated in Figure 6 which shows two peaks of a height exceeding the threshold of two thirds of the maximum height. In this case a match is indicated to occur at the peak on the right of the plot.
The key element of the technique employed in the preferred system is the difference in length between the two audio signatures that are cross-correlated. Were they to be the same length or close in length, the number of points (M-N+1) in the cross-correlation would be one or close to zero. This would make the correlation process unreliable and intolerant of any relative delay between the two signals.
Although, preferred values of M and N have been given, it will be appreciated that they could take any values providing that the number of points in the correlation (M-N+1) is sufficiently large to provide a reliable result. In practice, a minimum value for (M-N+1) of about 200 points has been found acceptable, although values of 1000 to 2000 for (M-N+1) are preferred.
Figure 7 shows the results of a correlation performed between two audio signatures in actual experiments. The results of four trials are plotted on the same axis; the first three trials lead to points centred at positions D=70, 71, and 72. The fourth trial results in a peak at position D=140. The difference in the position at which the peaks occur is caused by differences in timing between the two signals being compared, such as those caused by transmission for example.
Figure 8 on the other hand shows a correlation wave form produced by two audio signatures which represent different audio signals. The graph is plotted on approximately the same horizontal scale, and shows the region of the correlation plot from point 600 to 800. This range has been chosen for illustration purposes.
As can be seen from the drawing, the wave form contains a large number of unrelated and sporadic peaks, none of which could reliably be taken to indicate the presence of a match. This diagram illustrates the need to count the number of peaks for a given sample size in order to determine the quality of the plot.
The results of the correlation are output in step S70 indicating whether a match was found or not. If a match was found, the relative delay between the two signals may also be output. This could be used in a synchronising method for example, where one of the signals is a master or timing signal and the other is the signal to be synchronised. Using the relative delay calculated for one signal to the other, synchronisation could be achieved by calculating the phase shift or time shift required to align the two signal in time and applying this to the signal that is to be synchronised.
It will be appreciated that the audio signatures generated for the audio signals represent only a small fraction of the audio stream in a broadcast signal. Thus in order to monitor the two signals and provide a continuous comparison, it is necessary to repeat the procedure illustrated in steps S10 to S70 continuously or at least periodically in real time. Thus, once a match has been found, those signature points of the first audio signature which occur before the position at which the maximum peak of the correlation wave form was found are discarded. The remaining points of the first audio signature are then moved forward in the storage buffer, making room for new audio signature points to be added. The new audio signature points may be appended to the existing points in the buffer when sufficient points have been calculated to fill the remaining space. Alternatively, the points may be added to the buffer as they are calculated.
Thus, a new audio signature is formed representing a new section of audio signal starting approximately at the point corresponding to that at which the previous signatures were matched. This is shown in Figure 9, which illustrates the portion of the first audio signature that is discarded and the newly appended portion which is added to the buffer to form a subsequent signature or representation of the audio signal moved on in time.
The previous signature for the second signal, that is the signal from the programme originator, is also discarded, and a new source signature representing the next portion or region of the second audio signal is received for comparison. This new portion may be that part of the signal following on directly in time from the previous signal section. The comparison is then performed again based on the next sections of the two signals, and a match can be expected to occur at a position in the first audio signature given by one length of the second signature.
Each time a match is found not to occur, a flag is preferably set within the monitoring computer or monitoring device. When the number of flags exceeds a predetermined threshold value an alarm may be raised to indicate that there is no correlation between the first and second audio signals. The threshold at which this occurs can be determined in practice based on the difficulty in matching the signals and the amount of time that is acceptable before a warning is given. The use of a threshold allows some tolerance of non-matches which occasionally occur even for signals that are identical.
It will be appreciated that matching two signals is performed more easily for simple signals. The system described above therefore works well when employed to match audio signals representing speech, such as the audio stream from news programmes. However, if the audio signals being matched represent a quickly varying rhythmic signal such as rock music, determining that there is a match between the signals can take longer. To ensure accurate correlation, it is preferable in the case of signals such as rock music to increase the sampling or overlap co-efficient.
Figure 10 shows a schematic illustration of the preferred system in an implementation for monitoring broadcast signals at a programme originator and at a destination transmitter. The input audio source of the programme originator is first received at an input terminal 102, connected to a server or computer processor 104 at the programme originator's location 106, and also to an audio network 108. The signal is passed to the computer processor 104 which generates the audio signature of the master or original signal and transmits it on IT network 110 to the location 112 of the destination transmitter. At the same time, the audio source signal is transmitted on the private circuits of audio network 108 to the same destination 112.
At the destination transmitter location 112, are a correlator 114 and a processor or client computer 116. The processor 116 generates an audio signature of the signal received on the audio network and passes this to the correlator 114. The correlator 114 receives the generated audio signature from the IT network 110, as well as the signature generated by the processor 116 and performs the comparison method illustrated in Figure 1. The results are output to output terminal 118, and may be routed back to the programme originator or used at the destination transmitter.
Figure 11 shows an alternative embodiment in which the comparison is performed at the location of the programme originator 106. The audio source signal is transmitted from the input 102 to the computer processor 104 and to the audio network 108, and the computer processor 104 generates an audio signature representing the original audio source.
The destination processor 116 receives the audio source signal over the audio network 108 and generates an audio signature representing the received signal. This signature is then transmitted back to the programme originator's location 106 for comparison on correlator 120. The correlator 120 also receives representation of the audio signal generated by processor 104. The correlator 120 compares the audio representation received from the transmitter location 112 with audio signature it received from the processor 104 and the results are output to output 122 at the programme originator's location.
The signal being transmitted at a regional broadcasting centre will in most cases differ from the original signal transmitted from the programme originator. This is because any local content to be added to the received signal will be added at the regional centre. As a result, it is preferable that during broadcasting of local content the preferred system is deactivated. Otherwise, the system will report repeated occurrence of signals that do not match.
Other embodiments of systems for monitoring audio content are also possible. Figure 12 for example shows an arrangement in which the comparison is done locally within a single computer. Providing the computer has means to capture two audio signals, such as two audio capture cards connected to audio inputs, the single computer can prepare both of the audio signatures and perform the comparison.
Alternatively, as shown in Figure 13, the comparison may be performed by a third party computer, which is adapted to receive inputs from first and second computers which perform the audio capture.
The term 'capture' should also be understood to mean receive, as the signal could be captured by any means known in the art and the transmitted to the computer for comparison. In this case, the computer merely needs a receiver in order to receive the already captured signal.
The preferred system therefore provides an effective way of comparing two signals and determining whether they are the same. It will also be appreciated that the preferred system provides an effective method and apparatus for determining the relative time delay between two like signals. Once a match between the two signals has been determined, the relative delay in timing between the two signals can be calculated. The comparison process is repeated, ensuring that synchronization once obtained is tracked and maintained.
The preferred system provides an audio content monitoring system that is able to work in real time on continuous signals, and that is able to match audio content in the presence of delays even when the system has no previous knowledge of which signal will arrive first. The system can react quickly and reliably to indicate any incorrect audio content, and can regain the lock between the two signals once the incorrect audio content has been corrected. The system is able to resist impairments to the audio signal such as noise, coding/decoding artefacts while requiring no external or internal synchronisation to operate.

Claims

A system for determining the relative time difference between first and second audio signals that represent substantially the same audio content, the apparatus comprising:

capturing means for obtaining first and second audio signals;

processing means (104, 116) for generating a representation of sections of each of the first and second signals, the representation being based on the frequencies present within each section; the signal section and representation of the first audio signal being longer than the signal section and representation of the second audio signal; and

a correlator (114, 120) for correlating the representation of the first audio signal section with the representation of the second signal section at different relative timing differences, and for providing an output;

wherein the output indicates a match when the first and second audio signal sections contain substantially the same audio content, and if there is a match, indicates the timing difference between the points at which that audio content occurs in both the first and second audio signal sections.
A system for detecting whether first and second audio signals represent substantially the same audio content, the apparatus comprising:

capturing means for obtaining first and second audio signals;

processing means (104, 116) for generating a representation of sections of each of the first and second signals, the representation being based on the frequencies present within each section; the signal section and representation of the first audio signal being longer than the signal section and representation of the second audio signal; and

a correlator (114, 120) for correlating the representation of the first audio signal section with the representation of the second signal section at different relative timing differences, and for providing an output;

wherein the output indicates a match when the first and second audio signal sections contain substantially the same audio content.
A system according to claims 1 or 2, wherein the processor is operable to divide the first and second signal sections into a number of constituent signal frames, and to generate the representations of the first and second signal sections using the dominant frequency of each frame.
A system according to claim 3 wherein the processor is operable, in dividing the first and second signal sections into constituent frames, to cause the frames within a signal section to overlap with one or more adjacent frames in the signal section.
A system according to any preceding claim wherein if the correlator indicates that there is a match, the processor is operable to generate a representation of subsequent first and second audio signals sections from the first and second audio signals.
A system according to claim 5 wherein the processor is operable such that the subsequent section from the first audio signal begins substantially at the point where the audio content present in both the first and second signal sections was found to occur.
A system according to claims 6 or 7 wherein the processor is operable such that the subsequent section from the second audio signal substantially begins at the end of the section already obtained.
A system according to any preceding claim wherein the processor is operable to employ a Fourier Transform in generating the representation of the first and second signal sections.
A system for monitoring broadcast signals at a first and second location, wherein broadcast signals are transmitted, over a network, from the second location to the first location for transmission at the first location; the system comprising:

means for capturing a first audio signal from the broadcast signal transmitted at the first location;

means for capturing a second audio signal from the broadcast signal transmitted at the second location, the second audio signal being the broadcast signal transmitted to the first location over the network (108);

processing means (104, 116) for generating a representation of a signal section of each of the first and second audio signals based on the frequencies present within each section; the signal section and representation of the first audio signal being longer than the signal section and representation of the second audio signal;

a correlator (114, 120) for correlating the representation of the first audio signal section with the representation of the second signal section at different relative timing differences, and for providing an output; wherein the output indicates a match when the first and second audio signal sections contain substantially the same audio content.
A system according to claim 9 wherein the processing means includes a first processor (116) at the first location for generating the representation of the first audio signal section, and a second processor (104) at the second location for generating the representation of the second audio signal section, and wherein the system comprises a second network (110) connecting the first and second locations.
A system according to claim 10 wherein the correlator comprises a correlator (114) located at the first location, and wherein the second processor (104) is operable to transmit the representation of the second signal section to the correlator at the first location via the second network (110).
A system according to claims 10 or 11 wherein the correlator comprises a correlator (120) located at the second location, and wherein the first processor (116) is operable to transmit the representation of the first signal section to the correlator (120) at the second location via the second network (110).
A system according to claim 9 wherein the processing means comprise a single processor for receiving both the first and second audio signals and for generating the representations of the first and second audio signal.
A system according to claims 9 to 13 wherein if the correlator indicates that there is a match, the processing means is operable to generate subsequent representations of signal sections of the first and second audio signals.
A system according to claim 14 wherein the processing means is operable such that the signal section used to generate the subsequent representation of the second signal begins at the end of the previous signal section.
A system according to claims 14 or 15 wherein the processing means is operable such that the subsequent signal section of the first audio signal substantially begins at the point in the first audio signal where the match was found to occur.
A system according to claims 9 to 16 wherein the processing means is operable to employ a Fourier Transform in generating the representation of the signal sections.
A system according to any of claims 9 to 17 wherein the processing means is operable to divide the signal section into a number of constituent signal frames, and to generate the representations of the signal section using the dominant frequency of each frame.
A system according to claim 18 wherein the processing means is operable, in dividing the signal section into constituent frames, to cause the frames within the signal section to overlap with one or more adjacent frames in the signal section.
A system according to any preceding claim wherein when the correlator is operable to indicate that the signal sections do not contain the same audio content only after a predetermined number of signal sections have been correlated.
Apparatus for monitoring broadcast signals at a first location, wherein broadcast signals for transmission at the first location are received, over a network, from a second location, and wherein a signal representation of a section of the broadcast signal based on the frequencies contained within the broadcast signal is received from the second location;
the system comprising:

means for capturing a first audio signal from the broadcast signal transmitted at the first location;

processing means (104, 116) comprising a processor for receiving the first audio signal and for generating a first representation of a signal section of the first audio signal based on the frequencies present within the section; the first representation being longer than the signal representation received from the second location;

a correlator (114, 120) for correlating the first representation with that received from the second location at different relative timing differences, and for providing an output;

wherein the output indicates a match when the signal sections of the first audio signal and of the broadcast signal received from the second location contain substantially the same audio content.
Apparatus according to claim 20 wherein if the correlator indicates that there is a match, the processing means is operable to generate a representation of a subsequent signal section from the first audio signal, and receive a signal representation of a subsequent signal section from the second location.
Apparatus according to claim 22 wherein the processing means is operable to receive from the second location a subsequent representation of the signal section beginning at the end of the previous signal section.
Apparatus according to claims 22 or 23 wherein the processing means is operable such that the subsequent signal section of the first audio signal begins at the point in the first audio signal where the match was found to occur.
Apparatus according to claims 21 to 24 wherein the processing means is operable to employ a Fourier Transform in generating the representation of the signal sections.
Apparatus according to claims 21 to 25 wherein the processing means is operable to divide the signal section into a number of constituent signal frames, and to generate the representations of the signal section using the dominant frequency of each frame.
Apparatus according to claim 26 wherein the processing means is operable, in dividing the signal section into constituent frames, to cause the frames within the signal section to overlap with one or more adjacent frames in the signal section.
Apparatus according to claims 21 to 27 wherein when the correlator is operable to indicate that the signal sections do not contain the same audio content only after a predetermined number of signal sections have been correlated.
A method for determining the relative time difference between first and second audio signals that represent substantially the same audio content, the method comprising the steps of:

capturing first and second audio signals;

generating a representation of sections of each of the first and second signals, the representation being based on the frequencies present within each section; the signal section and representation of the first audio signal being longer than the signal section and representation of the second audio signal;

correlating the representation of the first audio signal section with the representation of the second signal section at different relative timing differences; and

providing an output, wherein the output indicates a match when the first and second audio signal sections contain substantially the same audio content, and if there is a match, indicates the timing difference between the points at which that audio content occurs in both the first and second audio signal sections.
A method for detecting whether first and second audio signals represent substantially the same audio content, the method comprising the steps of:

capturing first and second audio signal;

generating a representation of sections of each of the first and second signals, the representation being based on the frequencies present within each section; the signal section and representation of the first audio signal being longer than the signal section and representation of the second audio signal;

correlating the representation of the first audio signal section with the representation of the second signal section at different relative timing differences;

and providing an output, wherein the output indicates a match when the first and second audio signal sections contain substantially the same audio content.
A method according to claims 29 or 30, wherein the generating step comprises dividing the first and second signal sections into a number of constituent signal frames, and generating the representations of the first and second signal sections using the dominant frequency of each frame.
A method according to claim 31 wherein the dividing step comprises dividing the first and second signal sections into constituent frames such that the frames within a signal section overlap with one or more adjacent frames in the signal section.
A method according to claims 29 to 32 comprising the step of, if the output indicates that there is a match, generating representations of subsequent first and second audio signals sections from the first and second audio signals.
A method according to claim 33 wherein the generating step includes generating a representation of a subsequent section of the first audio signal beginning substantially at the point where the audio content present in both the first and second signal sections was found to occur.
A method according to claims 33 or 34 wherein the generating step includes generating a representation of a subsequent section of the second audio signal substantially beginning at the end of the section already obtained.
A method according to claims 29 to 35 wherein the generating step includes employing a Fourier Transform.
A method for monitoring broadcast signals at a first and second location, wherein broadcast signals are transmitted, over a network, from the second location to the first location for transmission at the first location; the method comprising the steps of:

capturing a first audio signal from the broadcast signal transmitted at the first location;

capturing a second audio signal from the broadcast signal transmitted at the second location, the second audio signal being the broadcast signal transmitted to the first location over the network;

generating a representation of a signal section of each of the first and second audio signals based on the frequencies present within each section; the signal section and representation of the first audio signal being longer than the signal section and representation of the second audio signal;

correlating the representation of the first audio signal section with the representation of the second signal section at different relative timing differences; and

providing an output, wherein the output indicates a match when the first and second audio signal sections contain substantially the same audio content.
A method according to claim 37 wherein the generating step comprises generating the representation of the first audio signal section at the first location, and generating the representation of the second audio signal section at the second location.
A method according to claim 38 comprising transmitting the representation of the second signal section to the first location, and wherein the correlating step is performed at the first location.
A method according to claims 38 or 39 comprising transmitting the representation of the first signal section to the second location and wherein the correlating step is performed at the second location.
A method according to claim 37 wherein the step of generating step representations of the first and second audio signal sections is performed at a single location.
A method according to claims 37 to 41 comprising the step of, if the output indicates that there is a match, generating subsequent representations of signal sections of the first and second audio signals.
A method according to claim 42 wherein the generating step includes generating a representation of a subsequent section of the first audio signal beginning substantially at the point where the audio content present in both the first and second signal sections was found to occur.
A method according to claims 41 or 43 wherein the generating step includes generating a representation of a subsequent section of the second audio signal substantially beginning at the end of the section already obtained.
A method according to claims 42 to 44 wherein the generating step includes employing a Fourier Transform.
A method according to claims 37 to 45, wherein the generating step comprises dividing the first and second signal sections into a number of constituent signal frames, and generating the representations of the first and second signal sections using the dominant frequency of each frame.
A method according to claim 44 wherein the dividing step comprises dividing the first and second signal sections into constituent frames such that the frames within a signal section overlap with one or more adjacent frames in the signal section.
A method according to any of claims 45 to 47 comprising the step of indicating that the signal sections do not contain the same audio content only after a predetermined number of signal sections have been correlated.
A method for monitoring broadcast signals at a first location, wherein broadcast signals for transmission at the first location are received, over a network, from a second location;
the method comprising the steps of:

capturing a first audio signal from the broadcast signal transmitted at the first location;

receiving from the second location a signal representation of a section of the broadcast signal based on the frequencies contained within the broadcast signal;

generating a first representation of a signal section of the first audio signal based on the frequencies present within the section; the first representation being longer than the signal representation received from the second location;

correlating the first representation with that received from the second location at different relative timing differences; and

providing an output, wherein the output indicates a match when the signal sections of the first audio signal and of the broadcast signal received from the second location contain substantially the same audio content.
A method according to claim 49 comprising the steps of, if the output indicates that there is a match, generating a representation of a subsequent signal section from the first audio signal, and receiving a signal representation of a subsequent signal section from the second location.
A method according to claim 50 wherein in the receiving step, a subsequent representation of the signal section beginning substantially at the end of the previous signal section is received.
A method according to claims 50 or 51 wherein the generating step includes generating the representation the subsequent signal section of the first audio signal beginning at the point in the first audio signal where the match was found to occur.
A method according to claims 49 to 52 wherein in the generating step a Fourier Transform is employed in generating the representation of the signal sections.
A method according to claims 51 to 53, wherein the generating step comprises dividing the first signal sections into a number of constituent signal frames, and generating the representations of the first signal sections using the dominant frequency of each frame.
A method according to claim 54 wherein the dividing step comprises dividing the first signal section into constituent frames such that the frames within a signal section overlap with one or more adjacent frames in the signal section.
A method according to any of claims 49 to 55 comprising the step of indicating that the signal sections do not contain the same audio content only after a predetermined number of signal sections have been correlated.
A computer software product for controlling a computer to determine the relative time difference between first and second audio signals that represent substantially the same audio content, the computer software product comprising a computer readable medium having program code stored thereon which when executed on a computer causes the computer to perform the steps of:

capturing first and second audio signals;

generating a representation of sections of each of the first and second signals, the representation being based on the frequencies present within each section; the signal section and representation of the first audio signal being longer than the signal section and representation of the second audio signal;

correlating the representation of the first audio signal section with the representation of the second signal section at different relative timing differences; and

providing an output, wherein the output indicates a match when the first and second audio signal sections contain substantially the same audio content, and if there is a match, indicates the timing difference between the points at which that audio content occurs in both the first and second audio signal sections.
A computer software product for controlling a computer to detect whether first and second audio signals represent substantially the same audio content, the computer software product comprising a computer readable medium having program code stored thereon which when executed on a computer causes the computer to perform the steps of:

capturing first and second audio signals;

generating a representation of sections of each of the first and second signals, the representation being based on the frequencies present within each section; the signal section and representation of the first audio signal being longer than the signal section and representation of the second audio signal;

correlating the representation of the first audio signal section with the representation of the second signal section at different relative timing differences;

and providing an output, wherein the output indicates a match when the first and second audio signal sections contain substantially the same audio content.