GB2597920A

GB2597920A - Fault monitoring in a communications network

Info

Publication number: GB2597920A
Application number: GB2011876.6A
Authority: GB
Inventors: Blake Andrew; Shannon Michael
Original assignee: Spatialbuzz Ltd
Current assignee: Spatialbuzz Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2022-02-16
Also published as: GB202011876D0; GB2597920A8

Abstract

A method of fault monitoring in a communications network, the method comprising: receiving network performance data from a number of user devices connected to the network; identifying a trend in the performance data; and obtaining information about a fault in the network based on the identified trend and stored historical data. The identified trend may be compared to historical trends indicative of a historical fault. The trends may match if a correlation factor between them exceeds a threshold. The threshold may be increased if the identified fault was predicted incorrectly. The information obtained may include: the type of fault, cause of fault, time of development, or possible mitigation measures. Performance data can include objective data, such as RSSI, bit-error rate, data rate, etc, and subjective data, such as user reports or status checks.

Description

FAULT MONITORING IN A COMMUNICATIONS NETWORK

FIELD OF THE INVENTION

The present invention relates to a method, system and computer program of fault monitoring in a communications network.

BACKGROUND OF THE INVENTION

A network operator of a communications network will typically have a large number (hundreds or thousands) of alarms and other forms of fault report evident on their network at any given point in time. Some of these alarms will be trivial and indicate a state which is operationally acceptable, but perhaps requires attention at the next planned maintenance visit. Examples of this might be a base-station mast-head power amplifier which is running hotter than it ordinarily should, or an RNC cabinet temperature which is higher than normal. Most alarms, however, indicate some form of 'failure', for example a lower radio-frequency power output from a base-station than should be produced based upon its operational state (e.g. number and range-distribution of users) or a complete shutdown of a site.

Still other faults may exist which do not result in an alarm being communicated to an operator. This may occur, for example, if a weather event causes an antenna system to change its pointing angle, thereby reducing or removing coverage from some users (at least some of whom would then undertake a status check to try and find out if there is a known problem).

A network operator does not typically have the resources necessary to inspect, diagnose and repair all of these faults, or even a majority of them, and most networks have to 'live' with a tolerable degree of 'failure' at any given point in time. The operator therefore has to decide how best to deploy their maintenance resources whilst achieving the greatest level of satisfaction from the network customers (the users).

At present, this may be achieved by ranking the sites exhibiting faults based upon which sites generate the most revenue. Similar metrics may also be used for other equipment, such as RNCs or transmission links -these will typically result in greater numbers of inconvenienced users as they may well serve multiple sites (resulting in more status checks from those users); this is likely to put them at, or close to, the top of the maintenance ranking.

Whilst this method works, to a degree, it makes assumptions about the numbers of users impacted and, crucially, about the user's perception of the failure. Taking an extreme example, if a base transceiver station (BTS) had failed and other, nearby BTSs then took over serving the affected users and all of the users were only sending occasional text messages (and doing nothing else), then those users would probably notice little or no difference to their service. The local BTS which had failed, however, might still appear as a high priority to repair, perhaps due to the type of alarm generated. In reality, even if the site wasn't repaired for days or weeks, these (textmessage-only) users would not notice and nor would they be dissatisfied customers. Conversely, a failed site with fewer but (say) heavy data users, would lead to many more complaints and a very dissatisfied user base. A sensible approach would be to rank the repair of the latter site higher than that of the former, but the aforementioned method would likely not do this.

An alternative approach would be to rank failed sites (or other network components or equipment alarms) according to how many users undertook a 'status check', e.g. used an app on their phone, or a web-site, in order to check if there were known service problems at their location. Such checks are an indication of user dissatisfaction with the service they are receiving, as users rarely make such checks if they are receiving a good service. Whilst this mechanism may appear to solve the above ranking problem, there are a number of issues with it: 1) Users may be suffering congestion on the network which is unrelated to equipment failure, but will still undertake status checks; 2) Users may have experienced a small drop in performance, due to a failure in a local piece of network equipment, but are not suffering unduly. For example they may be experiencing a reduced, but still reasonable, data rate. Such users may well still undertake a status check, but would not be as unhappy as other users, elsewhere on the network, who had suffered a dramatic drop in data rate -the latter should obviously have a higher-priority from a maintenance perspective; 3) Specific types of user may be suffering problems, whereas other users may be unaffected. For example heavy data users and gaming users would suffer if a latency-related problem occurred, whereas lighter data users and voice or text users would probably not notice a problem at all. Whether this situation constitutes a high priority may be an operator-specific question, however, at the very least, diagnostic data would be useful, here, in order to determine why these users were unhappy; 4) At present, there is a degree of scepticism, on the part of network operators, regarding whether a rising trend of status checks in relation to a particular site or area are a valid indication of a problem in that area of a network. This scepticism derives largely from a lack of experience, by the operators, of this relatively new technology.

A problem with using this status-check based approach in isolation is that the availability of means for making a check (e.g. penetration of an 'app' which allows status checks to be made) may be poor and hence the number of impacted users may be difficult to judge. For example, a simple scaling of the number of status checks by app penetration statistics (e.g. if 1% of customers have the app, then multiply the number of status checks by 100 to give an indication of the number of impacted users) is potentially very inaccurate, especially with very low app penetration levels (which is typical, at present).

SUMMARY OF THE INVENTION

A first aspect of the invention provides a method of fault monitoring in a communications network, the method comprising: receiving network performance data from a number of user devices connected to the communications network: identifying a trend in the network performance data; and obtaining information about a fault in the network based on the identified trend and stored historical data.

The network performance data may comprise objective data and subjective data, the objective data and subjective data indicative of network performance.

Obtaining information about a fault in the network based on the identified trend and stored historical data may comprise: comparing the identified trend to historical trends in the historical data, each historical trend being indicative of an historical fault in the historical data; determining that the identified trend matches a first historical trend, the first historical trend being indicative of a first historical fault; extracting historical information regarding the first historical fault in the historical data; and obtaining information regarding a fault in the network based on the extracted information regarding the first historical fault.

Determining that the identified trend matches the first historical trend may comprise determining a correlation factor between the identified trend and the first historical trend, and determining that the correlation factor exceeds a predetermined threshold.

The method may further comprise increasing the predetermined threshold upon determining that the real outcome indicates that the identified fault was predicted incorrectly.

The obtained information may include one or more of the following about the fault in the network: the type/nature of the fault; the cause of the fault; measures that can be taken to minimise or eliminate the impact of the fault on the users of the network; and time and manner in which the fault is expected to develop.

The method may further comprise the step of reporting the obtained information to a network operator.

The network performance data may comprise one or more of: received signal strength (RSSI); signal-to-interference, noise and distortion ratio (SINAD); energy per bit to noise power spectral density ratio (Eb/N0); bit-error rate; data-rate; data throughput; latency; number of status-check and/or other types of user reports; rate of increase of status-check and/or other types of user reports; geographic distribution of status-check and/or other types of user reports.

The network coverage may be broken into a plurality of regions. Each region of the plurality of regions may be hexagons.

Identifying a trend in the network performance data may comprise using data from only a first region of the plurality of regions to identify the trend.

Identifying a trend in the network performance data may comprise using data from more than one region of the plurality of regions to identify the trend.

The received network performance data may be stored in a user database.

A second aspect of the invention provides a computer program, which when executed by processing hardware, performs the method of the first aspect.

A third aspect of the invention provides a system with data processing resources and memory, configured to perform the method of the first aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

By way of example only, some embodiments of the invention will now be described with reference to the accompanying drawings, in which: Figure 1 shows a network receiving objective and subjective data; Figure 2 shows a flow chart for a fault prediction algorithm: Figure 3 shows a flow chart for determining a correlation between current and historical fault data; and Figure 4 shows a flow chart for self-learning in a fault prediction algorithm.

DETAILED DESCRIPTION OF EMBODIMENT(S)

Figure 1 illustrates a system 100 for collecting performance data about a communications cellular telecoms network for use in diagnosing faults in the communications network. The performance data includes subjective data 124 and objective data 120, 122: * Subjective data 124 is performance data related to the user's perceived performance of the communications network and is derived from user reports of issues with the communications network. The subjective data 124 may include status checks by the user and other user-reported metrics.

* Objective data 120, 122 is performance data derived from measurements from mobile devices 110a belonging to the users reporting issues with the communications network as well as measurements from mobile devices 110b belonging to other users nearby (to help distinguish between faults caused by a mobile device 110a, 110b from faults caused by a fault in the communications network). Objective data 120, 122 may include measurements taken by a mobile device 110a, 110b of the service quality it is experiencing (for example, received signal strength, transmitter output power, received and transmitted data rates, data throughput; latency, voice quality, bit error rate, signal-to-interference, noise and distortion (SINAD) and any other metric which the mobile device 110a, 110b is capable of reporting).

The performance data (subjective data 124 and objective data 120, 122) is stored in database 140 to build up a body of historical performance data which can be used in the diagnosis of future network faults. As the causes and impact of network faults are identified, these can be stored alongside the associated historical performance data in the database 140.

The current network performance data is then compared against comparable historic data in the database 140 in order to diagnose the cause of a fault in the communications network, based on what was identified to be the cause of the fault in the comparable historic data. In effect, the network fault diagnosis tool assesses whether similar network circumstances have occurred in the past, such as a similar level and distribution of affected users (as evidenced by the subjective data 124 such as status check requests) and similar network performance conditions (based on objective data 120, 122 measured from the mobile devices 110a belonging to the user reporting issues as well as measurements from other nearby mobile devices 110b), and optionally based upon a similar type of area (such as urban, suburban, rural, indoor, outdoor, etc.).

The network fault diagnosis tool is able to learn from the outcomes it proposes by comparing its proposal with the true cause of the fault entered into the database 140 after definitive diagnosis by a communications network engineer.

Further details of the nature of the subjective and objective performance data will now be discussed with reference to Figure 1.

Subiective Data Subjective data 124 is user-generated data on the status or performance of the network perceived by the user of a mobile device 110c. Such subjective data 124 may be generated in a number of different ways, including: * Status checks -these are checks made by the user, typically using an app on their mobile device 110c that has been provided for the purpose by the network operator (the app typically has many other functions as well, such as providing the ability to access the user's bill, usage to date, coverage maps etc.). The user will typically undertake a status check when they are experiencing a problem with the communications network or when they are receiving a poorer service than they might expect. A status check typically involves pressing a virtual button in the app on the touch-screen of the mobile device 110c which sends a message to the network operator asking if there is any known problem on the communications network local to the user. If there is a known problem, an explanatory message will typically be sent to the user's mobile device 110c in response, acknowledging that there is a problem and perhaps indicating the nature of the problem and when it will be rectified. A status check can also be undertaken in a similar way using a web browser pointed to the operator's website.

* Feedback reports -these can either be reports voluntarily submitted by the user (for example, via the network operator's website) which are essentially complaints about the service the user is receiving, or reports elicited by the network operator sending out a survey to selected users. Such surveys could, for example, be targeted at users in an area where it is possible that a problem exists -where other local users have undertaken status checks, for example -and the network operator wants to understand other users' experiences.

* Notification subscriptions -users can subscribe to notifications relating to when a network repair will be completed. A large number of such subscriptions (in a given area) could indicate that a large number of users are very unhappy about the service (or the lack of service) that they are currently receiving and are keen to know the moment it is restored to normal.

S

* Calls to a call centre -users may call a customer service call centre to ask about the status of the network in their local area and to report problems with their service. A large number of calls from a particular area could indicate that there is a problem in that area.

There are, of course, many other possible ways in which a user could communicate their subjective view of the network (for example, via social media, either involving the operator or just complaining generally). It should be emphasised that all of the above reports (from users) are subjective -they relate to the user's perception of the network -and do not necessarily indicate that a fault exists, simply that the network, for whatever reason, does not meet the expectations of that particular user, in that particular location, at that particular time. Clearly, however, a large number of such reports, in a given area, at a given time, are potentially indicative of a network problem, even if that problem is simply 'congestion'.

The subjective data 124 is collected by subjective data server 138. The subjective data 124 may be collected automatically (for example, from status checks performed on an app or website, or electronic feedback reports) or manually entered (for example, following a call with a call centre, the operator may manually enter the subjective data 124 into the subjective data server 138). The subjective data server 138 processes the subjective data 124 into a format suitable for database 140, before loading the subjective data 124 onto the database 140 where it is associated with an anonymised device identifier for the particular mobile device 110c, to allow the subjective data to later be associated with other relevant performance data for the particular mobile device 110c, such as the objective measurement data discussed below.

Objective Data Objective data is data collected or measured by the mobile devices 110a-c themselves.

Error! Reference source not found, illustrates two methods for collecting objective data: batch-data collection 119 and live-data collection 121.

Batch-data Collection Batch-data collection 119 periodically (typically hourly) collects measurement data 120 from all mobile devices 110a connected to the communications network at measurements collection server 130. Given the need to collect measurement data 120 from all mobile devices 110a connected to the communications network, batch-data collection 119 is designed to handle very large volumes of data. For example, although measurement data 120 is typically collected from each mobile device 110a every hour, the exact collection times from each individual mobile device 110a may be randomly staggered to ensure that not all mobile devices 110a are trying to send their measurement data 120 simultaneously.

The measurement data 120 comprises measurements taken by a mobile device 110a of the network service quality it is experiencing (for example, received signal strength, transmitter output power, received and transmitted data rates, data throughput, latency, voice quality, bit error rate, signal-to-interference, noise and distortion -SINAD -and any other metric which the mobile device 110a is capable of reporting).

Measurements collection server 130 generates a measurement report data file 131 for each set of measurement data from a mobile device 110a. The measurement report data file 131 contains the measurement data 120 with a timestamp at which the measurement data 120 was collected and an identifier associated with the mobile device 110a (which is typically an anonymised version of the identifier provided by the mobile device 110a to protect user privacy).

The measurement collection server 130 typically adds each measurement report data file 131 to a data queue 132 to await processing by the measurements batch processor 134.

The measurements batch processor 134 takes the measurement report data files 131 from the data queue 132 and essentially provides a translating/transformation process, converting the measurement report data files 131 and the data within them into the correct format to be stored in the database 140.

The data leaving the measurements batch processor 134 to enter the database 140 typically contains the following: 1) Anonymised identification -the user device from which the data originated is discarded and an anonymous (random) identity is attached. This allows the data from a mobile device 110a to be assessed over time without (potentially) infringing the privacy of the user of the mobile device 110a.

Anyone interrogating the database 140 would be unable to identify the mobile device 110a or its user, only that measurements have come from the same mobile device 110a or user.

2) A randomised identifier for the measurement report itself, to allow duplicates to be recognised and eliminated.

3) A location identifier indicating the network area in which the mobile device 110a was operating at the time the measurements were taken.

4) The location of the BTS which was serving the mobile device 110a at the time the measurements were taken. 10 5) The (compass) bearing of the mobile device 110a from that cell site.

6) The approximate distance of the mobile device 110a from the cell site's location.

The measurements batch processor 134 typically runs periodically (hence the requirement for the data queue 132), with an interval between initiating each run being typically being around five minutes.

Although only a single measurement collection server 130 is shown in Error! Reference source not found., it is possible to have multiple measurement collection servers 130, each feeding one or more batch processors 134.

Live-data Collection Live-data collection 121 collects live measurement data 122 from a mobile device 110b of the network service quality it is experiencing at that point in time (for example, received signal strength, transmitter output power, received and transmitted data rates, latency, voice quality, bit error rate, signal-to-interference, noise and distortion -SINAD -and any other metric which the mobile device 110b is capable of reporting).

Live data collection 121 is triggered in response to the generation of subjective data 124. For example, the occurrence of a user performing a status check from their mobile device 110b, 110c triggers their mobile device 110b, 110c to obtain live measurement data 122.

Live measurement data 122 may also be requested, by a live data server 136, from other mobile devices 110b which have not initiated a status check, but which happen to be local to an area of interest, either based for example upon the number of status checks in that area or upon a specific operator interest (such as a stadium during an event). In both cases, the trigger for the collection of live measurement data 122 is subjective, i.e. a network user is, in their opinion, experiencing a poor or degraded level of service relative to that which they have experienced in the past or would reasonably expect to receive. This is inherently subjective, as different users will have differing opinions (or thresholds) as to what constitutes 'poor' or 'degraded'. Collecting live measurement data 122 from other mobile devices 110b may aid in determining whether the issue which caused a user to initiate a status check is unique to that user (meaning that it may well be a problem with his/her mobile device 110b) or more general to the area (and if so, ascertain how widespread the issue might be). A more general experience of the problem (e.g. a low data rate) may well indicate that there is an issue with the communications network in that area.

Other triggers may also initiate live data collection 121, such as submitting web-based status requests or complaints. In this case, live measurement data 122 data may be collected from nearby mobile devices 110b while a subset of this live measurement data (such as network speed) may be collected from the user or users. It is also possible to infer the identity of the connection type of the web-based user (i.e. Wi-Fi or cellular). In the case of a cellular connection, the network speed will indicate the user's network experience. If the user is connected over Wi-Fi, this may indicate that there is a catastrophic issue with the cellular network in that area (since the user needs to resort to Wi-H to request a status check). Measurement data from web-based users can be filtered out (and not used in subsequent fault analysis, for example) if the user is identified as not using the network operator's network when making the status check or not using it in the location about which the status check or coverage query is made.

Live data collection 121 typically comprises fewer servers (perhaps one-tenth of the number involved in batch-data collection 119), since far less live measurement data 122 is collected (or needs to be collected) than batch measurement data 120 -live measurement data 122 only needs to be collected in response to a user-initiated status check and there are few of these relative to the number of mobile devices 110b active on the communications network at a given point in time. Essentially, live measurement data 122 is only uploaded when it is interesting to do so -that is, there is an immediate reason to do so, and this uploading is undertaken immediately.

The live data server 136 enters the live measurement data 122 into the database 140 along with: 1) Anonymised identification -the identity of the mobile device 110b from which the live measurement data 122 originated is discarded and an alternative anonymous identity is attached. This allows the live measurement data 122 from a particular mobile device 110b to be assessed over time without (potentially) infringing the privacy of the user of the mobile device 110b. Anyone interrogating the database 140 would be unable to identify the mobile device 110b or its user, only that measurements have come from the same mobile device 110b or user.

2) An alternative identifier for the live measurement data 122, to allow duplicates to be recognised and eliminated.

3) A location identifier indicating the network area in which the mobile device 110b was operating at the time the measurements were taken.

4) The location of the cell site which was serving the mobile device 110b at the time the measurements were taken. 20 5) The (compass) bearing of the mobile device 110b from that cell site.

6) The approximate distance of the mobile device 110b from the cell site's location.

Database 140 The database 140 stores all of the measurement data (batch or live) in the form of records or tuples, within tables, in its structure. The database is typically an off-theshelf product (such as Oracle®, Postgres® and the like) which is configured for this specific application (i.e. that of storing, and allowing access to, data collected from individual mobile devices 110a-c). It can be accessed by the network operator directly or by other systems owned, managed or used by the network operator.

The database may also store data from a range of other pertinent data sources to aid in fault diagnosis, such as: 1) Data 141 relating to network change requests (requests for changes to the network configuration, such as the position or pointing angle of one or more antennas, the installation or de-commissioning of a base-station, etc.) and/or planned maintenance operations. This can help to inform decisions regarding whether a network change may be the root cause of an increase in the number of status checks locally to the change or if they may simply be as a result of a planned local outage in the network for maintenance or upgrade purposes.

2) Data 142 relating to 'trouble tickets' and/or known incidents on the network.

These are incidents or problems of which the network operator is already aware and which may or may not be being dealt with already. Such information can be communicated to the users (e.g. in response to a status check), as appropriate.

3) Data 143 relating to network configuration information, such as cell-site locations, RNC/BSC parents and connectivity, antenna pointing angles, transmit power levels, etc. This information can be used, for example, to determine from which nearby user devices measurement data should be requested, in the event of one or more local users initiating a status check.

4) Data 144 relating to network alarms. This can be used to correlate status checks and (poor) measurement data with known alarm conditions and, potentially, thereby raise their status within the maintenance hierarchy. 25 5) Data 145 relating to network performance characteristics, such as the amount of traffic being handled by each cell and the availability of each cell.

6) Data 146 from a network planning tool, including the designed network topology (which may not necessarily exactly match the network as deployed). This database will contain coverage maps and coverage predictions and may be used to assess whether the reported issue stems simply from the fact that the user is outside of the designed network coverage area.

Data 145 and 146 provide the basis for a root-cause analysis to be undertaken, in order to identify the location (within the network hierarchy) of the faulty element.

Combining Subjective Data and Objective Data Since data in the database 140 is associated with an (anonymised) identifier for each mobile device 110a-c, subjective data based on status checks and other information provided by the user of the mobile device 110c can be associated with objective data (batch and/or live measurement data) from the same mobile device 110a, 110b.

For example, if a user requests a status check from the network operator's app running on mobile device A, data relating to the status check will be stored on the database 140 with an anonymised identifier associated with mobile device A. Simultaneously, or soon after, live measurement data 122 will be requested from mobile device A, either by the live data server 136 or the app itself, and this live measurement data 122 will also be assigned to the anonymised identifier associated with mobile device A. In this way, the subjective and objective data may be combined when the database is queried to form a richer and more powerful resource to assist the network operator in identifying and diagnosing faults.

Each of the blocks of Figure 1 could be implemented by a physically separate piece of hardware (such as a computer, server, hard disk storage unit or other item of electronic hardware), or some functions could be combined into a single piece of hardware (e.g. the measurement collection server 130, data queue 132 and measurements batch processor 134 could be integrated into a single block). It is also possible that some or all of these hardware items could be virtualized and be assigned to disparate hardware elements deployed by a third-party service provider, such as a cloud computing services provider. In this case, a 'server' could actually be a virtual server, with tasks executed and spread across a number of physical hardware devices, potentially in different physical locations. In all of these physical hardware configurations, however, the main elements shown will be present, either physically/individually, or in varying degrees of virtualisation.

The system of Figure 1 has the ability to scale as needed, that is, it is straightforward to add more computing resources as required, depending upon the volume of reports it is receiving. This may well increase over time as more customers are encouraged to sign-up to use the operator's service-reporting/billing app. The system could be implemented on a cloud computing platform to facilitate scaling.

Fault Monitoring and Prioritising Using Subjective Data The network performance data collected by the system of Figure 1 and stored in the database 140 over time can be used in monitoring and prioritising faults in the communications network. Specifically, the network performance data can be used in combination with stored historical data to predict the emergence or development of emerging faults in the communications network.

The manner in which this is performed is described with reference to Figure 2.

The method begins at step 200 and then moves to steps 205, 210 and 215 in parallel in which data is received from users via their devices or directly from their devices. The network coverage is broken into a plurality of regions. In other words, the total area covered by the network is broken into a plurality of sub-areas, a.k.a. regions. The method focuses on users in a single region, for example a hexagonal region, and may be duplicated or re-run focusing on other regions.

In optional step 205 data is received from one or more user devices concerning the app or apps which are running on that or those devices or which have been or were being run on that or those devices during the time in which the measurements included in the measurement reports (discussed in relation to step 215 below) were gathered.

In step 210, subjective data is received from the users' devices located in a given (hexagonal) region and a count is undertaken of the number of subjective reports received within that given region, over a given (rolling) time period. The rolling time period may be a rolling time period of 1 hour, for example. As discussed above, subjective reports are user-generated reports regarding the status or performance of the network that may be indicative of a fault.

Subjective data reports from users may be combined in order to geographically group the reports and decide upon whether a significant number of reports exists within an area, and details of this are described in GB2546118A, the contents of which are herein incorporated by reference.

In step 215, objective data is received from the user devices reporting a problem whilst being located in the same given (hexagonal) zone, and (optionally) other mobile devices active locally to these devices and/or connected via the same resources (for example, BTS, RNC, transmission link etc.). This objective data takes the form of measurement reports on various RE and geographic parameters, such as location, signal strength or RSSI, bit-error rate, latency, dropped-call or call-retry statistics or any other quality-of-service related metrics measured and recorded by the user device. It is also desirable to collect this objective data periodically, irrespective of whether one or more users are reporting problems in that area or region. This data may be analysed, stand-alone, in order to spot trends which may indicate that a fault is developing, even before users begin to complain. Since the data is downloaded sparingly (i.e. periodically, in batches) and is highly focused upon the metrics of most relevance to a user's experience of the network, very little data is required across a whole network. This means that relatively little in the way of computing resources is required to store or process this data using the method described herein.

The method then moves on to step 220, in which the data obtained in steps 205, 210 and 215 is combined based upon an (anonymised) identity of the user device which captured/collected the relevant data. For example, a user device with an IMEI (International Mobile Equipment Identifier) number of: 123456789123456 may be given an identity for use within the method of 733788180057392 (or any other random string of characters which may or may not be partially or wholly numeric or alphabetical). All measurements, whether objective or subjective (or derived from, or about, an app operating or recently having operated on the device), from the device with an IMEI number 123456789123456 are formatted, ready for storage in a database, in a record or records identified with device identifier 733788180057392. The data may be split across a number of tables in the database with, for example, separate tables for each of the measurement characteristics (RSSI, data-rate, data throughput, error-rate, etc.), status checks and other subjective data, etc. The method then moves on to step 225, in which the above data is stored in user/measurements database 230, before moving to step 235.

In step 235, the data stored in user/measurements database 230 is analysed for trends (at an individual region level or across groups of regions that are served by a common resource, such as an RNC or transmission link, or, more typically, both). The detailed operation of step 235 will be discussed below, in conjunction with Figure 3.

In step 240, a test is applied to ascertain if an identifiable trend, in some or all of the data, has been spotted which matches a previously-seen trend in any part of the network and, in particular, in a comparable part of the network (e.g. similar cell-size, geography, topography, etc.). It does this with reference to historical data (including historical trend data) stored in historical database 245. If a trend is identified, then the method moves on to step 250, which will be discussed below. If a trend is not identified, then the method moves on to step 265 in which a check is made as to whether a fault ultimately emerges in the relevant part of the network (i.e. something which the method failed to predict). If a fault does not emerge, then the method simply ends at step 275.

If, however, a fault does emerge, then the method moves to step 270 in which the details of this fault and its associated emergence trend are stored in the historical database 245, before the method ends at step 275.

Returning now to step 250, which is executed if a known trend is identified in step 240, the root cause of the fault that ultimately resulted from the recognised trend (and any associated timing data), is retrieved from historical database 245.

This historical data is analysed in step 255 and an estimate is derived of the time which may be expected to elapse before the fault fully emerges, based upon this historical data and the data on the emergence of the current trend. For example, if the identified trend is a gradual reduction in received signal strength by the mobile devices and the ultimate cause is a gradual over-heating and then failure of a base-station transmitter, then the rates of reductions of the signal strengths in the current and historic cases can be compared. If the current trend is emerging (so far) at, say, twice the rate of the historical case with which it is comparable, then the present fault may be expected to fully manifest itself in approximately half the time taken in the historic case, thereby allowing a definitive estimate to be calculated.

It is noted that this example is likely to be overly-simplistic, for at least the following reasons: 1) Only one metric was discussed (signal strength); most trends will involve a degradation in a number of metrics and their relative degradations (or changes) will be indicative of differing faults or rates of emergence of a fault.

2) It was assumed that only a single comparable (historic) trend was identified in steps 235 and 240, whereas it is likely that multiple trends will be identified, each with its own correlation factor to the present trend (with some being more closely correlated than others, but with all those coming above a threshold still being considered relevant, by step 240). In this case, the method may consider only the most closely-correlated historic trend, or it may consider all trends, weighted by their correlation to the current trend, or it may consider only trends which have the same root cause and which form either a simple majority of the identified cases or a weighted majority of the identified cases (with the weighting being based upon the degree of correlation between each historic trend and the present situation). The method may choose to report any or all of these to the network maintenance team who can then decide upon what action to take. For example, in the case of a severe (predicted) fault, they may choose to take action based upon a number of possible causes reported by the method, e.g. remotely re-setting a number of pieces of equipment, not just the most likely' candidate.

The method then moves on to step 260 in which the above-derived estimate of the time taken for the fault to fully emerge, along with the nature of the expected fault, is reported to the network maintenance team via a console or any other suitable mechanism.

Optionally, the rate and form of degradation to the customer experience may be calculated or retrieved from historical database 245, for example the expected number of customer complaints over time, and this can also be reported. This could assist a network operator in deciding when to schedule a repair, relative to other possibly higher or lower priority faults.

The method then ends at step 275, but is typically re-run regularly or continuously.

Note that the above method has been described in relation to a single zone or region of the network (which may be hexagonal, as discussed previously). Typically, the method will be executed for all zones within the network. Such execution may be in parallel or serially or a combination of the two (e.g. 5 servers operating in parallel, each of which sequentially repeats the method for its assigned 1/5th of the zones in the network).

Figure 3 shows a more detailed view of step 235 of Figure 2. The method begins at step 300 and moves on to step 305, in which a counter 'N' is defined and set to 11'.

In step 310, metric 'N is selected from a list of metrics to be checked, so in the first pass through the method, with N set to 1, the first metric is selected. The list of metrics to be checked may include (in no particular order): 1) Received signal strength (RSSI); 2) Signal-to-interference, noise and distortion ratio (SINAD); 3) Energy per bit to noise power spectral density ratio (Eb/N0); 4) Bit-error rate; 5) Data-rate; 6) Data throughput; 7) Latency; 8) Current number of status-check and other types of user reports; 9) Rate of increase of status-check and other types of user reports; and 10) Geographic distribution of status-check and other types of user reports.

The method then moves on to step 320 in which, for each area or region considered to be relevant by the method, the current/recent data obtained for metric N from user devices in the part of the network being examined (e.g. the cell or hexagon being considered by the method at this time) is correlated with an equivalent period of historical data, over a rolling time-period, 'T', where T may be 1 hour or any other period. In other words, this is a one-with-many correlation, where the 'one' is a particular part of the network which the method is considering and which may or may not be experiencing an emerging fault, and the 'many' is any region (including the same region) which is considered to be relevant, for comparison purposes, by the method.

The 'many' areas or regions to be considered may include all regions within the network or may include only 'equivalent' areas to the one with whose recent data a correlation is being undertaken (where 'equivalent may be in the sense of: geographic area; type of environment, e.g. urban, rural, indoor, etc.; topography, e.g. flat, hilly, forested, low-rise buildings, high-rise buildings, etc. or any other metric). The areas may be defined by means of hexagons of a range of sizes, or any other shape (including amorphous shapes).

The correlation being undertaken in step 320 is conducted between one set of data (covering time period T) from user/measurements database 330 and multiple sets of historical data, one correlation at a time, obtained from an historical faults and associated data trends database 345 (over a multitude of examples of time period T, rolling back over time). The output of step 320 is therefore many correlation coefficients, one for each of the correlation processes (e.g. present hexagon with a first historical hexagon, present hexagon with a second historical hexagon, etc. sliding over time, with the peak correlation of this process, for each hexagon, being recorded).

In step 340, each of these (peak) correlation coefficients is compared to a predetermined threshold. If none exceeds the predetermined threshold, in other words if the historical data does not contain any examples of sufficiently similar circumstances to the one which is evident at the present time and which subsequently result in a fault condition, then the method moves on to step 370, which will be discussed below.

If, on the other hand, one or more (peak) correlation coefficients exceeds the predetermined threshold, then the method moves on to step 350 in which the (peak) correlation coefficient value, the nature of the fault which develops and data associated with the fault, are temporarily stored in temporary store 360. Note that, for some of the data, it may be sufficient to store a pointer to the relevant data in the historical faults and associated data trends database 345.

Examples of the kind of data which may be stored include: 1) Correlation value/coefficient(s) for the identified area(s); 2) Location(s) of prior incident(s) within the historical faults and associated trends database 345 (or a copy of the data itself); 3) Data regarding the characteristics of the location(s) or region(s) of the potentially-similar historic incident(s), e.g. cell-size, geographic area, topography and any other potentially-relevant factors; 4) Details of the fault(s) which developed and their root-cause(s); 5) Details of the rate of emergence of the (historic) fault(s); and 6) Details of the measures taken to successfully remedy the fault(s) and the time taken to do so (this helps in generating relevant and accurate messaging to local users regarding the time before full service is restored).

The method then moves to step 370 in which the counter 'N' is incremented. There then follows a test 380 to ascertain if all metrics have been checked, i.e. if N now exceeds its maximum permitted value. If not, then the method loops back to step 310 and the next metric is assessed.

If N does now exceed its maximum value, N max, then the data from all positive correlation results, i.e. those which exceeded the predetermined threshold, is combined in step 390 based upon the same historical incident. For example, if a particular historic incident yields above-threshold correlation results for trends in: signal strength, error-rate and latency, then these results would be combined within the wrapper of that single historical incident. All of the results from all of the N max iterations of the algorithm are then forwarded to step 240 of Figure 2 and the present method then ends at step 395. In the event that there are no above-threshold correlations found then, of course, a null result is forwarded to step 240 of Figure 2 and that method would then follow the 'No' path to step 265 and so forth.

As noted above, the method may be augmented by an ability to 'self-learn'. In this variant, the operator is able to feed-back to the method the nature and location of the actual fault (or none), should it differ from that predicted by the method, or an indication of which of the options provided by the method was, in actuality, the correct one. This feedback can then be used by the method to 'bias' the calculated or reported correlation values and hence, ultimately, the predictions which the method generates (or their relative correlation/confidence scores). An example method for this idea is provided in Figure 4, although any suitable self-learning or similar artificial intelligence mechanism may be employed.

The method begins at step 400 and continues to step 410 in which the network operator adds actual fault data (or a no-fault indication), for a given user-derived-data 'trend', to database 420. It is assumed that the operator will do this for any and all faults (or trends resulting in no fault) occurring in the network, with such faults (or no-faults) then being associated with any objective and subjective measurement data collected at or around the time of the fault (or none), covering a region covered by the trend-data.

The method then moves on to step 430 in which a new trend is analysed and a fault-prediction (or none) is made, for example using the method described in relation to Figure 2 and Figure 3. This new trend is analysed and will generate a fault prediction together with an associated confidence score derived from the correlation coefficient or value (which was discussed above in relation to those figures). This step will also take into account any cumulative reduction in the confidence score (discussed below in relation to step 440), for example by making it lower than the correlation value for the current trend and its historical near-equivalent(s). To take a specific example, if the correlation value from an analysis of the current trend and a historic trend is 0.8 (correlation values will typically be between 0 = no correlation and 1 = an exact match) and past experience indicates that the trend is not always a good predictor of the existence of a specific fault (but sometimes predicts the specific fault correctly), then the correlation coefficient could be reduced by a factor X, as will be discussed below, and this reduction stored in database 420, along with the relevant trend and fault data.

If X is, for example, 10%, then the confidence score would be 0.8-10% of 0.8 = 0.72.

If the same trend is seen again and the fault prediction is again incorrect, despite a relatively strong correlation of 0.8, then the confidence score could be reduced further, perhaps by another 10% and so on, with the new reductions being stored in database 420 as discussed above. The process of reducing the confidence score will be discussed further below in relation to step 450.

The method then moves on to step 440 in which it is established whether or not the newly-predicted fault is the same as, or similar to, a previously wrongly-predicted fault On similar circumstances, e.g. geography, topography, cell-size, etc.). If so, the method moves on to step 450 in which the associated confidence score, calculated in step 430, is reduced by a percentage, X%, where X may be 10, for example (or any other figure which is >0 and <100). This new probability is stored in database 420 and associated with the measurement report data trend(s) and associated circumstances. Thus, next time a similar fault prediction is highlighted in a given set of circumstances, the new, lower, confidence score may be calculated from the correlation coefficient, using the saved (10%) correction-factor. If the same happens again, i.e. an incorrect prediction results from the same circumstances, then the correction factor used (and subsequently saved) may be increased by a further 10% (for example), taking the correction-factor to 20%. In this way the efficacy of the seemingly-strong prediction is reduced perhaps ultimately to the point where the newly-predicted fault no longer meets the threshold for reporting by the method.

Likewise, if, in step 410, it transpires that the predicted fault turns into an actual fault with great reliability, in similar circumstances (measurement report trends etc.), then its associated confidence score may be increased by Y% (where Y may equal X or may be any other number >0 and <100). In this way, the algorithm can gradually promote what may initially be unlikely diagnoses or those based upon relatively weakly-correlated present and historical trend data, in the light of repeated experience and demote what may initially appear to be likely predictions.

If, in step 440, no similar, previously wrongly-predicted, faults are found in database 420, then a correlation value is calculated/assigned, in step 460, using the method discussed previously in relation to Figure 2.

Finally, in step 470, the predicted fault and its associated confidence score is reported to the network operator, as previously.

The method then ends at step 480.

The goal of the algorithm is to improve the statistics of the outcomes, i.e. the percentage of times that the correct fault is diagnosed. Any algorithm capable of achieving this aim, whilst incorporating objective, subjective and historical data trends, may be used.

Note that it is not essential for each network to 'learn' from scratch, nor for it to start with an empty 'historical incidents' database. A database may be copied from a similar network with similar geographic features, for example, for a desert-dominated nation, a database could be used from a similarly desert-dominated nation. Likewise, the 'learning' from the earlier deployment could be built-in to the later deployment's implementation of the method, to give it a strong basis from which to learn further.

Although the invention has been described above with reference to one or more preferred embodiments, it will be appreciated that various changes or modifications may be made without departing from the scope of the invention as defined in the appended claims. In this connection, although the exemplary description has focussed quite strongly on fault analysis within a communications network, it should go without saying that the fault analysis techniques of the invention can be used in other kinds of cellular communications network or, indeed, in networks for the supply of other kinds of utility (gas supply, electricity supply, water supply, etc.).

Claims

CLAIMS1. A method of fault monitoring in a communications network, the method comprising: receiving network performance data from a number of user devices connected to the communications network; identifying a trend in the network performance data; and obtaining information about a fault in the network based on the identified trend and stored historical data.
2. The method of claim 1, wherein the network performance data comprises objective data and subjective data, the objective data and subjective data indicative of network performance.
3. The method of claim 1 or 2, wherein the obtaining information about a fault in the network based on the identified trend and stored historical data comprises: comparing the identified trend to historical trends in the historical data, each historical trend being indicative of an historical fault in the historical data; determining that the identified trend matches a first historical trend, the first historical trend being indicative of a first historical fault; extracting historical information regarding the first historical fault in the historical data; and obtaining information regarding a fault in the network based on the extracted information regarding the first historical fault.
4. The method of claim 3, wherein determining that the identified trend matches the first historical trend comprises determining a correlation factor between the identified trend and the first historical trend, and determining that the correlation factor exceeds a predetermined threshold.
5. The method of claim 4, further comprising increasing the predetermined threshold upon determining that the real outcome indicates that the identified fault was predicted incorrectly.
6. The method of any preceding claim, wherein the obtained information includes one or more of the following about the fault in the network: the type/nature of the fault; the cause of the fault; measures that can be taken to minimise or eliminate the impact of the fault on the users of the network; and time and manner in which the fault is expected to develop.
7. The method of any preceding claim, further comprising the step of reporting the obtained information to a network operator.
8. The method of any preceding claim, wherein the network performance data comprises one or more of: received signal strength (RSSI); signal-to-interference, noise and distortion ratio (SINAD); energy per bit to noise power spectral density ratio (Eb/N0); bit-error rate; data-rate; data throughput; latency; number of status-check and/or other types of user reports; rate of increase of status-check and/or other types of user reports; geographic distribution of status-check and/or other types of user reports.
9. The method of any preceding claim, wherein the network coverage is broken into a plurality of regions.
10. The method of claim 9, wherein identifying a trend in the network performance data comprises using data from only a first region of the plurality of regions to identify the 20 trend.
11. The method of claim 9, wherein identifying a trend in the network performance data comprises using data from more than one region of the plurality of regions to identify the trend.
12. The method of any preceding claim, wherein the received network performance data is stored in a user database.
13. A computer program, which when executed by processing hardware, performs the method of any of claims 1-12.
14. A system with data processing resources and memory, configured to perform the method of any of claims 1-12.