WO2022063032A1 - 一种面向分布式***的故障信息关联上报方法及相关设备 - Google Patents

一种面向分布式***的故障信息关联上报方法及相关设备 Download PDF

Info

Publication number
WO2022063032A1
WO2022063032A1 PCT/CN2021/118807 CN2021118807W WO2022063032A1 WO 2022063032 A1 WO2022063032 A1 WO 2022063032A1 CN 2021118807 W CN2021118807 W CN 2021118807W WO 2022063032 A1 WO2022063032 A1 WO 2022063032A1
Authority
WO
WIPO (PCT)
Prior art keywords
failure event
association failure
association
cache
notification
Prior art date
Application number
PCT/CN2021/118807
Other languages
English (en)
French (fr)
Inventor
余亮
张亮
鲁志军
李煜
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022063032A1 publication Critical patent/WO2022063032A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications

Definitions

  • the present application relates to the field of computer research, and in particular, to a distributed system-oriented fault information association reporting method and related equipment.
  • the distributed system has high reliability, good scalability, fast communication, and can more easily realize resource sharing among users.
  • the calls between modules in the device are complex, which makes it difficult to locate the fault when a fault occurs.
  • the trace ID is used to track the calls between devices first, and after a fault occurs, the global index is performed according to the trace ID of the abnormal business, and then the analysis and positioning are performed.
  • the faulty device has single-end perception, which means that when a fault occurs, only the faulty device will report the fault information, and other devices involved in processing the abnormal service may not report the fault-related information. Therefore, the server obtains the fault information.
  • the obtained fault-related information is very limited, which is not conducive to subsequent fault location analysis.
  • the fault location method will collect a large amount of normal business process data.
  • the collected normal business process data has a high probability of being irrelevant to the fault, which will bring unnecessary data storage costs and analysis costs.
  • the business volume increases , the amount of collected data increases, and duplicate Trace IDs are likely to appear, and the analysis difficulty will also increase.
  • the present application provides a distributed system-oriented fault information association reporting method and related equipment, which can report fault information when a fault occurs, without the need to collect normal process data, saving data storage and analysis costs. Report the fault information, and also notify the associated device (peer device) to report the information, so that more effective information can be used when the server performs fault analysis.
  • the present application provides a distributed system-oriented fault information association reporting method.
  • the method includes: a first device caches a first calling relationship, and the first calling relationship includes a user-initiated first device participating in processing.
  • Device invocation information of one or more distributed services when the first device fails to process the first distributed service, it reports the first failure information to the server; the first device searches for the first invocation relationship a second device; the second device includes a device that invokes the first device or is invoked by the first device when executing the first distributed service; the first device sends the first device to the second device notification, where the first notification is used to instruct the second device to report the second fault information to the server; or, in the case that the second device does not report the second fault information, the first The device sends the first notification to the second device; the second fault information includes fault information when the second device processes the first distributed service.
  • the call relationship caching technology is used, and information is not collected for irrelevant device information and device call information of fault-free distributed services, which saves data storage and analysis costs.
  • the faulty device not only reports the fault information by itself, but also sends a message to the device that has a calling relationship with it, notifying the devices to report their respective fault-related information to the server. This method of reporting fault information enables the server to locate the fault. More fault-related information can be used, which improves the efficiency and accuracy of fault location.
  • the first device searching for the second device from the first calling relationship includes: the first device uses the first identifier to search for the second device from the first A second device is searched in device invocation information; the first device invocation information is the device invocation information of the first distributed service in the first invocation relationship.
  • each distributed service-related invocation information has an identifier, and each identifier is cached in the device together with the device invocation information of the corresponding distributed service. Therefore, when the device fails, it can Find the cached device invocation information of the distributed service through the identifier in the fault information, so as to find its peer device, where the peer device refers to the device invoked by the faulty device or the device that invokes the faulty device in the distributed service, In this way, the faulty device can find the peer device more quickly when it is associated and reported, which saves time.
  • the caching of the first invocation relationship includes: dividing the first invocation relationship into at least two states according to the service life cycle; The life cycles of the services corresponding to one invocation relationship are different; the first invocation relationships in different states are cached separately.
  • the cache space of the device is limited.
  • the space is not enough to cache the calling relationship of the next distributed service, the first cached calling relationship will be cleared, and the calling relationship of different distributed services will be based on
  • the life cycle of the service is divided into different states, and then cached. This method avoids the situation that a service with a short life cycle will occupy most of the cache space, so that services with a long life cycle and a short life cycle have their own cache space. do not affect each other.
  • the method further includes: if the first device fails to send the first notification to the second device, caching the first notification An association failure event; the first association failure event is used to represent a failure to send the first notification.
  • the notification failure information will not be directly cleared, but the association failure event will be cached, and the association failure event includes the relevant information notified to the peer device. , this is so that the faulty device can notify the peer device to report the fault information again, especially in the scenario of service failure caused by external factors such as the network, this information collection mechanism can greatly improve the success rate of notifying the peer device.
  • the caching of the first association failure event includes: determining, by the first device, whether there is enough cache space for caching the first association failure event ; when there is enough cache space to cache the first association failure event, cache the first association failure event; when there is currently insufficient cache space to cache the first association failure event, clear the second association failure event failure event; the second association failure event is the association failure event with the longest cache time in the first device; if the second association failure event is cleared, there is enough cache space to cache the first association failure event, cache the first association failure event; if there is still not enough cache space to cache the first association failure event after clearing the second association failure event, clear the third association failure event; The third association failure event is the association failure event with the longest cache time in the first device after the second association failure event is cleared.
  • a quota and an aging mechanism are provided for the cached notification failure related information, that is, the associated failure events that can be cached by the faulty device are limited.
  • the device When there is not enough space to continue the cache, the device will clear the earliest cached events. Associating failure events, this mechanism avoids the situation that the storage space of the faulty device is occupied by too much unimportant information, and saves the storage space of the faulty device.
  • the method further includes: when the first device comes back online or processes distributed services again, checking whether the first association failure event exists ; when the first notification is successfully sent, clear the cached first association failure event.
  • this method provides a reliable information collection mechanism, reduces the loss of positioning information for unreliable communication links, and greatly improves the success rate of association.
  • a distributed system-oriented fault information association reporting method includes: a second device caches a second invocation relationship, the second invocation relationship includes a user-initiated second device participating in processing or device invocation information of multiple distributed services; the second device receives a first notification sent by the first device, where the first notification is used to instruct the second device to report the second fault information to the server; The second device reports the second fault information to the server; or, if the second device does not report the second fault information, the second device reports the second fault information to the server server; the second fault information includes fault information when the second device processes the first distributed service.
  • the peer device of the faulty device after receiving the notification information of the faulty device, the peer device of the faulty device will report its own fault-related information to the server. After notifying the information, report it directly, or you can first check whether the fault-related information has been reported.
  • This method of associated reporting enables all devices related to the abnormal distributed service to report the fault-related information, which improves the accuracy of the fault information.
  • the correlation reporting method only collects fault information after a fault affecting user functions occurs, and does not collect information on normal business processes, which reduces the load of the distributed system and the amount of analysis data.
  • the method further includes: the second device searches for a third device from the second calling relationship; the third device includes executing the when the first distributed service is invoked, the second device or the device invoked by the second device is invoked; the second device sends a second notification to the third device, where the second notification is used to indicate the The third device reports the third fault information to the server; or, if the third device does not report the third fault information, the second device sends the second fault information to the third device notification; the third fault information includes fault information when the third device processes the first distributed service.
  • the faulty device not only reports the fault information by itself, but also notifies the peer device to report the fault-related information.
  • This method enables the server to obtain relatively complete fault-related information, which is convenient for subsequent fault location analysis.
  • the second device searching for the third device from the second invocation relationship includes: the second device, through the first identifier, obtains information from the The third device is searched in the second device invocation information; the second device invocation information is the device invocation information of the first distributed service in the second invocation relationship.
  • the message sent by the faulty device to its peer device contains an identifier corresponding to the abnormal distributed service, and the peer device can use the identifier to find its peer device from the calling relationship, and then notify it to the peer device.
  • the terminal device reports fault-related information, which saves the time of searching for the peer device.
  • the caching of the second invocation relationship includes: dividing the second invocation relationship into at least two states according to the service life cycle; The service life cycles corresponding to the two invocation relationships are different; the second invocation relationships in different states are cached separately.
  • the calling relationship of different distributed services will be divided into different states according to the life cycle of the service and then cached.
  • This caching method enables the device calling relationship of services with different life cycles to be stored separately without each other. influence.
  • the method further includes: if the second device fails to send the second notification to the third device, caching the fourth an association failure event; the fourth association failure event is used to represent a failure to send the second notification.
  • the caching of the fourth association failure event includes: determining, by the second device, whether there is enough cache space for caching the fourth association failure event ; When there is enough cache space to cache the fourth association failure event, cache the fourth association failure event; When there is not enough cache space to cache the fourth association failure event, clear the fifth association failure event; the fifth association failure event is the association failure event with the longest cache time in the second device; if the fifth association failure event is cleared, there is enough cache space to cache the fourth association failure failure event, cache the fourth association failure event; if there is still not enough cache space to cache the fourth association failure event after clearing the fifth association failure event, clear the sixth association failure event; The sixth association failure event is the association failure event with the longest cache time in the second device after the fifth association failure event is cleared.
  • the method further includes: when the second device comes back online or processes distributed services again, checking whether the fourth association failure event exists ; when the fourth association failure event exists, send the second notification to the third device; when the second notification is successfully sent, clear the cached fourth association failure event.
  • a first device in a third aspect, includes: a first cache unit configured to cache a first invocation relationship, where the first invocation relationship includes one or more distributions initiated by a user to participate in processing by the first device device invocation information of the distributed service; a first processing unit, configured to report the first failure information to the server when the first device fails to process the first distributed service; search for the second device from the first invocation relationship;
  • the second device includes a device that invokes the first device or is invoked by the first device when executing the first distributed service; a first sending unit is configured to send a first notification to the second device, the The first notification is used to instruct the second device to report the second fault information to the server; or, in the case that the second device does not report the second fault information, the first device sends the information to the server.
  • the second device sends the first notification; the second fault information includes fault information when the second device processes the first distributed service.
  • the first processing unit when configured to search for the second device from the first calling relationship, it is specifically configured to: through the first identifier, The second device is searched from the first device invocation information; the first device invocation information is the device invocation information of the first distributed service in the first invocation relationship.
  • the first cache unit is specifically configured to: divide the first calling relationship into at least two states according to the service life cycle; The life cycles of the services corresponding to the first invocation relationships are different; the first invocation relationships in different states are cached separately.
  • the first buffer unit is further configured to: if the first sending unit sends the first notification to the second device, If the sending fails, the first association failure event is cached; the first association failure event is used to represent the failure to send the first notification.
  • the first cache unit when the first cache unit caches the first association failure event, it is specifically used to: determine whether there is enough cache space to cache the first Association failure event; when there is enough cache space to cache the first association failure event, cache the first association failure event; when there is currently insufficient cache space to cache the first association failure event, clear the first association failure event
  • the second association failure event the second association failure event is the association failure event with the longest cache time in the first device; if the second association failure event is cleared, there is enough cache space to cache the For the first association failure event, cache the first association failure event; if there is still insufficient cache space to cache the first association failure event after clearing the second association failure event, clear the third association failure event;
  • the third association failure event is the association failure event with the longest cache time in the first device after the second association failure event is cleared.
  • the first processing unit is further configured to check whether the first device exists when the first device goes online again or processes distributed services again. an association failure event; the first sending unit is further configured to send the first notification to the second device when the first association failure event exists; the first cache unit is further configured to send the first notification to the second device when the first association failure event exists; After the first sending unit successfully sends the first notification, the cached first association failure event is cleared.
  • a second device comprising: a second cache unit configured to cache a second invocation relationship, where the second invocation relationship includes one or more distributions initiated by a user to participate in processing by the second device device invocation information of the type service; a first receiving unit, configured to receive a first notification sent by the first device, where the first notification is used to instruct the second device to report the second fault information to the server; the second a processing unit, where the second device reports the second fault information to the server; or, in the case that the second device does not report the second fault information, the second device reports the second fault information to the server; the second fault information includes fault information when the second device processes the first distributed service.
  • the second processing unit is further configured to search for a third device from the second calling relationship, where the third device includes executing the The second device or the device invoked by the second device is invoked during the first distributed service; the second device further includes a second sending unit, configured to send a second notification to the third device, the The second notification is used to instruct the third device to report the third fault information to the server; or, in the case that the third device does not report the third fault information, the second device to the server The third device sends the second notification; the third fault information includes fault information when the third device processes the first distributed service.
  • the second processing unit when configured to search for the third device from the second calling relationship, it is specifically configured to: through the first identifier, The third device is searched from the second device invocation information; the second device invocation information is the device invocation information of the first distributed service in the second invocation relationship.
  • the second cache unit when used to cache the second invocation relationship, it is specifically used for: storing the second invocation relationship according to the life cycle of the service It is divided into at least two states; the service life cycles corresponding to the second invocation relationships in different states are different; and the second invocation relationships in different states are cached separately.
  • the second buffer unit is further configured to: if the second sending unit sends the second notification to the third device, If the sending fails, a fourth association failure event is cached; the fourth association failure event is used to represent the failure to send the second notification.
  • the second cache unit when used to cache the fourth association failure event, it is specifically used to: determine whether there is enough cache space to cache all the fourth association failure event; when there is currently enough cache space to cache the fourth association failure event, cache the fourth association failure event; currently there is not enough cache space to cache the fourth association failure event
  • the fifth association failure event is cleared; the fifth association failure event is the association failure event with the longest cache time in the second device; if the fifth association failure event is cleared, there is enough cache space for Cache the fourth association failure event, and cache the fourth association failure event; if there is still not enough cache space to cache the fourth association failure event after clearing the fifth association failure event, clear the sixth association failure event A failure event; the sixth association failure event is an association failure event with the longest cache time in the second device after clearing the fifth association failure event.
  • the second processing unit is further configured to check whether the first device exists when the second device goes online again or processes distributed services again.
  • the second sending unit is further configured to send the second notification to the third device when the fourth association failure event exists; and clear the cache when the second notification is successfully sent of the fourth association failure event.
  • a computing device in a fifth aspect, includes a processor and a memory, the memory is used for storing program codes, and the processor is used for the program codes in the memory to execute the above-mentioned first aspect and in combination with the above-mentioned A distributed system-oriented fault information association reporting method provided by any one of the implementation manners of the first aspect.
  • a computing device in a sixth aspect, includes a processor and a memory, the memory is used for storing program codes, and the processor is used for the program codes in the memory to execute the above second aspect and in combination with the above A distributed system-oriented fault information association reporting method provided by any one of the implementation manners of the second aspect.
  • a computer-readable storage medium stores a computer program.
  • the computer program is executed by a processor, the first aspect and any one of the first aspect can be implemented in combination with the above-mentioned first aspect.
  • the function of the method for correlated reporting of fault information for distributed systems provided by an implementation manner.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by the processor, the second aspect and any one of the second aspect can be implemented in combination The function of the method for correlated reporting of fault information for distributed systems provided by an implementation manner.
  • the present application provides a computer program product, the computer program includes instructions, when the computer program is executed by a computer, the computer can execute the above-mentioned first aspect and any implementation manner in combination with the above-mentioned first aspect
  • the present application provides a computer program product, the computer program includes instructions, when the computer program is executed by a computer, so that the computer can execute the above second aspect and any implementation manner in combination with the above second aspect
  • the present application provides a chip system, where the chip system includes a processor for supporting a first device to implement the functions involved in the first aspect above.
  • the chip system further includes a memory for storing necessary program instructions and data of the data sending device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • the present application provides a chip system, where the chip system includes a processor for supporting a second device to implement the functions involved in the second aspect above.
  • the chip system further includes a memory for storing necessary program instructions and data of the data sending device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • FIG. 1 is a schematic diagram of a system architecture of a distributed system-oriented fault information association reporting method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a system architecture of another distributed system-oriented fault information association reporting method provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a distributed system-oriented fault information association reporting method provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a call relationship cache provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of another call relationship cache provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a remote call provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a caching mechanism for an association failure event provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a cache processing flow of an association failure event provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a first device according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a second device according to an embodiment of the application.
  • FIG. 11 is a schematic structural diagram of a computing device according to an embodiment of the application.
  • FIG. 12 is a schematic structural diagram of another computing device provided by an embodiment of the present application.
  • a distributed system is a software system built on a network.
  • a set of independent computers presents a unified whole to the user, as if it were a system.
  • the system has a variety of general physical and logical resources, which can dynamically allocate tasks, and the scattered physical and logical resources realize information exchange through computer networks.
  • a distributed system has only one model or paradigm to the user.
  • Embedding is a term in the field of data collection (especially in the field of user behavior data collection), which refers to related technologies and their implementation processes for capturing, processing, and sending specific user behaviors or events.
  • the technical essence of burying points is to first monitor the events in the running process of the software application, and judge and capture the events that need attention when they occur.
  • Embedding includes code embedding, visual embedding and no embedding. Among them, the code buried point adds some codes to enable the user to trigger the corresponding behavior, and then reports the data; the visual buried point uses a visual interaction method, first configures the relevant events, and then performs data collection; no buried point refers to the development of the integrated acquisition software by developers.
  • the SDK will directly capture and monitor all user behaviors in the application, and report them all, without requiring developers to add additional code. It should be noted that, in this embodiment of the present application, the buried point has nothing to do with user behavior, and is only related to a fault on a service flow.
  • Remote Procedure Call is mainly to solve the problem of communication transparency between distributed systems. That is to say, RPC allows users to ignore the service on which device is being called. From the user's point of view, this remote service is as safe and reliable as calling a local service.
  • Remote invocation generally involves four processes: client send (Client Send, CS), server receive (Service Receive, SR), server send (Service Send, SS) and client receive (Client Receive, CR).
  • UUID Universally Unique Identifier
  • GUIDs Globally Unique Identifiers
  • the purpose of UUID is to allow all elements in a distributed system to have unique identification information without specifying identification information through the central control terminal. This way, each element can create a UUID that doesn't conflict with other elements. In such a case, there is no need to consider the problem of name duplication when the database is created.
  • a UUID is a number generated on a machine that is guaranteed to be unique to all machines in the same spacetime.
  • the platform will provide a generated application programming interface (Application Programming Interface, API). Calculations according to the standards set by the Open Software Foundation (OSF) use Ethernet card addresses, nanosecond time, chip ID codes and many possible numbers.
  • OSF Open Software Foundation
  • UUID consists of the following parts:
  • the current date and time The first part of UUID is related to time. If you generate a UUID a few seconds after generating a UUID, the first part is different and the rest are the same.
  • Event ID is a number that represents error information, that is, the reason that hinders business continuation. Different Event IDs represent different error information. According to the Event ID, you can roughly know the fault module, fault impact, and even the root cause, and then solve it. question.
  • the fault log that is, the error log
  • the fault log is a text file used by the software to record runtime error information. Programmers and maintenance personnel can use the error log to debug and maintain the system.
  • Open API is an open API, also known as an open platform.
  • Open API Open API
  • Open API is a common application of service-oriented websites.
  • the service provider of the website encapsulates its own website service into a series of Application Programming Interface (API) and opens it up for use by third-party developers. This behavior is called open website API, and the open API is called Open API (open API).
  • FIG. 1 is an architecture of a fault information reporting system provided by an embodiment of the present application, including at least one server and at least one device.
  • FIG. 1 an example of including multiple servers and multiple devices is used for illustration.
  • devices A, B, C, D, and E are the devices involved in a distributed business process initiated by the user.
  • devices A, B, C, D, and E process the distributed service, a device fails, causing the distributed service to be abnormal.
  • the faulty device will report its failure information to the server.
  • distributed services include but are not limited to wireless terminal requests, web page requests, and Open API requests.
  • a web page request when a user clicks on a web page, it may involve the web page calling other devices (such as sub-servers), and may also involve sending and receiving messages with other applications. Services may also involve query and update of distributed database, read cache and write cache of distributed cache, storage of distributed objects, etc.
  • distributed services also include some operations in the scenario of collaboration between smart terminals and peripheral devices, such as collaborative file transfer, which means that this application can be applied to near-field distributed systems such as large-screen projection, PC collaboration, PAD collaboration, and smart wear. Collaborative scene.
  • FIG. 2 is another system architecture for reporting fault information provided by an embodiment of the present application.
  • Devices A, C, D, F, and H are the devices involved in a distributed service initiated by the user. There are 6 pairs of RPC calls in the distributed service, which occur between device A and device C, and between device A and device respectively. Between D, between device C and device D, between device C and device F, between device C and device H, and between device D and device H.
  • devices A, C, D, F, and H process the distributed service, there is a device failure, which causes the distributed service to be abnormal. The failed device will report its failure information to the server.
  • FIG. 1 and FIG. 2 include but are not limited to servers, routers, switches, gateways, and terminal devices such as mobile phones, computers, and tablets.
  • the servers in the above Figures 1 and 2 can be ordinary servers (also called physical servers), which are real physical devices that can be placed in the computer room to run. Ordinary servers have independent hard disks, bandwidth, etc.; It is a server generated by ordinary servers after different degrees of virtualization. Part of this type of server has been virtualized, and only part of it may be a real physical device; it can also be a cloud server, that is, owned by a cloud service provider.
  • a cloud server is a type of server that is highly distributed and highly virtualized. Its computing resources are obtained from a large number of physical servers that have been integrated and virtualized.
  • such a virtualization scale may be several, dozens, or hundreds of physical servers, or it may be a large cloud virtual resource pool built across thousands of physical hardware in a data center. Therefore, Cloud servers support elastic adjustment of resources, which means that we can freely increase or decrease resources such as CPU, memory, disk, bandwidth, etc., and also have good scalability and high reliability.
  • FIG. 1 and FIG. 2 are only two exemplary implementations of the embodiments of the present application, and the system architectures of fault information reporting in the embodiments of the present application include but are not limited to the above structures.
  • the present application provides a distributed system-oriented fault information association Reporting method, this method allows the faulty device to not only report its own fault-related information, but also notify the peer device to report the fault information when a fault occurs, so that the server has more fault-related information to use when locating the fault.
  • This fault information reporting method improves the efficiency of fault location.
  • the method may include the following steps:
  • S310 The first device caches the first calling relationship.
  • each initiated distributed service process may involve at least one device, and there is a calling relationship between the devices.
  • each device involved in processing distributed services can cache its own invocation relationship.
  • the first device and the second device in the embodiments of the present application are devices involved in processing one or more distributed services initiated by a user.
  • the first device caches the first calling relationship
  • the second device caches the second calling relationship. Table 2 below:
  • the first device may be any device in the system architecture for reporting fault information of the distributed system shown in FIG. 1 and FIG. 2 , for example, the first device may be the device A in FIG. 1 , in this case, The second device is device B in FIG. 1 ; the first device may be device C in FIG. 2 , in this case, the second device is device A, device B, and device C in FIG. 2 .
  • the first invocation relationship includes device invocation information of the distributed service that the first device participates in processing, and the device invocation information refers to the invocation information related to the first device in the distributed service that the first device participates in processing, That is, the information that the first device calls other devices and the information that the first device is called by other devices.
  • the device invocation information of one or more distributed services that the first device participates in processing may include the peer type of the first device, the services of the distributed services Type, communication information related to the first device in the distributed service, and time stamp.
  • the peer type includes the UUID of the peer device and the peer device type.
  • the peer device refers to the device invoked by the first device or the device invoked by the first device in the distributed service.
  • the peer device Device types include but are not limited to terminals, servers, routers, software and/or modules in hardware systems; service types include but are not limited to: fixed communication services, cellular mobile communication services, first-class satellite communication services, first-class data communication services
  • the first category of basic telecommunication services such as business, the second category of trunking communication services, wireless paging services, the second category of satellite communication services, the second category of data communication services, network access services, domestic communication facilities services, and network hosting services
  • Basic telecommunication services value-added telecommunication services such as the first type of value-added telecommunication services and the second type of value-added telecommunication services
  • the communication information includes but is not limited to the communication information between the first device as the caller and the called party's device, the distributed service equipment Communication information between the callee and the calling device
  • the timestamp includes but is not limited to the calling time when the first device is the caller device and the called time when the first device is the callee device.
  • first invocation relationship can be divided into at least two states according to the life cycle of the service, and then the first invocation relationships in different states are cached separately, wherein the first invocation relationship in different states corresponds to the service life cycle The cycle is different.
  • the first invocation relationship can be divided into a cold state, a warm state, and a hot state according to the life cycle of the service, and then the first invocation relationships marked as the cold state, the warm state, and the hot state are separated cache.
  • the first calling relationship is marked as a hot state
  • the service life cycle is greater than the first threshold and less than or equal to the second threshold
  • the first call relationship is marked as a hot state
  • a call relationship is marked as a warm state
  • the service life cycle is greater than the second threshold
  • the first call relationship is marked as a cold state.
  • the first invocation relationship of the above three states will be cached in three areas respectively, and will not affect each other.
  • first threshold and the second threshold are set by the research and development personnel according to the actual situation, which are not limited in this application.
  • the first threshold is set to 3 minutes
  • the second threshold is set to 4 hours. If the screen casting operation of casting the mobile phone screen to the PC screen is maintained for 3 hours, the calling relationship between the mobile phone and the PC is marked as Warm state; if a Bluetooth device is paired for 5 hours before unbinding, the call relationship between the Bluetooth devices is marked as cold.
  • the first invocation relationship can be divided into a first state, a second state, a third state, and a fourth state according to the life cycle of the service, and then they are marked as the first state, the second state, and the fourth state.
  • the first calling relationships of the state, the third state and the fourth state are cached separately.
  • the first calling relationship is marked as the first state; when the service life cycle is greater than or equal to the fourth threshold and less than or equal to the fifth threshold, the first calling relationship is marked as the first state; The first calling relationship is marked as the second state; when the life cycle of the service is greater than or equal to the sixth threshold and less than or equal to the seventh threshold, the first calling relationship is marked as the third state; when the life cycle of the service When the value is greater than or equal to the eighth threshold, the first calling relationship is marked as the fourth state.
  • the first invocation relationship of the above four states will be cached in three areas respectively, and will not affect each other.
  • the fourth threshold is greater than the third threshold
  • the fifth threshold is greater than or equal to the fourth threshold
  • the sixth threshold is greater than the fifth threshold
  • the seventh threshold is greater than or equal to the sixth threshold
  • the eighth threshold is greater than the seventh threshold.
  • the third threshold, the fourth threshold, the fifth threshold, the sixth threshold, the seventh threshold and the eighth threshold can be set by the research and development personnel according to the actual situation, and the specific values thereof are not limited in this application.
  • both the fourth threshold and the fifth threshold are set to 2 hours, and if and only if the service life cycle is 2 hours, the device invocation relationship involved in the service is marked as the second state.
  • the first device when a failure occurs while processing the first distributed service, the first device generates first failure information, and reports the first failure information to the server.
  • the first fault information includes a first fault event and a first fault log.
  • the first failure event includes the device type, Event ID, failure time, failure module, and exception type of the failed device (ie, the first device), and the device type includes but is not limited to terminals, servers, routers, software and/or hardware
  • the fault module refers to a module in the first device that fails, and the module can be a hardware module or a software module
  • the exception type refers to the type of the failure of the first device, which can be no Response, freeze, negotiation failure, timeout, etc.
  • the first fault log refers to the log related to the fault of the first device.
  • S330 The first device searches for the second device from the first calling relationship.
  • the first device caches a first call relationship, and the first call relationship includes device call information of distributed services that the first device participates in processing, so the second device can be searched through the first call relationship , the second device includes a device that invokes the first device or is invoked by the first device when executing the first distributed service.
  • the purpose of finding the peer device can be achieved by assigning an identifier to each distributed service.
  • the identifier can be cached in the first calling relationship in the first device.
  • FIG. 4 is a schematic diagram of a device calling relationship for caching distributed services.
  • the identifier can be used as a keyword index to cache the device call information of distributed services, that is, the identifier can be used as an identifier to distinguish different distributed services, and the device call information of different distributed services can be cached separately, which means that if you want to find For the device invocation information of a certain distributed service, the cached device invocation information of the distributed service can be obtained by retrieving the identifier of the distributed service.
  • Figure 5 is a schematic diagram of another device invocation relationship for caching distributed services. It can be seen from Figure 5 that the identifier of the same distributed service can also be stored together with its device invocation information. In addition, the identifiers of different distributed services and their device invocation information will be placed in different cache spaces, and the identifiers of different distributed services and their device invocation information will be stored separately, which means that it is necessary to find the device of a distributed service. When calling the relationship, you need to check one by one until you find the cache space where the identifier corresponding to the distributed service is located, so as to find the device call information of the distributed service.
  • the first calling relationship will include the identifiers of one or more distributed services initiated by the user, as well as the UUID of the peer device.
  • first failure information will be generated.
  • the first failure information not only includes the first failure event and the first failure log as described above, but also includes the corresponding first distributed service.
  • the first identifier so the first device can find the first device invocation information in the first invocation relationship by looking up the first identifier, and the first device invocation information is the first distributed service in the first invocation relationship. to find the UUID and communication information of the peer device of the first device, that is, to find the UUID of the second device and the communication information between the first device and the second device.
  • the identifier can be obtained by burying the middleware in the distributed system, and the calling relationship of the device can also be obtained.
  • the middleware is embedded, and the business process can be tracked in the process of distributed business occurrence, so that in the devices participating in different distributed business, the log records related to the different business can be associated with different identifiers, and can also be obtained at the same time.
  • Device information such as UUID, branch information, device type, etc. of the devices involved in the distributed service, and the calling relationship between devices.
  • the middleware includes but is not limited to remote procedure call middleware, data access middleware, message middleware, transaction middleware, object middleware and terminal emulation/screen conversion middleware.
  • Huawei's distributed call chain tracking system can be used to obtain the identifier, the first call relationship, and the second call relationship, add switch configuration to the Hitrace system, first turn on the first-level switch, and start tracking.
  • the Hitrace system will track the distributed service and record the work information of processing the distributed service.
  • the Hitrace system will record an identifier, that is, the Trace ID, for the log corresponding to the relevant device. The ID associates the corresponding log record with the distributed service, so that the record corresponding to the distributed service can be successfully found in the log of the related device according to the Trace ID, and the related device is involved in processing the All equipment for distributed business.
  • FIG. 6 shows the remote call from device A to device B.
  • Device A (client) will initiate a request (CS)
  • device B (server) will receive the request (SR)
  • device B (server) Process and send the result to device A (client) (SS)
  • device A (client) obtains the return information (CR) of device B (server).
  • the information between tracking processes can be determined by turning on the switch, and the information between tracking threads can also be determined by turning on the switch. .
  • the Trace ID mentioned in the example is a representation of the above-mentioned identifier, and the identifier may also have other acquisition methods and representation forms, which are not limited in this application.
  • S340 The first device sends a first notification to the second device.
  • the first device after obtaining the UUID of the second device in the first calling relationship, the first device sends a first notification to the second device according to the UUID of the second device, where the first notification includes the Event ID and the first notification. logo.
  • the first device can detect whether the first identifier and the UUID of the second device exist in the device, and if they exist, it means that the first device has sent the first notification to the second device, that is, the second device has reported For the second fault information, at this time, the first device does not need to send the first notification to the second device, that is, there is no need to transmit the UUID of the second device, the Event ID and the first identification to the second device.
  • the first device may not be able to successfully send the first notification to the second device. For example, when a communication failure occurs between the first device and the second device, the first device cannot send a message to the second device, which means However, the first device cannot notify the second device to report the second fault information at this time.
  • the first device caches the first association failure event, and before caching the first association failure event, the first device first determines whether there is enough cache space to cache the first association failure. Failure event, when there is enough cache space to cache the first association failure event, cache the first association failure event; when there is not enough cache space to cache the first association failure event, clear the first association failure event. Two association failure events; if the second association failure event is cleared, there is enough cache space to cache the first association failure event, and the first association failure event is cached; if the second association failure event is cleared After that, there is still not enough cache space to cache the first association failure event, and clear the third association failure event.
  • the second association failure event is the association failure event with the longest cache time in the first device; the third association failure event is the first association failure event after the second association failure event is cleared.
  • the association failure event includes the notification failure time, identifier, Event ID, and UUID of the peer device, where the peer device refers to the device invoked by the faulty device in the abnormal distributed service Or the device of the faulty device is called.
  • the first association failure event includes the time when the first device fails to send the first notification, the Event ID corresponding to the failure of the first device, the first identifier, and the UUID of the second device.
  • the above-mentioned cache space may be a preset space, that is, a space different from system storage or memory.
  • the above-mentioned failure of the first device to send the first notification to the second device includes that the first device fails to send the first notification and the second device fails to receive the first notification.
  • the first device queries or checks whether the first association failure event exists in the device; if there is the first association failure event, The first device sends the first notification to the second device, that is, transmits the Event ID and the first identification in the first association failure event to the second device to notify the opposite end (second device) to report the failure information; if the first notification sends When successful, the cached first association failure event is cleared.
  • S350 The second device receives the first notification sent by the first device.
  • the second device receives the first notification sent by the first device, that is, the second device can receive the Event ID and the first identifier corresponding to the failure of the first device. Therefore, the second device can pass the received first notification. An identifier is informed that an abnormality has occurred in the first distributed service.
  • the second device Before the second device receives the first notification sent by the first device, the second device also needs to cache the second invocation relationship, where the second invocation relationship includes device invocation information of distributed services that the second device participates in processing.
  • the information refers to invocation information related to the second device in the distributed services that the second device participates in processing, that is, the information that the second device invokes other devices and the information that the second device is invoked by other devices.
  • the device invocation information of the distributed service that the second device participates in processing may include the peer type of the first device, the service type of the distributed service, and the distributed service related to the second device. communication information, timestamp.
  • reference may be made to the content of the device invocation information of one or more distributed services that the first device participates in processing, and details are not repeated here.
  • the second invocation relationship can be divided into at least two states according to the life cycle of the service, the service life cycles corresponding to the second invocation relationship in different states are different, and then the second invocation relationship in different states is cached separately .
  • S360 The second device reports the second fault information to the server.
  • the second device reports the second fault information to the server.
  • the second fault information includes a first identifier, an Event ID, and a second fault log
  • the second fault log refers to a log of the second device that is related to the fault of logging.
  • both the first device and the second device participate in processing the first distributed service, so they both cache the device invocation information and the first identifier of the first distributed service.
  • the second device detects whether the first identifier exists in the device, and if so, it means that the second device has reported the second fault information. , the second device does not need to report the second fault information to the server, that is, it does not need to report the received Event ID, the first identifier and the second fault log to the server.
  • a distributed service initiated by a user involves five devices: device A, device C, device D, device F, and device H.
  • Device A and Device C, Device A and Device D, Device C and Device D, Device C and Device F, Device C and Device H, and Device D and Device H all have a calling relationship.
  • device A fails At this time, device A reports fault information and sends a message to device C and device D that have a calling relationship to notify device C and device D to report the fault information.
  • device C when device C has information related to device A, it needs to perform multi-level association, that is, after receiving the message sent by device A, device C will send messages to device D, device F, and device H that have a calling relationship with it. After receiving the message sent by device A, D will send a message to device C and device H that have a calling relationship with it. After sending the first notification, the second device will detect whether there is a first identifier in the device, and if so, there is no need to report it. At this time, when device C receives the message sent by device D, it checks whether a fault related to the fault has been reported. If it is checked that the device C has indeed reported the fault information related to the fault, it will not report any more.
  • the fault information reported by the faulty device (the first device) to the server, and the fault information reported to the server by other devices (such as the second device) that have a calling relationship with the faulty device Information is not the same.
  • the device fails, a failure event is generated, and the identifier of the corresponding service is displayed at the same time.
  • the device will collect the failure log corresponding to the failure event, and then report the failure event, the failure log and the identification to the server, and
  • the faulty device sends a message to notify the device that has the calling relationship to report the fault information, and the device that has the calling relationship to the device reports the fault information to the server after receiving the message.
  • the failure event may include device type, failure time, Event ID, failure module and exception type, wherein the device type may be terminal equipment such as mobile phones, or other devices such as routers; failure time includes the time of failure. ;
  • the fault type corresponds to the error information represented by the Event ID.
  • the ID range of the Windows Event ID is 0 to 5073.
  • the error information represented by each Event ID is different. Therefore, when a fault occurs, you can know what happened to the device according to the Event ID.
  • Type of fault the fault module can be a hardware or software module inside the device, or a process or middleware; exception types include but are not limited to no response, freeze, negotiation failure, and timeout.
  • S370 The second device searches for the third device from the second calling relationship.
  • the second calling relationship includes the identifiers of one or more distributed services initiated by the user, and also includes the UUID of the peer device, and when the second device receives the first notification sent by the first device After that, the second device can find the second device invocation information in the second invocation relationship according to the first identifier in the received first notification, and the second device invocation information is the first distributed device in the second invocation relationship.
  • the device of the service invokes the information to find the UUID and communication information of the peer device of the second device, that is, to find the UUID of the third device and the communication information between the second device and the third device.
  • S380 The second device sends the first notification to the third device.
  • the second device after obtaining the UUID of the third device in the second calling relationship, the second device sends a second notification to the third device according to the UUID of the third device, where the second notification includes the Event ID and the first identifier .
  • the second device can detect whether the first identifier and the UUID of the third device exist in the device, and if so, it means that the second device has sent the first notification to the third device, that is, the third device has reported For the third fault information, at this time, the second device does not need to send the second notification to the third device, that is, it does not need to transmit the UUID of the third device, the Event ID and the first identifier to the third device.
  • the second device may not be able to successfully send the first notification to the third device.
  • the second device determines whether there is enough cache space to cache the fourth association failure event; there is currently enough When the cache space is used to cache the fourth association failure event, the fourth association failure event is cached; when there is currently insufficient cache space to cache the fourth association failure event, the fifth association failure event is cleared; if After clearing the fifth association failure event, there is enough cache space to cache the fourth association failure event and cache the fourth association failure event; if the fifth association failure event is cleared, there is still not enough cache space. The cache space is used to cache the fourth association failure event and clear the sixth association failure event.
  • the fifth association failure event is the association failure event with the longest cache time in the second device; the sixth association failure event is that after the fifth association failure event is cleared, the second The association failure event that has been cached for the longest time in the device.
  • the failure of the second device to send the second notification to the third device includes that the second device fails to send the second notification and the third device fails to receive the second notification.
  • the second device checks whether the fourth association failure event exists in the device; if there is the fourth association failure event, the second device sends a message to the third device.
  • the second notification is to transmit the Event ID and the first identifier in the fourth association failure event to the third device; if the sending is successful, the cached fourth association failure event is cleared.
  • the way in which the second device notifies the third device to report the third fault information is the same as the way in which the first device notifies the second device to report the second fault information.
  • the example in the process of reporting the second fault information by the device is sufficient.
  • FIG. 9 is a schematic structural diagram of a first device provided by the present application, where the first device is configured to execute the distributed system-oriented fault location method described in FIG. 3 .
  • the application does not limit the division of the functional units of the first device, and each unit in the first device can be added, decreased or combined as required.
  • the operations and/or functions of the units in the first device are respectively to implement the corresponding flow of the method described in FIG. 3 , and are not repeated here for brevity.
  • Figure 9 exemplarily provides a division of functional units:
  • the first device 900 includes a first buffer unit 910 , a first processing unit 920 and a first sending unit 930 .
  • the first cache unit 910 is configured to cache a first invocation relationship, where the first invocation relationship includes device invocation information of one or more distributed services initiated by the user and involved in processing by the first device.
  • the first processing unit 920 is configured to report the first failure information to the server when the first device fails to process the first distributed service; find the second device from the first calling relationship; the second device includes executing The first distributed service invokes the first device or a device invoked by the first device.
  • a first sending unit 930 configured to send a first notification to the second device, where the first notification is used to instruct the second device to report the second fault information to the server; or, in the second device If the device does not report the second fault information, the first device sends the first notification to the second device; the second fault information includes when the second device processes the first distributed service fault information.
  • each unit included in the first device 900 may be a software unit, a hardware unit, or a part of a software unit and a part of a hardware unit.
  • FIG. 10 is a schematic structural diagram of a second device provided by the present application, where the second device is configured to execute the method for associating and reporting fault information for a distributed system described in FIG. 3 .
  • the application does not limit the division of the functional units of the second device, and each unit in the second device may be added, decreased or combined as required.
  • the operations and/or functions of the units in the second device are respectively to implement the corresponding flow of the method described in FIG. 3 , and are not repeated here for brevity.
  • Figure 10 exemplarily provides a division of functional units:
  • the second device 1000 includes a second buffer unit 1010 , a first receiving unit 1020 and a second processing unit 1030 .
  • the second cache unit 1010 is configured to cache a second invocation relationship, where the second invocation relationship includes device invocation information of one or more distributed services initiated by the user and involved in processing by the second device.
  • the first receiving unit 1020 is configured to receive a first notification sent by a first device, where the first notification is used to instruct the second device to report the second fault information to the server.
  • the second processing unit 1030 the second device reports the second fault information to the server; or, if the second device does not report the second fault information, the second device reports the second fault information to the server
  • the fault information is reported to the server; the second fault information includes the fault information when the second device processes the first distributed service.
  • the second device 1000 further includes: a second sending unit 1040, the second sending unit is configured to send a second notification to the third device, the second notification using instructing the third device to report the third failure information to the server; or, in the case that the third device does not report the third failure information, the second device sends the third device The second notification; the third fault information includes fault information when the third device processes the first distributed service.
  • each unit included in the second device 1000 may be a software unit, a hardware unit, or a part of a software unit and a part of a hardware unit.
  • FIG. 11 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the computing device 1100 includes a processor 1110 , a communication interface 1120 and a memory 1130 , and the processor 1110 , the communication interface 1120 and the memory 1130 are connected to each other through an internal bus 1140 .
  • the computing device 1100 may be the first device 900 in FIG. 9 , and the functions performed by the first device 900 in FIG. 9 are actually performed by the processor 1110 of the first device 900 .
  • the processor 1110 may be composed of one or more general-purpose processors, such as a central processing unit (central processing unit, CPU), or a combination of a CPU and a hardware chip.
  • the above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • the communication interface 1120 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), core network, wireless local area network (Wireless Local Area Networks, WLAN) and the like.
  • RAN radio access network
  • WLAN wireless Local Area Networks
  • the bus 1140 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus 1140 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 11, but it does not mean that there is only one bus or one type of bus.
  • the memory 1130 may include volatile memory (volatile memory), such as random access memory (RAM); the memory 1130 may also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory) only memory, ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (solid-state drive, SSD); the memory 1130 may also include a combination of the above types.
  • volatile memory volatile memory
  • non-volatile memory non-volatile memory
  • non-volatile memory such as read-only memory (read-only memory) only memory, ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (solid-state drive, SSD
  • the memory 1130 may also include a combination of the above types.
  • the memory 1130 is used to store the program code for executing the embodiment of the method for reporting fault information associated with the distributed system.
  • the memory 1130 can also cache other data, and the processor 1110 controls the execution to achieve
  • the processor 1110 controls the memory 1130 to cache the first invocation relationship, where the first invocation relationship includes device invocation information of one or more distributed services that the first device initiates and participates in processing;
  • the processor 1110 controls the communication interface 1120 to report the first failure information to the server;
  • the processor 1110 searches for a second device from the first invocation relationship; the second device includes a device that invokes the first device 900 or is invoked by the first device 900 when executing the first distributed service;
  • the processor 1110 controls the communication interface 1120 to send a first notification to the second device, where the first notification is used to instruct the second device to report the second fault information to the server; or, in the second device If the second fault information is not reported, the first device 900 sends the first notification to the second device; the second fault information includes when the second device processes the first distributed service. fault information.
  • the processor 1110 searches for the second device from the first call relationship, including: the processor 1110 searches for the second device from the call information of the first device by using the first identifier; the first The device invocation information is the device invocation information of the first distributed service in the first invocation relationship.
  • the processor 1110 divides the first invocation relationship into at least two states according to the life cycle of the service; the life cycles of the services corresponding to the first invocation relationship in different states are different; the processor 1110 divides the The first calling relationships in different states are separately cached in the memory 1130 .
  • the memory 1130 caches the first association failure event; the first association failure event is used for Indicates that sending the first notification failed.
  • the processor 1110 determines whether there is enough cache space to cache the first association failure event; when there is currently enough cache space to cache the first association failure event, the processor 1110 The control memory 1130 caches the first association failure event; when there is currently insufficient cache space to cache the first association failure event, the processor 1110 clears the second association failure event; the second association failure event is the The association failure event with the longest cache time in the first device; if there is enough cache space to cache the first association failure event after clearing the second association failure event, the processor 1110 controls the memory 1130 to cache the first association failure event.
  • association failure event if there is still not enough cache space to cache the first association failure event after clearing the second association failure event, the processor 1110 clears the third association failure event; the third association failure event It is the association failure event with the longest cache time in the first device after the second association failure event is cleared.
  • the processor 1110 checks whether the first association failure event exists; when there is the first association failure event, The processor 1110 controls the communication interface 1120 to send the first notification to the second device; when the first notification is successfully sent, the processor 1110 clears the cached first association failure event.
  • FIG. 12 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the computing device 1200 includes: a processor 1210 , a communication interface 1220 and a memory 1230 , and the processor 1210 , the communication interface 1220 and the memory 1230 are connected to each other through an internal bus 1240 .
  • the computing device 1200 may be the second device 1000 in FIG. 12 , and the functions performed by the second device 1000 in FIG. 10 are actually performed by the processor 1210 of the second device 1000 .
  • the processor 1210 may be composed of one or more general-purpose processors, such as a central processing unit (CPU), or a combination of a CPU and a hardware chip.
  • the above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • the communication interface 1220 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), core network, wireless local area network (Wireless Local Area Networks, WLAN) and the like.
  • RAN radio access network
  • WLAN wireless Local Area Networks
  • the bus 1240 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus 1240 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 12, but it does not mean that there is only one bus or one type of bus.
  • the memory 1230 may include a volatile memory (volatile memory), such as random access memory (RAM); the memory 1230 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (read- only memory, ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (solid-state drive, SSD); the memory 1230 may also include a combination of the above types.
  • the memory 1230 is used to store the program code for executing the embodiment of the method for reporting fault information associated with the distributed system. In an implementation manner, the memory 1230 can also cache other data, and the execution is controlled by the processor 1210 to achieve
  • the functional units shown in the second device 1000 are used to implement the method steps in the method embodiment shown in FIG. 3 with the second device 1000 as the main body of execution. details as follows:
  • the processor 1210 controls the memory 1230 to cache the second invocation relationship, where the second invocation relationship includes device invocation information of one or more distributed services that the second device 1000 initiates and participates in processing;
  • the processor 1210 in the second device 1000 receives, through the control communication interface 1220, a first notification sent by the first device, where the first notification is used to instruct the second device to report the second fault information to the server;
  • the processor 1210 reports the second fault information to the server; or, if the second device 1000 does not report the second fault information, the second device 1000 reports the second fault information to the server server; the second fault information includes fault information when the second device 1000 processes the first distributed service.
  • the processor 1210 searches for a third device from the second calling relationship; the third device includes calling the second device 1000 or being called by the second device 1000 when executing the first distributed service.
  • the third device if the third device does not report the third fault information, the second device 1000 sends the second notification to the third device; the third fault information includes the third Fault information when the device processes the first distributed service.
  • the processor 1210 searches for the third device from the second device invocation information by using the first identifier; the second device invocation information is the first distribution in the second invocation relationship Device call information of the type service.
  • the processor 1210 divides the second invocation relationship into at least two states according to the life cycle of the service; the life cycles of the services corresponding to the second invocation relationship in different states are different; the processor 1210 divides the The second calling relationships in different states are separately cached in the memory 1230.
  • the processor 1210 controls the memory 1230 to cache the fourth association failure event; Four association failure events are used to characterize the failure to send the second notification.
  • the processor 1210 determines whether there is enough cache space for caching the fourth association failure event; when there is currently enough cache space for caching the fourth association failure event, the processor 1210 The control memory 1230 caches the fourth association failure event; when there is currently insufficient cache space to cache the fourth association failure event, the processor 1210 clears the fifth association failure event; the fifth association failure event is the The association failure event with the longest cache time in the second device; if there is enough cache space to cache the fourth association failure event after clearing the fifth association failure event, the processor 1210 controls the memory 1230 to cache the fourth association failure event.
  • the processor 1210 clears the sixth association failure event; the sixth association failure event is the association failure event with the longest cache time in the second device after the fifth association failure event is cleared.
  • the processor 1210 checks whether the fourth association failure event exists; when there is the fourth association failure event, The processor 1210 controls the memory 1230 to send the second notification to the third device; when the second notification is successfully sent, the processor 1210 clears the cached fourth association failure event.
  • Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored.
  • the program When the program is executed by a processor, it can implement some or all of the steps described in the above method embodiments, and realize the above The function of any one of the functional units described in FIG. 9 .
  • Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored.
  • the program When the program is executed by a processor, it can implement some or all of the steps described in the above method embodiments, and realize the above The function of any one of the functional units described in Figure 10.
  • Embodiments of the present application also provide a computer program product, which, when running on a computer or a processor, enables the computer or processor to execute one or more of the method steps in any of the above methods with the first device 900 as the main body of execution steps. If each component module of the above-mentioned device is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in the computer-readable storage medium.
  • Embodiments of the present application also provide a computer program product, which, when running on a computer or a processor, enables the computer or processor to execute one or more of the method steps in any of the above methods with the second device 1000 as the execution subject steps. If each component module of the above-mentioned device is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in the computer-readable storage medium.
  • Embodiments of the present application further provide a chip system, where the chip system includes a processor for supporting the first device 900 to implement one or more steps of the method steps in any of the foregoing methods with the first device 900 as an execution subject.
  • the chip system further includes a memory for storing necessary program instructions and data of the data sending device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • An embodiment of the present application further provides a chip system, where the chip system includes a processor for supporting the second device 1000 to implement one or more steps of the method steps in any of the foregoing methods with the second device 1000 as the execution body.
  • the chip system further includes a memory for storing necessary program instructions and data of the data sending device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be implemented in the present application.
  • the implementation of the examples constitutes no limitation.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .
  • the modules in the apparatus of the embodiment of the present application may be combined, divided and deleted according to actual needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请提供了一种面向分布式***的故障信息关联上报方法及相关设备。其中,该方法包括:第一设备缓存第一调用关系;当所述第一设备处理第一分布式业务发生故障时,上报第一故障信息到服务器;所述第一设备从所述第一调用关系中查找第二设备;所述第二设备包括执行所述第一分布式业务时调用所述第一设备的设备或被所述第一设备调用的设备;所述第一设备向所述第二设备发送第一通知;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第一设备向所述第二设备发送所述第一通知。上述方法减少了数据存储及分析成本,同时能够上报更多的故障相关信息,提高了故障定位的效率。

Description

一种面向分布式***的故障信息关联上报方法及相关设备
本申请要求于2020年9月28日提交中国专利局、申请号为202011040443.5、申请名称为“一种面向分布式***的故障信息关联上报方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机研究领域,尤其涉及一种面向分布式***的故障信息关联上报方法及相关设备。
背景技术
为满足日益增大的业务量的需求,应对大规模的应用场景,分布式***应用的越来越广泛。分布式***可靠性高、可扩展性好、通信快捷,同时能更方便地实现用户间的资源共享,但是由于分布式***中的分布式业务所涉及到的设备规模庞大,设备间的调用和设备内模块间的调用错综复杂,导致出现故障时难以对故障进行定位。
现有的故障定位方法中,首先通过Trace ID埋点来跟踪各设备间的调用,在出现故障后根据出现异常的业务的Trace ID进行全局索引,然后再进行分析定位。在上述过程中,故障设备单端感知,这就意味着,当出现故障时,只有故障设备会上报故障信息,其它参与处理该异常业务的设备可能并不会上报故障相关信息,因此,服务器获取到的与故障相关的信息非常有限,不利于后续故障定位分析。另外,该故障定位方法会采集大量正常的业务流程数据,所采集的正常业务流程数据大概率与故障无关,会带来不必要的数据存储成本和分析成本,再者,当业务量增大时,采集的数据量增多,很可能出现重复的Trace ID,此时分析难度也会增加。
因此,在分布式***中如何上报更多的故障相关信息从而有效进行故障定位是目前亟待解决的问题。
发明内容
本申请提供了一种面向分布式***的故障信息关联上报方法及相关设备,能够在发生故障时上报故障信息,无需采集正常的流程数据,节约了数据存储和分析成本,另外,故障设备除了自己上报故障信息,还会通知关联设备(对端设备)上报信息,使得服务器进行故障分析时有更多有效信息能被利用。
第一方面,本申请提供一种面向分布式***的故障信息关联上报方法,所述方法包括:第一设备缓存第一调用关系,所述第一调用关系包括用户发起的第一设备参与处理的一个或多个分布式业务的设备调用信息;当所述第一设备处理第一分布式业务发生故障时,上报第一故障信息到服务器;所述第一设备从所述第一调用关系中查找第二设备;所述第二设备包括执行所述第一分布式业务时调用所述第一设备或被所述第一设备调用的设备;所述第一设备向所述第二设备发送第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第一设备向所述第二设备发送所述第一通知;所述第二故障信息包括所述第二设备处理第一分布式业 务时的故障信息。
在本申请提供的方案中,使用了调用关系缓存技术,对于无关的设备信息、无故障的分布式业务的设备调用信息不进行信息采集,节约了数据存储和分析成本。另外,发生故障的设备除了自己上报故障信息,还给与其有调用关系的设备发送消息,通知所述设备将各自的故障相关信息上报到服务器,这种故障信息上报方式使得服务器在进行故障定位时有更多的故障相关信息可以利用,提高了故障定位的效率和准确性。
结合第一方面,在第一方面的一种可能的实现方式中,所述第一设备从所述第一调用关系中查找第二设备,包括:所述第一设备通过第一标识,从第一设备调用信息中查找第二设备;所述第一设备调用信息为所述第一调用关系中所述第一分布式业务的设备调用信息。
在本申请提供的方案中,每个分布式业务相关的调用信息都有一个标识,每个标识与其对应的分布式业务的设备调用信息一起缓存在设备中,因此,设备在出现故障时,可以通过故障信息中的标识找到缓存的分布式业务的设备调用信息,从而找到其对端设备,所述对端设备指在该分布式业务中该故障设备调用的设备或调用该故障设备的设备,这种方式可以使得故障设备在关联上报时更快捷地找到对端设备,节省了时间。
结合第一方面,在第一方面的一种可能的实现方式中,所述缓存第一调用关系包括:将所述第一调用关系根据业务的生命周期分为至少两种状态;不同状态的第一调用关系对应的业务的生命周期不同;将所述不同状态的第一调用关系分开缓存。
在本申请提供的方案中,设备的缓存空间是有限的,当空间不足以缓存下个分布式业务的调用关系时,就会清除最先缓存的调用关系,将不同分布式业务的调用关系根据该业务的生命周期分成不同的状态,然后再缓存,这种方式避免了生命周期短的业务会占据大部分缓存空间的情况,让生命周期较长和较短的业务都有各自的缓存空间,互不影响。
结合第一方面,在第一方面的一种可能的实现方式中,所述方法还包括:若所述第一设备向所述第二设备发送所述第一通知时,发送失败,缓存第一关联失败事件;所述第一关联失败事件用于表征发送所述第一通知失败。
在本申请提供的方案中,若故障设备没有成功通知对端设备上报故障信息,不会直接清除通知失败的信息,而是缓存关联失败事件,所述关联失败事件包括通知对端设备的相关信息,这是为了故障设备后续能够再次通知对端设备上报故障信息,特别是在网络等外部因子导致的业务失败场景下,这种信息采集机制可以大大提高通知对端设备的成功率。
结合第一方面,在第一方面的一种可能的实现方式中,所述缓存第一关联失败事件包括:所述第一设备确定是否有足够的缓存空间用来缓存所述第一关联失败事件;当前有足够的缓存空间用来缓存所述第一关联失败事件时,缓存所述第一关联失败事件;当前无足够的缓存空间用来缓存所述第一关联失败事件时,清除第二关联失败事件;所述第二关联失败事件是所述第一设备中缓存时间最长的关联失败事件;若清除所述第二关联失败事件后,有足够的缓存空间用来缓存所述第一关联失败事件,缓存所述第一关联失败事件;若清除所述第二关联失败事件后,仍无足够的缓存空间用来缓存所述第一关联失败事件,清除第三关联失败事件;所述第三关联失败事件是清除所述第二关联失败事件后,所述第一设备中缓存时间最长的关联失败事件。
在本申请提供的方案中,对缓存的通知失败的相关信息提供限额和老化机制,即故障设备能够缓存的关联失败事件是有限的,当没有足够的空间继续缓存时,设备会清除最早缓存的关联失败事件,这种机制避免了出现故障设备的存储空间被过多不重要的信息所占据的情况,节省了故障设备的存储空间。
结合第一方面,在第一方面的一种可能的实现方式中,所述方法还包括:当所述第一设备重新上线或者再次处理分布式业务时,检查是否存在所述第一关联失败事件;成功发送所述第一通知时,清除缓存的所述第一关联失败事件。
在本申请提供的方案中,当故障设备重新上线,或者再次处理分布式业务时,会检查是否缓存有关联失败事件,如果确实缓存有所述关联失败事件,故障设备会再次通知对端设备上报故障相关信息,这种方式提供了可靠的信息采集机制,针对不可靠的通信链路,减少了定位信息的丢失,同时也大大提高了关联成功率。
第二方面,提供了一种面向分布式***的故障信息关联上报方法,所述方法包括:第二设备缓存第二调用关系,所述第二调用关系包括用户发起的第二设备参与处理的一个或多个分布式业务的设备调用信息;所述第二设备接收第一设备发送的第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第二设备将第二故障信息上报给所述服务器;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
在本申请提供的方案中,故障设备的对端设备在收到故障设备的通知信息后,会将自己的故障相关信息上报给服务器,具体地,所述对端设备可以在收到故障设备的通知信息后直接上报,也可以先检查是否已经上报过所述故障相关信息,这种关联上报的方式,使得与出现异常的分布式业务相关的设备都可以上报故障相关信息,提高了故障信息的有效性,另外,所述关联上报方式仅在影响用户功能故障出现后才采集故障信息,并不采集正常业务流程的信息,降低了分布式***的负载和分析数据量。
结合第二方面,在第二方面的一种可能的实现方式中,所述方法还包括:所述第二设备从所述第二调用关系中查找第三设备;所述第三设备包括执行所述第一分布式业务时调用所述第二设备或被所述第二设备调用的设备;所述第二设备向所述第三设备发送第二通知,所述第二通知用于指示所述第三设备将第三故障信息上报给所述服务器;或者,在所述第三设备没有上报所述第三故障信息的情况下,所述第二设备向所述第三设备发送所述第二通知;所述第三故障信息包括所述第三设备处理第一分布式业务时的故障信息。
在本申请提供的方案中,故障设备不仅自己上报故障信息,还会通知对端设备上报故障相关信息,这种方式能让服务器能获得比较完整的故障相关信息,便于后续故障定位分析。
结合第二方面,在第二方面的一种可能的实现方式中,所述第二设备从所述第二调用关系中查找第三设备,包括:所述第二设备通过第一标识,从所述第二设备调用信息中查找第三设备;所述第二设备调用信息为所述第二调用关系中所述第一分布式业务的设备调用信息。
在本申请提供的方案中,故障设备给其对端设备发送的消息中包含异常分布式业务对应的标识,对端设备可以通过该标识从调用关系中查找其对端设备,然后再通知其对端设备上报故障相关信息,这种方式节省了查找对端设备的时间。
结合第二方面,在第二方面的一种可能的实现方式中,所述缓存第二调用关系包括:将所述第二调用关系根据业务的生命周期分为至少两种状态;不同状态的第二调用关系对应的业务的生命周期不同;将所述不同状态的第二调用关系分开缓存。
在本申请提供的方案中,不同分布式业务的调用关系会根据该业务的生命周期分成不同的状态后再缓存,这种缓存方式使得不同生命周期的业务的设备调用关系可以各自存储,互不影响。
结合第二方面,在第二方面的一种可能的实现方式中,所述方法还包括:若所述第二设 备向所述第三设备发送所述第二通知时,发送失败,缓存第四关联失败事件;所述第四关联失败事件用于表征发送所述第二通知失败。
在本申请提供的方案中,建立了一种可靠的信息采集机制,当设备未能成功通知对端设备上报故障相关信息时,会缓存关联失败事件,以便后续进行再次通知,这种信息采集机制减少了定位信息的丢失,极大地提高了通知对端设备的成功率。
结合第二方面,在第二方面的一种可能的实现方式中,所述缓存第四关联失败事件包括:所述第二设备确定是否有足够的缓存空间用来缓存所述第四关联失败事件;当前有足够的缓存空间用来缓存所述第四关联失败事件时,缓存所述第四关联失败事件;当前无足够的缓存空间用来缓存所述第四关联失败事件时,清除第五关联失败事件;所述第五关联失败事件是所述第二设备中缓存时间最长的关联失败事件;若清除所述第五关联失败事件后,有足够的缓存空间用来缓存所述第四关联失败事件,缓存所述第四关联失败事件;若清除所述第五关联失败事件后,仍无足够的缓存空间用来缓存所述第四关联失败事件,清除第六关联失败事件;所述第六关联失败事件是清除所述第五关联失败事件后,所述第二设备中缓存时间最长的关联失败事件。
在本申请提供的方案中,当设备中没有足够的缓存空间给关联失败事件存储时,设备会清除最早存储的关联失败事件,这种方式避免了内存被过多无用信息占据的情况,是合理使用设备存储空间一种表现。
结合第二方面,在第二方面的一种可能的实现方式中,所述方法还包括:当所述第二设备重新上线或者再次处理分布式业务时,检查是否存在所述第四关联失败事件;当存在所述第四关联失败事件时,向所述第三设备发送所述第二通知;成功发送所述第二通知时,清除缓存的所述第四关联失败事件。
在本申请提供的方案中,每次设备重新上线,或者再次处理分布式业务时,都会检查是否缓存有关联失败事件,如果设备内缓存有关联失败事件,设备会再次通知对端设备上报故障相关信息,这种方式尽可能使所有相关设备都能被通知到,减少了故障相关信息的丢失,也大大提高了关联成功率。
第三方面,提供了一种第一设备,该设备包括:第一缓存单元,用于缓存第一调用关系,所述第一调用关系包括用户发起的第一设备参与处理的一个或多个分布式业务的设备调用信息;第一处理单元,用于当第一设备处理第一分布式业务发生故障时,上报第一故障信息到服务器;从所述第一调用关系中查找第二设备;所述第二设备包括执行所述第一分布式业务时调用所述第一设备或被所述第一设备调用的设备;第一发送单元,用于向所述第二设备发送第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第一设备向所述第二设备发送所述第一通知;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
结合第三方面,在第三方面的一种可能的实现方式中,所述第一处理单元,用于从所述第一调用关系中查找第二设备时,具体用于:通过第一标识,从第一设备调用信息中查找第二设备;所述第一设备调用信息为所述第一调用关系中所述第一分布式业务的设备调用信息。
结合第三方面,在第三方面的一种可能的实现方式中,所述第一缓存单元具体用于:将所述第一调用关系根据业务的生命周期分为至少两种状态;不同状态的第一调用关系对应的业务的生命周期不同;将所述不同状态的第一调用关系分开缓存。
结合第三方面,在第三方面的一种可能的实现方式中,所述第一缓存单元,还用于:若所述第一发送单元向所述第二设备发送所述第一通知时,发送失败,缓存第一关联失败事件; 所述第一关联失败事件用于表征发送所述第一通知失败。
结合第三方面,在第三方面的一种可能的实现方式中,所述第一缓存单元缓存第一关联失败事件时,具体用于:确定是否有足够的缓存空间用来缓存所述第一关联失败事件;当前有足够的缓存空间用来缓存所述第一关联失败事件时,缓存所述第一关联失败事件;当前无足够的缓存空间用来缓存所述第一关联失败事件时,清除第二关联失败事件;所述第二关联失败事件是所述第一设备中缓存时间最长的关联失败事件;若清除所述第二关联失败事件后,有足够的缓存空间用来缓存所述第一关联失败事件,缓存所述第一关联失败事件;若清除所述第二关联失败事件后,仍无足够的缓存空间用来缓存所述第一关联失败事件,清除第三关联失败事件;所述第三关联失败事件是清除所述第二关联失败事件后,所述第一设备中缓存时间最长的关联失败事件。
结合第三方面,在第三方面的一种可能的实现方式中,所述第一处理单元,还用于当所述第一设备重新上线或者再次处理分布式业务时,检查是否存在所述第一关联失败事件;所述第一发送单元,还用于当存在所述第一关联失败事件时,向所述第二设备发送所述第一通知;所述第一缓存单元,还用于当所述第一发送单元成功发送所述第一通知后,清除缓存的所述第一关联失败事件。
第四方面,提供了一种第二设备,该设备包括:第二缓存单元,用于缓存第二调用关系,所述第二调用关系包括用户发起的第二设备参与处理的一个或多个分布式业务的设备调用信息;第一接收单元,用于接收第一设备发送的第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;第二处理单元,所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第二设备将第二故障信息上报给所述服务器;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二处理单元,还用于从所述第二调用关系中查找第三设备,所述第三设备包括执行所述第一分布式业务时调用所述第二设备或被所述第二设备调用的设备;所述第二设备还包括第二发送单元,用于向所述第三设备发送第二通知,所述第二通知用于指示所述第三设备将第三故障信息上报给所述服务器;或者,在所述第三设备没有上报所述第三故障信息的情况下,所述第二设备向所述第三设备发送所述第二通知;所述第三故障信息包括所述第三设备处理第一分布式业务时的故障信息。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二处理单元,用于从所述第二调用关系中查找第三设备时,具体用于:通过第一标识,从所述第二设备调用信息中查找第三设备;所述第二设备调用信息为所述第二调用关系中所述第一分布式业务的设备调用信息。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二缓存单元,用于缓存第二调用关系时,具体用于:将所述第二调用关系根据业务的生命周期分为至少两种状态;不同状态的第二调用关系对应的业务的生命周期不同;将所述不同状态的第二调用关系分开缓存。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二缓存单元,还用于:若所述第二发送单元向所述第三设备发送所述第二通知时,发送失败,缓存第四关联失败事件;所述第四关联失败事件用于表征发送所述第二通知失败。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二缓存单元,用于缓存第四关联失败事件时,具体用于:确定是否有足够的缓存空间用来缓存所述第四关联失败事件; 当前有足够的缓存空间用来缓存所述第四关联失败事件时,缓存所述第四关联失败事件;当前无足够的缓存空间用来缓存所述第四关联失败事件时,清除第五关联失败事件;所述第五关联失败事件是所述第二设备中缓存时间最长的关联失败事件;若清除所述第五关联失败事件后,有足够的缓存空间用来缓存所述第四关联失败事件,缓存所述第四关联失败事件;若清除所述第五关联失败事件后,仍无足够的缓存空间用来缓存所述第四关联失败事件,清除第六关联失败事件;所述第六关联失败事件是清除所述第五关联失败事件后,所述第二设备中缓存时间最长的关联失败事件。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二处理单元,还用于当所述第二设备重新上线或者再次处理分布式业务时,检查是否存在所述第四关联失败事件;所述第二发送单元,还用于当存在所述第四关联失败事件时,向所述第三设备发送所述第二通知;成功发送所述第二通知时,清除缓存的所述第四关联失败事件。
第五方面,提供了一种计算设备,所述计算设备包括处理器和存储器,所述存储器用于存储程序代码,所述处理器用于所述存储器中的程序代码执行上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的面向分布式***的故障信息关联上报方法。
第六方面,提供了一种计算设备,所述计算设备包括处理器和存储器,所述存储器用于存储程序代码,所述处理器用于所述存储器中的程序代码执行上述第二方面以及结合上述第二方面中的任意一种实现方式所提供的面向分布式***的故障信息关联上报方法。
第七方面,提供了计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,当该计算机程序被处理器执行时,可以实现上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的面向分布式***的故障信息关联上报方法的功能。
第八方面,提供了计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,当该计算机程序被处理器执行时,可以实现上述第二方面以及结合上述第二方面中的任意一种实现方式所提供的面向分布式***的故障信息关联上报方法的功能。
第九方面,本申请提供了一种计算机程序产品,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的面向分布式***的故障信息关联上报方法的流程。
第十方面,本申请提供了一种计算机程序产品,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述第二方面以及结合上述第二方面中的任意一种实现方式所提供的面向分布式***的故障信息关联上报方法的流程。
第十一方面,本申请提供了一种芯片***,该芯片***包括处理器,用于支持第一设备实现上述第一方面中所涉及的功能。在一种可能的设计中,所述芯片***还包括存储器,所述存储器,用于保存数据发送设备必要的程序指令和数据。该芯片***,可以由芯片构成,也可以包含芯片和其他分立器件。
第十二方面,本申请提供了一种芯片***,该芯片***包括处理器,用于支持第二设备实现上述第二方面中所涉及的功能。在一种可能的设计中,所述芯片***还包括存储器,所述存储器,用于保存数据发送设备必要的程序指令和数据。该芯片***,可以由芯片构成,也可以包含芯片和其他分立器件。
可以理解地,上述提供的第三方面提供的第一设备、第五方面提供的一种计算设备、第七方面提供的一种计算机可读存储介质、第九方面提供的一种计算机程序产品,以及第十一方面提供的芯片***均用于执行第一方面所提供的面向分布式***的故障信息关联上报方法。因此,其所能达到的有益效果可参考第一方面所提供的面向分布式***的故障信息关联 上报方法中的有益效果,此处不再赘述。
可以理解地,上述提供的第四方面提供的第二设备、第六方面提供的一种计算设备、第八方面提供的一种计算机可读存储介质、第十方面提供的一种计算机程序产品,以及第十二方面提供的芯片***均用于执行第二方面所提供的面向分布式***的故障信息关联上报方法。因此,其所能达到的有益效果可参考第二方面所提供的面向分布式***的故障信息关联上报方法中的有益效果,此处不再赘述。
附图说明
图1为本申请实施例提供的一种面向分布式***的故障信息关联上报方法的***架构的示意图;
图2为本申请实施例提供的另一种面向分布式***的故障信息关联上报方法的***架构的示意图;
图3为本申请实施例提供的一种面向分布式***的故障信息关联上报方法的流程示意图;
图4为本申请实施例提供的一种调用关系缓存的示意图;
图5为本申请实施例提供的另一种调用关系缓存的示意图;
图6为本申请实施例提供的一种远程调用的示意图;
图7为本申请实施例提供的一种关联失败事件的缓存机制的示意图;
图8为本申请实施例提供的一种关联失败事件的缓存处理流程示意图;
图9为本申请实施例提供的一种第一设备的结构示意图;
图10为本申请实施例提供的一种第二设备的结构示意图;
图11为本申请实施例提供的一种计算设备的结构示意图;
图12为本申请实施例提供的另一种计算设备的结构示意图。
具体实施方式
下面结合附图对本申请实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例仅仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
首先,结合附图对本申请中所涉及的部分用语和相关技术进行解释说明,以便于本领域技术人员理解。
分布式***(Distributed System)是建立在网络之上的软件***。在一个分布式***中,一组独立的计算机展现给用户的是一个统一的整体,就好像是一个***似的。***拥有多种通用的物理和逻辑资源,可以动态的分配任务,分散的物理和逻辑资源通过计算机网络实现信息交换。通常,对用户来说,分布式***只有一个模型或范型。在操作***之上有一层软件中间件(Middle Ware)负责实现这个模型。
埋点是数据采集领域(尤其是用户行为数据采集领域)的术语,指的是针对特定用户行为或事件进行捕获、处理和发送的相关技术及其实施过程。埋点的技术实质,是先监听软件应用运行过程中的事件,当需要关注的事件发生时进行判断和捕获。埋点包括代码埋点、可视化埋点和无埋点。其中,代码埋点通过加入一些代码使得用户触发相应行为时,进行数据 上报;可视化埋点利用了可视化交互手段,先配置相关事件,再进行数据采集;无埋点是指开发人员集成采集软件开发工具包(Software Development Kit,SDK)后,SDK便直接开始捕捉和监测用户在应用里的所有行为,并全部上报,不需要开发人员添加额外代码。需要说明的是,在本申请实施例中,埋点与用户行为无关,只与业务流上的故障相关。
远程过程调用(Remote Procedure Call,RPC)的出现主要是为了解决分布式***间的通信透明性的问题。也就意味着,RPC让用户不用理会调用的是哪个设备上的服务,在用户的角度,这个远程服务和调用本地服务一样安全可靠。远程调用一般涉及四个过程:客户端发送(Client Send,CS)、服务端接收(Service Receive,SR)、服务端发送(Service Send,SS)和客户端接收(Client Receive,CR)。
通用唯一识别码(Universally Unique Identifier,UUID)是一种软件建构的标准,亦为开放软件基金会组织在分布式计算环境领域的一部分,目前最广泛应用的UUID是微软的Microsoft's Globally Unique Identifiers(GUIDs)。UUID的目的是让分布式***中的所有元素,都能有唯一的辨识资讯,而不需要透过中央控制端来做辨识资讯的指定。如此一来,每个元素都可以建立不与其它元素冲突的UUID。在这样的情况下,就不需考虑数据库建立时的名称重复问题。UUID是指在一台机器上生成的数字,它保证对在同一时空中的所有机器都是唯一的。通常平台会提供生成的应用程序编程接口(Application Programming Interface,API)。按照开放软件基金会(OSF)制定的标准计算,用到了以太网卡地址、纳秒级时间、芯片ID码和许多可能的数字。
UUID由以下几部分组成:
(1)当前日期和时间,UUID的第一个部分与时间有关,如果你在生成一个UUID之后,过几秒又生成一个UUID,则第一个部分不同,其余相同。
(2)时钟序列。
(3)全局唯一的IEEE机器识别号。
事件ID(Event ID)是一个数字,表示错误信息,即阻碍业务继续进行的原因,不同的Event ID表示的错误信息不同,根据Event ID可以大概知道故障模块、故障影响、甚至根因,进而解决问题。
故障日志,即错误日志,是软件用来记录运行时出错信息的文本文件。编程人员和维护人员等可以利用错误日志对***进行调试和维护。
Open API即开放API,也称开放平台。所谓的开放API(Open API)是服务型网站常见的一种应用,网站的服务商将自己的网站服务封装成一系列应用编程接口(Application Programming Interface,API)开放出去,供第三方开发者使用,这种行为就叫做开放网站的API,所开放的API就被称作Open API(开放API)。
为了便于理解本申请实施例,首先对本申请实施例基于的一种面向分布式***的故障信息上报***架构进行描述。如图1所示,图1是本申请实施例提供的一种故障信息上报***架构,包括至少一个服务器、至少一个设备。图1中以包括多个服务器和多个设备为例进行说明。其中,设备A、B、C、D、E是用户发起的某个分布式业务流程所涉及到的设备,该分布式业务中有4对RPC调用,分别发生在设备A与设备B之间、设备B与设备C之间、设备C与设备D之间以及设备D与设备E之间。在设备A、B、C、D、E处理该分布式业务的过程中有设备出现故障,导致该分布式业务出现异常,那么出现故障的设备会向服务器上报其故障信息。
需要说明的是,分布式业务包括但不限于无线端请求、网页请求、Open API请求。以网页请求为例,当用户在某个网页上进行点击操作时,可能涉及到该网页对其它设备(如子服务器)的调用,还可能涉及与其他应用收、发消息的操作,即涉及消息服务,还可能涉及到分布式数据库的查询与更新、分布式缓存的读缓存与写缓存、分布式对象的存储等。
另外,分布式业务还包括智能终端与周边设备协同场景下的一些操作,比如协同传输文件,这意味着本申请可应用于大屏投屏、PC协同、PAD协同、智能穿戴等近场分布式协同场景。
在分布式***中,设备间的调用错综复杂,往往一个设备可能会被多个设备调用。如图2所示,图2是本申请实施例提供的另一种故障信息上报的***架构。设备A、C、D、F、H是用户发起的一个分布式业务所涉及到的设备,该分布式业务中有6对RPC调用,分别发生在设备A与设备C之间、设备A与设备D之间、设备C与设备D之间、设备C与设备F之间、设备C与设备H之间以及设备D与设备H之间。在设备A、C、D、F、H处理该分布式业务的过程中有设备出现故障,导致该分布式业务出现异常,那么出现故障的设备会向服务器上报其故障信息。
可理解,图1和图2中的设备包括但不限于服务器、路由器、交换机、网关,以及手机、电脑、平板等终端设备。
另外,上述图1和图2中的服务器可以是普通服务器(也可以叫做物理服务器),它们是可以放在机房来运行的真实存在的物理设备,普通服务器有独立的硬盘、带宽等;也可以是普通服务器经过不同程度的虚拟化之后产生的服务器,这一类服务器的一部分已经虚拟化了,可能只有一部分是真实的物理设备;还可以是云服务器,即指云服务提供商拥有的,用于提供计算、存储、通信资源的中心计算设备集群,云服务器是具有高度分布式、高度虚拟化等特点的一类服务器,其计算资源是从大量经过整合虚拟化的物理服务器中调度获取的,从节点规模看,这样的虚拟化规模可能是几台、数十台、数百台物理服务器,也可能是跨数据中心的成千上万台实体硬件构建起来的大型云端虚拟资源池,因此,云服务器支持资源的弹性调,这意味着我们可以自由地增加或缩减CPU、内存、磁盘、带宽等资源,同时也具有良好的可扩展性、高可靠性。
可理解,图1和图2所示的故障信息上报的***架构只是本申请实施例的两种示例性的实施方式,本申请实施例中的故障信息上报的***架构包括但不仅限于以上结构。
为了避免采集大量正常的业务流程数据,减少数据存储及分析成本,同时尽可能收集更多的与故障相关的信息,提高故障定位的精度,本申请提供了一种面向分布式***的故障信息关联上报方法,该方法可以在故障发生时,让发生故障的设备不仅上报自己的故障相关信息,而且还通知对端设备上报故障信息,使得服务器在进行故障定位时有更多故障相关信息可以利用,这种故障信息上报方法提高了故障定位的效率。
下面具体参见图3示出的本申请实施例提供的一种面向分布式***的故障信息关联上报方法的流程示意图,对本申请实施例的面向分布式***的故障信息关联上报方法进行说明。如图3所示,该方法可以包括以下步骤:
S310:第一设备缓存第一调用关系。
具体地,当用户发起一个或多个分布式业务,其中针对每个发起的分布式业务流程都可以涉及至少一个设备,且设备间存在调用关系。如下表1:
Figure PCTCN2021118807-appb-000001
表1
另外,涉及处理分布式业务的每个设备都可以缓存自身的调用关系。例如,本申请实施例中的第一设备和第二设备是涉及处理用户发起的一个或多个分布式业务的设备。第一设备缓存第一调用关系,第二设备缓存第二调用关系。如下表2:
Figure PCTCN2021118807-appb-000002
表2
可选的,第一设备可以是图1和图2所示的面向分布式***的故障信息上报的***架构中的任意一个设备,比如第一设备可以是图1中的设备A,此时,第二设备是图1中的设备B;第一设备可以是图2中的设备C,此时,第二设备是图2中的设备A、设备B和设备C。
如表2所示,第一调用关系包括第一设备参与处理的分布式业务的设备调用信息,该设备调用信息是指第一设备参与处理的分布式业务中与第一设备有关的调用信息,即第一设备调用其它设备的信息和第一设备被其它设备调用的信息。
需要说明的是,在本申请的一个实施例中,所述第一设备参与处理的一个或多个分布式业务的设备调用信息可以包括第一设备的对端类型、所述分布式业务的业务类型、所述分布式业务中与第一设备相关的通信信息、时间戳。其中,对端类型包括对端设备的UUID、对端设备类型,所述对端设备是指在所述分布式业务中第一设备所调用的设备或者调用第一设备的设备,所述对端设备类型包括但不限于终端、服务器、路由器、软件和/或硬件***内的模块;业务类型包括但不限于:固定通信业务、蜂窝移动通信业务、第一类卫星通信业务、第一类数据通信业务等第一类基础电信业务,集群通信业务、无线寻呼业务、第二类卫星通信业务、第二类数据通信业务、网络接入业务、国内通信设施服务业务、网络托管业务等第二类基础电信业务,第一类增值电信业务、第二类增值电信业务等增值电信业务;通信信息包括但不限于第一设备作为调用方时和被调用方设备间的通信信息、该分布式业务设备作为被调用方时和调用设备间的通信信息;时间戳包括但不限于第一设备作为调用方设备时的调用时间和作为被调用方设备时的被调用时间。
另外,可以将所述第一调用关系根据业务的生命周期分为至少两种状态,然后将所述不同状态的第一调用关系分开缓存,其中,不同状态的第一调用关系对应的业务的生命周期不同。
在本申请的一个实施例中,可以根据业务的生命周期将第一调用关系分为冷状态、温状态和热状态,然后将被标记为冷状态、温状态和热状态的第一调用关系分开缓存。具体地,当业务的生命周期小于或等于第一阈值时,将所述第一调用关系标记为热状态;当业务的生命周期大于第一阈值且小于或等于第二阈值时,将所述第一调用关系标记为温状态;当业务的生命周期大于第二阈值时,将所述第一调用关系标记为冷状态。上述三种状态的第一调用关系将分别缓存在三个区域,互不影响。
可理解,第一阈值和第二阈值由研发人员根据实际情况进行设置,本申请中对此不作限定。
示例性的,设置第一阈值为3分钟,设置第二阈值为4小时,若将手机屏幕投到PC屏幕上这一投屏操作维持3小时的时候,手机和PC间的调用关系被标记为温状态;若一次蓝牙设备匹配维持了5个小时才解绑,那么该蓝牙设备间调用关系被标记为冷状态。
在本申请的另一个实施例中,可以根据业务的生命周期将第一调用关系分为第一状态、第二状态、第三状态和第四状态,然后将被标记为第一状态、第二状态、第三状态和第四状态的第一调用关系分开缓存。具体地,当业务的生命周期小于或等于第三阈值时,将所述第一调用关系标记为第一状态;当业务的生命周期大于或等于第四阈值且小于或等于第五阈值时,将所述第一调用关系标记为第二状态;当业务的生命周期大于或等于第六阈值且小于或等于第七阈值时时,将所述第一调用关系标记为第三状态;当业务的生命周期大于或等于第八阈值时,将所述第一调用关系标记为第四状态。上述四种状态的第一调用关系将分别缓存在三个区域,互不影响。
可理解,第四阈值大于第三阈值,第五阈值大于或等于第四阈值,第六阈值大于第五阈值,第七阈值大于或等于第六阈值,第八阈值大于第七阈值。另外,第三阈值、第四阈值、第五阈值、第六阈值、第七阈值和第八阈值可由研发人员根据实际情况进行设置,本申请中对其具体数值不作限定。
示例性的,设置第四阈值和第五阈值都为2小时,当且仅当业务的生命周期为2小时的时候,该业务涉及到的设备调用关系被标记为第二状态。
S320:第一设备处理第一分布式业务发生故障时,上报第一故障信息到服务器。
具体地,当第一设备在处理第一分布式业务发生故障时,会生成第一故障信息,并将第一故障信息上报到服务器。
需要说明的是,在本申请的一个实施例中,所述第一故障信息包括第一故障事件和第一故障日志。其中,第一故障事件包括故障设备(即第一设备)的设备类型、Event ID、故障时间、故障模块、异常类型,所述设备类型包括但不限于终端、服务器、路由器、软件和/或硬件***内的模块,所述故障模块是指第一设备中发生故障的模块,该模块可以是硬件模块,也可以是软件模块,所述异常类型是指第一设备发生故障的类型,可以是无响应、卡顿、协商失败、超时等;第一故障日志是指第一设备与故障相关的日志。
S330:第一设备从第一调用关系中查找第二设备。
具体地,由上文可知,第一设备缓存有第一调用关系,而第一调用关系包括第一设备参与处理的分布式业务的设备调用信息,因此可以通过第一调用关系来查找第二设备,所述第二设备包括执行所述第一分布式业务时调用所述第一设备或被所述第一设备调用的设备。
在本申请的一个实施例中,可以通过给每个分布式业务分配一个标识来达到查找对端设备的目的。
需要说明的是,标识可以缓存在第一设备中的第一调用关系里,具体地,如图4所示,图4是一种缓存分布式业务的设备调用关系的示意图,由图4可看出,可以将标识作为关键字索引来缓存分布式业务的设备调用信息,即将标识作为区分不同分布式业务的标识,并分别缓存不同分布式业务的设备调用信息,这就意味着,若要查找某个分布式业务的设备调用信息,可以通过检索该分布式业务的标识,进而得到缓存的该分布式业务的设备调用信息。可选的,如图5所示,图5是另一种缓存分布式业务的设备调用关系的示意图,由图5可看出,还可以将同一分布式业务的标识与其设备调用信息一起存储,并且不同分布式业务的标识和其设备调用信息会放在不同的缓存空间里,不同的分布式业务的标识及其设备调用信息会分开存储,这意味着,需要查找某个分布式业务的设备调用关系时,需要逐个查看,直至找到该分布式业务对应的标识所在的缓存空间,从而找到该分布式业务的设备调用信息。
当采取给每个分布式业务分配一个标识这种方式后,第一调用关系就会包括用户发起的一个或多个分布式业务的标识,还包括对端设备的UUID,而当第一设备在处理第一分布式业务发生故障时,会生成第一故障信息,该第一故障信息中不仅如上文所述包括第一故障事件和第一故障日志,还应包括与第一分布式业务对应的第一标识,因此第一设备可以通过查找第一标识,在第一调用关系中找到第一设备调用信息,所述第一设备调用信息为所述第一调用关系中所述第一分布式业务的设备调用信息,从而找到第一设备的对端设备的UUID及通信信息,即找到第二设备的UUID及第一设备与第二设备之间的通信信息。
需要说明的是,在本申请的一个实施例中,可以采取对分布式***中间件埋点的方式来获取标识,同时也可以获取到设备的调用关系,这是因为通过对分布式***中的中间件进行埋点,可以在分布式业务发生的过程中跟踪业务流程,使得参与不同分布式业务的设备中,与所述不同业务相关的日志记录可以和不同的标识相关联,同时也可以获取该分布式业务涉及的设备的UUID、分支信息、设备类型等设备信息及设备间的调用关系。可理解,所述中间件包括但不限于远程过程调用中间件、数据访问中间件、消息中间件、交易中间件、对象中间件和终端仿真/屏幕转换中间件。
示例性的,可以采取华为分布式调用链追踪***(Hitrace***)实现标识、第一调用关系和第二调用关系的获取,对Hitrace***增加开关配置,首先打开第一级开关,启动跟踪,若用户此时发起分布式业务,Hitrace***会跟踪该分布式业务并记录处理该分布式业务的工作信息,另外,Hitrace***会给相关设备对应的日志记录一个标识符,即Trace ID,用该Trace ID将所述对应的日志记录与该分布式业务关联起来,以便于之后可以根据该Trace ID顺利地在相关设备的日志中找到与该分布式业务对应的记录,所述相关设备为参与处理该分布式业务的所有设备。然后打开第二级开关,确定跟踪设备间的信息,此时Hitrace***会记录该分布式业务调用信息,即获取每次调用的两端设备的信息,从而确定设备调用的先后关系,例如,如图6所示,图6表示设备A对设备B的远程调用,设备A(客户端)会发起请求(CS),设备B(服务端)收到请求(SR),然后设备B(服务端)进行处理并将结果发送给设备A(客户端)(SS),最后设备A(客户端)获取到设备B(服务端)的返回信息(CR),可理解,当第二级开关打开时,可明确获知设备A和设备B之间调用的先后顺序,并且可以输出CS\SR\SS\CR这四个时间节点的相关日志记录。
可选的,除了可以通过打开开关来确定跟踪设备间的信息,在本申请的一些实施例中,可以通过打开开关来确定跟踪进程间的信息,还可以通过打开开关来确定跟踪线程间的信息。
可理解,示例中提到的Trace ID是上文所述标识的一种表现方式,标识还可以有其它获取方式以及表现形式,本申请中对此不作限制。
S340:第一设备向第二设备发送第一通知。
具体地,当第一设备在第一调用关系中获取到第二设备的UUID后,根据第二设备的UUID向第二设备发送第一通知,所述第一通知包括所述Event ID和第一标识。
可选的,第一设备可以检测设备内是否存在所述第一标识和第二设备的UUID,若存在,则说明第一设备已经向第二设备发送第一通知,即第二设备已经上报过第二故障信息,此时第一设备无需给第二设备发送第一通知,即无需将第二设备的UUID、所述Event ID以及第一标识传送给第二设备。
值得注意的是,第一设备可能无法成功向第二设备发送第一通知,例如,当第一设备与第二设备间出现通信故障时,第一设备无法给第二设备发送消息,也就意味着,第一设备此时不能通知第二设备上报第二故障信息。
上述情况发生时,如图7所示,第一设备缓存第一关联失败事件,在缓存所述第一关联失败事件之前,第一设备先确定是否有足够的缓存空间来缓存所述第一关联失败事件,当前有足够的缓存空间用来缓存所述第一关联失败事件时,缓存所述第一关联失败事件;当前无足够的缓存空间用来缓存所述第一关联失败事件时,清除第二关联失败事件;若清除所述第二关联失败事件后,有足够的缓存空间用来缓存所述第一关联失败事件,缓存所述第一关联失败事件;若清除所述第二关联失败事件后,仍无足够的缓存空间用来缓存所述第一关联失败事件,清除第三关联失败事件。
需要说明的是,所述第二关联失败事件是所述第一设备中缓存时间最长的关联失败事件;所述第三关联失败事件是清除所述第二关联失败事件后,所述第一设备中缓存时间最长的关联失败事件。在本申请的一个实施例中,关联失败事件包括通知失败的时间、标识、Event ID、对端设备的UUID,其中,对端设备是指在出现异常的分布式业务中故障设备所调用的设备或者调用该故障设备的设备,此时,第一关联失败事件包括第一设备发送第一通知失败的时间、与第一设备出现的故障对应的Event ID、第一标识以及第二设备的UUID。
另外,上述缓存空间可以是预设空间,即是区别于***存储或内存的空间。
可理解,上文所述的第一设备向第二设备发送第一通知失败,包括第一设备未成功发送第一通知以及第二设备未成功接收第一通知。
另外,如图8所示,当第一设备重新上线或者再次处理分布式业务时,第一设备查询或检查设备内是否存在所述第一关联失败事件;若存在所述第一关联失败事件,第一设备给第二设备发送第一通知,即将第一关联失败事件中的Event ID和第一标识传送给第二设备,来通知对端(第二设备)上报故障信息;若第一通知发送成功时,清除缓存的所述第一关联失败事件。
S350:第二设备接收第一设备发送的第一通知。
具体地,第二设备接收第一设备发送的第一通知,即第二设备可以收到与第一设备出现的故障对应的Event ID和第一标识,所以,第二设备可通过收到的第一标识得知第一分布式业务出现了异常。
在第二设备接收第一设备发送的第一通知前,第二设备也需要缓存第二调用关系,所述第二调用关系包括第二设备参与处理的分布式业务的设备调用信息,该设备调用信息是指第二设备参与处理的分布式业务中与第二设备有关的调用信息,即第二设备调用其它设备的信息和第二设备被其它设备调用的信息。
与第一调用关系类似,第二设备参与处理的分布式业务的设备调用信息可以包括第一设备的对端类型、所述分布式业务的业务类型、所述分布式业务中与第二设备相关的通信信息、时间戳。这里可参照第一设备参与处理的一个或多个分布式业务的设备调用信息的内容,此处不再赘述。
另外,可以将所述第二调用关系根据业务的生命周期分为至少两种状态,不同状态的第二调用关系对应的业务的生命周期不同,然后将所述不同状态的第二调用关系分开缓存。可参照上文第一调用关系的例子,此处不再赘述。
S360:第二设备将第二故障信息上报给服务器。
具体地,在第二设备收到第一设备发送的第一通知后,第二设备将第二故障信息上报给服务器。需要说明的是,在本申请的一个实施例中,所述第二故障信息包括第一标识、Event ID以及第二故障日志,所述第二故障日志是指第二设备的日志中与故障相关的日志记录。
可理解,第一设备和第二设备都参与处理第一分布式业务,因此它们都缓存有第一分布式业务的设备调用信息和第一标识。
可选的,在第二设备收到第一设备发送的第一通知后,第二设备检测设备内是否存在第一标识,若存在,则说明第二设备已经上报过第二故障信息,此时,第二设备无需将第二故障信息上报到服务器,即无需将收到的Event ID和第一标识以及第二故障日志上报到服务器。
示例性的,如图2所示,在图2所示的故障信息上报的***架构中,用户发起的一个分布式业务涉及设备A、设备C、设备D、设备F和设备H这五个设备,设备A与设备C、设备A与设备D、设备C与设备D、设备C与设备F、设备C与设备H以及设备D与设备H均存在调用关系,一般情况下,当设备A出现故障时,设备A上报故障信息并给与其有调用关系的设备C和设备D发送消息,通知设备C和设备D上报故障信息。然而,当设备C中有与设备A相关信息时,需要进行多级关联,即设备C在接收设备A发送的消息后会给与其有调用关系的设备D、设备F和设备H发送消息,设备D在接收设备A发送的消息后会给与其有调用关系的设备C和设备H发送消息,可理解,由于设备A和设备C以及设备D都存在调用关系,在第二设备收到第一设备发送的第一通知后,第二设备会检测设备内是否存在第一标识,若存在则无需上报,此时,设备C在收到设备D发送的消息时检查是否上报过 与该故障相关的故障信息,检查到设备C确实上报过与该故障相关的故障信息,则不再进行上报。
需要说明的是,在本申请的一个实施例中,故障设备(第一设备)上报到服务器的故障信息,以及与该故障设备有调用关系的其它设备(如第二设备)上报到服务器的故障信息,是不相同的。在该实施例中,设备出现故障时,生成故障事件,同时会显示相应业务的标识,此时设备会采集故障事件对应的故障日志,然后将该故障事件、故障日志以及标识上报到服务器,而故障设备再发送消息来通知与其有调用关系的设备上报故障信息,所述与其有调用关系的设备收到消息后,再将故障信息上报到服务器。可理解,所述故障事件可以包括设备类型、故障时间、Event ID、故障模块和异常类型,其中,设备类型可以是手机等终端设备,也可以是路由器等其它设备;故障时间包括发生故障的时间;故障类型对应Event ID表示的错误信息,比如Windows Event ID的ID范围是0~5073,每个Event ID表示的错误信息不同,由此,故障发生时,可以根据Event ID知道设备发生的是什么类型的故障;故障模块可以是设备内部的某个硬件或者软件模块,也可以是一个进程或者中间件;异常类型包括但不限于无响应、卡顿、协商失败、超时。
S370:第二设备从第二调用关系中查找第三设备。
具体地,由上文可知,第二调用关系包括用户发起的一个或多个分布式业务的标识,还包括对端设备的UUID,而当第二设备在收到第一设备发送的第一通知后,第二设备可以根据收到的第一通知中的第一标识在第二调用关系中找到第二设备调用信息,所述第二设备调用信息为第二调用关系中所述第一分布式业务的设备调用信息,从而找到第二设备的对端设备的UUID及通信信息,即找到第三设备的UUID以及第二设备与第三设备之间的通信信息。
S380:第二设备向第三设备发送第一通知。
具体地,当第二设备在第二调用关系中获取到第三设备的UUID后,根据第三设备的UUID向第三设备发送二通知,所述第二通知包括所述Event ID和第一标识。
可选的,第二设备可以检测设备内是否存在所述第一标识和第三设备的UUID,若存在,则说明第二设备已经向第三设备发送第一通知,即第三设备已经上报过第三故障信息,此时第二设备无需给第三设备发送第二通知,即无需将第三设备的UUID、所述Event ID以及第一标识传送给第三设备。
值得注意的是,第二设备可能无法成功向第三设备发送第一通知,这种情况发生时,第二设备确定是否有足够的缓存空间用来缓存所述第四关联失败事件;当前有足够的缓存空间用来缓存所述第四关联失败事件时,缓存所述第四关联失败事件;当前无足够的缓存空间用来缓存所述第四关联失败事件时,清除第五关联失败事件;若清除所述第五关联失败事件后,有足够的缓存空间用来缓存所述第四关联失败事件,缓存所述第四关联失败事件;若清除所述第五关联失败事件后,仍无足够的缓存空间用来缓存所述第四关联失败事件,清除第六关联失败事件。
需要说明的是,所述第五关联失败事件是所述第二设备中缓存时间最长的关联失败事件;所述第六关联失败事件是清除所述第五关联失败事件后,所述第二设备中缓存时间最长的关联失败事件。
可理解,第二设备向第三设备发送第二通知失败,包括第二设备未成功发送第二通知以及第三设备未成功接收第二通知。
另外,当第二设备重新上线或者再次处理分布式业务时,第二设备检查设备内是否存在所述第四关联失败事件;若存在所述第四关联失败事件,第二设备给第三设备发送第二通知, 即将第四关联失败事件中的Event ID和第一标识传送给第三设备;若发送成功时,清除缓存的所述第四关联失败事件。
可理解,第二设备通知第三设备上报第三故障信息的方式与上述第一设备通知第二设备上报第二故障信息的方式相同,在此不再进行举例说明,参考第一设备通知第二设备上报第二故障信息的过程中的例子即可。
上述详细阐述了本申请实施例的方法,为了便于更好的实施本申请实施例的上述方案,相应地,下面还提供用于配合实施的相关设备。
如图9所示,图9是本申请提供的一种第一设备的结构示意图,该第一设备用于执行上述图3所述的面向分布式***的故障定位方法。本申请对该第一设备的功能单元的划分不做限定,可以根据需要对该第一设备中的各个单元进行增加、减少或合并。此外,第一设备中的各个单元的操作和/或功能分别为了实现上述图3所描述的方法的相应流程,为了简洁,在此不再赘述。图9示例性的提供了一种功能单元的划分:
第一设备900包括第一缓存单元910、第一处理单元920和第一发送单元930。
第一缓存单元910,用于缓存第一调用关系,所述第一调用关系包括用户发起的第一设备参与处理的一个或多个分布式业务的设备调用信息。
第一处理单元920,用于当第一设备处理第一分布式业务发生故障时,上报第一故障信息到服务器;从所述第一调用关系中查找第二设备;所述第二设备包括执行所述第一分布式业务时调用所述第一设备或被所述第一设备调用的设备。
第一发送单元930,用于向所述第二设备发送第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第一设备向所述第二设备发送所述第一通知;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
上述三个单元之间互相可通过通信通路进行数据传输,应理解,第一设备900包括的各单元可以为软件单元、也可以为硬件单元,还可以部分为软件单元部分为硬件单元。
如图10所示,图10是本申请提供的一种第二设备的结构示意图,该第二设备用于执行上述图3所述的面向分布式***的故障信息关联上报方法。本申请对该第二设备的功能单元的划分不做限定,可以根据需要对该第二设备中的各个单元进行增加、减少或合并。此外,第二设备中的各个单元的操作和/或功能分别为了实现上述图3所描述的方法的相应流程,为了简洁,在此不再赘述。图10示例性的提供了一种功能单元的划分:
第二设备1000包括第二缓存单元1010、第一接收单元1020和第二处理单元1030。
第二缓存单元1010,用于缓存第二调用关系,所述第二调用关系包括用户发起的第二设备参与处理的一个或多个分布式业务的设备调用信息。
第一接收单元1020,用于接收第一设备发送的第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器。
第二处理单元1030,所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第二设备将第二故障信息上报给所述服务器;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
在一种可能的实现方式中,所述第二设备1000还包括:第二发送单元1040,所述第二发送单元,用于向所述第三设备发送第二通知,所述第二通知用于指示所述第三设备将第三 故障信息上报给所述服务器;或者,在所述第三设备没有上报所述第三故障信息的情况下,所述第二设备向所述第三设备发送所述第二通知;所述第三故障信息包括所述第三设备处理第一分布式业务时的故障信息。
上述四个单元之间互相可通过通信通路进行数据传输,应理解,第二设备1000包括的各单元可以为软件单元、也可以为硬件单元,还可以部分为软件单元部分为硬件单元。
参见图11,图11是本申请实施例提供的一种计算设备的结构示意图。如图11所示,该计算设备1100包括:处理器1110、通信接口1120以及存储器1130,所述处理器1110、通信接口1120以及存储器1130通过内部总线1140相互连接。
所述计算设备1100可以是图9中的第一设备900,图9中的第一设备900所执行的功能实际上是由所述第一设备900的处理器1110来执行。
所述处理器1110可以由一个或者多个通用处理器构成,例如中央处理器(central processing unit,CPU),或者CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信接口1120用于与其他设备或通信网络通信,如以太网,无线接入网(RAN),核心网,无线局域网(Wireless Local Area Networks,WLAN)等。
总线1140可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线1140可以分为地址总线、数据总线、控制总线等。为便于表示,图11中仅用一条粗线表示,但不表示仅有一根总线或一种类型的总线。
存储器1130可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM);存储器1130也可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM)、快闪存储器(flash memory)、硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器1130还可以包括上述种类的组合。存储器1130用于存储执行以上述面向分布式***的故障信息关联上报方法实施例的程序代码,在一种实施方式中,存储器1130还可以缓存其他数据,并由处理器1110来控制执行,以实现第一设备900所示的功能单元,或者用于实现图3所示的方法实施例中以第一设备900为执行主体的方法步骤。具体如下:
处理器1110控制存储器1130缓存第一调用关系,所述第一调用关系包括用户发起的第一设备参与处理的一个或多个分布式业务的设备调用信息;
当第一设备900处理第一分布式业务发生故障时,处理器1110控制通信接口1120上报第一故障信息到服务器;
处理器1110从所述第一调用关系中查找第二设备;所述第二设备包括执行所述第一分布式业务时调用所述第一设备900或被所述第一设备900调用的设备;
处理器1110控制通信接口1120向所述第二设备发送第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第一设备900向所述第二设备发送所述第一通知;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
在其中一种实现方式中,处理器1110从所述第一调用关系中查找第二设备,包括:处理器1110通过第一标识,从第一设备调用信息中查找第二设备;所述第一设备调用信息为所述第一调用关系中所述第一分布式业务的设备调用信息。
在其中一种实现方式中,处理器1110将所述第一调用关系根据业务的生命周期分为至少两种状态;不同状态的第一调用关系对应的业务的生命周期不同;处理器1110将所述不同状态的第一调用关系分开缓存在存储器1130中。
在其中一种实现方式中,若所述第一设备900向所述第二设备发送所述第一通知时,发送失败,存储器1130缓存第一关联失败事件;所述第一关联失败事件用于表征发送所述第一通知失败。
在其中一种实现方式中,处理器1110确定是否有足够的缓存空间用来缓存所述第一关联失败事件;当前有足够的缓存空间用来缓存所述第一关联失败事件时,处理器1110控制存储器1130缓存所述第一关联失败事件;当前无足够的缓存空间用来缓存所述第一关联失败事件时,处理器1110清除第二关联失败事件;所述第二关联失败事件是所述第一设备中缓存时间最长的关联失败事件;若清除所述第二关联失败事件后,有足够的缓存空间用来缓存所述第一关联失败事件,处理器1110控制存储器1130缓存所述第一关联失败事件;若清除所述第二关联失败事件后,仍无足够的缓存空间用来缓存所述第一关联失败事件,处理器1110清除第三关联失败事件;所述第三关联失败事件是清除所述第二关联失败事件后,所述第一设备中缓存时间最长的关联失败事件。
在其中一种实现方式中,当所述第一设备900重新上线或者再次处理分布式业务时,处理器1110检查是否存在所述第一关联失败事件;当存在所述第一关联失败事件时,处理器1110控制通信接口1120向所述第二设备发送所述第一通知;成功发送所述第一通知时,处理器1110清除缓存的所述第一关联失败事件。
参见图12,图12是本申请实施例提供的一种计算设备的结构示意图。如图12所示,该计算设备1200包括:处理器1210、通信接口1220以及存储器1230,所述处理器1210、通信接口1220以及存储器1230通过内部总线1240相互连接。
所述计算设备1200可以是图12中的第二设备1000,图10中的第二设备1000所执行的功能实际上是由所述第二设备1000的处理器1210来执行。
所述处理器1210可以由一个或者多个通用处理器构成,例如中央处理器(central processing unit,CPU),或者CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信接口1220用于与其他设备或通信网络通信,如以太网,无线接入网(RAN),核心网,无线局域网(Wireless Local Area Networks,WLAN)等。
总线1240可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线1240可以分为地址总线、数据总线、控制总线等。为便于表示,图12中仅用一条粗线表示,但不表示仅有一根总线或一种类型的总线。
存储器1230可以包括易失性存储器(volatile memory),例如随机存取存储器(random  access memory,RAM);存储器1230也可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM)、快闪存储器(flash memory)、硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器1230还可以包括上述种类的组合。存储器1230用于存储执行以上述面向分布式***的故障信息关联上报方法实施例的程序代码,在一种实施方式中,存储器1230还可以缓存其他数据,并由处理器1210来控制执行,以实现第二设备1000所示的功能单元,或者用于实现图3所示的方法实施例中以第二设备1000为执行主体的方法步骤。具体如下:
处理器1210控制存储器1230缓存第二调用关系,所述第二调用关系包括用户发起的第二设备1000参与处理的一个或多个分布式业务的设备调用信息;
第二设备1000中的处理器1210通过控制通信接口1220接收第一设备发送的第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;
处理器1210将第二故障信息上报给所述服务器;或者,在所述第二设备1000没有上报所述第二故障信息的情况下,所述第二设备1000将第二故障信息上报给所述服务器;所述第二故障信息包括所述第二设备1000处理第一分布式业务时的故障信息。
在其中一种实现方式中,处理器1210从所述第二调用关系中查找第三设备;所述第三设备包括执行所述第一分布式业务时调用所述第二设备1000或被所述第二设备1000调用的设备;处理器1210控制通信接口1220向所述第三设备发送第二通知,所述第二通知用于指示所述第三设备将第三故障信息上报给所述服务器;或者,在所述第三设备没有上报所述第三故障信息的情况下,所述第二设备1000向所述第三设备发送所述第二通知;所述第三故障信息包括所述第三设备处理第一分布式业务时的故障信息。
在其中一种实现方式中,处理器1210通过第一标识,从所述第二设备调用信息中查找第三设备;所述第二设备调用信息为所述第二调用关系中所述第一分布式业务的设备调用信息。
在其中一种实现方式中,处理器1210将所述第二调用关系根据业务的生命周期分为至少两种状态;不同状态的第二调用关系对应的业务的生命周期不同;处理器1210将所述不同状态的第二调用关系分开缓存到存储器1230中。
在其中一种实现方式中,若处理器1210通过控制通信接口1220向所述第三设备发送所述第二通知时,发送失败,处理器1210控制存储器1230缓存第四关联失败事件;所述第四关联失败事件用于表征发送所述第二通知失败。
在其中一种实现方式中,处理器1210确定是否有足够的缓存空间用来缓存所述第四关联失败事件;当前有足够的缓存空间用来缓存所述第四关联失败事件时,处理器1210控制存储器1230缓存所述第四关联失败事件;当前无足够的缓存空间用来缓存所述第四关联失败事件时,处理器1210清除第五关联失败事件;所述第五关联失败事件是所述第二设备中缓存时间最长的关联失败事件;若清除所述第五关联失败事件后,有足够的缓存空间用来缓存所述第四关联失败事件,处理器1210控制存储器1230缓存所述第四关联失败事件;若清除所述第五关联失败事件后,仍无足够的缓存空间用来缓存所述第四关联失败事件,处理器1210清除第六关联失败事件;所述第六关联失败事件是清除所述第五关联失败事件后,所述第二设备中缓存时间最长的关联失败事件。
在其中一种实现方式中,当所述第二设备1000重新上线或者再次处理分布式业务时,处理器1210检查是否存在所述第四关联失败事件;当存在所述第四关联失败事件时,处理器1210控制存储器1230向所述第三设备发送所述第二通知;成功发送所述第二通知时,处理器1210清除缓存的所述第四关联失败事件。
本申请实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时,可以实现上述方法实施例中记载的任意一种的部分或全部步骤,以及实现上述图9所描述的任意一个功能单元的功能。
本申请实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时,可以实现上述方法实施例中记载的任意一种的部分或全部步骤,以及实现上述图10所描述的任意一个功能单元的功能。
本申请实施例还提供了一种计算机程序产品,当其在计算机或处理器上运行时,使得计算机或处理器执行上述任一个方法中以第一设备900为执行主体的方法步骤的一个或多个步骤。上述所涉及的设备的各组成模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在所述计算机可读取存储介质中。
本申请实施例还提供了一种计算机程序产品,当其在计算机或处理器上运行时,使得计算机或处理器执行上述任一个方法中以第二设备1000为执行主体的方法步骤的一个或多个步骤。上述所涉及的设备的各组成模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在所述计算机可读取存储介质中。
本申请实施例还提供了一种芯片***,该芯片***包括处理器,用于支持第一设备900实现上述任一个方法中以第一设备900为执行主体的方法步骤的一个或多个步骤。在一种可能的设计中,所述芯片***还包括存储器,所述存储器,用于保存数据发送设备必要的程序指令和数据。该芯片***,可以由芯片构成,也可以包含芯片和其他分立器件。
本申请实施例还提供了一种芯片***,该芯片***包括处理器,用于支持第二设备1000实现上述任一个方法中以第二设备1000为执行主体的方法步骤的一个或多个步骤。在一种可能的设计中,所述芯片***还包括存储器,所述存储器,用于保存数据发送设备必要的程序指令和数据。该芯片***,可以由芯片构成,也可以包含芯片和其他分立器件。
在上述实施例中,对各个实施例的描述各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
应理解,本文中涉及的第一、第二、第三、第四以及各种数字编号仅为描述方便进行的区分,并不用来限制本申请的范围。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的***、装置和方法,可以通过 其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。
本申请实施例装置中的模块可以根据实际需要进行合并、划分和删减。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (30)

  1. 一种面向分布式***的故障信息关联上报方法,其特征在于,所述方法包括:
    第一设备缓存第一调用关系,所述第一调用关系包括用户发起的第一设备参与处理的一个或多个分布式业务的设备调用信息;
    当所述第一设备处理第一分布式业务发生故障时,上报第一故障信息到服务器;
    所述第一设备从所述第一调用关系中查找第二设备;所述第二设备包括执行所述第一分布式业务时调用所述第一设备的设备或被所述第一设备调用的设备;
    所述第一设备向所述第二设备发送第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第一设备向所述第二设备发送所述第一通知;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
  2. 如权利要求1所述的方法,其特征在于,所述第一设备从所述第一调用关系中查找第二设备,包括:
    所述第一设备通过第一标识,从第一设备调用信息中查找第二设备;所述第一设备调用信息为所述第一调用关系中所述第一分布式业务的设备调用信息。
  3. 如权利要求1或2所述的方法,其特征在于,所述缓存第一调用关系包括:
    将所述第一调用关系根据业务的生命周期分为至少两种状态;不同状态的第一调用关系对应的业务的生命周期不同;
    将所述不同状态的第一调用关系分开缓存。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述方法还包括:
    若所述第一设备向所述第二设备发送所述第一通知时,发送失败,缓存第一关联失败事件;所述第一关联失败事件用于表征发送所述第一通知失败。
  5. 如权利要求4所述的方法,其特征在于,所述缓存第一关联失败事件包括:
    所述第一设备确定是否有足够的缓存空间用来缓存所述第一关联失败事件;
    当前有足够的缓存空间用来缓存所述第一关联失败事件时,缓存所述第一关联失败事件;
    当前无足够的缓存空间用来缓存所述第一关联失败事件时,清除第二关联失败事件;所述第二关联失败事件是所述第一设备中缓存时间最长的关联失败事件;
    若清除所述第二关联失败事件后,有足够的缓存空间用来缓存所述第一关联失败事件,缓存所述第一关联失败事件;
    若清除所述第二关联失败事件后,仍无足够的缓存空间用来缓存所述第一关联失败事件,清除第三关联失败事件;所述第三关联失败事件是清除所述第二关联失败事件后,所述第一设备中缓存时间最长的关联失败事件。
  6. 如权利要求4或5所述的方法,其特征在于,所述方法还包括:
    当所述第一设备重新上线或者再次处理分布式业务时,检查是否存在所述第一关联失败事件;
    当存在所述第一关联失败事件时,向所述第二设备发送所述第一通知;
    成功发送所述第一通知时,清除缓存的所述第一关联失败事件。
  7. 一种面向分布式***的故障信息关联上报方法,其特征在于,所述方法包括:
    第二设备缓存第二调用关系,所述第二调用关系包括用户发起的第二设备参与处理的一个或多个分布式业务的设备调用信息;
    所述第二设备接收第一设备发送的第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;
    所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第二设备将第二故障信息上报给所述服务器;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
  8. 如权利要求7所述的方法,其特征在于,所述方法还包括:
    所述第二设备从所述第二调用关系中查找第三设备;所述第三设备包括执行所述第一分布式业务时调用所述第二设备的设备或被所述第二设备调用的设备;
    所述第二设备向所述第三设备发送第二通知,所述第二通知用于指示所述第三设备将第三故障信息上报给所述服务器;或者,在所述第三设备没有上报所述第三故障信息的情况下,所述第二设备向所述第三设备发送所述第二通知;所述第三故障信息包括所述第三设备处理第一分布式业务时的故障信息。
  9. 如权利要求8所述的方法,其特征在于,所述第二设备从所述第二调用关系中查找第三设备,包括:
    所述第二设备通过第一标识,从所述第二设备调用信息中查找第三设备;所述第二设备调用信息为所述第二调用关系中所述第一分布式业务的设备调用信息。
  10. 如权利要求7-9任一项所述的方法,其特征在于,所述缓存第二调用关系包括:
    将所述第二调用关系根据业务的生命周期分为至少两种状态;不同状态的第二调用关系对应的业务的生命周期不同;
    将所述不同状态的第二调用关系分开缓存。
  11. 如权利要求7所述的方法,其特征在于,所述方法还包括:
    若所述第二设备向所述第三设备发送所述第二通知时,发送失败,缓存第四关联失败事件;所述第四关联失败事件用于表征发送所述第二通知失败。
  12. 如权利要求11所述的方法,其特征在于,所述缓存第四关联失败事件包括:
    所述第二设备确定是否有足够的缓存空间用来缓存所述第四关联失败事件;
    当前有足够的缓存空间用来缓存所述第四关联失败事件时,缓存所述第四关联失败事件;
    当前无足够的缓存空间用来缓存所述第四关联失败事件时,清除第五关联失败事件;所述第五关联失败事件是所述第二设备中缓存时间最长的关联失败事件;
    若清除所述第五关联失败事件后,有足够的缓存空间用来缓存所述第四关联失败事件,缓存所述第四关联失败事件;
    若清除所述第五关联失败事件后,仍无足够的缓存空间用来缓存所述第四关联失败事件,清除第六关联失败事件;所述第六关联失败事件是清除所述第五关联失败事件后,所述第二设备中缓存时间最长的关联失败事件。
  13. 如权利要求11或12所述的方法,其特征在于,所述方法还包括:
    当所述第二设备重新上线或者再次处理分布式业务时,检查是否存在所述第四关联失败事件;
    当存在所述第四关联失败事件时,向所述第三设备发送所述第二通知;
    成功发送所述第二通知时,清除缓存的所述第四关联失败事件。
  14. 一种第一设备,其特征在于,包括:
    第一缓存单元,用于缓存第一调用关系,所述第一调用关系包括用户发起的第一设备参与处理的一个或多个分布式业务的设备调用信息;
    第一处理单元,用于当第一设备处理第一分布式业务发生故障时,上报第一故障信息到服务器;从所述第一调用关系中查找第二设备;所述第二设备包括执行所述第一分布式业务时调用所述第一设备的设备或被所述第一设备调用的设备;
    第一发送单元,用于向所述第二设备发送第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第一设备向所述第二设备发送所述第一通知;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
  15. 如权利要求14所述的设备,其特征在于,所述第一处理单元,用于从所述第一调用关系中查找第二设备时,具体用于:
    通过第一标识,从第一设备调用信息中查找第二设备;所述第一设备调用信息为所述第一调用关系中所述第一分布式业务的设备调用信息。
  16. 如权利要求14或15所述的设备,其特征在于,所述第一缓存单元具体用于:
    将所述第一调用关系根据业务的生命周期分为至少两种状态;不同状态的第一调用关系对应的业务的生命周期不同;
    将所述不同状态的第一调用关系分开缓存。
  17. 如权利要求14-16任一项所述的设备,其特征在于,所述第一缓存单元,还用于:
    若所述第一发送单元向所述第二设备发送所述第一通知时,发送失败,缓存第一关联失败事件;所述第一关联失败事件用于表征发送所述第一通知失败。
  18. 如权利要求17所述的设备,其特征在于,所述第一缓存单元,用于缓存第一关联失败事件时,具体用于:
    确定是否有足够的缓存空间用来缓存所述第一关联失败事件;
    当前有足够的缓存空间用来缓存所述第一关联失败事件时,缓存所述第一关联失败事件;
    当前无足够的缓存空间用来缓存所述第一关联失败事件时,清除第二关联失败事件;所述第二关联失败事件是所述第一设备中缓存时间最长的关联失败事件;
    若清除所述第二关联失败事件后,有足够的缓存空间用来缓存所述第一关联失败事件,缓存所述第一关联失败事件;
    若清除所述第二关联失败事件后,仍无足够的缓存空间用来缓存所述第一关联失败事件,清除第三关联失败事件;所述第三关联失败事件是清除所述第二关联失败事件后,所述第一设备中缓存时间最长的关联失败事件。
  19. 如权利要求17或18所述的设备,其特征在于,所述第一处理单元,还用于当所述第一设备重新上线或者再次处理分布式业务时,检查是否存在所述第一关联失败事件;所述第一发送单元,还用于当存在所述第一关联失败事件时,向所述第二设备发送所述第一通知;所述第一缓存单元,还用于当所述第一发送单元成功发送所述第一通知后,清除缓存的所述第一关联失败事件。
  20. 一种第二设备,其特征在于,包括:
    第二缓存单元,用于缓存第二调用关系,所述第二调用关系包括用户发起的第二设备参与处理的一个或多个分布式业务的设备调用信息;
    第一接收单元,用于接收第一设备发送的第一通知,所述第一通知用于指示所述第二设备将第二故障信息上报给所述服务器;
    第二处理单元,所述第二设备将第二故障信息上报给所述服务器;或者,在所述第二设备没有上报所述第二故障信息的情况下,所述第二设备将第二故障信息上报给所述服务器;所述第二故障信息包括所述第二设备处理第一分布式业务时的故障信息。
  21. 如权利要求20所述的设备,其特征在于,所述第二处理单元,还用于从所述第二调用关系中查找第三设备,所述第三设备包括执行所述第一分布式业务时调用所述第二设备的设备或被所述第二设备调用的设备;
    所述第二设备还包括第二发送单元,用于向所述第三设备发送第二通知,所述第二通知用于指示所述第三设备将第三故障信息上报给所述服务器;或者,在所述第三设备没有上报所述第三故障信息的情况下,所述第二设备向所述第三设备发送所述第二通知;所述第三故障信息包括所述第三设备处理第一分布式业务时的故障信息。
  22. 如权利要求21所述的设备,其特征在于,所述第二处理单元,用于从所述第二调用关系中查找第三设备时,具体用于:
    通过第一标识,从所述第二设备调用信息中查找第三设备;所述第二设备调用信息为所述第二调用关系中所述第一分布式业务的设备调用信息。
  23. 如权利要求20-22任一项所述的设备,其特征在于,所述第二缓存单元,用于缓存第二调用关系时,具体用于:
    将所述第二调用关系根据业务的生命周期分为至少两种状态;不同状态的第二调用关系对应的业务的生命周期不同;
    将所述不同状态的第二调用关系分开缓存。
  24. 如权利要求21所述的设备,其特征在于,所述第二缓存单元,还用于:
    若所述第二发送单元向所述第三设备发送所述第二通知时,发送失败,缓存第四关联失败事件;所述第四关联失败事件用于表征发送所述第二通知失败。
  25. 如权利要求24所述的设备,其特征在于,所述第二缓存单元,用于缓存第四关联失败事件时,具体用于:
    确定是否有足够的缓存空间用来缓存所述第四关联失败事件;
    当前有足够的缓存空间用来缓存所述第四关联失败事件时,缓存所述第四关联失败事件;
    当前无足够的缓存空间用来缓存所述第四关联失败事件时,清除第五关联失败事件;所述第五关联失败事件是所述第二设备中缓存时间最长的关联失败事件;
    若清除所述第五关联失败事件后,有足够的缓存空间用来缓存所述第四关联失败事件,缓存所述第四关联失败事件;
    若清除所述第五关联失败事件后,仍无足够的缓存空间用来缓存所述第四关联失败事件,清除第六关联失败事件;所述第六关联失败事件是清除所述第五关联失败事件后,所述第二设备中缓存时间最长的关联失败事件。
  26. 如权利要求24或25所述的设备,其特征在于,所述第二处理单元,还用于当所述第二设备重新上线或者再次处理分布式业务时,检查是否存在所述第四关联失败事件;
    所述第二发送单元,还用于当存在所述第四关联失败事件时,向所述第三设备发送所述第二通知;成功发送所述第二通知时,清除缓存的所述第四关联失败事件。
  27. 一种计算设备,其特征在于,所述计算设备包括存储器和处理器,所述处理器执行所述存储器存储的计算机指令,使得所述计算设备执行权利要求1-13任一项所述的方法。
  28. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述权利要求1-13任意一项所述的方法。
  29. 一种计算机程序,其特征在于,所述计算机程序包括指令,当所述计算机程序被计算机执行时,使得所述计算机执行如权利要求1-13中任意一项所述的方法。
  30. 一种芯片***,其特征在于,所述芯片***包括至少一个处理器,存储器和接口电路,所述存储器、所述接口电路和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有指令;所述指令被所述处理器执行时,实现权利要求1-13任一项所述的方法。
PCT/CN2021/118807 2020-09-28 2021-09-16 一种面向分布式***的故障信息关联上报方法及相关设备 WO2022063032A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011040443.5A CN114363144B (zh) 2020-09-28 2020-09-28 一种面向分布式***的故障信息关联上报方法及相关设备
CN202011040443.5 2020-09-28

Publications (1)

Publication Number Publication Date
WO2022063032A1 true WO2022063032A1 (zh) 2022-03-31

Family

ID=80846244

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/118807 WO2022063032A1 (zh) 2020-09-28 2021-09-16 一种面向分布式***的故障信息关联上报方法及相关设备

Country Status (2)

Country Link
CN (1) CN114363144B (zh)
WO (1) WO2022063032A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115396282A (zh) * 2022-07-20 2022-11-25 北京奇艺世纪科技有限公司 信息处理方法、***及装置
WO2024012186A1 (zh) * 2022-07-11 2024-01-18 中兴通讯股份有限公司 根因定位方法、通信设备及计算机可读存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115460071B (zh) * 2022-07-27 2023-09-29 荣耀终端有限公司 故障定位方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040223461A1 (en) * 2000-06-16 2004-11-11 Ciena Corporation. Method and apparatus for aggregating alarms and faults of a communications network
CN104685830A (zh) * 2013-09-30 2015-06-03 华为技术有限公司 故障管理的方法、实体和***
CN108768752A (zh) * 2018-06-25 2018-11-06 华为技术有限公司 故障定位方法、装置以及***
CN111478798A (zh) * 2020-03-18 2020-07-31 华为技术有限公司 故障处理方法、故障处理的装置和存储介质
CN111526068A (zh) * 2020-04-29 2020-08-11 华为技术有限公司 故障上报方法及终端
CN111556447A (zh) * 2019-07-22 2020-08-18 新华三技术有限公司 一种信息处理方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107360012B (zh) * 2016-05-10 2020-05-05 大唐移动通信设备有限公司 一种链路状态处理方法及网络节点设备
CN106254144B (zh) * 2016-09-06 2020-02-14 华为技术有限公司 故障定位平台、故障定位方法及装置
CN110391928B (zh) * 2018-04-20 2022-01-18 华为技术有限公司 一种使用电子开关执行主备切换的通信方法和设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040223461A1 (en) * 2000-06-16 2004-11-11 Ciena Corporation. Method and apparatus for aggregating alarms and faults of a communications network
CN104685830A (zh) * 2013-09-30 2015-06-03 华为技术有限公司 故障管理的方法、实体和***
CN108768752A (zh) * 2018-06-25 2018-11-06 华为技术有限公司 故障定位方法、装置以及***
CN111556447A (zh) * 2019-07-22 2020-08-18 新华三技术有限公司 一种信息处理方法及装置
CN111478798A (zh) * 2020-03-18 2020-07-31 华为技术有限公司 故障处理方法、故障处理的装置和存储介质
CN111526068A (zh) * 2020-04-29 2020-08-11 华为技术有限公司 故障上报方法及终端

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024012186A1 (zh) * 2022-07-11 2024-01-18 中兴通讯股份有限公司 根因定位方法、通信设备及计算机可读存储介质
CN115396282A (zh) * 2022-07-20 2022-11-25 北京奇艺世纪科技有限公司 信息处理方法、***及装置
CN115396282B (zh) * 2022-07-20 2024-03-15 北京奇艺世纪科技有限公司 信息处理方法、***及装置

Also Published As

Publication number Publication date
CN114363144B (zh) 2023-06-27
CN114363144A (zh) 2022-04-15

Similar Documents

Publication Publication Date Title
WO2021000698A1 (zh) 一种热点数据的管理方法、装置及***
WO2022063032A1 (zh) 一种面向分布式***的故障信息关联上报方法及相关设备
US9641413B2 (en) Methods and computer program products for collecting storage resource performance data using file system hooks
WO2020147419A1 (zh) 监控方法、装置、计算机设备及存储介质
US20160036663A1 (en) Methods and computer program products for generating a model of network application health
US10298469B2 (en) Automatic asynchronous handoff identification
CN112910945A (zh) 请求链路跟踪方法和业务请求处理方法
WO2020173080A1 (zh) 调用链信息查询方法以及设备
US10936375B2 (en) Hyper-converged infrastructure (HCI) distributed monitoring system
US12019634B1 (en) Reassigning a processing node from downloading to searching a data group
CN113760652B (zh) 基于应用的全链路监控的方法、***、设备和存储介质
EP2634699B1 (en) Application monitoring
US8312138B2 (en) Methods and computer program products for identifying and monitoring related business application processes
US9893972B1 (en) Managing I/O requests
JP5642725B2 (ja) 性能分析装置、性能分析方法及び性能分析プログラム
CN102902593A (zh) 基于缓存机制的协议分发处理***
CN112860720B (zh) 一种存储容量的更新方法以及装置
CN115129708A (zh) 数据处理方法、装置和存储介质及电子设备
CN112433891A (zh) 数据处理方法、装置和服务器
JP6926646B2 (ja) 事業者間一括サービス管理装置および事業者間一括サービス管理方法
CN116055450B (zh) 通讯录数据的处理方法、装置和存储介质及电子设备
Carbone et al. Towards highly available complex event processing deployments in the cloud
Dua et al. Recovery in mobile database system
WO2024129079A1 (en) Local protect image for critical applications
CN116260857A (zh) 信息查询方法、装置和存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21871402

Country of ref document: EP

Kind code of ref document: A1