JP2004185318A

JP2004185318A - Trouble monitoring device for cpu system

Info

Publication number: JP2004185318A
Application number: JP2002351570A
Authority: JP
Inventors: Naoki Takai; 直樹高井; Makoto Ikeda; 誠池田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-12-03
Filing date: 2002-12-03
Publication date: 2004-07-02

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently store the useful information for specifying the cause even when a CPU runs away or stops. <P>SOLUTION: This trouble monitoring device of a CPU system 4 where a CPU 1 and a peripheral circuit 2 including one or more functioning devices 2a to 2b for realizing predetermined functions are connected through a common bus 3 to each other is provided with a detecting means 6 for detecting a predetermined state that any trouble may be generated in the processing of the CPU 1 based on a predetermined signal capable of estimating the operating condition of the CPU 1 from the outside and a log information recording means 7 for acquiring a predetermined signal transferred between the CPU 1 and the peripheral circuit 2 according as the predetermined state is detected, and for recording the predetermined signal in a non-volatile memory 8. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明はＣＰＵシステムの障害監視装置に関し、更に詳しくは、ＣＰＵと所定の機能を実現するための１又は２以上の機能デバイスを含む周辺回路とがＣＰＵの共通バスを介して相互に接続するＣＰＵシステムの障害監視装置に関するものである。
【０００２】
今日、多く装置はこの種のＣＰＵシステムによって実現されているが、特に複雑かつ高度な通信サービスを提供する伝送装置等では、ＣＰＵの配下で多数の機能デバイスが協動するために、システム障害発生時の原因特定が非常に困難になりつつある。
【０００３】
【従来の技術】
従来は、ＣＰＵシステム内の障害処理をＣＰＵの割込処理ファームで行っていた。しかし、ＣＰＵのハードウェアは正常であるにも関わらず、該ハードウェア以外の状態、即ち、ＣＰＵファームウェアのバグ、周辺回路の障害、多数の障害が重なったことによるＣＰＵ処理の過負荷状態、又はノイズ等の原因によってＣＰＵが暴走し又は停止する場合も少なくなく、係る場合には、迅速な対処のみならず、再発防止のための原因特定が不可欠である。
【０００４】
係る状況の下、従来は、ウォッチドッグタイマによりＣＰＵの暴走を検出し、暴走を検出すると、該ＣＰＵをリセット状態にすると共に、この区間のＣＰＵバスの制御をメモリから読み出した所定の制御用データで行い、これによって周辺機器を安全な状態で停止させた後、ＣＰＵのリセット状態を開放するものが知られている（例えば特許文献１）。
【０００５】
【特許文献１】
特開平５−１６５６５７号公報（「要約」，「００１３」、図１）。
【０００６】
【発明が解決しようとする課題】
しかし、上記従来方式ではＣＰＵシステムの異常時における迅速かつ安全な対処は可能であるが、再発防止のための有用な情報は得られない。
【０００７】
本発明は上記従来技術の問題点に鑑みなされたもので、その目的とする所は、ＣＰＵが暴走又は停止しても、原因特定のための有用な情報を効率よく蓄積可能なＣＰＵシステムの障害監視装置を提供することにある。
【０００８】
【課題を解決するための手段】
上記の課題は例えば図１の構成により解決される。即ち、本発明（１）のＣＰＵシステムの障害監視装置５は、ＣＰＵ１と所定の機能を実現するための１又は２以上の機能デバイス２ａ〜２ｂを含む周辺回路２とがＣＰＵ１の共通バス３を介して相互に接続するＣＰＵシステム４の障害監視装置において、外部よりＣＰＵ１の動作状態を推定可能な所定の信号に基づきＣＰＵ１の処理に支障を来たすであろう所定の状態を検出する検出手段６と、前記所定の状態の検出によりＣＰＵ１が周辺回路２との間でやり取りする所定の信号を取得して不揮発性メモリ８に記録するログ情報記録手段７とを備えるものである。
【０００９】
本発明（１）によれば、ＣＰＵ１の処理に支障を来たすであろう所定の状態を検出したことにより，該ＣＰＵ１が周辺回路２との間でやり取りする所定の信号を取得して不揮発性メモリ８に記録するため、その後にＣＰＵ１が暴走又は停止しても、原因特定のための有用な情報を効率よく蓄積可能となる。
【００１０】
本発明（２）では、上記本発明（１）において、検出手段は、ＣＰＵにより定期的にリセットされるべきウォッチドッグタイマの値が所定閾値を超えたこと、又は機能デバイスからのバスアサートに対する応答時間が所定閾値を超えたこと、又は機能デバイスからの割込要求に関して所定の高負荷状態を検出したことにより、ＣＰＵの処理に支障を来たすであろう所定の状態を検出するものである。
【００１１】
なお、上記機能デバイスからの割込要求に関する所定の高負荷状態とは、例えば緊急に割込処理すべき割込要求が略同時に多発した状態、又は過去にＣＰＵが暴走又は停止状態に到ったことがある場合と同一又は類似のパターンの割込要求が略同時に又はシーケンシャルに発生した状態等を意味する。従って、本発明（２）によれば、外部よりＣＰＵ１の動作状態を推定可能な所定の信号に基づきＣＰＵの処理に支障を来たすであろうシステムの状態を的確に検出できる。
【００１２】
本発明（３）では、上記本発明（１）において、ＣＰＵが周辺回路との間でやり取りする所定の信号は、共通バスのバスアクセスに関する信号、機能デバイスからＣＰＵへの割込要求に関する信号、又はＣＰＵにより起動されるＤＭＡアクセスに関する信号である。従って、これらの信号を記録に残すことで、ＣＰＵが処理障害に到った際のシステム環境を詳細に分析できる。
【００１３】
【発明の実施の形態】
以下、添付図面に従って本発明に好適なる実施の形態を詳細に説明する。なお、全図を通して同一符号は同一又は相当部分を示すものとする。
【００１４】
図２は実施の形態による障害監視方式の構成を示す図で、データ伝送装置への適用例を示している。図において、１０は例えば２回線分の伝送路を収容可能な通信制御部、１１は通信制御部１０の主制御を行うプロセッサ部、１２はそのＣＰＵ、１３はＣＰＵ１２が使用する主メモリ（ＭＭ）、１４はＣＰＵ１２が上位モジュール２０との間でやり取りするデータをＤＭＡ転送するためのＤＭＡ制御部（ＤＭＡ）、１５はＣＰＵ１２のプロセッサバス（ＰＲＢ）、１８ａ，１８ｂは各入出力回線を終端する機能デバイス（回線終端部）、１７は各機能デバイス１８ａ，１８ｂを収容するローカルバス（ＬＯＢ）、１６はプロセッサバス１５とローカルバス１７との間を接続する（プロトコル整合させる）ためのバスインタフェース部（ＢＩＦ）、１９は各機能デバイス１８ａ，１８ｂ等からの割込要求ＩＮＴを収容する割込バス（ＩＮＴＢ）、そして、２０は複数のこのような通信制御部１０に関する上位の管理・処理を行う上位モジュール、２１はＤＭＡ１４と上位モジュール２０との間を接続するＤＭＡバス（ＤＭＡＢ）である。
【００１５】
更に、３０は通信制御部（ＣＰＵシステム）１０の障害監視を行う障害監視部、３１はＤＭＡバス２１のアクセス信号を監視・取得するＤＭＡ監視部、３２はＰＲＢ１５のアクセス信号を監視・取得するＰＢＡ監視部、３３はプロセッサバス１５の所定のバスアクセス信号に基づきＣＰＵ１２の高負荷状態を検出する高負荷判定部、３４は割込バス１９上の割込要求に係る信号ＩＮＴ０〜ＩＮＴｎを監視・取得する割込監視部、３５は割込要求ＩＮＴ０〜ＩＮＴｎに対するＣＰＵ１２の処理が高負荷状態になるであろう所定の状態を検出する高負荷判定部、３６は所定のクロック信号でカウントアップすると共にＣＰＵ１２により定期的にリセット（ＲＳ）されるべきウォッチドッグタイマ（ＷＤＴ）、３７はＷＤＴ３６の計数値ｔが所定閾値ＴＨ２を超えたことによりＣＰＵ１２の高負荷状態を検出してその判定出力（即ち、メモリ書込イネーブル信号）ＷＥ１を出力する高負荷判定部、３９はフラッシュメモリやＥＥＰＲＯＭ等からなる不揮発性メモリ、３８は、高負荷状態の各判定出力ＷＥ１〜ＷＥ３により起動され、ＤＭＡバス２１，プロセッサバス１５及び又は割込バス１９から取得された各所定の信号を不揮発性メモリ３９に書き込むためのメモリ制御部、４０は上記各部の間を接続するメモリバス（ＭＢ）、４１は機能デバイス１８等からの割込要求に相当する信号ＩＮＴ０’〜ＩＮＴｎ’を擬似的に発生する擬似割込発生部、そして、５０は不揮発性メモリ３９に記録された内容をメモリ制御部３８を介して外部に読み出し、障害状況を解析するための保守端末である。
【００１６】
一例の高負荷判定部３３は、ＰＢＡ監視部３２から抽出された、ある機能デバイス１８からのバスアサートとＣＰＵ１２からのバスアックに関する各タイミング信号に基づき、前記バスアサートからバスアックに到るまでの時間を計数するためのカウンタＣＴＲと、該カウンタＣＴＲのカウント出力Ｑと所定閾値ＴＨ１とを比較する比較器ＣＭＰとを備え、該ＣＭＰは、ある機能デバイス１８からのバスアサートに対するＣＰＵ１２の応答時間Ｑが所定閾値ＴＨ１を超えたことにより、ＣＰＵ１２が高負荷状態にあることを示す判定出力（即ち、メモリ書込イネーブル信号）ＷＥ２を出力する。
【００１７】
また一例の高負荷判定部３５は、予め割込要求信号ＩＮＴ０〜ＩＮＴｎに関する所定の発生パターンＰを設定・保持するレジスタＲＥＧと、該ＲＥＧの出力パターンＰと、割込監視部３４により割込バス１９から抽出された割込要求信号ＩＮＴ０〜ＩＮＴｎとを比較する比較器ＣＭＰとを備え、該ＣＭＰは、割込バス１９から抽出された割込要求信号ＩＮＴ０〜ＩＮＴｎが所定のパターンＰであることにより、ＣＰＵ１２が高負荷状態である、又は高負荷状態になるであろうことを示す判定出力（即ち、メモリ書込イネーブル信号）ＷＥ３を出力する。なお、上記割込要求信号ＩＮＴ０〜ＩＮＴｎに関する所定の発生パターンＰとは、例えばＣＰＵ１２が緊急に割込処理すべき割込要求が略同時に多発したパターン、又は過去にＣＰＵ１２が暴走又は停止状態に到ったことがある場合と同一又は類似のパターンであって、各割込要求が略同時に又はシーケンシャルに発生した場合のパターン等を意味する。
【００１８】
擬似割込発生部４１は、周辺回路部（各機能デバイス１８を含む）における各種障害に対応する各擬似割込要求ＩＮＴ０’〜ＩＮＴｎ’を発生可能である。従来、この種の障害に対するＣＰＵの割込処理については、周辺回路部の各対応部位（例えばＣＰＵに対する動作プロトコル違反等）をその都度実際に生成（回路を改造）しないと起こせなかったが、本実施の形態によれば、各種障害に基づく擬似割込要求を１箇所に集約して能率よく発生可能である。これにより、ＣＰＵ１２への疑似負荷状態や、ＣＰＵ１２への応答違反も疑似可能であり、実運用状態で起こりうる様々な状態を再現可能となる。従って、ハードウェア障害又はファームウェアのバグに対する処理能力や処理信頼性の大幅な改善が図れる。
【００１９】
このような障害監視装置３０は、好ましくは，専用ＬＳＩ又は改版可能なＦＰＧＡ等により実現され、プロセッサ部１１及び各機能デバイス１８ａ，１８ｂと共に、同一の基盤（ボード）上に配置される。
【００２０】
図３に実施の形態による不揮発性メモリの記憶フォーマットを示す。一例の不揮発性メモリ３９は、ＣＰＵ１２と上位モジュール２０との間で行われるＤＭＡのアクセス発生回数を記憶するエリア３９ａと、配下の機能デバイス１８ａ，１８ｂを含む周辺回路部についての各種障害情報を記憶するエリア３９ｂと、周辺回路部からの割込要求信号ＩＮＴ０〜ＩＮＴ３１を記憶するエリア３９ｃとを備える。
【００２１】
上記エリア３９ｂの障害情報には、本システム上で検出される各種のアラーム信号ＡＬＭ、プロセッサバス１５上で検出されるデータパリティエラー信号ＤＡＴＰＥＲ、アドレスパリティーエラー信号ＡＤＤＰＥＲ、ＣＰＵ１２における演算オーバフローＯＶＦを知らせるための各種ステータス信号等が含まれる。また、機能ブロック１８で発生する各種障害については、一次的には割込要求信号ＩＮＴ０〜ＩＮＴ３１によって代表され、エリア３９ｃに記憶されるが、該割込要求の原因となった障害の詳細情報（ハードウェアの個別障害、ローカルバスのパリティ障害、外部インタフェース上のプロトコルエラー、電源系障害、クロックの同期障害等）については、割込要求信号ＩＮＴ０〜ＩＮＴ３１と共にエリア３９ｃに記憶してもよいし、又はエリア３９ｂに記憶してもよい。
【００２２】
以上述べた構成により、次に障害監視の動作を説明する。図２に戻り、ＣＰＵ１２は、内部に割込マスク（不図示）を備えており、適宜に必要な割込要求のみを受付け、処理可能である。一方、不揮発性メモリ３９にはＣＰＵ１２によって処理を受付けられた割込要求のみならず、未処理（即ち，処理待ち又は処理をマスクされた）の割込要求も記録可能である。従って、ＣＰＵ１２の割込処理負担が必要最小限のものに軽減されると共に、ＣＰＵ高負荷検出時の未処理の割込要求も失われずに不揮発性メモリ３９に記録される。
【００２３】
図４は実施の形態による割込情報取得のタイミングチャートであり、図において、ＷＴＤはウォッチドッグタイマ、ＷＥ１はメモリ３９への書込イネーブル信号、ＳＰは割込要求信号のサンプリングパルス、ＭＷＣはメモリ３９の書込タイミングを生成するためのカウント信号、ＭＡＤはメモリ３９の書込アドレス信号、ＭＣＳはメモリ３９のチップセレクト信号、ＭＯＥはメモリ３９のデータ読出イネーブル信号、ＭＷＥはメモリ３９のデータ書込イネーブル信号、ＭＤＡＴはメモリ３９に書き込まれるデータ信号である。
【００２４】
ウォッチドッグタイマ３６はＣＰＵ１２からの前回のリセットパルスＲＳによりリセットされて後、クロック信号ＣＬＫＡによりカウントアップしている。高負荷判定部３７は、ウォッチドッグタイマ３６のカウント値ｔが所定の閾値ＴＨ２を超えると、書込イネーブル信号ＷＥ１＝１にすると共に、割込要求信号のサンプリングパルスＳＰを発生する。これを受けたメモリ制御部３９では、該サンプリングパルスＳＰによりメモリバス４０上の割込要求信号ＩＮＴ０〜ＩＮＴｎをサンプリングすると共に、これを所定のタイミングで不揮発性メモリ３９に書き込む。
【００２５】
更に、この高負荷判定部３７は、上記書込イネーブル信号ＷＥ１＝１にした後は、例えばＴｍｓ毎の定期的に第２，第３のサンプリングパルスＳＰを発生し，これを受けたメモリ制御部３９では該パルスＳＰに同期して各時点の割込要求信号ＩＮＴ０〜ＩＮＴｎをサンプリングすると共に、これらを不揮発性メモリ３９の次アドレスに順次蓄積する。こうして、もし、ウォッチドッグタイマ３６のカウント値が所定の上限値Ｍになる前に、ウォッチドッグタイマ３６がＣＰＵ１２によりリセットされた場合には、それ以上の割込要求信号ＩＮＴ０〜ＩＮＴｎのサンプリング及びメモリ３９への書込は停止される。しかし、ＣＰＵ１２の暴走又は停止によって、ウォッチドッグタイマ３６がリセットされずに、やがてそのカウント値が上限値Ｍを超えた場合には、その直前までにどのような割込要求がどのようなパターンで発生していたかのログ情報が不揮発性メモリ２９に記憶されている。
【００２６】
なお、上記割込要求信号ＩＮＴ０〜ＩＮＴｎのサンプリング及びメモリ３９への書込は、ウォッチドッグタイマ３６のカウント値が上限値Ｍを超えた後も適当な時間だけ継続してもよい。こうすれば、ＣＰＵ１２の障害前のみならず，障害後のシステム状況も有効に記録される。
【００２７】
図５は実施の形態によるＣＰＵバスアクセス情報取得のタイミングチャートであり、図において、ＡＤ／ＤＡＴはアドレス／データ信号、ＣＭＤはリード／ライト等のコマンド信号、Ｆｒａｍｅはバスアクセスの開始信号、Ｉｒｄｙは入力レディ信号、Ｄｅｖｓｅｌはデバイス選択信号、Ｔｒｄｙは転送レディ信号、Ａｃ−ｃｎｔはバスアクセス区間を監視するためのタイミング信号、Ａｃ−ｅｎｄはバスアクセスの終了信号、ＷＥ２はメモリ３９への書込イネーブル信号で得ある。
【００２８】
図の左側に正常時のバスアクセスを示す。一例のデータ転送シーケンスはＡｃ−ｃｎｔ＝「Ａ」までに終了している。これは、ＣＰＵ１２が高負荷状態にないことを表しており，よって書込イネーブル信号ＷＥ２はセットされない。一方，図の右側は正常時ではないバスアクセスを示している。この場合のデータ転送シーケンスはＡｃ−ｃｎｔ＝「Ａ」を経過しても終了しておらず、これはＣＰＵ１２が高負荷状態（又は異常）であることを表している。これによって書込イネーブル信号ＷＥ２はセットされ、その後は、図示しないが，バスアクセス信号が適宜にサンプリングされると共に、メモリ３９に順次記憶される。
【００２９】
なお、上記バスアクセス信号の監視は、単にアクセス時間の上限を監視するのみではなく、各途中のタイミングで発生すべき各信号レベルの発生パターンを監視するようにしてもよい。こうすれば、バスアクセスの異常状態（又はＣＰＵ１２の高負荷状態）をより早期に発見でき、よってＣＰＵ１２が暴走又は停止にいたる前のより多くのバスアクセス信号をサンプリングし、メモリ３９に記憶できる。
【００３０】
図６は実施の形態によるＤＭＡ転送回数情報取得のタイミングチャートであり、図において、ＷＤＴはウォッチドッグタイマ、ＷＥ１はメモリ３９への書込イネーブル信号、ＡＷＲはアドレス開始ビット、ＡＤＥはビットシリアルからなるコマンド／アドレスデータ信号、ＷＤＡＴはビットシリアルからなる書込データ、Ａｅｎｄは１ＤＭＡアクセスの終了を表すアクセス終了ビット、ＤＭＡｃｎｔはＤＭＡアクセスの発生回数である。
【００３１】
ＣＰＵ１２の高負荷状態が検出（即ち、ＷＥ１＝１）されると、ＤＭＡアクセス回数のカウント及びカウント値のメモリ３９への書込制御が行われる。即ち、ＤＭＡｃｎｔはアドレス開始ビットＡＷＲ毎にカウントアップされ、やがてウォッチドッグタイマＷＤＴが最大値Ｍを超えると、その時点における計数値ｎがメモリ３９に記憶される。本実施の形態におけるＤＭＡアクセスは、上位モジュール２０へのデータ転送（即ち，障害報告等）が頻発している場合に多く発生するため、障害時のシステム状況を解析する上で有用な記録情報となり得る。
【００３２】
なお、この例ではＷＥ１＝１により、ＤＭＡアクセス回数の監視・記録を開始したが、ＷＥ２＝１又はＷＥ３＝１によりＤＭＡアクセス回数の監視・記録を開始してもよい。他の割込要求情報、バスアクセス情報の監視・記録についても同様である。この場合は、メモリバス４０ヘのデータ書込アクセスが競合しないようにバスアクセスの調停部が設けられる。こうして、システム障害の解析に必要な最小限のログ情報を効率よく取得・記録できる。
【００３３】
そして、保守端末５０では、メモリ３９に記録された情報を適宜に読み出し、これを統計的に分析することで、ＣＰＵ１２の障害がシステムのどの部分での障害によるかを容易に分析できる。逆に、特定の部分での障害が発生した際に、ハードウェアがどのようにリアクションし、それがＣＰＵ１２のアプリケ―ションソフトにどう伝わり、かつハードウェア及びアプリケーションソフトが正常に対処動作出来るかどうかを検証することも出来、問題があれば再発防止の変更を折り込む処置を取るサイクルを繰り返すなど、さらなる品質向上を目指す事が可能となる。
【００３４】
なお、図示しないが、ＣＰＵ１２が高負荷状態にあることを示す各信号ＷＥ１〜ＷＥ３でランプを点灯し、外部に警告してもよい。これにより、システムダウンの可能性がある事を事前に保守者に示唆することが可能となる。また、この警告情報ＷＥ１〜ＷＥ３は、一装置内のみならず通信対向する相手側装置への警告情報としても活用でき、こうすれば通信システム全体としての運用の信頼性向上にも極めて有効となる。
【００３５】
また、上記本発明に好適なる実施の形態を述べたが、本発明思想を逸脱しない範囲内で各部の構成、制御、処理及びこれらの組み合わせの様々な変更が行えることは言うまでも無い。
【００３６】
（付記１）ＣＰＵと所定の機能を実現するための１又は２以上の機能デバイスを含む周辺回路とがＣＰＵの共通バスを介して相互に接続するＣＰＵシステムの障害監視装置において、外部よりＣＰＵの動作状態を推定可能な所定の信号に基づきＣＰＵの処理に支障を来たすであろう所定の状態を検出する検出手段と、前記所定の状態の検出によりＣＰＵが周辺回路との間でやり取りする所定の信号を取得して不揮発性メモリに記録するログ情報記録手段とを備えることを特徴とするＣＰＵシステムの障害監視装置。
【００３７】
（付記２）検出手段は、ＣＰＵにより定期的にリセットされるべきウォッチドッグタイマの値が所定閾値を超えたこと、又は機能デバイスからのバスアサートに対する応答時間が所定閾値を超えたこと、又は機能デバイスからの割込要求に関して所定の高負荷状態を検出したことにより、ＣＰＵの処理に支障を来たすであろう所定の状態を検出することを特徴とする付記１記載のＣＰＵシステムの障害監視装置。
【００３８】
（付記３）ＣＰＵが周辺回路との間でやり取りする所定の信号は、共通バスのバスアクセスに関する信号、機能デバイスからＣＰＵへの割込要求に関する信号、又はＣＰＵにより起動されるＤＭＡアクセスに関する信号であることを特徴とする付記１記載のＣＰＵシステムの障害監視装置。
【００３９】
（付記４）不揮発性メモリの内容を外部接続の装置に読み出すためのインタフェース手段を備えることを特徴とする付記１記載のＣＰＵシステムの障害監視装置。
【００４０】
（付記５）機能デバイスからの割込要求に相当する信号を擬似的に発生してＣＰＵに対する割込要求とする擬似割込発生手段を備えることを特徴とする付記１記載のＣＰＵシステムの障害監視装置。
【００４１】
【発明の効果】
以上述べた如く本発明によれば、稼働中のＣＰＵが暴走又は停止しても、その前後のシステム稼働状況の情報を自律で記録可能となるため、障害分析に有用な情報が得られると共に、再発防止に活用できる。従って、この種のＣＰＵシステムの信頼性向上に寄与するところが極めて大きい。
【図面の簡単な説明】
【図１】本発明の原理を説明する図である。
【図２】実施の形態による障害監視方式の構成を示す図である。
【図３】実施の形態による不揮発性メモリの記憶フォーマットを説明する図である。
【図４】実施の形態による割込情報取得のタイミングチャートである。
【図５】実施の形態によるＣＰＵバスアクセス情報取得のタイミングチャートである。
【図６】実施の形態によるＤＭＡ転送回数情報取得のタイミングチャートである。
【符号の説明】
１０通信制御部
１１プロセッサ部
１２主メモリ（ＭＭ）
１４ＤＭＡ制御部（ＤＭＡ）
１５プロセッサバス（ＰＲＢ）
１６バスインタフェース部（ＢＩＦ）
１７ローカルバス（ＬＯＢ）
１８ａ，１８ｂ機能デバイス（回線終端部等）
１９割込バス（ＩＮＴＢ）
２０上位モジュール
２１ＤＭＡバス（ＤＭＡＢ）
３０障害監視部
３１ＤＭＡ監視部
３２ＰＢＡ監視部
３３，３５，３７高負荷判定部
３４割込監視部
３６ウォッチドッグタイマ（ＷＤＴ）
３８メモリ制御部
３９不揮発性メモリ
４１擬似割込発生部
５０保守端末[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a fault monitoring device for a CPU system, and more particularly, to a CPU in which a CPU and a peripheral circuit including one or more functional devices for realizing a predetermined function are mutually connected via a common bus of the CPU. The present invention relates to a system fault monitoring device.
[0002]
Today, many devices are realized by this type of CPU system. However, especially in a transmission device or the like that provides a complicated and advanced communication service, a system failure occurs because many functional devices cooperate under the CPU. It is becoming very difficult to identify the cause of the time.
[0003]
[Prior art]
Conventionally, failure processing in the CPU system has been performed by an interrupt processing firmware of the CPU. However, although the hardware of the CPU is normal, a state other than the hardware, that is, a bug in the CPU firmware, a failure in a peripheral circuit, an overload state of the CPU processing due to a number of overlapping failures, or In many cases, the CPU runs away or stops due to noise or the like. In such a case, not only prompt measures but also identification of the cause for preventing recurrence is indispensable.
[0004]
Under such circumstances, conventionally, the runaway of the CPU is detected by a watchdog timer, and when the runaway is detected, the CPU is reset and the control of the CPU bus in this section is read from a predetermined control data. In this method, after the peripheral device is stopped in a safe state, the reset state of the CPU is released (for example, Patent Document 1).
[0005]
[Patent Document 1]
JP-A-5-165657 ("Summary", "0013", FIG. 1).
[0006]
[Problems to be solved by the invention]
However, in the above-mentioned conventional method, quick and safe measures can be taken when the CPU system is abnormal, but useful information for preventing recurrence cannot be obtained.
[0007]
SUMMARY OF THE INVENTION The present invention has been made in view of the above-described problems of the related art, and has as its object to solve the problem of a CPU system that can efficiently accumulate useful information for identifying the cause even if the CPU runs away or stops. A monitoring device is provided.
[0008]
[Means for Solving the Problems]
The above problem is solved by, for example, the configuration of FIG. That is, in the fault monitoring device 5 of the CPU system of the present invention (1), the CPU 1 and the peripheral circuit 2 including one or more functional devices 2a to 2b for realizing a predetermined function are connected to the common bus 3 of the CPU 1. In the fault monitoring device of the CPU system 4 connected to each other via a detection unit, a detection unit 6 for detecting a predetermined state that will interfere with the processing of the CPU 1 based on a predetermined signal from which the operation state of the CPU 1 can be externally estimated; A log information recording means 7 for acquiring a predetermined signal exchanged between the CPU 1 and the peripheral circuit 2 by detecting the predetermined state and recording the signal in the nonvolatile memory 8.
[0009]
According to the present invention (1), by detecting a predetermined state that would interfere with the processing of the CPU 1, the CPU 1 obtains a predetermined signal exchanged with the peripheral circuit 2 to obtain a nonvolatile memory. 8, the useful information for specifying the cause can be efficiently accumulated even if the CPU 1 runs away or stops thereafter.
[0010]
According to the present invention (2), in the above-mentioned present invention (1), the detecting means may be configured such that the value of the watchdog timer to be periodically reset by the CPU exceeds a predetermined threshold value, or a response to a bus assertion from a functional device. When the time exceeds a predetermined threshold value or when a predetermined high-load state is detected in response to an interrupt request from a functional device, a predetermined state that may hinder the processing of the CPU is detected.
[0011]
Note that the predetermined high-load state related to the interrupt request from the functional device is, for example, a state in which interrupt requests to be subjected to an urgent interrupt process occur frequently at substantially the same time, or a CPU has runaway or stopped in the past. This means a state in which interrupt requests of the same or similar pattern as in some cases occur almost simultaneously or sequentially. Therefore, according to the present invention (2), it is possible to accurately detect the state of the system that would interfere with the processing of the CPU based on a predetermined signal that can externally estimate the operating state of the CPU 1.
[0012]
In the present invention (3), in the above-mentioned present invention (1), the predetermined signal exchanged between the CPU and the peripheral circuit is a signal relating to a bus access of a common bus, a signal relating to an interrupt request from a functional device to the CPU, Alternatively, it is a signal related to DMA access activated by the CPU. Therefore, by leaving these signals in a record, it is possible to analyze in detail the system environment when the CPU encounters a processing failure.
[0013]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Note that the same reference numerals indicate the same or corresponding parts throughout the drawings.
[0014]
FIG. 2 is a diagram showing a configuration of the fault monitoring system according to the embodiment, showing an example of application to a data transmission device. In the figure, reference numeral 10 denotes a communication control unit capable of accommodating, for example, two transmission lines, 11 denotes a processor unit for performing main control of the communication control unit 10, 12 denotes its CPU, and 13 denotes a main memory (MM) used by the CPU 12. , 14 are DMA controllers (DMA) for DMA-transferring data exchanged between the CPU 12 and the upper module 20, 15 is a processor bus (PRB) of the CPU 12, and 18a and 18b are functions for terminating each input / output line. A device (line termination unit) 17 is a local bus (LOB) for accommodating each of the functional devices 18a and 18b, and 16 is a bus interface unit (connection between the processor bus 15 and the local bus 17 for protocol matching). BIF) and 19 are interrupt buses (INTB) for accommodating interrupt requests INT from the respective functional devices 18a and 18b. Then, 20 a plurality of upper module for managing and processing the higher for such communication control unit 10, 21 is DMA bus (DMAB) which connects a DMA14 and the upper module 20.
[0015]
Further, reference numeral 30 denotes a failure monitoring unit that monitors a failure of the communication control unit (CPU system) 10, 31 denotes a DMA monitoring unit that monitors and acquires an access signal of the DMA bus 21, and 32 denotes a PBA that monitors and acquires an access signal of the PRB 15. A monitoring unit 33 is a high load determining unit that detects a high load state of the CPU 12 based on a predetermined bus access signal of the processor bus 15, and 34 monitors and acquires signals INT <b> 0 to INTn related to an interrupt request on the interrupt bus 19. A high load determining unit 35 detects a predetermined state where the processing of the CPU 12 in response to the interrupt requests INT0 to INTn is likely to be in a high load state. A high load determination unit 36 counts up with a predetermined clock signal and The watchdog timer (WDT) 37 which should be periodically reset (RS) by the timer 37 has a count value t of the WDT 36 of a predetermined threshold. A high load determination unit 39 which detects a high load state of the CPU 12 when TH2 is exceeded and outputs a determination output (that is, a memory write enable signal) WE1; 39 is a nonvolatile memory such as a flash memory or an EEPROM; Is a memory control unit that is activated by each of the determination outputs WE1 to WE3 in a high load state and writes each predetermined signal obtained from the DMA bus 21, the processor bus 15 and / or the interrupt bus 19 to the nonvolatile memory 39; Reference numeral 40 denotes a memory bus (MB) for connecting the above-described units, reference numeral 41 denotes a pseudo interrupt generation unit for pseudo-generating signals INT0 'to INTn' corresponding to an interrupt request from the functional device 18 or the like; Is a maintenance terminal for reading out the contents recorded in the non-volatile memory 39 to the outside via the memory control unit 38 and analyzing the failure situation. .
[0016]
The high-load determination unit 33 of an example calculates the time from the bus assertion to the bus acknowledge based on the bus assertion from a certain functional device 18 and the timing signals related to the bus acknowledge from the CPU 12 extracted from the PBA monitoring unit 32. A counter CTR for counting, and a comparator CMP for comparing a count output Q of the counter CTR with a predetermined threshold value TH1. The CMP has a predetermined response time Q of the CPU 12 to a bus assertion from a certain functional device 18. When the threshold value TH1 is exceeded, the CPU 12 outputs a determination output (that is, a memory write enable signal) WE2 indicating that the CPU 12 is in a high load state.
[0017]
Further, the high-load determining unit 35 of the example includes a register REG that sets and holds a predetermined generation pattern P relating to the interrupt request signals INT0 to INTn in advance, an output pattern P of the REG, and an interrupt monitoring unit 34. And a comparator CMP for comparing the interrupt request signals INT0 to INTn extracted from the interrupt bus 19 with the interrupt request signals INT0 to INTn extracted from the interrupt bus 19 having a predetermined pattern P. As a result, the CPU 12 outputs a determination output (that is, a memory write enable signal) WE3 indicating that the CPU 12 is in the high load state or will be in the high load state. The predetermined generation pattern P relating to the interrupt request signals INT0 to INTn is, for example, a pattern in which many interrupt requests for the CPU 12 to perform an emergency interrupt process occur almost simultaneously, or the CPU 12 has reached a runaway or stopped state in the past. This is a pattern that is the same or similar to the case where the interrupt request has occurred, and a pattern when each interrupt request occurs almost simultaneously or sequentially.
[0018]
The pseudo interrupt generation unit 41 can generate each of the pseudo interrupt requests INT0 'to INTn' corresponding to various failures in the peripheral circuit unit (including each functional device 18). Conventionally, the CPU interrupt processing for this kind of failure cannot be performed without actually generating (modifying the circuit) each corresponding part of the peripheral circuit section (for example, an operation protocol violation for the CPU). According to the embodiment, pseudo interrupt requests based on various failures can be collected at one location and can be generated efficiently. Accordingly, a pseudo load state on the CPU 12 and a response violation to the CPU 12 can be simulated, and various states that can occur in the actual operation state can be reproduced. Therefore, the processing capability and processing reliability for hardware failures or firmware bugs can be significantly improved.
[0019]
Such a fault monitoring device 30 is preferably realized by a dedicated LSI or a renewable FPGA or the like, and is arranged on the same base (board) together with the processor unit 11 and the functional devices 18a and 18b.
[0020]
FIG. 3 shows a storage format of the nonvolatile memory according to the embodiment. The non-volatile memory 39 as an example stores an area 39a for storing the number of occurrences of DMA access performed between the CPU 12 and the upper module 20, and various types of fault information for peripheral circuit units including the subordinate functional devices 18a and 18b. And an area 39c for storing interrupt request signals INT0 to INT31 from the peripheral circuit unit.
[0021]
The fault information in the area 39b includes various alarm signals ALM detected on the present system, a data parity error signal DATPER detected on the processor bus 15, an address parity error signal ADDPER, and an arithmetic overflow OVF in the CPU 12. Various status signals are included. The various types of faults occurring in the functional block 18 are temporarily represented by the interrupt request signals INT0 to INT31 and are stored in the area 39c. However, detailed information of the fault that caused the interrupt request ( Individual hardware failures, local bus parity failures, protocol errors on external interfaces, power supply system failures, clock synchronization failures, etc.) may be stored in the area 39c together with the interrupt request signals INT0 to INT31. Alternatively, it may be stored in the area 39b.
[0022]
Next, the operation of the fault monitoring with the above-described configuration will be described. Returning to FIG. 2, the CPU 12 includes an interrupt mask (not shown) therein, and can appropriately receive and process only necessary interrupt requests. On the other hand, in the non-volatile memory 39, not only an interrupt request accepted by the CPU 12 but also an unprocessed (that is, a waiting or masked process) interrupt request can be recorded. Accordingly, the interrupt processing load on the CPU 12 is reduced to the minimum necessary, and an unprocessed interrupt request upon detection of a high CPU load is recorded in the nonvolatile memory 39 without being lost.
[0023]
FIG. 4 is a timing chart of interrupt information acquisition according to the embodiment. In the figure, WTD is a watchdog timer, WE1 is a write enable signal to the memory 39, SP is a sampling pulse of an interrupt request signal, and MWC is a memory. A count signal for generating a write timing of 39, MAD is a write address signal of the memory 39, MCS is a chip select signal of the memory 39, MOE is a data read enable signal of the memory 39, and MWE is a data write enable of the memory 39. The enable signal, MDAT, is a data signal written to the memory 39.
[0024]
After being reset by the previous reset pulse RS from the CPU 12, the watchdog timer 36 counts up with the clock signal CLKA. When the count value t of the watchdog timer 36 exceeds a predetermined threshold value TH2, the high load determination unit 37 sets the write enable signal WE1 = 1 and generates a sampling pulse SP of the interrupt request signal. In response to this, the memory control unit 39 samples the interrupt request signals INT0 to INTn on the memory bus 40 by the sampling pulse SP, and writes them into the nonvolatile memory 39 at a predetermined timing.
[0025]
Further, after setting the write enable signal WE1 = 1, the high load determination unit 37 periodically generates the second and third sampling pulses SP at every Tms, for example, and receives the received second and third sampling pulses SP. At 39, the interrupt request signals INT0 to INTn at each time are sampled in synchronization with the pulse SP, and these are sequentially stored at the next address of the nonvolatile memory 39. Thus, if the watchdog timer 36 is reset by the CPU 12 before the count value of the watchdog timer 36 reaches the predetermined upper limit value M, the sampling and memory of the interrupt request signals INT0 to INTn are further increased. Writing to 39 is stopped. However, when the watchdog timer 36 is not reset due to a runaway or stoppage of the CPU 12 and the count value eventually exceeds the upper limit value M, what kind of interrupt request has been issued and in what pattern until immediately before the count value exceeds the upper limit value M. Log information indicating whether or not the error has occurred is stored in the nonvolatile memory 29.
[0026]
The sampling of the interrupt request signals INT0 to INTn and the writing to the memory 39 may be continued for an appropriate time after the count value of the watchdog timer 36 exceeds the upper limit M. In this way, not only before the failure of the CPU 12, but also the system status after the failure is effectively recorded.
[0027]
FIG. 5 is a timing chart for acquiring CPU bus access information according to the embodiment. In the figure, AD / DAT is an address / data signal, CMD is a command signal such as read / write, Frame is a bus access start signal, and Irdy is a bus access start signal. An input ready signal, Devsel is a device selection signal, Trdy is a transfer ready signal, Ac-cnt is a timing signal for monitoring a bus access section, Ac-end is a bus access end signal, and WE2 is a write enable to the memory 39. Signal.
[0028]
The normal bus access is shown on the left side of the figure. The example data transfer sequence has been completed by Ac-cnt = “A”. This indicates that the CPU 12 is not in a high load state, and therefore the write enable signal WE2 is not set. On the other hand, the right side of the figure shows an abnormal bus access. The data transfer sequence in this case is not completed even after Ac-cnt = “A”, which indicates that the CPU 12 is in a high load state (or abnormal). As a result, the write enable signal WE2 is set. Thereafter, although not shown, the bus access signal is appropriately sampled and sequentially stored in the memory 39.
[0029]
The bus access signal may be monitored not only by monitoring the upper limit of the access time but also by monitoring the occurrence pattern of each signal level to be generated at each intermediate timing. In this way, an abnormal state of the bus access (or a high load state of the CPU 12) can be detected earlier, so that more bus access signals before the CPU 12 goes out of control or stops can be sampled and stored in the memory 39.
[0030]
FIG. 6 is a timing chart of DMA transfer count information acquisition according to the embodiment. In the figure, WDT is a watchdog timer, WE1 is a write enable signal to the memory 39, AWR is an address start bit, and ADE is a bit serial. The command / address data signal, WDAT is write data composed of bit serial, Aend is an access end bit indicating the end of one DMA access, and DMAcnt is the number of times of DMA access.
[0031]
When the high load state of the CPU 12 is detected (that is, WE1 = 1), the count of the number of DMA accesses and the writing control of the count value to the memory 39 are performed. That is, the DMAcnt is counted up for each address start bit AWR, and when the watchdog timer WDT exceeds the maximum value M, the count value n at that time is stored in the memory 39. The DMA access according to the present embodiment frequently occurs when data transfer (that is, a failure report or the like) to the higher-level module 20 occurs frequently. Therefore, the DMA access becomes useful recording information for analyzing the system status at the time of the failure. obtain.
[0032]
In this example, monitoring and recording of the number of DMA accesses are started when WE1 = 1, but monitoring and recording of the number of DMA accesses may be started when WE2 = 1 or WE3 = 1. The same applies to monitoring and recording of other interrupt request information and bus access information. In this case, a bus access arbitration unit is provided so that data write access to memory bus 40 does not conflict. In this way, the minimum log information required for analyzing the system failure can be efficiently acquired and recorded.
[0033]
Then, the maintenance terminal 50 appropriately reads out the information recorded in the memory 39 and statistically analyzes the information, so that it is possible to easily analyze in which part of the system the failure of the CPU 12 is caused. Conversely, when a failure occurs in a specific part, how the hardware reacts, how it is transmitted to the application software of the CPU 12, and whether the hardware and the application software can cope and operate normally Can be verified, and if there is a problem, it is possible to aim for further quality improvement by repeating a cycle of taking measures to incorporate changes to prevent recurrence.
[0034]
Although not shown, the lamp may be turned on by each of the signals WE1 to WE3 indicating that the CPU 12 is in a high load state, and an alarm may be issued to the outside. As a result, it is possible to suggest to a maintenance person in advance that there is a possibility of a system down. Further, the warning information WE1 to WE3 can be used not only within one device but also as warning information to a communication partner device, and this is extremely effective in improving the reliability of operation of the communication system as a whole. .
[0035]
Although the preferred embodiments of the present invention have been described, it goes without saying that various changes in the configuration, control, processing, and combinations thereof can be made without departing from the spirit of the present invention.
[0036]
(Supplementary Note 1) In a fault monitoring device of a CPU system in which a CPU and a peripheral circuit including one or more functional devices for realizing a predetermined function are interconnected via a common bus of the CPU, an external CPU Detecting means for detecting a predetermined state that will interfere with the processing of the CPU based on a predetermined signal capable of estimating an operation state; and a predetermined means for the CPU to exchange with peripheral circuits by detecting the predetermined state. And a log information recording means for acquiring a signal and recording the signal in a nonvolatile memory.
[0037]
(Supplementary Note 2) The detecting means determines that the value of the watchdog timer to be periodically reset by the CPU exceeds a predetermined threshold, or that the response time to a bus assertion from a functional device exceeds the predetermined threshold, The fault monitoring device for a CPU system according to claim 1, wherein a predetermined state that would hinder the processing of the CPU is detected by detecting a predetermined high load state with respect to the interrupt request from the device.
[0038]
(Supplementary Note 3) The predetermined signal exchanged between the CPU and the peripheral circuit is a signal related to a bus access of a common bus, a signal related to an interrupt request from a functional device to the CPU, or a signal related to a DMA access activated by the CPU. 3. The fault monitoring device for a CPU system according to claim 1, wherein
[0039]
(Supplementary Note 4) The failure monitoring device of the CPU system according to Supplementary Note 1, further comprising an interface unit for reading the content of the nonvolatile memory to an externally connected device.
[0040]
(Supplementary note 5) The CPU system according to Supplementary note 1, further comprising: a pseudo interrupt generating unit that generates a signal corresponding to an interrupt request from the functional device in a simulated manner and generates an interrupt request to the CPU. apparatus.
[0041]
【The invention's effect】
As described above, according to the present invention, even if the CPU in operation runs away or stops, information on the system operation status before and after the CPU can be autonomously recorded, so that useful information for failure analysis can be obtained. Can be used to prevent recurrence. Therefore, it greatly contributes to improving the reliability of this type of CPU system.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating the principle of the present invention.
FIG. 2 is a diagram illustrating a configuration of a failure monitoring method according to an embodiment.
FIG. 3 is a diagram illustrating a storage format of a nonvolatile memory according to the embodiment.
FIG. 4 is a timing chart of interrupt information acquisition according to the embodiment.
FIG. 5 is a timing chart for acquiring CPU bus access information according to the embodiment;
FIG. 6 is a timing chart for acquiring DMA transfer count information according to the embodiment;
[Explanation of symbols]
10 Communication control unit 11 Processor unit 12 Main memory (MM)
14 DMA control unit (DMA)
15 Processor bus (PRB)
16 Bus interface (BIF)
17 Local bus (LOB)
18a, 18b Functional device (line termination unit, etc.)
19 Interrupt bus (INTB)
20 Upper module 21 DMA bus (DMAB)
Reference Signs List 30 failure monitoring unit 31 DMA monitoring unit 32 PBA monitoring units 33, 35, 37 high load determination unit 34 interrupt monitoring unit 36 watchdog timer (WDT)
38 memory control unit 39 non-volatile memory 41 pseudo interrupt generation unit 50 maintenance terminal

Claims

ＣＰＵと所定の機能を実現するための１又は２以上の機能デバイスを含む周辺回路とがＣＰＵの共通バスを介して相互に接続するＣＰＵシステムの障害監視装置において、
外部よりＣＰＵの動作状態を推定可能な所定の信号に基づきＣＰＵの処理に支障を来たすであろう所定の状態を検出する検出手段と、
前記所定の状態の検出によりＣＰＵが周辺回路との間でやり取りする所定の信号を取得して不揮発性メモリに記録するログ情報記録手段とを備えることを特徴とするＣＰＵシステムの障害監視装置。In a fault monitoring apparatus for a CPU system, a CPU and a peripheral circuit including one or more functional devices for realizing a predetermined function are interconnected via a common bus of the CPU.
Detecting means for detecting a predetermined state that would interfere with the processing of the CPU based on a predetermined signal from which an operation state of the CPU can be estimated from outside;
A failure monitoring device for a CPU system, comprising: a log information recording unit that acquires a predetermined signal exchanged between the CPU and a peripheral circuit by detecting the predetermined state and records the signal in a nonvolatile memory.

検出手段は、ＣＰＵにより定期的にリセットされるべきウォッチドッグタイマの値が所定閾値を超えたこと、又は機能デバイスからのバスアサートに対する応答時間が所定閾値を超えたこと、又は機能デバイスからの割込要求に関して所定の高負荷状態を検出したことにより、ＣＰＵの処理に支障を来たすであろう所定の状態を検出することを特徴とする請求項１記載のＣＰＵシステムの障害監視装置。The detecting means detects that the value of the watchdog timer to be periodically reset by the CPU has exceeded a predetermined threshold, or that the response time to the bus assertion from the functional device has exceeded the predetermined threshold, or 2. A fault monitoring apparatus for a CPU system according to claim 1, wherein a predetermined state that would hinder processing of the CPU is detected by detecting a predetermined high load state with respect to the load request.

ＣＰＵが周辺回路との間でやり取りする所定の信号は、共通バスのバスアクセスに関する信号、機能デバイスからＣＰＵへの割込要求に関する信号、又はＣＰＵにより起動されるＤＭＡアクセスに関する信号であることを特徴とする請求項１記載のＣＰＵシステムの障害監視装置。The predetermined signal exchanged between the CPU and the peripheral circuit is a signal relating to bus access of a common bus, a signal relating to an interrupt request from a functional device to the CPU, or a signal relating to DMA access activated by the CPU. The fault monitoring device for a CPU system according to claim 1, wherein