CN114897150B - Reliability design method of AI intelligent module - Google Patents

Reliability design method of AI intelligent module Download PDF

Info

Publication number
CN114897150B
CN114897150B CN202210347712.5A CN202210347712A CN114897150B CN 114897150 B CN114897150 B CN 114897150B CN 202210347712 A CN202210347712 A CN 202210347712A CN 114897150 B CN114897150 B CN 114897150B
Authority
CN
China
Prior art keywords
intelligent
fpga
program
current
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210347712.5A
Other languages
Chinese (zh)
Other versions
CN114897150A (en
Inventor
徐友庆
朱宗卫
周学海
李曦
王超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Original Assignee
Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Institute Of Higher Studies University Of Science And Technology Of China filed Critical Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Priority to CN202210347712.5A priority Critical patent/CN114897150B/en
Publication of CN114897150A publication Critical patent/CN114897150A/en
Application granted granted Critical
Publication of CN114897150B publication Critical patent/CN114897150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/39Circuit design at the physical level
    • G06F30/398Design verification or optimisation, e.g. using design rule check [DRC], layout versus schematics [LVS] or finite element methods [FEM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/61Installation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/02Reliability analysis or reliability optimisation; Failure analysis, e.g. worst case scenario performance, failure mode and effects analysis [FMEA]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Biomedical Technology (AREA)
  • Neurology (AREA)
  • Geometry (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a reliability design method of an AI intelligent module, wherein the AI intelligent module comprises an intelligent processing unit and a memory unit, a power supply unit, an Ethernet interface unit, a PCIE interface unit, a temperature monitoring unit and a current monitoring unit which are connected with the intelligent processing unit. The AI intelligent module can transmit other data such as images and videos to perform acceleration processing of relevant AI algorithms, such as image enhancement, image classification, target detection, target identification, target tracking and the like through a gigabit network interface and a PCIE high-speed interface. The reliability design method of the AI intelligent module mainly comprises three types, wherein one type is a method based on hardware circuit current-limiting protection, the other type is a method based on software program upper note updating, and the other type is a method based on a soft watchdog and a restart instruction. Generally, to implement the above three methods, a matched auxiliary control board, such as an FPGA, a CPLD, etc., is also needed.

Description

Reliability design method of AI intelligent module
Technical Field
The invention relates to the field of AI intelligence, in particular to a reliability design method of an AI intelligent module.
Background
AI technology development has entered the high-speed era, and industry interior disputes are explored to utilize artificial intelligence technology to energize the industry; therefore, an AI intelligent module capable of integrating a high-performance and high-reliability AI chip into a device is needed, so that an intelligent solution is provided, and various algorithm framework building based on deep learning is accelerated more efficiently and reliably to realize rapid intelligent upgrading of the industry.
Disclosure of Invention
The invention aims to: the reliability design method of the AI intelligent module is provided, and the reliability of the AI intelligent module is improved from different layers.
The technical scheme of the invention is as follows:
a reliability design method of an AI intelligent module comprises an intelligent processing unit, and a memory unit, a power supply unit, an Ethernet interface unit, a PCIE interface unit, a temperature monitoring unit and a current monitoring unit which are connected with the intelligent processing unit; the intelligent processing unit adopts an intelligent chip with an ARM + NPU architecture, and a main processor unit CPU, a neural network unit NPU, a video coding and decoding unit VPU and a picture coding and decoding unit JPU are integrated inside the intelligent processing unit;
the reliability design method of the AI intelligent module comprises three types, wherein the first type is a method based on hardware circuit current-limiting protection, the second type is a method based on software program uploading and updating, and the third type is a method based on a soft watchdog and a restarting instruction;
the method comprises the following steps that a program of the AI intelligent module is divided into a firmware program and an application program, wherein the firmware program comprises a preboot program, an uboot bootstrap program, a linux kernel and a rootfs file system; the application program comprises a neural network model, an algorithm library, an executable program and a corresponding script program; the program uploading and updating process is divided into two stages, the firmware program and the application program are uploaded and injected into three NorFlash in the first stage, the NorFlash space can be divided into two parts, the first part stores the system firmware program, and the second part stores the application program; and in the second stage, the firmware program and the application program are loaded after being judged by a 'two out of three' logic from three NorFlash chips, and the execution is started.
Preferably, the first-stage program injection process comprises:
(1) The upper computer sends an instruction to be switched to an upper injection state;
(2) The upper computer divides the upper note data into a plurality of data segments, each segment is fixed in size, the upper note is performed in a segmented mode, when one segment is injected, the FPGA automatically erases the segment, and then receives, unpacks and writes the segment into three NorFlash simultaneously;
(3) Performing CRC16 check when one data segment is received, and if the check is correct, setting bit position 1 corresponding to the bitmap;
(4) When the upper computer finishes sending the uploading data, sending a request bitmap remote measuring instruction, downloading corresponding bitmap remote measuring by the FPGA, judging the correctness of uploading check of each data segment by the upper computer, and retransmitting the wrong segment;
(5) And after the upper notes of all the data sections are successful, the upper computer sends an instruction for ending the upper notes, the FPGA jumps out of the NorFlash programming mode, and the first stage is completed.
Preferably, the second stage program injection process comprises:
(1) The upper computer sends an instruction to be switched to a running state, and the starting mode of the intelligent chip is switched to the SPI starting mode;
(2) The FPGA controls an intelligent chip to be powered on and started, the intelligent chip reads a system firmware program in a first part of storage space in Norflash through an analog SPI interface of the FPGA, three-out-of-two logic judgment is carried out during reading, and at least two pieces of same data are obtained by comparing whether the data of each bit in three pieces of NOR Flash are the same or not;
(3) And automatically mounting a second part of storage space in the NorFlash in the starting process of the intelligent chip, wherein the space can be divided into different partitions according to application requirements and is respectively used for storing contents such as a neural network model, an algorithm library, an executable program and the like.
Preferably, the procedure of the procedure update further includes simulating the SPI device by the FPGA: simulating the first part of the NorFlash into an SPI Flash by the FPGA to replace the original SPI Flash on the intelligent module; and the intelligent chip reads the system firmware through the SPI interface simulated by the FPGA to complete the loading and starting of the system.
Preferably, the procedure of updating the upper notes further comprises simulating the SDIO device through the FPGA: and the FPGA simulates the NorFlash second part storage space into an SD Device to replace the original eMMC equipment on the intelligent module. The SD FakeX module and the NorFlash jointly form a virtual SD card; among them, norFlash is only responsible for providing storage of data; the SD FakeX is a Verilog module and is responsible for analyzing and responding to commands of the SD bus, reading data from NorFlash and transmitting the data to the SD host when the SD host requests the data, namely the FPGA transmits the data to the intelligent chip through the SDIO to complete the uploading and updating of the application program.
Preferably, the reliability design further comprises a current-limiting protection design, wherein 2 paths of key current monitoring are adopted for current-limiting control, and the 2 paths of key current comprise 1 path of intelligent chip total input 5V current and 1 path of DDR 1.1V current; monitoring the 2 paths of key current through an external main control FPGA ADC; through subsequent tests, the threshold value of 2 paths of key current when the intelligent chip has single-particle locking is measured, current limiting control is carried out in a main control FPGA program according to the current threshold value obtained through the tests, and when the current threshold value exceeds the set threshold value, an input power supply is quickly cut off, so that an AI intelligent module is protected from being damaged;
firstly, the main control FPGA judges the total input current of the intelligent chip, and if the total input current is greater than a threshold value, the main control FPGA controls the AI intelligent module to be powered off and restarted; if the total input current is smaller than the threshold value, whether the second path of current is larger than the threshold value or not is further judged, if the second path of current is larger than the threshold value, the main control FPGA controls the AI intelligent module to be powered off and restarted, and if the currents are smaller than the threshold value, the intelligent chip is determined to have no single event locking phenomenon.
Preferably, the reliability design further comprises a restart instruction: the command is restarted in independent design when the command is designed, the AI intelligent module is controlled to be powered on and powered off through the command, and the AI intelligent module is powered on and powered off by the external FPGA according to the requirement.
Preferably, the reliability design further comprises a soft watchdog, and the AI intelligent module periodically sends state flag information to the FPGA, so that the FPGA can timely acquire the state information of the intelligent chip, and if the state flag information of the intelligent chip is not received within a set time, the AI intelligent module is restarted. The invention has the advantages that:
1. according to the reliability design method of the AI intelligent module, the AI intelligent module mainly comprises an intelligent processing unit, an LPDDR4x memory unit, a power supply unit, an Ethernet unit, a temperature monitoring unit and a current monitoring unit. The AI intelligent module can perform data transmission through a gigabit network and a PCIE high-speed interface, can perform intelligent processing such as image enhancement, image classification, image target detection and image tracking, and returns processing results to other modules through serial ports or other interfaces, thereby accelerating the construction of various algorithm frameworks based on deep learning so as to realize rapid intelligent upgrading of the industry.
2. According to the reliability design method of the AI intelligent module, the whole program uploading and updating are divided into two stages, the first stage is mainly used for completing the writing of uploading data from an upper computer to NorFlash, and the NorFlash storage space is divided into a first part of system firmware space and a second part of application program storage space. And the second stage is mainly used for reading a corresponding system firmware program and an application program from the NorFlash by the AI chip.
3. The reliability design method of the AI intelligent module also carries out reliability design aiming at the simulation SPI equipment, the simulation SDIO equipment, the current-limiting protection, the restart instruction, the soft watchdog and the like, and improves the high reliability of the AI intelligent module comprehensively.
Drawings
The invention is further described below with reference to the following figures and examples:
FIG. 1 is a functional block diagram of an AI intelligence module of the present invention;
FIG. 2 is a functional block diagram of a power supply unit;
FIG. 3 is a diagram of a power-up sequence control for a power supply unit;
FIG. 4 is a diagram of the overall architecture of the program of the reliability design method of the present invention;
FIG. 5 is a diagram of a first stage process of annotating;
FIG. 6 is a schematic diagram of an analog SPI device;
FIG. 7 is a schematic diagram of a simulated SDIO device;
FIG. 8 is a schematic view of the current monitoring principle;
fig. 9 is a flow chart of current monitoring.
Detailed Description
As shown in fig. 1, the scheme of the present invention is applied to the design scheme of an AI intelligent module, and the AI intelligent module mainly comprises an intelligent processing unit, a memory unit, a power supply unit, an ethernet unit, a PCIE unit, a temperature monitoring unit, and a current monitoring unit.
The AI intelligent module can carry out data transmission through the gigabit network and the PCIE high-speed interface, can carry out intelligent processing such as image enhancement, image classification, image target detection and image tracking, and returns a processing result to other modules through a serial port or other interfaces
The AI intelligent module externally provides 2 PCIE interfaces, 2 SGMII interfaces, 10 GPIO interfaces, 2 SPI interfaces, 2I 2C interfaces, 4 UART serial ports and 2 SDIO interfaces.
The intelligent chip is used as the core of the intelligent processing unit, the chip adopts an ARM + NPU framework, and a main processor unit CPU, a neural network unit NPU, a video coding and decoding unit VPU and a picture coding and decoding unit JPU are integrated in the chip.
The intelligent chip supports a 2-channel 64bit ECC LPDDR4 memory, supports various capacities and can meet the calculation and storage requirements of various inference scenes. Meanwhile, the system also has abundant peripheral interfaces, such as PCIE, UART, SPI, SDIO, eMMC, GPIO and the like.
The memory unit adopts two pieces of LPDDR4x particles of 4GB (1G × 32bit) of magnesium light to perform data bit expansion cascade, the total capacity is 8GB, and the model of the memory particle is MT53D1024M32D4DT-053 AIT.
The AI intelligent module adopts one path of DC5V power supply input, and then the DC is subjected to DC-DC and LDO direct current voltage reduction to be converted into different voltages to supply power to each chip. The power conversion module and the chip that use mainly adopt high reliable DC-DC and LDO chip, mainly contain three kinds of power chips: DC-DC chip RSHF2000LRH (2 slices), RSS0508HRH (1 slice), LDO chip RSW1101HRH (5 slices). In addition, the module adopts a rising timing control chip RSS5004CRH (2 chips) to control the power-on sequence of the power supply chip, and the power supply block diagram is shown in fig. 2.
Since the module requires multiple power supplies and different power-on sequences, a power-on control timing circuit is required. The module multi-path power supply needs to be powered on according to a certain time sequence, although the power supply branch of the whole module has more than 30 branches, 10 groups of power supplies are needed to be started in sequence through statistics, and other paths are started together with the main power supply branch under different main power supply branches. The specific start-up sequence is shown in fig. 3.
The sequential control of the module power supply enable drive is completed by adopting a radiation-resistant programmable sequential control chip, and a sequential control chip RSS5004CRH is selected.
The delay time can be set through an external resistor, and the adjustment range of the delay time can reach 2ms to 20ms. The power supply detection port is compared with the internal high-precision reference voltage, so that the accurate monitoring of the point power supply voltage can be realized, and the precision of delay time is ensured. The product simultaneously provides a plurality of error detection functions: power-on and power-off sequence detection, power failure detection by mistake, input error detection, output error detection and external trigger error detection. And after the error signal is detected, the power supply system is closed, a rear-stage circuit is protected, and the reliability of the system is improved.
The voltage input range of VDD is 3-16V, and is the input end of the internal linear voltage regulator. VCC5 is the output of the linear regulator, providing power to the internal circuitry. The VCC under-voltage locking threshold is 2.8V, and after the power supply exceeds 2.8V, the internal 600mV reference voltage and other modules are all started; when the power supply is below 2.8V, the chip will turn off the internal clock and reference voltage, and all ENx outputs are pulled low until VCC5 is below 1.2V.
The RSS5004CRH LCC18 has smaller packaging volume, and is convenient for reducing the area of a PCB. According to the requirement of a power supply starting time sequence branch of the module, 1 time sequence control scheme is designed for the module, and 2 RSS5004CRH cascade connections are adopted to form up to 8 time sequence control circuits.
The AI intelligent module converts the RGMII interface into the SGMII interface through the PHY chip 88E1512, so that the intelligent processor can communicate with an external board card by using a gigabit network, the module receives image data through the SGMII interface to perform intelligent calculation, and a processing result is returned to the external board card.
88E1512 is a 10M/100M/1000M-capable Ethernet transceiver using 56 pin 8mm x 8mm QFN package using standard fabricated digital CMOS technology. A switching regulator is integrated in the chip to generate power supply to be used, and the chip has ultra-low power consumption. 88E1512 supports the LVCMOS I/O standard on the RGMII terminal.
AI intelligence module has integrateed the temperature detecting element real-time supervision module temperature, prevents that the module from taking place work unusual because of the high temperature.
The temperature monitoring chip adopts a high-reliability chip TMP461-SP, the chip is a radiation-resistant, high-precision and low-power consumption temperature sensor with a built-in local temperature sensor, the temperature precision is 12 bit wide, the resolution is 0.0625 ℃, the temperature monitoring chip is packaged by adopting CFP with 8 pins, and the size of the chip is 7.1mm x 6.2mm. And carrying out data communication by adopting a two-wire serial interface SMBus protocol.
Detailed description of the interface as shown in table 1,
TABLE 1 AI external interface
Figure DEST_PATH_IMAGE002
The reliability design method of the AI intelligent module comprises the following contents.
S1, integral process
As shown in fig. 4, for updating the overall architecture diagram for program annotation, the program of the AI intelligent module is divided into a firmware program and an application program, wherein the firmware program includes a preboot program, an uboot bootstrap program, a linux kernel and a rootfs file system; the application program comprises a neural network model, an algorithm library, an executable program and a corresponding script program; the method comprises the following steps that a annotating process can be divided into two stages, wherein in the first stage, an annotating data upper computer (comprising firmware and application) is annotated into three NorFlash, the NorFlash is mainly divided into two parts, the first part stores a system firmware program, and the second part stores a model, an algorithm library and an application program; and in the second stage, the firmware program and the application program are loaded after the judgment of the three-in-two logic from the three NorFlash chips, and the execution is started.
As shown in fig. 5, the detailed process of the first stage is as follows:
when in injection, an upper computer program divides a data file to be injected into a plurality of data sections, each section is of a fixed size, each data section is divided into a plurality of data packets for injection, the maximum data volume of the data packets is determined by a transmission protocol, the FPGA checks each data section and records a detection result for downloading, a retransmission mechanism is designed for an error data section, link resources are saved to the maximum extent, and the complete process of program injection is as follows:
(1) The upper computer sends an instruction to be switched to an upper injection state;
(2) The upper computer divides the upper note data into a plurality of data segments, each segment is fixed in size, the upper note data is segmented, when one segment is annotated each time, the FPGA automatically erases the segment, then receives and unpacks the segment and writes the segment into three NorFlash simultaneously;
(3) Performing CRC16 check when one data segment is received, and if the check is correct, setting bit position 1 corresponding to the bitmap;
(4) When the upper computer finishes sending the uploading data, sending a bitmap remote measuring request instruction, downloading corresponding bitmap remote measuring by the FPGA, judging the correctness of uploading verification of each data segment by the upper computer, and retransmitting the wrong segment;
(5) And after the upper notes of all the data sections are successful, the upper computer sends an instruction for ending the upper notes, the FPGA jumps out of the NorFlash programming mode, and the first stage is completed.
The detailed process of the second stage is as follows:
(1) The upper computer sends an instruction to switch to a running state, and the starting mode of the intelligent chip is switched to SPI starting;
(2) The FPGA controls an intelligent chip to be powered on and started, the intelligent chip reads a system firmware program in a first part of storage space in Norflash through an analog SPI interface of the FPGA, three-out-of-two logic judgment is carried out before reading, and at least two pieces of same data are obtained by comparing whether the data of each bit in three pieces of NOR Flash are the same or not;
(3) And automatically mounting a second part of storage space in NorFlash in the starting process of the intelligent chip, wherein the space can be divided into different partitions according to application requirements and is respectively used for storing contents such as a neural network model, an algorithm library, an application program and the like.
S2, simulating SPI equipment
The procedure updating process also comprises the step of simulating SPI equipment through the FPGA, and the FPGA simulates the first part of NorFlash into SPI Flash to replace the original SPI Flash of the intelligent chip. And the intelligent chip reads the system firmware through the SPI interface simulated by the FPGA to complete the loading and starting of the system.
As shown in fig. 6, the left SPI host is an intelligent chip, the SPI FakeX is a Verilog module, and is used to convert the BPI interface of NorFlash into an SPI interface to be connected to the intelligent chip, and the first part of the FPGA FakeX + NorFlash constitutes an SPI device to replace the original SPI Flash function of the intelligent chip, so that the intelligent chip can directly read and load a firmware program from the FPGA NorFlash.
S3, simulating SDIO equipment
The procedure of annotating also includes simulating SDIO devices through the FPGA. After the system is started, the SD card with the application program needs to be mounted in a corresponding directory, and the SD card is simulated by adopting FPGA + NorFlash in the project. The analog SD card can be identified under a system like a common SD card, so that applications and application drivers in NOR Flash are moved to a system directory.
As shown in fig. 7, the FPGA simulates the second part of NorFlash into an SD Device, and the SD host on the left side is an SD card host, which is often a card reader, an embedded microprocessor, or the like. The SD FakeX and NorFlash on the right side jointly form a virtual SD card. The Nor Flash is only responsible for providing data storage; the SD FakeX is a Verilog module and is responsible for analyzing and responding to commands of the SD bus, reading data from Nor Flash and transmitting the data to the SD host when the SD-host requests the data, namely the FPGA transmits the data to the intelligent chip through the SDIO to complete the uploading and updating of the application program.
S4, current limiting protection
For preventing that AI intelligence module from taking place the latch lock in special environment, monitor the mains voltage of chip, when finding the unusual grow of electric current prevent chip damage, reach the purpose of protection AI intelligence module to the module outage immediately. Current monitoring is shown in fig. 8:
a 2-path key current monitoring method is adopted, wherein the 2-path key current comprises 1-path intelligent chip total input 5V current and 1-path DDR 1.1V current; on one hand, if all branch circuits are selected to be monitored, the needed sampling device circuits are more, the area of a circuit board is more occupied, the size of the AI intelligent module is increased, and meanwhile, too many acquisition channels bring difficulty to the FPGA monitoring circuit; on the other hand, the currents of the other branches are small, the measurement accuracy is difficult to guarantee, the threshold value is difficult to determine, and the measured value is possibly larger than the threshold value due to small fluctuation, so that the intelligent chip is always powered off and restarted. Therefore, 2 key currents are selected to be monitored finally for current limiting control.
The 2-path key current can be monitored through an external main control FPGA ADC. Through a subsequent single-particle test, the threshold value of 2 paths of key current when the intelligent chip is locked by a single particle is measured, current limiting control is carried out in a main control FPGA program according to the current threshold value obtained through the test, and when the current threshold value exceeds the set threshold value, an input power supply is cut off rapidly, so that the AI intelligent module is protected from being damaged. The main control FPGA current limiting control logic is shown in FIG. 9.
Firstly, the main control FPGA judges the total input current of the intelligent chip, and if the total input current is greater than a threshold value, the main control FPGA controls the intelligent chip to be powered off and restarted; if the total input current is smaller than the threshold value, whether the second path of current is larger than the threshold value or not is further judged, if the current is larger than the threshold value, the main control FPGA controls the intelligent chip to be powered off and restarted, and if the currents are smaller than the threshold value, the intelligent chip is determined to have no single event locking phenomenon.
S5, restarting command
For the accident that takes place circumstances of avoiding AI intelligent module, design alone when the instruction design and restart the instruction, accessible instruction control AI intelligent module add the outage, outside FPGA can add the outage to AI intelligent module as required.
S6, soft watchdog
The AI intelligent module periodically sends the state mark information to the FPGA so that the FPGA can timely acquire the state information of the intelligent chip, and if the state mark information of the intelligent chip is not received for a long time, the AI intelligent module is restarted.
The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose of the embodiments is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All modifications made according to the spirit of the main technical scheme of the invention are covered in the protection scope of the invention.

Claims (7)

1. A reliability design method of an AI intelligent module is characterized in that the AI intelligent module comprises an intelligent processing unit and a memory unit, a power supply unit, an Ethernet interface unit, a PCIE interface unit, a temperature monitoring unit and a current monitoring unit which are connected with the intelligent processing unit; the intelligent processing unit adopts an intelligent chip with an ARM + NPU architecture, and a main processor unit CPU, a neural network unit NPU, a video coding and decoding unit VPU and a picture coding and decoding unit JPU are integrated inside the intelligent processing unit;
the reliability design method of the AI intelligent module comprises three types, wherein the first type is a hardware circuit based current-limiting protection method, the second type is a software program based comment updating method, and the third type is a soft watchdog based restart instruction method;
the method comprises the following steps that a program of the AI intelligent module is divided into a firmware program and an application program, wherein the firmware program comprises a preboot program, an uboot bootstrap program, a linux kernel and a rootfs file system; the application program comprises a neural network model, an algorithm library, an executable program and a corresponding script program;
the program uploading and updating process is divided into two stages, the firmware program and the application program are uploaded to three NorFlash chips in the first stage, the NorFlash space can be divided into two parts, the first part stores the system firmware program, and the second part stores the application program; in the second stage, the firmware program and the application program are loaded after being judged by a 'two out of three' logic from three NorFlash chips, and the execution is started;
the first stage program annotating flow comprises the following steps:
(1) The upper computer sends an instruction to be switched to an upper injection state;
(2) The upper computer divides the upper note data into a plurality of data segments, each segment is fixed in size, the upper note data is segmented, when one segment is injected each time, the FPGA automatically erases the current upper note data segment, then receives and unpacks the data segment and writes the data segment into three NorFlash simultaneously;
(3) Performing CRC16 check when one data segment is received, and if the check is correct, setting bit position 1 corresponding to the bitmap;
(4) After the upper computer sends the uploading data, a bitmap remote measuring request command is sent, the FPGA downloads a corresponding bitmap remote measuring result, the upper computer judges the correctness of uploading verification of each data segment, and the error segments are retransmitted;
(5) And after the upper notes of all the data sections are successful, the upper computer sends an instruction for ending the upper notes, the FPGA jumps out of the NorFlash programming mode, and the first stage is completed.
2. The method for designing the reliability of the AI intelligence module of claim 1 wherein the second stage procedural annotation process comprises:
(1) The upper computer sends an instruction to switch to a running state, and the intelligent chip starting mode is switched to an SPI starting mode;
(2) The FPGA controls an intelligent chip to be powered on and started, the intelligent chip reads a system firmware program in a first part of storage space in Norflash through an analog SPI interface of the FPGA, and performs two-out-of-three logic judgment during reading, and at least two pieces of same data are fetched by comparing whether the data of each bit in three pieces of NOR Flash are the same or not;
(3) And automatically mounting a second part of storage space in NorFlash in the starting process of the intelligent chip, wherein the space can be divided into different partitions according to application requirements and is respectively used for storing a neural network model, an algorithm library and an executable program.
3. The method for designing the reliability of the AI intelligence module of claim 2 wherein the process of programming the update further comprises simulating SPI equipment via the FPGA: the FPGA simulates a first part of storage space of the NorFlash into an SPI Flash to replace the original SPI Flash on the intelligent module; and the intelligent chip reads the system firmware through the SPI interface simulated by the FPGA to complete the loading and starting of the system.
4. The method for designing the reliability of the AI intelligence module of claim 2 wherein the process of program update further includes simulating SDIO equipment via FPGA: the FPGA simulates the second part of storage space of the NorFlash into an SD Device to replace the original eMMC equipment on the intelligent module, and the SD FakeX module and the NorFlash jointly form a virtual SD card; among them, norFlash is only responsible for providing storage of data; the SD FakeX is a Verilog module and is responsible for analyzing and responding to commands of the SD bus, reading data from NorFlash and transmitting the data to the SD host when the SD host requests the data, namely the FPGA transmits the data to the intelligent chip through the SDIO to complete the uploading and updating of the application program.
5. The reliability design method of the AI intelligent module of claim 1, wherein the reliability design further comprises a current limiting protection design, wherein 2-path critical current monitoring is adopted for current limiting control, and the 2-path critical current comprises 1-path intelligent chip total input 5V current and 1-path DDR 1.1V current; monitoring 2 paths of key current through an external main control FPGA ADC; through subsequent tests, the threshold value of 2 paths of key current when the intelligent chip has single-particle locking is measured, current limiting control is carried out in a main control FPGA program according to the current threshold value obtained through the tests, and when the current threshold value exceeds the set threshold value, an input power supply is quickly cut off, so that an AI intelligent module is protected from being damaged;
firstly, the main control FPGA judges the total input current of the intelligent chip, and if the total input current is greater than a threshold value, the main control FPGA controls the AI intelligent module to be powered off and restarted; if the total input current is smaller than the threshold value, whether the second path of current is larger than the threshold value or not is further judged, if the second path of current is larger than the threshold value, the main control FPGA controls the AI intelligent module to be powered off and restarted, and if the currents are smaller than the threshold value, the intelligent chip is determined to have no single event locking phenomenon.
6. The method for designing reliability of an AI intelligence module of claim 1 wherein the reliability design further includes a restart instruction: the command is restarted in independent design when the command is designed, the AI intelligent module is controlled to be powered on and powered off through the command, and the AI intelligent module is powered on and powered off by the external FPGA according to the requirement.
7. The method for designing the reliability of the AI intelligent module according to claim 1, wherein the reliability design further includes a soft watchdog, and the AI intelligent module periodically sends status flag information to the FPGA, so that the FPGA can timely obtain the status information of the intelligent chip, and if the status flag information of the intelligent chip is not received within a set time, the AI intelligent module is restarted.
CN202210347712.5A 2022-04-01 2022-04-01 Reliability design method of AI intelligent module Active CN114897150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210347712.5A CN114897150B (en) 2022-04-01 2022-04-01 Reliability design method of AI intelligent module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210347712.5A CN114897150B (en) 2022-04-01 2022-04-01 Reliability design method of AI intelligent module

Publications (2)

Publication Number Publication Date
CN114897150A CN114897150A (en) 2022-08-12
CN114897150B true CN114897150B (en) 2023-04-07

Family

ID=82715535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210347712.5A Active CN114897150B (en) 2022-04-01 2022-04-01 Reliability design method of AI intelligent module

Country Status (1)

Country Link
CN (1) CN114897150B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10803401B2 (en) * 2016-01-27 2020-10-13 Microsoft Technology Licensing, Llc Artificial intelligence engine having multiple independent processes on a cloud based platform configured to scale
CN106547574A (en) * 2016-12-08 2017-03-29 航天恒星科技有限公司 The outside download system and method for a kind of DSP programs and FPGA programs
CN113495799B (en) * 2020-03-20 2024-04-12 华为技术有限公司 Memory fault processing method and related equipment
CN113127407A (en) * 2021-05-18 2021-07-16 南京优存科技有限公司 Chip architecture for AI calculation based on NVM

Also Published As

Publication number Publication date
CN114897150A (en) 2022-08-12

Similar Documents

Publication Publication Date Title
US6845276B2 (en) Multiple axis modular controller and method of operating same
JP3175757B2 (en) Debug system
CN111580454B (en) Safety control method of industrial safety PLC (programmable logic controller)
EP1358555B1 (en) Service processor and system and method using a service processor
CN108549591A (en) A kind of black box device and its implementation of embedded system
CN107301042B (en) SoC application program guiding method with self-checking function
US10120702B2 (en) Platform simulation for management controller development projects
CN111800345B (en) High-reliability constellation networking space router circuit
TWI620061B (en) Error detecting apparatus of server and error detecting method thereof
US20160364298A1 (en) Energy-efficient nonvolatile microprocessor
US6760864B2 (en) Data processing system with on-chip FIFO for storing debug information and method therefor
CN102043636B (en) Method and device for loading field programmable gate array bit file
US10042692B1 (en) Circuit arrangement with transaction timeout detection
CN114897150B (en) Reliability design method of AI intelligent module
CN109117299B (en) Error detecting device and method for server
CN113204456A (en) Test method, tool, device and equipment for VPP interface of server
US7266680B1 (en) Method and apparatus for loading configuration data
CN111158950A (en) Positioning system and method for abnormal reset of embedded computer system
TWI802951B (en) Method, computer system and computer program product for storing state data of finite state machine
US6836757B1 (en) Emulation system employing serial test port and alternative data transfer protocol
CN113778487A (en) Software uploading system and method of intelligent processing module
CN115204081A (en) Chip simulation method, chip simulation platform, chip simulation system, and computer-readable storage medium
Nan et al. Design and development of module management controller for MicroTCA. 4 standard
CN101231608A (en) Device and method for detecting error
Trujilho et al. Dependable I2C communication with FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant